Social Network is spreading its roots across different communities and nations across the world. It has become imperative to think of social networking as a major aspect of human life. In the recent years, priorities have shifted from physical social networking to online social networking which brings in an intricate structure of relationships and data that is crucial to be analyzed for different use cases for different organizations and running algorithms that make the social networking websites a friendly and secure community.
Our project aims to analyze a Twitter Social Network dataset with the distributed database Neo4J and big data technologies like Hadoop and Spark. The dataset contains over 100,000 nodes and is stored in a distributed Neo4j database. We have worked on implementing the K-Means clustering algorithm on Hadoop, link prediction using Hadoop and in distributed clustering packages for Neo4j. In our project, we also the extract the most important nodes and subgraphs with a given probability.
Twitter is an extremely popular social media platform that has billions of visitors every day that account to the 200 billion tweets every year. All this leads to the generation of an enormous amount of data that falls under the category of big data. Big data has been a major field of study in the recent years in the field of Computer Sciences and thus bringing in attention for various ways to process, analyse and store the enormous data that is generated by social media network websites like Twitter. Storing such large amounts of data is not the challenge. A major challenge is to store it in a way that can make it easier for a data scientist to retrieve the data whenever needed without the overhead of high processing power required for loading and processing the data.
We downloaded the Twitter Social Network dataset that contains millions of nodes related to user tweets, follows, and mentions to name a few. It is a valuable resource for analyzing the interactions and relationships between users. Our project used this dataset to gain insights into the behavior of Twitter users and implement advanced algorithms such as clustering and link prediction. We first downloaded multiple datasets and analyzed them for deployment in Neo4J. We finalized the Twitter Neo4J dataset for use in our big data analysis.
We analyzed the dataset and explored different user interactions and relationships between different nodes. We examined the different properties of the dataset like the degree distribution, clustering coefficient, and centrality measures. These properties helped us better analyze the dataset and gave us a better insight into the Twitter Social Network data and enabled us to better target the user node relationships.
We also deployed the Airbnb dataset with over 1.6 million nodes to visualize big data and run queries over it.
A sample visualization of the Airbnb data on Neo4J Bloom has been given in the image below.
We also ran the K-means clustering algorithm on the Hadoop Filesystem in an Ubuntu Linux AWS environment EC2 instance and created different clusters of data.
We ran different queries for extracting different subgraphs.
The cypher query and visualization for getting a certain number of mentions from the graph is as follows:
WITH u,p,t,m, COUNT(m.screen_name) AS count
ORDER BY count DESC
The cypher query and the table for getting the trending hashtag is as follows:
WITH h, COUNT(h) AS Hashtags
ORDER BY Hashtags DESC
RETURN h.name, Hashtags
The cypher query and the table for retweeted links and the frequency of the number of times they are favourited is as follows:
MATCH (:User:Me)-[:POSTS]-> (t:Tweet)-[:RETWEETS]->(rt)-[:CONTAINS]->(link:Link)
RETURN t.id_str AS tweet, link.url AS url, rt.favorites AS favorites
ORDER BY favorites DESC
It was a great learning experience and a hands-on journey on analyzing big data distributed graphs on Neo4J Desktop and Neo4J AuraDB cloud online. We got to learn the different ways to visualize a dataset with millions of node and then query it using Cypher without compromising speed and scalability using the different clustering algorithm.
The commands for setting up the Hadoop Filesystem on Linux are as follows:
$ cd /usr/local
$ sudo tar xzf hadoop-2.7.3.tar.gz
$ sudo mv hadoop-2.7.3 hadoop
$ sudo chown -R hduser:hadoop Hadoop
hadoop.tmp.dir /app/hadoop/tmp fs.defaultFS hdfs://localhost:9000
Permissions and Ownership
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
$ sudo chmod 750 /app/hadoop/tmp o
Formatting Hadoop Filesystem
udyan@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format
Starting the Node Cluster
Stopping the node cluster
The Java code for K-Means clustering is as follows:
This was our project on Social Network Analysis of Twitter’s network of users, tweets, mentions, hashtags, links and sources.
A huge thanks to our mentor and faculty Dr. Vishal Srivastava sir, Dr. Sudhanshu Gupta sir and Dr. Deepika Pantola ma’am for guiding us continuously throughout the project and introducing us to the intriguing and compelling data querying language Neo4J and giving us a head start in deploying distributed graphs and running clustering algorithms in distributed filesystems like Hadoop.