This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDBs frameworks and connectors. Once wed familiarized ourselves with a sizable
This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.
Once we’d familiarized ourselves with a sizable amount of data from the Flights dataset, the next step was to move towards really BIG data: Twitter Memes. SNAP (Stanford Network Analysis Project) provides free large network data sets. In particular, we were interested in the Twitter Memes dataset from the 2008 presidential election generated by MemeTracker, a combined project between Stanford and Cornell which created news cycle maps from news articles and blog posts. This dataset is a compilation of blogs, news articles, and other media postings that pertain to the presidential election. In particular, the project focused on the time lapse between the first mention of a quote and the time it took for the quote to be picked up by mass media. However, our analysis focuses again on the PageRank and importance of both the individual URLs and websites.
The entire dataset contains over 96 million documents and over 418 million links between them. To begin, we focused solely on the memes during the month of April in 2009, which had over 15 million documents. The goal was to run PageRank on this dataset (where all web pages form a graph). The higher the PageRank of a web page, the more important the page. Web pages with a relatively high PageRank usually have a high ratio of incoming links to outgoing links compared to all other pages in the graph.
Disclaimer: As we’d learn through this project, the pages with the most PageRank do not necessarily have to be related to the 2008 presidential election.
Importing the Data
The source file quotes_2009-04.txt was 11G. It came in this continuous format:
P http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.htmlT 2008-09-09 22:35:24Q that's not changeQ you know you can put lipstick on a pigQ what's the difference between a hockey mom and a pit bull lipstickQ you can wrap an old fish in a piece of paper called changeL http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112L http://cbn.com/cbnnews/436448.aspxL http://voices.washingtonpost.com/thefix/2008/09<strong style="color:transparent">来2源gaodaima#com搞(代@码&网</strong>/bristol_palin_is_pregnant.html?hpid=topnews
- P denotes the URL of the document.
- T represents the time of the post.
- Q is a quote found in the post.
- L is a link that exists in the post.
This was not an ideal schema for MongoDB. With the use of inputMongo.py, the above input was converted into documents resembling the following:
{ "_id" : ObjectId("51c8a3f200a7f40aae706e86"), "url" : "http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html", "quotes" : [ "that's not change", "you know you can put lipstick on a pig", "what's the difference between a hockey mom and a pit bull lipstick", "you can wrap an old fish in a piece of paper called change" ], "links" : [ "http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112", "http://cbn.com/cbnnews/436448.aspx", "http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews" ], "time" : ISODate("2008-09-09T22:35:24Z")}
This resulted in 15,312,738 documents. We also utilized bulk insertion instead of individual document insertions. It took about 1 hour and 48 minutes to insert all of these documents into a collection in the database.