• 欢迎访问搞代码网站,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站!
  • 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏搞代码吧

Twitter Memes Dataset Overview with PageRank

mysql 搞代码 4年前 (2022-01-09) 20次浏览 已收录 0个评论

This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDBs frameworks and connectors. Once wed familiarized ourselves with a sizable

This is the last of three blog posts from this summer internship project showcasing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.

Once we’d familiarized ourselves with a sizable amount of data from the Flights dataset, the next step was to move towards really BIG data: Twitter Memes. SNAP (Stanford Network Analysis Project) provides free large network data sets. In particular, we were interested in the Twitter Memes dataset from the 2008 presidential election generated by MemeTracker, a combined project between Stanford and Cornell which created news cycle maps from news articles and blog posts. This dataset is a compilation of blogs, news articles, and other media postings that pertain to the presidential election. In particular, the project focused on the time lapse between the first mention of a quote and the time it took for the quote to be picked up by mass media. However, our analysis focuses again on the PageRank and importance of both the individual URLs and websites.

The entire dataset contains over 96 million documents and over 418 million links between them. To begin, we focused solely on the memes during the month of April in 2009, which had over 15 million documents. The goal was to run PageRank on this dataset (where all web pages form a graph). The higher the PageRank of a web page, the more important the page. Web pages with a relatively high PageRank usually have a high ratio of incoming links to outgoing links compared to all other pages in the graph.

Disclaimer: As we’d learn through this project, the pages with the most PageRank do not necessarily have to be related to the 2008 presidential election.

Importing the Data

The source file quotes_2009-04.txt was 11G. It came in this continuous format:

P       http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.htmlT       2008-09-09 22:35:24Q       that's not changeQ       you know you can put lipstick on a pigQ       what's the difference between a hockey mom and a pit bull lipstickQ       you can wrap an old fish in a piece of paper called changeL       http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112L       http://cbn.com/cbnnews/436448.aspxL       http://voices.washingtonpost.com/thefix/2008/09<strong style="color:transparent">来2源gaodaima#com搞(代@码&网</strong>/bristol_palin_is_pregnant.html?hpid=topnews
  • P denotes the URL of the document.
  • T represents the time of the post.
  • Q is a quote found in the post.
  • L is a link that exists in the post.

This was not an ideal schema for MongoDB. With the use of inputMongo.py, the above input was converted into documents resembling the following:

{    "_id" : ObjectId("51c8a3f200a7f40aae706e86"),    "url" : "http://blogs.abcnews.com/politicalpunch/2008/09/obama-says-mc-1.html",    "quotes" : [        "that's not change",                "you know you can put lipstick on a pig",                 "what's the difference between a hockey mom and a pit bull lipstick",                 "you can wrap an old fish in a piece of paper called change"    ],    "links" : [        "http://reuters.com/article/politicsnews/idusn2944356420080901?pagenumber=1&virtualbrandchannel=10112",                 "http://cbn.com/cbnnews/436448.aspx",                 "http://voices.washingtonpost.com/thefix/2008/09/bristol_palin_is_pregnant.html?hpid=topnews"    ],    "time" : ISODate("2008-09-09T22:35:24Z")}

This resulted in 15,312,738 documents. We also utilized bulk insertion instead of individual document insertions. It took about 1 hour and 48 minutes to insert all of these documents into a collection in the database.


搞代码网(gaodaima.com)提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected],我们会在看到邮件的第一时间内为您处理,或直接联系QQ:872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Twitter Memes Dataset Overview with PageRank
喜欢 (0)
[搞代码]
分享 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址