PageRank on the Wikipedia Corpus

Lab 3 - PageRank on the Wikipedia Corpus

Except where otherwise noted all portions of this work are Copyright © 2007 University of Washington and are licensed under the Creative Commons Attribution 3.0 License --

The Goal --
    Implement PageRank, turn Wikipedia into a giant graph, run PageRank on said graph, run it several more times (ideally until the values converge), return (in a humanly parseable sort of way) the PageRank of all the articles.

The Algorithm -- (section 2.1.1)

The Corpus --
1.  Wikipedia offers a compressed XML file for download that contains every article on the site, including redirect and disambiguation pages.
2.  We've reformatted the data set into 270 30MB files. Each 30MB file contains several thousand Wikipedia articles, formatted one per line.
3.  The characters '<' and '>' have been replaced by '<' and '>' because the input format we recommend strips XML tags.
4.  That input format is org.apache.hadoop.mapred.TextInputFormat.class
5.   The title of each article is marked by "Name of the article" in which the brackets have been reformatted (as mentioned above) so the title is now marked by "<title>Name of the article</title>".
6.  Links to other wikipedia articles are of the form "[[Name of other article]]".
7.  Find the data files on dfs at /user/jhebert/wikiPieces
8.  Find a more managable test data file on dfs at /user/jhebert/wikiTest
9.  Find the original XML file on dfs at /wikipedia/wikipedia/enwiki-20061130-pages-articles.xml
10.                Get your own copy of the compressed XML download at
11.                Get your own copy of the HTML page download at (several languages are available!)

One Possible Outline (each of these is a MapReduce) --
1.  graphBuilder - Takes the Wikipedia data and constructs a link graph in a SequenceFileFormat file.
2.  pageRankIter - Iteratively updates the page rank value of a graph in sequenceFileFormat
3.  pageRankViewer - Converts sequenceFileFormat to text format and sorts nodes by their page rank score. It's helpful here to negate the pagerank scores so that sites with the highest values appear first.

Complications --
This isn't super efficient (sorting all of the Wikipedia articles in the reduce step of PageRank is slow!) so keep three things in mind:
1.  Don't save this for last minute.
2.  Set the number of reduce tasks (five is good, more is probably better).
3.  Get extra use out of your PageRank reduce method by also setting it as your combiner class.

Some recommended--but not mandatory--extension possibilities --
1.  We used the TextInputFormat class because we didn't have the time to figure out how to MapReduce over an XML file. Figure it out, write it up, let us know. It's probably trivial.
2.  Rework your graphBuilder so that it works with the Wikipedia HTML input and then run PageRank on some other languages. Make a couple sweeping conclusions about cultural differences evidenced by differences in highly-ranked pages and include these in your write up.
3.  Link Hack Wikipedia. Use your data to subtly edit important Wikipedia pages to increase the PageRank of a pet cause. Tell us what you did and why it should work.
4.  Implement variably weighted links. Explain your heuristic and convince us that it works.
5.  Run PageRank over the Nutch crawl or your InvertedIndexer over Wikipedia and then combine both data sets into a rudimentary search engine.

Write-up --
1.  How long do you think this project will take you to finish? (Do this before you start)
2.  How much time did you actually spend on the project?
3.  Acknowledge any assistance you received from anyone except assigned course readings and the course staff.
4.  Explain why it's ok to use a combiner class with the PageRank algorithm.
5.  What are the ten pages with highest rank in the provided Wikipedia corpus?
6.  Describe any extensions you implemented.
Next Post »
0 Komentar

Terimakasih telah berkomentar