Project 1: PageRank in Python

Due Monday, Jan 22, 2024 at 8pm ET

A PDF version of this document is located here.

In this project, you will implement a basic graph library in Python 3 and then implement a simplified version of PageRank, a famous algorithm in search-engine optimization. The primary learning goal of the project is to gain familiarity with the syntax, data structures, and idioms of Python 3. A secondary goal is to review the basics of abstract data types (ADTs) and object-oriented programming.

The project is divided into multiple suggested phases. We recommend completing the project in the order of the phases below.

You must work alone for this project. As a reminder, you may not share any part of your solution. This includes both code and test cases.

Reported Time to Complete the Project

The following is the time students reported they spent on the project in Winter 2023.

Time to complete project 1. 18% less than 5 hours, 41% 5-9 hours, 27% 10-14 hours, 11% 15-19 hours, 2% 20-24 hours, 1% 25-30 hours, 1% greater than 30 hours

These data are for planning purposes only. We do not consider the exact time spent on the project to be reflective of anyone’s learning or ability. Rather, completing the project regardless of how much time it takes is what is important to achieve the learning goals of the project.

Background

Review of Terminology

We start by reviewing concepts in graph data structures, which you may have seen in your prior discrete mathematics and data structures courses. We will also introduce some terminology used in search-engine algorithms.

A graph is an abstract data type (ADT) that consists of a set of nodes (or vertices) and a set of edges (or links) connecting pairs of nodes. Less formally, graphs model connections or interactions between entities. In this project we model both undirected and directed graphs, either of which may be optionally attributed.

In an attributed graph, nodes and/or edges have additional labels, or attributes, associated with them. For example, a social network is a type of attributed graph in which nodes, or users, are associated with data like age, country, and gender.

_images/digraph.svg — Figure 1 A directed graph.

An undirected graph is one such that an edge from node \(u\) to node \(v\) is equivalent to an edge from node \(v\) to node \(u\), whereas in a directed graph those edges are not the same. Figure 1 shows a small directed graph.

In a directed graph, an edge leaving \(u\) and arriving at \(v\) may be called a forward link of \(u\), whereas an edge leaving \(v\) and arriving at \(u\) may be called a backlink of \(u\). For example, in Figure 1, node \(e\) has one forward link and two backlinks.

In an undirected graph, the degree of node \(u\) is the number of edges to which \(u\) is incident. In a directed graph, the in-degree of node \(u\) is the number of edges that arrive at \(u\), or equivalently the number of backlinks of \(u\); the out-degree of node \(u\) is the number of edges that leave \(u\), or equivalently the number of forward links of \(u\). In Figure 1, node \(e\) has an out-degree of 1 and an in-degree of 2. If the graph were not directed, node \(e\) would simply have a degree of 3.

PageRank

In 1998, Michigan alum Larry Page and his fellow Stanford PhD student Sergey Brin introduced PageRank, a novel web search ranking algorithm. PageRank was the foundation of what became known as the Google search engine. More generally, PageRank can be used to approximate the “importance” of any given node in a graph structure.

Intuitively, a node in a graph will have a high PageRank if the sum of the PageRanks of its backlinked nodes are high. Thus, a node’s PageRank depends not only on the number of backlinks it has, but also the importance of those backlinked nodes. For example, if a node \(u\) only has one backlink from node \(v\), but node \(v\) has a high PageRank, then node \(u\) will also have a relatively high PageRank.

A real-world example is the Twitter graph of who-follows-whom, where a directed edge from user A to user B exists if A follows B. Famous people like Barack Obama have millions of followers, some of whom are also famous with millions of followers. Thus, these users, when represented as nodes in a graph, will have high PageRank. Furthermore, such celebrities’ high PageRank will contribute to the PageRank of the users that they follow. If Barack Obama follows another user and is that user’s only follower, that user will still have a relatively high PageRank.

Note that this definition of PageRank is inherently recursive: computing a node’s PageRank depends on other nodes’ PageRanks, which in turn depend on other nodes’ PageRanks, and so on. However, Page and Brin show that the PageRank algorithm may be computed iteratively until convergence, starting with any set of assigned ranks to nodes[1].

The PageRank computation models a theoretical web surfer. Given that the surfer is on a particular webpage, the algorithm assumes that they will follow any of the outgoing links with equal probability. Thus, the PageRank value of the webpage is divided equally among each of the linked webpages, raising each of their values. Using this reasoning, we end up with the following formula for computing a webpage \(u\)’s PageRank:

\[PR_{k+1}(u) = \sum_{v \in BL(u)}\frac{PR_{k}(v)}{v^+}\]

Here, \(BL(u)\) is the set of nodes that link to \(u\) (i.e. the nodes in its backlinks), and \(v^+\) is the out-degree of node \(v\).

The reasoning above, however, breaks down when it comes to a sink webpage that has no outgoing links: it assumes that the surfer gets stuck, never leaving that webpage. Instead, we will assume that the surfer restarts their session at a random webpage. Thus, the sink contributes \(1/N\) of its PageRank value to every webpage, where \(N\) is the total number of pages. The formula then becomes:

\[PR_{k+1}(u) = \sum_{v \in BL(u)}\frac{PR_{k}(v)}{v^+} + \sum_{w\ \textrm{is a sink}}\frac{PR_{k}(w)}{N}\]

Finally, we introduce a damping factor \(d \in (0, 1]\), which models the fact that surfers may restart their web session even if they are on a page that does have outlinks. Thus, we damp a webpage’s PageRank by multiplying it by \(d\), and then we assume that the total residual probability \(1-d\) is distributed among all webpages, so that each page receives \((1-d)/N\) as its share[2]:

\[PR_{k+1}(u) = \frac{1-d}{N} + d\left(\sum_{v \in BL(u)}\frac{PR_{k}(v)}{v^{+}} + \sum_{w\ \textrm{is a sink}}\frac{PR_{k}(w)}{N}\right)\]

We apply this update rule iteratively, with a uniform initial distribution of \(PR_0(u) = 1/N\). In the absence of cycles[3], this algorithm will converge after a small number of iterations, on the order of 50 for large sets of webpages. The PageRanks can then be considered the relative probability that a surfer will visit each webpage, and the PageRanks of all webpages add up to 1.

The PageRank problem can also be formulated in terms of matrix algebra. We do not use that formulation here, but you can refer to the original paper for details if you are interested.

Distribution Code

Use the following commands to download and unpack the distribution code:

$ wget https://eecs390.github.io/project-pagerank/starter-files.tar.gz
$ tar xzf starter-files.tar.gz

The distribution code consists of the following files:

`graph.py`	Definition of the graph ADTs.
`pagerank.py`	Implementation and driver for computing PageRanks.
`graph_test.py`	Basic test cases. Add your own to this file.
`graph_test.expect`	Expected output from running `graph_test.py`. Update this when you add more test cases.
`Makefile`	A Makefile for running test cases.
`characters-nodes.csv`	The nodes and their attributes for a small graph.
`characters-edges.csv`	The edges and their attributes for a small graph.
`characters-pagerank.expect`	Expected output from running `pagerank.py` on the `characters` graph.
`email-Eu-core.txt-nodes.csv`	The nodes for a medium graph from a SNAP email dataset.
`email-Eu-core.txt-edges.csv`	The edges for a medium graph from a SNAP email dataset.
`email-Eu-core.txt-pagerank.expect`	Expected output from running `pagerank.py` on the `email-Eu-core.txt` graph.

Data Format

The distribution code includes a function read_graph_from_csv(), which constructs a graph from spreadsheets in CSV format. A node file must start with a header row, where one of the columns must be Id. The remaining columns may have arbitrary labels. Each data row must have a value for each column. The function creates a directed or undirected graph, and for each data row in the node file, it adds a node to the graph. The node’s ID is the string value in the Id column, and the remaining columns hold the attribute values for the node, with the column labels as the attribute names. An edge file has a similar structure, except that the header must contain entries Node_Id_1 and Node_Id_2. These columns contain the IDs of the the two nodes connected by the edge.

Phase 1: Graph ADTs

Start by reading through the code in graph.py, which contains the following classes:

GraphError: An exception class used by the graph ADTs, derived from the built-in Exception.
Node: Represents a node in a graph, with an ID and attributes.
Edge: Represents an edge in a graph, connecting two nodes and with its own attributes.
BaseGraph: The base class for both directed and undirected graphs. Most of the functionality in the graph ADTs is implemented here.
UndirectedGraph: A class representing an undirected graph, derived from BaseGraph.
DirectedGraph: A class representing a directed graph, derived from BaseGraph.

`GraphError`

The GraphError class contains the following methods:

__init__: The special method in Python for initializing an object, i.e. the constructor. For this class, the constructor takes an optional argument representing the message associated with the exception, defaulted to an empty string. This message should be saved as an instance variable, an attribute of the object itself using the self argument.
__str__: The special method for retrieving the printed string representation of an object. For this class, the __str__ method should produce the saved message.
__repr__: The special method for producing a code representation of the object. Calling the built-in eval() on the result should produce an equivalent object.

Fill in the implementations of the first two methods.

The strings at the beginning of the class and each method are docstrings, serving as their documentation. They are surrounded by triple quotes, allowing the strings to span multiple lines. The docstring for the class also contains some basic test cases, called doctests, which can be run using the Python doctest module. Normally, the doctests can be invoked from the command line with the following command:

$ python3 -m doctest <python file>

However, for the graph.py module, we need to configure the doctest module to ignore exception details, since some of the doctests should produce an exception when run. As such, we have arranged for the doctests to run when invoking the graph.py module itself:

$ python3 graph.py

Once you have implemented GraphError, its doctests should pass, but the remaining doctests will fail until you implement the rest of the module.

`Node`

The Node class is used in the interface for the graph ADTs. Looking up a node ID will produce a Node object, with the given ID and the attributes associated with the given node. The class contains a constructor, which takes in an ID and any number of keyword arguments (also see the section on functions in the course notes). These are packaged into a dictionary by Python, which you should store as an instance variable. In keeping with Python convention, use a variable name that begins with an underscore to mark it as a private variable.

A node ID can be any hashable object, including a tuple, but all node IDs used in a particular graph must be mutually comparable so that they can be sorted. You should not make any further assumptions about the type of an ID.

The attributes() method should return a copy of the dictionary received by the constructor. This is so that if a user modifies the dictionary returned by attributes(), it does not modify the dictionary stored in the Node object itself:

>>> n = Node('foo', a=3)
>>> n.attributes()
{'a': 3}
>>> n.attributes()['a'] = 4
>>> n.attributes()
{'a': 3}

A shallow copy of the dictionary suffices, since the attributes we will use are immutable (strings and numbers). Be aware, however, that Python has reference semantics, so assigning one variable to another does not make a copy of the underlying object; rather, it copies the reference (i.e. pointer) to the object, so that both variables refer to the same object. The built-in id() function illustrates this:

>>> x = []
>>> y = x
>>> id(x), id(y)
(4452125376, 4452125376)
>>> x.append(3)
>>> x, y
([3], [3])

There are several ways to copy a container; one way is to invoke the constructor for the container type with the original container as the argument:

>>> y = list(x)
>>> id(x), id(y)
(4452125376, 4452190272)
>>> x.append(5)
>>> x, y
([3, 5], [3])

This results in a shallow copy of the container – it copies over the elements from the original container to the new one, but the elements are themselves references, so both containers refer to the same set of objects.

The __str__ method should produce a representation of the Node in the following format:

The first line of the string should be Node [<id>], where <id> should be the ID of the node.
The remaining lines should the attributes of the node, one per line, in lexicographically increasing order by attribute name. Such a line should begin with exactly four spaces, then the attribute name, a space, a colon, another space, and then the value of the attribute. The line should be terminated with a newline character.

The following is an example:

>>> n = Node('bar', a=4, b=3)
>>> str(n)
'Node [bar]\n    a : 4\n    b : 3\n'

In order to construct the string representation, we recommend using Python 3’s string formatting functionality. You may also find the built-in sorted() function useful for ordering attributes, and the join() method of strings for combining substrings with a separator between them.

`Edge`

The Edge class is also used in the interface for the graph ADTs, representing a directed edge between two nodes. The constructor takes in the two Node objects and the attributes of the edge itself. The nodes() method returns a tuple containing the two Node objects, in the same order as passed to the constructor. The string representation is similar to that of Node, except that the initial line has the form:

Edge from node [<id1>] to node [<id2>]

Here, <id1> and <id2> should be the IDs of the two Nodes.

`BaseGraph`

The BaseGraph class implements the core functionality of a graph ADT, and it is the base class for the other graph classes. It has the following methods:

__init__: The constructor initializes the graph to be empty, so that it contains no nodes or edges.

Expected runtime: \(O(1)\)
__len__: A Python special method, called by the built-in len() function. For BaseGraph, it should return the number of nodes in the graph.

Expected runtime: \(O(1)\)
add_node: Takes in a node ID and keyword arguments representing the attributes of the node. Adds a corresponding Node to the graph. Raises a GraphError if the graph already contains a node with the given ID.

Expected runtime: \(O(1)\)
node: Returns the Node object corresponding to the given ID. If the ID does not match a node in the graph, a GraphError is raised.

Expected runtime: \(O(1)\)
nodes: Returns a list of all nodes in the graph, sorted by node ID in lexicographic order. You may find the built-in sorted() function useful here.

Expected runtime: \(O(N \log N\)), where \(N\) is the number of nodes in the graph
add_edge: Adds an edge between the nodes corresponding to the given IDs. The keyword arguments are the attributes of the edge. If either of the given IDs does not match a node in the graph, raises a GraphError. Also raises a GraphError if the graph already contains an edge in the same direction between the nodes (we are not implementing multigraphs in this project).

Expected runtime: \(O(1)\)
edge: Returns an Edge object corresponding to the edge between the given node IDs. Order is relevant here: the edge corresponds to the call to add_edge() with IDs in the same order as the arguments to edge(). Raises a GraphError if no such edge is in the graph.

Expected runtime: \(O(1)\)
edges: Returns a list of the edges in the graph, sorted in lexicographic order by the pair of node IDs corresponding to each edge.

Expected runtime: \(O(E \log E\)), where \(E\) is the number of edges in the graph
__getitem__: A Python special method for overloading the subscript (square-brackets) operator on an object. For BaseGraph, the method should do the following in order:
- If the given argument is the ID of a node in the graph, return the corresponding Node object with that ID.
- Otherwise, if the given argument matches the pair of IDs corresponding to an edge in the graph, return the corresponding Edge object.
- Otherwise, raise a GraphError.
Do not use isinstance() or type() to query the type of the given argument. Rather, implement the logic above, which can be done without knowledge of the argument’s type.

Expected runtime: \(O(1)\)
__contains__: A Python special method for overloading the in operator. For BaseGraph, it should return True if the the argument is the ID of a node in the graph, or if it is a pair of IDs corresponding to an edge in the graph. Otherwise, the method should return False. Use similar logic as for __getitem__ above.

Expected runtime: \(O(1)\)
__str__: Returns a string representation of the graph. The nodes and edges are stringified in lexicographic order by ID(s).

Expected runtime: \(O(N \log N + E \log E)\), where \(N\) is the number of nodes and \(E\) is the number of edges in the graph

The data representation for BaseGraph is up to you. Any representation is acceptable, as long as the interface and the expected runtimes are met.

`UndirectedGraph`

The UndirectedGraph class derives from BaseGraph and represents an undirected graph. Adding an edge between two nodes in an undirected graph should introduce edges in both directions between the two nodes. Looking up an edge using edge() should return the Edge object corresponding to the given direction. Retrieving all edges using edges() should result in a list that includes Edge objects in each direction for each pair of nodes connected by an edge.

We do not permit self-loops in UndirectedGraph. Your code should raise an exception if a self-loop is attempted to be added.

The distribution code includes the following methods:

__init__: The constructor, which should delegate most of the work to the base-class constructor using super().

Expected runtime: \(O(1)\)
degree: Returns the degree of the node with the given ID. Raises a GraphError if no such node is in the graph.

Expected runtime: \(O(1)\)

Hint: What data structure can you use here that provides \(O(1)\) expected access time? How can you efficiently ensure that the data structure has the necessary data in it? You might need to add extra work in other operations, but it shouldn’t change the expected runtime of those operations.

You may add additional methods or override any base-class methods. Make sure that you don’t repeat yourself! If the base-class method suffices, then do not override it. If you need to add additional functionality on top of what the base-class method does, then use super() to delegate the shared functionality to the base-class method.

`DirectedGraph`

The DirectedGraph class also derives from BaseGraph. It includes a constructor as well as in_degree() and out_degree() methods for retrieving the in- and out-degree of a node (expected runtime: \(O(1)\)). As with UndirectedGraph, you may add methods or override methods from BaseGraph.

Unlike UndirectedGraph, we permit self-loops in DirectedGraph. Such an edge should add one to both the in-degree and out-degree of the adjacent node.

Testing

The doctests include only basic tests, and you should write further tests of your own. A handful of test cases are in graph_test.py, and you should add your tests to this file. Modify graph_test.expect as needed to contain the expected output. Use the assert construct to assert that a condition is true in your test cases.

As with testing any ADT, your test cases should only rely on the public interface for the graph module. Thus, your test cases should be able to run successfully with our implementation of graph.py.

Phase 2: PageRank

Now that you have working graph ADTs, proceed to implement the PageRank algorithm itself. The formula you are to implement is as follows:

\[PR_{k+1}(u) = \frac{1-d}{N} + d\left(\sum_{v \in BL(u)}\frac{PR_{k}(v)}{v^{+}} + \sum_{w\ \textrm{is a sink}}\frac{PR_{k}(w)}{N}\right)\]

Fill in the pagerank() function in pagerank.py to iteratively apply this formula on the given graph. The num_iterations argument specifies the number of iterations to run, and the damping_factor argument corresponds to the value of \(d\) in the formula above. Start with an initial uniform distribution of \(PR_0(u) = 1/N\) for each node \(u\), where \(N\) is the number of nodes in the graph.

The PageRank formula above specifies an out-of-place computation: the new values of \(PR_{k+1}(u)\) should be computed in a separate data structure from the old values of \(PR_k(u)\). This is to ensure that we get consistent results that are independent of the order in which the algorithm processes nodes and edges.

Read through the rest of pagerank.py in order to understand the command-line interface and how to run the doctests.

Since pagerank() is not part of the graph ADTs, it should only rely on the public interface of the latter. A key component of modular programming is separation of concerns – a module should only implement the functionality that is appropriate for the abstraction the module is providing. In the case of this project, we deliberately separate the graph and pagerank modules. Functionality that is specific to the PageRank algorithm (e.g. sinks, backlinks) should go in the pagerank module, not the graph module. Thus, any helper functions you write for PageRank should go in pagerank.py and only depend on the public interface for the graph classes.

Performance

While performance is not a focus of this project, your implementation should be reasonably efficient. Asymptotically, the algorithm should run in \(O(N \log N + E \log E + m(N + E))\) time, where \(m\) is the number of iterations, \(N\) is the number of nodes in the graph, and \(E\) is the number of edges:

Invoking nodes() and edges() on the graph takes \(O(N \log N + E \log E)\) time. This should only be done once, since the graph structure does not change over the course of the algorithm.
Computing the sinks should take \(O(N)\) time. This should only be done once.
Computing the backlinks should take \(O(N + E)\) time. This should only be done once.
In each iteration, computing the sink contribution should take \(O(N)\) time. This should only be done once in each iteration, since it is the same for all nodes. Computing the backlink contributions should take \(O(E)\) time across all nodes (the total number of backlinks is exactly \(E\)).

Running the PageRank algorithm with the default arguments on the email-Eu-core.txt graph should take no more than a minute on an average machine. Our solution takes about a second to run PageRank on this graph on a slow machine.

A larger graph dataset of Twitter connections is available as follows:

$ wget https://eecs390.github.io/project-pagerank/twitter_combined.tar.gz

As an optional challenge, try to improve the efficiency of your code so that you can compute its PageRanks within 1-2 minutes.

Grading

This project will not be hand graded, so your entire score will be determined by your results on the autograder. We will test your graph.py and pagerank.py both together as well as independently with our own implementations of the two modules, so make sure that you only use the public interfaces defined in the spec and starter code.

You are required to adhere to the coding practices in the course style guide. We will use the automated tools listed there to evaluate your code. You can run the style checks yourself as described in the guide.

Submission

Submit graph.py, graph_test.py, and pagerank.py to the autograder before the deadline.

Acknowledgments

The project was written by Tara Safavi and Amir Kamil in Fall 2017. The medium and large datasets for this project are from the Stanford Network Analysis Project (SNAP).

Frequently Asked Questions

Q: I’m getting an error such as TypeError: 'str' object is not callable.

A: As in C++, you cannot have a field (member/instance variable) in Python with the same name as a method (member function). Use the Python convention for naming non-public fields:

Use one leading underscore only for non-public methods and instance variables.
Q: The graph doctests fail for the cases that involve exceptions. My output has graph.GraphError instead of just GraphError.

A: Refer to the GraphError section for the correct way to run the graph doctests.
Q: My pagerank implementation computes the correct results for the top 10 nodes, but the sum of all the ranks is not equal to 1.

A: The most common culprit is doing an in-place rather than an out-of-place computation of the ranks in each iteration. Often, this inadvertently happens due to Python’s reference semantics. Assigning one variable to another does not make a copy of the underlying object, but rather copies the reference (i.e. pointer) held by the variable, so that both variables refer to the same object.
Q: My pagerank implementation is too slow.

A: We recommend looking at two things. First, make sure that you are not unnecessarily repeating an expensive computation. For instance, if you are recomputing which nodes are sinks in each iteration of the algorithm, that is unnecessary recomputation since the graph does not change. Second, make sure that your graph operations are implemented efficiently. For example, checking whether a graph contains an edge should be an expected \(O(1)\) operation, not linear. Obtaining the in/out degree of a node can be done in \(O(1)\) time. As a hint, think about an implementation of a linked list with a size() operation. The naïve implementation would traverse the list to count the elements, taking linear time. Can you think of a way to reduce the time to constant without making other operations asymptotically more costly?