Datasets

Download the AOL dataset
How to use the dataset

The AOL query log

We have modeled the AOL query-log as an undirected bipartite labeled graph with two kinds of vertices: query and URL, and with (multi-)labels on both edges and vertices. A query/URL vertex is labeled with a query/URL occurring in the AOL query-log, respectively. An edge (q,u) between a query vertex q and a URL vertex u is defined if query q has lead some user of AOL to click on the URL u. We prune from the graph all (query) vertices that have no edge incident into them, i.e. queries that do not lead to any click. The graph consists of a total of 6.444.438 nodes (4.811.650 queries and 1.632.788 URLs) and 10.741.956 undirected edges. We store and process the adjacency list of the graph by means of the WebGraph library.

Additionally, the graph is (multi-)labeled on its edges. Namely, each edge (q,u) is associated with the following attributes:
  1. click: the number of times that q has clicked on u in the AOL query-log.
  2. session: the IDs of the user sessions containing these clicks (multi-value attribute).
  3. time: the timestamps of these clicks (multi-value attribute).
  4. rank: the positions of u in the ranked lists of answers returned for q (multi-value attribute).
We note that the existence of attributes with multi-values on edges is justified by the fact that a pair (q,u) may occur many times in the AOL query-log, because many times the query q may be issued by some user (possibly the same one) and this user has then clicked on u. Therefore, each of those pairs has its own timestamp and ranked position in the list of results returned by the search engine. Although these positions refer to the same URL for a given query they may be different because the search engine might have changed the ranking of u for q in the meanwhile. So that, instead of defining a (multi-)edge, one edge per pair occurrence, we preferred to define a (single-)edge with (multi-)attributes thus enforcing locality of reference during the compression and accessing operations. As an example, suppose that the query "aol" has clicked 10 times onto the URL "http://www.aol.com" in the AOL query-log. We then create an edge between the corresponding query and URL vertices, and then set the click attribute to the value 10, and associate 10 distinct values for the attributes session, time and rank. Each of these values will be separated by the special character sequence "::". Additionally, these values will be aligned, in the sense that the i-th values of these three attributes will refer to the same user click.

The vertex and edge labels are stored by means of an instance of our GraphLabeler class, which achieves the following space vs. time performance on the AOL query-log.

Dataset Original Size Compressed Size Compression Ratio TextDB Compressor
Query-node labels108MB43MB40% it.unipi.di.textdb.BucketedHuffword
URL-node labels46MB22MB48% it.unipi.di.textdb.FrontCoding
click edge attribute42MB15MB36% it.unipi.di.textdb.BucketedHuffword
rank edge attribute97MB36MB37% it.unipi.di.textdb.BucketedHuffword
session edge attribute316MB164MB52% it.unipi.di.textdb.BucketedHuffword
time edge attribute759MB278MB36% it.unipi.di.textdb.BucketedHuffword
Total1.4GB557MB40%


Download

How to use the AOL dataset

We propose a fully-commented Java example that shows how to use the AOL query-log graph stored into the above files. The example is structured into three parts concerning with the initialization of the data structures, the fetching of vertex labels, and the fetching of the values for the edge-attribute click.
In detail, the Java example deploys WebGraph to know the vertex IDs connected by edges in the query-log graph, and deploys GraphLabeler to fetch the corresponding vertex and edge labels. Precisely, we start by loading in memory all WebGraph and GraphLabeler data structures. Second, we scan all nodes in the graph and, for each of them, we determine the IDs of its adiacent vertices by using WebGraph. Also, we fetch from disk the labels corresponding to these IDs (which are either URL or query strings) by using GraphLabeler. Finally, for each node in the graph, we fetch the value of the attribute click for all of its outgoing edges still using GraphLabeler.

At the end of each part some statistical information is dumped on stdout as: the number of the fetched strings, the rate at which these strings have been fetched form disk and the overall time to complete. Running this experiment on our machine (SMP architecture with two processors of type Intel(R) Pentium(R) 4 CPU 3.00GHz) we have taken the following times:
  1. ~2 secs to open the labeler and its underlying databases
  2. ~80 secs @ 265K str/sec for the second part (i.e. fetching the vertex labels)
  3. ~18 secs @ ~1.1M str/sec for the third part (i.e. fetching the click attributes)
In the latest two cases we fetched approximately 21.5 millions of textual strings (one per edge). It is clear that these result strongly depend on our disk/hardware performance.

To run the provided Java class you must have on your CLASSPATH the SmallText library and its dependencies together with the WebGraph library and its dependencies (check out the WebGraph documentation). Then, having unzipped the entire dataset and the Java example into the directory aol-dataset, you can run the class by doing:
$ cd aol-dataset/
$ java -Xmx512m AOLGraphExample webgraph/wg-aol graphlabeler/graphlabeler.conf