Introduction

Small Text is a 100% pure Java library designed to compress and access huge textual files at various levels of granularity in a efficient way. A textual file is logically decomposed into records, each consisting of a sequence of fields having variable length and being in variable number among the records. This feature provides a significant flexibility to the approach, with respect to (compressed) relational DBs.

Operationally, a record is a line of the textual file terminated by the new line character ('\n'), and the fields of that record are substrings of the line having variable length and being separated by a user defined character (by default this is '\t'). In addition, a field may contain multiple values which are separated by a user-specified separator character that, of course, must be different from the previous ones. Records and/or fields can be accessed by specifying their ordinal position in the file. Namely, a record is identified with its (absolute) ordinal number in the file (namely, its line number), whereas a field is identified with its (relative) ordinal number in its record (namely, the number of preceding '\t').

Small Text offers a collection of compression techniques which provide various trade-offs between compression ratio and time to access records and fields of the compressed file. Small Text does not incur in the whole decompression of the file at each record/field access, so that it typically outperforms standard compression methods (like gzip), which instead incur in the sequential scan of the whole compressed file (i.e. zcat).

This two-level logical view of the (compressed) textual file is very useful in the context of web search and data mining applications. As an example, consider a sequence of queries issued to a search engine: here, records are the queries and fields are their composing words. As another example, consider a dictionary of URLs: here, records are the URL-strings and fields are their URL-parts like protocol, domain, directories, filename, filetype. As a final example, consider a graph whose vertices and edges are labeled with variable length strings. Here, the i-th record consists of the sequence of strings: label of vertex i, labels of the edges outgoing from vertex i ordered according to the vertex-ID of the edge destination. Each of these strings constitutes one field of that record, and all of them are stored as the ith line of a textual file. Notice that the variability in number of the fields present in each record is crucial to implement those examples.

File compression

The core of Small Text is the collection of compression techniques made available through the interface TextDB. Each TextDB implements a different compression method, suitable for a specific type of text, and offers access to single records/fields or group of records/fields.

While the building stage requires specific parameters, which depend on the chosen TextDB, the access stage is uniform over all TextDBs. Take a look to the APIs to have a complete list of the available methods (under the package it.unipi.di.textdb). We report here a list of the most interesting ones: To understand deeply how a TextDB works and how to build and use these compression tools, please take a look at the documentation section of this site.

An interesting application: GraphLabeler

As we commented above, Small Text could be used in many web/data mining applications. Here we comment on a novel one, which we call the graph labeler. This application concerns with the efficient storage and access of a graph whose vertices and edges are labeled with strings of variable length. As far as the structure of the graph is concerned, the programmer may use her preferred library, such as WebGraph. The efficiency of our GraphLabeler depends on the efficiency (in space and time) of the TextDBs used to store and access the vertex/edge labels of the graph.

Objects of class GraphLabeler (see the package it.unipi.di.graph) provide access to two types of information: vertex and edge labels. Each label is a string of variable length (record). While vertex labels are unique (i.e. one field), edge labels may consists of many sub-information (i.e. fields), each identified by a unique name. The programmer can deploy a configuration file to define how these information have to be compressed and associated to vertices and edges. This file has a simple and intuitive syntax, and is loaded by the GraphLabeler at run-time.

Check the documentation section of this site to have more information about how a GraphLabeler works and how to build it.