Introduction
Small Text is a 100% pure Java
library designed to compress and access huge textual files at various levels of
granularity in a efficient way. A textual file is logically
decomposed into records, each consisting of a sequence
of fields having variable length and being in variable number among the records. This feature
provides a significant flexibility to the approach, with respect to (compressed)
relational DBs.
Operationally, a record is a line of the textual file terminated by the new line
character ('\n'), and the fields of that record are substrings of the line having
variable length and being separated by a user defined
character (by default this is '\t'). In addition, a field may contain
multiple values which are separated by a user-specified
separator character that, of course, must be different from the previous ones.
Records and/or fields can be accessed by specifying
their ordinal position in the file. Namely, a record is identified with its
(absolute) ordinal number in the file (namely, its line number), whereas a field
is identified with its (relative) ordinal number in its record (namely, the number of
preceding '\t').
Small Text offers a collection of compression techniques which provide
various trade-offs between compression ratio and time to access records and fields
of the compressed file. Small Text does not incur in the whole
decompression of the file at each record/field access, so that it typically outperforms
standard compression methods (like gzip), which instead
incur in the sequential scan of the whole compressed file (i.e.
zcat).
This two-level logical view of the (compressed) textual file is very useful in the context of
web search and data mining applications. As an example, consider a sequence of
queries issued to a search engine: here, records are the queries and fields are their
composing words. As another example, consider a dictionary of URLs: here, records
are the URL-strings and fields are their URL-parts like protocol, domain, directories,
filename, filetype.
As a final example, consider a graph whose vertices and edges are labeled with
variable length strings. Here, the i-th record consists of the sequence of
strings: label of vertex i, labels of the edges outgoing from vertex i
ordered according to the vertex-ID of the edge destination. Each of these
strings constitutes one field of that record, and all of them are stored as the ith
line of a textual file. Notice that the variability in
number of the fields present in each record is crucial to implement those examples.
File compression
The core of Small Text is the collection of compression techniques made
available through the interface TextDB. Each TextDB implements a different
compression method, suitable for a specific type of text, and offers access to
single records/fields or group of records/fields.
While the building stage requires specific parameters, which depend on
the chosen TextDB, the access stage is uniform over all TextDBs.
Take a look to the APIs to have a complete list of the available
methods (under the package it.unipi.di.textdb).
We report here a list of the most interesting ones:
- Front Coding
Assume that the records (lines) of the input file are lexicographically sorted.
This compression method replaces the prefixes shared by consecutive records with a
proper encoding of their lengths. To support efficient decoding, records are partitioned
into buckets, and the first record of each bucket is stored explicitly.
At query time, when a record (or a group of records) is requested, the containing
bucket is identified and loaded into memory. Then it is sequentially scanned, and records
are reconstructed on-the-fly by appending back their squeezed prefix. This proceeds
until the queried record is found. Compression ratio can be improved
by gzipping each front-coded bucket. However,
the query time slows down because of the additional compression level.
The Front Coding scheme is suitable for records that share long common prefixes as
the case of URLs or terms in a dictionary.
- Bucketed Huffword
This is an extension of the classical Huffman coder to the case in which the dictionary
of symbols consists of terms of the input file (and thus not just characters).
These terms are identified by means of a tokenizer which may be possibly defined
by the programmer.
The input file is first partitioned into buckets, consisting of a fixed-number of
records, and then each bucket is compressed by replacing each term with its codeword.
Huffword assigns shorter codewords to more frequent terms, by means
of a Huffman tree.
At query time, when a record (or a group of records) is requested, the
containing bucket is identified and loaded into memory. Then it is sequentially decompressed
to find the queried record. Actually, codewords are prefix-free and thus they could be
skipped until the queried record is met, thus reducing the decompression time. This compression
method is general purpose and achieves good results in access time and 0-th order entropy (of
the terms) in space occupancy.
- Rank/Select Huffword
The file is compressed using the Huffword compression technique explained above. Access to
the resulting file is obtained by means of the Rank/Select mechanism where records and fields are
marked by bit vectors stored in a compressed way. At run time, the requested record positions
are determined using these vectors in a fast way and the compressed records are then loaded
into memory and uncompressed. Given the same results in space occupancy than the standard Bucketed
Huffword, the use of that bit vectors avoid the sequential access of the compressed blocks resulting
in a faster access. Information on the Rank/Select technology can be found in
Compressed full-text indexes by
Gonzalo Navarro and Veli Mäkinen (see also the references page).
- Bucketed GZip
The input file is divided into buckets, each consisting of a fixed-number of records. Then
every bucket is compressed using gzip. At query time, the bucket containing
the requested record is identified and loaded into memory. Then it is sequentially
decompressed, by using a zcat, until the queried record is found. This technique
is general purpose and achieves good results in space occupancy but not in accessing
time.
To understand deeply how a TextDB works and how to build and use these compression tools,
please take a look at the documentation section of this site.
An interesting application: GraphLabeler
As we commented above, Small Text could be used in many web/data mining applications.
Here we comment on a novel one, which we call the graph labeler. This application
concerns with the efficient storage and access of a graph whose vertices and edges
are labeled with strings of variable length. As far as the structure of the graph
is concerned, the programmer may use her preferred library, such as
WebGraph. The efficiency of
our GraphLabeler depends on the efficiency (in space and time) of the TextDBs used
to store and access the vertex/edge labels of the graph.
Objects of class GraphLabeler (see the
package it.unipi.di.graph)
provide access to two types of information: vertex and edge labels. Each label
is a string of variable length (record). While vertex labels are unique (i.e. one field), edge labels may consists
of many sub-information (i.e. fields), each identified by a unique name. The programmer can deploy a
configuration file to define how these information have to be compressed and associated
to vertices and edges. This file has a simple and intuitive syntax, and is loaded
by the GraphLabeler at run-time.
Check the documentation section of this site to have more
information about how a GraphLabeler works and how to build it.