Lab of Advanced Algorithms and Applications
HighlightsWe are developing a new semantic-annotation technology for short textual fragments, called TAGME [ACM CIKM 2010, IEEE Software 2012]. This tool has been applied succesfully to many contexts concerning with the clustering, the classification and the similarity-comparison of short texts. We are also studying the design of of “Multi-objective” data compressor, a novel paradigm in which optimization techniques are plugged into the design of data compressors. For example, in the Bicriteria Data Compression ([ACM-SIAM SODA14], [ESA2014] to appear), the compressor solves the following problem: "return a compressed file which minimizes the compressed space provided that the decompression time is less than T". Similarly, the compressor can "return a compressed file which minimizes the decompression time provided that the compressed space is less than S bits". A third line of research is a classic one for our group lasting for 20 years now and regarding the design of compressed indexes for big data (such as String B-tree [JACM 1999] and FM-index [JACM 2005]). Recently we have designed cache-oblivious compressed version of those indexes for dictionaries of strings [ESA 2013], and distribution-ware compressed indexes for data collections [Algorithmica 2013].
- Cerved Group
- Google (Zurich)
- Yahoo! Research (Barcelona)
- Tiscali Italia
-  Yahoo! Faculty Research and Engagement Program 2014
- [2013-2015] Regione Toscana, Net7, StudioFlu, SpazioDati
- [2013-2014] Bassilichi
- [2013-2014] Google Research Award
- [2012-2014] Italian MIUR-PRIN Project "ARS-Technomedia"
- [2011-2012] Telecom Italia Working Capital
-  Google Research Award
- [2010-2012] Italian MIUR-PRIN Project "The Mad Web"
- [2006-2011] Yahoo! Research
- [2009-2012] Italian MIUR-FIRB Project "Linguistica"
RLZAP Dataset » May 9, 2016
The dataset contains four collections of files: three collections of genomes, each belonging to a distinct species, and a set of three 32-bit integer arrays. In particular:
- Cere: collection of 39 strains of Saccharomyces cerevisiae (cere);
- E. Coli: collection of 33 strains of the bacteria Escherichia coli;
- Para: collection of 36 strains of the yeast Saccharomyces paradoxus;
- DLCP: Differential Longest Common Prefix arrays computed by the Relative-FM data structure from a set of three human genomes.
These files are formatted as follows:
- Cere, E. Coli, Para: textual files (ASCII), sequence of characters drawn from the alphabet ACTGN.
- DLCP: binary files, sequence of signed 32-bits integers in little-endian byte-order (as obtained by dumping an array of int32_t into a file with a single fwrite in any modern machine).
The dataset (gzipped tar file, ~7.5GB) can be downloaded here.
Acubelab at WWW 2016 » April 28, 2016
We are glad to announce that our paper “A Piggyback System for Joint Entity Mention Detection and Linking in Web Queries” have been accepted at the 25th International World Wide Web Conference (WWW 2016) conference. It is now available for download.
ICWSM 2015 accepted paper » May 23, 2015
In this paper we build a novel graph upon hashtags and (Wikipedia) entities (HE-Graph) and we exploit it to address two challenging problems regarding the “meaning of hashtags”: hashtags relatedness and hashtag classification.
We also constructed two datasets for hashtags relatedness and classification. We are happy to release them to the research community, together with the HE-graph we constructed (Hashtag Datasets).
WWW 2015 accepted papers » January 26, 2015
We announce that our papers “Compressed indexes for string-searching in labeled graphs” and “GERBIL – General Entity Annotator Benchmark” have been accepted at the 24th International World Wide Web Conference (WWW 2015) conference.
See you in Florence