Compressed Text Indexes:From Theory to Practice! - http://arxiv.org/abs/0712.3360
Compress text to 40-80% of its original size and: (1) Execute 20K-50K count queries for patterns of 20 chars per second, (2) locate 100K patterns/sec and (3) decompress from arbitrary locations at 1MB/sec. Well, not exactly, but you'll have to read for the full details :) - Sanjeev Singh
Hm, Navarro. He is impressively productive - http://www.dcc.uchile.cl/~gnavar... - Amund Tveit
I implemented this to play with a few months ago. It compresses well (it's bzip2, after all), but searching doesn't come close to having a dedicated index -- because, just like decompressing bzip2, you're chasing pointers all the time, which pretty much guarantees a cache miss for every character in your search string. There are ways to improve it by a constant factor (and slightly change the representation so that you can parallelize one search) but I couldn't get it to perform well enough. - Tudor Bosman
and, Sanjeev, you probably know what I was trying to do with this, too :) - Tudor Bosman
OK, I spoke to soon (I saw two Italian names and a discussion about compressed text indexes, and I thought the article was referring to the same thing I played with). That may still be (Paolo Ferragina was one of the people behind both, and I haven't read the article yet), but I was talking about FMindex: http://www.mfn.unipmn.it/~manzin... - Tudor Bosman
Is there a compression scheme that doesn't involve chasing pointers all the time? It seems like that's sort of what compression is mostly about (the whole eliminating redundancy through backreferences thing). I guess if you can fit a Huffman table in your cache... but that's not really a big deal. - ⓞnor
Huffman decompression is linear, and so cache efficient (if, as you said, you can fit the table in your cache). LZ77 (gzip) decompression means following backreferences -- references to text you've already decompressed, and hopefully still in your cache for reasonably small block sizes. (more) - Tudor Bosman
bzip2 uses the Burrows-Wheeler transform to transform the original text into a text of equal length that compresses better using RLE (and bzip2 uses Huffman coding afterwards). At decompression, after the cache-efficient and simple Huffman and RLE decoding stages, you have to, essentially, follow one pointer for every byte of decompressed text, and that pointer can lead you anywhere in a ~500k block. - Tudor Bosman
Anyway: the insight in Ferragina / Manzini's work is that the data structure produced by the Burrows-Wheeler transform is very similar to a suffix array (http://en.wikipedia.org/wiki...) which is quite useful for full text indexing. And yes, we can use it for both, and efficiently implement "decompress starting from the first occurrence of a query string". - Tudor Bosman
yeah, searching is still relatively slow, "100 to 1000x slower than pure suffix arrays". On the other hand, compressed text indexes are pretty new and I think aggressive algorithm engineering hasn't happened yet. - Sanjeev Singh
Also, there are other small disadvantages: unlike pure suffix arrays, you don't automatically know *where* your needle is in the haystack -- the only operation you have is "decompress starting from the match onwards"; you have to encode the string in some way (and insert self-referential position markers) and decompress until you find the next marker, which tells you where you are in the string. - Tudor Bosman