Re: lztext and compression ratios... - Mailing list pgsql-general
From | Jeffery Collins |
---|---|
Subject | Re: lztext and compression ratios... |
Date | |
Msg-id | 3963CDDA.43643AF7@onyx-technologies.com Whole thread Raw |
In response to | Re: lztext and compression ratios... (JanWieck@t-online.de (Jan Wieck)) |
List | pgsql-general |
Jan, Thank you for your comments on lztext. They were very informative. I hope you (and the rest of the list) don't mind my ramblings on compression, but I think there is some performance/space advantage if compression can integrated into a table. I admit to not being a compression expert. All that I know has been learn from reading some zlib source. The compression that I have been working with (and am more familiar with) is Huffman compression. You are probably familiar with it and may have even evaluated it when you were looking at compression techniques. In my limited understanding of LZ compression, there are some advantages and disadvantages to Huffman compression. The disadvantages seem to be: - In normal Huffman compression, the string to be compressed is examined twice. Once to build the translation table and once to do the translation. - The translation table must be written with the string. This makes the result larger and limits it's effectiveness with small strings. - I don't know for sure, but I wouldn't be surprised if uncompression is slower with Huffman than LZ compression. To get around these disadvantages, I modified my Huffman support in the following way: - I trained the translation table and stored the table separately. I was able to do this because I knew before hand the type of text that would be in my columns. - I "improved" the compression by giving the translation table the ability to look for well known substrings in addition to byte values. For example the string "THE" in an English text might have it's own Huffman translation value instead of relying on the individual T, H and E translations. Again I was able to do this because I knew the type of text I would be storing in the column. Because the translation table is external to the string, it is no longer included in the string. This improves the compression ratio and helps the uncompression as the table does not need to be read and interpreted. This approach also allows for and takes advantage of cross-row commonalities. After doing all of this, I was able to get the average compression ratio to about 3.5 (i.e. orig size / new size) on text columns of 1K or less. Of course the disadvantage to this is that if the dataset changes overtime, the compression translations to not change with it and the compression ratio will get worse. In the very worst case, where the dataset totally changes (say the language changed from English to Russian) the "compressed" string could actually get larger than the original string because everything the compression algorithm thought it knew was totally wrong. As I said, I did this with a different database (CTREE). My company is now converting from CTREE to postgresql and I would like to find some fairly elegant way to include this type of compression support. My CTREE implementation was rather a hack, but now that we are moving to a better database, I would like to have a better solution for compression. Originally I latched onto the idea of using lztext or something like lztext, but I agree it is better if the column type is something standard and the compression is hidden by the backend. I guess my vision would be to be able to define either a column type or a column attribute that would allow me to indicate the type of compression to have and possibly the compression algorithm and translation set to use. Actually the real vision is to have it all hidden and have the database learn about the column as rows are added and modify the compression transalation to adapt to the changing column data. Jeff
pgsql-general by date: