Table and Index compression - Mailing list pgsql-hackers
From | PFC |
---|---|
Subject | Table and Index compression |
Date | |
Msg-id | op.ux8if71gcigqcu@soyouz Whole thread Raw |
Responses |
Re: Table and Index compression
Re: Table and Index compression Re: Table and Index compression |
List | pgsql-hackers |
With the talk about adding compression to pg_dump lately, I've been wondering if tables and indexes could be compressed too. So I've implemented a quick on-the-fly compression patch for postgres Sorry for the long email, but I hope you find this interesting. Why compress ? 1- To save disk space ? Disks are so cheap now that it is not a valid reason. Besides, anyone with a database that is big enough to fill a modern harddisk probably also has a monster RAID to get more IO throughput. 2- To make the database faster ? This would be a valid reason. If the database is made smaller through compression, this can have several interesting effects : - you need less RAM to cache it entirely, so random IO hits more cache than before - if the CPU can decompress faster than the disk IO, even seq scans can be faster (this will not be the case on monster RAID setups, but for cheap servers, it could be) So, why not ? I coded it. I've benchmarked lzo, lzf, quicklz, lzjb, and fastLZ. The best for this is lzo : very fast decompression, a good compression ratio on a sample of postgres table and indexes, and a license that could work. QuickLZ compresses faster and more, but is not free. Compression would compress each page independently, so it is fast to decompress a single page. This means 8k pages are a bit small, I used 32k pages to get better compression. I've checked various table and index files. - your average table with many columns, some TEXT, etc, compresses about 2x - when the row header is significant vs the row data, it can reach 3x - indexes compress well too, 2-4x - gist indexes could be compressed up to 6x sometimes Roughly, 2-3x compression is achievable on a complete database. -------------------- Implementation This lives in md.c The implementation is about 100 lines of code (!) Instead of calling FileWrite and FileRead, md.c calls special functions that read and write compressed blocks. * Writing : The block is compressed in a temp memory buffer. A header (prepended to the compressed data) tells the length of this data. Then, it is written where in the disk file. * Reading : The first 4k of compressed data is read. Looking at the length header, we know how much more to read. The rest of the data (if any) is read. The data is decompressed. Since a compressed block can be larger than the original block, I have enlarged the block size in the files by 4k, so that there is a block every 40k instead of every 32k (with 32k postgres pages). That's it, very simple. -------------------- Now, the reason it works is the underlying file is not handled as sparse by the OS. The holes between compressed blocks are removed : not recorded on disk, and never cached either. However sparse files can have big performance problems if you do this : - write a small block in the middle of the file, surrounded by holes - later enlarge this block When the block is enlarged, if it needs an extra filesystem page, it will not be allocated contiguously. When it is read later, it will need an extra seek, which is really bad. So, looking at the compression statistics : gist index for geometric coordinates search : a 32k page is compressed to 1-2 4k-pages, very rarely 3 pages. btree indexes : a 32k page is compressed to 2-3 4k-pages large table : a 32k page is compressed to 2-4 4k-pages Therefore, on write, I pre-allocate some space in the sparse file, by writing more than needed : currently I write 5 4k-blocks. Whatever is written after the compressed data is garbage previously in the buffer, it is ignored on reads. This means the disk space savings are less than a full compression, but access is much smoother, in fact much like a regular non-sparse file, since the blocks between the holes almost never need to be grown. Without pre-allocating, performance is abysmal, not even worth talking about. Pre-allocated but not actually used blocks are never read, except maybe by OS readahead during seq scan. On a heavy random access database they will not be touched, not wasting any space in the OS cache. -------------------- shared_buffers thus contains decompressed blocks : a row that is updated very often will not go through decompression-compression cycles each time. The OS cache contains compressed data. Some tests : It appears to behave as expected. It didn't crash (yet...). Basically it looks like RAM has doubled and CPU speed is halved. Random access queries are faster, even on a cold cache, and of course, much better cached afterwards, since the amount of data the OS cache can hold is at least doubled. Seq scans on a huge table, reading data from disk, are a tiny bit slower, which is strange : on my test box, the disks are slow (50 MB/s) and lzo can decompress much faster than this. At least it isn't slower. Seq scans on a cached table that would fit in RAM anyway are slower, because it needs to be decompressed. Seq scans on a table that can be cached because of the compression, are much faster than seq scans without compression, where it has to hit the disk. Compressing the data pages directly also reduces the need for many ad-hoc schemes like compressing index tuples, etc. Who cares the row header is huge ? If all rows have the same row header, lzo will shrink it down to 2 bytes, and the rest of Postgres doesn't need to know about it. Also this compression scheme operates on whole pages : no need to split a page if suddenly it becomes larger after compression. Only the storage manager needs to know about it. The OS sparse file handling takes care of the gremlins. For someone wanting to run a cheap server like this, it could be an interesting tradeoff. http://www.ovh.com/fr/produits/superplan_best_of.xml -------------------- I need to clean it up a little and do some more checks before I send the patch to you. I am not suggesting this as a part of the commitfest for 8.5, but as a proof of concept, since many features are missing : - ability to specify per-table/index or per-tablespace compression (per-tablespace would be better I think) right now, it's compressing everything, even the sys catalogs, even the fsm, which is a bit dumb... there should really be a md.c and a md_compress.c and an ALTER TABLESPACE SET COMPRESSION TO 1, then move tables to it Then you can keep your tables un-compressed, except this huge fulltext index that eats your RAM is not muuuch smaller. - check how it corrupts data during recovery ahem, notice I didn't say "if" - check how it works under heavy concurrent load I don't know under which locks md.c operates. - determine good preallocation values those vary according to the nature of the data, maybe user-supplied ? - test with 16k pages also better caching granularity at the cost of worse compression - TOAST Yes, the TOAST table is compressed too, even if it was compressed already by TOAST, so some "tuning" is needed - Page header I want to add 1 byte to the page header to specify if a page is compressed or not, and the length (in 4k-blocks) of the compressed data. Then, more ideas come to mind : - Read-only Someone talked about adding a read-only option to a table, which would do some magic tricks to the data when re-building the table (CLUSTER,VAC FULL), making it more efficient. This would be a good place to also compress it, and write it as a sparse file without need for pre-allocation - Lazy decompression Right now pages are decompressed on read, which is not optimal. It would be better to issue another read() immediately, and decompress the page later. The page would be stored as compressed in shared_buffers, then the first process which needs it would decompress it, and it would remain de-compressed until written. - Parallel decompression A new friend of the bgwriter, the bg-decompressor, hunts recently-read compressed pages in shared_buffers and decompresses them. It would have one process per CPU core. One core can decompress 200 MB/s or even more... a cheap quad core CPU, even with crummy SATA disks, would be able to seq scan like a monster RAID... - Parallel compression the bg-decompressor can also be a bg-compressor, hunting pages that are about to be written, and compressing them, this would speedup pg_restore by reducing the IO. - Relation Extend right now it seems that, when a relation needs a new page, this page is written (empty) at the end of the relation, then at some point later, written again, with data. So it has to be compressed twice (or more), and the first time, since it is empty, it will compress to something like 100 bytes, hence the need for pre-allocation. All tests above done on a Core 2, 4GB RAM (3.5 actually since 32-bit OS), SATA RAID1, kubuntu, ext3 filesystem. I'm going to do some checks on a XFS volume right now to see how it handles sparse files... Cheers, Pierre
pgsql-hackers by date: