Table and Index compression - Mailing list pgsql-hackers

From PFC
Subject Table and Index compression
Date
Msg-id op.ux8if71gcigqcu@soyouz
Whole thread Raw
Responses Re: Table and Index compression  (Josh Berkus <josh@agliodbs.com>)
Re: Table and Index compression  (Guillaume Smet <guillaume.smet@gmail.com>)
Re: Table and Index compression  (Ron Mayer <rm_pg@cheapcomplexdevices.com>)
List pgsql-hackers
With the talk about adding compression to pg_dump lately, I've been  
wondering if tables and indexes could be compressed too.
So I've implemented a quick on-the-fly compression patch for postgres

Sorry for the long email, but I hope you find this interesting.

Why compress ?

1- To save disk space ?
Disks are so cheap now that it is not a valid reason.
Besides, anyone with a database that is big enough to fill a modern  
harddisk probably also has a monster RAID to get more IO throughput.

2- To make the database faster ?
This would be a valid reason.

If the database is made smaller through compression, this can have several  
interesting effects :
- you need less RAM to cache it entirely, so random IO hits more cache  
than before
- if the CPU can decompress faster than the disk IO, even seq scans can be  
faster (this will not be the case on monster RAID setups, but for cheap  
servers, it could be)

So, why not ? I coded it.

I've benchmarked lzo, lzf, quicklz, lzjb, and fastLZ.
The best for this is lzo : very fast decompression, a good compression  
ratio on a sample of postgres table and indexes, and a license that could  
work.
QuickLZ compresses faster and more, but is not free.

Compression would compress each page independently, so it is fast to  
decompress a single page.
This means 8k pages are a bit small, I used 32k pages to get better  
compression.

I've checked various table and index files.
- your average table with many columns, some TEXT, etc, compresses about 2x
- when the row header is significant vs the row data, it can reach 3x
- indexes compress well too, 2-4x
- gist indexes could be compressed up to 6x sometimes

Roughly, 2-3x compression is achievable on a complete database.

--------------------

Implementation

This lives in md.c
The implementation is about 100 lines of code (!)

Instead of calling FileWrite and FileRead, md.c calls special functions  
that read and write compressed blocks.

* Writing :

The block is compressed in a temp memory buffer.
A header (prepended to the compressed data) tells the length of this data.
Then, it is written where in the disk file.

* Reading :

The first 4k of compressed data is read.
Looking at the length header, we know how much more to read.
The rest of the data (if any) is read.
The data is decompressed.

Since a compressed block can be larger than the original block, I have  
enlarged the block size in the files by 4k, so that there is a block every  
40k instead of every 32k (with 32k postgres pages).

That's it, very simple.

--------------------

Now, the reason it works is the underlying file is not handled as sparse  
by the OS.
The holes between compressed blocks are removed : not recorded on disk,  
and never cached either.

However sparse files can have big performance problems if you do this :

- write a small block in the middle of the file, surrounded by holes
- later enlarge this block

When the block is enlarged, if it needs an extra filesystem page, it will  
not be allocated contiguously.
When it is read later, it will need an extra seek, which is really bad.

So, looking at the compression statistics :

gist index for geometric coordinates search :
a 32k page is compressed to 1-2 4k-pages, very rarely 3 pages.

btree indexes :
a 32k page is compressed to 2-3 4k-pages

large table :
a 32k page is compressed to 2-4 4k-pages

Therefore, on write, I pre-allocate some space in the sparse file, by  
writing more than needed : currently I write 5 4k-blocks.
Whatever is written after the compressed data is garbage previously in the  
buffer, it is ignored on reads.

This means the disk space savings are less than a full compression, but  
access is much smoother, in fact much like a regular non-sparse file,  
since the blocks between the holes almost never need to be grown.

Without pre-allocating, performance is abysmal, not even worth talking  
about.

Pre-allocated but not actually used blocks are never read, except maybe by  
OS readahead during seq scan.
On a heavy random access database they will not be touched, not wasting  
any space in the OS cache.

--------------------

shared_buffers thus contains decompressed blocks : a row that is updated  
very often will not go through decompression-compression cycles each time.
The OS cache contains compressed data.

Some tests :

It appears to behave as expected. It didn't crash (yet...).

Basically it looks like RAM has doubled and CPU speed is halved.

Random access queries are faster, even on a cold cache, and of course,  
much better cached afterwards, since the amount of data the OS cache can  
hold is at least doubled.

Seq scans on a huge table, reading data from disk, are a tiny bit slower,  
which is strange : on my test box, the disks are slow (50 MB/s) and lzo  
can decompress much faster than this. At least it isn't slower.

Seq scans on a cached table that would fit in RAM anyway are slower,  
because it needs to be decompressed.

Seq scans on a table that can be cached because of the compression, are  
much faster than seq scans without compression, where it has to hit the  
disk.

Compressing the data pages directly also reduces the need for many ad-hoc  
schemes like compressing index tuples, etc.
Who cares the row header is huge ? If all rows have the same row header,  
lzo will shrink it down to 2 bytes, and the rest of Postgres doesn't need  
to know about it.

Also this compression scheme operates on whole pages : no need to split a  
page if suddenly it becomes larger after compression. Only the storage  
manager needs to know about it. The OS sparse file handling takes care of  
the gremlins.

For someone wanting to run a cheap server like this, it could be an  
interesting tradeoff.
http://www.ovh.com/fr/produits/superplan_best_of.xml

--------------------

I need to clean it up a little and do some more checks before I send the  
patch to you.
I am not suggesting this as a part of the commitfest for 8.5, but as a  
proof of concept, since many features are missing :

- ability to specify per-table/index or per-tablespace compression  
(per-tablespace would be better I think)
right now, it's compressing everything, even the sys catalogs, even the  
fsm, which is a bit dumb...
there should really be a md.c and a md_compress.c
and an ALTER TABLESPACE SET COMPRESSION TO 1, then move tables to it

Then you can keep your tables un-compressed, except this huge fulltext  
index that eats your RAM is not muuuch smaller.

- check how it corrupts data during recovery
ahem, notice I didn't say "if"

- check how it works under heavy concurrent load
I don't know under which locks md.c operates.

- determine good preallocation values
those vary according to the nature of the data, maybe user-supplied ?

- test with 16k pages also
better caching granularity at the cost of worse compression

- TOAST
Yes, the TOAST table is compressed too, even if it was compressed already  
by TOAST, so some "tuning" is needed

- Page header
I want to add 1 byte to the page header to specify if a page is compressed  
or not, and the length (in 4k-blocks) of the compressed data.


Then, more ideas come to mind :

- Read-only
Someone talked about adding a read-only option to a table, which would do  
some magic tricks to the data when re-building the table (CLUSTER,VAC  
FULL), making it more efficient. This would be a good place to also  
compress it, and write it as a sparse file without need for pre-allocation

- Lazy decompression
Right now pages are decompressed on read, which is not optimal. It would  
be better to issue another read() immediately, and decompress the page  
later.
The page would be stored as compressed in shared_buffers, then the first  
process which needs it would decompress it, and it would remain  
de-compressed until written.

- Parallel decompression
A new friend of the bgwriter, the bg-decompressor, hunts recently-read  
compressed pages in shared_buffers and decompresses them.
It would have one process per CPU core.
One core can decompress 200 MB/s or even more... a cheap quad core CPU,  
even with crummy SATA disks, would be able to seq scan like a monster  
RAID...

- Parallel compression
the bg-decompressor can also be a bg-compressor, hunting pages that are  
about to be written, and compressing them, this would speedup pg_restore  
by reducing the IO.

- Relation Extend
right now it seems that, when a relation needs a new page, this page is  
written (empty) at the end of the relation, then at some point later,  
written again, with data. So it has to be compressed twice (or more), and  
the first time, since it is empty, it will compress to something like 100  
bytes, hence the need for pre-allocation.

All tests above done on a Core 2, 4GB RAM (3.5 actually since 32-bit OS),  
SATA RAID1, kubuntu, ext3 filesystem.
I'm going to do some checks on a XFS volume right now to see how it  
handles sparse files...

Cheers,
Pierre


pgsql-hackers by date:

Previous
From: Sam Mason
Date:
Subject: Re: the case for machine-readable error fields
Next
From: Pavel Stehule
Date:
Subject: Re: the case for machine-readable error fields