Re: Table and Index compression - Mailing list pgsql-hackers

From Pierre Frédéric Caillaud
Subject Re: Table and Index compression
Date
Msg-id op.ux997d1qcke6l8@soyouz
Whole thread Raw
In response to Re: Table and Index compression  (Greg Stark <gsstark@mit.edu>)
Responses Re: Table and Index compression  (Sam Mason <sam@samason.me.uk>)
Re: Table and Index compression  (Greg Stark <gsstark@mit.edu>)
List pgsql-hackers
First, a few things that I forgot to mention in the previous message :





> I like the idea too, but I think there are some major problems to
> solve. In particular I think we need a better solution to blocks
> growing than sparse files.

Sparse files allow something great : to test this concept in real life,  
with a length of code that is shorter than this email.

> The main problem with using sparse files is that currently postgres is
> careful to allocate blocks early so it can fail  if there's not enough
> space. With your sparse file solution Postgres might only find out
> there's no space after it has already committed a transaction.
> bgwriter has no good course of action to take if it finds out there's
> nowhere to put the data it has in shared buffers.

Yes, that is a big problem.

Note that the same problem can happen if you use a filesystem that  
supports compression :

- you preallocate a file full of empty blocks
- filesystem compresses it well, says it fits
- you fill it with data that compresses much worse than zeros
- at some point one of the write() calls fails with a disk full error

This would be an interesting argument in the "can we use compressed ZFS on  
postgres" debate...

Also, about compressed NTFS : it can give you disk-full errors on read().
While this may appear stupid, it is in fact very good.
When you read() something compressed, NTFS will reserve the space on disk  
to re-write this data, assuming the worst compression, so your app fails  
early (on read, not on write). If there is not enough space on the volume,  
read() will fail with a disk full error. Many applications do not expect  
this.

On the other hand, filling a disk completely is a never a good idea, since  
it generally triggers fragmentation so extreme that the only way to  
recover is to go buy another harddisk and copy your data.

As a side note, I have also tested lzjb (the ZFS compressor) and lzo is  
much faster, and compresses much better (sometimes 2x better).

Back to the point of how to handle disk full errors :
- we could write a file the size of shared_buffers at startup
- if a write() reports disk full, delete the file above
- we now have enough space to flush all of shared_buffers
- flush and exit gracefully

This will waste disk space. Of course... but the purpose of compression is  
not to save disk space, it is to have a faster database.
Also, if you use compression,
- you'll do it because your database is much bigger than shared_buffers,  
so the wasted space is not huge.
- you'll use a smaller shared_buffers than usual, because shared_buffers  
contains uncompressed pages, and you'll want to use lots of OS disk cache  
to cache compressed data.

Doesn't seem too crazy to me.

> But I think even if you solve that it's not really a good long-term
> solution. We don't know how the OS handles block allocation for this
> type of file. I'm actually moderately surprised it isn't skipping
> enough blocks assuming you'll allocate them eventually. Even if it
> does handle it the way you expect what happens when you do grow a
> block, it'll have to allocate it way out of the way and we have no way
> to repair that discontinuity later.

I made a quick check before implementing it, using python scripts to play  
with sparse files on ext3 :

- writing a sparse file is a bit (not much) slower than a regular file,
- reading from a non-fragmented sparse file is as fast as reading a  
regular file
- holes do not eat OS disk cache (which is the most interesting point)
- reading from cache is as fast as reading a regular file (and faster if  
you don't read the holes because you know they are holes, which is the  
case here)

And, also, growing a sparse file by plugging the holes in it WILL allocate  
blocks all over the place and render IO extremely inefficient.
You can defrag it, of course (using fs-specific tools or just cpio), but  
that's not "high-availability"...

> Also, the way you've prellocated blocks effectively nails the maximum
> compression at 2x. That seems to be leaving a lot of money on the
> table.

Yes, it does, but there are a few subtleties.

First, about preallocation : I had to implement this, because of the way  
the files are extended.
Since first an empty block is written, it will compress well, to 1 OS  
page. When, later, this block is written again with data, it will compress  
to a larger amount of pages, which will trigger allocation all over the  
place. In testing, the resulting sparse file is slow in the of the "forget  
it" kind...

The file could also be non-sparse : in this case, no disk space is saved.

Not saving disk space is OK with me : disks are really cheap these days.

I know this sounds crazy ;) but as I said above, compression is meant to  
save expensive resources. When you use H-264 to compress 1TB of raw video  
to 1GB of compressed video, you save an expensive resource (0.999 TB of  
harddisk). But if you use lzo to compress 40GB of database down to 20GB,  
you save maybe $3 in disk space, which is ludicrous. However, you also  
save much more expensive resources : disk throughput and RAM OS cache. So  
it might really pay off.

So, a non-sparse file would work well for random accesses too, since we  
only read the parts of it that contain compressed data, and the "wasted"  
pages would never be touched.

However it would have drawbacks :

- seq scan would not be able to skip the "empty" pages, so it would have  
no benefit whatsoever
- the empty pages, read-ahead by the OS, would pollute the disk cache
- the empty pages would also pollute the controller's cache and the disks'  
internal caches

So, a balance has to be reached on how much OS pages (4K on Linux) to  
preallocate for a Postgres block (16K or 32K to get good compression).

If you preallocate too little, adding data (INSERT, UPDATEs) will heavily  
fragment the file.
If you preallocate too much, seq scan will be slower and perhaps cache  
will be wasted (but it will still be much faster than the huge seek-fest  
which happens if you preallocate too little).

How many OS pages to preallocate for a postgres block should be specified  
by the user.

- If you intend to modify the table contents, you should preallocate a bit  
more.
- If the table will be mostly read-only, a tight fit is okay.

A nice thing is that it makes read-only tables a bit redundant. Basically,  
if you ask postgres
"CLUSTER or VAC FULL this table, and compress it, not preallocating  
anything more than necessary"
You will get a table and indexes that are well compressed, but updates,  
deletes, will be slow, so it is a read-mostly table.

If you use partitions, the current partition can be uncompressed (or  
compressed with a generous pre-allocation), but the past partitions, which  
are usually read-only, can be compressed tight, and their indexes too.

Then you could change the pre-allocation value for this table to  
preallocate more, and INSERT into it, no problem, except for the indexes,  
for which you'd need to specify beforehand to preallocate a bit more pages  
to allow for growing index pages...

So, how many pages to preallocate is a tuning parameter, that should have  
a reasonable default, but be user-settable, per-relation (table, index...)

Note that if a page contains lots of dead rows which are similar except a  
few updated columns, compression will exploit this redundancy.

Here are some stats on some tables of more than 300K rows :

* Table with lots of columns including a TEXT column of a few hundred  
bytes, containing french text, which is usually too short for TOASTING :
Size               : 314.69 MB
Paged Size         : 133.24 MB
Unpaged efficiency : 2.77
Paged Efficiency   : 2.36
pages | count    1 |    16    2 |   848    3 |  4451    4 |  4730    5 |    25

The above is a histogram : for instance 4730 32K postgresql blocks  
compressed to 4 4K-pages (a ratio of 2X), and only 16 postgres blocks  
compressed to 4K.

"Size" is uncompressed size
"Paged Size" is the sum of all 4K which contain compressed data
"Unpaged Efficiency" is (raw data) / (compressed data), showing compressor  
performance.
"Paged Efficiency" is computed from the real number of pages used, so it  
is the interesting metric.

In this case, efficiency is 2.36 because some pages compress better than  
2X, but if we preallocate 16K for each 32K page, real efficiency will be  
2X.
We can see, also, that 25 blocks would need some extra allocation. This is  
a really small percentage of the table, so not significant to speed.

* GIN fulltext index on text field of above table

Size               : 136.75 MB
Paged Size         : 48.04 MB
Unpaged efficiency : 3.63
Paged Efficiency   : 2.85
pages | count    1 |  1528    2 |   470    3 |   944    4 |   172    5 |  1262


* GIST fulltext index on text field of above table

annonces_ts_gist
Size               : 63.84 MB
Paged Size         : 33.89 MB
Unpaged efficiency : 2.14
Paged Efficiency   : 1.88
pages | count    1 |     3    2 |    32    3 |   165    4 |  1103    5 |   739    6 |     1

* table with 32 columns including a few TEXT fields which contain at most  
a few chars

Size               : 393.50 MB
Paged Size         : 127.24 MB
Unpaged efficiency : 3.83
Paged Efficiency   : 3.09
pages | count    2 |  5207    3 |  7381    4 |     4

* btree on DATE column of above table

Size               : 56.34 MB
Unpaged efficiency : 2.63
Paged Efficiency   : 2.08
pages | count    1 |     2    2 |     1    3 |   282    4 |  1518


* btree on field "zipcode" of above table

Size               : 56.34 MB
Paged Size         : 27.04 MB
Unpaged efficiency : 2.90
Paged Efficiency   : 2.67
pages | count    1 |     2    2 |     2    3 |  1794    4 |     5

* gist index on GPS coordinates (every row has (lat,lon)) of above table

Size               : 196.47 MB
Paged Size         : 39.79 MB
Unpaged efficiency : 6.93
Paged Efficiency   : 4.94
pages | count    1 |  2486    2 |  3704    3 |    96    4 |     1

Note that the 32 column table is 400MB... and this gist index is almost  
200 MB.
This table has 770 MB of indexes on it...
Compression shrinks the table to 127 MB and indexes to a total of about  
300 MB.

* table of (increasing timestamp, 1 SERIAL, 1 increasing INT, 3 INT  
columns which contain random() data)

Size               : 344.06 MB
Paged Size         : 215.02 MB
Unpaged efficiency : 1.97
Paged Efficiency   : 1.60
pages | count    1 |     1    5 | 11009


* multicolumn btree on (int, timestamp) above

Size               : 153.19 MB
Paged Size         : 57.42 MB
Unpaged efficiency : 2.73
Paged Efficiency   : 2.67
pages | count    1 |     3    3 |  4899


* btree on int column above

Size               : 102.06 MB
Paged Size         : 63.73 MB
Unpaged efficiency : 1.87
Paged Efficiency   : 1.60
pages | count    1 |     2    2 |     1    4 |     3    5 |  3260


As you can see, efficiency varies widely between average (1.6), good  
(2-2.5), and amazing (almost 5) depending on what is compressed.

> To handle read-write tables I think we would need to directly
> implement the kind of indirection layer that you're getting out of the
> filesystem's block layer currently. That would let you allocate enough
> blocks to hold the data uncompressed and then free up those blocks
> once you're sure the data is compressible.

Just like NTFS does, then ?


> One possibility is to handle only read-only tables. That would make
> things a *lot* simpler. But it sure would be inconvenient if it's only
> useful on large static tables but requires you to rewrite the whole
> table -- just what you don't want to do with large static tables -- to
> get the benefit.

See above : read-mostly table that you can still update (but it will be  
very slow) or insert into (if you change the preallocation setting).


I forgot to talk about SSDs in the previous message. SSDs are quite  
expensive, but seek really fast. Saving disk space would be good on a SSD,  
because the price per byte is much higher. And since SSDs seek very fast,  
they probably wouldn't be bothered that much by sparse blocks being  
allocated all over the place. Also the price of SSDs means you wouldn't  
make a monster RAID out of them, so the CPU could decompress faster than  
the disk in a seqscan. This would need a test.


Ah, I would like to know a thing : if I add a 2-byte field to the postgres  
page header, which contains the compressed length, will I get any problems  
? (ie. do some pages use a non-standard header ?)

I need to modify a few things and add comments, then will submit a patch  
(not for committing, but for hacking).


pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: [Pg-migrator-general] Composite types break pg_migrated tables
Next
From: Pierre Frédéric Caillaud
Date:
Subject: Re: Table and Index compression