Thread: Table and Index compression

Table and Index compression

From
PFC
Date:
With the talk about adding compression to pg_dump lately, I've been  
wondering if tables and indexes could be compressed too.
So I've implemented a quick on-the-fly compression patch for postgres

Sorry for the long email, but I hope you find this interesting.

Why compress ?

1- To save disk space ?
Disks are so cheap now that it is not a valid reason.
Besides, anyone with a database that is big enough to fill a modern  
harddisk probably also has a monster RAID to get more IO throughput.

2- To make the database faster ?
This would be a valid reason.

If the database is made smaller through compression, this can have several  
interesting effects :
- you need less RAM to cache it entirely, so random IO hits more cache  
than before
- if the CPU can decompress faster than the disk IO, even seq scans can be  
faster (this will not be the case on monster RAID setups, but for cheap  
servers, it could be)

So, why not ? I coded it.

I've benchmarked lzo, lzf, quicklz, lzjb, and fastLZ.
The best for this is lzo : very fast decompression, a good compression  
ratio on a sample of postgres table and indexes, and a license that could  
work.
QuickLZ compresses faster and more, but is not free.

Compression would compress each page independently, so it is fast to  
decompress a single page.
This means 8k pages are a bit small, I used 32k pages to get better  
compression.

I've checked various table and index files.
- your average table with many columns, some TEXT, etc, compresses about 2x
- when the row header is significant vs the row data, it can reach 3x
- indexes compress well too, 2-4x
- gist indexes could be compressed up to 6x sometimes

Roughly, 2-3x compression is achievable on a complete database.

--------------------

Implementation

This lives in md.c
The implementation is about 100 lines of code (!)

Instead of calling FileWrite and FileRead, md.c calls special functions  
that read and write compressed blocks.

* Writing :

The block is compressed in a temp memory buffer.
A header (prepended to the compressed data) tells the length of this data.
Then, it is written where in the disk file.

* Reading :

The first 4k of compressed data is read.
Looking at the length header, we know how much more to read.
The rest of the data (if any) is read.
The data is decompressed.

Since a compressed block can be larger than the original block, I have  
enlarged the block size in the files by 4k, so that there is a block every  
40k instead of every 32k (with 32k postgres pages).

That's it, very simple.

--------------------

Now, the reason it works is the underlying file is not handled as sparse  
by the OS.
The holes between compressed blocks are removed : not recorded on disk,  
and never cached either.

However sparse files can have big performance problems if you do this :

- write a small block in the middle of the file, surrounded by holes
- later enlarge this block

When the block is enlarged, if it needs an extra filesystem page, it will  
not be allocated contiguously.
When it is read later, it will need an extra seek, which is really bad.

So, looking at the compression statistics :

gist index for geometric coordinates search :
a 32k page is compressed to 1-2 4k-pages, very rarely 3 pages.

btree indexes :
a 32k page is compressed to 2-3 4k-pages

large table :
a 32k page is compressed to 2-4 4k-pages

Therefore, on write, I pre-allocate some space in the sparse file, by  
writing more than needed : currently I write 5 4k-blocks.
Whatever is written after the compressed data is garbage previously in the  
buffer, it is ignored on reads.

This means the disk space savings are less than a full compression, but  
access is much smoother, in fact much like a regular non-sparse file,  
since the blocks between the holes almost never need to be grown.

Without pre-allocating, performance is abysmal, not even worth talking  
about.

Pre-allocated but not actually used blocks are never read, except maybe by  
OS readahead during seq scan.
On a heavy random access database they will not be touched, not wasting  
any space in the OS cache.

--------------------

shared_buffers thus contains decompressed blocks : a row that is updated  
very often will not go through decompression-compression cycles each time.
The OS cache contains compressed data.

Some tests :

It appears to behave as expected. It didn't crash (yet...).

Basically it looks like RAM has doubled and CPU speed is halved.

Random access queries are faster, even on a cold cache, and of course,  
much better cached afterwards, since the amount of data the OS cache can  
hold is at least doubled.

Seq scans on a huge table, reading data from disk, are a tiny bit slower,  
which is strange : on my test box, the disks are slow (50 MB/s) and lzo  
can decompress much faster than this. At least it isn't slower.

Seq scans on a cached table that would fit in RAM anyway are slower,  
because it needs to be decompressed.

Seq scans on a table that can be cached because of the compression, are  
much faster than seq scans without compression, where it has to hit the  
disk.

Compressing the data pages directly also reduces the need for many ad-hoc  
schemes like compressing index tuples, etc.
Who cares the row header is huge ? If all rows have the same row header,  
lzo will shrink it down to 2 bytes, and the rest of Postgres doesn't need  
to know about it.

Also this compression scheme operates on whole pages : no need to split a  
page if suddenly it becomes larger after compression. Only the storage  
manager needs to know about it. The OS sparse file handling takes care of  
the gremlins.

For someone wanting to run a cheap server like this, it could be an  
interesting tradeoff.
http://www.ovh.com/fr/produits/superplan_best_of.xml

--------------------

I need to clean it up a little and do some more checks before I send the  
patch to you.
I am not suggesting this as a part of the commitfest for 8.5, but as a  
proof of concept, since many features are missing :

- ability to specify per-table/index or per-tablespace compression  
(per-tablespace would be better I think)
right now, it's compressing everything, even the sys catalogs, even the  
fsm, which is a bit dumb...
there should really be a md.c and a md_compress.c
and an ALTER TABLESPACE SET COMPRESSION TO 1, then move tables to it

Then you can keep your tables un-compressed, except this huge fulltext  
index that eats your RAM is not muuuch smaller.

- check how it corrupts data during recovery
ahem, notice I didn't say "if"

- check how it works under heavy concurrent load
I don't know under which locks md.c operates.

- determine good preallocation values
those vary according to the nature of the data, maybe user-supplied ?

- test with 16k pages also
better caching granularity at the cost of worse compression

- TOAST
Yes, the TOAST table is compressed too, even if it was compressed already  
by TOAST, so some "tuning" is needed

- Page header
I want to add 1 byte to the page header to specify if a page is compressed  
or not, and the length (in 4k-blocks) of the compressed data.


Then, more ideas come to mind :

- Read-only
Someone talked about adding a read-only option to a table, which would do  
some magic tricks to the data when re-building the table (CLUSTER,VAC  
FULL), making it more efficient. This would be a good place to also  
compress it, and write it as a sparse file without need for pre-allocation

- Lazy decompression
Right now pages are decompressed on read, which is not optimal. It would  
be better to issue another read() immediately, and decompress the page  
later.
The page would be stored as compressed in shared_buffers, then the first  
process which needs it would decompress it, and it would remain  
de-compressed until written.

- Parallel decompression
A new friend of the bgwriter, the bg-decompressor, hunts recently-read  
compressed pages in shared_buffers and decompresses them.
It would have one process per CPU core.
One core can decompress 200 MB/s or even more... a cheap quad core CPU,  
even with crummy SATA disks, would be able to seq scan like a monster  
RAID...

- Parallel compression
the bg-decompressor can also be a bg-compressor, hunting pages that are  
about to be written, and compressing them, this would speedup pg_restore  
by reducing the IO.

- Relation Extend
right now it seems that, when a relation needs a new page, this page is  
written (empty) at the end of the relation, then at some point later,  
written again, with data. So it has to be compressed twice (or more), and  
the first time, since it is empty, it will compress to something like 100  
bytes, hence the need for pre-allocation.

All tests above done on a Core 2, 4GB RAM (3.5 actually since 32-bit OS),  
SATA RAID1, kubuntu, ext3 filesystem.
I'm going to do some checks on a XFS volume right now to see how it  
handles sparse files...

Cheers,
Pierre


Re: Table and Index compression

From
Josh Berkus
Date:
On 8/6/09 2:39 AM, PFC wrote:
> 
>     
> With the talk about adding compression to pg_dump lately, I've been
> wondering if tables and indexes could be compressed too.
> So I've implemented a quick on-the-fly compression patch for postgres

I find this very interesting, and would like to test it further on some
client workloads, before you/we put more work into completing it.

I think if we can implement compressed database as an option (perhaps at
initdb time, perhaps at tablespace creation time) then it will be very
attractive.

Where is the patch?

BTW, who are you actually?

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: Table and Index compression

From
Guillaume Smet
Date:
Pierre,

On Thu, Aug 6, 2009 at 11:39 AM, PFC<lists@peufeu.com> wrote:
> The best for this is lzo : very fast decompression, a good compression ratio
> on a sample of postgres table and indexes, and a license that could work.

The license of lzo doesn't allow us to include it in PostgreSQL
without relicensing PostgreSQL as GPL.

I'm not sure of what you imply by "a license that could work".

Note that it doesn't change the interest of your approach. It's just
that I'm not sure we can find a performance-acceptable BSD licensed
compression library (it was discussed a lot of times here).

-- 
Guillaume


Re: Table and Index compression

From
Greg Stark
Date:
I like the idea too, but I think there are some major problems to
solve. In particular I think we need a better solution to blocks
growing than sparse files.

The main problem with using sparse files is that currently postgres is
careful to allocate blocks early so it can fail  if there's not enough
space. With your sparse file solution Postgres might only find out
there's no space after it has already committed a transaction.
bgwriter has no good course of action to take if it finds out there's
nowhere to put the data it has in shared buffers.

But I think even if you solve that it's not really a good long-term
solution. We don't know how the OS handles block allocation for this
type of file. I'm actually moderately surprised it isn't skipping
enough blocks assuming you'll allocate them eventually. Even if it
does handle it the way you expect what happens when you do grow a
block, it'll have to allocate it way out of the way and we have no way
to repair that discontinuity later.

Also, the way you've prellocated blocks effectively nails the maximum
compression at 2x. That seems to be leaving a lot of money on the
table.

To handle read-write tables I think we would need to directly
implement the kind of indirection layer that you're getting out of the
filesystem's block layer currently. That would let you allocate enough
blocks to hold the data uncompressed and then free up those blocks
once you're sure the data is compressible.

One possibility is to handle only read-only tables. That would make
things a *lot* simpler. But it sure would be inconvenient if it's only
useful on large static tables but requires you to rewrite the whole
table -- just what you don't want to do with large static tables -- to
get the benefit.


Re: Table and Index compression

From
Robert Haas
Date:
On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark<gsstark@mit.edu> wrote:
> I like the idea too, but I think there are some major problems to
> solve. In particular I think we need a better solution to blocks
> growing than sparse files.

How much benefit does this approach have over using TOAST compression
more aggressively?

...Robert


Re: Table and Index compression

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote: 
> On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark<gsstark@mit.edu> wrote:
>> I like the idea too, but I think there are some major problems to
>> solve. In particular I think we need a better solution to blocks
>> growing than sparse files.
> 
> How much benefit does this approach have over using TOAST
> compression more aggressively?
I was wondering the same thing.  It seems like compressing a page at a
time should allow more space savings than a column at a time, and
possibly do it faster.
One question I have banging around in my head is what to do with
out-of-line storage.  Sometimes you have a large column which you know
contains data which is already compressed and/or encrypted, so
attempting compression would give little or no benefit; so I'm
inclined to think that if we do page compression, it shouldn't deal
with toast tables.  We could leave them to the current techniques. 
That leaves some subtle problems with how to deal with a datum which
currently compresses from, say, several kB down to one or two hundred
bytes.  Current TOAST logic would typically compress and inline it.
What would we do if we're trying to push heap compression to the page
level?
-Kevin


Re: Table and Index compression

From
Josh Berkus
Date:
On 8/6/09 1:03 PM, Greg Stark wrote:
> One possibility is to handle only read-only tables. That would make
> things a *lot* simpler. But it sure would be inconvenient if it's only
> useful on large static tables but requires you to rewrite the whole
> table -- just what you don't want to do with large static tables -- to
> get the benefit.

Well less flexible, I could see combining this with partitioning to
still be useful.  If we could rewrite specific partitions as compressed,
then there's a lot of cumulative data applications which it would benefit.

Not as exciting as being able to compress the whole thing, of course.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: Table and Index compression

From
Ron Mayer
Date:
I'm curious what advantages there are in building compression into
the database itself, rather than using filesystem-based compression.

I see ZFS articles[1] discuss how enabling compression
improves performance with ZFS; for Linux, Btrfs has compression
features as well[2]; and on Windows NTFS seems to too.

[1]http://blogs.sun.com/observatory/entry/zfs_compression_a_win_win
[2]http://lwn.net/Articles/305697/




Re: Table and Index compression

From
Pierre Frédéric Caillaud
Date:
First, a few things that I forgot to mention in the previous message :





> I like the idea too, but I think there are some major problems to
> solve. In particular I think we need a better solution to blocks
> growing than sparse files.

Sparse files allow something great : to test this concept in real life,  
with a length of code that is shorter than this email.

> The main problem with using sparse files is that currently postgres is
> careful to allocate blocks early so it can fail  if there's not enough
> space. With your sparse file solution Postgres might only find out
> there's no space after it has already committed a transaction.
> bgwriter has no good course of action to take if it finds out there's
> nowhere to put the data it has in shared buffers.

Yes, that is a big problem.

Note that the same problem can happen if you use a filesystem that  
supports compression :

- you preallocate a file full of empty blocks
- filesystem compresses it well, says it fits
- you fill it with data that compresses much worse than zeros
- at some point one of the write() calls fails with a disk full error

This would be an interesting argument in the "can we use compressed ZFS on  
postgres" debate...

Also, about compressed NTFS : it can give you disk-full errors on read().
While this may appear stupid, it is in fact very good.
When you read() something compressed, NTFS will reserve the space on disk  
to re-write this data, assuming the worst compression, so your app fails  
early (on read, not on write). If there is not enough space on the volume,  
read() will fail with a disk full error. Many applications do not expect  
this.

On the other hand, filling a disk completely is a never a good idea, since  
it generally triggers fragmentation so extreme that the only way to  
recover is to go buy another harddisk and copy your data.

As a side note, I have also tested lzjb (the ZFS compressor) and lzo is  
much faster, and compresses much better (sometimes 2x better).

Back to the point of how to handle disk full errors :
- we could write a file the size of shared_buffers at startup
- if a write() reports disk full, delete the file above
- we now have enough space to flush all of shared_buffers
- flush and exit gracefully

This will waste disk space. Of course... but the purpose of compression is  
not to save disk space, it is to have a faster database.
Also, if you use compression,
- you'll do it because your database is much bigger than shared_buffers,  
so the wasted space is not huge.
- you'll use a smaller shared_buffers than usual, because shared_buffers  
contains uncompressed pages, and you'll want to use lots of OS disk cache  
to cache compressed data.

Doesn't seem too crazy to me.

> But I think even if you solve that it's not really a good long-term
> solution. We don't know how the OS handles block allocation for this
> type of file. I'm actually moderately surprised it isn't skipping
> enough blocks assuming you'll allocate them eventually. Even if it
> does handle it the way you expect what happens when you do grow a
> block, it'll have to allocate it way out of the way and we have no way
> to repair that discontinuity later.

I made a quick check before implementing it, using python scripts to play  
with sparse files on ext3 :

- writing a sparse file is a bit (not much) slower than a regular file,
- reading from a non-fragmented sparse file is as fast as reading a  
regular file
- holes do not eat OS disk cache (which is the most interesting point)
- reading from cache is as fast as reading a regular file (and faster if  
you don't read the holes because you know they are holes, which is the  
case here)

And, also, growing a sparse file by plugging the holes in it WILL allocate  
blocks all over the place and render IO extremely inefficient.
You can defrag it, of course (using fs-specific tools or just cpio), but  
that's not "high-availability"...

> Also, the way you've prellocated blocks effectively nails the maximum
> compression at 2x. That seems to be leaving a lot of money on the
> table.

Yes, it does, but there are a few subtleties.

First, about preallocation : I had to implement this, because of the way  
the files are extended.
Since first an empty block is written, it will compress well, to 1 OS  
page. When, later, this block is written again with data, it will compress  
to a larger amount of pages, which will trigger allocation all over the  
place. In testing, the resulting sparse file is slow in the of the "forget  
it" kind...

The file could also be non-sparse : in this case, no disk space is saved.

Not saving disk space is OK with me : disks are really cheap these days.

I know this sounds crazy ;) but as I said above, compression is meant to  
save expensive resources. When you use H-264 to compress 1TB of raw video  
to 1GB of compressed video, you save an expensive resource (0.999 TB of  
harddisk). But if you use lzo to compress 40GB of database down to 20GB,  
you save maybe $3 in disk space, which is ludicrous. However, you also  
save much more expensive resources : disk throughput and RAM OS cache. So  
it might really pay off.

So, a non-sparse file would work well for random accesses too, since we  
only read the parts of it that contain compressed data, and the "wasted"  
pages would never be touched.

However it would have drawbacks :

- seq scan would not be able to skip the "empty" pages, so it would have  
no benefit whatsoever
- the empty pages, read-ahead by the OS, would pollute the disk cache
- the empty pages would also pollute the controller's cache and the disks'  
internal caches

So, a balance has to be reached on how much OS pages (4K on Linux) to  
preallocate for a Postgres block (16K or 32K to get good compression).

If you preallocate too little, adding data (INSERT, UPDATEs) will heavily  
fragment the file.
If you preallocate too much, seq scan will be slower and perhaps cache  
will be wasted (but it will still be much faster than the huge seek-fest  
which happens if you preallocate too little).

How many OS pages to preallocate for a postgres block should be specified  
by the user.

- If you intend to modify the table contents, you should preallocate a bit  
more.
- If the table will be mostly read-only, a tight fit is okay.

A nice thing is that it makes read-only tables a bit redundant. Basically,  
if you ask postgres
"CLUSTER or VAC FULL this table, and compress it, not preallocating  
anything more than necessary"
You will get a table and indexes that are well compressed, but updates,  
deletes, will be slow, so it is a read-mostly table.

If you use partitions, the current partition can be uncompressed (or  
compressed with a generous pre-allocation), but the past partitions, which  
are usually read-only, can be compressed tight, and their indexes too.

Then you could change the pre-allocation value for this table to  
preallocate more, and INSERT into it, no problem, except for the indexes,  
for which you'd need to specify beforehand to preallocate a bit more pages  
to allow for growing index pages...

So, how many pages to preallocate is a tuning parameter, that should have  
a reasonable default, but be user-settable, per-relation (table, index...)

Note that if a page contains lots of dead rows which are similar except a  
few updated columns, compression will exploit this redundancy.

Here are some stats on some tables of more than 300K rows :

* Table with lots of columns including a TEXT column of a few hundred  
bytes, containing french text, which is usually too short for TOASTING :
Size               : 314.69 MB
Paged Size         : 133.24 MB
Unpaged efficiency : 2.77
Paged Efficiency   : 2.36
pages | count    1 |    16    2 |   848    3 |  4451    4 |  4730    5 |    25

The above is a histogram : for instance 4730 32K postgresql blocks  
compressed to 4 4K-pages (a ratio of 2X), and only 16 postgres blocks  
compressed to 4K.

"Size" is uncompressed size
"Paged Size" is the sum of all 4K which contain compressed data
"Unpaged Efficiency" is (raw data) / (compressed data), showing compressor  
performance.
"Paged Efficiency" is computed from the real number of pages used, so it  
is the interesting metric.

In this case, efficiency is 2.36 because some pages compress better than  
2X, but if we preallocate 16K for each 32K page, real efficiency will be  
2X.
We can see, also, that 25 blocks would need some extra allocation. This is  
a really small percentage of the table, so not significant to speed.

* GIN fulltext index on text field of above table

Size               : 136.75 MB
Paged Size         : 48.04 MB
Unpaged efficiency : 3.63
Paged Efficiency   : 2.85
pages | count    1 |  1528    2 |   470    3 |   944    4 |   172    5 |  1262


* GIST fulltext index on text field of above table

annonces_ts_gist
Size               : 63.84 MB
Paged Size         : 33.89 MB
Unpaged efficiency : 2.14
Paged Efficiency   : 1.88
pages | count    1 |     3    2 |    32    3 |   165    4 |  1103    5 |   739    6 |     1

* table with 32 columns including a few TEXT fields which contain at most  
a few chars

Size               : 393.50 MB
Paged Size         : 127.24 MB
Unpaged efficiency : 3.83
Paged Efficiency   : 3.09
pages | count    2 |  5207    3 |  7381    4 |     4

* btree on DATE column of above table

Size               : 56.34 MB
Unpaged efficiency : 2.63
Paged Efficiency   : 2.08
pages | count    1 |     2    2 |     1    3 |   282    4 |  1518


* btree on field "zipcode" of above table

Size               : 56.34 MB
Paged Size         : 27.04 MB
Unpaged efficiency : 2.90
Paged Efficiency   : 2.67
pages | count    1 |     2    2 |     2    3 |  1794    4 |     5

* gist index on GPS coordinates (every row has (lat,lon)) of above table

Size               : 196.47 MB
Paged Size         : 39.79 MB
Unpaged efficiency : 6.93
Paged Efficiency   : 4.94
pages | count    1 |  2486    2 |  3704    3 |    96    4 |     1

Note that the 32 column table is 400MB... and this gist index is almost  
200 MB.
This table has 770 MB of indexes on it...
Compression shrinks the table to 127 MB and indexes to a total of about  
300 MB.

* table of (increasing timestamp, 1 SERIAL, 1 increasing INT, 3 INT  
columns which contain random() data)

Size               : 344.06 MB
Paged Size         : 215.02 MB
Unpaged efficiency : 1.97
Paged Efficiency   : 1.60
pages | count    1 |     1    5 | 11009


* multicolumn btree on (int, timestamp) above

Size               : 153.19 MB
Paged Size         : 57.42 MB
Unpaged efficiency : 2.73
Paged Efficiency   : 2.67
pages | count    1 |     3    3 |  4899


* btree on int column above

Size               : 102.06 MB
Paged Size         : 63.73 MB
Unpaged efficiency : 1.87
Paged Efficiency   : 1.60
pages | count    1 |     2    2 |     1    4 |     3    5 |  3260


As you can see, efficiency varies widely between average (1.6), good  
(2-2.5), and amazing (almost 5) depending on what is compressed.

> To handle read-write tables I think we would need to directly
> implement the kind of indirection layer that you're getting out of the
> filesystem's block layer currently. That would let you allocate enough
> blocks to hold the data uncompressed and then free up those blocks
> once you're sure the data is compressible.

Just like NTFS does, then ?


> One possibility is to handle only read-only tables. That would make
> things a *lot* simpler. But it sure would be inconvenient if it's only
> useful on large static tables but requires you to rewrite the whole
> table -- just what you don't want to do with large static tables -- to
> get the benefit.

See above : read-mostly table that you can still update (but it will be  
very slow) or insert into (if you change the preallocation setting).


I forgot to talk about SSDs in the previous message. SSDs are quite  
expensive, but seek really fast. Saving disk space would be good on a SSD,  
because the price per byte is much higher. And since SSDs seek very fast,  
they probably wouldn't be bothered that much by sparse blocks being  
allocated all over the place. Also the price of SSDs means you wouldn't  
make a monster RAID out of them, so the CPU could decompress faster than  
the disk in a seqscan. This would need a test.


Ah, I would like to know a thing : if I add a 2-byte field to the postgres  
page header, which contains the compressed length, will I get any problems  
? (ie. do some pages use a non-standard header ?)

I need to modify a few things and add comments, then will submit a patch  
(not for committing, but for hacking).


Re: Table and Index compression

From
Pierre Frédéric Caillaud
Date:
> On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark<gsstark@mit.edu> wrote:
>> I like the idea too, but I think there are some major problems to
>> solve. In particular I think we need a better solution to blocks
>> growing than sparse files.
>
> How much benefit does this approach have over using TOAST compression
> more aggressively?
>
> ...Robert
>


The two are different :

- TOAST compresses a large column value.

To store a 100KB text file, TOAST is great.

- page compression compresses whole pages.

Suppose you have a table with a TIMESTAMP, and a few INT columns. The rows  
are small enough to make per-row compression useless, and TOAST cannot  
compress non-varlenas anyway. However, if (for instance) the timestamp is  
the row insertion date, and you INSERT several rows per second, most  
timestamps on a page will have lots of bytes in common. Also, row headers  
(which are larger than the rows) will have much redundant data. Page  
compression can exploit this, without the need for the rest of the code to  
know about it.

Page compression can also handle indexes, etc.

Also, External TOAST is nice if you seldom need the field : for instance,  
you search on in-page columns, get the row you need, and fetch it.
Suppose you have a forum : in this case, when you display a topic page,  
you need all the posts text. It would be a very bad idea to store them in  
a separate TOAST table, because it would create more random IO. Storing  
the posts in the page means less IO, and if you regularly CLUSTER your  
table, all the posts you need to display a topic page are on the same (or  
adjacent) postgres page. In this case, individual post text can be  
TOASTed, too, but compression tends to work better with longer blocks, so  
compressing the whole page will be more efficient.




Re: Table and Index compression

From
Sam Mason
Date:
On Fri, Aug 07, 2009 at 10:36:39AM +0200, Pierre Frrrdddric Caillaud wrote:
> Also, about compressed NTFS : it can give you disk-full errors on read().
> While this may appear stupid, it is in fact very good.

Is this not just because they've broken the semantics of read?

> As a side note, I have also tested lzjb (the ZFS compressor) and lzo is  
> much faster, and compresses much better (sometimes 2x better).

Disks are fast and cheap; a basic IDE disk runs at over 100MB/s now, and
it's doing this in the background while your CPU is doing other stuff.
If you're also decompressing stuff you're serializing even more and
you're doing so with a much power hungrier device (the CPU).

How fast is decompression (as that seems to be your selling point)?  Lzo
claims to run at about about a third of main memory bandwidth which is
nice, however research projects found this to be far too slow and were
only getting positive results when decompression stayed in secondary
cache.  Basically decompression has to run at several GB/s for it to
have much measurable benefit.

> I made a quick check before implementing it, using python scripts to play  
> with sparse files on ext3 :
> 
> - writing a sparse file is a bit (not much) slower than a regular file,
> - reading from a non-fragmented sparse file is as fast as reading a  
> regular file
> - holes do not eat OS disk cache (which is the most interesting point)
> - reading from cache is as fast as reading a regular file (and faster if  
> you don't read the holes because you know they are holes, which is the  
> case here)

Numbers?

> And, also, growing a sparse file by plugging the holes in it WILL allocate  
> blocks all over the place and render IO extremely inefficient.
> You can defrag it, of course (using fs-specific tools or just cpio), but  
> that's not "high-availability"...

That would not seem to difficult to solve.

> I forgot to talk about SSDs in the previous message. SSDs are quite  
> expensive, but seek really fast.

SSDs are about decreasing latency; if you're putting compression in
there you're pushing latency up as well.  If you don't care about
latency you get traditional rotating media.

--  Sam  http://samason.me.uk/


Re: Table and Index compression

From
Greg Stark
Date:
2009/8/7 Pierre Frédéric Caillaud <lists@peufeu.com>:
>
> Also, about compressed NTFS : it can give you disk-full errors on read().

I suspect it's unavoidable for similar reasons to the problems
Postgres faces. When you issue a read() you have to find space in the
filesystem cache to hold the data. Some other data has to be evicted.
If that data doesn't compress as well as it did previously it could
take more space and cause the disk to become full.

This also implies that fsync() could generate that error...

> Back to the point of how to handle disk full errors :
> - we could write a file the size of shared_buffers at startup
> - if a write() reports disk full, delete the file above
> - we now have enough space to flush all of shared_buffers
> - flush and exit gracefully

Unfortunately that doesn't really help. That only addresses the issue
for a single backend (or as many as are actually running when the
error starts). The next connection could read in new data and expand
that and now you have no slop space.

Put another way, we don't want to exit at all, gacefully or not. We
want to throw an error, abort the transaction (or subtransaction) and
keep going.


--
greg
http://mit.edu/~gsstark/resume.pdf


Re: Table and Index compression

From
Sam Mason
Date:
On Fri, Aug 07, 2009 at 10:33:33AM +0100, Greg Stark wrote:
> 2009/8/7 Pierre Frédéric Caillaud <lists@peufeu.com>:
> > Also, about compressed NTFS : it can give you disk-full errors on read().
> 
> I suspect it's unavoidable for similar reasons to the problems
> Postgres faces. When you issue a read() you have to find space in the
> filesystem cache to hold the data. Some other data has to be evicted.
> If that data doesn't compress as well as it did previously it could
> take more space and cause the disk to become full.
> 
> This also implies that fsync() could generate that error...

If that's indeed how it works it seems like one broken file system and
needs to get its block accounting in order.

When you choose a compression algorithm you know how much space a worst
case compression will take (i.e. lzo takes up to 8% more for a 4kB block
size).  This space should be reserved in case of situations like the
above and the filesystem shouldn't over-commit on this.

Never had to think about this before though so I'm probably missing
something obvious.

--  Sam  http://samason.me.uk/


Re: Table and Index compression

From
Greg Stark
Date:
On Fri, Aug 7, 2009 at 11:29 AM, Sam Mason<sam@samason.me.uk> wrote:
> When you choose a compression algorithm you know how much space a worst
> case compression will take (i.e. lzo takes up to 8% more for a 4kB block
> size).  This space should be reserved in case of situations like the
> above and the filesystem shouldn't over-commit on this.
>
> Never had to think about this before though so I'm probably missing
> something obvious.

Well most users want compression for the space savings. So running out
of space sooner than without compression when most of the space is
actually unused would disappoint them.

Also, I'm puzzled why it would the space increase would proportional
to the amount of data and be more than 300 bytes. There's no reason it
wouldn't be a small fixed amount. The ideal is you set aside one bit
-- if the bit is set the rest is compressed and has to save at least
one bit. If the bit is not set then the rest is uncompressed. Maximum
bloat is 1-bit. In real systems it's more likely to be a byte or a
word.

--
greg
http://mit.edu/~gsstark/resume.pdf


Re: Table and Index compression

From
Pierre Frédéric Caillaud
Date:
> Also, I'm puzzled why it would the space increase would proportional
> to the amount of data and be more than 300 bytes. There's no reason it
> wouldn't be a small fixed amount. The ideal is you set aside one bit
> -- if the bit is set the rest is compressed and has to save at least
> one bit. If the bit is not set then the rest is uncompressed. Maximum
> bloat is 1-bit. In real systems it's more likely to be a byte or a
> word.
I'm working on cleaning the patch...
I added a field in PageHeader which contains :- 0 to indicate a non-compressed page- length of compressed data if
compressed
If compression gains nothing (ie gains less than 4K), the page is stored  
raw.
It seems that only pages having a PageHeader are handled by md.c, so it  
should work (am I mistaken ?)



Re: Table and Index compression

From
Sam Mason
Date:
On Fri, Aug 07, 2009 at 11:49:46AM +0100, Greg Stark wrote:
> On Fri, Aug 7, 2009 at 11:29 AM, Sam Mason<sam@samason.me.uk> wrote:
> > When you choose a compression algorithm you know how much space a worst
> > case compression will take (i.e. lzo takes up to 8% more for a 4kB block
> > size).  This space should be reserved in case of situations like the
> > above and the filesystem shouldn't over-commit on this.
> >
> > Never had to think about this before though so I'm probably missing
> > something obvious.
> 
> Well most users want compression for the space savings. So running out
> of space sooner than without compression when most of the space is
> actually unused would disappoint them.

Note, that as far as I can tell for a filesystems you only need to keep
enough reserved for the amount of uncompressed dirty buffers you have in
memory.  As space runs out in the filesystem all that happens is that
the amount of (uncompressed?) dirty buffers you can safely have around
decreases.  In practical terms, this says that performance drops off
when there is less free space than the size of the filesystem's cache
and I think you have to reserve exactly one block to handle the base
case.  But there are so many problems associated with completely filling
a filesystem that I'm not sure if this would really matter.

> Also, I'm puzzled why it would the space increase would proportional
> to the amount of data and be more than 300 bytes. There's no reason it
> wouldn't be a small fixed amount. The ideal is you set aside one bit
> -- if the bit is set the rest is compressed and has to save at least
> one bit. If the bit is not set then the rest is uncompressed. Maximum
> bloat is 1-bit. In real systems it's more likely to be a byte or a
> word.

It'll depend on the compression algorithm; lz algorithms are dictionary
based so you'd have a single entry for the incompressible data and then
a pointer to the entry.

In PG's case, it would seem possible to do the compression and then
check to see if the resulting size is greater than 4kB.  If it is you
write into the 4kB page size and write uncompressed data.  Upon reading
you do the inverse, if it's 4kB then no need to decompress.  I believe
TOAST does this already.

--  Sam  http://samason.me.uk/


Re: Table and Index compression

From
Greg Stark
Date:
On Fri, Aug 7, 2009 at 12:48 PM, Sam Mason<sam@samason.me.uk> wrote:
>> Well most users want compression for the space savings. So running out
>> of space sooner than without compression when most of the space is
>> actually unused would disappoint them.
>
> Note, that as far as I can tell for a filesystems you only need to keep
> enough reserved for the amount of uncompressed dirty buffers you have in
> memory.  As space runs out in the filesystem all that happens is that
> the amount of (uncompressed?) dirty buffers you can safely have around
> decreases.

And when it drops to zero?

> In PG's case, it would seem possible to do the compression and then
> check to see if the resulting size is greater than 4kB.  If it is you
> write into the 4kB page size and write uncompressed data.  Upon reading
> you do the inverse, if it's 4kB then no need to decompress.  I believe
> TOAST does this already.

It does, as does gzip and afaik every compression system.

--
greg
http://mit.edu/~gsstark/resume.pdf


Re: Table and Index compression

From
Greg Stark
Date:
For reference what I'm picturing is this:

When a table is compressed it's marked read-only which bars any new
tuples from being inserted or existing tuples being deleted. Then it's
frozen and any pages which contain tuples wich can't be frozen are
waited on until they can be. When it's finished every tuple has to be
guaranteed to be fully frozen.

Then the relation is rewritten in compressed form. Each block is
compressed one by one and written one after the other to disk.

At the same time a new fork is written which contains a pointer to
each block. It could just be a directly addressed array of offsets and
lengths. All block lookups have to first load the page of the
indirection map, then read the appropriate section of the original
file and decompress it into shared buffers.

From a programming point of view this is nice and simple. From a
user's point of view it's a bit of a pain since it means you have to
rewrite your whole table when you want to compress it. And it means
you have to rewrite it all again if you decide you want to set it back
to read-write. My experience with people who have very large tables is
that they design their whole process around the goal of avoiding
having to move the data once it's written.


Re: Table and Index compression

From
Sam Mason
Date:
On Fri, Aug 07, 2009 at 12:59:57PM +0100, Greg Stark wrote:
> On Fri, Aug 7, 2009 at 12:48 PM, Sam Mason<sam@samason.me.uk> wrote:
> >> Well most users want compression for the space savings. So running out
> >> of space sooner than without compression when most of the space is
> >> actually unused would disappoint them.
> >
> > Note, that as far as I can tell for a filesystems you only need to keep
> > enough reserved for the amount of uncompressed dirty buffers you have in
> > memory.  As space runs out in the filesystem all that happens is that
> > the amount of (uncompressed?) dirty buffers you can safely have around
> > decreases.
> 
> And when it drops to zero?

That was why I said you need to have one page left "to handle the base
case".  I was treating the inductive case as the interesting common case
and considered the base case of lesser interest.

> > In PG's case, it would seem possible to do the compression and then
> > check to see if the resulting size is greater than 4kB.  If it is you
> > write into the 4kB page size and write uncompressed data.  Upon reading
> > you do the inverse, if it's 4kB then no need to decompress.  I believe
> > TOAST does this already.
> 
> It does, as does gzip and afaik every compression system.

It's still a case that needs to be handled explicitly by the code.  Just
for reference, gzip does not appear to do this when I test it:
 echo -n 'a' | gzip > tmp.gz gzip -l --verbose tmp.gz

says the compression ratio is "-200%" (an empty string results in
an infinite increase in size yet gets displayed as "0%" for some
strange reason).  It's only when you hit six 'a's that you start to get
positive ratios.  Note that that this is taking headers into account;
the compressed size is 23 bytes for both 'aaa' and 'aaaaaa' but the
uncompressed size obviously changes.

gzip does indeed have a "copy" method, but it doesn't seem to be being
used.

--  Sam  http://samason.me.uk/


Re: Table and Index compression

From
Robert Haas
Date:
On Fri, Aug 7, 2009 at 8:18 AM, Greg Stark<gsstark@mit.edu> wrote:
> For reference what I'm picturing is this:
>
> When a table is compressed it's marked read-only which bars any new
> tuples from being inserted or existing tuples being deleted. Then it's
> frozen and any pages which contain tuples wich can't be frozen are
> waited on until they can be. When it's finished every tuple has to be
> guaranteed to be fully frozen.
>
> Then the relation is rewritten in compressed form. Each block is
> compressed one by one and written one after the other to disk.
>
> At the same time a new fork is written which contains a pointer to
> each block. It could just be a directly addressed array of offsets and
> lengths. All block lookups have to first load the page of the
> indirection map, then read the appropriate section of the original
> file and decompress it into shared buffers.
>
> From a programming point of view this is nice and simple. From a
> user's point of view it's a bit of a pain since it means you have to
> rewrite your whole table when you want to compress it. And it means
> you have to rewrite it all again if you decide you want to set it back
> to read-write. My experience with people who have very large tables is
> that they design their whole process around the goal of avoiding
> having to move the data once it's written.

If you add an indirection table, it's not strictly necessary for the
table to be read-only, though if you want to make it read-write you'd
need to think about how to defragment.

...Robert


Re: Table and Index compression

From
Pierre Frédéric Caillaud
Date:
> For reference what I'm picturing is this:
>
> When a table is compressed it's marked read-only which bars any new
> tuples from being inserted or existing tuples being deleted. Then it's
> frozen and any pages which contain tuples wich can't be frozen are
> waited on until they can be. When it's finished every tuple has to be
> guaranteed to be fully frozen.
>
> Then the relation is rewritten in compressed form. Each block is
> compressed one by one and written one after the other to disk.
>
> At the same time a new fork is written which contains a pointer to
> each block. It could just be a directly addressed array of offsets and
> lengths. All block lookups have to first load the page of the
> indirection map, then read the appropriate section of the original
> file and decompress it into shared buffers.
I had pondered the idea of a fork storing the compressed status of each  
page, because it has advantages :- no need to change the page layout to insert a "is compressed" flag- possible to
compressany data, not just standard pages- if you know the compressed size of a page in advance, it is much easier  
 
to prefetch it entirely and not just the first chunk, or read too much...

> From a programming point of view this is nice and simple. From a
> user's point of view it's a bit of a pain since it means you have to
> rewrite your whole table when you want to compress it. And it means
> you have to rewrite it all again if you decide you want to set it back
> to read-write. My experience with people who have very large tables is
> that they design their whole process around the goal of avoiding
> having to move the data once it's written.
Note that if a table is huge, it is always cut in (currently) 1GB slices,  
so you could operate on one slice at a time, then release a lock, let the  
backlog of queries flow, and resume.
Realtime compression would be much less of a hassle to use, though...




Re: Table and Index compression

From
Pierre Frédéric Caillaud
Date:
Not strictly related to compression, but I've noticed something really
strange...

pg 8.4 (vanilla) is doing it, and my compressed version is doing it too.
tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives,
but it does it too if I put the tablespace and data on the same volume.

The traditional test table :

BEGIN; CREATE TABLE dwb (
bid SERIAL,
aid INTEGER NOT NULL,
ts TIMESTAMP NOT NULL,
i1 INTEGER NOT NULL,
i2 INTEGER NOT NULL,
i3 INTEGER NOT NULL,
i4 INTEGER NOT NULL
) WITHOUT OIDS;

The traditional test data :

INSERT INTO dwb (ts,aid,i1,i2,i3,i4)
SELECT now() + '1 sec'::INTERVAL, (1+random()*99998), random()*10000000,
n+random()*10000, n+random()*1000, n  FROM generate_series( 1, 60000000 ) AS n;

vmstat output :

it starts out relatively fast :
 si   so    bi    bo   in   cs    us sy id wa    0    0     0 43680 2796 19162 42 18 37  3    0    0     0 45515 3165
2065244 17 35  4    0    0     0 43130 3046 21991 43 17 38  2
 

then here it starts to slow down : check "bo" output
    0    0   181 24439  577 3541 31  6 40 23    0    0   176 17258  292 1324 31  4 43 22    0    0     0 18626  162
69335  3 49 12    0    0     1 21554  235 1362 31  5 50 14    0    0     0 19177  324 2053 35  4 50 12    0    0     0
19208 206 1155 36  4 48 12    0    0     1 20740  215 1117 33  4 50 13    0    0     0 20154  258 1100 32  4 50 14    0
  0     0 20355  316 2056 34  5 49 12
 

... and it stays like this until the end of the INSERT...

It's not writing any xlog since the table was created after the BEGIN...

I'm trying to benchmark insert speed, but no luck. This volume does about
100 MB/s sustained write speed, so ?......


Re: Table and Index compression

From
Sam Mason
Date:
On Fri, Aug 07, 2009 at 03:29:44PM +0200, Pierre Frrrdddric Caillaud wrote:
> vmstat output :

Sorry, I don't know enough of PGs internals to suggest anything here,
but iostat may give you more details as to what's going on.

--  Sam  http://samason.me.uk/


Re: Table and Index compression

From
"Kevin Grittner"
Date:
Pierre Frédéric Caillaud<lists@peufeu.com> wrote:
> tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives,
> but it does it too if I put the tablespace and data on the same
> volume.
> it starts out relatively fast :
> 
>   si   so    bi    bo   in   cs    us sy id wa
>      0    0     0 43680 2796 19162 42 18 37  3
>      0    0     0 45515 3165 20652 44 17 35  4
>      0    0     0 43130 3046 21991 43 17 38  2
> 
> then here it starts to slow down : check "bo" output
> 
>      0    0   181 24439  577 3541 31  6 40 23
>      0    0   176 17258  292 1324 31  4 43 22
>      0    0     0 18626  162  693 35  3 49 12
>      0    0     1 21554  235 1362 31  5 50 14
>      0    0     0 19177  324 2053 35  4 50 12
>      0    0     0 19208  206 1155 36  4 48 12
>      0    0     1 20740  215 1117 33  4 50 13
>      0    0     0 20154  258 1100 32  4 50 14
>      0    0     0 20355  316 2056 34  5 49 12
> 
> ... and it stays like this until the end of the INSERT...
I don't know if this is it, but we tend to see outrageously high
performance at the start of a benchmark because of the battery-backed
cache in the RAID controller.  Every write comes back immediately
after copying the data to RAM.  After a while the cache gets filled
and you settle down to a steady state.  If it's not BBU with
write-back enabled, perhaps you have drives that lie about write
completion?
-Kevin


Re: Table and Index compression

From
Pierre Frédéric Caillaud
Date:
On Fri, 07 Aug 2009 15:42:35 +0200, Kevin Grittner  
<Kevin.Grittner@wicourts.gov> wrote:

> Pierre Frédéric Caillaud<lists@peufeu.com> wrote:
>
>> tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives,
>> but it does it too if I put the tablespace and data on the same
>> volume.
>
>> it starts out relatively fast :
>>
>>   si   so    bi    bo   in   cs    us sy id wa
>>      0    0     0 43680 2796 19162 42 18 37  3
>>      0    0     0 45515 3165 20652 44 17 35  4
>>      0    0     0 43130 3046 21991 43 17 38  2
>>
>> then here it starts to slow down : check "bo" output
>>
>>      0    0   181 24439  577 3541 31  6 40 23
>>      0    0   176 17258  292 1324 31  4 43 22
>>      0    0     0 18626  162  693 35  3 49 12
>>      0    0     1 21554  235 1362 31  5 50 14
>>      0    0     0 19177  324 2053 35  4 50 12
>>      0    0     0 19208  206 1155 36  4 48 12
>>      0    0     1 20740  215 1117 33  4 50 13
>>      0    0     0 20154  258 1100 32  4 50 14
>>      0    0     0 20355  316 2056 34  5 49 12
>>
>> ... and it stays like this until the end of the INSERT...
> I don't know if this is it, but we tend to see outrageously high
> performance at the start of a benchmark because of the battery-backed
> cache in the RAID controller.  Every write comes back immediately
> after copying the data to RAM.  After a while the cache gets filled
> and you settle down to a steady state.  If it's not BBU with
> write-back enabled, perhaps you have drives that lie about write
> completion?
> -Kevin
>

I'm answering my own question : at the beginning of the run, postgres  
creates a 800MB temporary file, then it fills the table, then deletes the  
temp file.
Is this because I use generate_series to fill the test table ?


Re: Table and Index compression

From
Sam Mason
Date:
On Fri, Aug 07, 2009 at 04:17:18PM +0200, Pierre Frrrdddric Caillaud wrote:
> I'm answering my own question : at the beginning of the run, postgres  
> creates a 800MB temporary file, then it fills the table, then deletes the  
> temp file.
> Is this because I use generate_series to fill the test table ?

Doh, yes.  A function's result is written to temp location first and
then read back again once the function returns success.  You'll have
more luck if you do:
 SELECT now() + '1 sec'::INTERVAL, (1+random()*99998),   random()*10000000,n+random()*10000, n+random()*1000, n FROM (
SELECT generate_series( 1, 60000000 )) x(n);
 

--  Sam  http://samason.me.uk/


Re: Table and Index compression

From
Josh Berkus
Date:
Pierre,

>     I added a field in PageHeader which contains :
>     - 0 to indicate a non-compressed page
>     - length of compressed data if compressed
> 
>     If compression gains nothing (ie gains less than 4K), the page is
> stored raw.
> 
>     It seems that only pages having a PageHeader are handled by md.c, so
> it should work (am I mistaken ?)

Well, there's the issue of upgradability; this would require us to have
an incompatible upgrade of on-disk formats.  So we don't want to go
further down this route unless we're sure it's worthwhile.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: Table and Index compression

From
Pierre Frédéric Caillaud
Date:
    Well, here is the patch. I've included a README, which I paste here.
    If someone wants to play with it (after the CommitFest...) feel free to
do so.
    While it was an interesting thing to try, I don't think it has enough
potential to justify more effort...


* How to test

- apply the patch
- copy minilzo.c and minilzo.h to
src/backend/storage/smgr

- configure & make
- enjoy

* How it works

- pg block size set to 32K
- an extra field is added in the header telling the compressed length

THIS IS BAD, this information should be stored in a separate fork of the
relation, because
    - it would then be backwards compatible
    - the number of bytes to read from a compressed page would be known in
advance

- the table file is sparse
- the page header is not compressed
- pages are written at their normal positions, but only the compressed
bytes are written
- if compression gains nothing, un-compressed page is stored
- the filesystem doesn't store the un-written blocks

* Benefits

- Sparse file holes are not cached, so OS disk cache efficiency is at
least x2
- Random access is faster, having a better probability to hit cache
(sometimes a bit faster, sometimes it's spectatular)
- Yes, it does save space (> 50%)

* Problems

- Biggest problem : any write to a table that writes data that compresses
less than whatever was there before can fail on a disk full error.

- ext3 sparse file handling isn't as fast as I wish it would be : on seq
scans, even if it reads 2x less data, and decompresses very fast, it's
still slower...

- many seq scans (especially with aggregates) are CPU bound anyway

- therefore, some kind of background-reader-decompressor would be needed

- pre-allocation has to be done to avoid extreme fragmentation of the
file, which kind of defeats the purpose

- it still causes fragmentation

* Conclusion (for now)

It was a nice thing to try, but I believe it would be better if this was
implemented directly in the filesystem, on the condition that it should be
implemented well (ie not like NTFS compression).




Attachment

Re: Table and Index compression

From
Peter Eisentraut
Date:
On Tuesday 11 August 2009 13:05:39 Pierre Frédéric Caillaud wrote:
>     Well, here is the patch. I've included a README, which I paste here.
>     If someone wants to play with it (after the CommitFest...) feel free to
> do so.
>     While it was an interesting thing to try, I don't think it has enough
> potential to justify more effort...
>
>
> * How to test
>
> - apply the patch
> - copy minilzo.c and minilzo.h to
> src/backend/storage/smgr

For future reference, and since this keeps appearing every few months: The
license of LZO is not acceptable for inclusion or use with PostgreSQL.  You
need to find a different library if you want to pursue this further.



Re: Table and Index compression

From
Pierre Frédéric Caillaud
Date:
> For future reference, and since this keeps appearing every few months:  
> The
> license of LZO is not acceptable for inclusion or use with PostgreSQL.   
> You
> need to find a different library if you want to pursue this further.
Yes, I know about the license... I used LZO for tests, but since my  
little experiment with compression didn't give any really motivating  
results, I won't pursue this further.