Re: Table and Index compression - Mailing list pgsql-hackers
From | Pierre Frédéric Caillaud |
---|---|
Subject | Re: Table and Index compression |
Date | |
Msg-id | op.ux997d1qcke6l8@soyouz Whole thread Raw |
In response to | Re: Table and Index compression (Greg Stark <gsstark@mit.edu>) |
Responses |
Re: Table and Index compression
Re: Table and Index compression |
List | pgsql-hackers |
First, a few things that I forgot to mention in the previous message : > I like the idea too, but I think there are some major problems to > solve. In particular I think we need a better solution to blocks > growing than sparse files. Sparse files allow something great : to test this concept in real life, with a length of code that is shorter than this email. > The main problem with using sparse files is that currently postgres is > careful to allocate blocks early so it can fail if there's not enough > space. With your sparse file solution Postgres might only find out > there's no space after it has already committed a transaction. > bgwriter has no good course of action to take if it finds out there's > nowhere to put the data it has in shared buffers. Yes, that is a big problem. Note that the same problem can happen if you use a filesystem that supports compression : - you preallocate a file full of empty blocks - filesystem compresses it well, says it fits - you fill it with data that compresses much worse than zeros - at some point one of the write() calls fails with a disk full error This would be an interesting argument in the "can we use compressed ZFS on postgres" debate... Also, about compressed NTFS : it can give you disk-full errors on read(). While this may appear stupid, it is in fact very good. When you read() something compressed, NTFS will reserve the space on disk to re-write this data, assuming the worst compression, so your app fails early (on read, not on write). If there is not enough space on the volume, read() will fail with a disk full error. Many applications do not expect this. On the other hand, filling a disk completely is a never a good idea, since it generally triggers fragmentation so extreme that the only way to recover is to go buy another harddisk and copy your data. As a side note, I have also tested lzjb (the ZFS compressor) and lzo is much faster, and compresses much better (sometimes 2x better). Back to the point of how to handle disk full errors : - we could write a file the size of shared_buffers at startup - if a write() reports disk full, delete the file above - we now have enough space to flush all of shared_buffers - flush and exit gracefully This will waste disk space. Of course... but the purpose of compression is not to save disk space, it is to have a faster database. Also, if you use compression, - you'll do it because your database is much bigger than shared_buffers, so the wasted space is not huge. - you'll use a smaller shared_buffers than usual, because shared_buffers contains uncompressed pages, and you'll want to use lots of OS disk cache to cache compressed data. Doesn't seem too crazy to me. > But I think even if you solve that it's not really a good long-term > solution. We don't know how the OS handles block allocation for this > type of file. I'm actually moderately surprised it isn't skipping > enough blocks assuming you'll allocate them eventually. Even if it > does handle it the way you expect what happens when you do grow a > block, it'll have to allocate it way out of the way and we have no way > to repair that discontinuity later. I made a quick check before implementing it, using python scripts to play with sparse files on ext3 : - writing a sparse file is a bit (not much) slower than a regular file, - reading from a non-fragmented sparse file is as fast as reading a regular file - holes do not eat OS disk cache (which is the most interesting point) - reading from cache is as fast as reading a regular file (and faster if you don't read the holes because you know they are holes, which is the case here) And, also, growing a sparse file by plugging the holes in it WILL allocate blocks all over the place and render IO extremely inefficient. You can defrag it, of course (using fs-specific tools or just cpio), but that's not "high-availability"... > Also, the way you've prellocated blocks effectively nails the maximum > compression at 2x. That seems to be leaving a lot of money on the > table. Yes, it does, but there are a few subtleties. First, about preallocation : I had to implement this, because of the way the files are extended. Since first an empty block is written, it will compress well, to 1 OS page. When, later, this block is written again with data, it will compress to a larger amount of pages, which will trigger allocation all over the place. In testing, the resulting sparse file is slow in the of the "forget it" kind... The file could also be non-sparse : in this case, no disk space is saved. Not saving disk space is OK with me : disks are really cheap these days. I know this sounds crazy ;) but as I said above, compression is meant to save expensive resources. When you use H-264 to compress 1TB of raw video to 1GB of compressed video, you save an expensive resource (0.999 TB of harddisk). But if you use lzo to compress 40GB of database down to 20GB, you save maybe $3 in disk space, which is ludicrous. However, you also save much more expensive resources : disk throughput and RAM OS cache. So it might really pay off. So, a non-sparse file would work well for random accesses too, since we only read the parts of it that contain compressed data, and the "wasted" pages would never be touched. However it would have drawbacks : - seq scan would not be able to skip the "empty" pages, so it would have no benefit whatsoever - the empty pages, read-ahead by the OS, would pollute the disk cache - the empty pages would also pollute the controller's cache and the disks' internal caches So, a balance has to be reached on how much OS pages (4K on Linux) to preallocate for a Postgres block (16K or 32K to get good compression). If you preallocate too little, adding data (INSERT, UPDATEs) will heavily fragment the file. If you preallocate too much, seq scan will be slower and perhaps cache will be wasted (but it will still be much faster than the huge seek-fest which happens if you preallocate too little). How many OS pages to preallocate for a postgres block should be specified by the user. - If you intend to modify the table contents, you should preallocate a bit more. - If the table will be mostly read-only, a tight fit is okay. A nice thing is that it makes read-only tables a bit redundant. Basically, if you ask postgres "CLUSTER or VAC FULL this table, and compress it, not preallocating anything more than necessary" You will get a table and indexes that are well compressed, but updates, deletes, will be slow, so it is a read-mostly table. If you use partitions, the current partition can be uncompressed (or compressed with a generous pre-allocation), but the past partitions, which are usually read-only, can be compressed tight, and their indexes too. Then you could change the pre-allocation value for this table to preallocate more, and INSERT into it, no problem, except for the indexes, for which you'd need to specify beforehand to preallocate a bit more pages to allow for growing index pages... So, how many pages to preallocate is a tuning parameter, that should have a reasonable default, but be user-settable, per-relation (table, index...) Note that if a page contains lots of dead rows which are similar except a few updated columns, compression will exploit this redundancy. Here are some stats on some tables of more than 300K rows : * Table with lots of columns including a TEXT column of a few hundred bytes, containing french text, which is usually too short for TOASTING : Size : 314.69 MB Paged Size : 133.24 MB Unpaged efficiency : 2.77 Paged Efficiency : 2.36 pages | count 1 | 16 2 | 848 3 | 4451 4 | 4730 5 | 25 The above is a histogram : for instance 4730 32K postgresql blocks compressed to 4 4K-pages (a ratio of 2X), and only 16 postgres blocks compressed to 4K. "Size" is uncompressed size "Paged Size" is the sum of all 4K which contain compressed data "Unpaged Efficiency" is (raw data) / (compressed data), showing compressor performance. "Paged Efficiency" is computed from the real number of pages used, so it is the interesting metric. In this case, efficiency is 2.36 because some pages compress better than 2X, but if we preallocate 16K for each 32K page, real efficiency will be 2X. We can see, also, that 25 blocks would need some extra allocation. This is a really small percentage of the table, so not significant to speed. * GIN fulltext index on text field of above table Size : 136.75 MB Paged Size : 48.04 MB Unpaged efficiency : 3.63 Paged Efficiency : 2.85 pages | count 1 | 1528 2 | 470 3 | 944 4 | 172 5 | 1262 * GIST fulltext index on text field of above table annonces_ts_gist Size : 63.84 MB Paged Size : 33.89 MB Unpaged efficiency : 2.14 Paged Efficiency : 1.88 pages | count 1 | 3 2 | 32 3 | 165 4 | 1103 5 | 739 6 | 1 * table with 32 columns including a few TEXT fields which contain at most a few chars Size : 393.50 MB Paged Size : 127.24 MB Unpaged efficiency : 3.83 Paged Efficiency : 3.09 pages | count 2 | 5207 3 | 7381 4 | 4 * btree on DATE column of above table Size : 56.34 MB Unpaged efficiency : 2.63 Paged Efficiency : 2.08 pages | count 1 | 2 2 | 1 3 | 282 4 | 1518 * btree on field "zipcode" of above table Size : 56.34 MB Paged Size : 27.04 MB Unpaged efficiency : 2.90 Paged Efficiency : 2.67 pages | count 1 | 2 2 | 2 3 | 1794 4 | 5 * gist index on GPS coordinates (every row has (lat,lon)) of above table Size : 196.47 MB Paged Size : 39.79 MB Unpaged efficiency : 6.93 Paged Efficiency : 4.94 pages | count 1 | 2486 2 | 3704 3 | 96 4 | 1 Note that the 32 column table is 400MB... and this gist index is almost 200 MB. This table has 770 MB of indexes on it... Compression shrinks the table to 127 MB and indexes to a total of about 300 MB. * table of (increasing timestamp, 1 SERIAL, 1 increasing INT, 3 INT columns which contain random() data) Size : 344.06 MB Paged Size : 215.02 MB Unpaged efficiency : 1.97 Paged Efficiency : 1.60 pages | count 1 | 1 5 | 11009 * multicolumn btree on (int, timestamp) above Size : 153.19 MB Paged Size : 57.42 MB Unpaged efficiency : 2.73 Paged Efficiency : 2.67 pages | count 1 | 3 3 | 4899 * btree on int column above Size : 102.06 MB Paged Size : 63.73 MB Unpaged efficiency : 1.87 Paged Efficiency : 1.60 pages | count 1 | 2 2 | 1 4 | 3 5 | 3260 As you can see, efficiency varies widely between average (1.6), good (2-2.5), and amazing (almost 5) depending on what is compressed. > To handle read-write tables I think we would need to directly > implement the kind of indirection layer that you're getting out of the > filesystem's block layer currently. That would let you allocate enough > blocks to hold the data uncompressed and then free up those blocks > once you're sure the data is compressible. Just like NTFS does, then ? > One possibility is to handle only read-only tables. That would make > things a *lot* simpler. But it sure would be inconvenient if it's only > useful on large static tables but requires you to rewrite the whole > table -- just what you don't want to do with large static tables -- to > get the benefit. See above : read-mostly table that you can still update (but it will be very slow) or insert into (if you change the preallocation setting). I forgot to talk about SSDs in the previous message. SSDs are quite expensive, but seek really fast. Saving disk space would be good on a SSD, because the price per byte is much higher. And since SSDs seek very fast, they probably wouldn't be bothered that much by sparse blocks being allocated all over the place. Also the price of SSDs means you wouldn't make a monster RAID out of them, so the CPU could decompress faster than the disk in a seqscan. This would need a test. Ah, I would like to know a thing : if I add a 2-byte field to the postgres page header, which contains the compressed length, will I get any problems ? (ie. do some pages use a non-standard header ?) I need to modify a few things and add comments, then will submit a patch (not for committing, but for hacking).
pgsql-hackers by date: