Thread: Table and Index compression
With the talk about adding compression to pg_dump lately, I've been wondering if tables and indexes could be compressed too. So I've implemented a quick on-the-fly compression patch for postgres Sorry for the long email, but I hope you find this interesting. Why compress ? 1- To save disk space ? Disks are so cheap now that it is not a valid reason. Besides, anyone with a database that is big enough to fill a modern harddisk probably also has a monster RAID to get more IO throughput. 2- To make the database faster ? This would be a valid reason. If the database is made smaller through compression, this can have several interesting effects : - you need less RAM to cache it entirely, so random IO hits more cache than before - if the CPU can decompress faster than the disk IO, even seq scans can be faster (this will not be the case on monster RAID setups, but for cheap servers, it could be) So, why not ? I coded it. I've benchmarked lzo, lzf, quicklz, lzjb, and fastLZ. The best for this is lzo : very fast decompression, a good compression ratio on a sample of postgres table and indexes, and a license that could work. QuickLZ compresses faster and more, but is not free. Compression would compress each page independently, so it is fast to decompress a single page. This means 8k pages are a bit small, I used 32k pages to get better compression. I've checked various table and index files. - your average table with many columns, some TEXT, etc, compresses about 2x - when the row header is significant vs the row data, it can reach 3x - indexes compress well too, 2-4x - gist indexes could be compressed up to 6x sometimes Roughly, 2-3x compression is achievable on a complete database. -------------------- Implementation This lives in md.c The implementation is about 100 lines of code (!) Instead of calling FileWrite and FileRead, md.c calls special functions that read and write compressed blocks. * Writing : The block is compressed in a temp memory buffer. A header (prepended to the compressed data) tells the length of this data. Then, it is written where in the disk file. * Reading : The first 4k of compressed data is read. Looking at the length header, we know how much more to read. The rest of the data (if any) is read. The data is decompressed. Since a compressed block can be larger than the original block, I have enlarged the block size in the files by 4k, so that there is a block every 40k instead of every 32k (with 32k postgres pages). That's it, very simple. -------------------- Now, the reason it works is the underlying file is not handled as sparse by the OS. The holes between compressed blocks are removed : not recorded on disk, and never cached either. However sparse files can have big performance problems if you do this : - write a small block in the middle of the file, surrounded by holes - later enlarge this block When the block is enlarged, if it needs an extra filesystem page, it will not be allocated contiguously. When it is read later, it will need an extra seek, which is really bad. So, looking at the compression statistics : gist index for geometric coordinates search : a 32k page is compressed to 1-2 4k-pages, very rarely 3 pages. btree indexes : a 32k page is compressed to 2-3 4k-pages large table : a 32k page is compressed to 2-4 4k-pages Therefore, on write, I pre-allocate some space in the sparse file, by writing more than needed : currently I write 5 4k-blocks. Whatever is written after the compressed data is garbage previously in the buffer, it is ignored on reads. This means the disk space savings are less than a full compression, but access is much smoother, in fact much like a regular non-sparse file, since the blocks between the holes almost never need to be grown. Without pre-allocating, performance is abysmal, not even worth talking about. Pre-allocated but not actually used blocks are never read, except maybe by OS readahead during seq scan. On a heavy random access database they will not be touched, not wasting any space in the OS cache. -------------------- shared_buffers thus contains decompressed blocks : a row that is updated very often will not go through decompression-compression cycles each time. The OS cache contains compressed data. Some tests : It appears to behave as expected. It didn't crash (yet...). Basically it looks like RAM has doubled and CPU speed is halved. Random access queries are faster, even on a cold cache, and of course, much better cached afterwards, since the amount of data the OS cache can hold is at least doubled. Seq scans on a huge table, reading data from disk, are a tiny bit slower, which is strange : on my test box, the disks are slow (50 MB/s) and lzo can decompress much faster than this. At least it isn't slower. Seq scans on a cached table that would fit in RAM anyway are slower, because it needs to be decompressed. Seq scans on a table that can be cached because of the compression, are much faster than seq scans without compression, where it has to hit the disk. Compressing the data pages directly also reduces the need for many ad-hoc schemes like compressing index tuples, etc. Who cares the row header is huge ? If all rows have the same row header, lzo will shrink it down to 2 bytes, and the rest of Postgres doesn't need to know about it. Also this compression scheme operates on whole pages : no need to split a page if suddenly it becomes larger after compression. Only the storage manager needs to know about it. The OS sparse file handling takes care of the gremlins. For someone wanting to run a cheap server like this, it could be an interesting tradeoff. http://www.ovh.com/fr/produits/superplan_best_of.xml -------------------- I need to clean it up a little and do some more checks before I send the patch to you. I am not suggesting this as a part of the commitfest for 8.5, but as a proof of concept, since many features are missing : - ability to specify per-table/index or per-tablespace compression (per-tablespace would be better I think) right now, it's compressing everything, even the sys catalogs, even the fsm, which is a bit dumb... there should really be a md.c and a md_compress.c and an ALTER TABLESPACE SET COMPRESSION TO 1, then move tables to it Then you can keep your tables un-compressed, except this huge fulltext index that eats your RAM is not muuuch smaller. - check how it corrupts data during recovery ahem, notice I didn't say "if" - check how it works under heavy concurrent load I don't know under which locks md.c operates. - determine good preallocation values those vary according to the nature of the data, maybe user-supplied ? - test with 16k pages also better caching granularity at the cost of worse compression - TOAST Yes, the TOAST table is compressed too, even if it was compressed already by TOAST, so some "tuning" is needed - Page header I want to add 1 byte to the page header to specify if a page is compressed or not, and the length (in 4k-blocks) of the compressed data. Then, more ideas come to mind : - Read-only Someone talked about adding a read-only option to a table, which would do some magic tricks to the data when re-building the table (CLUSTER,VAC FULL), making it more efficient. This would be a good place to also compress it, and write it as a sparse file without need for pre-allocation - Lazy decompression Right now pages are decompressed on read, which is not optimal. It would be better to issue another read() immediately, and decompress the page later. The page would be stored as compressed in shared_buffers, then the first process which needs it would decompress it, and it would remain de-compressed until written. - Parallel decompression A new friend of the bgwriter, the bg-decompressor, hunts recently-read compressed pages in shared_buffers and decompresses them. It would have one process per CPU core. One core can decompress 200 MB/s or even more... a cheap quad core CPU, even with crummy SATA disks, would be able to seq scan like a monster RAID... - Parallel compression the bg-decompressor can also be a bg-compressor, hunting pages that are about to be written, and compressing them, this would speedup pg_restore by reducing the IO. - Relation Extend right now it seems that, when a relation needs a new page, this page is written (empty) at the end of the relation, then at some point later, written again, with data. So it has to be compressed twice (or more), and the first time, since it is empty, it will compress to something like 100 bytes, hence the need for pre-allocation. All tests above done on a Core 2, 4GB RAM (3.5 actually since 32-bit OS), SATA RAID1, kubuntu, ext3 filesystem. I'm going to do some checks on a XFS volume right now to see how it handles sparse files... Cheers, Pierre
On 8/6/09 2:39 AM, PFC wrote: > > > With the talk about adding compression to pg_dump lately, I've been > wondering if tables and indexes could be compressed too. > So I've implemented a quick on-the-fly compression patch for postgres I find this very interesting, and would like to test it further on some client workloads, before you/we put more work into completing it. I think if we can implement compressed database as an option (perhaps at initdb time, perhaps at tablespace creation time) then it will be very attractive. Where is the patch? BTW, who are you actually? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Pierre, On Thu, Aug 6, 2009 at 11:39 AM, PFC<lists@peufeu.com> wrote: > The best for this is lzo : very fast decompression, a good compression ratio > on a sample of postgres table and indexes, and a license that could work. The license of lzo doesn't allow us to include it in PostgreSQL without relicensing PostgreSQL as GPL. I'm not sure of what you imply by "a license that could work". Note that it doesn't change the interest of your approach. It's just that I'm not sure we can find a performance-acceptable BSD licensed compression library (it was discussed a lot of times here). -- Guillaume
I like the idea too, but I think there are some major problems to solve. In particular I think we need a better solution to blocks growing than sparse files. The main problem with using sparse files is that currently postgres is careful to allocate blocks early so it can fail if there's not enough space. With your sparse file solution Postgres might only find out there's no space after it has already committed a transaction. bgwriter has no good course of action to take if it finds out there's nowhere to put the data it has in shared buffers. But I think even if you solve that it's not really a good long-term solution. We don't know how the OS handles block allocation for this type of file. I'm actually moderately surprised it isn't skipping enough blocks assuming you'll allocate them eventually. Even if it does handle it the way you expect what happens when you do grow a block, it'll have to allocate it way out of the way and we have no way to repair that discontinuity later. Also, the way you've prellocated blocks effectively nails the maximum compression at 2x. That seems to be leaving a lot of money on the table. To handle read-write tables I think we would need to directly implement the kind of indirection layer that you're getting out of the filesystem's block layer currently. That would let you allocate enough blocks to hold the data uncompressed and then free up those blocks once you're sure the data is compressible. One possibility is to handle only read-only tables. That would make things a *lot* simpler. But it sure would be inconvenient if it's only useful on large static tables but requires you to rewrite the whole table -- just what you don't want to do with large static tables -- to get the benefit.
On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark<gsstark@mit.edu> wrote: > I like the idea too, but I think there are some major problems to > solve. In particular I think we need a better solution to blocks > growing than sparse files. How much benefit does this approach have over using TOAST compression more aggressively? ...Robert
Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark<gsstark@mit.edu> wrote: >> I like the idea too, but I think there are some major problems to >> solve. In particular I think we need a better solution to blocks >> growing than sparse files. > > How much benefit does this approach have over using TOAST > compression more aggressively? I was wondering the same thing. It seems like compressing a page at a time should allow more space savings than a column at a time, and possibly do it faster. One question I have banging around in my head is what to do with out-of-line storage. Sometimes you have a large column which you know contains data which is already compressed and/or encrypted, so attempting compression would give little or no benefit; so I'm inclined to think that if we do page compression, it shouldn't deal with toast tables. We could leave them to the current techniques. That leaves some subtle problems with how to deal with a datum which currently compresses from, say, several kB down to one or two hundred bytes. Current TOAST logic would typically compress and inline it. What would we do if we're trying to push heap compression to the page level? -Kevin
On 8/6/09 1:03 PM, Greg Stark wrote: > One possibility is to handle only read-only tables. That would make > things a *lot* simpler. But it sure would be inconvenient if it's only > useful on large static tables but requires you to rewrite the whole > table -- just what you don't want to do with large static tables -- to > get the benefit. Well less flexible, I could see combining this with partitioning to still be useful. If we could rewrite specific partitions as compressed, then there's a lot of cumulative data applications which it would benefit. Not as exciting as being able to compress the whole thing, of course. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
I'm curious what advantages there are in building compression into the database itself, rather than using filesystem-based compression. I see ZFS articles[1] discuss how enabling compression improves performance with ZFS; for Linux, Btrfs has compression features as well[2]; and on Windows NTFS seems to too. [1]http://blogs.sun.com/observatory/entry/zfs_compression_a_win_win [2]http://lwn.net/Articles/305697/
First, a few things that I forgot to mention in the previous message : > I like the idea too, but I think there are some major problems to > solve. In particular I think we need a better solution to blocks > growing than sparse files. Sparse files allow something great : to test this concept in real life, with a length of code that is shorter than this email. > The main problem with using sparse files is that currently postgres is > careful to allocate blocks early so it can fail if there's not enough > space. With your sparse file solution Postgres might only find out > there's no space after it has already committed a transaction. > bgwriter has no good course of action to take if it finds out there's > nowhere to put the data it has in shared buffers. Yes, that is a big problem. Note that the same problem can happen if you use a filesystem that supports compression : - you preallocate a file full of empty blocks - filesystem compresses it well, says it fits - you fill it with data that compresses much worse than zeros - at some point one of the write() calls fails with a disk full error This would be an interesting argument in the "can we use compressed ZFS on postgres" debate... Also, about compressed NTFS : it can give you disk-full errors on read(). While this may appear stupid, it is in fact very good. When you read() something compressed, NTFS will reserve the space on disk to re-write this data, assuming the worst compression, so your app fails early (on read, not on write). If there is not enough space on the volume, read() will fail with a disk full error. Many applications do not expect this. On the other hand, filling a disk completely is a never a good idea, since it generally triggers fragmentation so extreme that the only way to recover is to go buy another harddisk and copy your data. As a side note, I have also tested lzjb (the ZFS compressor) and lzo is much faster, and compresses much better (sometimes 2x better). Back to the point of how to handle disk full errors : - we could write a file the size of shared_buffers at startup - if a write() reports disk full, delete the file above - we now have enough space to flush all of shared_buffers - flush and exit gracefully This will waste disk space. Of course... but the purpose of compression is not to save disk space, it is to have a faster database. Also, if you use compression, - you'll do it because your database is much bigger than shared_buffers, so the wasted space is not huge. - you'll use a smaller shared_buffers than usual, because shared_buffers contains uncompressed pages, and you'll want to use lots of OS disk cache to cache compressed data. Doesn't seem too crazy to me. > But I think even if you solve that it's not really a good long-term > solution. We don't know how the OS handles block allocation for this > type of file. I'm actually moderately surprised it isn't skipping > enough blocks assuming you'll allocate them eventually. Even if it > does handle it the way you expect what happens when you do grow a > block, it'll have to allocate it way out of the way and we have no way > to repair that discontinuity later. I made a quick check before implementing it, using python scripts to play with sparse files on ext3 : - writing a sparse file is a bit (not much) slower than a regular file, - reading from a non-fragmented sparse file is as fast as reading a regular file - holes do not eat OS disk cache (which is the most interesting point) - reading from cache is as fast as reading a regular file (and faster if you don't read the holes because you know they are holes, which is the case here) And, also, growing a sparse file by plugging the holes in it WILL allocate blocks all over the place and render IO extremely inefficient. You can defrag it, of course (using fs-specific tools or just cpio), but that's not "high-availability"... > Also, the way you've prellocated blocks effectively nails the maximum > compression at 2x. That seems to be leaving a lot of money on the > table. Yes, it does, but there are a few subtleties. First, about preallocation : I had to implement this, because of the way the files are extended. Since first an empty block is written, it will compress well, to 1 OS page. When, later, this block is written again with data, it will compress to a larger amount of pages, which will trigger allocation all over the place. In testing, the resulting sparse file is slow in the of the "forget it" kind... The file could also be non-sparse : in this case, no disk space is saved. Not saving disk space is OK with me : disks are really cheap these days. I know this sounds crazy ;) but as I said above, compression is meant to save expensive resources. When you use H-264 to compress 1TB of raw video to 1GB of compressed video, you save an expensive resource (0.999 TB of harddisk). But if you use lzo to compress 40GB of database down to 20GB, you save maybe $3 in disk space, which is ludicrous. However, you also save much more expensive resources : disk throughput and RAM OS cache. So it might really pay off. So, a non-sparse file would work well for random accesses too, since we only read the parts of it that contain compressed data, and the "wasted" pages would never be touched. However it would have drawbacks : - seq scan would not be able to skip the "empty" pages, so it would have no benefit whatsoever - the empty pages, read-ahead by the OS, would pollute the disk cache - the empty pages would also pollute the controller's cache and the disks' internal caches So, a balance has to be reached on how much OS pages (4K on Linux) to preallocate for a Postgres block (16K or 32K to get good compression). If you preallocate too little, adding data (INSERT, UPDATEs) will heavily fragment the file. If you preallocate too much, seq scan will be slower and perhaps cache will be wasted (but it will still be much faster than the huge seek-fest which happens if you preallocate too little). How many OS pages to preallocate for a postgres block should be specified by the user. - If you intend to modify the table contents, you should preallocate a bit more. - If the table will be mostly read-only, a tight fit is okay. A nice thing is that it makes read-only tables a bit redundant. Basically, if you ask postgres "CLUSTER or VAC FULL this table, and compress it, not preallocating anything more than necessary" You will get a table and indexes that are well compressed, but updates, deletes, will be slow, so it is a read-mostly table. If you use partitions, the current partition can be uncompressed (or compressed with a generous pre-allocation), but the past partitions, which are usually read-only, can be compressed tight, and their indexes too. Then you could change the pre-allocation value for this table to preallocate more, and INSERT into it, no problem, except for the indexes, for which you'd need to specify beforehand to preallocate a bit more pages to allow for growing index pages... So, how many pages to preallocate is a tuning parameter, that should have a reasonable default, but be user-settable, per-relation (table, index...) Note that if a page contains lots of dead rows which are similar except a few updated columns, compression will exploit this redundancy. Here are some stats on some tables of more than 300K rows : * Table with lots of columns including a TEXT column of a few hundred bytes, containing french text, which is usually too short for TOASTING : Size : 314.69 MB Paged Size : 133.24 MB Unpaged efficiency : 2.77 Paged Efficiency : 2.36 pages | count 1 | 16 2 | 848 3 | 4451 4 | 4730 5 | 25 The above is a histogram : for instance 4730 32K postgresql blocks compressed to 4 4K-pages (a ratio of 2X), and only 16 postgres blocks compressed to 4K. "Size" is uncompressed size "Paged Size" is the sum of all 4K which contain compressed data "Unpaged Efficiency" is (raw data) / (compressed data), showing compressor performance. "Paged Efficiency" is computed from the real number of pages used, so it is the interesting metric. In this case, efficiency is 2.36 because some pages compress better than 2X, but if we preallocate 16K for each 32K page, real efficiency will be 2X. We can see, also, that 25 blocks would need some extra allocation. This is a really small percentage of the table, so not significant to speed. * GIN fulltext index on text field of above table Size : 136.75 MB Paged Size : 48.04 MB Unpaged efficiency : 3.63 Paged Efficiency : 2.85 pages | count 1 | 1528 2 | 470 3 | 944 4 | 172 5 | 1262 * GIST fulltext index on text field of above table annonces_ts_gist Size : 63.84 MB Paged Size : 33.89 MB Unpaged efficiency : 2.14 Paged Efficiency : 1.88 pages | count 1 | 3 2 | 32 3 | 165 4 | 1103 5 | 739 6 | 1 * table with 32 columns including a few TEXT fields which contain at most a few chars Size : 393.50 MB Paged Size : 127.24 MB Unpaged efficiency : 3.83 Paged Efficiency : 3.09 pages | count 2 | 5207 3 | 7381 4 | 4 * btree on DATE column of above table Size : 56.34 MB Unpaged efficiency : 2.63 Paged Efficiency : 2.08 pages | count 1 | 2 2 | 1 3 | 282 4 | 1518 * btree on field "zipcode" of above table Size : 56.34 MB Paged Size : 27.04 MB Unpaged efficiency : 2.90 Paged Efficiency : 2.67 pages | count 1 | 2 2 | 2 3 | 1794 4 | 5 * gist index on GPS coordinates (every row has (lat,lon)) of above table Size : 196.47 MB Paged Size : 39.79 MB Unpaged efficiency : 6.93 Paged Efficiency : 4.94 pages | count 1 | 2486 2 | 3704 3 | 96 4 | 1 Note that the 32 column table is 400MB... and this gist index is almost 200 MB. This table has 770 MB of indexes on it... Compression shrinks the table to 127 MB and indexes to a total of about 300 MB. * table of (increasing timestamp, 1 SERIAL, 1 increasing INT, 3 INT columns which contain random() data) Size : 344.06 MB Paged Size : 215.02 MB Unpaged efficiency : 1.97 Paged Efficiency : 1.60 pages | count 1 | 1 5 | 11009 * multicolumn btree on (int, timestamp) above Size : 153.19 MB Paged Size : 57.42 MB Unpaged efficiency : 2.73 Paged Efficiency : 2.67 pages | count 1 | 3 3 | 4899 * btree on int column above Size : 102.06 MB Paged Size : 63.73 MB Unpaged efficiency : 1.87 Paged Efficiency : 1.60 pages | count 1 | 2 2 | 1 4 | 3 5 | 3260 As you can see, efficiency varies widely between average (1.6), good (2-2.5), and amazing (almost 5) depending on what is compressed. > To handle read-write tables I think we would need to directly > implement the kind of indirection layer that you're getting out of the > filesystem's block layer currently. That would let you allocate enough > blocks to hold the data uncompressed and then free up those blocks > once you're sure the data is compressible. Just like NTFS does, then ? > One possibility is to handle only read-only tables. That would make > things a *lot* simpler. But it sure would be inconvenient if it's only > useful on large static tables but requires you to rewrite the whole > table -- just what you don't want to do with large static tables -- to > get the benefit. See above : read-mostly table that you can still update (but it will be very slow) or insert into (if you change the preallocation setting). I forgot to talk about SSDs in the previous message. SSDs are quite expensive, but seek really fast. Saving disk space would be good on a SSD, because the price per byte is much higher. And since SSDs seek very fast, they probably wouldn't be bothered that much by sparse blocks being allocated all over the place. Also the price of SSDs means you wouldn't make a monster RAID out of them, so the CPU could decompress faster than the disk in a seqscan. This would need a test. Ah, I would like to know a thing : if I add a 2-byte field to the postgres page header, which contains the compressed length, will I get any problems ? (ie. do some pages use a non-standard header ?) I need to modify a few things and add comments, then will submit a patch (not for committing, but for hacking).
> On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark<gsstark@mit.edu> wrote: >> I like the idea too, but I think there are some major problems to >> solve. In particular I think we need a better solution to blocks >> growing than sparse files. > > How much benefit does this approach have over using TOAST compression > more aggressively? > > ...Robert > The two are different : - TOAST compresses a large column value. To store a 100KB text file, TOAST is great. - page compression compresses whole pages. Suppose you have a table with a TIMESTAMP, and a few INT columns. The rows are small enough to make per-row compression useless, and TOAST cannot compress non-varlenas anyway. However, if (for instance) the timestamp is the row insertion date, and you INSERT several rows per second, most timestamps on a page will have lots of bytes in common. Also, row headers (which are larger than the rows) will have much redundant data. Page compression can exploit this, without the need for the rest of the code to know about it. Page compression can also handle indexes, etc. Also, External TOAST is nice if you seldom need the field : for instance, you search on in-page columns, get the row you need, and fetch it. Suppose you have a forum : in this case, when you display a topic page, you need all the posts text. It would be a very bad idea to store them in a separate TOAST table, because it would create more random IO. Storing the posts in the page means less IO, and if you regularly CLUSTER your table, all the posts you need to display a topic page are on the same (or adjacent) postgres page. In this case, individual post text can be TOASTed, too, but compression tends to work better with longer blocks, so compressing the whole page will be more efficient.
On Fri, Aug 07, 2009 at 10:36:39AM +0200, Pierre Frrrdddric Caillaud wrote: > Also, about compressed NTFS : it can give you disk-full errors on read(). > While this may appear stupid, it is in fact very good. Is this not just because they've broken the semantics of read? > As a side note, I have also tested lzjb (the ZFS compressor) and lzo is > much faster, and compresses much better (sometimes 2x better). Disks are fast and cheap; a basic IDE disk runs at over 100MB/s now, and it's doing this in the background while your CPU is doing other stuff. If you're also decompressing stuff you're serializing even more and you're doing so with a much power hungrier device (the CPU). How fast is decompression (as that seems to be your selling point)? Lzo claims to run at about about a third of main memory bandwidth which is nice, however research projects found this to be far too slow and were only getting positive results when decompression stayed in secondary cache. Basically decompression has to run at several GB/s for it to have much measurable benefit. > I made a quick check before implementing it, using python scripts to play > with sparse files on ext3 : > > - writing a sparse file is a bit (not much) slower than a regular file, > - reading from a non-fragmented sparse file is as fast as reading a > regular file > - holes do not eat OS disk cache (which is the most interesting point) > - reading from cache is as fast as reading a regular file (and faster if > you don't read the holes because you know they are holes, which is the > case here) Numbers? > And, also, growing a sparse file by plugging the holes in it WILL allocate > blocks all over the place and render IO extremely inefficient. > You can defrag it, of course (using fs-specific tools or just cpio), but > that's not "high-availability"... That would not seem to difficult to solve. > I forgot to talk about SSDs in the previous message. SSDs are quite > expensive, but seek really fast. SSDs are about decreasing latency; if you're putting compression in there you're pushing latency up as well. If you don't care about latency you get traditional rotating media. -- Sam http://samason.me.uk/
2009/8/7 Pierre Frédéric Caillaud <lists@peufeu.com>: > > Also, about compressed NTFS : it can give you disk-full errors on read(). I suspect it's unavoidable for similar reasons to the problems Postgres faces. When you issue a read() you have to find space in the filesystem cache to hold the data. Some other data has to be evicted. If that data doesn't compress as well as it did previously it could take more space and cause the disk to become full. This also implies that fsync() could generate that error... > Back to the point of how to handle disk full errors : > - we could write a file the size of shared_buffers at startup > - if a write() reports disk full, delete the file above > - we now have enough space to flush all of shared_buffers > - flush and exit gracefully Unfortunately that doesn't really help. That only addresses the issue for a single backend (or as many as are actually running when the error starts). The next connection could read in new data and expand that and now you have no slop space. Put another way, we don't want to exit at all, gacefully or not. We want to throw an error, abort the transaction (or subtransaction) and keep going. -- greg http://mit.edu/~gsstark/resume.pdf
On Fri, Aug 07, 2009 at 10:33:33AM +0100, Greg Stark wrote: > 2009/8/7 Pierre Frédéric Caillaud <lists@peufeu.com>: > > Also, about compressed NTFS : it can give you disk-full errors on read(). > > I suspect it's unavoidable for similar reasons to the problems > Postgres faces. When you issue a read() you have to find space in the > filesystem cache to hold the data. Some other data has to be evicted. > If that data doesn't compress as well as it did previously it could > take more space and cause the disk to become full. > > This also implies that fsync() could generate that error... If that's indeed how it works it seems like one broken file system and needs to get its block accounting in order. When you choose a compression algorithm you know how much space a worst case compression will take (i.e. lzo takes up to 8% more for a 4kB block size). This space should be reserved in case of situations like the above and the filesystem shouldn't over-commit on this. Never had to think about this before though so I'm probably missing something obvious. -- Sam http://samason.me.uk/
On Fri, Aug 7, 2009 at 11:29 AM, Sam Mason<sam@samason.me.uk> wrote: > When you choose a compression algorithm you know how much space a worst > case compression will take (i.e. lzo takes up to 8% more for a 4kB block > size). This space should be reserved in case of situations like the > above and the filesystem shouldn't over-commit on this. > > Never had to think about this before though so I'm probably missing > something obvious. Well most users want compression for the space savings. So running out of space sooner than without compression when most of the space is actually unused would disappoint them. Also, I'm puzzled why it would the space increase would proportional to the amount of data and be more than 300 bytes. There's no reason it wouldn't be a small fixed amount. The ideal is you set aside one bit -- if the bit is set the rest is compressed and has to save at least one bit. If the bit is not set then the rest is uncompressed. Maximum bloat is 1-bit. In real systems it's more likely to be a byte or a word. -- greg http://mit.edu/~gsstark/resume.pdf
> Also, I'm puzzled why it would the space increase would proportional > to the amount of data and be more than 300 bytes. There's no reason it > wouldn't be a small fixed amount. The ideal is you set aside one bit > -- if the bit is set the rest is compressed and has to save at least > one bit. If the bit is not set then the rest is uncompressed. Maximum > bloat is 1-bit. In real systems it's more likely to be a byte or a > word. I'm working on cleaning the patch... I added a field in PageHeader which contains :- 0 to indicate a non-compressed page- length of compressed data if compressed If compression gains nothing (ie gains less than 4K), the page is stored raw. It seems that only pages having a PageHeader are handled by md.c, so it should work (am I mistaken ?)
On Fri, Aug 07, 2009 at 11:49:46AM +0100, Greg Stark wrote: > On Fri, Aug 7, 2009 at 11:29 AM, Sam Mason<sam@samason.me.uk> wrote: > > When you choose a compression algorithm you know how much space a worst > > case compression will take (i.e. lzo takes up to 8% more for a 4kB block > > size). This space should be reserved in case of situations like the > > above and the filesystem shouldn't over-commit on this. > > > > Never had to think about this before though so I'm probably missing > > something obvious. > > Well most users want compression for the space savings. So running out > of space sooner than without compression when most of the space is > actually unused would disappoint them. Note, that as far as I can tell for a filesystems you only need to keep enough reserved for the amount of uncompressed dirty buffers you have in memory. As space runs out in the filesystem all that happens is that the amount of (uncompressed?) dirty buffers you can safely have around decreases. In practical terms, this says that performance drops off when there is less free space than the size of the filesystem's cache and I think you have to reserve exactly one block to handle the base case. But there are so many problems associated with completely filling a filesystem that I'm not sure if this would really matter. > Also, I'm puzzled why it would the space increase would proportional > to the amount of data and be more than 300 bytes. There's no reason it > wouldn't be a small fixed amount. The ideal is you set aside one bit > -- if the bit is set the rest is compressed and has to save at least > one bit. If the bit is not set then the rest is uncompressed. Maximum > bloat is 1-bit. In real systems it's more likely to be a byte or a > word. It'll depend on the compression algorithm; lz algorithms are dictionary based so you'd have a single entry for the incompressible data and then a pointer to the entry. In PG's case, it would seem possible to do the compression and then check to see if the resulting size is greater than 4kB. If it is you write into the 4kB page size and write uncompressed data. Upon reading you do the inverse, if it's 4kB then no need to decompress. I believe TOAST does this already. -- Sam http://samason.me.uk/
On Fri, Aug 7, 2009 at 12:48 PM, Sam Mason<sam@samason.me.uk> wrote: >> Well most users want compression for the space savings. So running out >> of space sooner than without compression when most of the space is >> actually unused would disappoint them. > > Note, that as far as I can tell for a filesystems you only need to keep > enough reserved for the amount of uncompressed dirty buffers you have in > memory. As space runs out in the filesystem all that happens is that > the amount of (uncompressed?) dirty buffers you can safely have around > decreases. And when it drops to zero? > In PG's case, it would seem possible to do the compression and then > check to see if the resulting size is greater than 4kB. If it is you > write into the 4kB page size and write uncompressed data. Upon reading > you do the inverse, if it's 4kB then no need to decompress. I believe > TOAST does this already. It does, as does gzip and afaik every compression system. -- greg http://mit.edu/~gsstark/resume.pdf
For reference what I'm picturing is this: When a table is compressed it's marked read-only which bars any new tuples from being inserted or existing tuples being deleted. Then it's frozen and any pages which contain tuples wich can't be frozen are waited on until they can be. When it's finished every tuple has to be guaranteed to be fully frozen. Then the relation is rewritten in compressed form. Each block is compressed one by one and written one after the other to disk. At the same time a new fork is written which contains a pointer to each block. It could just be a directly addressed array of offsets and lengths. All block lookups have to first load the page of the indirection map, then read the appropriate section of the original file and decompress it into shared buffers. From a programming point of view this is nice and simple. From a user's point of view it's a bit of a pain since it means you have to rewrite your whole table when you want to compress it. And it means you have to rewrite it all again if you decide you want to set it back to read-write. My experience with people who have very large tables is that they design their whole process around the goal of avoiding having to move the data once it's written.
On Fri, Aug 07, 2009 at 12:59:57PM +0100, Greg Stark wrote: > On Fri, Aug 7, 2009 at 12:48 PM, Sam Mason<sam@samason.me.uk> wrote: > >> Well most users want compression for the space savings. So running out > >> of space sooner than without compression when most of the space is > >> actually unused would disappoint them. > > > > Note, that as far as I can tell for a filesystems you only need to keep > > enough reserved for the amount of uncompressed dirty buffers you have in > > memory. As space runs out in the filesystem all that happens is that > > the amount of (uncompressed?) dirty buffers you can safely have around > > decreases. > > And when it drops to zero? That was why I said you need to have one page left "to handle the base case". I was treating the inductive case as the interesting common case and considered the base case of lesser interest. > > In PG's case, it would seem possible to do the compression and then > > check to see if the resulting size is greater than 4kB. If it is you > > write into the 4kB page size and write uncompressed data. Upon reading > > you do the inverse, if it's 4kB then no need to decompress. I believe > > TOAST does this already. > > It does, as does gzip and afaik every compression system. It's still a case that needs to be handled explicitly by the code. Just for reference, gzip does not appear to do this when I test it: echo -n 'a' | gzip > tmp.gz gzip -l --verbose tmp.gz says the compression ratio is "-200%" (an empty string results in an infinite increase in size yet gets displayed as "0%" for some strange reason). It's only when you hit six 'a's that you start to get positive ratios. Note that that this is taking headers into account; the compressed size is 23 bytes for both 'aaa' and 'aaaaaa' but the uncompressed size obviously changes. gzip does indeed have a "copy" method, but it doesn't seem to be being used. -- Sam http://samason.me.uk/
On Fri, Aug 7, 2009 at 8:18 AM, Greg Stark<gsstark@mit.edu> wrote: > For reference what I'm picturing is this: > > When a table is compressed it's marked read-only which bars any new > tuples from being inserted or existing tuples being deleted. Then it's > frozen and any pages which contain tuples wich can't be frozen are > waited on until they can be. When it's finished every tuple has to be > guaranteed to be fully frozen. > > Then the relation is rewritten in compressed form. Each block is > compressed one by one and written one after the other to disk. > > At the same time a new fork is written which contains a pointer to > each block. It could just be a directly addressed array of offsets and > lengths. All block lookups have to first load the page of the > indirection map, then read the appropriate section of the original > file and decompress it into shared buffers. > > From a programming point of view this is nice and simple. From a > user's point of view it's a bit of a pain since it means you have to > rewrite your whole table when you want to compress it. And it means > you have to rewrite it all again if you decide you want to set it back > to read-write. My experience with people who have very large tables is > that they design their whole process around the goal of avoiding > having to move the data once it's written. If you add an indirection table, it's not strictly necessary for the table to be read-only, though if you want to make it read-write you'd need to think about how to defragment. ...Robert
> For reference what I'm picturing is this: > > When a table is compressed it's marked read-only which bars any new > tuples from being inserted or existing tuples being deleted. Then it's > frozen and any pages which contain tuples wich can't be frozen are > waited on until they can be. When it's finished every tuple has to be > guaranteed to be fully frozen. > > Then the relation is rewritten in compressed form. Each block is > compressed one by one and written one after the other to disk. > > At the same time a new fork is written which contains a pointer to > each block. It could just be a directly addressed array of offsets and > lengths. All block lookups have to first load the page of the > indirection map, then read the appropriate section of the original > file and decompress it into shared buffers. I had pondered the idea of a fork storing the compressed status of each page, because it has advantages :- no need to change the page layout to insert a "is compressed" flag- possible to compressany data, not just standard pages- if you know the compressed size of a page in advance, it is much easier to prefetch it entirely and not just the first chunk, or read too much... > From a programming point of view this is nice and simple. From a > user's point of view it's a bit of a pain since it means you have to > rewrite your whole table when you want to compress it. And it means > you have to rewrite it all again if you decide you want to set it back > to read-write. My experience with people who have very large tables is > that they design their whole process around the goal of avoiding > having to move the data once it's written. Note that if a table is huge, it is always cut in (currently) 1GB slices, so you could operate on one slice at a time, then release a lock, let the backlog of queries flow, and resume. Realtime compression would be much less of a hassle to use, though...
Not strictly related to compression, but I've noticed something really strange... pg 8.4 (vanilla) is doing it, and my compressed version is doing it too. tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives, but it does it too if I put the tablespace and data on the same volume. The traditional test table : BEGIN; CREATE TABLE dwb ( bid SERIAL, aid INTEGER NOT NULL, ts TIMESTAMP NOT NULL, i1 INTEGER NOT NULL, i2 INTEGER NOT NULL, i3 INTEGER NOT NULL, i4 INTEGER NOT NULL ) WITHOUT OIDS; The traditional test data : INSERT INTO dwb (ts,aid,i1,i2,i3,i4) SELECT now() + '1 sec'::INTERVAL, (1+random()*99998), random()*10000000, n+random()*10000, n+random()*1000, n FROM generate_series( 1, 60000000 ) AS n; vmstat output : it starts out relatively fast : si so bi bo in cs us sy id wa 0 0 0 43680 2796 19162 42 18 37 3 0 0 0 45515 3165 2065244 17 35 4 0 0 0 43130 3046 21991 43 17 38 2 then here it starts to slow down : check "bo" output 0 0 181 24439 577 3541 31 6 40 23 0 0 176 17258 292 1324 31 4 43 22 0 0 0 18626 162 69335 3 49 12 0 0 1 21554 235 1362 31 5 50 14 0 0 0 19177 324 2053 35 4 50 12 0 0 0 19208 206 1155 36 4 48 12 0 0 1 20740 215 1117 33 4 50 13 0 0 0 20154 258 1100 32 4 50 14 0 0 0 20355 316 2056 34 5 49 12 ... and it stays like this until the end of the INSERT... It's not writing any xlog since the table was created after the BEGIN... I'm trying to benchmark insert speed, but no luck. This volume does about 100 MB/s sustained write speed, so ?......
On Fri, Aug 07, 2009 at 03:29:44PM +0200, Pierre Frrrdddric Caillaud wrote: > vmstat output : Sorry, I don't know enough of PGs internals to suggest anything here, but iostat may give you more details as to what's going on. -- Sam http://samason.me.uk/
Pierre Frédéric Caillaud<lists@peufeu.com> wrote: > tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives, > but it does it too if I put the tablespace and data on the same > volume. > it starts out relatively fast : > > si so bi bo in cs us sy id wa > 0 0 0 43680 2796 19162 42 18 37 3 > 0 0 0 45515 3165 20652 44 17 35 4 > 0 0 0 43130 3046 21991 43 17 38 2 > > then here it starts to slow down : check "bo" output > > 0 0 181 24439 577 3541 31 6 40 23 > 0 0 176 17258 292 1324 31 4 43 22 > 0 0 0 18626 162 693 35 3 49 12 > 0 0 1 21554 235 1362 31 5 50 14 > 0 0 0 19177 324 2053 35 4 50 12 > 0 0 0 19208 206 1155 36 4 48 12 > 0 0 1 20740 215 1117 33 4 50 13 > 0 0 0 20154 258 1100 32 4 50 14 > 0 0 0 20355 316 2056 34 5 49 12 > > ... and it stays like this until the end of the INSERT... I don't know if this is it, but we tend to see outrageously high performance at the start of a benchmark because of the battery-backed cache in the RAID controller. Every write comes back immediately after copying the data to RAM. After a while the cache gets filled and you settle down to a steady state. If it's not BBU with write-back enabled, perhaps you have drives that lie about write completion? -Kevin
On Fri, 07 Aug 2009 15:42:35 +0200, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Pierre Frédéric Caillaud<lists@peufeu.com> wrote: > >> tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives, >> but it does it too if I put the tablespace and data on the same >> volume. > >> it starts out relatively fast : >> >> si so bi bo in cs us sy id wa >> 0 0 0 43680 2796 19162 42 18 37 3 >> 0 0 0 45515 3165 20652 44 17 35 4 >> 0 0 0 43130 3046 21991 43 17 38 2 >> >> then here it starts to slow down : check "bo" output >> >> 0 0 181 24439 577 3541 31 6 40 23 >> 0 0 176 17258 292 1324 31 4 43 22 >> 0 0 0 18626 162 693 35 3 49 12 >> 0 0 1 21554 235 1362 31 5 50 14 >> 0 0 0 19177 324 2053 35 4 50 12 >> 0 0 0 19208 206 1155 36 4 48 12 >> 0 0 1 20740 215 1117 33 4 50 13 >> 0 0 0 20154 258 1100 32 4 50 14 >> 0 0 0 20355 316 2056 34 5 49 12 >> >> ... and it stays like this until the end of the INSERT... > I don't know if this is it, but we tend to see outrageously high > performance at the start of a benchmark because of the battery-backed > cache in the RAID controller. Every write comes back immediately > after copying the data to RAM. After a while the cache gets filled > and you settle down to a steady state. If it's not BBU with > write-back enabled, perhaps you have drives that lie about write > completion? > -Kevin > I'm answering my own question : at the beginning of the run, postgres creates a 800MB temporary file, then it fills the table, then deletes the temp file. Is this because I use generate_series to fill the test table ?
On Fri, Aug 07, 2009 at 04:17:18PM +0200, Pierre Frrrdddric Caillaud wrote: > I'm answering my own question : at the beginning of the run, postgres > creates a 800MB temporary file, then it fills the table, then deletes the > temp file. > Is this because I use generate_series to fill the test table ? Doh, yes. A function's result is written to temp location first and then read back again once the function returns success. You'll have more luck if you do: SELECT now() + '1 sec'::INTERVAL, (1+random()*99998), random()*10000000,n+random()*10000, n+random()*1000, n FROM ( SELECT generate_series( 1, 60000000 )) x(n); -- Sam http://samason.me.uk/
Pierre, > I added a field in PageHeader which contains : > - 0 to indicate a non-compressed page > - length of compressed data if compressed > > If compression gains nothing (ie gains less than 4K), the page is > stored raw. > > It seems that only pages having a PageHeader are handled by md.c, so > it should work (am I mistaken ?) Well, there's the issue of upgradability; this would require us to have an incompatible upgrade of on-disk formats. So we don't want to go further down this route unless we're sure it's worthwhile. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Well, here is the patch. I've included a README, which I paste here. If someone wants to play with it (after the CommitFest...) feel free to do so. While it was an interesting thing to try, I don't think it has enough potential to justify more effort... * How to test - apply the patch - copy minilzo.c and minilzo.h to src/backend/storage/smgr - configure & make - enjoy * How it works - pg block size set to 32K - an extra field is added in the header telling the compressed length THIS IS BAD, this information should be stored in a separate fork of the relation, because - it would then be backwards compatible - the number of bytes to read from a compressed page would be known in advance - the table file is sparse - the page header is not compressed - pages are written at their normal positions, but only the compressed bytes are written - if compression gains nothing, un-compressed page is stored - the filesystem doesn't store the un-written blocks * Benefits - Sparse file holes are not cached, so OS disk cache efficiency is at least x2 - Random access is faster, having a better probability to hit cache (sometimes a bit faster, sometimes it's spectatular) - Yes, it does save space (> 50%) * Problems - Biggest problem : any write to a table that writes data that compresses less than whatever was there before can fail on a disk full error. - ext3 sparse file handling isn't as fast as I wish it would be : on seq scans, even if it reads 2x less data, and decompresses very fast, it's still slower... - many seq scans (especially with aggregates) are CPU bound anyway - therefore, some kind of background-reader-decompressor would be needed - pre-allocation has to be done to avoid extreme fragmentation of the file, which kind of defeats the purpose - it still causes fragmentation * Conclusion (for now) It was a nice thing to try, but I believe it would be better if this was implemented directly in the filesystem, on the condition that it should be implemented well (ie not like NTFS compression).
Attachment
On Tuesday 11 August 2009 13:05:39 Pierre Frédéric Caillaud wrote: > Well, here is the patch. I've included a README, which I paste here. > If someone wants to play with it (after the CommitFest...) feel free to > do so. > While it was an interesting thing to try, I don't think it has enough > potential to justify more effort... > > > * How to test > > - apply the patch > - copy minilzo.c and minilzo.h to > src/backend/storage/smgr For future reference, and since this keeps appearing every few months: The license of LZO is not acceptable for inclusion or use with PostgreSQL. You need to find a different library if you want to pursue this further.
> For future reference, and since this keeps appearing every few months: > The > license of LZO is not acceptable for inclusion or use with PostgreSQL. > You > need to find a different library if you want to pursue this further. Yes, I know about the license... I used LZO for tests, but since my little experiment with compression didn't give any really motivating results, I won't pursue this further.