Thread: Compressing Fields?
I was wondering if anyone has had occasion to hook up a compression scheme (Huffman/gzip/zlib/whatever) as a PostgreSQL function. I've got a case where there is a need to store fairly large chunks of XML in a database. There seems little reason to parse it beforehand, as many seem wont to do. In fact, there will often be no need to look at it at all. Typically, it will be ignored, only to be looked at if a human specifically asks for it. Ergo, it would be a slick idea to cut a 1700 byte field down to 253 by [somehow compressing it] before sticking it into a "bytea" field. That will save on disk space, pack more records into pages, and generally lead to queries being a bit cheaper. insert into log_table (id, txn_date, metadata, xml) values(4271324, '2003-07-01', 'Useless Data', compress('<xml> <a> <b>This is worthless data </b> <c status="139"/> <d> <e>Foo </e> </d> <bar> And 8K of futile murmurings </bar> </a> </xml>'); 90% of the time, we need only: select id, txn_date, metadata from log_table; And the other 10%, we do select id, txn_date, metadata, uncompress(xml) from logtable where id in (871009, 873281, 8321947); It would surely be possible for the client software to do the compression, but it seems an interesting thought to do it on the server, thereby making it "language-neutral" such that I could write client software in Perl, Python, or even use Pierre Mai's binding to CMU/CL without having to worry about whether or not there's a binding of the compression algorithm to whatever client language I might imagine using. I don't see anything in contrib for this. pgcrypto obviously does something similar for cryptographic functions, but I don't see compression on the list. Presumably something could be constructed using zlib; if anyone has done this already, it would be nice to know of... -- output = reverse("moc.enworbbc" "@" "enworbbc") http://www.ntlug.org/~cbbrowne/advocacy.html "Bureaucracies interpret communication as damage and route around it" -- Jamie Zawinski
You are going to love the answer to this question --- it already does compression of any long fields when it is stored in the TOAST table. In fact, you have to turn off compression if you don't want it using ALTER TABLE ... SET STORAGE. --------------------------------------------------------------------------- Christopher Browne wrote: > I was wondering if anyone has had occasion to hook up a compression scheme > (Huffman/gzip/zlib/whatever) as a PostgreSQL function. > > I've got a case where there is a need to store fairly large chunks of XML in a > database. There seems little reason to parse it beforehand, as many seem wont > to do. In fact, there will often be no need to look at it at all. Typically, > it will be ignored, only to be looked at if a human specifically asks for it. > > Ergo, it would be a slick idea to cut a 1700 byte field down to 253 by > [somehow compressing it] before sticking it into a "bytea" field. That will > save on disk space, pack more records into pages, and generally lead to > queries being a bit cheaper. > > insert into log_table (id, txn_date, metadata, xml) values > (4271324, '2003-07-01', 'Useless Data', > compress('<xml> <a> <b> This is worthless data </b> <c status="139"/> <d> <e> > Foo </e> </d> <bar> And 8K of futile murmurings </bar> </a> </xml>'); > > 90% of the time, we need only: > > select id, txn_date, metadata from log_table; > > And the other 10%, we do > select id, txn_date, metadata, uncompress(xml) from logtable where id in > (871009, 873281, 8321947); > > It would surely be possible for the client software to do the compression, but > it seems an interesting thought to do it on the server, thereby making it > "language-neutral" such that I could write client software in Perl, Python, or > even use Pierre Mai's binding to CMU/CL without having to worry about whether > or not there's a binding of the compression algorithm to whatever client > language I might imagine using. > > I don't see anything in contrib for this. pgcrypto obviously does something > similar for cryptographic functions, but I don't see compression on the list. > > Presumably something could be constructed using zlib; if anyone has done this > already, it would be nice to know of... > -- > output = reverse("moc.enworbbc" "@" "enworbbc") > http://www.ntlug.org/~cbbrowne/advocacy.html > "Bureaucracies interpret communication as damage and route around it" > -- Jamie Zawinski > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > You are going to love the answer to this question --- it already does > compression of any long fields when it is stored in the TOAST table. Well, sort of. The compression algorithm is extremely poor compared to anything Christopher mentioned. It was choosen because it's free (no patent or whatsoever) and lighning fast. It does a good job on padded input or stuff with a lot of redundancy (like tags in HTML). A typical HTML tree like the PostgreSQL documentation once reduces by 50%. That is not bad considering that each item has to be compressed separately and the average size isn't that big too (I think it was around 4K or so). Also know that the toaster kicks in as soon as a row exceeds a quarter of the block size, so 2K in a default installation. It will not waste any time on small rows, and it will stop efforts as soon as it reduced the rows size under said quarter block size. Jan > > In fact, you have to turn off compression if you don't want it using > ALTER TABLE ... SET STORAGE. > > --------------------------------------------------------------------------- > > Christopher Browne wrote: >> I was wondering if anyone has had occasion to hook up a compression scheme >> (Huffman/gzip/zlib/whatever) as a PostgreSQL function. >> >> I've got a case where there is a need to store fairly large chunks of XML in a >> database. There seems little reason to parse it beforehand, as many seem wont >> to do. In fact, there will often be no need to look at it at all. Typically, >> it will be ignored, only to be looked at if a human specifically asks for it. >> >> Ergo, it would be a slick idea to cut a 1700 byte field down to 253 by >> [somehow compressing it] before sticking it into a "bytea" field. That will >> save on disk space, pack more records into pages, and generally lead to >> queries being a bit cheaper. >> >> insert into log_table (id, txn_date, metadata, xml) values >> (4271324, '2003-07-01', 'Useless Data', >> compress('<xml> <a> <b> This is worthless data </b> <c status="139"/> <d> <e> >> Foo </e> </d> <bar> And 8K of futile murmurings </bar> </a> </xml>'); >> >> 90% of the time, we need only: >> >> select id, txn_date, metadata from log_table; >> >> And the other 10%, we do >> select id, txn_date, metadata, uncompress(xml) from logtable where id in >> (871009, 873281, 8321947); >> >> It would surely be possible for the client software to do the compression, but >> it seems an interesting thought to do it on the server, thereby making it >> "language-neutral" such that I could write client software in Perl, Python, or >> even use Pierre Mai's binding to CMU/CL without having to worry about whether >> or not there's a binding of the compression algorithm to whatever client >> language I might imagine using. >> >> I don't see anything in contrib for this. pgcrypto obviously does something >> similar for cryptographic functions, but I don't see compression on the list. >> >> Presumably something could be constructed using zlib; if anyone has done this >> already, it would be nice to know of... >> -- >> output = reverse("moc.enworbbc" "@" "enworbbc") >> http://www.ntlug.org/~cbbrowne/advocacy.html >> "Bureaucracies interpret communication as damage and route around it" >> -- Jamie Zawinski >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 5: Have you checked our extensive FAQ? >> >> http://www.postgresql.org/docs/faqs/FAQ.html >> > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #