Re: [HACKERS] compression in LO and other fields - Mailing list pgsql-hackers
From | wieck@debis.com (Jan Wieck) |
---|---|
Subject | Re: [HACKERS] compression in LO and other fields |
Date | |
Msg-id | m11mHDk-0003kLC@orion.SAPserv.Hamburg.dsh.de Whole thread Raw |
In response to | Re: [HACKERS] compression in LO and other fields (Karel Zak - Zakkr <zakkr@zf.jcu.cz>) |
Responses |
Re: [HACKERS] compression in LO and other fields
Re: [HACKERS] compression in LO and other fields Re: [HACKERS] compression in LO and other fields |
List | pgsql-hackers |
Karel Zak - Zakkr wrote: > On Fri, 12 Nov 1999, Jan Wieck wrote: > > > I already made some tests with a type I called 'lztext' > > locally. Only the input-/output-functions exist so far and > > I is your original implementation or you use any current compression > code? I try bzip2, but output from this algorithm is total binary, > I don't know how this use in PgSQL if in backend are all routines > (in/out) use *char (yes, I'am newbie for PgSQL hacking:-). The internal storage format is based on an article I found at: http://www.neutralzone.org/home/faqsys/docs/slz_art.txt Simple Compression using an LZ buffer Part 3 Revision 1.d: An introduction to compression on the Amiga by Adisak Pochanayon Freely Distributable as long as reproduced completely. Copyright 1993 Adisak Pochanayon I've written the code from scratch. The internal representation is binary, for sure. It's a PostgreSQL variable length data format as usual. I don't know if there's a compression library available that fit's our need. First and most important it must have a license that permits us to include it in the distribution under our existing license. Second it's implementation must not cause any problems in the backend like memory leakage or the like. > > The compression rates aren't that giantic. I've got 30-50% > > Not is problem, that your implementation compress all data at once? > Typically compression use a stream, and compress only small a buffer > in any cycle. No, that's no problem. On type input, the original value is completely in memory given as a char*, and the internal representation is returned as a palloc()'d Datum. For output it's vice versa. O.K. some details on the compression rate. I've used 112 .html files with a total size of 1188346 bytes this time. The smallest one was 131 bytes, the largest one 114549 bytes and most of the files are somewhere between 3-12K. Compression results on the binary level are: gzip -9 outputs 398180 bytes (66.5% rate) gzip -1 outputs 447597 bytes (62.3% rate) my code outputs 529420 bytes (55.4% rate) Html input might be somewhat optimal for Adisak's storage format, but taking into account that my source implementing the type input and output functions is smaller than 600 lines, I think 11% difference to a gzip -9 is a good result anyway. > > Sorry for the compression specific slang here. Well, anyone > > interested in the code? > > Yes, for me - I finish to_char()/to_data() ora compatible routines > (Thomas, you still quiet?) and this is new appeal for me :-) Bruce suggested the contrib area, but I'm not sure if that's the right place. If it goes into the distribution at all, I'd like to use this data type for rule plan strings and function source text in the system catalogs. I don't expect we'll have a general solution for tuples split across multiple blocks for v7.0. And using lztext for rules and function sources would lower some FRP's. But using it in the catalogs requires to be builtin. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
pgsql-hackers by date: