TOAST != BLOB (was Re: Unanswered questions about Postgre) - Mailing list pgsql-general
From | Tim Kientzle |
---|---|
Subject | TOAST != BLOB (was Re: Unanswered questions about Postgre) |
Date | |
Msg-id | 3A36CD85.B9BDB351@acm.org Whole thread Raw |
List | pgsql-general |
Fabrizio Ermini writes: > ... when TOAST will be available ... 90% of the > work it is already done to support also > column-style BLOBs ... Any web programmer worth > its salt could put up a simple layer that > does base64 encode/decode ... Actually, if NULL characters are the only problem, then there are several more compact encodings that could be used. My favorite is a technique that gaurantees worst-case 1-in-254 expansion of the input data. I can't find the link to the original paper right now, so I'll describe the technique as well as I can. The following technique encodes arbitrary binary data into an output stream that contains no NULL bytes. The data size is enlarged by no more than ceiling(input size / 254). To encode data, break the input stream into variable-length "packets" just after each null byte. A packet then consists of N bytes of non-null data followed by a null. Encode each packet as follows, depending on the number of non-null bytes preceding the null: 1 <= N <= 253: packet is stored as 1-byte "packet size" N followed by N bytes of non-null data. N > 253: Store a byte 254 followed by the first 254 bytes of non-null data. You've now reduced the packet size, so you can encode the remaining packet using these same rules. N == 0: Store a byte 255. Note that no expansion occurs at all unless there are more than 253 consecutive non-null bytes. In all other cases, you drop the trailing null byte of a packet and add a leading packet size, which preserves the data size exactly. Decoding is very simple, just read the "packet size" byte first: * if the packet size is 255, write a null to the output. * if the packet size is 254, copy 254 bytes to the output. * if the packet size is N < 254, copy N bytes to the output and write a null. This encoding is compact, easy to implement, fast, and eliminates all null bytes from the input. It should be very easy to implement this within PostgreSQL to permit arbitrary binary data to be stored into columns. (It might be useful to implement such encoding on all variable-length character column types, which would simplify storing Unicode, for example.) This technique can be modified to eliminate any single byte value (just use the encoder above as-is, then convert the unwanted value to null). It can also be easily extended to eliminate any M specific byte values (break the "packet size" codes into M ranges so that the packet size encodes the trailing value that was dropped), but the enlargement is then limited to something like ceiling( <input size> / (254/M) ), so it's only really a good idea if M is quite small. Fabrizio is right, TOAST is 90% of the work to get full BLOB support; the remaining pieces are to ensure that clients correctly encode nulls for wire transfer and either use something like the above to eliminate null bytes from the storage or modify PGs internals to use dimensioned strings rather than null-terminated strings. A more radical approach would build an encoding like the above into the wire protocol; but that might create ugly backwards-compatibility headaches. - Tim Kientzle
pgsql-general by date: