RE: TOAST != BLOB (was Re: Unanswered questions about P ostgre) - Mailing list pgsql-general
From | Franck Martin |
---|---|
Subject | RE: TOAST != BLOB (was Re: Unanswered questions about P ostgre) |
Date | |
Msg-id | F12ECEA0435AD211B5280008C7ACBC857FF0ED@BIGIRON Whole thread Raw |
List | pgsql-general |
Sorry for this question, but I always thought that the type bytea authorised to put anything you want inside including NULL Characters. The problem may be to cast or copy the data into a useful structure. Am I wrong here? Franck Martin Database Development Officer SOPAC South Pacific Applied Geoscience Commission Fiji E-mail: franck@sopac.org Web site: http://www.sopac.org/ This e-mail is intended for its recipients only. Do not forward this e-mail without approval. The views expressed in this e-mail may not be necessarily the views of SOPAC. -----Original Message----- From: Tim Kientzle [mailto:kientzle@acm.org] Sent: Wednesday, 13 December 2000 1:15 To: PostgreSQL general mailing list Subject: [GENERAL] TOAST != BLOB (was Re: Unanswered questions about Postgre) Fabrizio Ermini writes: > ... when TOAST will be available ... 90% of the > work it is already done to support also > column-style BLOBs ... Any web programmer worth > its salt could put up a simple layer that > does base64 encode/decode ... Actually, if NULL characters are the only problem, then there are several more compact encodings that could be used. My favorite is a technique that gaurantees worst-case 1-in-254 expansion of the input data. I can't find the link to the original paper right now, so I'll describe the technique as well as I can. The following technique encodes arbitrary binary data into an output stream that contains no NULL bytes. The data size is enlarged by no more than ceiling(input size / 254). To encode data, break the input stream into variable-length "packets" just after each null byte. A packet then consists of N bytes of non-null data followed by a null. Encode each packet as follows, depending on the number of non-null bytes preceding the null: 1 <= N <= 253: packet is stored as 1-byte "packet size" N followed by N bytes of non-null data. N > 253: Store a byte 254 followed by the first 254 bytes of non-null data. You've now reduced the packet size, so you can encode the remaining packet using these same rules. N == 0: Store a byte 255. Note that no expansion occurs at all unless there are more than 253 consecutive non-null bytes. In all other cases, you drop the trailing null byte of a packet and add a leading packet size, which preserves the data size exactly. Decoding is very simple, just read the "packet size" byte first: * if the packet size is 255, write a null to the output. * if the packet size is 254, copy 254 bytes to the output. * if the packet size is N < 254, copy N bytes to the output and write a null. This encoding is compact, easy to implement, fast, and eliminates all null bytes from the input. It should be very easy to implement this within PostgreSQL to permit arbitrary binary data to be stored into columns. (It might be useful to implement such encoding on all variable-length character column types, which would simplify storing Unicode, for example.) This technique can be modified to eliminate any single byte value (just use the encoder above as-is, then convert the unwanted value to null). It can also be easily extended to eliminate any M specific byte values (break the "packet size" codes into M ranges so that the packet size encodes the trailing value that was dropped), but the enlargement is then limited to something like ceiling( <input size> / (254/M) ), so it's only really a good idea if M is quite small. Fabrizio is right, TOAST is 90% of the work to get full BLOB support; the remaining pieces are to ensure that clients correctly encode nulls for wire transfer and either use something like the above to eliminate null bytes from the storage or modify PGs internals to use dimensioned strings rather than null-terminated strings. A more radical approach would build an encoding like the above into the wire protocol; but that might create ugly backwards-compatibility headaches. - Tim Kientzle
pgsql-general by date: