Home > mailing lists

RE: TOAST != BLOB (was Re: Unanswered questions about P ostgre) - Mailing list pgsql-general

From	Franck Martin
Subject	RE: TOAST != BLOB (was Re: Unanswered questions about P ostgre)
Date	December 13, 2000 00:26:33
Msg-id	F12ECEA0435AD211B5280008C7ACBC857FF0ED@BIGIRON Whole thread Raw
List	pgsql-general

Tree view

Sorry for this question, but I always thought that the type bytea authorised
to put anything you want inside including NULL Characters.

The problem may be to cast or copy the data into a useful structure.

Am I wrong here?

Franck Martin
Database Development Officer
SOPAC South Pacific Applied Geoscience Commission
Fiji
E-mail: franck@sopac.org
Web site: http://www.sopac.org/

This e-mail is intended for its recipients only. Do not forward this e-mail
without approval. The views expressed in this e-mail may not be necessarily
the views of SOPAC.

-----Original Message-----
From: Tim Kientzle [mailto:kientzle@acm.org]
Sent: Wednesday, 13 December 2000 1:15
To: PostgreSQL general mailing list
Subject: [GENERAL] TOAST != BLOB (was Re: Unanswered questions about
Postgre)

Fabrizio Ermini writes:
> ... when TOAST will be available ... 90% of the
> work it is already done to support also
> column-style BLOBs ... Any web programmer worth
> its salt could put up a simple layer that
> does base64 encode/decode ...

Actually, if NULL characters are the only problem, then
there are several more compact encodings that could
be used.  My favorite is a technique that gaurantees
worst-case 1-in-254 expansion of the input data.
I can't find the link to the original paper right
now, so I'll describe the technique as well as
I can.

The following technique encodes arbitrary binary
data into an output stream that contains no NULL
bytes.  The data size is enlarged by no more than
ceiling(input size / 254).

To encode data, break the input stream
into variable-length "packets" just after each
null byte.  A packet then consists of N bytes
of non-null data followed by a null.  Encode
each packet as follows, depending on the number
of non-null bytes preceding the null:

1 <= N <= 253: packet is stored as 1-byte
    "packet size" N followed by N bytes of
    non-null data.

N > 253: Store a byte 254 followed by the first 254
   bytes of non-null data.  You've now reduced the
   packet size, so you can encode the remaining packet
   using these same rules.

N == 0: Store a byte 255.

Note that no expansion occurs at all unless there
are more than 253 consecutive non-null bytes.
In all other cases, you drop the trailing null
byte of a packet and add a leading packet size,
which preserves the data size exactly.

Decoding is very simple, just read the
"packet size" byte first:

* if the packet size is 255, write a null to the output.
* if the packet size is 254, copy 254 bytes to the output.
* if the packet size is N < 254, copy N bytes to the
  output and write a null.

This encoding is compact, easy to implement, fast,
and eliminates all null bytes from the input.
It should be very easy to implement this within
PostgreSQL to permit arbitrary binary data to be
stored into columns.  (It might be useful to
implement such encoding on all variable-length
character column types, which would simplify
storing Unicode, for example.)

This technique can be modified to eliminate
any single byte value (just use the encoder above
as-is, then convert the unwanted value to null).
It can also be easily extended to eliminate any
M specific byte values (break the "packet size"
codes into M ranges so that the packet size
encodes the trailing value that was dropped),
but the enlargement is then limited to something
like  ceiling( <input size> / (254/M) ), so it's
only really a good idea if M is quite small.

Fabrizio is right, TOAST is 90% of the work to
get full BLOB support; the remaining pieces are
to ensure that clients correctly encode nulls for
wire transfer and either use something like the
above to eliminate null bytes from the storage
or modify PGs internals to use dimensioned strings
rather than null-terminated strings.

A more radical approach would build an encoding like
the above into the wire protocol; but that might
create ugly backwards-compatibility headaches.

                        - Tim Kientzle

pgsql-general by date:

From: "rob"
Date: 12 December 2000, 23:48:31
Subject: Re: Too much traffic

From: MEOW961021@aol.com
Date: 13 December 2000, 00:33:05
Subject: info please

RE: TOAST != BLOB (was Re: Unanswered questions about P ostgre) - Mailing list pgsql-general

Previous

Next