Thread: Binary data representations for new protocol

Binary data representations for new protocol

From
Tom Lane
Date:
Here are some concrete suggestions for the on-the-wire representation
of binary data (this is format code 1 in the new 3.0 protocol):

integer datatypes: the integer value, in network byte order, width
depending on data type.

float datatypes: the server's internal representation, but with byte
ordering swapped if on a little-endian machine.  This will give us a
common representation across all IEEE-float machines, which is just
about everyone, and will make life no worse for those with weird float
hardware.

text, varchar, etc: I'm not entirely sure whether the contents should be
converted to client encoding or left in server encoding.  The former seems
to make the most sense, but it bothers me that COPY BINARY output sent to
the client might be different from a COPY BINARY file written directly by
the server (which presumably would stay in server encoding).  Comments?

bytea, bit arrays, etc: just the bytes, ma'am.

numeric: pretty much the current internal representation, which boils
down to an array of int2 values.

arrays: the array header info (dimensionality, element datatype OID,
flags) with each field in network byte order, then the element values,
each one as a byte count and data bytes just as though it were a separate
field converted according to the current format code.  Note that this
representation will support a future implementation of null elements in
arrays: put -1 for a byte count, just as in the FE/BE protocol.

Other datatypes such as point, line, etc can be handled as combinations
of the above cases, and don't offer any great interest AFAICS.

I intend that these representations will be used even when talking to
old-protocol clients; we won't accept raw unchecked internal
representations from anyone.  This will create some compatibility
issues, but if we don't do that then we still have what's arguably
a security hole.


Server-side implementation:

We will restore the typsend and typreceive columns of pg_type.  A type is
allowed not to define these (have zeroes in these columns), in which case
it can't participate in binary I/O.  If typsend is supplied, it must have
the signature "typsend(mytype) returns bytea".  The returned bytea object
contains the bytes to be sent.  If typreceive is supplied, it must have
the signature "typreceive(internal) returns mytype".  The supplied
argument will be a StringInfo data structure initialized to hold the
received bytes.  (The motivation for using StringInfo rather than bytea is
to avoid unnecessary data copying.)

The actual implementations of these routines will make use of the existing
conversion subroutines in src/backend/libpq/pqformat.c, which convert data
into or out of a StringInfo buffer.  typsend routines will need just a
couple new support routines to handle setting up a buffer and packaging
its finished contents as a bytea result.  I envision the coding of,
for example, int4send as
StringInfoData buf;
pq_begintypsend(&buf);pq_sendint(&buf, value, 4);PG_RETURN_BYTEA_P(pq_endtypsend(&buf));

typreceive routines can use the pqformat.c routines like pq_getmsgint
directly on the supplied StringInfo buffer.  We'll need to add new
pqformat.c routines for handling float data according to the above
byte-swapping rules, but just about all the other infrastructure is
there already.

Comments?
        regards, tom lane