Re: Re: COPY BINARY file format proposal - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Re: COPY BINARY file format proposal
Date
Msg-id 13671.976217308@sss.pgh.pa.us
Whole thread Raw
In response to Re: Re: COPY BINARY file format proposal  (Philip Warner <pjw@rhyme.com.au>)
Responses Re: Re: COPY BINARY file format proposal  (Philip Warner <pjw@rhyme.com.au>)
Re: Re: COPY BINARY file format proposal  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Re: COPY BINARY file format proposal  (ncm@zembu.com (Nathan Myers))
List pgsql-hackers
Philip Warner <pjw@rhyme.com.au> writes:
>> Just thinking that the only way an endianness flag inside the header
>> would be useful is if we pick a magic number that's a bytewise
>> palindrome.

> You could just read the 1st, 2nd, 3rd, etc bytes and require that they be
> 'P', 'G', 'C', 'P', 'Y' or some such. I *think* reading five bytes and
> doing a strcmp works...ie. don't rely on the integer value, use a string.

Oh.  We could use a string instead of an integer, I suppose, although
I'm not sure I see the point for what's basically a binary format.

Given all that, here is a proposed spec for the header:

First 8 bytes: signature, ASCII "PGBCOPY\0" --- note that the null is a
required part of the signature.  (This is to catch files that have been
munged by a non-8-bit-clean transfer.)

Next 4 bytes: integer layout field.  This consists of the int32 constant
0x0A820D0A expressed in the source machine's endianness.  (Again, value
chosen with malice aforethought, to catch files munged by things like
DOS/Unix newline conversion or high-bit-stripping.)  Potentially, a
reader could engage in byte-flipping of subsequent fields if the wrong
byte order is detected here.

Next 4 bytes: version number, currently 1 (expressed in source machine's
endianness, as are all subsequent integer fields).  A reader should
abort if it does not recognize the version number.

Next 4 bytes: length of remainder of header, not including self.  In
the initial version this will be zero, and the first tuple follows
immediately.  Future changes to the format might allow additional data
to be present in the header.  A reader should silently ignore any header
extension data it does not know what to do with.

This allows for both backwards-compatible header additions (extend the
header without changing the version number) and non-backwards-compatible
changes (bump the version number).

Since we don't yet know what we might do about the issue of
floating-point format, I left that out of the spec.  It can be added to
the header extension area when and if we figure out how to do it.

Likewise, addons such as column names are also punted until later.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Switch pg_ctl's default about waiting?
Next
From: The Hermit Hacker
Date:
Subject: v7.1 beta 1 ...packaged, finally ...