Thread: COPY BINARY file format proposal

COPY BINARY file format proposal

From
Tom Lane
Date:
Well, no one seemed very unhappy at the idea of changing the file format
for binary COPY, so here is a proposal.

The objectives of this change are:

1. Get rid of the tuple count at the front of the file.  This requires
an extra pass over the relation, which is a lot more trouble than the
count is worth.  Use an explicit EOF marker instead.
2. Send fields of a tuple individually, instead of dumping out raw tuples
(complete with alignment padding and so forth) as is currently done.
This is mainly to simplify TOAST-related processing.
3. Make the format somewhat self-identifying, so that the reader has at
least some chance of detecting it when the data doesn't match the table
it's supposed to be loaded into.

The proposed format consists of a file header, zero or more tuples, and a
file trailer.

The file header will just be a 32-bit magic number; it's present so that a
reader can reject non-COPY-binary input data, as well as detect problems
like incompatible endianness.  (We could also use changes in the magic
number as a flag for future format changes.)

Each tuple begins with an int16 count of the number of fields in the
tuple.  (Presently, all tuples in a table will have the same count, but
that might not always be true.)  Then, repeated for each field in the
tuple, there is an int16 typlen word possibly followed by field data.
The typlen field is interpreted thus:
Zero        Field is NULL.  No data follows.
> 0        Field is a fixed-length datatype.  Exactly N        bytes of data follow the typlen word.
-1        Field is a varlena datatype.  The next four        bytes are the varlena header, which contains        the
totalvalue length including itself.
 
< -1        Reserved for future use.

For non-NULL fields, the reader can check that the typlen matches the
expected typlen for the destination column.  This provides a simple
but very useful check that the data is as expected.

There is no alignment padding or any other extra data between fields.
Note also that the format does not distinguish whether a datatype is
pass-by-reference or pass-by-value.  Both of these provisions are
deliberate: they might help improve portability of the files (although
of course endianness and floating-point-format issues can still keep
you from moving a binary file across machines).

The file trailer consists of an int16 word containing -1.  This is
easily distinguished from a tuple's field-count word.

A reader should report an error if a field-count word is neither -1
nor the expected number of columns.  This provides a pretty strong
check against somehow getting out of sync with the data.

Comments?
        regards, tom lane


Re: COPY BINARY file format proposal

From
Tom Lane
Date:
Grumble, I forgot about COPY WITH OIDS.  Amend that proposal as follows:

... We should use two different
magic numbers depending on whether OIDs are included in the dump or not.

If OIDs are included in the dump, the OID field immediately follows the
field-count word.  It is a normal field except that it's not included
in the field-count.  In particular it has a typlen --- this will allow
handling of 4-byte vs 8-byte OIDs without too much pain, and will allow
OIDs to be shown as NULL if we someday allow OIDs to be optional.
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> I'd prefer to see a single magic number for all binary COPY output, then a
> few bytes of header including a version number, and flags to indicate
> endianness, OIDs etc. It seems a lot cleaner than overloading the magic
> number.

OK, we can do it that way.  I'm still going to pick a magic number that
looks different depending on endianness, however ;-).

What might we need in the header besides a version indicator and a
has-OIDs flag?

> Also, IIRC part of the problem with text-based COPY is that we can't
> specify field order (I think this affectes dumping the regression DB).
> Would it be possible to add the ability to (a) specify field order, and (b)
> dump a subset of fields?

This is not an issue for the file format, but for the COPY command itself.
And considering we're in beta now (or as soon as Marc gets the tarball
made, anyway) I'm going to call that a new feature and say it should
wait for 7.2.
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Philip Warner
Date:
At 20:40 6/12/00 -0500, Tom Lane wrote:
>Philip Warner <pjw@rhyme.com.au> writes:
>> I'd prefer to see a single magic number for all binary COPY output, then a
>> few bytes of header including a version number, and flags to indicate
>> endianness, OIDs etc. It seems a lot cleaner than overloading the magic
>> number.
>
>OK, we can do it that way.  I'm still going to pick a magic number that
>looks different depending on endianness, however ;-).

What does the smiley mean in this context? I hope you're not serious...or
if you are, I'd be interested to know why.


>What might we need in the header besides a version indicator and a
>has-OIDs flag?

Just of the top of my head, some things that could be there in the future: 

- floating point representation (for portability)

- flag for compressed or uncompressed toast fields (I assume you dump them
uncompressed?)

- version number may be important if we dump a subset of fields (ie. we'll
need to store the field names somewhere).

I really have no idea what might be there, but it seems prudent to do it
this way.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
>> OK, we can do it that way.  I'm still going to pick a magic number that
>> looks different depending on endianness, however ;-).

> What does the smiley mean in this context?

Just thinking that the only way an endianness flag inside the header
would be useful is if we pick a magic number that's a bytewise
palindrome.

> - floating point representation (for portability)

Specified how?  (For that matter, determined how?)

> - flag for compressed or uncompressed toast fields (I assume you dump them
> uncompressed?)

Yes, I want COPY to force 'em to uncompressed so as to avoid problems
with cross-version changes of compression algorithm.  (Right at the
moment it gets that wrong.)

> - version number may be important if we dump a subset of fields (ie. we'll
> need to store the field names somewhere).

No we don't.  ASCII COPY format doesn't store field names either ... at
least not as part of the data stream ... and should not IMHO.  Don't you
want to be able to reload into a table that you've changed the column
names of?
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Philip Warner
Date:
At 21:12 6/12/00 -0500, Tom Lane wrote:
>Philip Warner <pjw@rhyme.com.au> writes:
>>> OK, we can do it that way.  I'm still going to pick a magic number that
>>> looks different depending on endianness, however ;-).
>
>> What does the smiley mean in this context?
>
>Just thinking that the only way an endianness flag inside the header
>would be useful is if we pick a magic number that's a bytewise
>palindrome.

You could just read the 1st, 2nd, 3rd, etc bytes and require that they be
'P', 'G', 'C', 'P', 'Y' or some such. I *think* reading five bytes and
doing a strcmp works...ie. don't rely on the integer value, use a string.


>> - floating point representation (for portability)
>
>Specified how?  (For that matter, determined how?)

I'd recommend a crystal ball. You did ask a question about the future ;-}.


>> - flag for compressed or uncompressed toast fields (I assume you dump them
>> uncompressed?)
>
>Yes, I want COPY to force 'em to uncompressed so as to avoid problems
>with cross-version changes of compression algorithm.  (Right at the
>moment it gets that wrong.)

Sounds reasonable, but there could be an advantage in allowing a binary
compressed dump for short-term work.


>> - version number may be important if we dump a subset of fields (ie. we'll
>> need to store the field names somewhere).
>
>No we don't.  ASCII COPY format doesn't store field names either ... at
>least not as part of the data stream ... and should not IMHO.  Don't you
>want to be able to reload into a table that you've changed the column
>names of?

This is essential if we ever allow subsets of columns - even if it is only
for displaying information to the user. If I dump 5 out of 7 columns then
rename half of them, I'd say I'm asking for trouble. At least with the
names available, you have a chance of working out what goes where. But
again, without copy-a-subset-of-columns, this also requires a crystal ball.


It all gets back to whether it's a good idea to overload a magic number. 



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: COPY BINARY file format proposal

From
Philip Warner
Date:
At 15:36 6/12/00 -0500, Tom Lane wrote:
>Grumble, I forgot about COPY WITH OIDS.  Amend that proposal as follows:
>
>... We should use two different
>magic numbers depending on whether OIDs are included in the dump or not.

I'd prefer to see a single magic number for all binary COPY output, then a
few bytes of header including a version number, and flags to indicate
endianness, OIDs etc. It seems a lot cleaner than overloading the magic
number.

Also, IIRC part of the problem with text-based COPY is that we can't
specify field order (I think this affectes dumping the regression DB).
Would it be possible to add the ability to (a) specify field order, and (b)
dump a subset of fields?


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
>> Just thinking that the only way an endianness flag inside the header
>> would be useful is if we pick a magic number that's a bytewise
>> palindrome.

> You could just read the 1st, 2nd, 3rd, etc bytes and require that they be
> 'P', 'G', 'C', 'P', 'Y' or some such. I *think* reading five bytes and
> doing a strcmp works...ie. don't rely on the integer value, use a string.

Oh.  We could use a string instead of an integer, I suppose, although
I'm not sure I see the point for what's basically a binary format.

Given all that, here is a proposed spec for the header:

First 8 bytes: signature, ASCII "PGBCOPY\0" --- note that the null is a
required part of the signature.  (This is to catch files that have been
munged by a non-8-bit-clean transfer.)

Next 4 bytes: integer layout field.  This consists of the int32 constant
0x0A820D0A expressed in the source machine's endianness.  (Again, value
chosen with malice aforethought, to catch files munged by things like
DOS/Unix newline conversion or high-bit-stripping.)  Potentially, a
reader could engage in byte-flipping of subsequent fields if the wrong
byte order is detected here.

Next 4 bytes: version number, currently 1 (expressed in source machine's
endianness, as are all subsequent integer fields).  A reader should
abort if it does not recognize the version number.

Next 4 bytes: length of remainder of header, not including self.  In
the initial version this will be zero, and the first tuple follows
immediately.  Future changes to the format might allow additional data
to be present in the header.  A reader should silently ignore any header
extension data it does not know what to do with.

This allows for both backwards-compatible header additions (extend the
header without changing the version number) and non-backwards-compatible
changes (bump the version number).

Since we don't yet know what we might do about the issue of
floating-point format, I left that out of the spec.  It can be added to
the header extension area when and if we figure out how to do it.

Likewise, addons such as column names are also punted until later.

Comments?
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Philip Warner
Date:
At 14:28 7/12/00 -0500, Tom Lane wrote:
>
>Next 4 bytes: version number, currently 1 (expressed in source machine's
>endianness

I don't want to continue being picky, but you could just use 4 bytes for a
maj-min-rev-patch version number (in that order), and avoid the endian
issues by reading and writing each byte. No big deal, though.


>This allows for both backwards-compatible header additions (extend the
>header without changing the version number) and non-backwards-compatible
>changes (bump the version number).

That's where the rev & patch levels help if you adopt the above version
numbering - 1.0-** should should all be compatibile, 1.1 should be able to
read <= 1.1-**, 1.0-** should not be expected to read 1.1-** etc.


>
>Comments?
>

Sounds reasonable even without the above suggestions.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> I don't want to continue being picky, but you could just use 4 bytes for a
> maj-min-rev-patch version number (in that order), and avoid the endian
> issues by reading and writing each byte. No big deal, though.

Well, the thing is that we need to protect the contents of
datatype-specific structures.  If it were just a matter of byte-flipping
the counts and lengths defined by the (proposed) file format, I'd have
specified that we write 'em all in network byte order and be done with
it.  But knowing the internal structure of every datatype in the system
is a very different game, and I don't want to try to play that game ...
at least not yet.  So the proposal is just to identify the endianness
that the file is being written with.  Recovering the data on a machine
of different endianness is a project for future data archeologists.

>> This allows for both backwards-compatible header additions (extend the
>> header without changing the version number) and non-backwards-compatible
>> changes (bump the version number).

> That's where the rev & patch levels help if you adopt the above version
> numbering - 1.0-** should should all be compatibile, 1.1 should be able to
> read <= 1.1-**, 1.0-** should not be expected to read 1.1-** etc.

Tell you the truth, I don't believe in file-format version numbers at
all.  My experience with such things is that they defeat portability
rather than promote it, because readers tend to reject files that they
could have actually have read as a result of insignificant version number
issues.  You can read all about my view of this issue in the PNG spec
(RFC 2083, esp section 12.13) --- the versioning philosophy described
there is largely yours truly's.

I will not complain about sticking a "version 1.0" field into a format
when there is no real intention of changing it in the future ... but
assigning deep significance to major/minor numbers, or something like
that, is wrongheaded.  You need a much finer-grained view of
compatibility issues than that if you want to achieve anything much
in cross-version compatibility.  Feature-based versioning, like PNG's
notion of critical vs. ancillary chunks, is the thing you need for
that.  I didn't bring up the issue in this morning's proposal --- but
if we ever do add stuff to the proposed extensible header, I will hold
out for self-identifying feature-related items much like PNG chunks.
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
I wrote:
> Next 4 bytes: integer layout field.  This consists of the int32 constant
> 0x0A820D0A expressed in the source machine's endianness.  (Again, value
> chosen with malice aforethought, to catch files munged by things like
> DOS/Unix newline conversion or high-bit-stripping.)

Actually, that won't do.  A little-endian machine would write 0A 0D 82
0A which would fail to trigger newline converters that are looking for
\r followed by \n (0D 0A).  If we're going to take seriously the idea of
detecting newline transforms, then we need to incorporate the test
pattern into the fixed-byte-order signature.

How about:

Signature: 12-byte sequence "PGBCOPY\n\377\r\n\0" (detects newline
replacements, dropped nulls, dropped high bits, parity changes);

Integer layout field: int32 constant 0x01020304 in source's byte order.

The rest as before.
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Philip Warner
Date:
At 02:31 8/12/00 -0500, Tom Lane wrote:
>
>How about:
>
>Signature: 12-byte sequence "PGBCOPY\n\377\r\n\0" (detects newline
>replacements, dropped nulls, dropped high bits, parity changes);
>
>Integer layout field: int32 constant 0x01020304 in source's byte order.
>

How about a CRC? ;-P


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: COPY BINARY file format proposal

From
Bruce Momjian
Date:
> I will not complain about sticking a "version 1.0" field into a format
> when there is no real intention of changing it in the future ... but
> assigning deep significance to major/minor numbers, or something like

I assume the version would be the COPY format version, not the
PostgreSQL version.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: COPY BINARY file format proposal

From
Bruce Momjian
Date:
> Also, IIRC part of the problem with text-based COPY is that we can't
> specify field order (I think this affectes dumping the regression DB).
> Would it be possible to add the ability to (a) specify field order, and (b)
> dump a subset of fields?

Informix does this nicely:
UNLOAD TO "file"SELECT *FROM tab

Merging COPY and SELECT has some real advantages.  You can specify
columns, parts of a table using WHERE, and even joins.  Very flexible.

Perhaps, if the table name is missing from COPY, we can allow a SELECT:
COPY TO 'file'SELECT *FROM tab

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: COPY BINARY file format proposal

From
ncm@zembu.com (Nathan Myers)
Date:
On Thu, Dec 07, 2000 at 02:28:28PM -0500, Tom Lane wrote:
> Given all that, here is a proposed spec for the header:
>  ...
> Comments?

I've been thinking about this.  

I'd like to see a timestamp for when the image was created, and a 
128-byte comment field to allow annotations, even after the fact.
(I don't think we're pressed for space, right?)  The more chances
that you don't have to actually load the file to find out what's
in it, the better.

(I have also suggested, in private mail, that the "header length" 
field should be the length of the whole header, not just whatever 
was added on in versions 2..n.  Tom didn't agree.)

Nathan Myers
ncm@zembu.com


Re: Re: COPY BINARY file format proposal

From
"Ross J. Reedstrom"
Date:
On Sun, Dec 10, 2000 at 04:08:58PM -0800, Nathan Myers wrote:
> On Thu, Dec 07, 2000 at 02:28:28PM -0500, Tom Lane wrote:
> > Given all that, here is a proposed spec for the header:
> >  ...
> > Comments?
> 
> (I have also suggested, in private mail, that the "header length" 
> field should be the length of the whole header, not just whatever 
> was added on in versions 2..n.  Tom didn't agree.)

I had the same thought, but didn't get around to posting it.

Ross


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
> I'd like to see a timestamp for when the image was created, and a 
> 128-byte comment field to allow annotations, even after the fact.

Both seem like reasonable options.  If you don't mind, however,
I'd suggest that they be left for inclusion as chunks in the header
extension area, rather than nailing them down in the fixed header.

The advantage of handling a comment that way is obvious: it needn't
be fixed-length.  As for the timestamp, handling it as an optional
chunk would allow graceful substitution of a different timestamp
format, which we'll need when 2038 begins to loom.

Basically what I want to do at the moment is get a minimal format
spec nailed down for 7.1.  There'll be time for neat extras later
as long as we get it right now --- but there's not a lot of time
for extras before 7.1.
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Bruce Momjian
Date:
> ncm@zembu.com (Nathan Myers) writes:
> > I'd like to see a timestamp for when the image was created, and a 
> > 128-byte comment field to allow annotations, even after the fact.
> 
> Both seem like reasonable options.  If you don't mind, however,
> I'd suggest that they be left for inclusion as chunks in the header
> extension area, rather than nailing them down in the fixed header.
> 
> The advantage of handling a comment that way is obvious: it needn't
> be fixed-length.  As for the timestamp, handling it as an optional
> chunk would allow graceful substitution of a different timestamp
> format, which we'll need when 2038 begins to loom.
> 
> Basically what I want to do at the moment is get a minimal format
> spec nailed down for 7.1.  There'll be time for neat extras later
> as long as we get it right now --- but there's not a lot of time
> for extras before 7.1.

The have the look of creeping-featurism to me.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Re: COPY BINARY file format proposal

From
Philip Warner
Date:
At 01:27 8/12/00 -0500, Tom Lane wrote:
>Recovering the data on a machine
>of different endianness is a project for future data archeologists.

It's frightening to think that in 1000 years time people will be deducing
things about our society from the way we stored data.


>
>Tell you the truth, I don't believe in file-format version numbers at
>all...
>(RFC 2083, esp section 12.13) --- the versioning philosophy described
>there is largely yours truly's.

Seems to be a much better approach; (non)critical chunks & chunk types are
much more portable.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: COPY BINARY file format proposal

From
Philip Warner
Date:
At 19:55 8/12/00 -0500, Tom Lane wrote:
>Philip Warner <pjw@rhyme.com.au> writes:
>> How about a CRC? ;-P
>
>I take it from the smiley that you're not serious, but actually it seems
>like it might not be a bad idea.  I could see appending a CRC to each
>tuple record.  Comments anyone?

More a matter of not thinking it was important enough to worry about, and
not really wanting to drag the MD5/MD4/CRC64/etc debate into this one.
Having said that, I think it would be a nice-to-have, like CRCs on db pages
- in the latter case I'd really like VACCUM (or another utility) to be able
to report 'invalid pages' on a nightly basis (or, better still, not report
them). 


>Attached is the current state of the proposal.  I haven't added a CRC
>field but am willing to do so if that's the consensus.

Sounds good to me. I'm not sure you need it on a per-tuple basis - but it
can't hurt, assuming it's cheap to generate. Does the backend send tuples
or blocks of tuples? If the latter, and if CRC is expensive, then maybe 1
CRC for each group of tuples.

Also having a CRC on a per-tupple basis will prevent getting out of sync
with the data, and make partial data recovery 


>Next 4 bytes: length of remainder of header, not including self.  In
>the initial version this will be zero, and the first tuple follows
>immediately.  Future changes to the format might allow additional data
>to be present in the header.  A reader should silently ignore any header
>extension data it does not know what to do with.

Don't you need to at least define how to specify non-essential chunks,
since the flags are not to be used to describe the header extensions. Or
are we going to make the initial version barf when it encounters any header
extension?


>Tuples
>------
>
>Each tuple begins with an int16 count of the number of fields in the
>tuple.  (Presently, all tuples in a table will have the same count, but
>that might not always be true.)

Another option would be to:

- dump the field sizes in the header somewhere (they will all be the same), 
- for each row output a bitmap of non-null fields, followed by the data.
- varlena would have a -1 length in the header, an an int32 length in the row.

This is harder to read and to write, but saves space, if that is desirable.

>
>For non-NULL fields, the reader can check that the typlen matches the
>expected typlen for the destination column.  This provides a simple
>but very useful check that the data is as expected.

CRC seems like the go here...




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> More a matter of not thinking it was important enough to worry about, and
> not really wanting to drag the MD5/MD4/CRC64/etc debate into this one.

I'd just as soon not drag that debate in here either ;-) ... but once we
settle on an appropriate CRC method for WAL it's easy enough to call the
same routine for this code.

> Sounds good to me. I'm not sure you need it on a per-tuple basis - but it
> can't hurt, assuming it's cheap to generate. Does the backend send tuples
> or blocks of tuples? If the latter, and if CRC is expensive, then maybe 1
> CRC for each group of tuples.

Extending the CRC over multiple tuples would just complicate life,
I think.  The per-byte cost is the biggest factor, so you don't really
save all that much.

>> Next 4 bytes: length of remainder of header, not including self.  In
>> the initial version this will be zero, and the first tuple follows
>> immediately.  Future changes to the format might allow additional data
>> to be present in the header.  A reader should silently ignore any header
>> extension data it does not know what to do with.

> Don't you need to at least define how to specify non-essential chunks,
> since the flags are not to be used to describe the header extensions. Or
> are we going to make the initial version barf when it encounters any header
> extension?

No, the initial version will just silently skip the whole header
extension; it's defined so that that's a legal behavior (everything
in the header extension is inessential).  We can come back and define
a format for the entries in the header extension area when we need some.

> Another option would be to:
> - dump the field sizes in the header somewhere (they will all be the same), 
> - for each row output a bitmap of non-null fields, followed by the data.
> - varlena would have a -1 length in the header, an an int32 length in the row.

That would work if you are willing to assume that all the tuples indeed
always have the same set of fields --- you're not, for example, doing an
inheritance-tree-walk "COPY FROM foo*".  But Chris Bitmead still has a
gleam in his eye about that sort of thing, so we might want it someday.
I think it's worth a small amount of extra space to avoid that
assumption, especially since it simplifies the code too.
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> How about a CRC? ;-P

I take it from the smiley that you're not serious, but actually it seems
like it might not be a bad idea.  I could see appending a CRC to each
tuple record.  Comments anyone?

You seemed to like the PNG philosophy of using feature flags rather than
a version number.  Accordingly, I propose dropping the version number
field in favor of a flags word.  (Which was needed anyway, because I had
*again* forgotten about COPY WITH OIDS :-(.)

Attached is the current state of the proposal.  I haven't added a CRC
field but am willing to do so if that's the consensus.
        regards, tom lane


COPY BINARY file format proposal

The objectives of this change are:

1. Get rid of the tuple count at the front of the file.  This requires
an extra pass over the relation, which is a lot more trouble than the
count is worth.  Use an explicit EOF marker instead.
2. Send fields of a tuple individually, instead of dumping out raw tuples
(complete with alignment padding and so forth) as is currently done.
This is mainly to simplify TOAST-related processing.
3. Make the format somewhat self-identifying, so that the reader has at
least some chance of detecting it when the data doesn't match the table
it's supposed to be loaded into.

The proposed format consists of a file header, zero or more tuples, and a
file trailer.


File Header
-----------

The proposed file header consists of 24 bytes of fixed fields, followed
by a variable-length header extension area.

Signature: 12-byte sequence "PGBCOPY\n\377\r\n\0" --- note that the null
is a required part of the signature.  (The signature is designed to allow
easy identification of files that have been munged by a non-8-bit-clean
transfer.  The proposed signature will be changed by newline-translation
filters, dropped nulls, dropped high bits, or parity changes.)

Integer layout field: int32 constant 0x01020304 in source's byte order.
Potentially, a reader could engage in byte-flipping of subsequent fields
if the wrong byte order is detected here.

Flags field: a 4-byte bit mask to denote important aspects of the file
format.  Bits are numbered from 0 (LSB) to 31 (MSB) --- note that this
field is stored with source's endianness, as are all subsequent integer
fields.  Bits 16-31 are reserved to denote critical file format issues;
a reader should abort if it finds an unexpected bit set in this range.
Bits 0-15 are reserved to signal backwards-compatible format issues;
a reader should simply ignore any unexpected bits set in this range.
Currently only one flag bit is defined, and the rest must be zero:Bit 16:    if 1, OIDs are included in the dump; if 0,
not

Next 4 bytes: length of remainder of header, not including self.  In
the initial version this will be zero, and the first tuple follows
immediately.  Future changes to the format might allow additional data
to be present in the header.  A reader should silently ignore any header
extension data it does not know what to do with.

Note that I envision the content of the header extension area as being a
sequence of self-identifying chunks (but the specific design of same is
postponed until we need 'em).  The flags field is not intended to tell
readers what is in the extension area.

This design allows for both backwards-compatible header additions (add
header extension chunks, or set low-order flag bits) and non-backwards-
compatible changes (set high-order flag bits to signal such changes,
and add supporting data to the extension area if needed).


Tuples
------

Each tuple begins with an int16 count of the number of fields in the
tuple.  (Presently, all tuples in a table will have the same count, but
that might not always be true.)  Then, repeated for each field in the
tuple, there is an int16 typlen word possibly followed by field data.
The typlen field is interpreted thus:
Zero        Field is NULL.  No data follows.
> 0        Field is a fixed-length datatype.  Exactly N        bytes of data follow the typlen word.
-1        Field is a varlena datatype.  The next four        bytes are the varlena header, which contains        the
totalvalue length including itself.
 
< -1        Reserved for future use.

For non-NULL fields, the reader can check that the typlen matches the
expected typlen for the destination column.  This provides a simple
but very useful check that the data is as expected.

There is no alignment padding or any other extra data between fields.
Note also that the format does not distinguish whether a datatype is
pass-by-reference or pass-by-value.  Both of these provisions are
deliberate: they might help improve portability of the files (although
of course endianness and floating-point-format issues can still keep
you from moving a binary file across machines).

If OIDs are included in the dump, the OID field immediately follows the
field-count word.  It is a normal field except that it's not included
in the field-count.  In particular it has a typlen --- this will allow
handling of 4-byte vs 8-byte OIDs without too much pain, and will allow
OIDs to be shown as NULL if we someday allow OIDs to be optional.


File Trailer
------------

The file trailer consists of an int16 word containing -1.  This is
easily distinguished from a tuple's field-count word.

A reader should report an error if a field-count word is neither -1
nor the expected number of columns.  This provides a pretty strong
check against somehow getting out of sync with the data.


Re: Re: COPY BINARY file format proposal

From
ncm@zembu.com (Nathan Myers)
Date:
On Sun, Dec 10, 2000 at 08:51:52PM -0500, Tom Lane wrote:
> ncm@zembu.com (Nathan Myers) writes:
> > I'd like to see a timestamp for when the image was created, and a 
> > 128-byte comment field to allow annotations, even after the fact.
> 
> Both seem like reasonable options.  If you don't mind, however,
> I'd suggest that they be left for inclusion as chunks in the header
> extension area, rather than nailing them down in the fixed header.
> 
> The advantage of handling a comment that way is obvious: it needn't
> be fixed-length.  As for the timestamp, handling it as an optional
> chunk would allow graceful substitution of a different timestamp
> format, which we'll need when 2038 begins to loom.

I don't know if you get the point of the fixed-size comment field.  
The idea was that a comment could be poked into an existing COPY 
image, after it was written.  A variable-size comment field in an
already-written image might leave no space to poke in anything.  A 
variable-size comment field with a required minimum size would 
satisfy both needs, at some cost in complexity.  
> Basically what I want to do at the moment is get a minimal format
> spec nailed down for 7.1.  There'll be time for neat extras later
> as long as we get it right now --- but there's not a lot of time
> for extras before 7.1.

I understand.

Nathan Myers
ncm@zembu.com


Re: Re: COPY BINARY file format proposal

From
Tom Lane
Date:
ncm@zembu.com (Nathan Myers) writes:
> I don't know if you get the point of the fixed-size comment field.  
> The idea was that a comment could be poked into an existing COPY 
> image, after it was written.

Yes, I did get the point ...

> A variable-size comment field in an
> already-written image might leave no space to poke in anything.  A 
> variable-size comment field with a required minimum size would 
> satisfy both needs, at some cost in complexity.  

This strikes me as a perfect argument for a variable-size field.
If you want to leave N bytes for a future poked-in comment, you do that.
If you don't, then not.  Leaving 128 bytes (or any other frozen-by-the-
file-format number) is guaranteed to satisfy nobody.
        regards, tom lane


Re: Re: COPY BINARY file format proposal

From
Peter Eisentraut
Date:
Tom Lane writes:

> I take it from the smiley that you're not serious, but actually it seems
> like it might not be a bad idea.  I could see appending a CRC to each
> tuple record.  Comments anyone?

I think I missed the point here.  With CRC you typically want to detect
data corruption.  Where's the possible source of corruption here?

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/