Re: Re: COPY BINARY file format proposal - Mailing list pgsql-hackers

From Philip Warner
Subject Re: Re: COPY BINARY file format proposal
Date
Msg-id 3.0.5.32.20001209144019.02c9e100@mail.rhyme.com.au
Whole thread Raw
In response to COPY BINARY file format proposal  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Re: COPY BINARY file format proposal  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
At 19:55 8/12/00 -0500, Tom Lane wrote:
>Philip Warner <pjw@rhyme.com.au> writes:
>> How about a CRC? ;-P
>
>I take it from the smiley that you're not serious, but actually it seems
>like it might not be a bad idea.  I could see appending a CRC to each
>tuple record.  Comments anyone?

More a matter of not thinking it was important enough to worry about, and
not really wanting to drag the MD5/MD4/CRC64/etc debate into this one.
Having said that, I think it would be a nice-to-have, like CRCs on db pages
- in the latter case I'd really like VACCUM (or another utility) to be able
to report 'invalid pages' on a nightly basis (or, better still, not report
them). 


>Attached is the current state of the proposal.  I haven't added a CRC
>field but am willing to do so if that's the consensus.

Sounds good to me. I'm not sure you need it on a per-tuple basis - but it
can't hurt, assuming it's cheap to generate. Does the backend send tuples
or blocks of tuples? If the latter, and if CRC is expensive, then maybe 1
CRC for each group of tuples.

Also having a CRC on a per-tupple basis will prevent getting out of sync
with the data, and make partial data recovery 


>Next 4 bytes: length of remainder of header, not including self.  In
>the initial version this will be zero, and the first tuple follows
>immediately.  Future changes to the format might allow additional data
>to be present in the header.  A reader should silently ignore any header
>extension data it does not know what to do with.

Don't you need to at least define how to specify non-essential chunks,
since the flags are not to be used to describe the header extensions. Or
are we going to make the initial version barf when it encounters any header
extension?


>Tuples
>------
>
>Each tuple begins with an int16 count of the number of fields in the
>tuple.  (Presently, all tuples in a table will have the same count, but
>that might not always be true.)

Another option would be to:

- dump the field sizes in the header somewhere (they will all be the same), 
- for each row output a bitmap of non-null fields, followed by the data.
- varlena would have a -1 length in the header, an an int32 length in the row.

This is harder to read and to write, but saves space, if that is desirable.

>
>For non-NULL fields, the reader can check that the typlen matches the
>expected typlen for the destination column.  This provides a simple
>but very useful check that the data is as expected.

CRC seems like the go here...




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


pgsql-hackers by date:

Previous
From: Paul
Date:
Subject: Oracle-compatible lpad/rpad behavior
Next
From: Tom Lane
Date:
Subject: Re: Re: COPY BINARY file format proposal