Home > mailing lists

Re: Re: COPY BINARY file format proposal - Mailing list pgsql-hackers

From	Philip Warner
Subject	Re: Re: COPY BINARY file format proposal
Date	December 12, 2000 14:12:05
Msg-id	3.0.5.32.20001209144019.02c9e100@mail.rhyme.com.au Whole thread Raw
In response to	COPY BINARY file format proposal (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Re: COPY BINARY file format proposal (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

At 19:55 8/12/00 -0500, Tom Lane wrote:
>Philip Warner <pjw@rhyme.com.au> writes:
>> How about a CRC? ;-P
>
>I take it from the smiley that you're not serious, but actually it seems
>like it might not be a bad idea.  I could see appending a CRC to each
>tuple record.  Comments anyone?

More a matter of not thinking it was important enough to worry about, and
not really wanting to drag the MD5/MD4/CRC64/etc debate into this one.
Having said that, I think it would be a nice-to-have, like CRCs on db pages
- in the latter case I'd really like VACCUM (or another utility) to be able
to report 'invalid pages' on a nightly basis (or, better still, not report
them). 


>Attached is the current state of the proposal.  I haven't added a CRC
>field but am willing to do so if that's the consensus.

Sounds good to me. I'm not sure you need it on a per-tuple basis - but it
can't hurt, assuming it's cheap to generate. Does the backend send tuples
or blocks of tuples? If the latter, and if CRC is expensive, then maybe 1
CRC for each group of tuples.

Also having a CRC on a per-tupple basis will prevent getting out of sync
with the data, and make partial data recovery 


>Next 4 bytes: length of remainder of header, not including self.  In
>the initial version this will be zero, and the first tuple follows
>immediately.  Future changes to the format might allow additional data
>to be present in the header.  A reader should silently ignore any header
>extension data it does not know what to do with.

Don't you need to at least define how to specify non-essential chunks,
since the flags are not to be used to describe the header extensions. Or
are we going to make the initial version barf when it encounters any header
extension?


>Tuples
>------
>
>Each tuple begins with an int16 count of the number of fields in the
>tuple.  (Presently, all tuples in a table will have the same count, but
>that might not always be true.)

Another option would be to:

- dump the field sizes in the header somewhere (they will all be the same), 
- for each row output a bitmap of non-null fields, followed by the data.
- varlena would have a -1 length in the header, an an int32 length in the row.

This is harder to read and to write, but saves space, if that is desirable.

>
>For non-NULL fields, the reader can check that the typlen matches the
>expected typlen for the destination column.  This provides a simple
>but very useful check that the data is as expected.

CRC seems like the go here...




----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

pgsql-hackers by date:

From: Paul
Date: 12 December 2000, 13:42:40
Subject: Oracle-compatible lpad/rpad behavior

From: Tom Lane
Date: 12 December 2000, 14:13:16
Subject: Re: Re: COPY BINARY file format proposal

Re: Re: COPY BINARY file format proposal - Mailing list pgsql-hackers

Previous

Next