Home > mailing lists

Re: pg_dump directory archive format / parallel pg_dump - Mailing list pgsql-hackers

From	Joachim Wieland
Subject	Re: pg_dump directory archive format / parallel pg_dump
Date	January 19, 2011 13:02:20
Msg-id	AANLkTikqrGJ0zq9Vw34-4T+70EeRpVOZt=cwAe-mzDX-@mail.gmail.com Whole thread Raw
In response to	Re: pg_dump directory archive format / parallel pg_dump (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses	Re: pg_dump directory archive format / parallel pg_dump (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List	pgsql-hackers

Tree view

On Wed, Jan 19, 2011 at 7:47 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> Here are the latest patches all of them also rebased to current HEAD.
>> Will update the commitfest app as well.
>
> What's the idea of storing the file sizes in the toc file? It looks like
> it's not used for anything.

It's part of the overall idea to make sure files are not inadvertently
exchanged between different backups and that a file is not truncated.
In the future I'd also like to add a checksum to the TOC so that a
backup can be checked for integrity. This will cost performance but
with the parallel backup it can be distributed to several processors.

> It would be nice to have this format match the tar format. At the moment,
> there's a couple of cosmetic differences:
>
> * TOC file is called "TOC", instead of "toc.dat"
>
> * blobs TOC file is called "BLOBS.TOC" instead of "blobs.toc"
>
> * each blob is stored as "blobs/<oid>.dat", instead of "blob_<oid>.dat"

That can be done easily...

> The only significant difference is that in the directory archive format,
> each data file has a header in the beginning.

> What are the benefits of the data file header? Would it be better to leave
> it out, so that the format would be identical to the tar format? You could
> then just tar up the directory to get a tar archive, or vice versa.

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.

The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.

If you want to drop the header altogether, fine with me but if it's
just for the tar <-> directory conversion, then I am failing to see
what the use case of that would be.

A tar archive has the advantage that you can postprocess the dump data
with other tools  but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...

Joachim

pgsql-hackers by date:

From: Merlin Moncure
Date: 19 January 2011, 12:56:59
Subject: Re: limiting hint bit I/O

From: Robert Haas
Date: 19 January 2011, 13:08:38
Subject: Re: Re: [COMMITTERS] pgsql: Log replication connections only when log_connections is on

Re: pg_dump directory archive format / parallel pg_dump - Mailing list pgsql-hackers

Previous

Next