Re: pg_dump directory archive format / parallel pg_dump - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: pg_dump directory archive format / parallel pg_dump
Date
Msg-id 4D385319.1060005@enterprisedb.com
Whole thread Raw
In response to Re: pg_dump directory archive format / parallel pg_dump  (Joachim Wieland <joe@mcknight.de>)
Responses Re: pg_dump directory archive format / parallel pg_dump  (Florian Pflug <fgp@phlo.org>)
Re: pg_dump directory archive format / parallel pg_dump  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
On 20.01.2011 15:46, Joachim Wieland wrote:
> On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>>> The header is there to identify a file, it contains the header that
>>> every other pgdump file contains, including the internal version
>>> number and the unique backup id.
>>>
>>> The tar format doesn't support compression so going from one to the
>>> other would only work for an uncompressed archive and special care
>>> must be taken to get the order of the tar file right.
>>
>> Hmm, tar format doesn't support compression, but looks like the file format
>> issue has been thought of already: there's still code there to add .gz
>> suffix for compressed files. How about adopting that convention in the
>> directory format too? That would make an uncompressed directory format
>> compatible with the tar format.
>
> So what you could do is dump in the tar format, untar and restore in
> the directory format. I see that this sounds nice but still I am not
> sure why someone would dump to the tar format in the first place.

I'm not sure either. Maybe you want to pipe the output of "pg_dump -F t" 
via an ssh tunnel to another host, where you untar it, producing a 
directory format dump. You can then edit the directory format dump, and 
restore it back to the database without having to tar it again.

It gives you a lot of flexibility if the formats are compatible, which 
is generally good.

> But you still cannot go back from the directory archive to the tar
> archive because the standard command line tar will not respect the
> order of the objects that pg_restore expects in a tar format, right?

Hmm, I didn't realize pg_restore requires the files to be in certain 
order in the tar file. There's no mention of that in the docs either, we 
should add that. It doesn't actually require that if you read from a 
file, but from stdin it does.

You can put files in the archive in a certain order if you list them 
explicitly in the tar command line, like "tar cf backup.tar toc.dat 
...". It's hard to know the right order, though. In practice you would 
need to do "tar tf backup.tar >files" before untarring, and use "files" 
to tar them again in the rightorder.

>> That seems pretty attractive anyway, because you can then dump to a
>> directory, and manually gzip the data files later.
>
> The command line gzip will probably add its own header to the file
> that pg_restore would need to strip off...

Yeah, we should write the header too. That's not hard, e.g gzopen will 
do that automatically, or you can pass a flag to deflateInit2.

>>> A tar archive has the advantage that you can postprocess the dump data
>>> with other tools  but for this we could also add an option that gives
>>> you only the data part of a dump file (and uncompresses it at the same
>>> time if compressed). Once we have that however, the question is what
>>> anybody would then still want to use the tar format for...
>>
>> I don't know how popular it'll be in practice, but it seems very nice to me
>> if you can do things like parallel pg_dump in directory format first, and
>> then tar it up to a file for archival.
>
> Yes, but you cannot pg_restore the archive then if it was created with
> standard tar, right?

See above, you can unless you try to pipe it to pg_restore. In fact, 
that's listed as an advantage of the tar format over other formats in 
the pg_dump documentation.

(I'm working on this, no need to submit a new patch)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pg_basebackup for streaming base backups
Next
From: Tom Lane
Date:
Subject: Re: Moving test_fsync to /contrib?