Re: design for parallel backup - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Re: design for parallel backup
Date
Msg-id 892b057f-bc33-6bb3-0abf-8bd5674ba901@2ndquadrant.com
Whole thread Raw
In response to design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: design for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2020-04-15 17:57, Robert Haas wrote:
> Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com
> there's a proposal for a parallel backup patch which works in the way
> that I have always thought parallel backup would work: instead of
> having a monolithic command that returns a series of tarballs, you
> request individual files from a pool of workers. Leaving aside the
> quality-of-implementation issues in that patch set, I'm starting to
> think that the design is fundamentally wrong and that we should take a
> whole different approach. The problem I see is that it makes a
> parallel backup and a non-parallel backup work very differently, and
> I'm starting to realize that there are good reasons why you might want
> them to be similar.

That would clearly be a good goal.  Non-parallel backup should ideally 
be parallel backup with one worker.

But it doesn't follow that the proposed design is wrong.  It might just 
be that the design of the existing backup should change.

I think making the wire format so heavily tied to the tar format is 
dubious.  There is nothing particularly fabulous about the tar format. 
If the server just sends a bunch of files with metadata for each file, 
the client can assemble them in any way they want: unpacked, packed in 
several tarball like now, packed all in one tarball, packed in a zip 
file, sent to S3, etc.

Another thing I would like to see sometime is this: Pull a minimal 
basebackup, start recovery and possibly hot standby before you have 
received all the files.  When you need to access a file that's not there 
yet, request that as a priority from the server.  If you nudge the file 
order a little with perhaps prewarm-like data, you could get a mostly 
functional standby without having to wait for the full basebackup to 
finish.  Pull a file on request is a requirement for this.

> So, my new idea for parallel backup is that the server will return
> tarballs, but just more of them. Right now, you get base.tar and
> ${tablespace_oid}.tar for each tablespace. I propose that if you do a
> parallel backup, you should get base-${N}.tar and
> ${tablespace_oid}-${N}.tar for some or all values of N between 1 and
> the number of workers, with the server deciding which files ought to
> go in which tarballs.

I understand the other side of this:  Why not compress or encrypt the 
backup already on the server side?  Makes sense.  But this way seems 
weird and complicated.  If I want a backup, I want one file, not an 
unpredictable set of files.  How do I even know I have them all?  Do we 
need a meta-manifest?

A format such as ZIP would offer more flexibility, I think.  You can 
build a single target file incrementally, you can compress or encrypt 
each member file separately, thus allowing some compression etc. on the 
server.  I'm not saying it's perfect for this, but some more thinking 
about the archive formats would potentially give some possibilities.

All things considered, we'll probably want more options and more ways of 
doing things.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables
Next
From: Robert Haas
Date:
Subject: Re: new heapcheck contrib module