Re: design for parallel backup - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: design for parallel backup |
Date | |
Msg-id | 20200504194145.lw6c34moqsykxmfj@alap3.anarazel.de Whole thread Raw |
In response to | Re: design for parallel backup (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
Hi, On 2020-05-04 14:04:32 -0400, Robert Haas wrote: > OK, thanks. Let me see if I can summarize here. On the strength of > previous experience, you'll probably tell me that some parts of this > summary are wildly wrong or at least "not quite correct" but I'm going > to try my best. > - Server-side compression seems like it has the potential to be a > significant win by stretching bandwidth. We likely need to do it with > 10+ parallel threads, at least for stronger compressors, but these > might be threads within a single PostgreSQL process rather than > multiple separate backends. That seems right. I think it might be reasonable to just support "compression parallelism" for zstd, as the library has all the code internally. So we basically wouldn't have to care about it. > - Client-side cache management -- that is, use of > posix_fadvise(DONTNEED), posix_fallocate, and sync_file_range, where > available -- looks like it can improve write rates and CPU efficiency > significantly. Larger block sizes show a win when used together with > such techniques. Yea. Alternatively direct io, but I am not sure we want to go there for now. > - The benefits of multiple concurrent connections remain somewhat > elusive. Peter Eisentraut hypothesized upthread that such an approach > might be the most practical way forward for networks with a high > bandwidth-delay product, and I hypothesized that such an approach > might be beneficial when there are multiple tablespaces on independent > disks, but we don't have clear experimental support for those > propositions. Also, both your data and mine indicate that too much > parallelism can lead to major regressions. I think for that we'd basically have to create two high bandwidth nodes across the pond. My experience in the somewhat recent past is that I could saturate multi-gbit cross-atlantic links without too much trouble, at least once I changed sys.net.ipv4.tcp_congestion_control to something appropriate for such setups (BBR is probably the thing to use here these days). > - Any work we do while trying to make backup super-fast should also > lend itself to super-fast restore, possibly including parallel > restore. I'm not sure I see a super clear case for parallel restore in any of the experiments done so far. The only case we know it's a clear win is when there's independent filesystems for parts of the data. There's an obvious case for parallel decompression however. > Compressed tarfiles don't permit random access to member files. This is an issue for selective restores too, not just parallel restore. I'm not sure how important a case that is, although it'd certainly be useful if e.g. pg_rewind could read from compressed base backups. > Uncompressed tarfiles do, but software that works this way is not > commonplace. I am not 100% sure which part you comment on not being commonplace here. Supporting randomly accessing data in tarfiles? My understanding of that is that one still has to "skip" through the entire archive, right? What not being compressed allows is to not have to read the files inbetween. Given the size of our data files compared to the metadata size that's probably fine? > The only mainstream archive format that seems to support random access > seems to be zip. Adopting that wouldn't be crazy, but might limit our > choice of compression options more than we'd like. I'm not sure that's *really* an issue - there's compression format codes in zip ([1] 4.4.5, also 4.3.14.3 & 4.5 for another approach), and several tools seem to have used that to add additional compression methods. > A tar file of individually compressed files might be a plausible > alternative, though there would probably be some hit to compression > ratios for small files. I'm not entirely sure using zip over uncompressed-tar-over-compressed-files gains us all that much. AFAIU zip compresses each file individually. So the advantage would be a more efficient (less seeking) storage of archive metadata (i.e. which file is where) and that the metadata could be compressed. > Then again, if a single, highly-efficient process can handle a > server-to-client backup, maybe the same is true for extracting a > compressed tarfile... Yea. I'd expect that to be the case, at least for the single filesystem case. Depending on the way multiple tablespaces / filesystems are handled, it could even be doable to handle that reasonably - but it'd probably be harder. Greetings, Andres Freund [1] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
pgsql-hackers by date: