multithreaded zstd backup compression for client and server - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | multithreaded zstd backup compression for client and server |
Date | |
Msg-id | CA+Tgmobj6u-nWF-j=FemygUhobhryLxf9h-wJN7W-2rSsseHNA@mail.gmail.com Whole thread Raw |
In response to | Re: refactoring basebackup.c (Dipesh Pandit <dipesh.pandit@gmail.com>) |
Responses |
Re: multithreaded zstd backup compression for client and server
(Andres Freund <andres@anarazel.de>)
Re: multithreaded zstd backup compression for client and server (Justin Pryzby <pryzby@telsasoft.com>) Re: multithreaded zstd backup compression for client and server (Justin Pryzby <pryzby@telsasoft.com>) |
List | pgsql-hackers |
[ Changing subject line in the hopes of attracting more eyeballs. ] On Mon, Mar 14, 2022 at 12:11 PM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > I tried to implement support for parallel ZSTD compression. Here's a new patch for this. It's more of a rewrite than an update, honestly; commit ffd53659c46a54a6978bcb8c4424c1e157a2c0f1 necessitated totally different options handling, but I also redid the test cases, the documentation, and the error message. For those who may not have been following along, here's an executive summary: libzstd offers an option for parallel compression. It's intended to be transparent: you just say you want it, and the library takes care of it for you. Since we have the ability to do backup compression on either the client or the server side, we can expose this option in both locations. That would be cool, because it would allow for really fast backup compression with a good compression ratio. It would also mean that we would be, or really libzstd would be, spawning threads inside the PostgreSQL backend. Short of cats and dogs living together, it's hard to think of anything more terrifying, because the PostgreSQL backend is very much not thread-safe. However, a lot of the things we usually worry about when people make noises about using threads in the backend don't apply here, because the threads are hidden away behind libzstd interfaces and can't execute any PostgreSQL code. Therefore, I think it might be safe to just ... turn this on. One reason I think that is that this whole approach was recommended to me by Andres ... but that's not to say that there couldn't be problems. I worry a bit that the mere presence of threads could in some way mess things up, but I don't know what the mechanism for that would be, and I don't want to postpone shipping useful features based on nebulous fears. In my ideal world, I'd like to push this into v15. I've done a lot of work to improve the backup code in this release, and this is actually a very small change yet one that potentially enables the project to get a lot more value out of the work that has already been committed. That said, I also don't want to break the world, so if you have an idea what this would break, please tell me. For those curious as to how this affects performance and backup size, I loaded up the UK land registry database. That creates a 3769MB database. Then I backed it up using client-side compression and server-side compression using the various different algorithms that are supported in the master branch, plus parallel zstd. no compression: 3.7GB, 9 seconds gzip: 1.5GB, 140 seconds with server-side, 141 seconds with client-side lz4: 2.0GB, 13 seconds with server-side, 12 seconds with client-side For both parallel and non-parallel zstd compression, I see differences between the compressed size depending on where the compression is done. I don't know whether this is an expected behavior of the zstd library or a bug. Both files uncompress OK and pass pg_verifybackup, but that doesn't mean we're not, for example, selecting different compression levels where we shouldn't be. I'll try to figure out what's going on here. zstd, client-side: 1.7GB, 17 seconds zstd, server-side: 1.3GB, 25 seconds parallel zstd, 4 workers, client-side: 1.7GB, 7.5 seconds parallel zstd, 4 workers, server-side: 1.3GB, 7.2 seconds Notice that compressing the backup with parallel zstd is actually faster than taking an uncompressed backup, even though this test is all being run on the same machine. That's kind of crazy to me: the parallel compression is so fast that we save more time on I/O than we spend compressing. This assumes of course that you have plenty of CPU resources and limited I/O resources, which won't be true for everyone, but it's not an unusual situation. I think the documentation changes in this patch might not be quite up to scratch. I think there's a brewing problem here: as we add more compression options, whether or not that happens in this release, and regardless of what specific options we add, the way things are structured right now, we're going to end up either duplicating a bunch of stuff between the pg_basebackup documentation and the BASE_BACKUP documentation, or else one of those places is going to end up lacking information that someone reading it might like to have. I'm not exactly sure what to do about this, though. This patch contains a trivial adjustment to PostgreSQL::Test::Cluster::run_log to make it return a useful value instead of not. I think that should be pulled out and committed independently regardless of what happens to this patch overall, and possibly back-patched. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: