Re: design for parallel backup - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: design for parallel backup |
Date | |
Msg-id | CA+TgmobZGVGXVrHSmCwSxaewQNShLa5gQXxgh0oSzoOaPGmwmw@mail.gmail.com Whole thread Raw |
In response to | Re: design for parallel backup (Andres Freund <andres@anarazel.de>) |
Responses |
Re: design for parallel backup
(Andres Freund <andres@anarazel.de>)
|
List | pgsql-hackers |
On Tue, Apr 21, 2020 at 11:36 AM Andres Freund <andres@anarazel.de> wrote: > It's all CRC overhead. I don't see a difference with > --manifest-checksums=none anymore. We really should look for a better > "fast" checksum. Hmm, OK. I'm wondering exactly what you tested here. Was this over your 20GiB/s connection between laptop and workstation, or was this local TCP? Also, was the database being read from persistent storage, or was it RAM-cached? How do you expect to take advantage of I/O parallelism without multiple processes/connections? Meanwhile, I did some local-only testing on my new 16GB MacBook Pro laptop with all combinations of: - UNIX socket, local TCP socket, local TCP socket with SSL - Plain format, tar format, tar format with gzip - No manifest ("omit"), manifest with no checksums, manifest with CRC-32C checksums, manifest with SHA256 checksums. The database is a fresh scale-factor 1000 pgbench database. No concurrent database load. Observations: - UNIX socket was slower than a local TCP socket, and about the same speed as a TCP socket with SSL. - CRC-32C is about 10% slower than no manifest and/or no checksums in the manifest. SHA256 is 1.5-2x slower, but less when compression is also used (see below). - Plain format is a little slower than tar format; tar with gzip is typically >~5x slower, but less when the checksum algorithm is SHA256 (again, see below). - SHA256 + tar format with gzip is the slowest combination, but it's "only" about 15% slower than no manifest, and about 3.3x slower than no compression, presumably because the checksumming is slowing down the server and the compression is slowing down the client. - Fastest speeds I see in any test are ~650MB/s, and slowest are ~65MB/s, obviously benefiting greatly from the fact that this is a local-only test. - The time for a raw cp -R of the backup directory is about 10s, and the fastest time to take a backup (tcp+tar+m:omit) is about 22s. - In all cases I've checked so far both pg_basebackup and the server backend are pegged at 98-100% CPU usage. I haven't looked into where that time is going yet. Full results and test script attached. I and/or my colleagues will try to test out some other environments, but I'm not sure we have easy access to anything as high-powered as a 20GiB/s interconnect. It seems to me that the interesting cases may involve having lots of available CPUs and lots of disk spindles, but a comparatively slow pipe between the machines. I mean, if it takes 36 hours to read the data from disk, you can't realistically expect to complete a full backup in less than 36 hours. Incremental backup might help, but otherwise you're just dead. On the other hand, if you can read the data from the disk in 2 hours but it takes 36 hours to complete a backup, it seems like you have more justification for thinking that the backup software could perhaps do better. In such cases efficient server-side compression may help a lot, but even then, I wonder whether you can you read the data at maximum speed with only a single process? I tend to doubt it, but I guess you only have to be fast enough to saturate the network. Hmm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: