Thread: design for parallel backup
Hi, Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com there's a proposal for a parallel backup patch which works in the way that I have always thought parallel backup would work: instead of having a monolithic command that returns a series of tarballs, you request individual files from a pool of workers. Leaving aside the quality-of-implementation issues in that patch set, I'm starting to think that the design is fundamentally wrong and that we should take a whole different approach. The problem I see is that it makes a parallel backup and a non-parallel backup work very differently, and I'm starting to realize that there are good reasons why you might want them to be similar. Specifically, as Andres recently pointed out[1], almost anything that you might want to do on the client side, you might also want to do on the server side. We already have an option to let the client compress each tarball, but you might also want the server to, say, compress each tarball[2]. Similarly, you might want either the client or the server to be able to encrypt each tarball, or compress but with a different compression algorithm than gzip. If, as is presently the case, the server is always returning a set of tarballs, it's pretty easy to see how to make this work in the same way on either the client or the server, but if the server returns a set of tarballs in non-parallel backup cases, and a set of tarballs in parallel backup cases, it's a lot harder to see how that any sort of server-side processing should work, or how the same mechanism could be used on either the client side or the server side. So, my new idea for parallel backup is that the server will return tarballs, but just more of them. Right now, you get base.tar and ${tablespace_oid}.tar for each tablespace. I propose that if you do a parallel backup, you should get base-${N}.tar and ${tablespace_oid}-${N}.tar for some or all values of N between 1 and the number of workers, with the server deciding which files ought to go in which tarballs. This is more or less the naming convention that BART uses for its parallel backup implementation, which, incidentally, I did not write. I don't really care if we pick something else, but it seems like a sensible choice. The reason why I say "some or all" is that some workers might not get any of the data for a given tablespace. In fact, it's probably desirable to have different workers work on different tablespaces as far as possible, to maximize parallel I/O, but it's quite likely that you will have more workers than tablespaces. So you might end up, with pg_basebackup -j4, having the server send you base-1.tar and base-2.tar and base-4.tar, but not base-3.tar, because worker 3 spent all of its time on user-defined tablespaces, or was just out to lunch. Now, if you use -Fp, those tar files are just going to get extracted anyway by pg_basebackup itself, so you won't even know they exist. However, if you use -Ft, you're going to end up with more files than before. This seems like something of a wart, because you wouldn't necessarily expect that the set of output files produced by a backup would depend on the degree of parallelism used to take it. However, I'm not sure I see a reasonable alternative. The client could try to glue all of the related tar files sent by the server together into one big tarfile, but that seems like it would slow down the process of writing the backup by forcing the different server connections to compete for the right to write to the same file. Moreover, if you end up needing to restore the backup, having a bunch of smaller tar files instead of one big one means you can try to untar them in parallel if you like, so it seems not impossible that it could be advantageous to have them split in that case as well. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] http://postgr.es/m/20200412191702.ul7ohgv5gus3tsvo@alap3.anarazel.de [2] https://www.postgresql.org/message-id/20190823172637.GA16436%40tamriel.snowman.net
On Wed, Apr 15, 2020 at 9:27 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com > there's a proposal for a parallel backup patch which works in the way > that I have always thought parallel backup would work: instead of > having a monolithic command that returns a series of tarballs, you > request individual files from a pool of workers. Leaving aside the > quality-of-implementation issues in that patch set, I'm starting to > think that the design is fundamentally wrong and that we should take a > whole different approach. The problem I see is that it makes a > parallel backup and a non-parallel backup work very differently, and > I'm starting to realize that there are good reasons why you might want > them to be similar. > > Specifically, as Andres recently pointed out[1], almost anything that > you might want to do on the client side, you might also want to do on > the server side. We already have an option to let the client compress > each tarball, but you might also want the server to, say, compress > each tarball[2]. Similarly, you might want either the client or the > server to be able to encrypt each tarball, or compress but with a > different compression algorithm than gzip. If, as is presently the > case, the server is always returning a set of tarballs, it's pretty > easy to see how to make this work in the same way on either the client > or the server, but if the server returns a set of tarballs in > non-parallel backup cases, and a set of tarballs in parallel backup > cases, it's a lot harder to see how that any sort of server-side > processing should work, or how the same mechanism could be used on > either the client side or the server side. > > So, my new idea for parallel backup is that the server will return > tarballs, but just more of them. Right now, you get base.tar and > ${tablespace_oid}.tar for each tablespace. I propose that if you do a > parallel backup, you should get base-${N}.tar and > ${tablespace_oid}-${N}.tar for some or all values of N between 1 and > the number of workers, with the server deciding which files ought to > go in which tarballs. > It is not apparent how you are envisioning this division on the server-side. I think in the currently proposed patch, each worker on the client-side requests the specific files. So, how are workers going to request such numbered files and how we will ensure that the work division among workers is fair? > This is more or less the naming convention that > BART uses for its parallel backup implementation, which, incidentally, > I did not write. I don't really care if we pick something else, but it > seems like a sensible choice. The reason why I say "some or all" is > that some workers might not get any of the data for a given > tablespace. In fact, it's probably desirable to have different workers > work on different tablespaces as far as possible, to maximize parallel > I/O, but it's quite likely that you will have more workers than > tablespaces. So you might end up, with pg_basebackup -j4, having the > server send you base-1.tar and base-2.tar and base-4.tar, but not > base-3.tar, because worker 3 spent all of its time on user-defined > tablespaces, or was just out to lunch. > > Now, if you use -Fp, those tar files are just going to get extracted > anyway by pg_basebackup itself, so you won't even know they exist. > However, if you use -Ft, you're going to end up with more files than > before. This seems like something of a wart, because you wouldn't > necessarily expect that the set of output files produced by a backup > would depend on the degree of parallelism used to take it. However, > I'm not sure I see a reasonable alternative. The client could try to > glue all of the related tar files sent by the server together into one > big tarfile, but that seems like it would slow down the process of > writing the backup by forcing the different server connections to > compete for the right to write to the same file. > I think it also depends to some extent what we decide in the nearby thread [1] related to support of compression/encryption. Say, if we want to support a new compression on client-side then we need to anyway process the contents of each tar file in which case combining into single tar file might be okay but not sure what is the right thing here. I think this part needs some more thoughts. [1] - https://www.postgresql.org/message-id/CA%2BTgmoYr7%2B-0_vyQoHbTP5H3QGZFgfhnrn6ewDteF%3DkUqkG%3DFw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Apr 20, 2020 at 8:50 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > It is not apparent how you are envisioning this division on the > server-side. I think in the currently proposed patch, each worker on > the client-side requests the specific files. So, how are workers going > to request such numbered files and how we will ensure that the work > division among workers is fair? I think that the workers would just say "give me my share of the base backup" and then the server would divide up the files as it went. It would probably keep a queue of whatever files still need to be processed in shared memory and each process would pop items from the queue to send to its client. > I think it also depends to some extent what we decide in the nearby > thread [1] related to support of compression/encryption. Say, if we > want to support a new compression on client-side then we need to > anyway process the contents of each tar file in which case combining > into single tar file might be okay but not sure what is the right > thing here. I think this part needs some more thoughts. Yes, it needs more thought, but the central idea is to try to create something that is composable. For example, if we have to do LZ4 compression, and code to do GPG encryption, than we should be able to do both without adding any more code. Ideally, we should also be able to either of those operations either on the client side or on the server side, using the same code either way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-04-15 17:57, Robert Haas wrote: > Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com > there's a proposal for a parallel backup patch which works in the way > that I have always thought parallel backup would work: instead of > having a monolithic command that returns a series of tarballs, you > request individual files from a pool of workers. Leaving aside the > quality-of-implementation issues in that patch set, I'm starting to > think that the design is fundamentally wrong and that we should take a > whole different approach. The problem I see is that it makes a > parallel backup and a non-parallel backup work very differently, and > I'm starting to realize that there are good reasons why you might want > them to be similar. That would clearly be a good goal. Non-parallel backup should ideally be parallel backup with one worker. But it doesn't follow that the proposed design is wrong. It might just be that the design of the existing backup should change. I think making the wire format so heavily tied to the tar format is dubious. There is nothing particularly fabulous about the tar format. If the server just sends a bunch of files with metadata for each file, the client can assemble them in any way they want: unpacked, packed in several tarball like now, packed all in one tarball, packed in a zip file, sent to S3, etc. Another thing I would like to see sometime is this: Pull a minimal basebackup, start recovery and possibly hot standby before you have received all the files. When you need to access a file that's not there yet, request that as a priority from the server. If you nudge the file order a little with perhaps prewarm-like data, you could get a mostly functional standby without having to wait for the full basebackup to finish. Pull a file on request is a requirement for this. > So, my new idea for parallel backup is that the server will return > tarballs, but just more of them. Right now, you get base.tar and > ${tablespace_oid}.tar for each tablespace. I propose that if you do a > parallel backup, you should get base-${N}.tar and > ${tablespace_oid}-${N}.tar for some or all values of N between 1 and > the number of workers, with the server deciding which files ought to > go in which tarballs. I understand the other side of this: Why not compress or encrypt the backup already on the server side? Makes sense. But this way seems weird and complicated. If I want a backup, I want one file, not an unpredictable set of files. How do I even know I have them all? Do we need a meta-manifest? A format such as ZIP would offer more flexibility, I think. You can build a single target file incrementally, you can compress or encrypt each member file separately, thus allowing some compression etc. on the server. I'm not saying it's perfect for this, but some more thinking about the archive formats would potentially give some possibilities. All things considered, we'll probably want more options and more ways of doing things. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-04-15 11:57:29 -0400, Robert Haas wrote: > Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com > there's a proposal for a parallel backup patch which works in the way > that I have always thought parallel backup would work: instead of > having a monolithic command that returns a series of tarballs, you > request individual files from a pool of workers. Leaving aside the > quality-of-implementation issues in that patch set, I'm starting to > think that the design is fundamentally wrong and that we should take a > whole different approach. The problem I see is that it makes a > parallel backup and a non-parallel backup work very differently, and > I'm starting to realize that there are good reasons why you might want > them to be similar. > > Specifically, as Andres recently pointed out[1], almost anything that > you might want to do on the client side, you might also want to do on > the server side. We already have an option to let the client compress > each tarball, but you might also want the server to, say, compress > each tarball[2]. Similarly, you might want either the client or the > server to be able to encrypt each tarball, or compress but with a > different compression algorithm than gzip. If, as is presently the > case, the server is always returning a set of tarballs, it's pretty > easy to see how to make this work in the same way on either the client > or the server, but if the server returns a set of tarballs in > non-parallel backup cases, and a set of tarballs in parallel backup > cases, it's a lot harder to see how that any sort of server-side > processing should work, or how the same mechanism could be used on > either the client side or the server side. > > So, my new idea for parallel backup is that the server will return > tarballs, but just more of them. Right now, you get base.tar and > ${tablespace_oid}.tar for each tablespace. I propose that if you do a > parallel backup, you should get base-${N}.tar and > ${tablespace_oid}-${N}.tar for some or all values of N between 1 and > the number of workers, with the server deciding which files ought to > go in which tarballs. This is more or less the naming convention that > BART uses for its parallel backup implementation, which, incidentally, > I did not write. I don't really care if we pick something else, but it > seems like a sensible choice. The reason why I say "some or all" is > that some workers might not get any of the data for a given > tablespace. In fact, it's probably desirable to have different workers > work on different tablespaces as far as possible, to maximize parallel > I/O, but it's quite likely that you will have more workers than > tablespaces. So you might end up, with pg_basebackup -j4, having the > server send you base-1.tar and base-2.tar and base-4.tar, but not > base-3.tar, because worker 3 spent all of its time on user-defined > tablespaces, or was just out to lunch. One question I have not really seen answered well: Why do we want parallelism here. Or to be more precise: What do we hope to accelerate by making what part of creating a base backup parallel. There's several potential bottlenecks, and I think it's important to know the design priorities to evaluate a potential design. Bottlenecks (not ordered by importance): - compression performance (likely best solved by multiple compression threads and a better compression algorithm) - unencrypted network performance (I'd like to see benchmarks showing in which cases multiple TCP streams help / at which bandwidth it starts to help) - encrypted network performance, i.e. SSL overhead (not sure this is an important problem on modern hardware, given hardware accelerated AES) - checksumming overhead (a serious problem for cryptographic checksums, but presumably not for others) - file IO (presumably multiple facets here, number of concurrent in-flight IOs, kernel page cache overhead when reading TBs of data) I'm not really convinced that design addressing the more crucial bottlenecks really needs multiple fe/be connections. But that seems to be have been the focus of the discussion so far. Greetings, Andres Freund
Thanks for your thoughts. On Mon, Apr 20, 2020 at 4:02 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > That would clearly be a good goal. Non-parallel backup should ideally > be parallel backup with one worker. Right. > But it doesn't follow that the proposed design is wrong. It might just > be that the design of the existing backup should change. > > I think making the wire format so heavily tied to the tar format is > dubious. There is nothing particularly fabulous about the tar format. > If the server just sends a bunch of files with metadata for each file, > the client can assemble them in any way they want: unpacked, packed in > several tarball like now, packed all in one tarball, packed in a zip > file, sent to S3, etc. Yeah, that's true, and I agree that there's something a little unsatisfying and dubious about the current approach. However, I am not sure that there is sufficient reason to change it to something else, either. After all, what purpose would such a change serve? The client can already do any of the things you mention here, provided that it can interpret the data sent by the server, and pg_basebackup already has code to do exactly this. Right now, we have pretty good pg_basebackup compatibility across server versions, and if we change the format, then we won't, unless we make both the client and the server understand both formats. I'm not completely averse to such a change if it has sufficient benefits to make it worthwhile, but it's not clear to me that it does. > Another thing I would like to see sometime is this: Pull a minimal > basebackup, start recovery and possibly hot standby before you have > received all the files. When you need to access a file that's not there > yet, request that as a priority from the server. If you nudge the file > order a little with perhaps prewarm-like data, you could get a mostly > functional standby without having to wait for the full basebackup to > finish. Pull a file on request is a requirement for this. True, but that can always be implemented as a separate feature. I won't be sad if that feature happens to fall out of work in this area, but I don't think the possibility that we'll some day have such advanced wizardry should bias the design of this feature very much. One pretty major problem with this is that you can't open for connections until you've reached a consistent state, and you can't say that you're in a consistent state until you've replayed all the WAL generated during the backup, and you can't say that you're at the end of the backup until you've copied all the files. So, without some clever idea, this would only allow you to begin replay sooner; it would not allow you to accept connections sooner. I suspect that makes it significantly less appealing. > > So, my new idea for parallel backup is that the server will return > > tarballs, but just more of them. Right now, you get base.tar and > > ${tablespace_oid}.tar for each tablespace. I propose that if you do a > > parallel backup, you should get base-${N}.tar and > > ${tablespace_oid}-${N}.tar for some or all values of N between 1 and > > the number of workers, with the server deciding which files ought to > > go in which tarballs. > > I understand the other side of this: Why not compress or encrypt the > backup already on the server side? Makes sense. But this way seems > weird and complicated. If I want a backup, I want one file, not an > unpredictable set of files. How do I even know I have them all? Do we > need a meta-manifest? Yes, that's a problem, but... > A format such as ZIP would offer more flexibility, I think. You can > build a single target file incrementally, you can compress or encrypt > each member file separately, thus allowing some compression etc. on the > server. I'm not saying it's perfect for this, but some more thinking > about the archive formats would potentially give some possibilities. ...I don't think this really solves anything. I expect you would have to write the file more or less sequentially, and I think that Amdahl's law will not be kind to us. > All things considered, we'll probably want more options and more ways of > doing things. Yes. That's why I'm trying to figure out how to create a flexible framework. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <andres@anarazel.de> wrote: > Why do we want parallelism here. Or to be more precise: What do we hope > to accelerate by making what part of creating a base backup > parallel. There's several potential bottlenecks, and I think it's > important to know the design priorities to evaluate a potential design. > > Bottlenecks (not ordered by importance): > - compression performance (likely best solved by multiple compression > threads and a better compression algorithm) > - unencrypted network performance (I'd like to see benchmarks showing in > which cases multiple TCP streams help / at which bandwidth it starts > to help) > - encrypted network performance, i.e. SSL overhead (not sure this is an > important problem on modern hardware, given hardware accelerated AES) > - checksumming overhead (a serious problem for cryptographic checksums, > but presumably not for others) > - file IO (presumably multiple facets here, number of concurrent > in-flight IOs, kernel page cache overhead when reading TBs of data) > > I'm not really convinced that design addressing the more crucial > bottlenecks really needs multiple fe/be connections. But that seems to > be have been the focus of the discussion so far. I haven't evaluated this. Both BART and pgBackRest offer parallel backup options, and I'm pretty sure both were performance tested and found to be very significantly faster, but I didn't write the code for either, nor have I evaluated either to figure out exactly why it was faster. My suspicion is that it has mostly to do with adequately utilizing the hardware resources on the server side. If you are network-constrained, adding more connections won't help, unless there's something shaping the traffic which can be gamed by having multiple connections. However, as things stand today, at any given point in time the base backup code on the server will EITHER be attempting a single filesystem I/O or a single network I/O, and likewise for the client. If a backup client - either current or hypothetical - is compressing and encrypting, then it doesn't have either a filesystem I/O or a network I/O in progress while it's doing so. You take not only the hit of the time required for compression and/or encryption, but also use that much less of the available network and/or I/O capacity. While I agree that some of these problems could likely be addressed in other ways, parallelism seems to offer an approach that could solve multiple issues at the same time. If you want to address it without that, you need asynchronous filesystem I/O and asynchronous network I/O and both of those on both the client and server side, plus multithreaded compression and multithreaded encryption and maybe some other things. That sounds pretty hairy and hard to get right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-20 16:36:16 -0400, Robert Haas wrote: > My suspicion is that it has mostly to do with adequately utilizing the > hardware resources on the server side. If you are network-constrained, > adding more connections won't help, unless there's something shaping > the traffic which can be gamed by having multiple connections. > However, as things stand today, at any given point in time the base > backup code on the server will EITHER be attempting a single > filesystem I/O or a single network I/O, and likewise for the client. Well, kinda, but not really. Both file reads (server)/writes(client) and network send(server)/recv(client) are buffered by the OS, and the file IO is entirely sequential. That's not true for checksum computations / compressions to the same degree. They're largely bottlenecked in userland, without the kernel doing as much async work. > If a backup client - either current or hypothetical - is compressing > and encrypting, then it doesn't have either a filesystem I/O or a > network I/O in progress while it's doing so. You take not only the hit > of the time required for compression and/or encryption, but also use > that much less of the available network and/or I/O capacity. I don't think it's really the time for network/file I/O that's the issue. Sure memcpy()'ing from the kernel takes time, but compared to encryption/compression it's not that much. Especially for compression, it's not really lack of cycles for networking that prevent a higher throughput, it's that after buffering a few MB there's just no point buffering more, given compression will plod along with 20-100MB/s. > While I agree that some of these problems could likely be addressed in > other ways, parallelism seems to offer an approach that could solve > multiple issues at the same time. If you want to address it without > that, you need asynchronous filesystem I/O and asynchronous network > I/O and both of those on both the client and server side, plus > multithreaded compression and multithreaded encryption and maybe some > other things. That sounds pretty hairy and hard to get right. I'm not really convinced. You're complicating the wire protocol by having multiple tar files with overlapping contents. With the consequence that clients need additional logic to deal with that. We'll not get one manifest, but multiple ones, etc. We already do network IO non-blocking, and leaving the copying to kernel, the kernel does the actual network work asynchronously. Except for file boundaries the kernel does asynchronous read IO for us (but we should probably hint it to do that even at the start of a new file). I think we're quite a bit away from where we need to worry about making encryption multi-threaded: andres@awork3:~/src/postgresql$ openssl speed -evp aes-256-ctr Doing aes-256-ctr for 3s on 16 size blocks: 81878709 aes-256-ctr's in 3.00s Doing aes-256-ctr for 3s on 64 size blocks: 71062203 aes-256-ctr's in 3.00s Doing aes-256-ctr for 3s on 256 size blocks: 31738391 aes-256-ctr's in 3.00s Doing aes-256-ctr for 3s on 1024 size blocks: 10043519 aes-256-ctr's in 3.00s Doing aes-256-ctr for 3s on 8192 size blocks: 1346933 aes-256-ctr's in 3.00s Doing aes-256-ctr for 3s on 16384 size blocks: 674680 aes-256-ctr's in 3.00s OpenSSL 1.1.1f 31 Mar 2020 built on: Tue Mar 31 21:59:59 2020 UTC options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-hsg853/openssl-1.1.1f=.-fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE-DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5-DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM-DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-256-ctr 436686.45k 1515993.66k 2708342.70k 3428187.82k 3678025.05k 3684652.37k So that really just leaves compression (and perhaps cryptographic checksumming). Given that we can provide nearly all of the benefits of multi-stream parallelism in a compatible way by using parallelism/threads at that level, I just have a hard time believing the complexity of doing those tasks in parallel is bigger than multi-stream parallelism. And I'd be fairly unsurprised if you'd end up with a lot more "bubbles" in the pipeline when using multi-stream parallelism. Greetings, Andres Freund
On Tue, Apr 21, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote: > > On 2020-04-20 16:36:16 -0400, Robert Haas wrote: > > > If a backup client - either current or hypothetical - is compressing > > and encrypting, then it doesn't have either a filesystem I/O or a > > network I/O in progress while it's doing so. You take not only the hit > > of the time required for compression and/or encryption, but also use > > that much less of the available network and/or I/O capacity. > > I don't think it's really the time for network/file I/O that's the > issue. Sure memcpy()'ing from the kernel takes time, but compared to > encryption/compression it's not that much. Especially for compression, > it's not really lack of cycles for networking that prevent a higher > throughput, it's that after buffering a few MB there's just no point > buffering more, given compression will plod along with 20-100MB/s. > It is quite likely that compression can benefit more from parallelism as compared to the network I/O as that is mostly a CPU intensive operation but I am not sure if we can just ignore the benefit of utilizing the network bandwidth. In our case, after copying from the network we do write that data to disk, so during filesystem I/O the network can be used if there is some other parallel worker processing other parts of data. Also, there may be some users who don't want their data to be compressed due to some reason like the overhead of decompression is so high that restore takes more time and they are not comfortable with that as for them faster restore is much more critical then compressed or fast back up. So, for such things, the parallelism during backup as being discussed in this thread will still be helpful. OTOH, I think without some measurements it is difficult to say that we have significant benefit by paralysing the backup without compression. I have scanned the other thread [1] where the patch for parallel backup was discussed and didn't find any performance numbers, so probably having some performance data with that patch might give us a better understanding of introducing parallelism in the backup. [1] - https://www.postgresql.org/message-id/CADM=JehKgobEknb+_nab9179HzGj=9EiTzWMOd2mpqr_rifm0Q@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi, On 2020-04-21 10:20:01 +0530, Amit Kapila wrote: > It is quite likely that compression can benefit more from parallelism > as compared to the network I/O as that is mostly a CPU intensive > operation but I am not sure if we can just ignore the benefit of > utilizing the network bandwidth. In our case, after copying from the > network we do write that data to disk, so during filesystem I/O the > network can be used if there is some other parallel worker processing > other parts of data. Well, as I said, network and FS IO as done by server / pg_basebackup are both fully buffered by the OS. Unless the OS throttles the userland process, a large chunk of the work will be done by the kernel, in separate kernel threads. My workstation and my laptop can, in a single thread each, get close 20GBit/s of network IO (bidirectional 10GBit, I don't have faster - it's a thunderbolt 10gbe card) and iperf3 is at 55% CPU while doing so. Just connecting locally it's 45Gbit/s. Or over 8GBbyte/s of buffered filesystem IO. And it doesn't even have that high per-core clock speed. I just don't see this being the bottleneck for now. > Also, there may be some users who don't want their data to be > compressed due to some reason like the overhead of decompression is so > high that restore takes more time and they are not comfortable with > that as for them faster restore is much more critical then compressed > or fast back up. So, for such things, the parallelism during backup > as being discussed in this thread will still be helpful. I am not even convinced it'll be helpful in a large fraction of cases. The added overhead of more connections / processes isn't free. I believe there are some cases where it'd help. E.g. if there are multiple tablespaces on independent storage, parallelism as described here could end up to a significantly better utilization of the different tablespaces. But that'd require sorting work between processes appropriately. > OTOH, I think without some measurements it is difficult to say that we > have significant benefit by paralysing the backup without compression. > I have scanned the other thread [1] where the patch for parallel > backup was discussed and didn't find any performance numbers, so > probably having some performance data with that patch might give us a > better understanding of introducing parallelism in the backup. Agreed, we need some numbers. Greetings, Andres Freund
Hi, On 2020-04-20 22:31:49 -0700, Andres Freund wrote: > On 2020-04-21 10:20:01 +0530, Amit Kapila wrote: > > It is quite likely that compression can benefit more from parallelism > > as compared to the network I/O as that is mostly a CPU intensive > > operation but I am not sure if we can just ignore the benefit of > > utilizing the network bandwidth. In our case, after copying from the > > network we do write that data to disk, so during filesystem I/O the > > network can be used if there is some other parallel worker processing > > other parts of data. > > Well, as I said, network and FS IO as done by server / pg_basebackup are > both fully buffered by the OS. Unless the OS throttles the userland > process, a large chunk of the work will be done by the kernel, in > separate kernel threads. > > My workstation and my laptop can, in a single thread each, get close > 20GBit/s of network IO (bidirectional 10GBit, I don't have faster - it's > a thunderbolt 10gbe card) and iperf3 is at 55% CPU while doing so. Just > connecting locally it's 45Gbit/s. Or over 8GBbyte/s of buffered > filesystem IO. And it doesn't even have that high per-core clock speed. > > I just don't see this being the bottleneck for now. FWIW, I just tested pg_basebackup locally. Without compression and a stock postgres I get: unix tcp tcp+ssl: 1.74GiB/s 1.02GiB/s 699MiB/s That turns out to be bottlenecked by the backup manifest generation. Without compression and a stock postgres I get, and --no-manifest unix tcp tcp+ssl: 2.51GiB/s 1.63GiB/s 1.00GiB/s I.e. all of them area already above 10Gbit/s network. Looking at a profile it's clear that our small output buffer is the bottleneck: 64kB Buffers + --no-manifest: unix tcp tcp+ssl: 2.99GiB/s 2.56GiB/s 1.18GiB/s At this point the backend is not actually the bottleneck anymore, instead it's pg_basebackup. Which is in part due to the small buffer used for output data (i.e. libc's FILE buffering), and in part because we spend too much time memmove()ing data, because of the "left-justify" logic in pqCheckInBufferSpace(). - Andres
On Tue, Apr 21, 2020 at 2:44 AM Andres Freund <andres@anarazel.de> wrote: > FWIW, I just tested pg_basebackup locally. > > Without compression and a stock postgres I get: > unix tcp tcp+ssl: > 1.74GiB/s 1.02GiB/s 699MiB/s > > That turns out to be bottlenecked by the backup manifest generation. Whoa. That's unexpected, at least for me. Is that because of the CRC-32C overhead, or something else? What do you get with --manifest-checksums=none? > Without compression and a stock postgres I get, and --no-manifest > unix tcp tcp+ssl: > 2.51GiB/s 1.63GiB/s 1.00GiB/s > > I.e. all of them area already above 10Gbit/s network. > > Looking at a profile it's clear that our small output buffer is the > bottleneck: > 64kB Buffers + --no-manifest: > unix tcp tcp+ssl: > 2.99GiB/s 2.56GiB/s 1.18GiB/s > > At this point the backend is not actually the bottleneck anymore, > instead it's pg_basebackup. Which is in part due to the small buffer > used for output data (i.e. libc's FILE buffering), and in part because > we spend too much time memmove()ing data, because of the "left-justify" > logic in pqCheckInBufferSpace(). Hmm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-21 07:18:20 -0400, Robert Haas wrote: > On Tue, Apr 21, 2020 at 2:44 AM Andres Freund <andres@anarazel.de> wrote: > > FWIW, I just tested pg_basebackup locally. > > > > Without compression and a stock postgres I get: > > unix tcp tcp+ssl: > > 1.74GiB/s 1.02GiB/s 699MiB/s > > > > That turns out to be bottlenecked by the backup manifest generation. > > Whoa. That's unexpected, at least for me. Is that because of the > CRC-32C overhead, or something else? What do you get with > --manifest-checksums=none? It's all CRC overhead. I don't see a difference with --manifest-checksums=none anymore. We really should look for a better "fast" checksum. Regards, Andres
On Tue, Apr 21, 2020 at 11:36 AM Andres Freund <andres@anarazel.de> wrote: > It's all CRC overhead. I don't see a difference with > --manifest-checksums=none anymore. We really should look for a better > "fast" checksum. Hmm, OK. I'm wondering exactly what you tested here. Was this over your 20GiB/s connection between laptop and workstation, or was this local TCP? Also, was the database being read from persistent storage, or was it RAM-cached? How do you expect to take advantage of I/O parallelism without multiple processes/connections? Meanwhile, I did some local-only testing on my new 16GB MacBook Pro laptop with all combinations of: - UNIX socket, local TCP socket, local TCP socket with SSL - Plain format, tar format, tar format with gzip - No manifest ("omit"), manifest with no checksums, manifest with CRC-32C checksums, manifest with SHA256 checksums. The database is a fresh scale-factor 1000 pgbench database. No concurrent database load. Observations: - UNIX socket was slower than a local TCP socket, and about the same speed as a TCP socket with SSL. - CRC-32C is about 10% slower than no manifest and/or no checksums in the manifest. SHA256 is 1.5-2x slower, but less when compression is also used (see below). - Plain format is a little slower than tar format; tar with gzip is typically >~5x slower, but less when the checksum algorithm is SHA256 (again, see below). - SHA256 + tar format with gzip is the slowest combination, but it's "only" about 15% slower than no manifest, and about 3.3x slower than no compression, presumably because the checksumming is slowing down the server and the compression is slowing down the client. - Fastest speeds I see in any test are ~650MB/s, and slowest are ~65MB/s, obviously benefiting greatly from the fact that this is a local-only test. - The time for a raw cp -R of the backup directory is about 10s, and the fastest time to take a backup (tcp+tar+m:omit) is about 22s. - In all cases I've checked so far both pg_basebackup and the server backend are pegged at 98-100% CPU usage. I haven't looked into where that time is going yet. Full results and test script attached. I and/or my colleagues will try to test out some other environments, but I'm not sure we have easy access to anything as high-powered as a 20GiB/s interconnect. It seems to me that the interesting cases may involve having lots of available CPUs and lots of disk spindles, but a comparatively slow pipe between the machines. I mean, if it takes 36 hours to read the data from disk, you can't realistically expect to complete a full backup in less than 36 hours. Incremental backup might help, but otherwise you're just dead. On the other hand, if you can read the data from the disk in 2 hours but it takes 36 hours to complete a backup, it seems like you have more justification for thinking that the backup software could perhaps do better. In such cases efficient server-side compression may help a lot, but even then, I wonder whether you can you read the data at maximum speed with only a single process? I tend to doubt it, but I guess you only have to be fast enough to saturate the network. Hmm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2020-04-21 14:01:28 -0400, Robert Haas wrote: > On Tue, Apr 21, 2020 at 11:36 AM Andres Freund <andres@anarazel.de> wrote: > > It's all CRC overhead. I don't see a difference with > > --manifest-checksums=none anymore. We really should look for a better > > "fast" checksum. > > Hmm, OK. I'm wondering exactly what you tested here. Was this over > your 20GiB/s connection between laptop and workstation, or was this > local TCP? It was local TCP. The speeds I can reach are faster than the 10GiB/s (unidirectional) I can do between the laptop & workstation, so testing it over "actual" network isn't informative - I basically can reach line speed between them with any method. > Also, was the database being read from persistent storage, or was it > RAM-cached? It was in kernel buffer cache. But I can reach 100% utilization of storage too (which is slightly slower than what I can do over unix socket). pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null 2.59GiB/s find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null 2.53GiB/s find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null 2.42GiB/s I tested this with a -s 5000 DB, FWIW. > How do you expect to take advantage of I/O parallelism without > multiple processes/connections? Which kind of I/O parallelism are you thinking of? Independent tablespaces? Or devices that can handle multiple in-flight IOs? WRT the latter, at least linux will keep many IOs in-flight for sequential buffered reads. > - UNIX socket was slower than a local TCP socket, and about the same > speed as a TCP socket with SSL. Hm. Interesting. Wonder if that a question of the unix socket buffer size? > - CRC-32C is about 10% slower than no manifest and/or no checksums in > the manifest. SHA256 is 1.5-2x slower, but less when compression is > also used (see below). > - Plain format is a little slower than tar format; tar with gzip is > typically >~5x slower, but less when the checksum algorithm is SHA256 > (again, see below). I see about 250MB/s with -Z1 (from the source side). If I hack pg_basebackup.c to specify a deflate level of 0 to gzsetparams, which zlib docs says should disable compression, I get up to 700MB/s. Which still is a factor of ~3.7 to uncompressed. This seems largely due to zlib's crc32 computation not being hardware accelerated: - 99.75% 0.05% pg_basebackup pg_basebackup [.] BaseBackup - 99.95% BaseBackup - 81.60% writeTarData - gzwrite - gz_write - gz_comp.constprop.0 - 85.11% deflate - 97.66% deflate_stored + 87.45% crc32_z + 9.53% __memmove_avx_unaligned_erms + 3.02% _tr_stored_block 2.27% __memmove_avx_unaligned_erms + 14.86% __libc_write + 18.40% pqGetCopyData3 > It seems to me that the interesting cases may involve having lots of > available CPUs and lots of disk spindles, but a comparatively slow > pipe between the machines. Hm, I'm not sure I am following. If network is the bottleneck, we'd immediately fill the buffers, and that'd be that? ISTM all of this is only really relevant if either pg_basebackup or walsender is the bottleneck? > I mean, if it takes 36 hours to read the > data from disk, you can't realistically expect to complete a full > backup in less than 36 hours. Incremental backup might help, but > otherwise you're just dead. On the other hand, if you can read the > data from the disk in 2 hours but it takes 36 hours to complete a > backup, it seems like you have more justification for thinking that > the backup software could perhaps do better. In such cases efficient > server-side compression may help a lot, but even then, I wonder > whether you can you read the data at maximum speed with only a single > process? I tend to doubt it, but I guess you only have to be fast > enough to saturate the network. Hmm. Well, I can do >8GByte/s of buffered reads in a single process (obviously cached, because I don't have storage quite that fast - uncached I can read at nearly 3GByte/s, the disk's speed). So sure, there's a limit to what a single process can do, but I think we're fairly far away from it. I think it's fairly obvious that we need faster compression - and that while we clearly can win a lot by just using a faster algorithm/implementation than standard zlib, we'll likely also need parallelism in some form. I'm doubtful that using multiple connections and multiple backends is the best way to achieve that, but it'd be a way. Greetings, Andres Freund
On Tue, Apr 21, 2020 at 4:14 PM Andres Freund <andres@anarazel.de> wrote: > It was local TCP. The speeds I can reach are faster than the 10GiB/s > (unidirectional) I can do between the laptop & workstation, so testing > it over "actual" network isn't informative - I basically can reach line > speed between them with any method. Is that really a conclusive test, though? In the case of either local TCP or a fast local interconnect, you'll have negligible latency. It seems at least possible that saturating the available bandwidth is harder on a higher-latency connection. Cross-region data center connections figure to have way higher latency than a local wired network, let alone the loopback interface. > It was in kernel buffer cache. But I can reach 100% utilization of > storage too (which is slightly slower than what I can do over unix > socket). > > pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null > 2.59GiB/s > find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null > 2.53GiB/s > find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null > 2.42GiB/s > > I tested this with a -s 5000 DB, FWIW. But that's not a real test either, because you're not writing the data anywhere. It's going to be a whole lot easier to saturate the read side if the write side is always zero latency. > > How do you expect to take advantage of I/O parallelism without > > multiple processes/connections? > > Which kind of I/O parallelism are you thinking of? Independent > tablespaces? Or devices that can handle multiple in-flight IOs? WRT the > latter, at least linux will keep many IOs in-flight for sequential > buffered reads. Both. I know that the kernel will prefetch for sequential reads, but it won't know what file you're going to access next, so I think you'll tend to stall when you reach the end of each file. It also seems possible that on a large disk array, you could read N files at a time with greater aggregate bandwidth than you can read a single file. > > It seems to me that the interesting cases may involve having lots of > > available CPUs and lots of disk spindles, but a comparatively slow > > pipe between the machines. > > Hm, I'm not sure I am following. If network is the bottleneck, we'd > immediately fill the buffers, and that'd be that? > > ISTM all of this is only really relevant if either pg_basebackup or > walsender is the bottleneck? I agree that if neither pg_basebackup nor walsender is the bottleneck, parallelism is unlikely to be very effective. I have realized as a result of your comments that I actually don't care intrinsically about parallel backup; what I actually care about is making backups very, very fast. I suspect that parallelism is a useful means to that end, but I interpret your comments as questioning that, and specifically drawing attention to the question of where the bottlenecks might be. So I'm trying to think about that. > I think it's fairly obvious that we need faster compression - and that > while we clearly can win a lot by just using a faster > algorithm/implementation than standard zlib, we'll likely also need > parallelism in some form. I'm doubtful that using multiple connections > and multiple backends is the best way to achieve that, but it'd be a > way. I think it has a good chance of being pretty effective, but it's certainly worth casting about for other possibilities that might deliver more benefit or be less work. In terms of better compression, I did a little looking around and it seems like LZ4 is generally agreed to be a lot faster than gzip, and also significantly faster than most other things that one might choose to use. On the other hand, the compression ratio may not be as good; e.g. https://facebook.github.io/zstd/ cites a 2.1 ratio (on some data set) for lz4 and a 2.9 ratio for zstd. While the compression and decompression speeds are slower, they are close enough that you might be able to make up the difference by using 2x the cores for compression and 3x for decompression. I don't know if that sort of thing is worth considering. If your limitation is the line speed, and you have have CPU cores to burn, a significantly higher compression ratio means significantly faster backups. On the other hand, if you're backing up over the LAN and the machine is heavily taxed, that's probably not an appealing trade. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-21 17:09:50 -0400, Robert Haas wrote: > On Tue, Apr 21, 2020 at 4:14 PM Andres Freund <andres@anarazel.de> wrote: > > It was local TCP. The speeds I can reach are faster than the 10GiB/s > > (unidirectional) I can do between the laptop & workstation, so testing > > it over "actual" network isn't informative - I basically can reach line > > speed between them with any method. > > Is that really a conclusive test, though? In the case of either local > TCP or a fast local interconnect, you'll have negligible latency. It > seems at least possible that saturating the available bandwidth is > harder on a higher-latency connection. Cross-region data center > connections figure to have way higher latency than a local wired > network, let alone the loopback interface. Sure. But that's what the TCP window etc should take care of. You might have to tune the OS if you have a high latency multi-GBit link, but you'd have to do that regardless of whether a single process or multiple processes are used. And the number of people with high-latency multi-gbit links isn't that high, compared to the number taking backups within a datacenter. > > It was in kernel buffer cache. But I can reach 100% utilization of > > storage too (which is slightly slower than what I can do over unix > > socket). > > > > pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null > > 2.59GiB/s > > find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null > > 2.53GiB/s > > find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null > > 2.42GiB/s > > > > I tested this with a -s 5000 DB, FWIW. > > But that's not a real test either, because you're not writing the data > anywhere. It's going to be a whole lot easier to saturate the read > side if the write side is always zero latency. I also stored data elsewhere in separate threads. But the bottleneck of that is lower (my storage is faster on reads than on writes, at least after the ram on the nvme is exhausted)... > > > It seems to me that the interesting cases may involve having lots of > > > available CPUs and lots of disk spindles, but a comparatively slow > > > pipe between the machines. > > > > Hm, I'm not sure I am following. If network is the bottleneck, we'd > > immediately fill the buffers, and that'd be that? > > > > ISTM all of this is only really relevant if either pg_basebackup or > > walsender is the bottleneck? > > I agree that if neither pg_basebackup nor walsender is the bottleneck, > parallelism is unlikely to be very effective. I have realized as a > result of your comments that I actually don't care intrinsically about > parallel backup; what I actually care about is making backups very, > very fast. I suspect that parallelism is a useful means to that end, > but I interpret your comments as questioning that, and specifically > drawing attention to the question of where the bottlenecks might be. > So I'm trying to think about that. I agree that trying to make backups very fast is a good goal (or well, I think not very slow would be a good descriptor for the current situation). I am just trying to make sure we tackle the right problems for that. My gut feeling is that we have to tackle compression first, because without addressing that "all hope is lost" ;) FWIW, here's the base backup from pgbench -i -s 5000 compressed a number of ways. The uncompressed backup is 64622701911 bytes. Unfortunately pgbench -i -s 5000 is not a particularly good example, it's just too compressible. method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format gzip 1 1 380.79 368.46 12.15 3892457816 16.6 .gz gzip 6 1 976.05 963.10 12.84 3594605389 18.0 .gz pigz 1 10 34.35 364.14 23.55 3892401867 16.6 .gz pigz 6 10 101.27 1056.85 28.98 3620724251 17.8 .gz zstd-gz 1 1 278.14 265.31 12.81 3897174342 15.6 .gz zstd-gz 1 6 906.67 893.58 12.52 3598238594 18.0 .gz zstd 1 1 82.95 67.97 11.82 2853193736 22.6 .zstd zstd 1 6 228.58 214.65 13.92 2687177334 24.0 .zstd zstd 1 10 25.05 151.84 13.35 2847414913 22.7 .zstd zstd 6 10 43.47 374.30 12.37 2745211100 23.5 .zstd zstd 6 20 32.50 468.18 13.44 2745211100 23.5 .zstd zstd 9 20 57.99 949.91 14.13 2606535138 24.8 .zstd lz4 1 1 49.94 36.60 13.33 7318668265 8.8 .lz4 lz4 3 1 201.79 187.36 14.42 6561686116 9.84 .lz4 lz4 6 1 318.35 304.64 13.55 6560274369 9.9 .lz4 pixz 1 10 92.54 925.52 37.00 1199499772 53.8 .xz pixz 3 10 210.77 2090.38 37.96 1186219752 54.5 .xz bzip2 1 1 2210.04 2190.89 17.67 1276905211 50.6 .bz2 pbzip2 1 10 236.03 2352.09 34.01 1332010572 48.5 .bz2 plzip 1 10 243.08 2430.18 25.60 915598323 70.6 .lz plzip 3 10 359.04 3577.94 27.92 1018585193 63.4 .lz plzip 3 20 197.36 3911.85 22.02 1018585193 63.4 .lz (zstd-gz is zstd with --format=gzip, zstd with parallelism 1 is with --single-thread to avoid a separate IO thread it uses by default, even with -T0) These weren't taken on a completely quiesced system, and I tested gzip and bzip2 in parallel, because they took so long. But I think this still gives a good overview (cpu-user-time is not that affected by smaller amounts of noise too). It looks to me that bzip2/pbzip2 are clearly too slow. pixz looks interesting as it achieves pretty good compression rates at a lower cost than plzip. plzip's rates are impressive, but damn, is it expensive. And higher compression ratios using more space is also a bit "huh"? Does anybody have a better idea what exactly to use as a good test corpus? pgbench -i clearly sucks, but ... One thing this reminded me of is whether using a format (tar) that doesn't allow efficient addressing of individual files is a good idea for base backups. The compression rates very likely will be better when not compressing tiny files individually, but at the same time it'd be very useful to be able to access individual files more efficiently than O(N). I can imagine that being important for some cases of incremental backup assembly. > > I think it's fairly obvious that we need faster compression - and that > > while we clearly can win a lot by just using a faster > > algorithm/implementation than standard zlib, we'll likely also need > > parallelism in some form. I'm doubtful that using multiple connections > > and multiple backends is the best way to achieve that, but it'd be a > > way. > > I think it has a good chance of being pretty effective, but it's > certainly worth casting about for other possibilities that might > deliver more benefit or be less work. In terms of better compression, > I did a little looking around and it seems like LZ4 is generally > agreed to be a lot faster than gzip, and also significantly faster > than most other things that one might choose to use. On the other > hand, the compression ratio may not be as good; e.g. > https://facebook.github.io/zstd/ cites a 2.1 ratio (on some data set) > for lz4 and a 2.9 ratio for zstd. While the compression and > decompression speeds are slower, they are close enough that you might > be able to make up the difference by using 2x the cores for > compression and 3x for decompression. I don't know if that sort of > thing is worth considering. If your limitation is the line speed, and > you have have CPU cores to burn, a significantly higher compression > ratio means significantly faster backups. On the other hand, if you're > backing up over the LAN and the machine is heavily taxed, that's > probably not an appealing trade. I think zstd with a low compression "setting" would be a pretty good default for most cases. lz4 is considerably faster, true, but the compression rates are also considerably worse. I think lz4 is great for mostly in-memory workloads (e.g. a compressed cache / live database with compressed data, as it allows to have reasonably close to memory speeds but with twice the data), but for anything longer lived zstd is probably better. The other big benefit is that zstd's library has multi-threaded compression built in, whereas that's not the case for other libraries that I am aware of. Greetings, Andres Freund
On Tue, Apr 21, 2020 at 6:57 PM Andres Freund <andres@anarazel.de> wrote: > I agree that trying to make backups very fast is a good goal (or well, I > think not very slow would be a good descriptor for the current > situation). I am just trying to make sure we tackle the right problems > for that. My gut feeling is that we have to tackle compression first, > because without addressing that "all hope is lost" ;) OK. I have no objection to the idea of starting with (1) server side compression and (2) a better compression algorithm. However, I'm not very sold on the idea of relying on parallelism that is specific to compression. I think that parallelism across the whole operation - multiple connections, multiple processes, etc. - may be a more promising approach than trying to parallelize specific stages of the process. I am not sure about that; it could be wrong, and I'm open to the possibility that it is, in fact, wrong. Leaving out all the three and four digit wall times from your table: > method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format > pigz 1 10 34.35 364.14 23.55 3892401867 16.6 .gz > zstd 1 1 82.95 67.97 11.82 2853193736 22.6 .zstd > zstd 1 10 25.05 151.84 13.35 2847414913 22.7 .zstd > zstd 6 10 43.47 374.30 12.37 2745211100 23.5 .zstd > zstd 6 20 32.50 468.18 13.44 2745211100 23.5 .zstd > zstd 9 20 57.99 949.91 14.13 2606535138 24.8 .zstd > lz4 1 1 49.94 36.60 13.33 7318668265 8.8 .lz4 > pixz 1 10 92.54 925.52 37.00 1199499772 53.8 .xz It's notable that almost all of the fast wall times here are with zstd; the surviving entries with pigz and pixz are with ten-way parallelism, and both pigz and lz4 have worse compression ratios than zstd. My impression, though, is that LZ4 might be getting a bit of a raw deal here because of the repetitive nature of the data. I theorize based on some reading I did yesterday, and general hand-waving, that maybe the compression ratios would be closer together on a more realistic data set. It's also notable that lz1 -1 is BY FAR the winner in terms of absolute CPU consumption. So I kinda wonder whether supporting both LZ4 and ZSTD might be the way to go, especially since once we have the LZ4 code we might be able to use it for other things, too. > One thing this reminded me of is whether using a format (tar) that > doesn't allow efficient addressing of individual files is a good idea > for base backups. The compression rates very likely will be better when > not compressing tiny files individually, but at the same time it'd be > very useful to be able to access individual files more efficiently than > O(N). I can imagine that being important for some cases of incremental > backup assembly. Yeah, being able to operate directly on the compressed version of the file would be very useful, but I'm not sure that we have great options available there. I think the only widely-used format that supports that is ".zip", and I'm not too sure about emitting zip files. Apparently, pixz also supports random access to archive members, and it did have on entry that survived my arbitrary cut in the table above, but the last release was in 2015, and it seems to be only a command-line tool, not a library. It also depends on libarchive and liblzma, which is not awful, but I'm not sure we want to suck in that many dependencies. But that's really a secondary thing: I can't imagine us depending on something that hasn't had a release in 5 years, and has less than 300 total commits. Now, it is based on xz/liblzma, and those seems to have some built-in indexing capabilities which it may be leveraging, so possibly we could roll our own. I'm not too sure about that, though, and it would limit us to using only that form of compression. Other options include, perhaps, (1) emitting a tarfile of compressed files instead of a compressed tarfile, and (2) writing our own index files. We don't know when we begin emitting the tarfile what files we're going to find our how big they will be, so we can't really emit a directory at the beginning of the file. Even if we thought we knew, files can disappear or be truncated before we get around to archiving them. However, when we reach the end of the file, we do know what we included and how big it was, so possibly we could generate an index for each tar file, or include something in the backup manifest. > The other big benefit is that zstd's library has multi-threaded > compression built in, whereas that's not the case for other libraries > that I am aware of. Wouldn't it be a problem to let the backend become multi-threaded, at least on Windows? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-22 09:52:53 -0400, Robert Haas wrote: > On Tue, Apr 21, 2020 at 6:57 PM Andres Freund <andres@anarazel.de> wrote: > > I agree that trying to make backups very fast is a good goal (or well, I > > think not very slow would be a good descriptor for the current > > situation). I am just trying to make sure we tackle the right problems > > for that. My gut feeling is that we have to tackle compression first, > > because without addressing that "all hope is lost" ;) > > OK. I have no objection to the idea of starting with (1) server side > compression and (2) a better compression algorithm. However, I'm not > very sold on the idea of relying on parallelism that is specific to > compression. I think that parallelism across the whole operation - > multiple connections, multiple processes, etc. - may be a more > promising approach than trying to parallelize specific stages of the > process. I am not sure about that; it could be wrong, and I'm open to > the possibility that it is, in fact, wrong. *My* gut feeling is that you're going to have a harder time using CPU time efficiently when doing parallel compression via multiple processes and independent connections. You're e.g. going to have a lot more context switches, I think. And there will be network overhead from doing more connections (including worse congestion control). > Leaving out all the three and four digit wall times from your table: > > > method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format > > pigz 1 10 34.35 364.14 23.55 3892401867 16.6 .gz > > zstd 1 1 82.95 67.97 11.82 2853193736 22.6 .zstd > > zstd 1 10 25.05 151.84 13.35 2847414913 22.7 .zstd > > zstd 6 10 43.47 374.30 12.37 2745211100 23.5 .zstd > > zstd 6 20 32.50 468.18 13.44 2745211100 23.5 .zstd > > zstd 9 20 57.99 949.91 14.13 2606535138 24.8 .zstd > > lz4 1 1 49.94 36.60 13.33 7318668265 8.8 .lz4 > > pixz 1 10 92.54 925.52 37.00 1199499772 53.8 .xz > > It's notable that almost all of the fast wall times here are with > zstd; the surviving entries with pigz and pixz are with ten-way > parallelism, and both pigz and lz4 have worse compression ratios than > zstd. My impression, though, is that LZ4 might be getting a bit of a > raw deal here because of the repetitive nature of the data. I theorize > based on some reading I did yesterday, and general hand-waving, that > maybe the compression ratios would be closer together on a more > realistic data set. I agree that most datasets won't get even close to what we've seen here. And that disadvantages e.g. lz4. To come up with a much less compressible case, I generated data the following way: CREATE TABLE random_data(id serial NOT NULL, r1 float not null, r2 float not null, r3 float not null); ALTER TABLE random_data SET (FILLFACTOR = 100); ALTER SEQUENCE random_data_id_seq CACHE 1024 -- with pgbench, I ran this in parallel for 100s INSERT INTO random_data(r1,r2,r3) SELECT random(), random(), random() FROM generate_series(1, 100000); -- then created indexes, using a high fillfactor to ensure few zeroed out parts ALTER TABLE random_data ADD CONSTRAINT random_data_id_pkey PRIMARY KEY(id) WITH (FILLFACTOR = 100); CREATE INDEX random_data_r1 ON random_data(r1) WITH (fillfactor = 100); this results in a 16GB base backup. I think this is probably a good bit less compressible than most PG databases. method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format gzip 1 1 305.37 299.72 5.52 7067232465 2.28 lz4 1 1 33.26 27.26 5.99 8961063439 1.80 .lz4 lz4 3 1 188.50 182.91 5.58 8204501460 1.97 .lz4 zstd 1 1 66.41 58.38 6.04 6925634128 2.33 .zstd zstd 1 10 9.64 67.04 4.82 6980075316 2.31 .zstd zstd 3 1 122.04 115.79 6.24 6440274143 2.50 .zstd zstd 3 10 13.65 106.11 5.64 6438439095 2.51 .zstd zstd 9 10 100.06 955.63 6.79 5963827497 2.71 .zstd zstd 15 10 259.84 2491.39 8.88 5912617243 2.73 .zstd pixz 1 10 162.59 1626.61 15.52 5350138420 3.02 .xz plzip 1 20 135.54 2705.28 9.25 5270033640 3.06 .lz > It's also notable that lz1 -1 is BY FAR the winner in terms of > absolute CPU consumption. So I kinda wonder whether supporting both > LZ4 and ZSTD might be the way to go, especially since once we have the > LZ4 code we might be able to use it for other things, too. Yea. I think the case for lz4 is far stronger in other places. E.g. having lz4 -1 for toast can make a lot of sense, suddenly repeated detoasting is much less of an issue, while still achieving higher compression than pglz. .oO(Now I really see how pglz compares to the above) > > One thing this reminded me of is whether using a format (tar) that > > doesn't allow efficient addressing of individual files is a good idea > > for base backups. The compression rates very likely will be better when > > not compressing tiny files individually, but at the same time it'd be > > very useful to be able to access individual files more efficiently than > > O(N). I can imagine that being important for some cases of incremental > > backup assembly. > > Yeah, being able to operate directly on the compressed version of the > file would be very useful, but I'm not sure that we have great options > available there. I think the only widely-used format that supports > that is ".zip", and I'm not too sure about emitting zip files. I don't really see a problem with emitting .zip files. It's an extremely widely used container format for all sorts of file formats these days. Except for needing a bit more complicated (and I don't think it's *that* big of a difference) code during generation / unpacking, it seems clearly advantageous over .tar.gz etc. > Apparently, pixz also supports random access to archive members, and > it did have on entry that survived my arbitrary cut in the table > above, but the last release was in 2015, and it seems to be only a > command-line tool, not a library. It also depends on libarchive and > liblzma, which is not awful, but I'm not sure we want to suck in that > many dependencies. But that's really a secondary thing: I can't > imagine us depending on something that hasn't had a release in 5 > years, and has less than 300 total commits. Oh, yea. I just looked at the various tools I could find that did parallel compression. > Other options include, perhaps, (1) emitting a tarfile of compressed > files instead of a compressed tarfile Yea, that'd help some. Although I am not sure how good the tooling to seek through tarfiles in an O(files) rather than O(bytes) manner is. I think there some cases where using separate compression state for each file would hurt us. Some of the archive formats have support for reusing compression state, but I don't know which. > , and (2) writing our own index files. We don't know when we begin > emitting the tarfile what files we're going to find our how big they > will be, so we can't really emit a directory at the beginning of the > file. Even if we thought we knew, files can disappear or be truncated > before we get around to archiving them. However, when we reach the end > of the file, we do know what we included and how big it was, so > possibly we could generate an index for each tar file, or include > something in the backup manifest. Hm. There's some appeal to just store offsets in the manifest, and to make sure it's a seakable offset in the compression stream. OTOH, it makes it pretty hard for other tools to generate a compatible archive. > > The other big benefit is that zstd's library has multi-threaded > > compression built in, whereas that's not the case for other libraries > > that I am aware of. > > Wouldn't it be a problem to let the backend become multi-threaded, at > least on Windows? We already have threads in windows, e.g. the signal handler emulation stuff runs in one. Are you thinking of this bit in postmaster.c: #ifdef HAVE_PTHREAD_IS_THREADED_NP /* * On macOS, libintl replaces setlocale() with a version that calls * CFLocaleCopyCurrent() when its second argument is "" and every relevant * environment variable is unset or empty. CFLocaleCopyCurrent() makes * the process multithreaded. The postmaster calls sigprocmask() and * calls fork() without an immediate exec(), both of which have undefined * behavior in a multithreaded program. A multithreaded postmaster is the * normal case on Windows, which offers neither fork() nor sigprocmask(). */ if (pthread_is_threaded_np() != 0) ereport(FATAL, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("postmaster became multithreaded during startup"), errhint("Set the LC_ALL environment variable to a valid locale."))); #endif ? I don't really see any of the concerns there to apply for the base backup case. Greetings, Andres Freund
On Wed, Apr 22, 2020 at 11:24 AM Andres Freund <andres@anarazel.de> wrote: > *My* gut feeling is that you're going to have a harder time using CPU > time efficiently when doing parallel compression via multiple processes > and independent connections. You're e.g. going to have a lot more > context switches, I think. And there will be network overhead from doing > more connections (including worse congestion control). OK, noted. I'm still doubtful that the optimal number of connections is 1, but it might be that the optimal number of CPU cores to apply to compression is much higher than the optimal number of connections. For instance, suppose there are two equally sized tablespaces on separate drives, but zstd with 10-way parallelism is our chosen compression strategy. It seems to me that two connections has an excellent chance of being faster than one, because with only one connection I don't see how you can benefit from the opportunity to do I/O in parallel. However, I can also see that having twenty connections just as a way to get 10-way parallelism for each tablespace might be undesirable and/or inefficient for various reasons. > this results in a 16GB base backup. I think this is probably a good bit > less compressible than most PG databases. > > method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format > gzip 1 1 305.37 299.72 5.52 7067232465 2.28 > lz4 1 1 33.26 27.26 5.99 8961063439 1.80 .lz4 > lz4 3 1 188.50 182.91 5.58 8204501460 1.97 .lz4 > zstd 1 1 66.41 58.38 6.04 6925634128 2.33 .zstd > zstd 1 10 9.64 67.04 4.82 6980075316 2.31 .zstd > zstd 3 1 122.04 115.79 6.24 6440274143 2.50 .zstd > zstd 3 10 13.65 106.11 5.64 6438439095 2.51 .zstd > zstd 9 10 100.06 955.63 6.79 5963827497 2.71 .zstd > zstd 15 10 259.84 2491.39 8.88 5912617243 2.73 .zstd > pixz 1 10 162.59 1626.61 15.52 5350138420 3.02 .xz > plzip 1 20 135.54 2705.28 9.25 5270033640 3.06 .lz So, picking a better compressor in this case looks a lot less exciting. Parallel zstd still compresses somewhat better than single-core lz4, but the difference in compression ratio is far less, and the amount of CPU you have to burn in order to get that extra compression is pretty large. > I don't really see a problem with emitting .zip files. It's an extremely > widely used container format for all sorts of file formats these days. > Except for needing a bit more complicated (and I don't think it's *that* > big of a difference) code during generation / unpacking, it seems > clearly advantageous over .tar.gz etc. Wouldn't that imply buying into DEFLATE as our preferred compression algorithm? Either way, I don't really like the idea of having PostgreSQL have its own code to generate and interpret various archive formats. That seems like a maintenance nightmare and a recipe for bugs. How can anyone even verify that our existing 'tar' code works with all 'tar' implementations out there, or that it's correct in all cases? Do we really want to maintain similar code for other formats, or even for this one? I'd say "no". We should pick archive formats that have good, well-maintained libraries with permissive licenses and then use those. I don't know whether "zip" falls into that category or not. > > Other options include, perhaps, (1) emitting a tarfile of compressed > > files instead of a compressed tarfile > > Yea, that'd help some. Although I am not sure how good the tooling to > seek through tarfiles in an O(files) rather than O(bytes) manner is. Well, considering that at present we're using hand-rolled code... > I think there some cases where using separate compression state for each > file would hurt us. Some of the archive formats have support for reusing > compression state, but I don't know which. Yeah, I had the same thought. People with mostly 1GB relation segments might not notice much difference, but people with lots of little relations might see a more significant difference. > Hm. There's some appeal to just store offsets in the manifest, and to > make sure it's a seakable offset in the compression stream. OTOH, it > makes it pretty hard for other tools to generate a compatible archive. Yeah. FWIW, I don't see it as being entirely necessary to create a seekable compressed archive format, let alone to make all of our compressed archive formats seekable. I think supporting multiple compression algorithms in a flexible way that's not too tied to the capabilities of particular algorithms is more important. If you want fast restores of incremental and differential backups, consider using -Fp rather than -Ft. Or we can have a new option that's like -Fp but every file is compressed individually in place, or files larger than N bytes are compressed in place using a configurable algorithm. It might be somewhat less efficient but it's also way less complicated to implement, and I think that should count for something. I don't want to get so caught up in advanced features here that we don't make any useful progress at all. If we can add better features without a large complexity increment, and without drawing objections from others on this list, great. If not, I'm prepared to summarily jettison it as nice-to-have but not essential. > I don't really see any of the concerns there to apply for the base > backup case. I felt like there was some reason that threads were bad, but it may have just been the case you mentioned and not relevant here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-04-20 22:36, Robert Haas wrote: > My suspicion is that it has mostly to do with adequately utilizing the > hardware resources on the server side. If you are network-constrained, > adding more connections won't help, unless there's something shaping > the traffic which can be gamed by having multiple connections. This is a thing. See "long fat network" and "bandwidth-delay product" (https://en.wikipedia.org/wiki/Bandwidth-delay_product). The proper way to address this is presumably with TCP parameter tuning, but in practice it's often easier to just start multiple connections, for example, when doing a backup via rsync. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 22, 2020 at 12:20 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2020-04-20 22:36, Robert Haas wrote: > > My suspicion is that it has mostly to do with adequately utilizing the > > hardware resources on the server side. If you are network-constrained, > > adding more connections won't help, unless there's something shaping > > the traffic which can be gamed by having multiple connections. > > This is a thing. See "long fat network" and "bandwidth-delay product" > (https://en.wikipedia.org/wiki/Bandwidth-delay_product). The proper way > to address this is presumably with TCP parameter tuning, but in practice > it's often easier to just start multiple connections, for example, when > doing a backup via rsync. Very interesting -- thanks! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-22 12:12:32 -0400, Robert Haas wrote: > On Wed, Apr 22, 2020 at 11:24 AM Andres Freund <andres@anarazel.de> wrote: > > *My* gut feeling is that you're going to have a harder time using CPU > > time efficiently when doing parallel compression via multiple processes > > and independent connections. You're e.g. going to have a lot more > > context switches, I think. And there will be network overhead from doing > > more connections (including worse congestion control). > > OK, noted. I'm still doubtful that the optimal number of connections > is 1, but it might be that the optimal number of CPU cores to apply to > compression is much higher than the optimal number of connections. Yea, that's basically what I think too. > For instance, suppose there are two equally sized tablespaces on > separate drives, but zstd with 10-way parallelism is our chosen > compression strategy. It seems to me that two connections has an > excellent chance of being faster than one, because with only one > connection I don't see how you can benefit from the opportunity to do > I/O in parallel. Yea. That's exactly the case for "connection level" parallelism I had upthread as well. It'd require being somewhat careful about different tablespaces in the selection for each connection, but that's not that hard. I also can see a case for using N backends and one connection, but I think that'll be too complicated / too much bound by lcoking around the socket etc. > > > this results in a 16GB base backup. I think this is probably a good bit > > less compressible than most PG databases. > > > > method level parallelism wall-time cpu-user-time cpu-kernel-time size rate format > > gzip 1 1 305.37 299.72 5.52 7067232465 2.28 > > lz4 1 1 33.26 27.26 5.99 8961063439 1.80 .lz4 > > lz4 3 1 188.50 182.91 5.58 8204501460 1.97 .lz4 > > zstd 1 1 66.41 58.38 6.04 6925634128 2.33 .zstd > > zstd 1 10 9.64 67.04 4.82 6980075316 2.31 .zstd > > zstd 3 1 122.04 115.79 6.24 6440274143 2.50 .zstd > > zstd 3 10 13.65 106.11 5.64 6438439095 2.51 .zstd > > zstd 9 10 100.06 955.63 6.79 5963827497 2.71 .zstd > > zstd 15 10 259.84 2491.39 8.88 5912617243 2.73 .zstd > > pixz 1 10 162.59 1626.61 15.52 5350138420 3.02 .xz > > plzip 1 20 135.54 2705.28 9.25 5270033640 3.06 .lz > > So, picking a better compressor in this case looks a lot less > exciting. Oh? I find it *extremely* exciting here. This is pretty close to the worst case compressability-wise, and zstd takes only ~22% of the time as gzip does, while still delivering better compression. A nearly 5x improvement in compression times seems pretty exciting to me. Or do you mean for zstd over lz4, rather than anything over gzip? 1.8x -> 2.3x is a pretty decent improvement still, no? And being able to do do it in 1/3 of the wall time seems pretty helpful. > Parallel zstd still compresses somewhat better than single-core lz4, > but the difference in compression ratio is far less, and the amount of > CPU you have to burn in order to get that extra compression is pretty > large. It's "just" a ~2x difference for "level 1" compression, right? For having 1.9GiB less to write / permanently store of a 16GiB base backup that doesn't seem that bad to me. > > I don't really see a problem with emitting .zip files. It's an extremely > > widely used container format for all sorts of file formats these days. > > Except for needing a bit more complicated (and I don't think it's *that* > > big of a difference) code during generation / unpacking, it seems > > clearly advantageous over .tar.gz etc. > > Wouldn't that imply buying into DEFLATE as our preferred compression algorithm? zip doesn't have to imply DEFLATE although it is the most common option. There's a compression method associated with each file. > Either way, I don't really like the idea of having PostgreSQL have its > own code to generate and interpret various archive formats. That seems > like a maintenance nightmare and a recipe for bugs. How can anyone > even verify that our existing 'tar' code works with all 'tar' > implementations out there, or that it's correct in all cases? Do we > really want to maintain similar code for other formats, or even for > this one? I'd say "no". We should pick archive formats that have good, > well-maintained libraries with permissive licenses and then use those. > I don't know whether "zip" falls into that category or not. I agree we should pick one. I think tar is not a great choice. .zip seems like it'd be a significant improvement - but not necessarily optimal. > > > Other options include, perhaps, (1) emitting a tarfile of compressed > > > files instead of a compressed tarfile > > > > Yea, that'd help some. Although I am not sure how good the tooling to > > seek through tarfiles in an O(files) rather than O(bytes) manner is. > > Well, considering that at present we're using hand-rolled code... Good point. Also looks like at least gnu tar supports seeking (when not reading from a pipe etc). > > I think there some cases where using separate compression state for each > > file would hurt us. Some of the archive formats have support for reusing > > compression state, but I don't know which. > > Yeah, I had the same thought. People with mostly 1GB relation segments > might not notice much difference, but people with lots of little > relations might see a more significant difference. Yea. I suspect it's close to immeasurable for large relations. Reusing the dictionary might help, although it likely would imply some overhead. OTOH, the overhead of small relations will usually probably be in the number of files, rather than the actual size. FWIW, not that it's really relevant to this discussion, but I played around with using trained compression dictionaries for postgres contents. Can improve e.g. lz4's compression ratio a fair bit, in particular when compressing small amounts of data. E.g. per-block compression or such. > FWIW, I don't see it as being entirely necessary to create a seekable > compressed archive format, let alone to make all of our compressed > archive formats seekable. I think supporting multiple compression > algorithms in a flexible way that's not too tied to the capabilities > of particular algorithms is more important. If you want fast restores > of incremental and differential backups, consider using -Fp rather > than -Ft. Given how compressible many real-world databases are (maybe not quite the 50x as in the pgbench -i case, but still extremely so), I don't quite find -Fp a convincing alternative. > Or we can have a new option that's like -Fp but every file > is compressed individually in place, or files larger than N bytes are > compressed in place using a configurable algorithm. It might be > somewhat less efficient but it's also way less complicated to > implement, and I think that should count for something. Yea, I think that'd be a decent workaround. > I don't want to get so caught up in advanced features here that we > don't make any useful progress at all. If we can add better features > without a large complexity increment, and without drawing objections > from others on this list, great. If not, I'm prepared to summarily > jettison it as nice-to-have but not essential. Just to be clear: I am not at all advocating tying a change of the archive format to compression method / parallelism changes or anything. > > I don't really see any of the concerns there to apply for the base > > backup case. > > I felt like there was some reason that threads were bad, but it may > have just been the case you mentioned and not relevant here. I mean, they do have some serious issues when postgres infrastructure is needed. Not being threadsafe and all. One needs to be careful to not let "threads escape", to not fork() etc. That doesn't seems like a problem here though. Greetings, Andres Freund
On Wed, Apr 22, 2020 at 2:06 PM Andres Freund <andres@anarazel.de> wrote: > I also can see a case for using N backends and one connection, but I > think that'll be too complicated / too much bound by lcoking around the > socket etc. Agreed. > Oh? I find it *extremely* exciting here. This is pretty close to the > worst case compressability-wise, and zstd takes only ~22% of the time as > gzip does, while still delivering better compression. A nearly 5x > improvement in compression times seems pretty exciting to me. > > Or do you mean for zstd over lz4, rather than anything over gzip? 1.8x > -> 2.3x is a pretty decent improvement still, no? And being able to do > do it in 1/3 of the wall time seems pretty helpful. I meant the latter thing, not the former. I'm taking it as given that we don't want gzip as the only option. Yes, 1.8x -> 2.3x is decent, but not as earth-shattering as 8.8x -> ~24x. In any case, I lean towards adding both lz4 and zstd as options, so I guess we're not really disagreeing here > > Parallel zstd still compresses somewhat better than single-core lz4, > > but the difference in compression ratio is far less, and the amount of > > CPU you have to burn in order to get that extra compression is pretty > > large. > > It's "just" a ~2x difference for "level 1" compression, right? For > having 1.9GiB less to write / permanently store of a 16GiB base > backup that doesn't seem that bad to me. Sure, sure. I'm just saying some people may not be OK with ramping up to 10 or more compression threads on their master server, if it's already heavily loaded, and maybe only has 4 vCPUs or whatever, so we should have lighter-weight options for those people. I'm not trying to argue against zstd or against the idea of ramping up large numbers of compression threads, just saying that lz4 looks awfully nice for people who need some compression but are tight on CPU cycles. > I agree we should pick one. I think tar is not a great choice. .zip > seems like it'd be a significant improvement - but not necessarily > optimal. Other ideas? > > I don't want to get so caught up in advanced features here that we > > don't make any useful progress at all. If we can add better features > > without a large complexity increment, and without drawing objections > > from others on this list, great. If not, I'm prepared to summarily > > jettison it as nice-to-have but not essential. > > Just to be clear: I am not at all advocating tying a change of the > archive format to compression method / parallelism changes or anything. Good, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-22 14:40:17 -0400, Robert Haas wrote: > > Oh? I find it *extremely* exciting here. This is pretty close to the > > worst case compressability-wise, and zstd takes only ~22% of the time as > > gzip does, while still delivering better compression. A nearly 5x > > improvement in compression times seems pretty exciting to me. > > > > Or do you mean for zstd over lz4, rather than anything over gzip? 1.8x > > -> 2.3x is a pretty decent improvement still, no? And being able to do > > do it in 1/3 of the wall time seems pretty helpful. > > I meant the latter thing, not the former. I'm taking it as given that > we don't want gzip as the only option. Yes, 1.8x -> 2.3x is decent, > but not as earth-shattering as 8.8x -> ~24x. Ah, good. > In any case, I lean towards adding both lz4 and zstd as options, so I > guess we're not really disagreeing here We're agreeing, indeed ;) > > I agree we should pick one. I think tar is not a great choice. .zip > > seems like it'd be a significant improvement - but not necessarily > > optimal. > > Other ideas? The 7zip format, perhaps. Does have format level support to address what we were discussing earlier: "Support for solid compression, where multiple files of like type are compressed within a single stream, in order to exploit the combined redundancy inherent in similar files.". Greetings, Andres Freund
On Wed, Apr 22, 2020 at 3:03 PM Andres Freund <andres@anarazel.de> wrote: > The 7zip format, perhaps. Does have format level support to address what > we were discussing earlier: "Support for solid compression, where > multiple files of like type are compressed within a single stream, in > order to exploit the combined redundancy inherent in similar files.". I think that might not be a great choice. One potential problem is that according to https://www.7-zip.org/license.txt the license is partly LGPL, partly three-clause BSD with an advertising clause, and partly some strange mostly-free thing with reverse-engineering restrictions. That sounds pretty unappealing to me as a key dependency for core technology. It also seems like it's mostly a Windows thing. p7zip, the "port of the command line version of 7-Zip to Linux/Posix", last released a new version in 2016. I therefore think that there is room to question how well supported this is all going to be on the systems where most of us work all day. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <andres@anarazel.de> wrote: > One question I have not really seen answered well: > > Why do we want parallelism here. Or to be more precise: What do we hope > to accelerate by making what part of creating a base backup > parallel. There's several potential bottlenecks, and I think it's > important to know the design priorities to evaluate a potential design. I spent some time today trying to understand just one part of this, which is how long it will take to write the base backup out to disk and whether having multiple independent processes helps. I settled on writing and fsyncing 64GB of data, written in 8kB chunks, divided into 1, 2, 4, 8, or 16 equal size files, with each file written by a separate process, and an fsync() at the end before process exit. So in this test, there is no question of whether the master can read the data fast enough, nor is there any issue of network bandwidth. It's purely a test of whether it's faster to have one process write a big file or whether it's faster to have multiple processes each write a smaller file. I tested this on EDB's cthulhu. It's an older server, but it happens to have 4 mount points available for testing, one with XFS + magnetic disks, one with ext4 + magnetic disks, one with XFS + SSD, and one with ext4 + SSD. I did the experiment described above on each mount point separately, and then I also tried 4, 8, or 16 equal size files spread evenly across the 4 mount points. To summarize the results very briefly: 1. ext4 degraded really badly with >4 concurrent writers. XFS did not. 2. SSDs were faster than magnetic disks, but you had to use XFS and >=4 concurrent writers to get the benefit. 3. Spreading writes across the mount points works well, but the slowest mount point sets the pace. Here are more detailed results, with times in seconds: filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB xfs mag 97 53 60 67 71 ext4 mag 94 68 66 335 549 xfs ssd 97 55 33 27 25 ext4 ssd 116 70 66 227 450 spread spread n/a n/a 48 42 44 The spread test with 16 files @ 4GB llooks like this: [/mnt/data-ssd/robert.haas/test14] open: 0, write: 7, fsync: 0, close: 0, total: 7 [/mnt/data-ssd/robert.haas/test10] open: 0, write: 7, fsync: 2, close: 0, total: 9 [/mnt/data-ssd/robert.haas/test2] open: 0, write: 7, fsync: 2, close: 0, total: 9 [/mnt/data-ssd/robert.haas/test6] open: 0, write: 7, fsync: 2, close: 0, total: 9 [/mnt/data-ssd2/robert.haas/test3] open: 0, write: 16, fsync: 0, close: 0, total: 16 [/mnt/data-ssd2/robert.haas/test11] open: 0, write: 16, fsync: 0, close: 0, total: 16 [/mnt/data-ssd2/robert.haas/test15] open: 0, write: 17, fsync: 0, close: 0, total: 17 [/mnt/data-ssd2/robert.haas/test7] open: 0, write: 18, fsync: 0, close: 0, total: 18 [/mnt/data-mag/robert.haas/test16] open: 0, write: 7, fsync: 18, close: 0, total: 25 [/mnt/data-mag/robert.haas/test4] open: 0, write: 7, fsync: 19, close: 0, total: 26 [/mnt/data-mag/robert.haas/test12] open: 0, write: 7, fsync: 19, close: 0, total: 26 [/mnt/data-mag/robert.haas/test8] open: 0, write: 7, fsync: 22, close: 0, total: 29 [/mnt/data-mag2/robert.haas/test9] open: 0, write: 20, fsync: 23, close: 0, total: 43 [/mnt/data-mag2/robert.haas/test13] open: 0, write: 18, fsync: 25, close: 0, total: 43 [/mnt/data-mag2/robert.haas/test5] open: 0, write: 19, fsync: 24, close: 0, total: 43 [/mnt/data-mag2/robert.haas/test1] open: 0, write: 18, fsync: 25, close: 0, total: 43 The fastest write performance of any test was the 16-way XFS-SSD test, which wrote at about 2.56 gigabytes per second. The fastest single-file test was on ext4-magnetic, though ext4-ssd and xfs-magnetic were similar, around 0.66 gigabytes per second. Your system must be a LOT faster, because you were seeing pg_basebackup running at, IIUC, ~3 gigabytes per second, and that would have been a second process both writing and doing other things. For comparison, some recent local pg_basebackup testing on this machine by some of my colleagues ran at about 0.82 gigabytes per second. I suspect it would be possible to get significantly higher numbers on this hardware by (1) changing all the filesystems over to XFS and (2) dividing the data dynamically based on write speed rather than writing the same amount of it everywhere. I bet we could reach 6-8 gigabytes per second if we did all that. Now, I don't know how much this matters. To get limited by this stuff, you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit Ethernet or something like that - or to be doing a local backup. But I thought that it was interesting and that I should share it, so here you go! I do wonder if the apparently concurrency problems with ext4 might matter on systems with high connection counts just in normal operation, backups aside. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2020-04-30 14:50:34 -0400, Robert Haas wrote: > On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <andres@anarazel.de> wrote: > > One question I have not really seen answered well: > > > > Why do we want parallelism here. Or to be more precise: What do we hope > > to accelerate by making what part of creating a base backup > > parallel. There's several potential bottlenecks, and I think it's > > important to know the design priorities to evaluate a potential design. > > I spent some time today trying to understand just one part of this, > which is how long it will take to write the base backup out to disk > and whether having multiple independent processes helps. I settled on > writing and fsyncing 64GB of data, written in 8kB chunks Why 8kb? That's smaller than what we currently do in pg_basebackup, afaictl, and you're actually going to be bottlenecked by syscall overhead at that point (unless you disable / don't have the whole intel security mitigation stuff). > , divided into 1, 2, 4, 8, or 16 equal size files, with each file > written by a separate process, and an fsync() at the end before > process exit. So in this test, there is no question of whether the > master can read the data fast enough, nor is there any issue of > network bandwidth. It's purely a test of whether it's faster to have > one process write a big file or whether it's faster to have multiple > processes each write a smaller file. That's not necessarily the only question though, right? There's also the approach one process writing out multiple files (via buffered, not async IO)? E.g. one basebackup connecting to multiple backends, or just shuffeling multiple files through one copy stream. > I tested this on EDB's cthulhu. It's an older server, but it happens > to have 4 mount points available for testing, one with XFS + magnetic > disks, one with ext4 + magnetic disks, one with XFS + SSD, and one > with ext4 + SSD. IIRC cthulhu's SSDs are not that fast, compared to NVMe storage (by nearly an order of magnitude IIRC). So this might be disadvantaging the parallel case more than it should. Also perhaps the ext4 disadvantage is smaller on more modern kernel versions? If you can provide me with the test program, I'd happily run it on some decent, but not upper end, NVMe SSDs. > The fastest write performance of any test was the 16-way XFS-SSD test, > which wrote at about 2.56 gigabytes per second. The fastest > single-file test was on ext4-magnetic, though ext4-ssd and > xfs-magnetic were similar, around 0.66 gigabytes per second. I think you might also be seeing some interaction with write caching on the raid controller here. The file sizes are small enough to fit in there to a significant degree for the single file tests. > Your system must be a LOT faster, because you were seeing > pg_basebackup running at, IIUC, ~3 gigabytes per second, and that > would have been a second process both writing and doing other > things. Right. On my workstation I have a NVMe SSD that can do ~2.5 GiB/s sustained, in my laptop one that peaks to ~3.2GiB/s but then quickly goes to ~2GiB/s. FWIW, I ran a "benchmark" just now just using dd, on my laptop, on battery (so take this with a huge grain of salt). With 1 dd writing out 150GiB in 8kb blocks I get 1.8GiB/s, and with two writing 75GiB each ~840MiB/s, with three writing 50GiB each 550MiB/s. > Now, I don't know how much this matters. To get limited by this stuff, > you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit > Ethernet or something like that - or to be doing a local backup. But I > thought that it was interesting and that I should share it, so here > you go! I do wonder if the apparently concurrency problems with ext4 > might matter on systems with high connection counts just in normal > operation, backups aside. I have seen such problems. Some of them have gotten better though. For most (all?) linux filesystems we can easily run into filesystem concurrency issues from within postgres. There's basically a file level exclusive lock for buffered writes (only for the copy into the page cache though), due to posix requirements about the effects of a write being atomic. Greetings, Andres Freund
On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres@anarazel.de> wrote: > Why 8kb? That's smaller than what we currently do in pg_basebackup, > afaictl, and you're actually going to be bottlenecked by syscall > overhead at that point (unless you disable / don't have the whole intel > security mitigation stuff). I just picked something. Could easily try other things. > > , divided into 1, 2, 4, 8, or 16 equal size files, with each file > > written by a separate process, and an fsync() at the end before > > process exit. So in this test, there is no question of whether the > > master can read the data fast enough, nor is there any issue of > > network bandwidth. It's purely a test of whether it's faster to have > > one process write a big file or whether it's faster to have multiple > > processes each write a smaller file. > > That's not necessarily the only question though, right? There's also the > approach one process writing out multiple files (via buffered, not async > IO)? E.g. one basebackup connecting to multiple backends, or just > shuffeling multiple files through one copy stream. Sure, but that seems like it can't scale better than this. You have the scaling limitations of the filesystem, plus the possibility that the process is busy doing something else when it could be writing to any particular file. > If you can provide me with the test program, I'd happily run it on some > decent, but not upper end, NVMe SSDs. It was attached, but I forgot to mention that in the body of the email. > I think you might also be seeing some interaction with write caching on > the raid controller here. The file sizes are small enough to fit in > there to a significant degree for the single file tests. Yeah, that's possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 30, 2020 at 6:06 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres@anarazel.de> wrote: > > Why 8kb? That's smaller than what we currently do in pg_basebackup, > > afaictl, and you're actually going to be bottlenecked by syscall > > overhead at that point (unless you disable / don't have the whole intel > > security mitigation stuff). > > I just picked something. Could easily try other things. I tried changing the write size to 64kB, keeping the rest the same. Here are the results: filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB xfs mag 65 53 64 74 79 ext4 mag 96 68 75 303 437 xfs ssd 75 43 29 33 38 ext4 ssd 96 68 63 214 254 spread spread n/a n/a 43 38 40 And here again are the previous results with an 8kB write size: xfs mag 97 53 60 67 71 ext4 mag 94 68 66 335 549 xfs ssd 97 55 33 27 25 ext4 ssd 116 70 66 227 450 spread spread n/a n/a 48 42 44 Generally, those numbers look better than the previous numbers, but parallelism still looks fairly appealing on the SSD storage - less so on magnetic disks, at least in this test. Hmm, now I wonder what write size pg_basebackup is actually using. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-05-01 16:32:15 -0400, Robert Haas wrote: > On Thu, Apr 30, 2020 at 6:06 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Apr 30, 2020 at 3:52 PM Andres Freund <andres@anarazel.de> wrote: > > > Why 8kb? That's smaller than what we currently do in pg_basebackup, > > > afaictl, and you're actually going to be bottlenecked by syscall > > > overhead at that point (unless you disable / don't have the whole intel > > > security mitigation stuff). > > > > I just picked something. Could easily try other things. > > I tried changing the write size to 64kB, keeping the rest the same. > Here are the results: > > filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB > xfs mag 65 53 64 74 79 > ext4 mag 96 68 75 303 437 > xfs ssd 75 43 29 33 38 > ext4 ssd 96 68 63 214 254 > spread spread n/a n/a 43 38 40 > > And here again are the previous results with an 8kB write size: > > xfs mag 97 53 60 67 71 > ext4 mag 94 68 66 335 549 > xfs ssd 97 55 33 27 25 > ext4 ssd 116 70 66 227 450 > spread spread n/a n/a 48 42 44 > > Generally, those numbers look better than the previous numbers, but > parallelism still looks fairly appealing on the SSD storage - less so > on magnetic disks, at least in this test. I spent a fair bit of time analyzing this, and my conclusion is that you might largely be seeing numa effects. Yay. I don't have an as large numa machine at hand, but here's what I'm seeing on my local machine, during a run of writing out 400GiB (this is a run with noise on the machine, the benchmarks below are without that). The machine has 192GiB of ram, evenly distributed to two sockets / numa domains. At start I see numastat -m|grep -E 'MemFree|MemUsed|Dirty|Writeback|Active\(file\)|Inactive\(file\)'" MemFree 91908.20 92209.85 184118.05 MemUsed 3463.05 4553.33 8016.38 Active(file) 105.46 328.52 433.98 Inactive(file) 68.29 190.14 258.43 Dirty 0.86 0.90 1.76 Writeback 0.00 0.00 0.00 WritebackTmp 0.00 0.00 0.00 For a while there's pretty decent IO throughput (all 10s samples): Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1955.67 2299.32 0.00 0.00 42.48 1203.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 82.10 89.33 Then it starts to be slower on a sustained basis: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1593.33 1987.85 0.00 0.00 42.90 1277.55 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 67.55 76.53 And then performance tanks completely: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 646.33 781.85 0.00 0.00 132.68 1238.70 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 85.43 58.63 That amount of degradation confused me for a while, especially because I couldn't reproduce it the more controlled I made the setups. In particular I stopped seeing the same magnitude of issues after pinnning processes to one numa socket (both running and memory). After a few seconds: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1882.00 2320.07 0.00 0.00 42.50 1262.35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 79.05 88.07 MemFree 35356.50 80986.46 116342.96 MemUsed 60014.75 15776.72 75791.47 Active(file) 179.44 163.28 342.72 Inactive(file) 58293.18 13385.15 71678.33 Dirty 18407.50 882.00 19289.50 Writeback 235.78 335.43 571.21 WritebackTmp 0.00 0.00 0.00 A bit later io starts to get slower: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1556.30 1898.70 0.00 0.00 40.92 1249.29 0.00 0.00 0.00 0.00 0.00 0.00 0.20 24.00 62.90 72.01 MemFree 519.56 36086.14 36605.70 MemUsed 94851.69 60677.04 155528.73 Active(file) 303.84 212.96 516.80 Inactive(file) 92776.70 58133.28 150909.97 Dirty 10913.20 5374.07 16287.27 Writeback 812.94 331.96 1144.90 WritebackTmp 0.00 0.00 0.00 And then later it gets worse: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme1n1 0.00 0.00 0.00 0.00 0.00 0.00 1384.70 1671.25 0.00 0.00 40.87 1235.91 0.00 0.00 0.00 0.00 0.00 0.00 0.20 7.00 55.89 63.45 MemFree 519.54 242.98 762.52 MemUsed 94851.71 96520.20 191371.91 Active(file) 175.82 246.03 421.85 Inactive(file) 92820.19 93985.79 186805.98 Dirty 10482.75 4140.72 14623.47 Writeback 0.00 0.00 0.00 WritebackTmp 0.00 0.00 0.00 When using a 1s iostat instead of a 10s, it's noticable that performance swings widely between very slow (<100MB/s) and very high throughput (> 2500MB/s). It's clearly visible that performance degrades substantially first when all of a numa node's free memory is exhausted, then when the second numa node's is. Looking at profile I see a lot of cacheline bouncing between the kernel threads that "reclaim" pages (i.e. make them available for reuse), the kernel threads that write out dirty pages, the kernel threads where the IO completes (i.e. where the dirty bit can be flipped / locks get released), and the writing process. I think there's a lot from the kernel side that can improve - but it's not too surprising that letting the kernel cache / forcing it to make caching decisions for a large streaming wide has substantial costs. I changed Robert's test program to optionall fallocate, sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large footprint in the page cache. The performance differences are quite substantial: gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \ rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \ /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1 running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220 comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1 running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161 Below are the results of running a the program with a variation of parameters (both file and resutls attached). I used perf stat in this run to measure the difference in CPU usage. ref_cycles are the number of CPU cycles, across all 20 cores / 40 threads, CPUs were doing *something*. It is not affected by CPU frequency scaling, just by the time CPUs were not "halted". Whereas cycles is affected by frequency scaling. A high ref_cycles_sec, combined with a decent number of total instructions/cycles is *good*, because it indicates fewer CPUs used. Whereas a very high ref_cycles_tot means that more CPUs were running doing something for the duration of the benchmark. The run-to-run variations between the runs without cache control are pretty large. So this is probably not the end-all-be-all numbers. But I think the trends are pretty clear. test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54 numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59 numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84 numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 243.609959554 1,802,385,405,203 184.970M/sec 1,449,560,513,247 0.149GHz 855,426,288,031 0.56 numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 230.880100449 1,328,417,418,799 143.846M/sec 1,148,924,667,393 0.124GHz 723,158,246,628 0.63 numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 253.591234992 1,548,485,571,798 152.658M/sec 1,229,926,994,613 0.121GHz 1,117,352,436,324 0.95 numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84 numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51 numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76 numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69 numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92 numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 421.580487642 2,756,486,952,728 163.449M/sec 1,387,708,033,752 0.082GHz 990,478,650,874 0.72 numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 169.854206542 1,333,619,626,854 196.282M/sec 1,036,261,531,134 0.153GHz 666,052,333,591 0.64 numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 305.078100578 1,970,042,289,192 161.445M/sec 1,505,706,462,812 0.123GHz 954,963,240,648 0.62 numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89 numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66 numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65 numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68 numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84 numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 334.932246893 2,366,388,662,460 176.628M/sec 1,216,049,589,993 0.091GHz 796,698,831,717 0.68 numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 161.697270285 1,866,036,713,483 288.576M/sec 1,068,181,502,433 0.165GHz 739,559,279,008 0.70 numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 231.440889430 1,965,389,749,057 212.391M/sec 1,407,927,406,358 0.152GHz 997,199,361,968 0.72 numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80 numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0 644.521613661 3,688,449,404,537 143.079M/sec 2,020,128,131,309 0.078GHz 961,486,630,359 0.48 numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0 243.830464632 1,499,608,983,445 153.756M/sec 1,227,468,439,403 0.126GHz 691,534,661,654 0.59 numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1 292.866419420 1,753,376,415,877 149.677M/sec 1,483,169,463,392 0.127GHz 860,035,914,148 0.56 numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 162.152397194 925,643,754,128 142.719M/sec 743,208,501,601 0.115GHz 554,462,585,110 0.70 numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0 211.369510165 1,558,996,898,599 184.401M/sec 1,359,343,408,200 0.161GHz 766,769,036,524 0.57 numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0 233.315094908 1,427,133,080,540 152.927M/sec 1,166,000,868,597 0.125GHz 743,027,329,074 0.64 numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1 290.698155820 1,732,849,079,701 149.032M/sec 1,441,508,612,326 0.124GHz 835,039,426,282 0.57 numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 159.945462440 850,162,390,626 132.892M/sec 724,286,281,548 0.113GHz 670,069,573,150 0.90 numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0 163.244592275 1,524,807,507,173 233.531M/sec 1,398,319,581,978 0.214GHz 689,514,058,243 0.46 numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0 231.795934322 1,731,030,267,153 186.686M/sec 1,124,935,745,020 0.121GHz 736,084,922,669 0.70 numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1 315.564163702 1,958,199,733,216 155.128M/sec 1,405,115,546,716 0.111GHz 1,000,595,890,394 0.73 numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 210.945487961 1,527,169,148,899 180.990M/sec 906,023,518,692 0.107GHz 700,166,552,207 0.80 numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0 161.759094088 1,468,321,054,671 226.934M/sec 1,221,167,105,510 0.189GHz 735,855,415,612 0.59 numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0 158.578248952 1,354,770,825,277 213.586M/sec 936,436,363,752 0.148GHz 654,823,079,884 0.68 numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1 274.628500801 1,792,841,068,080 163.209M/sec 1,343,398,055,199 0.122GHz 996,073,874,051 0.73 numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 179.140070123 1,383,595,004,328 193.095M/sec 850,299,722,091 0.119GHz 706,959,617,654 0.83 numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0 445.496787199 2,663,914,572,687 149.495M/sec 1,267,340,496,930 0.071GHz 787,469,552,454 0.62 numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0 261.866083604 2,325,884,820,091 222.043M/sec 1,094,814,208,219 0.105GHz 649,479,233,453 0.57 numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1 172.963505544 1,717,387,683,260 248.228M/sec 1,356,381,335,831 0.196GHz 822,256,638,370 0.58 numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 157.934678897 1,650,503,807,778 261.266M/sec 970,705,561,971 0.154GHz 637,953,927,131 0.66 numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0 225.623143601 1,804,402,820,599 199.938M/sec 1,086,394,788,362 0.120GHz 656,392,112,807 0.62 numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0 157.930900998 1,797,506,082,342 284.548M/sec 1,001,509,813,741 0.159GHz 644,107,150,289 0.66 numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1 165.772265335 1,805,895,001,689 272.353M/sec 1,514,173,918,970 0.228GHz 823,435,044,810 0.54 numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 187.664764448 1,964,118,348,429 261.660M/sec 978,060,510,880 0.130GHz 668,316,194,988 0.67 Greetings, Andres Freund
Attachment
On Sat, May 2, 2020 at 10:36 PM Andres Freund <andres@anarazel.de> wrote: > I changed Robert's test program to optionall fallocate, > sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large > footprint in the page cache. The performance > differences are quite substantial: > > gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \ > rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \ > /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1 > > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220 > > comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1 > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161 Ah, nice. > The run-to-run variations between the runs without cache control are > pretty large. So this is probably not the end-all-be-all numbers. But I > think the trends are pretty clear. Could you be explicit about what you think those clear trends are? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-05-03 09:12:59 -0400, Robert Haas wrote: > On Sat, May 2, 2020 at 10:36 PM Andres Freund <andres@anarazel.de> wrote: > > I changed Robert's test program to optionall fallocate, > > sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large > > footprint in the page cache. The performance > > differences are quite substantial: > > > > gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \ > > rm -ff /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && \ > > /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1 > > > > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 > > [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, close: 0, total: 220 > > > > comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1 > > running test with: numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 > > [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, close: 0, total: 161 > > Ah, nice. Btw, I forgot to include the result for 0 / 0 / 0 in the results (off-by-one error in a script :)) numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68 > > The run-to-run variations between the runs without cache control are > > pretty large. So this is probably not the end-all-be-all numbers. But I > > think the trends are pretty clear. > > Could you be explicit about what you think those clear trends are? Largely that concurrency can help a bit, but also hurt tremendously. Below is some more detailed analysis, it'll be a bit long... Taking the no concurrency / cache management as a baseline: > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68 and comparing cache management with using some concurrency: > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51 we can see very similar timing. Which makes sense, because that's roughly the device's max speed. But then going to higher concurrency, there's clearly regressions: > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66 And I think it is instructive to look at the ref_cycles_tot/cycles_tot/instructions_tot vs ref_cycles_sec/cycles_sec/ipc. The units are confusing because they are across all cores and most are idle. But it's pretty obvious that numprocs=1 sfr=1 fadvise=1 has cores running for a lot shorter time (reference cycles basically count the time cores were running on a absolute time scale). Compared to numprocs=2 sfr=0 fadvise=0, which has the same resulting performance, it's clear that cores were busier, but less efficient (lower ipc). With cache mangement there's very little benefit, and some risk (1->2 regression) in this workload with increasing concurrency: > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84 And there's good benefit, but tremendous risk, of concurrency in the no cache control case: > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66 sync file range without fadvise isn't a benefit at low concurrency, but prevents bad regressions at high concurency: > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65 fadvise alone is similar: > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68 There does not appear to be a huge of benefit in fallocate in this workload, the OS's delayed allocation works well. Compare: numprocs=1 > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 243.609959554 1,802,385,405,203 184.970M/sec 1,449,560,513,247 0.149GHz 855,426,288,031 0.56 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 230.880100449 1,328,417,418,799 143.846M/sec 1,148,924,667,393 0.124GHz 723,158,246,628 0.63 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 253.591234992 1,548,485,571,798 152.658M/sec 1,229,926,994,613 0.121GHz 1,117,352,436,324 0.95 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84 numprocs=2 > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 421.580487642 2,756,486,952,728 163.449M/sec 1,387,708,033,752 0.082GHz 990,478,650,874 0.72 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 169.854206542 1,333,619,626,854 196.282M/sec 1,036,261,531,134 0.153GHz 666,052,333,591 0.64 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 305.078100578 1,970,042,289,192 161.445M/sec 1,505,706,462,812 0.123GHz 954,963,240,648 0.62 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89 numprocs=4 > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 334.932246893 2,366,388,662,460 176.628M/sec 1,216,049,589,993 0.091GHz 796,698,831,717 0.68 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 161.697270285 1,866,036,713,483 288.576M/sec 1,068,181,502,433 0.165GHz 739,559,279,008 0.70 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 231.440889430 1,965,389,749,057 212.391M/sec 1,407,927,406,358 0.152GHz 997,199,361,968 0.72 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80 I would say that it seems to help concurrent cases without cache control, but not particularly reliably so. At higher concurrency it seems to hurt with cache control, not sure I undstand why. I was at first confused why 128kb write sizes hurt (128kb is probably on the higher end of useful, but I wanted to have see a more extreme difference): > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0 220.210155081 1,569,524,602,961 178.188M/sec 1,363,686,761,705 0.155GHz 833,345,334,408 0.68 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0 644.521613661 3,688,449,404,537 143.079M/sec 2,020,128,131,309 0.078GHz 961,486,630,359 0.48 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0 248.430736196 1,497,048,950,014 150.653M/sec 1,226,822,167,960 0.123GHz 705,950,461,166 0.54 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0 243.830464632 1,499,608,983,445 153.756M/sec 1,227,468,439,403 0.126GHz 691,534,661,654 0.59 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1 310.275952938 1,921,817,571,226 154.849M/sec 1,499,581,687,133 0.121GHz 944,243,167,053 0.59 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1 292.866419420 1,753,376,415,877 149.677M/sec 1,483,169,463,392 0.127GHz 860,035,914,148 0.56 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 162.152397194 925,643,754,128 142.719M/sec 743,208,501,601 0.115GHz 554,462,585,110 0.70 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0 243.609959554 1,802,385,405,203 184.970M/sec 1,449,560,513,247 0.149GHz 855,426,288,031 0.56 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0 211.369510165 1,558,996,898,599 184.401M/sec 1,359,343,408,200 0.161GHz 766,769,036,524 0.57 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0 230.880100449 1,328,417,418,799 143.846M/sec 1,148,924,667,393 0.124GHz 723,158,246,628 0.63 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0 233.315094908 1,427,133,080,540 152.927M/sec 1,166,000,868,597 0.125GHz 743,027,329,074 0.64 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1 253.591234992 1,548,485,571,798 152.658M/sec 1,229,926,994,613 0.121GHz 1,117,352,436,324 0.95 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1 290.698155820 1,732,849,079,701 149.032M/sec 1,441,508,612,326 0.124GHz 835,039,426,282 0.57 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 159.945462440 850,162,390,626 132.892M/sec 724,286,281,548 0.113GHz 670,069,573,150 0.90 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0 164.052510134 1,561,521,537,336 237.972M/sec 1,404,761,167,120 0.214GHz 715,274,337,015 0.51 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0 163.244592275 1,524,807,507,173 233.531M/sec 1,398,319,581,978 0.214GHz 689,514,058,243 0.46 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0 192.151682414 1,526,440,715,456 198.603M/sec 1,037,135,756,007 0.135GHz 802,754,964,096 0.76 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0 231.795934322 1,731,030,267,153 186.686M/sec 1,124,935,745,020 0.121GHz 736,084,922,669 0.70 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1 242.648245159 1,782,637,416,163 183.629M/sec 1,463,696,313,881 0.151GHz 1,000,100,694,932 0.69 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1 315.564163702 1,958,199,733,216 155.128M/sec 1,405,115,546,716 0.111GHz 1,000,595,890,394 0.73 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 210.945487961 1,527,169,148,899 180.990M/sec 906,023,518,692 0.107GHz 700,166,552,207 0.80 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0 421.580487642 2,756,486,952,728 163.449M/sec 1,387,708,033,752 0.082GHz 990,478,650,874 0.72 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0 161.759094088 1,468,321,054,671 226.934M/sec 1,221,167,105,510 0.189GHz 735,855,415,612 0.59 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0 169.854206542 1,333,619,626,854 196.282M/sec 1,036,261,531,134 0.153GHz 666,052,333,591 0.64 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0 158.578248952 1,354,770,825,277 213.586M/sec 936,436,363,752 0.148GHz 654,823,079,884 0.68 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1 305.078100578 1,970,042,289,192 161.445M/sec 1,505,706,462,812 0.123GHz 954,963,240,648 0.62 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1 274.628500801 1,792,841,068,080 163.209M/sec 1,343,398,055,199 0.122GHz 996,073,874,051 0.73 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 179.140070123 1,383,595,004,328 193.095M/sec 850,299,722,091 0.119GHz 706,959,617,654 0.83 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0 455.096916715 2,808,715,616,077 154.293M/sec 1,366,660,063,053 0.075GHz 888,512,073,477 0.66 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0 445.496787199 2,663,914,572,687 149.495M/sec 1,267,340,496,930 0.071GHz 787,469,552,454 0.62 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0 256.156100686 2,407,922,637,215 235.003M/sec 1,133,311,037,956 0.111GHz 748,666,206,805 0.65 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0 261.866083604 2,325,884,820,091 222.043M/sec 1,094,814,208,219 0.105GHz 649,479,233,453 0.57 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1 215.255015340 1,977,578,120,924 229.676M/sec 1,461,504,758,029 0.170GHz 1,005,270,838,642 0.68 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1 172.963505544 1,717,387,683,260 248.228M/sec 1,356,381,335,831 0.196GHz 822,256,638,370 0.58 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 157.934678897 1,650,503,807,778 261.266M/sec 970,705,561,971 0.154GHz 637,953,927,131 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0 334.932246893 2,366,388,662,460 176.628M/sec 1,216,049,589,993 0.091GHz 796,698,831,717 0.68 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0 225.623143601 1,804,402,820,599 199.938M/sec 1,086,394,788,362 0.120GHz 656,392,112,807 0.62 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0 161.697270285 1,866,036,713,483 288.576M/sec 1,068,181,502,433 0.165GHz 739,559,279,008 0.70 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0 157.930900998 1,797,506,082,342 284.548M/sec 1,001,509,813,741 0.159GHz 644,107,150,289 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1 231.440889430 1,965,389,749,057 212.391M/sec 1,407,927,406,358 0.152GHz 997,199,361,968 0.72 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1 165.772265335 1,805,895,001,689 272.353M/sec 1,514,173,918,970 0.228GHz 823,435,044,810 0.54 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 187.664764448 1,964,118,348,429 261.660M/sec 978,060,510,880 0.130GHz 668,316,194,988 0.67 It's pretty clear that the larger write block size can hurt quite badly. I was somewhat confused by this at first, but after thinking about it for a while longer it actually makes sense: For the OS to finish an 8k write it needs to find two free pagecache pages. For an 128k write it needs to find 32. Which means that it's much more likely that kernel threads and the writes are going to fight over locks / cachelines: In the 8k page it's quite likely that ofen the kernel threads will do so while the memcpy() from userland is happening, but that's less the case with 32 pages that need to be acquired before the memcpy() can happen. With cache control that problem doesn't exist, which is why the larger block size is beneficial: > test time ref_cycles_tot ref_cycles_sec cycles_tot cycles_sec instructions_tot ipc > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1 164.175492485 913,991,290,231 139.183M/sec 762,359,320,428 0.116GHz 678,451,556,273 0.84 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 162.152397194 925,643,754,128 142.719M/sec 743,208,501,601 0.115GHz 554,462,585,110 0.70 > numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1 164.488835158 911,974,902,254 138.611M/sec 760,756,011,483 0.116GHz 672,105,046,261 0.84 > numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 159.945462440 850,162,390,626 132.892M/sec 724,286,281,548 0.113GHz 670,069,573,150 0.90 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1 188.772193248 1,418,274,870,697 187.803M/sec 923,133,958,500 0.122GHz 799,212,291,243 0.92 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 210.945487961 1,527,169,148,899 180.990M/sec 906,023,518,692 0.107GHz 700,166,552,207 0.80 > numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1 166.295223626 1,290,699,256,763 194.044M/sec 857,873,391,283 0.129GHz 761,338,026,415 0.89 > numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 179.140070123 1,383,595,004,328 193.095M/sec 850,299,722,091 0.119GHz 706,959,617,654 0.83 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1 158.262790654 1,720,443,307,097 271.769M/sec 1,004,079,045,479 0.159GHz 826,905,592,751 0.84 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 157.934678897 1,650,503,807,778 261.266M/sec 970,705,561,971 0.154GHz 637,953,927,131 0.66 > numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1 214.433248700 2,232,198,239,769 260.300M/sec 1,073,334,918,389 0.125GHz 861,540,079,120 0.80 > numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 187.664764448 1,964,118,348,429 261.660M/sec 978,060,510,880 0.130GHz 668,316,194,988 0.67 Note how especially in the first few cases the total number of instructions required is improved (although due to the way I did the perf stat the sampling error is pretty large). I haven't run that test yest, but after looking at all this I would bet that reducing the block size to 4kb (i.e. a single os/hw page) would help the no cache control case significantly, in particular in the concurrent case. And conversely, I'd expect that the CPU efficiency will be improved by larger block size for the cache control case for just about any realistic block size. I'd love to have a faster storage available (faster NVMes, or multiple ones I can use for benchmarking) to see what the cutoff point for actually benefiting from concurrency is. Also worthwhile to note that even the "best case" from a CPU usage point here absolutely *pales* against using direct-IO. It's not an apples/apples comparison, but comparing buffered io using write_and_fsync, and unbuffered io using fio: 128KiB blocksize: write_and_fsync: echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructions /tmp/write_and_fsync--blocksize $((128*1024)) --sync_file_range=1 --fallocate=1 --fadvise=1 --sequential=0 --filesize=$((400*1024*1024*1024))/srv/dev/bench/test1 Performance counter stats for 'system wide': 6,377,903.65 msec cpu-clock # 39.999 CPUs utilized 628,014,590,200 ref-cycles # 98.467 M/sec 634,468,623,514 cycles # 0.099 GHz 795,771,756,320 instructions # 1.25 insn per cycle 159.451492209 seconds time elapsed fio: rm -f /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructionsfio --name=test --iodepth=512 --iodepth_low=8 --iodepth_batch_submit=8 --iodepth_batch_complete_min=8--iodepth_batch_complete_max=128 --ioengine=libaio --rw=write --bs=128k --filesize=$((400*1024*1024*1024)) --direct=1 --numjobs=1 Performance counter stats for 'system wide': 6,313,522.71 msec cpu-clock # 39.999 CPUs utilized 458,476,185,800 ref-cycles # 72.618 M/sec 196,148,015,054 cycles # 0.031 GHz 158,921,457,853 instructions # 0.81 insn per cycle 157.842080440 seconds time elapsed CPU usage for fio most of the time was around 98% for write_and_fsync and 40% for fio. I.e. system-wide CPUs were active 0.73x the time, and 0.2x as many instructions had to be executed in the DIO case. Greetings, Andres Freund
On Sun, May 3, 2020 at 1:49 PM Andres Freund <andres@anarazel.de> wrote: > > > The run-to-run variations between the runs without cache control are > > > pretty large. So this is probably not the end-all-be-all numbers. But I > > > think the trends are pretty clear. > > > > Could you be explicit about what you think those clear trends are? > > Largely that concurrency can help a bit, but also hurt > tremendously. Below is some more detailed analysis, it'll be a bit > long... OK, thanks. Let me see if I can summarize here. On the strength of previous experience, you'll probably tell me that some parts of this summary are wildly wrong or at least "not quite correct" but I'm going to try my best. - Server-side compression seems like it has the potential to be a significant win by stretching bandwidth. We likely need to do it with 10+ parallel threads, at least for stronger compressors, but these might be threads within a single PostgreSQL process rather than multiple separate backends. - Client-side cache management -- that is, use of posix_fadvise(DONTNEED), posix_fallocate, and sync_file_range, where available -- looks like it can improve write rates and CPU efficiency significantly. Larger block sizes show a win when used together with such techniques. - The benefits of multiple concurrent connections remain somewhat elusive. Peter Eisentraut hypothesized upthread that such an approach might be the most practical way forward for networks with a high bandwidth-delay product, and I hypothesized that such an approach might be beneficial when there are multiple tablespaces on independent disks, but we don't have clear experimental support for those propositions. Also, both your data and mine indicate that too much parallelism can lead to major regressions. - Any work we do while trying to make backup super-fast should also lend itself to super-fast restore, possibly including parallel restore. Compressed tarfiles don't permit random access to member files. Uncompressed tarfiles do, but software that works this way is not commonplace. The only mainstream archive format that seems to support random access seems to be zip. Adopting that wouldn't be crazy, but might limit our choice of compression options more than we'd like. A tar file of individually compressed files might be a plausible alternative, though there would probably be some hit to compression ratios for small files. Then again, if a single, highly-efficient process can handle a server-to-client backup, maybe the same is true for extracting a compressed tarfile... Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-05-04 14:04:32 -0400, Robert Haas wrote: > OK, thanks. Let me see if I can summarize here. On the strength of > previous experience, you'll probably tell me that some parts of this > summary are wildly wrong or at least "not quite correct" but I'm going > to try my best. > - Server-side compression seems like it has the potential to be a > significant win by stretching bandwidth. We likely need to do it with > 10+ parallel threads, at least for stronger compressors, but these > might be threads within a single PostgreSQL process rather than > multiple separate backends. That seems right. I think it might be reasonable to just support "compression parallelism" for zstd, as the library has all the code internally. So we basically wouldn't have to care about it. > - Client-side cache management -- that is, use of > posix_fadvise(DONTNEED), posix_fallocate, and sync_file_range, where > available -- looks like it can improve write rates and CPU efficiency > significantly. Larger block sizes show a win when used together with > such techniques. Yea. Alternatively direct io, but I am not sure we want to go there for now. > - The benefits of multiple concurrent connections remain somewhat > elusive. Peter Eisentraut hypothesized upthread that such an approach > might be the most practical way forward for networks with a high > bandwidth-delay product, and I hypothesized that such an approach > might be beneficial when there are multiple tablespaces on independent > disks, but we don't have clear experimental support for those > propositions. Also, both your data and mine indicate that too much > parallelism can lead to major regressions. I think for that we'd basically have to create two high bandwidth nodes across the pond. My experience in the somewhat recent past is that I could saturate multi-gbit cross-atlantic links without too much trouble, at least once I changed sys.net.ipv4.tcp_congestion_control to something appropriate for such setups (BBR is probably the thing to use here these days). > - Any work we do while trying to make backup super-fast should also > lend itself to super-fast restore, possibly including parallel > restore. I'm not sure I see a super clear case for parallel restore in any of the experiments done so far. The only case we know it's a clear win is when there's independent filesystems for parts of the data. There's an obvious case for parallel decompression however. > Compressed tarfiles don't permit random access to member files. This is an issue for selective restores too, not just parallel restore. I'm not sure how important a case that is, although it'd certainly be useful if e.g. pg_rewind could read from compressed base backups. > Uncompressed tarfiles do, but software that works this way is not > commonplace. I am not 100% sure which part you comment on not being commonplace here. Supporting randomly accessing data in tarfiles? My understanding of that is that one still has to "skip" through the entire archive, right? What not being compressed allows is to not have to read the files inbetween. Given the size of our data files compared to the metadata size that's probably fine? > The only mainstream archive format that seems to support random access > seems to be zip. Adopting that wouldn't be crazy, but might limit our > choice of compression options more than we'd like. I'm not sure that's *really* an issue - there's compression format codes in zip ([1] 4.4.5, also 4.3.14.3 & 4.5 for another approach), and several tools seem to have used that to add additional compression methods. > A tar file of individually compressed files might be a plausible > alternative, though there would probably be some hit to compression > ratios for small files. I'm not entirely sure using zip over uncompressed-tar-over-compressed-files gains us all that much. AFAIU zip compresses each file individually. So the advantage would be a more efficient (less seeking) storage of archive metadata (i.e. which file is where) and that the metadata could be compressed. > Then again, if a single, highly-efficient process can handle a > server-to-client backup, maybe the same is true for extracting a > compressed tarfile... Yea. I'd expect that to be the case, at least for the single filesystem case. Depending on the way multiple tablespaces / filesystems are handled, it could even be doable to handle that reasonably - but it'd probably be harder. Greetings, Andres Freund [1] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT