Thread: large regression for parallel COPY
On Thu, Mar 10, 2016 at 8:29 PM, Andres Freund <andres@anarazel.de> wrote: > Allow to trigger kernel writeback after a configurable number of writes. While testing out Dilip Kumar's relation extension patch today, I discovered (with some help from Andres) that this causes nasty regressions when doing parallel COPY on hydra (3.2.6-3.fc16.ppc64, lousy disk subsystem). What I did was (1) run pgbench -i -s 100, (2) copy the results to a file, (3) truncate and drop the indexes on the original table, and (4) try copying in one or more copies of the data from the file. Typical command line: time pgbench -n -f f -t 1 -c 4 -j 4 && psql -c "select pg_size_pretty(pg_relation_size('pgbench_accounts'));" && time psql -c checkpoint && psql -c "truncate pgbench_accounts; checkpoint;" With default settings against 96f8373cad5d6066baeb7a1c5a88f6f5c9661974, pgbench takes 9 to 9.5 minutes and the subsequent checkpoint takes 9 seconds. After setting , it takes 1 minute and 11 seconds and the subsequent checkpoint takes 11 seconds. With a single copy of the data (that is, -c 1 -j 1 but otherwise as above), it takes 28-29 seconds with default settings and 26-27 seconds with backend_flush_after=0, bgwriter_flush_after=0. So the difference is rather small with a straight-up COPY, but with 4 copies running at the same time, it's near enough to an order of magnitude. Andres reports that on his machine, non-zero *_flush_after settings make things faster, not slower, so apparently this is hardware-dependent or kernel-dependent. Nevertheless, it seems to me that we should try to get some broader testing here to see which experience is typical. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-03-30 15:50:21 -0400, Robert Haas wrote: > On Thu, Mar 10, 2016 at 8:29 PM, Andres Freund <andres@anarazel.de> wrote: > > Allow to trigger kernel writeback after a configurable number of writes. > > While testing out Dilip Kumar's relation extension patch today, I > discovered (with some help from Andres) that this causes nasty > regressions when doing parallel COPY on hydra (3.2.6-3.fc16.ppc64, > lousy disk subsystem). What I did was (1) run pgbench -i -s 100, (2) > copy the results to a file, (3) truncate and drop the indexes on the > original table, and (4) try copying in one or more copies of the data > from the file. Typical command line: > > time pgbench -n -f f -t 1 -c 4 -j 4 && psql -c "select > pg_size_pretty(pg_relation_size('pgbench_accounts'));" && time psql -c > checkpoint && psql -c "truncate pgbench_accounts; checkpoint;" > > With default settings against > 96f8373cad5d6066baeb7a1c5a88f6f5c9661974, pgbench takes 9 to 9.5 > minutes and the subsequent checkpoint takes 9 seconds. After setting > , it takes 1 minute and 11 seconds and the subsequent checkpoint takes > 11 seconds. With a single copy of the data (that is, -c 1 -j 1 but > otherwise as above), it takes 28-29 seconds with default settings and > 26-27 seconds with backend_flush_after=0, bgwriter_flush_after=0. So > the difference is rather small with a straight-up COPY, but with 4 > copies running at the same time, it's near enough to an order of > magnitude. > > Andres reports that on his machine, non-zero *_flush_after settings > make things faster, not slower, so apparently this is > hardware-dependent or kernel-dependent. Nevertheless, it seems to me > that we should try to get some broader testing here to see which > experience is typical. Indeed. On SSDs I see about a 25-35% gain, on HDDs about 5%. If I increase the size of backend_flush_after to 64 (like it's for bgwriter) I however do get about 15% for HDDs as well. I wonder if the default value of backend_flush_after is too small for some scenarios. I've reasoned that backend_flush_after should have a *lower* default value than e.g. checkpointer or bgwriter, because there's many concurrent writers increasing the total amount of unflushed dirty writes. Which is true for OLTP write workloads; but less so for bulk load. Andres
On 03/30/2016 01:10 PM, Andres Freund wrote: > On 2016-03-30 15:50:21 -0400, Robert Haas wrote: >> On Thu, Mar 10, 2016 at 8:29 PM, Andres Freund <andres@anarazel.de> wrote: >>> Allow to trigger kernel writeback after a configurable number of writes. >> >> While testing out Dilip Kumar's relation extension patch today, I >> discovered (with some help from Andres) that this causes nasty >> regressions when doing parallel COPY on hydra (3.2.6-3.fc16.ppc64, >> lousy disk subsystem). Unless Fedora/Redhat fixed the 3.2 kernel in a subsequent patch (.6-3?) then I would look hard right at that. The kernel from 3.2 - 3.8 is going to be miserable for anything that is doing concurrent writes. I understand that this is a regression regardless but I think we need wider testing to see if the changes are somehow related. Sincerely, jD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them.
On Wed, Mar 30, 2016 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote: > Indeed. On SSDs I see about a 25-35% gain, on HDDs about 5%. If I > increase the size of backend_flush_after to 64 (like it's for bgwriter) > I however do get about 15% for HDDs as well. I tried the same test mentioned in the original post on cthulhu (EDB machine, CentOS 7.2, 8 sockets, 8 cores per socket, 2 threads per core, Xeon E7-8830 @ 2.13 GHz). I attempted to test both the effects of multi_extend_v21 and the *_flush_after settings. The machine has both HD and SSD, but I used HD for this test. master, logged tables, 4 parallel copies: 1m15.411s, 1m14.248s, 1m15.040s master, logged tables, 1 copy: 0m28.336s, 0m28.040s, 0m29.576s multi_extend_v21, logged tables, 4 parallel copies: 0m46.058s, 0m44.515s, 0m45.688s multi_extend_v21, logged tables, 1 copy: 0m28.440s, 0m28.129s, 0m30.698s master, logged tables, 4 parallel copies, {backend,bgwriter}_flush_after=0: 1m2.817s, 1m4.467s, 1m12.319s multi_extend_v21, logged tables, 4 parallel copies, {backend,bgwriter}_flush_after=0: 0m41.301s, 0m41.104s, 0m41.342s master, logged tables, 1 copy, {backend,bgwriter}_flush_after=0: 0m26.948s, 0m26.829s, 0m26.616s So the flushing is a small win with only 1 parallel copy, but with 4 parallel copies it's a significant loss. However, the relation extension patch reduces the regression significantly, probably because it makes it far more likely that a backend doing a flush is flushing a consecutive range of blocks all of which it added to the relation, so that there is no interleaving. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Robert, > I tried the same test mentioned in the original post on cthulhu (EDB > machine, CentOS 7.2, 8 sockets, 8 cores per socket, 2 threads per core, > Xeon E7-8830 @ 2.13 GHz). I attempted to test both the effects of > multi_extend_v21 and the *_flush_after settings. I'm not sure of {backend,writer}_flush_after intrinsic effectiveness, especially on HDDs, because although for the checkpointer (checkpoint_flush_after) there is a great deal of effort to generate large sequential writes, there is no such provisions for other write activities. I'm not sure how the write activity of the "parallel" copy is organized, but that sounds like it will generate less sequential writes than before, and the negative performance impact could be accentuated by flushing. This might suggest that the benefit of these two settings are more irregular/hard to predict, so their default value should be 0 (aka off)? Or maybe warn clearly in the documentation about the uncertain effects of these two settings? -- Fabien.
On 2016-04-05 17:12:11 -0400, Robert Haas wrote: > On Wed, Mar 30, 2016 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote: > > Indeed. On SSDs I see about a 25-35% gain, on HDDs about 5%. If I > > increase the size of backend_flush_after to 64 (like it's for bgwriter) > > I however do get about 15% for HDDs as well. > > I tried the same test mentioned in the original post on cthulhu (EDB > machine, CentOS 7.2, 8 sockets, 8 cores per socket, 2 threads per > core, Xeon E7-8830 @ 2.13 GHz). I attempted to test both the effects > of multi_extend_v21 and the *_flush_after settings. The machine has > both HD and SSD, but I used HD for this test. > master, logged tables, 4 parallel copies: 1m15.411s, 1m14.248s, 1m15.040s > master, logged tables, 1 copy: 0m28.336s, 0m28.040s, 0m29.576s > multi_extend_v21, logged tables, 4 parallel copies: 0m46.058s, 0m44.515s, 0m45.688s > multi_extend_v21, logged tables, 1 copy: 0m28.440s, 0m28.129s, 0m30.698s > master, logged tables, 4 parallel copies, {backend,bgwriter}_flush_after=0: 1m2.817s, 1m4.467s, 1m12.319s > multi_extend_v21, logged tables, 4 parallel copies, {backend,bgwriter}_flush_after=0: 0m41.301s, 0m41.104s, 0m41.342s > master, logged tables, 1 copy, {backend,bgwriter}_flush_after=0: 0m26.948s, 0m26.829s, 0m26.616s Any chance you could repeat with backend_flush_after set to 64? I wonder if the current value isn't just too small a default for HDDs due to their increased latency. - Andres