Thread: large regression for parallel COPY

large regression for parallel COPY

From
Robert Haas
Date:
On Thu, Mar 10, 2016 at 8:29 PM, Andres Freund <andres@anarazel.de> wrote:
> Allow to trigger kernel writeback after a configurable number of writes.

While testing out Dilip Kumar's relation extension patch today, I
discovered (with some help from Andres) that this causes nasty
regressions when doing parallel COPY on hydra (3.2.6-3.fc16.ppc64,
lousy disk subsystem).  What I did was (1) run pgbench -i -s 100, (2)
copy the results to a file, (3) truncate and drop the indexes on the
original table, and (4) try copying in one or more copies of the data
from the file.  Typical command line:

time pgbench -n -f f -t 1 -c 4 -j 4 && psql -c "select
pg_size_pretty(pg_relation_size('pgbench_accounts'));" && time psql -c
checkpoint && psql -c "truncate pgbench_accounts; checkpoint;"

With default settings against
96f8373cad5d6066baeb7a1c5a88f6f5c9661974, pgbench takes 9 to 9.5
minutes and the subsequent checkpoint takes 9 seconds.  After setting
, it takes 1 minute and 11 seconds and the subsequent checkpoint takes
11 seconds.   With a single copy of the data (that is, -c 1 -j 1 but
otherwise as above), it takes 28-29 seconds with default settings and
26-27 seconds with backend_flush_after=0, bgwriter_flush_after=0.  So
the difference is rather small with a straight-up COPY, but with 4
copies running at the same time, it's near enough to an order of
magnitude.

Andres reports that on his machine, non-zero *_flush_after settings
make things faster, not slower, so apparently this is
hardware-dependent or kernel-dependent.  Nevertheless, it seems to me
that we should try to get some broader testing here to see which
experience is typical.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: large regression for parallel COPY

From
Andres Freund
Date:
On 2016-03-30 15:50:21 -0400, Robert Haas wrote:
> On Thu, Mar 10, 2016 at 8:29 PM, Andres Freund <andres@anarazel.de> wrote:
> > Allow to trigger kernel writeback after a configurable number of writes.
> 
> While testing out Dilip Kumar's relation extension patch today, I
> discovered (with some help from Andres) that this causes nasty
> regressions when doing parallel COPY on hydra (3.2.6-3.fc16.ppc64,
> lousy disk subsystem).  What I did was (1) run pgbench -i -s 100, (2)
> copy the results to a file, (3) truncate and drop the indexes on the
> original table, and (4) try copying in one or more copies of the data
> from the file.  Typical command line:
> 
> time pgbench -n -f f -t 1 -c 4 -j 4 && psql -c "select
> pg_size_pretty(pg_relation_size('pgbench_accounts'));" && time psql -c
> checkpoint && psql -c "truncate pgbench_accounts; checkpoint;"
> 
> With default settings against
> 96f8373cad5d6066baeb7a1c5a88f6f5c9661974, pgbench takes 9 to 9.5
> minutes and the subsequent checkpoint takes 9 seconds.  After setting
> , it takes 1 minute and 11 seconds and the subsequent checkpoint takes
> 11 seconds.   With a single copy of the data (that is, -c 1 -j 1 but
> otherwise as above), it takes 28-29 seconds with default settings and
> 26-27 seconds with backend_flush_after=0, bgwriter_flush_after=0.  So
> the difference is rather small with a straight-up COPY, but with 4
> copies running at the same time, it's near enough to an order of
> magnitude.
> 
> Andres reports that on his machine, non-zero *_flush_after settings
> make things faster, not slower, so apparently this is
> hardware-dependent or kernel-dependent.  Nevertheless, it seems to me
> that we should try to get some broader testing here to see which
> experience is typical.

Indeed. On SSDs I see about a 25-35% gain, on HDDs about 5%. If I
increase the size of backend_flush_after to 64 (like it's for bgwriter)
I however do get about 15% for HDDs as well.

I wonder if the default value of backend_flush_after is too small for
some scenarios. I've reasoned that backend_flush_after should have a
*lower* default value than e.g. checkpointer or bgwriter, because
there's many concurrent writers increasing the total amount of unflushed
dirty writes. Which is true for OLTP write workloads; but less so for
bulk load.

Andres



Re: large regression for parallel COPY

From
"Joshua D. Drake"
Date:
On 03/30/2016 01:10 PM, Andres Freund wrote:
> On 2016-03-30 15:50:21 -0400, Robert Haas wrote:
>> On Thu, Mar 10, 2016 at 8:29 PM, Andres Freund <andres@anarazel.de> wrote:
>>> Allow to trigger kernel writeback after a configurable number of writes.
>>
>> While testing out Dilip Kumar's relation extension patch today, I
>> discovered (with some help from Andres) that this causes nasty
>> regressions when doing parallel COPY on hydra (3.2.6-3.fc16.ppc64,
>> lousy disk subsystem).

Unless Fedora/Redhat fixed the 3.2 kernel in a subsequent patch (.6-3?) 
then I would look hard right at that. The kernel from 3.2 - 3.8 is going 
to be miserable for anything that is doing concurrent writes. I 
understand that this is a regression regardless but I think we need 
wider testing to see if the changes are somehow related.

Sincerely,

jD

-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.



Re: large regression for parallel COPY

From
Robert Haas
Date:
On Wed, Mar 30, 2016 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
> Indeed. On SSDs I see about a 25-35% gain, on HDDs about 5%. If I
> increase the size of backend_flush_after to 64 (like it's for bgwriter)
> I however do get about 15% for HDDs as well.

I tried the same test mentioned in the original post on cthulhu (EDB
machine, CentOS 7.2, 8 sockets, 8 cores per socket, 2 threads per
core, Xeon E7-8830 @ 2.13 GHz).  I attempted to test both the effects
of multi_extend_v21 and the *_flush_after settings.  The machine has
both HD and SSD, but I used HD for this test.

master, logged tables, 4 parallel copies: 1m15.411s, 1m14.248s, 1m15.040s
master, logged tables, 1 copy: 0m28.336s, 0m28.040s, 0m29.576s
multi_extend_v21, logged tables, 4 parallel copies: 0m46.058s,
0m44.515s, 0m45.688s
multi_extend_v21, logged tables, 1 copy: 0m28.440s, 0m28.129s, 0m30.698s
master, logged tables, 4 parallel copies,
{backend,bgwriter}_flush_after=0: 1m2.817s, 1m4.467s, 1m12.319s
multi_extend_v21, logged tables, 4 parallel copies,
{backend,bgwriter}_flush_after=0: 0m41.301s, 0m41.104s, 0m41.342s
master, logged tables, 1 copy, {backend,bgwriter}_flush_after=0:
0m26.948s, 0m26.829s, 0m26.616s

So the flushing is a small win with only 1 parallel copy, but with 4
parallel copies it's a significant loss.  However, the relation
extension patch reduces the regression significantly, probably because
it makes it far more likely that a backend doing a flush is flushing a
consecutive range of blocks all of which it added to the relation, so
that there is no interleaving.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: large regression for parallel COPY

From
Fabien COELHO
Date:
Hello Robert,

> I tried the same test mentioned in the original post on cthulhu (EDB 
> machine, CentOS 7.2, 8 sockets, 8 cores per socket, 2 threads per core, 
> Xeon E7-8830 @ 2.13 GHz).  I attempted to test both the effects of 
> multi_extend_v21 and the *_flush_after settings.

I'm not sure of {backend,writer}_flush_after intrinsic effectiveness, 
especially on HDDs, because although for the checkpointer 
(checkpoint_flush_after) there is a great deal of effort to generate large 
sequential writes, there is no such provisions for other write activities. 
I'm not sure how the write activity of the "parallel" copy is organized, 
but that sounds like it will generate less sequential writes than before, 
and the negative performance impact could be accentuated by flushing.

This might suggest that the benefit of these two settings are more 
irregular/hard to predict, so their default value should be 0 (aka off)?

Or maybe warn clearly in the documentation about the uncertain effects of 
these two settings?

-- 
Fabien.



Re: large regression for parallel COPY

From
Andres Freund
Date:
On 2016-04-05 17:12:11 -0400, Robert Haas wrote:
> On Wed, Mar 30, 2016 at 4:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > Indeed. On SSDs I see about a 25-35% gain, on HDDs about 5%. If I
> > increase the size of backend_flush_after to 64 (like it's for bgwriter)
> > I however do get about 15% for HDDs as well.
> 
> I tried the same test mentioned in the original post on cthulhu (EDB
> machine, CentOS 7.2, 8 sockets, 8 cores per socket, 2 threads per
> core, Xeon E7-8830 @ 2.13 GHz).  I attempted to test both the effects
> of multi_extend_v21 and the *_flush_after settings.  The machine has
> both HD and SSD, but I used HD for this test.

> master, logged tables, 4 parallel copies:                                             1m15.411s, 1m14.248s,
1m15.040s
> master, logged tables, 1 copy:                                                        0m28.336s, 0m28.040s,
0m29.576s
> multi_extend_v21, logged tables, 4 parallel copies:                                   0m46.058s, 0m44.515s,
0m45.688s
> multi_extend_v21, logged tables, 1 copy:                                              0m28.440s, 0m28.129s,
0m30.698s
> master, logged tables, 4 parallel copies, {backend,bgwriter}_flush_after=0:           1m2.817s, 1m4.467s, 1m12.319s
> multi_extend_v21, logged tables, 4 parallel copies, {backend,bgwriter}_flush_after=0: 0m41.301s, 0m41.104s,
0m41.342s
> master, logged tables, 1 copy, {backend,bgwriter}_flush_after=0:                      0m26.948s, 0m26.829s,
0m26.616s

Any chance you could repeat with backend_flush_after set to 64? I wonder
if the current value isn't just too small a default for HDDs due to
their increased latency.

- Andres