Thread: 8.4 open item: copy performance regression?

8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Any objections if I add:

http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php

to the (currently empty) list of open items for 8.4?

A 25-30% performance regression in our main bulk loading mechanism 
should at least be explained before the release...




Stefan


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Any objections if I add:
> http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
> to the (currently empty) list of open items for 8.4?

I am unable to duplicate any slowdown on this test case.  AFAICT
8.4 and 8.3 branch tip are about the same speed; if anything 8.4
is faster.  Testing on x86_64 Fedora 10 ...
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Any objections if I add:
>>
http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
>> to the (currently empty) list of open items for 8.4?
> 
> I am unable to duplicate any slowdown on this test case.  AFAICT
> 8.4 and 8.3 branch tip are about the same speed; if anything 8.4
> is faster.  Testing on x86_64 Fedora 10 ...
I just ran the specified test on:
Linux ATHENA 2.6.16.60-0.31-smp #1 SMP Tue Oct 7 16:16:29 UTC 2008
x86_64 x86_64 x86_64 GNU/Linux
Linux version 2.6.16.60-0.31-smp (geeko@buildhost) (gcc version 4.1.2
20070115 (SUSE Linux)) #1 SMP Tue Oct 7 16:16:29 UTC 2008
SUSE Linux Enterprise Server 10 (x86_64)
VERSION = 10
PATCHLEVEL = 2
I did configure with a prefix and these options:--enable-integer-datetimes --enable-debug --disable-nls --with-libxml
I did initdb with --no-locale
8.3.7
real    0m24.249s
real    0m24.054s
real    0m24.361s
8.4rc1
real    0m33.503s
real    0m34.198s
real    0m33.931s
-Kevin


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> I am unable to duplicate any slowdown on this test case.

> [ Kevin can ]

It'd be useful first off to figure out if it's a CPU or I/O issue.
Is there any visible difference in vmstat output?  Also, try turning
off autovacuum in both cases, just to see if that's related.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> It'd be useful first off to figure out if it's a CPU or I/O issue.
> Is there any visible difference in vmstat output?  Also, try turning
> off autovacuum in both cases, just to see if that's related.

Both took slightly longer with autovacuum off, but probably just
within the noise.

It's hard to interpret the vmstat output, largely because I chose to
run this on one of our biggest servers, which is not currently serving
an application, per se, but as a replication target, and this being
"off hours" is busy running the sync process to the source machines.
This involves generating md5 sums on both sides for blocks of rows,
which is pretty CPU-intensive.  There is very little disk output from
that right now, pretty light on the disk reads, but keeping a few CPUs
pretty busy generating those md5 sums.

I've got to go keep an appointment, but I'll come back and see if I
can do more.  For now, here's the raw vmstat, in case someone can pick
out info I'm missing:

-Kevin


Attachment

Re: 8.4 open item: copy performance regression?

From
Kenneth Marshall
Date:
On Thu, Jun 18, 2009 at 05:20:08PM -0400, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> > Any objections if I add:
> > http://archives.postgresql.org/pgsql-performance/2009-06/msg00215.php
> > to the (currently empty) list of open items for 8.4?
> 
> I am unable to duplicate any slowdown on this test case.  AFAICT
> 8.4 and 8.3 branch tip are about the same speed; if anything 8.4
> is faster.  Testing on x86_64 Fedora 10 ...
> 
>             regards, tom lane
> 

What is not clear from Stefen's function listing is how the 8.4
server could issue 33% more XLogInsert() and CopyReadLine()
calls than the 8.3.7 server using the same input file. That would
account for the slow down but now why it is happening.

Cheers,
Ken


Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: 
> I've got to go keep an appointment
Sorry about that.  Back now.  Anything else I can do to help with
this?
-Kevin


Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Kenneth Marshall <ktm@rice.edu> wrote: 
> What is not clear from Stefen's function listing is how the 8.4
> server could issue 33% more XLogInsert() and CopyReadLine()
> calls than the 8.3.7 server using the same input file.
I thought those were profiling numbers -- the number of times a timer
checked what was executing and found it in that method.  Which
suggests that those two methods are probably slower now than in 8.3.7,
at least in some environments.
-Kevin


Re: 8.4 open item: copy performance regression?

From
Alan Li
Date:
It doesn't look like it's related to autovacuum.  I re-ran the test against the two solaris boxes with autovacuum turned off and the results look about the same.

8.3.7 - Solaris 10 11/06 s10x_u3wos_10 X86
real    0m43.662s
user    0m0.001s
sys     0m0.003s
real    0m43.565s
user    0m0.001s
sys     0m0.003s
real    0m43.742s
user    0m0.001s
sys     0m0.003s

8.4rc1 - Solaris 10 11/06 s10x_u3wos_10 X86
real    0m59.304s
user    0m0.001s
sys     0m0.003s
real    0m58.653s
user    0m0.001s
sys     0m0.003s
real    1m0.253s
user    0m0.001s
sys     0m0.003s

8.3.7 - Solaris 10 8/07 s10x_u4wos_12b X86
real    0m38.981s
user    0m0.002s
sys     0m0.004s
real    0m39.879s
user    0m0.002s
sys     0m0.004s
real    0m39.111s
user    0m0.002s
sys     0m0.004s

8.4rc1 - Solaris 10 8/07 s10x_u4wos_12b X86
real    0m50.647s
user    0m0.002s
sys     0m0.004s
real    0m49.453s
user    0m0.002s
sys     0m0.004s
real    0m49.725s
user    0m0.002s
sys     0m0.004s

Alan


On Thu, Jun 18, 2009 at 3:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I am unable to duplicate any slowdown on this test case.

> [ Kevin can ]

It'd be useful first off to figure out if it's a CPU or I/O issue.
Is there any visible difference in vmstat output?  Also, try turning
off autovacuum in both cases, just to see if that's related.

                       regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: 8.4 open item: copy performance regression?

From
Kenneth Marshall
Date:
Yes, you are right. I thought that they were absolute function
counts. The data makes more sense now.

Regards,
Ken

On Thu, Jun 18, 2009 at 07:03:34PM -0500, Kevin Grittner wrote:
> Kenneth Marshall <ktm@rice.edu> wrote: 
>  
> > What is not clear from Stefen's function listing is how the 8.4
> > server could issue 33% more XLogInsert() and CopyReadLine()
> > calls than the 8.3.7 server using the same input file.
>  
> I thought those were profiling numbers -- the number of times a timer
> checked what was executing and found it in that method.  Which
> suggests that those two methods are probably slower now than in 8.3.7,
> at least in some environments.
>  
> -Kevin
> 


Re: 8.4 open item: copy performance regression?

From
Andrew Dunstan
Date:

Kevin Grittner wrote:
>  
> 8.3.7
> real    0m24.249s
> real    0m24.054s
> real    0m24.361s
>  
> 8.4rc1
> real    0m33.503s
> real    0m34.198s
> real    0m33.931s
>  
>
>   

Ugh. This looks like a poster child case for a benchfarm ...

Is there any chance you guys could triangulate this a bit? Good initial 
triangulation points might be the end of each commitfest. (I have a 
vested interest in making sure COPY performance doesn't regress, since 
it will affect parallel restore's speed in spades.)

cheers

andrew


Re: 8.4 open item: copy performance regression?

From
Marko Kreen
Date:
On 6/19/09, Andrew Dunstan <andrew@dunslane.net> wrote:
>  Kevin Grittner wrote:
> >  8.3.7
> > real    0m24.249s
> > real    0m24.054s
> > real    0m24.361s
> >  8.4rc1
> > real    0m33.503s
> > real    0m34.198s
> > real    0m33.931s
>
>  Ugh. This looks like a poster child case for a benchfarm ...
>
>  Is there any chance you guys could triangulate this a bit? Good initial
> triangulation points might be the end of each commitfest. (I have a vested
> interest in making sure COPY performance doesn't regress, since it will
> affect parallel restore's speed in spades.)

git bisect?

-- 
marko


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Andrew Dunstan wrote:
> 
> 
> Kevin Grittner wrote:
>>  
>> 8.3.7
>> real    0m24.249s
>> real    0m24.054s
>> real    0m24.361s
>>  
>> 8.4rc1
>> real    0m33.503s
>> real    0m34.198s
>> real    0m33.931s
>>  
>>
>>   
> 
> Ugh. This looks like a poster child case for a benchfarm ...

indeed...

> 
> Is there any chance you guys could triangulate this a bit? Good initial 
> triangulation points might be the end of each commitfest. (I have a 
> vested interest in making sure COPY performance doesn't regress, since 
> it will affect parallel restore's speed in spades.)

Maybe parallel restore is the issue why we haven't noticed this earlier. 
The case that regressed this way is WAL logged COPY, COPY that can 
bypass WAL (which typically happens in parallel restore now) is actually  a bit faster in my testing in 8.4.

I will try and see if I can figure out what caused the regression...


Stefan


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Just eyeing the code ... another thing we changed since 8.3 is to enable
posix_fadvise() calls for WAL.  Any of the complaints want to try diking
out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?

#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)if (!XLogArchivingActive() &&
(get_sync_bit(sync_method)& PG_O_DIRECT) == 0)    (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 
#endif
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Just eyeing the code ... another thing we changed since 8.3 is to enable
> posix_fadvise() calls for WAL.  Any of the complaints want to try diking
> out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?
> 
> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
>     if (!XLogArchivingActive() &&
>         (get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
>         (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
> #endif

doesn't seem to cause any noticable difference for me...


Stefan


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Just eyeing the code ... another thing we changed since 8.3 is to enable
> posix_fadvise() calls for WAL.  Any of the complaints want to try diking
> out this bit of code (near line 2580 in src/backend/access/transam/xlog.c)?
> 
> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
>     if (!XLogArchivingActive() &&
>         (get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
>         (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
> #endif

ok after a bit of bisecting I'm happy to announce the winner of the contest:

http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php

this patch causes a 25-30% performance regression for WAL logged copy, 
however in the WAL bypass case (maybe that was what got tested?) it 
results in a 20% performance increase.

the raw numbers using the upthread posted minimal postgresql.conf are:

post patch/wal logged: 4min10s/4min19/4min12
post patch/wal bypass: 1m55s/1m58s/2m00
prepatch/wal logged: 2m55s/3min00/2m59
prepatch/wal bypass: 2m22s/2m18s/2m20s


Stefan


Re: 8.4 open item: copy performance regression?

From
Kenneth Marshall
Date:
On Fri, Jun 19, 2009 at 07:49:31PM +0200, Stefan Kaltenbrunner wrote:
> Tom Lane wrote:
>> Just eyeing the code ... another thing we changed since 8.3 is to enable
>> posix_fadvise() calls for WAL.  Any of the complaints want to try diking
>> out this bit of code (near line 2580 in 
>> src/backend/access/transam/xlog.c)?
>> #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
>>     if (!XLogArchivingActive() &&
>>         (get_sync_bit(sync_method) & PG_O_DIRECT) == 0)
>>         (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
>> #endif
>
> ok after a bit of bisecting I'm happy to announce the winner of the 
> contest:
>
> http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
>
> this patch causes a 25-30% performance regression for WAL logged copy, 
> however in the WAL bypass case (maybe that was what got tested?) it results 
> in a 20% performance increase.
>
> the raw numbers using the upthread posted minimal postgresql.conf are:
>
> post patch/wal logged: 4min10s/4min19/4min12
> post patch/wal bypass: 1m55s/1m58s/2m00
> prepatch/wal logged: 2m55s/3min00/2m59
> prepatch/wal bypass: 2m22s/2m18s/2m20s
>
>
> Stefan
>

Great! Maybe just increasing the size of the BULKWRITE ring,
possibly as a function of the shared_memory is all that is
needed. 256kB is the currently coded ring_size in storage/buffer/freelist.c

Ken


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> ok after a bit of bisecting I'm happy to announce the winner of the contest:
> http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php

> this patch causes a 25-30% performance regression for WAL logged copy, 
> however in the WAL bypass case (maybe that was what got tested?) it 
> results in a 20% performance increase.

Hmm.  What that patch actually changes is that it prevents a bulk insert
(ie COPY in) from trashing the entire shared-buffers arena.  I think the
reason for the WAL correlation is that once it's filled the ring buffer,
creating new pages requires writing out old ones, and the
WAL-before-data rule means that the copy process has to block waiting
for WAL to go down to disk before it can write.  When it's allowed to
use the whole arena there is more chance for some of that writing to be
done by the walwriter or bgwriter.  But the details are going to depend
on the platform's CPU vs I/O balance, which no doubt explains why some
of us don't see it.

I don't think we want to revert that patch --- not trashing the whole
buffer arena seems like a Good Thing from a system-wide point of view,
even if it makes individual COPY operations go slower.  However, we
could maybe play around with the tradeoffs a bit.  In particular it
seems like it would be useful to experiment with different ring buffer
sizes.  Could you try increasing the ring size allowed in
src/backend/storage/buffer/freelist.c for the BULKWRITE case

***************
*** 384,389 ****
--- 384,392 ----         case BAS_BULKREAD:             ring_size = 256 * 1024 / BLCKSZ;             break;
+         case BAS_BULKWRITE:
+             ring_size = 256 * 1024 / BLCKSZ;
+             break;         case BAS_VACUUM:             ring_size = 256 * 1024 / BLCKSZ;             break;


and see if maybe we can buy back most of the loss with not too much
of a ring size increase?
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> ok after a bit of bisecting I'm happy to announce the winner of the contest:
>> http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
> 
>> this patch causes a 25-30% performance regression for WAL logged copy, 
>> however in the WAL bypass case (maybe that was what got tested?) it 
>> results in a 20% performance increase.
> 
> Hmm.  What that patch actually changes is that it prevents a bulk insert
> (ie COPY in) from trashing the entire shared-buffers arena.  I think the
> reason for the WAL correlation is that once it's filled the ring buffer,
> creating new pages requires writing out old ones, and the
> WAL-before-data rule means that the copy process has to block waiting
> for WAL to go down to disk before it can write.  When it's allowed to
> use the whole arena there is more chance for some of that writing to be
> done by the walwriter or bgwriter.  But the details are going to depend
> on the platform's CPU vs I/O balance, which no doubt explains why some
> of us don't see it.

hmm - In my case both the CPU (an Intel E5530 Nehalem) and the IO 
subsystem (8GB Fiberchannel connected NetApp with 4GB cache) are pretty 
fast. and even with say fsync=off 8.4RC1 is only slightly faster than 
8.3 with the same config and fsync=on so maybe there is a secondary 
effect at play too.

> 
> I don't think we want to revert that patch --- not trashing the whole
> buffer arena seems like a Good Thing from a system-wide point of view,
> even if it makes individual COPY operations go slower.  However, we
> could maybe play around with the tradeoffs a bit.  In particular it
> seems like it would be useful to experiment with different ring buffer
> sizes.  Could you try increasing the ring size allowed in
> src/backend/storage/buffer/freelist.c for the BULKWRITE case
> 
> ***************
> *** 384,389 ****
> --- 384,392 ----
>           case BAS_BULKREAD:
>               ring_size = 256 * 1024 / BLCKSZ;
>               break;
> +         case BAS_BULKWRITE:
> +             ring_size = 256 * 1024 / BLCKSZ;
> +             break;
>           case BAS_VACUUM:
>               ring_size = 256 * 1024 / BLCKSZ;
>               break;
> 
> 
> and see if maybe we can buy back most of the loss with not too much
> of a ring size increase?

already started testing that once I found the offending commit.

256 * 1024 / BLCKSZ
4min10s/4min19/4min12

512 * 1024 / BLCKSZ
3min27s/3min32s

1024 * 1024 / BLCKSZ
3min14s/3min12s

2048 * 1024 / BLCKSZ
3min02/3min02

4096 * 1024 / BLCKSZ
2m59/2m58s

8192 * 1024 / BLCKSZ

2m59/2m59s

so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in 
more or less the same performance that 8.3 had.



Stefan


Re: 8.4 open item: copy performance regression?

From
Josh Berkus
Date:
> so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in
> more or less the same performance that 8.3 had.

Can some folks test this with different size COPYs?  That's both 
larger/smaller tables, and larger/smaller rows.  We should also test 
copy with large blob data.


-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Fri, 2009-06-19 at 14:11 -0400, Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> > ok after a bit of bisecting I'm happy to announce the winner of the contest:
> > http://archives.postgresql.org/pgsql-committers/2008-11/msg00054.php
> 
> > this patch causes a 25-30% performance regression for WAL logged copy, 
> > however in the WAL bypass case (maybe that was what got tested?) it 
> > results in a 20% performance increase.
> 
> Hmm.  What that patch actually changes is that it prevents a bulk insert
> (ie COPY in) from trashing the entire shared-buffers arena.  I think the
> reason for the WAL correlation is that once it's filled the ring buffer,
> creating new pages requires writing out old ones, and the
> WAL-before-data rule means that the copy process has to block waiting
> for WAL to go down to disk before it can write.  When it's allowed to
> use the whole arena there is more chance for some of that writing to be
> done by the walwriter or bgwriter.  But the details are going to depend
> on the platform's CPU vs I/O balance, which no doubt explains why some
> of us don't see it.
> 
> I don't think we want to revert that patch --- not trashing the whole
> buffer arena seems like a Good Thing from a system-wide point of view,
> even if it makes individual COPY operations go slower.  However, we
> could maybe play around with the tradeoffs a bit.  In particular it
> seems like it would be useful to experiment with different ring buffer
> sizes.  Could you try increasing the ring size allowed in
> src/backend/storage/buffer/freelist.c for the BULKWRITE case

Yes, that's definitely the cause. The ring buffer technique was
prototyped in 8.3 and a ring of 32 blocks was found not to be
sufficient, which was one reason we didn't try to commit that then.
At the time, I also proposed a "filled buffer list" change to bufmgr to
allow bgwriter to preferentially target COPY's filled blocks, which
would also help with this effect.

You'll get better benefit from decreasing wal_writer_delay also.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Greg Smith
Date:
On Fri, 19 Jun 2009, Stefan Kaltenbrunner wrote:

> In my case both the CPU (an Intel E5530 Nehalem) and the IO subsystem 
> (8GB Fiberchannel connected NetApp with 4GB cache) are pretty fast.

The server Alan identified as "Solaris 10 8/07 s10x_u4wos_12b X86" has a 
Xeon E5320 (1.86GHz) and a single boring SAS drive in it (Sun X4150). 
The filesystem involved in that particular case is UFS, which I am 
suspicious of as being part of why the problem is so pronounced there--the 
default UFS tuning is pretty lightweight in terms of how much caching it 
does.  Not sure if Alan ran any tests against the big ZFS volume on the 
other sever, I think all the results he posted were from the UFS boot 
drive there too.

> so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in more 
> or less the same performance that 8.3 had.

It looks like it's a touch higher on our 8/07 system, it levels out at 
8192 * (haven't checked the other one yet).  I'm seeing this, using Alan's 
original test set size (to make sure I was running exactly the same test) 
and just grabbing the low/high from a set of 3 runs:

8.3.7:  0m39.266s   0m43.269s (alan:  36.2 - 39.2)

256:    0m50.435s   0m51.944s (alan:  48.1 - 50.6)
1024:   0m47.299s   0m49.346s
4096:   0m43.725s   0m46.116s
8192:   0m40.715s   0m42.480s
16384:  0m41.318s   0m42.118s
65536:  0m41.675s   0m42.955s

I collected some iostat data here as well for some of the runs (the vmstat 
data was harder to read, this being Solaris, and didn't seem to add 
anything).  I'm seeing lines like this with the default ring buffer of 256 
*:
   tty        sd1           sd2           nfs1           cpu tin tout kps tps serv  kps tps serv  kps tps serv   us sy
wtid   0  322  12   1    0  41371 2754    0    0   0    0   12 11  0 78   0  166   0   0    0  46246 3380    0    0   0
  0   14 10  0 76   0  164   0   0    0  44874 3068    1    0   0    0   13  9  0 78
 

Obviously sd2 is where the database and source file are at.  Basically, 
about one core (out of four) tied up with a pretty even split of 
user/system time.  Using the highest ring size I tried, 65536 *, gives 
lines that look like this:
   tty        sd1           sd2           nfs1           cpu tin tout kps tps serv  kps tps serv  kps tps serv   us sy
wtid   0  163   0   0    0  56696 4291    0    0   0    0   20 12  0 68   0  166   0   0    0  58554 4542    0    0   0
  0   21 12  0 67   0  168   0   0    0  56057 4308    0    0   0    0   21 12  0 67
 

So it seems like increasing the ring size helps saturate the disks better, 
went from ~45MB/s to 57MB/s.  What's kind of interesting is to compare 
this against the 8.3.7 run, which is the fastest of them all, which I was 
expecting to find had the highest write rate of them all;
   tty        sd1           sd2           nfs1           cpu tin tout kps tps serv  kps tps serv  kps tps serv   us sy
wtid   0   83   0   0    0  47654 2121    0    0   0    0   23  8  0 69   0  240   0   0    0  44198 2150    1    0   0
  0   19  8  0 73   0   83   0   0    0  37750 1110    1    0   0    0   21  6  0 72
 

That's actually doing less I/O per capita, which is why it's also got less 
waiting for I/O%, but it's completing the most work.  This makes me wonder 
if in addition to the ring buffering issue, there isn't just plain more 
writing per average completed transaction in 8.4 with this type of COPY. 
This might explain why even with the expanded ring buffer, both Stephan 
and my test runs still showed a bit of a regression against 8.3.  I'm 
guessing we have a second, smaller shooter here involved as well.

In any case, a bump of the ring multiplier to either 4096 or 8192 
eliminates the worst of the regression here, good improvement so far.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: 8.4 open item: copy performance regression?

From
Greg Smith
Date:
On Sat, 20 Jun 2009, Simon Riggs wrote:

> At the time, I also proposed a "filled buffer list" change to bufmgr to
> allow bgwriter to preferentially target COPY's filled blocks, which
> would also help with this effect.

One of the things I keep meaning to investigate is whether there's any 
benefit to pushing buffers the background writer cleans onto the free 
list, to speed up their subsequent allocation to needy backends.  Both 
this and the sequential scan ring buffer implementation might both benefit 
from an approach where buffers are similarly pushed onto the free list 
when they're no longer needed by the process that pulled them in. 
Buffers could move from allocated->used->filled buffer list->free list in 
the COPY buffer case, and allocated->used->free list when executing a 
sequential scan.

That would seem to me to be a more robust general approach for solving 
this class of problem than the whole ring buffer idea, which is a great 
start but bound to run into situations where the size of the buffer just 
isn't right anymore a few hardware generations down the road.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Sat, 2009-06-20 at 02:53 -0400, Greg Smith wrote:
> On Sat, 20 Jun 2009, Simon Riggs wrote:
> 
> > At the time, I also proposed a "filled buffer list" change to bufmgr to
> > allow bgwriter to preferentially target COPY's filled blocks, which
> > would also help with this effect.
> 
> One of the things I keep meaning to investigate is whether there's any 
> benefit to pushing buffers the background writer cleans onto the free 
> list, to speed up their subsequent allocation to needy backends.  Both 
> this and the sequential scan ring buffer implementation might both benefit 
> from an approach where buffers are similarly pushed onto the free list 
> when they're no longer needed by the process that pulled them in. 
> Buffers could move from allocated->used->filled buffer list->free list in 
> the COPY buffer case, and allocated->used->free list when executing a 
> sequential scan.

The reason for not doing that would be that we don't know that the
blocks are free to use; we know very little about them. The longer we
leave them the more likely they are to be reused, so putting buffers
onto the freelist when they aren't actually free would likely make the
cache less effective, ISTM.

With filled buffers from COPY we have a very good clue that the buffers
will no longer be needed and can treat them differently from others.
Also, if we can get the bgwriter to do some of the work then we would
have the COPY process, bgwriter and WALwriter all working together on
the data loading. (We need to make WALwriter a little smarter also, so
that it can respond to changing WAL write workloads).

> That would seem to me to be a more robust general approach for solving 
> this class of problem than the whole ring buffer idea, which is a great 
> start but bound to run into situations where the size of the buffer just 
> isn't right anymore a few hardware generations down the road.

The ring buffer optimises L2 cache, not the buffer cache in general. If
we put buffers back on the freelist that is the same as having one
global ring buffer, which would then spoil the benefit for L2 on
multi-CPU systems.

We don't see any L2 benefit with COPY yet, but it is potentially there
if we can overcome the stronger effect of the WAL costs. When the ring
buffer expands to minimise WAL overheads we also minimise benefit for
L2. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Fri, 2009-06-19 at 22:03 -0400, Greg Smith wrote:

> This makes me wonder if in addition to the ring buffering issue, there
> isn't just plain more writing per average completed transaction in 8.4
> with this type of COPY.

I would suggest that we check how much WAL has been written. There may
be a secondary effect or a different regression hidden in these results.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Greg Stark
Date:
On Sat, Jun 20, 2009 at 9:22 AM, Simon Riggs<simon@2ndquadrant.com> wrote:
>> That would seem to me to be a more robust general approach for solving
>> this class of problem than the whole ring buffer idea, which is a great
>> start but bound to run into situations where the size of the buffer just
>> isn't right anymore a few hardware generations down the road.
>
> The ring buffer optimises L2 cache, not the buffer cache in general. If
> we put buffers back on the freelist that is the same as having one
> global ring buffer, which would then spoil the benefit for L2 on
> multi-CPU systems.

Incidentally a "ring buffer" is something else. We're talking about a
clock-sweep algorithm.

The whole point of the clock-sweep algorithm is that it's an
approximated LRU but with no single point of contention like a list
pointer. Doing this would be undermining that advantage.

I don't understand what you mean by "size of the buffer" either. The
only real parameter in the clock sweep algorithm is how many distinct
counter values there are. The more values the closer to an LRU the
result is. it isn't really tied to hardware in any way, just the usage
pattern.


-- 
greg
http://mit.edu/~gsstark/resume.pdf


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Greg Smith wrote:
> On Fri, 19 Jun 2009, Stefan Kaltenbrunner wrote:
> 
>> In my case both the CPU (an Intel E5530 Nehalem) and the IO subsystem 
>> (8GB Fiberchannel connected NetApp with 4GB cache) are pretty fast.
> 
> The server Alan identified as "Solaris 10 8/07 s10x_u4wos_12b X86" has a 
> Xeon E5320 (1.86GHz) and a single boring SAS drive in it (Sun X4150). 
> The filesystem involved in that particular case is UFS, which I am 
> suspicious of as being part of why the problem is so pronounced 
> there--the default UFS tuning is pretty lightweight in terms of how much 
> caching it does.  Not sure if Alan ran any tests against the big ZFS 
> volume on the other sever, I think all the results he posted were from 
> the UFS boot drive there too.
> 
>> so 4096 * 1024 / BLCKSZ seems to be the sweet spot and also results in 
>> more or less the same performance that 8.3 had.
> 
> It looks like it's a touch higher on our 8/07 system, it levels out at 
> 8192 * (haven't checked the other one yet).  I'm seeing this, using 
> Alan's original test set size (to make sure I was running exactly the 
> same test) and just grabbing the low/high from a set of 3 runs:
> 
> 8.3.7:  0m39.266s   0m43.269s (alan:  36.2 - 39.2)
> 
> 256:    0m50.435s   0m51.944s (alan:  48.1 - 50.6)
> 1024:   0m47.299s   0m49.346s
> 4096:   0m43.725s   0m46.116s
> 8192:   0m40.715s   0m42.480s
> 16384:  0m41.318s   0m42.118s
> 65536:  0m41.675s   0m42.955s

hmm interesting - I just did a bunch of runs using the lineitem table 
from the DBT3 tests (loading 60M rows in each run) and the same config 
Alan used.

8.4(postpatch - not RC1 but that one seems to behave exactly the same way)

lineitem1
256 9min38s
512 9min20s
1024 7m44.667s/7m45.342s
2048 7m15.500s/7m17.910s
4096 7m11.424s/7m13.276s
8192 6m43.203s/6m48.293s
16384 6m24.980s/6m24.116s
32768 6m20.753s/6m22.083s
65536 6m22.913s/6m22.449s
1048576 6m23.765s/6m24.645s


8.3

6m45.650s/6m44.781s


so on this workload the sweetspot seems to be much higher than on the 
one with the narrower rows.

[...]
> That's actually doing less I/O per capita, which is why it's also got 
> less waiting for I/O%, but it's completing the most work.  This makes me 
> wonder if in addition to the ring buffering issue, there isn't just 
> plain more writing per average completed transaction in 8.4 with this 
> type of COPY. This might explain why even with the expanded ring buffer, 
> both Stephan and my test runs still showed a bit of a regression against 
> 8.3.  I'm guessing we have a second, smaller shooter here involved as well.

well yes I also suspect that there is some secondary effect at play here   and I believe I have seen the "more IO with
8.4"thing here too but I 
 
have not actually paid enough attention yet to be sure.


> 
> In any case, a bump of the ring multiplier to either 4096 or 8192 
> eliminates the worst of the regression here, good improvement so far.

yeah with the above numbers I would say that 8192 should remove most if 
not all of the regression. However it seems that we might have to make 
this more dynamic in the future since the behaviour seems to depend on a 
number of variables...


Stefan


Re: 8.4 open item: copy performance regression?

From
Greg Stark
Date:
On Sat, Jun 20, 2009 at 12:10 PM, Greg Stark<gsstark@mit.edu> wrote:
>
> I don't understand what you mean by "size of the buffer" either.

Ok, having gone back and read the whole thread I understand the
context for that statement. Nevermind.



-- 
greg
http://mit.edu/~gsstark/resume.pdf


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>    and I believe I have seen the "more IO with 8.4" thing here too but I 
> have not actually paid enough attention yet to be sure.

FSM/VM overhead maybe?  I think COPY IN is setting the SKIP_FSM bit,
but I wonder if there's some vestigial overhead anyway.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
> > 
> > 8.3.7:  0m39.266s   0m43.269s (alan:  36.2 - 39.2)
> > 
> > 8192:   0m40.715s   0m42.480s
> > 16384:  0m41.318s   0m42.118s
> > 65536:  0m41.675s   0m42.955s
> 
> hmm interesting - I just did a bunch of runs using the lineitem table 
> from the DBT3 tests (loading 60M rows in each run) and the same config 
> Alan used.
> 
> 8.4(postpatch - not RC1 but that one seems to behave exactly the same way)
> 
> lineitem1
> 8192 6m43.203s/6m48.293s
> 16384 6m24.980s/6m24.116s
> 32768 6m20.753s/6m22.083s
> 65536 6m22.913s/6m22.449s
> 1048576 6m23.765s/6m24.645s

> 8.3
> 
> 6m45.650s/6m44.781s

> so on this workload the sweetspot seems to be much higher than on the 
> one with the narrower rows.

The rest of the patch should have had a greater effect on tables with
thinner rows. Your results match my expectations, though I read from
them that we should use 16384, since that provides some gain, not just a
cancellation of the regression. 

I would suggest that we leave it as a tunable parameter in this release
and remove it again once we have clear evidence of how to set it. We are
unlikely to cover conclusively how to do this before we release 8.4.

> > 
> > In any case, a bump of the ring multiplier to either 4096 or 8192 
> > eliminates the worst of the regression here, good improvement so far.
> 
> yeah with the above numbers I would say that 8192 should remove most if 
> not all of the regression. However it seems that we might have to make 
> this more dynamic in the future since the behaviour seems to depend on a 
> number of variables...

I would be inclined to repeat this with multiple concurrent COPYs. We
may be able to improve on the numbers there, as well as get a
perspective on how to set parameter in real world.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Greg Smith
Date:
On Sat, 20 Jun 2009, Simon Riggs wrote:

> The reason for not doing that would be that we don't know that the
> blocks are free to use; we know very little about them. The longer we
> leave them the more likely they are to be reused, so putting buffers
> onto the freelist when they aren't actually free would likely make the
> cache less effective, ISTM.

This is the prevailing wisdom.  The current design of the background 
writer tries to make a reasonable guess as to how many buffers are going 
to be allocated in the next bg_writer_delay period, and it 
locates/generates clean buffers to fill that expected need.  The idea 
would be to put those buffers, which have a good chance of being allocated 
by somebody else and therefore invalided shortly, onto the free list.

There's a certain amount of overhead the backends themselves wait for in 
order to advance the clock sweep position to find the buffers they need, 
and that adds some latency to them.  I would guess there's some potential 
for reducing that latency if the freelist was sometimes populated, which 
right now it rarely is.  While that would add contention onto the free 
list, I wonder if it might still be less than the contention on advancing 
the current buffer pool strategy point.

I don't want to go through the whole thing in this thread, just pointing 
out some similarity with an item I'd already proposed onto the TODO 
list--and that a good solution there might get rid of this whole "how big 
do I make the ring buffer?" situation.  Obviously actual measurements here 
would trump any theorizing as to what works better, it's hard to get any 
intuition about low-level optimizing given how complicated CPU caches are 
nowadays.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
>> 8192 6m43.203s/6m48.293s
>> 16384 6m24.980s/6m24.116s
>> 32768 6m20.753s/6m22.083s
>> 65536 6m22.913s/6m22.449s
>> 1048576 6m23.765s/6m24.645s

> The rest of the patch should have had a greater effect on tables with
> thinner rows. Your results match my expectations, though I read from
> them that we should use 16384, since that provides some gain, not just a
> cancellation of the regression. 

+1 for using 16384 (ie, max ring buffer size 16MB).  Maybe even more.
It seems likely that other cases might have an even bigger issue than
is exhibited in the couple of test cases we have here, so we should
leave some margin for error.  Also, there's code in there to limit the
ring buffer to 1/8th of shared buffers, so we don't have to worry about
trashing the whole buffer arena in small configurations.  Any limitation
at all is still a step forward over previous releases as far as not
trashing the arena is concerned.

> I would suggest that we leave it as a tunable parameter in this release
> and remove it again once we have clear evidence of how to set it.

It's not going to become an exposed tunable.  There is not evidence to
justify that, and we are down to the point of the release cycle where
any more than one-liner changes have to be looked at with considerable
reluctance.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Greg Smith
Date:
On Sat, 20 Jun 2009, Simon Riggs wrote:

> I would suggest that we check how much WAL has been written. There may
> be a secondary effect or a different regression hidden in these results.

What's the easiest way to do that?  My first thought was to issue a 
checkpoint before the test (which is a good idea to make each test 
consistent anyway), save the output from pg_controldata, test, checkpoint, 
and look at the control data again.  This seems kind of clunky though, but 
still better than trolling through the OS statistics for the data.  Any 
clever ideas for a better way to measure bytes of WAL written during a 
particular chunk of code?  We may need some sort of checkpoint/sync after 
the test to get correct results, because I've noticed that the tests I run 
sometimes continue writing out buffers for a few seconds after the test 
time is finished.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: 8.4 open item: copy performance regression?

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
>>> 8192 6m43.203s/6m48.293s
>>> 16384 6m24.980s/6m24.116s
>>> 32768 6m20.753s/6m22.083s
>>> 65536 6m22.913s/6m22.449s
>>> 1048576 6m23.765s/6m24.645s
> 
>> The rest of the patch should have had a greater effect on tables with
>> thinner rows. Your results match my expectations, though I read from
>> them that we should use 16384, since that provides some gain, not just a
>> cancellation of the regression. 
> 
> +1 for using 16384 (ie, max ring buffer size 16MB).  Maybe even more.
> It seems likely that other cases might have an even bigger issue than
> is exhibited in the couple of test cases we have here, so we should
> leave some margin for error.  Also, there's code in there to limit the
> ring buffer to 1/8th of shared buffers, so we don't have to worry about
> trashing the whole buffer arena in small configurations.  Any limitation
> at all is still a step forward over previous releases as far as not
> trashing the arena is concerned.

+1. You might get away with a smaller ring with narrow tables, where 
writing 16MB of data produces more than 16MB of WAL, but I don't think 
it can ever be the other way round. Leaving a little bit of room for 
error doesn't seem like a bad idea, though.

IIRC we experimented with an auto-tuning ring size when we worked on the 
original ring buffer patch. The idea is that you start with a small 
ring, and enlarge it in StrategyRejectBuffer. But that seems too risky 
for 8.4.

I wonder if using the small ring showed any benefit when the COPY is not 
WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, 
so the small ring might have some L2 cache benefits.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> On Sat, 2009-06-20 at 13:15 +0200, Stefan Kaltenbrunner wrote:
>>>> 8192 6m43.203s/6m48.293s
>>>> 16384 6m24.980s/6m24.116s
>>>> 32768 6m20.753s/6m22.083s
>>>> 65536 6m22.913s/6m22.449s
>>>> 1048576 6m23.765s/6m24.645s
>>
>>> The rest of the patch should have had a greater effect on tables with
>>> thinner rows. Your results match my expectations, though I read from
>>> them that we should use 16384, since that provides some gain, not just a
>>> cancellation of the regression. 
>>
>> +1 for using 16384 (ie, max ring buffer size 16MB).  Maybe even more.
>> It seems likely that other cases might have an even bigger issue than
>> is exhibited in the couple of test cases we have here, so we should
>> leave some margin for error.  Also, there's code in there to limit the
>> ring buffer to 1/8th of shared buffers, so we don't have to worry about
>> trashing the whole buffer arena in small configurations.  Any limitation
>> at all is still a step forward over previous releases as far as not
>> trashing the arena is concerned.
> 
> +1. You might get away with a smaller ring with narrow tables, where 
> writing 16MB of data produces more than 16MB of WAL, but I don't think 
> it can ever be the other way round. Leaving a little bit of room for 
> error doesn't seem like a bad idea, though.

yeah 16MB seems like the best choice given the available data and how 
far we are into the release cycle.

> 
> IIRC we experimented with an auto-tuning ring size when we worked on the 
> original ring buffer patch. The idea is that you start with a small 
> ring, and enlarge it in StrategyRejectBuffer. But that seems too risky 
> for 8.4.

agreed.

> 
> I wonder if using the small ring showed any benefit when the COPY is not 
> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, 
> so the small ring might have some L2 cache benefits.

I did some limited testing on that but I was unable to measure any 
significant effect - especially since the difference between wal-logged 
and not is rather small for a non-parallel COPY (ie in the above example 
you get around 6m20s runtime for wal-logged and ~5m40s in the other case).


Stefan


Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote:
> I did some limited testing on that but I was unable to measure any 
> significant effect - especially since the difference between
> wal-logged and not is rather small for a non-parallel COPY (ie in the
> above example you get around 6m20s runtime for wal-logged and ~5m40s
> in the other case).

This is a common confusion for small tests.

Non-WAL logged case causes all buffers to be written to disk at end of
COPY. This is roughly the same size as the volume of WAL written. In
logged case we do not write data blocks, they get written at next
checkpoint. So the reduction in I/O is not apparent, since during the
period of the test the I/O is about the same in both cases and less I/O
in the non-WAL logged case. On longer tests the difference shows more
clearly because the data blocks start to migrate out of shared buffers
while the COPY is still running, effecting the test results. 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Sun, 2009-06-21 at 02:45 -0400, Greg Smith wrote:
> On Sat, 20 Jun 2009, Simon Riggs wrote:
> 
> > I would suggest that we check how much WAL has been written. There may
> > be a secondary effect or a different regression hidden in these results.
> 
> What's the easiest way to do that? 

pg_current_xlog_insert_location()

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Simon Riggs wrote:
> On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote:
>> I did some limited testing on that but I was unable to measure any 
>> significant effect - especially since the difference between
>> wal-logged and not is rather small for a non-parallel COPY (ie in the
>> above example you get around 6m20s runtime for wal-logged and ~5m40s
>> in the other case).
> 
> This is a common confusion for small tests.
> 
> Non-WAL logged case causes all buffers to be written to disk at end of
> COPY. This is roughly the same size as the volume of WAL written. In
> logged case we do not write data blocks, they get written at next
> checkpoint. So the reduction in I/O is not apparent, since during the
> period of the test the I/O is about the same in both cases and less I/O
> in the non-WAL logged case. On longer tests the difference shows more
> clearly because the data blocks start to migrate out of shared buffers
> while the COPY is still running, effecting the test results. 

I was actually testing with and without explicit CHECKPOINTing 
before/after the load(and also with longer runs) too - the difference is 
negligible especially with only one process involved.
I think the difference is simply not that large because we are still 
mostly CPU bound within COPY on reasonably fast IO-subsystems.



Stefan


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Stefan Kaltenbrunner wrote:
> Simon Riggs wrote:
>> On Sun, 2009-06-21 at 10:28 +0200, Stefan Kaltenbrunner wrote:
>>> I did some limited testing on that but I was unable to measure any 
>>> significant effect - especially since the difference between
>>> wal-logged and not is rather small for a non-parallel COPY (ie in the
>>> above example you get around 6m20s runtime for wal-logged and ~5m40s
>>> in the other case).
>>
>> This is a common confusion for small tests.
>>
>> Non-WAL logged case causes all buffers to be written to disk at end of
>> COPY. This is roughly the same size as the volume of WAL written. In
>> logged case we do not write data blocks, they get written at next
>> checkpoint. So the reduction in I/O is not apparent, since during the
>> period of the test the I/O is about the same in both cases and less I/O
>> in the non-WAL logged case. On longer tests the difference shows more
>> clearly because the data blocks start to migrate out of shared buffers
>> while the COPY is still running, effecting the test results. 
> 
> I was actually testing with and without explicit CHECKPOINTing 
> before/after the load(and also with longer runs) too - the difference is 
> negligible especially with only one process involved.
> I think the difference is simply not that large because we are still 
> mostly CPU bound within COPY on reasonably fast IO-subsystems.

hmm to further demonstrate that I just did some testing(same config as 
before and the 16MB for the buffer) by loading those 60M rows into a 
20GB ramdisk instead of the SAN(with a CHECKPOINT before and after).

this results in the following "improvements":

16384:
wal bypass: 5min40s -> 5min10s (~9%)
wal logged: 6min20s -> 6min8s (~3%)


vmstat 5 output shows that the system is in fact CPU bound (ie using ~6% 
which is more or less a full core on a 16 core box) and not doing 
anything IO-wise.

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa st 1  0      0 19010444 118648 15415684    0    0     0     6 1068 2151 
6  0 94  0  0 1  0      0 18870448 118652 15555204    0    0     0     6 1069 2142 
6  0 94  0  0 1  0      0 18730568 118684 15694872    0    0     0   185 1080 2151 
6  0 94  0  0 1  0      0 18591236 118692 15834516    0    0     0    19 1072 2144 
6  0 94  0  0 1  0      0 18451472 118696 15973532    0    0     0    46 1073 2152 
6  0 94  0  0 1  0      0 18311720 118704 16113204    0    0     0     7 1059 2136 
6  0 94  0  0 1  0      0 18171968 118704 16252944    0    0     0     0 1077 2171 
6  0 94  0  0 1  0      0 18032088 118712 16392300    0    0     0    54 1062 2138 
6  0 94  0  0 1  0      0 17891716 118720 16532060    0    0     0     8 1078 2176 
6  0 94  0  0


So I do think that IO is in fact not too significant for this kind of 
testing and we still have ways to go in terms of CPU efficiency in COPY.



Stefan


Re: 8.4 open item: copy performance regression?

From
Robert Haas
Date:
On Sun, Jun 21, 2009 at 6:48 AM, Stefan
Kaltenbrunner<stefan@kaltenbrunner.cc> wrote:
> So I do think that IO is in fact not too significant for this kind of
> testing and we still have ways to go in terms of CPU efficiency in COPY.

It would be interesting to see some gprof or oprofile output from that
test.   I went back and dug up the results that I got when I profiled
this patch during initial development, and my version of the patch
applied, the profile looked like this on my system:
 %   cumulative   self              self     totaltime   seconds   seconds    calls   s/call   s/call  name14.48
0.85    0.85        1     0.85     5.47  DoCopy10.05      1.44     0.59 10000001     0.00     0.00  CopyReadLine 5.62
  1.77     0.33 10000039     0.00     0.00  PageAddItem 5.11      2.07     0.30 10400378     0.00     0.00
LWLockRelease4.68      2.35     0.28 10000013     0.00     0.00  heap_insert 4.34      2.60     0.26 10000012     0.00
  0.00  heap_formtuple 3.83      2.83     0.23 10356158     0.00     0.00  LWLockAcquire 3.83      3.05     0.23
10000054    0.00     0.00  MarkBufferDirty 3.32      3.25     0.20 10000013     0.00     0.00
RelationGetBufferForTuple3.07      3.43     0.18 10000005     0.00     0.00  pg_verify_mbstr_len 2.90      3.60
0.1710000002     0.00     0.00  CopyGetData 2.73      3.76     0.16 20000030     0.00     0.00  enlargeStringInfo 2.73
   3.92     0.16 20000014     0.00     0.00  pq_getbytes 2.04      4.04     0.12 10000000     0.00     0.00
InputFunctionCall

...but this might not be very representative, since I think I may have
tested it on a single-column table.  It would be interesting to see
some other results.

Simon had the idea of further improving performance by keeping the
current buffer locked (this patch just kept it pinned, but not
locked), but I didn't see an obvious clean design for that.  Heikki
also had a patch for speeding up copy, but it got dropped for 8.4 due
to time constraints.

...Robert


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> It would be interesting to see some gprof or oprofile output from that
> test.   I went back and dug up the results that I got when I profiled
> this patch during initial development, and my version of the patch
> applied, the profile looked like this on my system:

Were you testing with a temp table?  The lack of XLogInsert in your
profile is striking.  Stefan's results upthread had it at the top,
and I got more or less the same thing here (didn't keep my numbers
unfortunately).

> Simon had the idea of further improving performance by keeping the
> current buffer locked (this patch just kept it pinned, but not
> locked), but I didn't see an obvious clean design for that.

The potential for deadlock seems sufficient reason not to do that.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> It would be interesting to see some gprof or oprofile output from that
>> test.   I went back and dug up the results that I got when I profiled
>> this patch during initial development, and my version of the patch
>> applied, the profile looked like this on my system:
> 
> Were you testing with a temp table?  The lack of XLogInsert in your
> profile is striking.  Stefan's results upthread had it at the top,
> and I got more or less the same thing here (didn't keep my numbers
> unfortunately).

I guess that profile was for the wal bypass case and it looks fairly 
similiar to what I get here(lineitem table into tmpfs - though only 30M 
rows this time to keep VM pressure low):

samples  %        symbol name
286197   17.1997  DoCopy
232958   14.0002  CopyReadLine
99762     5.9954  DecodeNumber
92751     5.5741  heap_fill_tuple
84439     5.0746  pg_verify_mbstr_len
65421     3.9316  InputFunctionCall
62502     3.7562  DecodeDate
53565     3.2191  heap_form_tuple
47731     2.8685  ParseDateTime
41206     2.4764  DecodeDateTime
39936     2.4001  pg_next_dst_boundary
36093     2.1691  AllocSetAlloc
33967     2.0413  heap_compute_data_size
29921     1.7982  float4in
27227     1.6363  DetermineTimeZoneOffset
25622     1.5398  pg_atoi
24703     1.4846  pg_mblen
24495     1.4721  .plt
23912     1.4371  pg_mbstrlen_with_len
23448     1.4092  bpchar_input
20033     1.2039  date2j
16331     0.9815  date_in
15684     0.9426  bpcharin
14819     0.8906  PageAddItem
14261     0.8571  ValidateDate


Stefan


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I wonder if using the small ring showed any benefit when the COPY is not 
> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, 
> so the small ring might have some L2 cache benefits.

I think the notion that we might get a cache win from a smaller ring
is an illusion.  We're not expecting to go back and re-read from a
previously filled page in this scenario.  In any case, all of the
profiling results so far show that the CPU bottlenecks are elsewhere.
Until we can squeeze an order of magnitude out of COPY's data parsing
and/or XLogInsert, any possible cache effects will be down in the noise.

So to my mind, the only question left to answer (at least for the 8.4
cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
Right at the moment I'd be satisfied with 16, but I wonder whether there
are scenarios where 32MB would show a significant advantage.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Robert Haas
Date:
On Sun, Jun 21, 2009 at 11:31 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> It would be interesting to see some gprof or oprofile output from that
>> test.   I went back and dug up the results that I got when I profiled
>> this patch during initial development, and my version of the patch
>> applied, the profile looked like this on my system:
>
> Were you testing with a temp table?

No.

> The lack of XLogInsert in your
> profile is striking.  Stefan's results upthread had it at the top,
> and I got more or less the same thing here (didn't keep my numbers
> unfortunately).

As Stephen guessed, I created the table in the same transaction that I
inserted into it...

>> Simon had the idea of further improving performance by keeping the
>> current buffer locked (this patch just kept it pinned, but not
>> locked), but I didn't see an obvious clean design for that.
>
> The potential for deadlock seems sufficient reason not to do that.

Yep, that was the problem.  *thinks*

I think we had the idea of buffering up enough tuples to fill a page
(estimating conservatively so as to make sure we actually fill it),
and then inserting them all at once.  But I'm not sure how much
trouble that causes in terms of the timing of inserting index entries
and firing of after row insert triggers.  If the command ID doesn't
change, it seems like it might be OK.  Or at worst, even if the
optimization would only work in cases where there are no after row
triggers, that would still be useful to a lot of people, I think.

I haven't really spent much time on this angle of attack and
completely confess to not understanding all of the issues...

...Robert


Re: 8.4 open item: copy performance regression?

From
Robert Haas
Date:
On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> I wonder if using the small ring showed any benefit when the COPY is not
>> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen,
>> so the small ring might have some L2 cache benefits.
>
> I think the notion that we might get a cache win from a smaller ring
> is an illusion.  We're not expecting to go back and re-read from a
> previously filled page in this scenario.  In any case, all of the
> profiling results so far show that the CPU bottlenecks are elsewhere.
> Until we can squeeze an order of magnitude out of COPY's data parsing
> and/or XLogInsert, any possible cache effects will be down in the noise.
>
> So to my mind, the only question left to answer (at least for the 8.4
> cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
> Right at the moment I'd be satisfied with 16, but I wonder whether there
> are scenarios where 32MB would show a significant advantage.

Even 32MB is not that much.  It seems to me that in any realistic
production scenario you're going to have at least half a gig of shared
buffers, so we're really talking about at most one-sixteenth of the
shared buffer arena, and possibly quite a bit less.  I think that's
pretty conservative.

...Robert


Re: 8.4 open item: copy performance regression?

From
Greg Stark
Date:
On Sun, Jun 21, 2009 at 5:07 PM, Robert Haas<robertmhaas@gmail.com> wrote:
>
> I think we had the idea of buffering up enough tuples to fill a page
> (estimating conservatively so as to make sure we actually fill it),
> and then inserting them all at once.  But I'm not sure how much
> trouble that causes in terms of the timing of inserting index entries
> and firing of after row insert triggers.  If the command ID doesn't
> change, it seems like it might be OK.  Or at worst, even if the
> optimization would only work in cases where there are no after row
> triggers, that would still be useful to a lot of people, I think.
>
> I haven't really spent much time on this angle of attack and
> completely confess to not understanding all of the issues...

There was some discussion of doing this in general for all inserts
inside the indexam. The btree indexam could buffer up any inserts done
within the transaction and keep them in an in-memory btree. Any
lookups done within the transaction first look up in the in-memory
tree then the disk. If the in-memory buffer fills up then we flush
them to the index.

The reason this is tempting is that we could then insert them all in a
single index-merge operation which would often be more efficient than
retail inserts.

--
greg
http://mit.edu/~gsstark/resume.pdf


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> There was some discussion of doing this in general for all inserts
> inside the indexam. The btree indexam could buffer up any inserts done
> within the transaction and keep them in an in-memory btree. Any
> lookups done within the transaction first look up in the in-memory
> tree then the disk. If the in-memory buffer fills up then we flush
> them to the index.

> The reason this is tempting is that we could then insert them all in a
> single index-merge operation which would often be more efficient than
> retail inserts.

That's not gonna work for a unique index, which unfortunately is a
pretty common case ...
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> I wonder if using the small ring showed any benefit when the COPY is not 
>> WAL-logged? In that scenario block-on-WAL-flush behavior doesn't happen, 
>> so the small ring might have some L2 cache benefits.
> 
> I think the notion that we might get a cache win from a smaller ring
> is an illusion.  We're not expecting to go back and re-read from a
> previously filled page in this scenario.  In any case, all of the
> profiling results so far show that the CPU bottlenecks are elsewhere.
> Until we can squeeze an order of magnitude out of COPY's data parsing
> and/or XLogInsert, any possible cache effects will be down in the noise.

we also need to take a serious look at our locking overhead - WAL logged 
COPY is already taking a significant performance hit with just a second 
process running in parallel(into a seperate table).
I just did some testing using those 16MB buffer, the upthread mentioned 
postgresql.conf and a 20GB tmpfs.

The following copying 3M rows(each) into a seperate table of the same 
database.

processes    total time(s)    rows/s    rows/s - per core

1    17.5    171428.57    171428.57
2    20.8    288461.54    144230.77
4    25.5    470588.24    117647.06
6    31.1    578778.14    96463.02
8    41.4    579710.14    72463.77
10    63    476190.48    47619.05
12    89    404494.38    33707.87
14    116    362068.97    25862.07
16    151    317880.79    19867.55



the higher the process count the more erratic the box behaves - it will 
show a very high context switch rate (between 300000 and 400000/s) a 
large amount of idle time (>60%!).

example vmstat 5 output for the 12 process test:
 7  0      0 21654500  45436 12932516    0    0     0     3 1079 336941 
34  7 59  0  0 6  0      0 21354044  45444 13232444    0    0     0    52 1068 341836 
35  7 59  0  0 4  0      0 21053832  45452 13531472    0    0     0    23 1082 341672 
35  7 59  0  0 9  0      0 20751136  45460 13833336    0    0     0    41 1063 344117 
35  7 59  0  0 6  0      0 20443856  45468 14138116    0    0     0    14 1079 349398 
35  7 58  0  0 8  0      0 20136592  45476 14444644    0    0     0     8 1060 351569 
35  7 58  0  0
10  0      0 19836600  45484 14743320    0    0     0   144 1086 341533 
35  7 58  0  0 7  0      0 19540472  45492 15039616    0    0     0    94 1067 337731 
36  7 58  0  0 2  0      0 19258244  45500 15321156    0    0     0    15 1079 311394 
34  6 60  0  0



Stefan


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> The following copying 3M rows(each) into a seperate table of the same 
> database.

Is this with WAL, or bypassing WAL?  Given what we've already seen,
a lot of contention for WALInsertLock wouldn't surprise me much here.
It should be possible to bypass that though.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Heikki Linnakangas
Date:
Robert Haas wrote:
> On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> So to my mind, the only question left to answer (at least for the 8.4
>> cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
>> Right at the moment I'd be satisfied with 16, but I wonder whether there
>> are scenarios where 32MB would show a significant advantage.
> 
> Even 32MB is not that much.  It seems to me that in any realistic
> production scenario you're going to have at least half a gig of shared
> buffers, so we're really talking about at most one-sixteenth of the
> shared buffer arena, and possibly quite a bit less.  I think that's
> pretty conservative.

I was going to say that since we flush the WAL every 16MB anyway (at 
every XLOG file switch), you shouldn't see any benefit with larger ring 
buffers, since to fill 16MB of data you're not going to write more than 
16MB WAL. But then I realized that that's not true if you have an 
unusually low fillfactor. If you only fill each page say 50% full, 
you're going to use 32MB worth of data pages but only write 16MB of WAL. 
And maybe you could have a situation like that with very wide rows as 
well, with wasted space on each page that's not enough to store one more  row.

Could you test that scenario?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I was going to say that since we flush the WAL every 16MB anyway (at 
> every XLOG file switch), you shouldn't see any benefit with larger ring 
> buffers, since to fill 16MB of data you're not going to write more than 
> 16MB WAL.

I'm not convinced that WAL segment boundaries are particularly relevant
to this.  The unit of flushing is an 8K page, not a segment.

I wonder though whether the wal_buffers setting interacts with the
ring size.  Has everyone who's tested this used the same 16MB
wal_buffers setting as in Alan's original scenario?
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> The following copying 3M rows(each) into a seperate table of the same 
>> database.
> 
> Is this with WAL, or bypassing WAL?  Given what we've already seen,
> a lot of contention for WALInsertLock wouldn't surprise me much here.
> It should be possible to bypass that though.

this was with WAL. here are the numbers for bypass:

processes    total time(s)    rows/s    rows/s - per core

1    15    200000    200000
2    15.1    397350.99    198675.5
4    15.2    789473.68    197368.42
6    15.3    1176470.59    196078.43
8    16.2    1481481.48    185185.19
10    21.9    1369863.01    136986.3
12    22.7    1585903.08    132158.59
14    25.2    1666666.67    119047.62
16    27.9    1720430.11    107526.88


runtimes grew very short here but the numbers still seem sane and if you 
compare them to what I did on real storage(though without the 16MB 
ringbuffer fix!):

http://www.kaltenbrunner.cc/blog/index.php?/archives/27-Benchmarking-8.4-Chapter-2bulk-loading.html

you will see that for a single core there is almost no performance 
difference between ramdisk and real disk, at 8 cores there is the 
largest gap at around 45% but on 16 cores we are down to a mere 20% 
difference.
All in all it seems that we have a big locking issue with WALInsertLock 
and even with that removed we are mostly CPU limited and not IO limited 
for COPY.



Stefan




Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Sun, 2009-06-21 at 12:38 -0400, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > There was some discussion of doing this in general for all inserts
> > inside the indexam. The btree indexam could buffer up any inserts done
> > within the transaction and keep them in an in-memory btree. Any
> > lookups done within the transaction first look up in the in-memory
> > tree then the disk. If the in-memory buffer fills up then we flush
> > them to the index.
> 
> > The reason this is tempting is that we could then insert them all in a
> > single index-merge operation which would often be more efficient than
> > retail inserts.
> 
> That's not gonna work for a unique index, which unfortunately is a
> pretty common case ...

I think it can. If we fail on a unique index we fail. We aren't
expecting that, else we wouldn't be using COPY. So I reckon its
acceptable to load a whole block of rows and then load a whole blocks's
worth of index entries. The worst thing that can happen is we insert a
few extra heap rows that get aborted, which is small in comparison to
the potential gains from buffering.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Sun, 2009-06-21 at 20:37 +0300, Heikki Linnakangas wrote:
> Robert Haas wrote:
> > On Sun, Jun 21, 2009 at 11:52 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> >> So to my mind, the only question left to answer (at least for the 8.4
> >> cycle) is "is 16MB enough, or do we want to make the ring even bigger?".
> >> Right at the moment I'd be satisfied with 16, but I wonder whether there
> >> are scenarios where 32MB would show a significant advantage.
> > 
> > Even 32MB is not that much.  It seems to me that in any realistic
> > production scenario you're going to have at least half a gig of shared
> > buffers, so we're really talking about at most one-sixteenth of the
> > shared buffer arena, and possibly quite a bit less.  I think that's
> > pretty conservative.
> 
> I was going to say that since we flush the WAL every 16MB anyway (at 
> every XLOG file switch), you shouldn't see any benefit with larger ring 
> buffers, since to fill 16MB of data you're not going to write more than 
> 16MB WAL. But then I realized that that's not true if you have an 
> unusually low fillfactor. If you only fill each page say 50% full, 
> you're going to use 32MB worth of data pages but only write 16MB of WAL. 
> And maybe you could have a situation like that with very wide rows as 
> well, with wasted space on each page that's not enough to store one more 
>   row.

If walwriter is working correctly then it should be writing and fsyncing
WAL, while the COPY process just inserts WAL. I don't see that as an
argument to limit us to 16MB. But I take your point as being an argument
in favour of that as a consensus value for us to choose.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Alan Li
Date:
Hi Tom,<br /><br />How much concern is there for the contention for use cases where the WAL can't be bypassed?<br /><br
/>Thanks,Alan<br /><br /><div class="gmail_quote">On Sun, Jun 21, 2009 at 10:00 AM, Tom Lane <span dir="ltr"><<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">Stefan
Kaltenbrunner<stefan@kaltenbrunner.cc> writes:<br /></div><div class="im">> The following copying 3M
rows(each)into a seperate table of the same<br /> > database.<br /><br /></div>Is this with WAL, or bypassing WAL?
 Givenwhat we've already seen,<br /> a lot of contention for WALInsertLock wouldn't surprise me much here.<br /> It
shouldbe possible to bypass that though.<br /><br />                        regards, tom lane<br /><br /><br
/></blockquote></div><br/> 

Re: 8.4 open item: copy performance regression?

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> I was going to say that since we flush the WAL every 16MB anyway (at 
>> every XLOG file switch), you shouldn't see any benefit with larger ring 
>> buffers, since to fill 16MB of data you're not going to write more than 
>> 16MB WAL.
> 
> I'm not convinced that WAL segment boundaries are particularly relevant
> to this.  The unit of flushing is an 8K page, not a segment.

We fsync() the old WAL segment every time we switch to a new WAL 
segment. That's what I meant by "flush".

If the walwriter is keeping up, it will fsync() the WAL more often, but 
16MB is the maximum distance between fsync()s.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Mon, 2009-06-22 at 10:52 +0300, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> >> I was going to say that since we flush the WAL every 16MB anyway (at 
> >> every XLOG file switch), you shouldn't see any benefit with larger ring 
> >> buffers, since to fill 16MB of data you're not going to write more than 
> >> 16MB WAL.
> > 
> > I'm not convinced that WAL segment boundaries are particularly relevant
> > to this.  The unit of flushing is an 8K page, not a segment.
> 
> We fsync() the old WAL segment every time we switch to a new WAL 
> segment. That's what I meant by "flush".
> 
> If the walwriter is keeping up, it will fsync() the WAL more often, but 
> 16MB is the maximum distance between fsync()s.

Yes, but the fsync is performed by the process that writes the WAL, not
necessarily by the process that inserts the WAL. In perfect balance, an
inserter-of-WAL could insert an infinite amount of WAL and never need to
fsync the WAL. So the question is are we in balance between WALWriter
and COPY?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Alan Li <ali@truviso.com> writes:
> How much concern is there for the contention for use cases where the WAL
> can't be bypassed?

If you mean "is something going to be done about it in 8.4", the
answer is "no".  This is a pre-existing issue that there is no simple
fix for.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Alan Li <ali@truviso.com> writes:
>   
>> How much concern is there for the contention for use cases where the WAL
>> can't be bypassed?
>>     
>
> If you mean "is something going to be done about it in 8.4", the
> answer is "no".  This is a pre-existing issue that there is no simple
> fix for.
>
>             
>   

I thought he was asking if we intend to provide for WAL bypass on a 
table by table basis in future.

cheers

andrew


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Alan Li <ali@truviso.com> writes:
>>> How much concern is there for the contention for use cases where the WAL
>>> can't be bypassed?
>> 
>> If you mean "is something going to be done about it in 8.4", the
>> answer is "no".  This is a pre-existing issue that there is no simple
>> fix for.

> I thought he was asking if we intend to provide for WAL bypass on a 
> table by table basis in future.

I thought he was asking for a solution to the problem of WALInsertLock
contention.  In any case, we have "WAL bypass on a table by table basis"
now, don't we?
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Andrew Dunstan
Date:

Tom Lane wrote:
> I thought he was asking for a solution to the problem of WALInsertLock
> contention.  In any case, we have "WAL bypass on a table by table basis"
> now, don't we?
>
>             
>   

If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?

cheers

andrew


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> I thought he was asking for a solution to the problem of WALInsertLock
>> contention.  In any case, we have "WAL bypass on a table by table basis"
>> now, don't we?

> If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?

Make it a temporary table.

The more useful case for data load is "create or truncate it in the
same transaction", of course.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Aidan Van Dyk
Date:
* Andrew Dunstan <andrew@dunslane.net> [090622 10:47]:

> If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?

CREATE TEMPORARY TABLE ...

a.


-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: 8.4 open item: copy performance regression?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> The more useful case for data load is "create or truncate it in the
> same transaction", of course.

Unfortunately, WAL bypass also requires not being in archive mode with
no way to turn that off w/o a server restart, aiui.
Thanks,
    Stephen

Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> The more useful case for data load is "create or truncate it in the
>> same transaction", of course.

> Unfortunately, WAL bypass also requires not being in archive mode with
> no way to turn that off w/o a server restart, aiui.

Well, if you're trying to archive then you certainly wouldn't want WAL
off, so I'm failing to see where this thread is going ...
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > Unfortunately, WAL bypass also requires not being in archive mode with
> > no way to turn that off w/o a server restart, aiui.
>
> Well, if you're trying to archive then you certainly wouldn't want WAL
> off, so I'm failing to see where this thread is going ...

I disagree.  I'd love to be able to say "please bypass WAL logging for
this bulk load" because I know that I'll pick up the data during my next
full dump and I can reload it from original if I get disrupted before
then.  This is especially true when you're doing bulk loads of static or
reference data from another data source.
Thanks,
    Stephen

Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Mon, 2009-06-22 at 11:14 -0400, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> The more useful case for data load is "create or truncate it in the
> >> same transaction", of course.
> 
> > Unfortunately, WAL bypass also requires not being in archive mode with
> > no way to turn that off w/o a server restart, aiui.
> 
> Well, if you're trying to archive then you certainly wouldn't want WAL
> off, so I'm failing to see where this thread is going ...

I was thinking it might be beneficial to be able to defer writing WAL
until COPY is complete, so heap_sync would either fsync the whole heap
file or copy the whole file to WAL.

That would avoid writing WAL piecemeal because we could just backup the
whole block, plus we wouldn't write anything at all if the COPY failed.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> I was thinking it might be beneficial to be able to defer writing WAL
> until COPY is complete, so heap_sync would either fsync the whole heap
> file or copy the whole file to WAL.

What about indexes?
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Mon, 2009-06-22 at 11:24 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > I was thinking it might be beneficial to be able to defer writing WAL
> > until COPY is complete, so heap_sync would either fsync the whole heap
> > file or copy the whole file to WAL.
> 
> What about indexes?

I was thinking we could do exactly as stated for the cases that would be
WAL-bypass now, but need to write WAL because XLogArchivingActive().
i.e. improve the exact case we are measuring here.

Yes, it is more complex than that for loading to existing tables.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> I thought he was asking for a solution to the problem of WALInsertLock
>>> contention.  In any case, we have "WAL bypass on a table by table basis"
>>> now, don't we?
>>>       
>
>   
>> If we do I'm ignorant of it ;-) How do we say "Never WAL this table"?
>>     
>
> Make it a temporary table.
>
>
>   

That doesn't help if you need the data visible in multiple sessions. But 
we're digressing from the original topic. Sorry.

cheers

andrew


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> I'm not convinced that WAL segment boundaries are particularly relevant
>> to this.  The unit of flushing is an 8K page, not a segment.

> We fsync() the old WAL segment every time we switch to a new WAL 
> segment. That's what I meant by "flush".

> If the walwriter is keeping up, it will fsync() the WAL more often, but 
> 16MB is the maximum distance between fsync()s.

I'm still not convinced --- to my mind the issue is not whether fsyncs
happen but whether the COPY process has to wait for 'em, and I don't
think that segment boundaries directly affect that.  I'd still be
interested to see similar measurements done with different wal_buffer
settings.

However, in the interests of getting this resolved in time for 8.4.0,
I propose that we just settle on 16MB as the bulkwrite ring buffer size.
There doesn't seem to be any evidence that a larger size will make for
a significant improvement, and we shouldn't allow COPY to trash a bigger
fraction of the arena than it really has to.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Dimitri Fontaine
Date:
Le 22 juin 2009 à 17:24, Tom Lane <tgl@sss.pgh.pa.us> a écrit :

> Simon Riggs <simon@2ndQuadrant.com> writes:
>> I was thinking it might be beneficial to be able to defer writing WAL
>> until COPY is complete, so heap_sync would either fsync the whole
>> heap
>> file or copy the whole file to WAL.
>
> What about indexes?

Skip this optimisation if there are any.
It's already Common practise to create them only after copy succeeded
when possible for better bulk loading perfs.

Then there's also the copy + Insert ... Select ... technique.

--
dim

Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> I wonder though whether the wal_buffers setting interacts with the
> ring size.  Has everyone who's tested this used the same 16MB
> wal_buffers setting as in Alan's original scenario?
I had been using his postgresql.conf file, then added autovacuum =
off.  When I tried with setting the ring size to 16MB, I accidentally
left off the step to copy the postgresql.conf file, and got better
performance.  I alternated between the postgresql.conf file from
earlier tests and the default file left there by the initdb, and got
this:
8.4rc1 with 16MB ring, default postgresql.conf
0m23.223s
0m23.489s
0m23.921s
8.4rc1 with 16MB ring, Alan's postgresql.conf
0m28.678s
0m26.171s
0m27.513s
default postgresql.conf (comments stripped)
max_connections = 100
shared_buffers = 32MB
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'
default_text_search_config = 'pg_catalog.english'
Alan's postgresql.conf (comments stripped)
shared_buffers = 256MB
wal_buffers = 16MB
checkpoint_segments = 100
autovacuum = off
I'm not going to claim I know why, but I thought I should report it.
Oh, and the 8.3.7 numbers and pre-patch numbers were averaging the
same under the day-time load as the replication sync mode.  So, with
the ring size at 16MB this load is faster under 8.4 than 8.3.
-Kevin


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> I wonder though whether the wal_buffers setting interacts with the
>> ring size.  Has everyone who's tested this used the same 16MB
>> wal_buffers setting as in Alan's original scenario?
> I had been using his postgresql.conf file, then added autovacuum =
> off.  When I tried with setting the ring size to 16MB, I accidentally
> left off the step to copy the postgresql.conf file, and got better
> performance.

Huh, that's bizarre.  I can see that increasing shared_buffers should
make no difference in this test case (we're not using them all anyway).
But why should increasing wal_buffers make it slower?  I forget the
walwriter's control algorithm at the moment ... maybe it works harder
when wal buffers are full?

BTW, I committed the change to use 16MB; that will be in RC2.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Greg Smith
Date:
On Mon, 22 Jun 2009, Kevin Grittner wrote:

> When I tried with setting the ring size to 16MB, I accidentally left off 
> the step to copy the postgresql.conf file, and got better performance.

Do you have happen to have a build with assertions turned on?  That is one 
common cause of performance going down via increased shared_buffers that 
people tend to run into during beta.

You can check via psql with
  show debug_assertions;

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Greg Smith <gsmith@gregsmith.com> wrote: 
> Do you have happen to have a build with assertions turned on?
Nope.  I showed my ./configure options upthread, but can confirm with
pg_config:
BINDIR = /usr/local/pgsql-8.4rc1/bin
DOCDIR = /usr/local/pgsql-8.4rc1/share/doc
HTMLDIR = /usr/local/pgsql-8.4rc1/share/doc
INCLUDEDIR = /usr/local/pgsql-8.4rc1/include
PKGINCLUDEDIR = /usr/local/pgsql-8.4rc1/include
INCLUDEDIR-SERVER = /usr/local/pgsql-8.4rc1/include/server
LIBDIR = /usr/local/pgsql-8.4rc1/lib
PKGLIBDIR = /usr/local/pgsql-8.4rc1/lib
LOCALEDIR = /usr/local/pgsql-8.4rc1/share/locale
MANDIR = /usr/local/pgsql-8.4rc1/share/man
SHAREDIR = /usr/local/pgsql-8.4rc1/share
SYSCONFDIR = /usr/local/pgsql-8.4rc1/etc
PGXS = /usr/local/pgsql-8.4rc1/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/usr/local/pgsql-8.4rc1'
'--enable-integer-datetimes' '--enable-debug' '--disable-nls'
'--with-libxml'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing
-fwrapv -g
CFLAGS_SL = -fpic
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/local/pgsql-8.4rc1/lib'
LDFLAGS_SL =
LIBS = -lpgport -lxml2 -lz -lreadline -lcrypt -ldl -lm
VERSION = PostgreSQL 8.4rc1
> You can check via psql with
> 
>    show debug_assertions;
OK, we'll do it your way.  :-)
kgrittn@ATHENA:~/postgresql-8.4rc1> psql postgres
psql (8.4rc1)
Type "help" for help.

postgres=# show debug_assertions;debug_assertions
------------------off
(1 row)
-Kevin


Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: 
> A 25-30% performance regression in our main bulk loading mechanism 
> should at least be explained before the release...
I think a performance regression of that magnitude merits holding up
a release to resolve.
Note that in a follow-up post showing the slow 8.4 copying on Linux,
the copy was 11.7% to 19.3% *faster* on 8.4 when the WAL writing was
suppressed:
http://archives.postgresql.org/pgsql-performance/2009-06/msg00219.php
Extracting from that post:
# I can reproduce that on Linux(CentoS 5.3/x86_64, Nehalem Xeon E5530)
# on 8.4 I get:
# 
# 3m59/4m01/3m56s runtime and a profile of
# 
# samples  %        symbol name
# 636302   19.6577  XLogInsert
# 415510   12.8366  CopyReadLine
# on 8.3.7 I get 2m58s,2m54s,2m55s
# 
# and a profile of:
# 
# samples  %        symbol name
# 460966   16.2924  XLogInsert
# 307386   10.8643  CopyReadLine
# If I do the same test utilizing WAL bypass the picture changes:
# 
# 8.3 runtimes:2m16,2min14s,2min22s
# 8.4 runtime: 2m1s,2m,1m59s
Is there a reason to believe that the XLogInsert part of this *only*
affects bulk loads?
-Kevin


Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> Huh, that's bizarre.  I can see that increasing shared_buffers
> should make no difference in this test case (we're not using them
> all anyway). But why should increasing wal_buffers make it slower? 
> I forget the walwriter's control algorithm at the moment ... maybe
> it works harder when wal buffers are full?
I created a postgresql.conf file with the options from the default
file, and then tried that by itself again, and with each of three
other options:
<none>
0m24.540s
0m24.630s
0m23.778s

checkpoint_segments = 100
0m30.251s
0m29.474s
0m26.604s

wal_buffers = 16MB
0m24.487s
0m23.939s
0m23.557s

shared_buffers = 256MB
0m25.885s
0m25.654s
0m24.025s
So the big hit seems to come from boosting checkpoint_segments,
although boosting shared_buffers seems to cause a slight slowdown. 
Boosting wal_buffers seemed to help a little.  Both of these last two,
though, are within the noise, so low confidence on those without a lot
more tests.
The checkpoint_segments seems dramatic enough to be real.  I wonder if
the test is short enough that it never got around to re-using any of
them, so it was doing extra writes for the initial creation during the
test?
-Kevin


Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
I wrote: 
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: 
>  
>> A 25-30% performance regression in our main bulk loading mechanism 
>> should at least be explained before the release...
> 
> I think a performance regression of that magnitude merits holding
> up a release to resolve.
Wow.  That sure took a long time to come through.  I posted that
days ago....
-Kevin



Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> The checkpoint_segments seems dramatic enough to be real.  I wonder if
> the test is short enough that it never got around to re-using any of
> them, so it was doing extra writes for the initial creation during the
> test?

That's exactly what I was about to suggest.  Are you starting each run
from a fresh initdb?  If so, try running the load long enough that the
number of WAL files stabilizes (should happen at 2x checkpoint_segments)
and then start the test measurement.
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
Alan Li
Date:
<div class="gmail_quote">On Mon, Jun 22, 2009 at 7:16 AM, Tom Lane <span dir="ltr"><<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">Alan Li
<<ahref="mailto:ali@truviso.com">ali@truviso.com</a>> writes:<br /> > How much concern is there for the
contentionfor use cases where the WAL<br /> > can't be bypassed?<br /><br /></div>If you mean "is something going to
bedone about it in 8.4", the<br /> answer is "no".  This is a pre-existing issue that there is no simple<br /> fix
for.<br/><br />                        regards, tom lane<br /><br /><br /></blockquote></div>No no, I am certainly not
implyinganything for the 8.4 timeframe.<br /><br />Moving forward, I imagine this being more of a problem for data
warehouseapplications, where bulk inserts occur on existing fact tables.  In this case, the WAL cannot be bypassed
(unlessthe bulk insert occurs on a newly created partition).  And since COPY is cpu-bound, it would perhaps be
advantageousto do parallel COPY's on the same table on multi-core systems, which won't work with WAL bypassing
either.<br/><br />Thanks, Alan<br /> 

Re: 8.4 open item: copy performance regression?

From
Simon Riggs
Date:
On Mon, 2009-06-22 at 15:18 -0500, Kevin Grittner wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>  
>  
> default postgresql.conf (comments stripped)
> max_connections = 100
> shared_buffers = 32MB

This forces ring size to be 4MB, since min(32MB/8, ringsize).

Please re-run tests with your config, ring size 4MB, which should give
same answer

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> The checkpoint_segments seems dramatic enough to be real.  I wonder
>> if the test is short enough that it never got around to re-using
>> any of them, so it was doing extra writes for the initial creation
>> during the test?
> 
> That's exactly what I was about to suggest.  Are you starting each
> run from a fresh initdb?  If so, try running the load long enough
> that the number of WAL files stabilizes (should happen at 2x
> checkpoint_segments) and then start the test measurement.
default conf (xlogs not populated)
real    3m49.604s
real    3m47.225s
real    3m45.831s

default conf (xlogs populated)
real    3m45.603s
real    3m45.284s
real    3m45.906s

default conf + checkpoint_segments = 100 (xlogs not populated)
real    4m27.629s
real    4m24.496s
real    4m22.832s

default conf + checkpoint_segments = 100 (xlogs populated)
real    3m52.746s
real    3m52.619s
real    3m50.418s
I used ten times the number of rows, to get more meaningful results. 
To get the "populated" times, I just dropped the target table and
created it again; otherwise identical runs.  Clearly, pre-populating
the xlog files reduces run time, especially for a large number of xlog
files; however, I still got better performance with a smaller set of
xlog files.
Regarding the fact that even with the xlog files pre-populated, the
smaller set of xlog files is faster: I'm only guessing, but I suspect
the battery backed RAID controller is what's defeating conventional
wisdom here.  By writing to the same, relatively small, set of xlog
files repeatedly, some of the actual disk writes probably evaporate in
the BBU cache.  More frequent checkpoints from the smaller number of
xlog files might also have caused data to start streaming to the disk
a little sooner, minimizing write gluts later.
I've often seen similar benefits to the BBU cache which cause some of
the frequently-given advice here to have no discernible affect or be
counter-productive in our environment.  (I know that some doubted that
my aggressive background writer settings didn't increase disk writes,
but I couldn't even measure a difference there in the writes from OS
cache to the controller cache, much less anything which indicated it
actually increased physical disk writes.)
By the way, the number of xlog files seemed to always go to two above
2x checkpoint_segments.
-Kevin


Re: 8.4 open item: copy performance regression?

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Regarding the fact that even with the xlog files pre-populated, the
> smaller set of xlog files is faster: I'm only guessing, but I suspect
> the battery backed RAID controller is what's defeating conventional
> wisdom here.  By writing to the same, relatively small, set of xlog
> files repeatedly, some of the actual disk writes probably evaporate in
> the BBU cache.

Yeah, sounds plausible.  How big is your BBU cache?
        regards, tom lane


Re: 8.4 open item: copy performance regression?

From
"Andrew Dunstan"
Date:
On Fri, June 26, 2009 4:13 pm, Kevin Grittner wrote:
> By the way, the number of xlog files seemed to always go to two above
> 2x checkpoint_segments.


The docs say:

"There will always be at least one WAL segment file, and will normally not
be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1
files."


cheers

andrew



Re: 8.4 open item: copy performance regression?

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> How big is your BBU cache?
On this machine, I guess it is 512MB.  (Possibly 1GB, but I'm having
trouble finding the right incantation to check it at the moment, so
I'm going by what the hardware tech remembers.)
-Kevin