Thread: testing HS/SR - 1 vs 2 performance

testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
Using 9.0devel cvs HEAD, 2010.04.08.

I am trying to understand the performance difference
between primary and standby under a standard pgbench
read-only test.

server has 32 GB, 2 quadcores.

primary: tps = 34606.747930 (including connections establishing) tps = 34527.078068 (including connections
establishing)tps = 34654.297319 (including connections establishing)
 

standby: tps = 700.346283 (including connections establishing) tps = 717.576886 (including connections establishing)
tps= 740.522472 (including connections establishing)
 

transaction type: SELECT only
scaling factor: 1000
query mode: simple
number of clients: 20
number of threads: 1
duration: 900 s

both instances have max_connections = 100 shared_buffers = 256MB checkpoint_segments = 50 effective_cache_size= 16GB

See also:

http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php    (differences with scale 10_000)

I understand that in the scale=1000 case, there is a huge
cache effect, but why doesn't that apply to the pgbench runs
against the standby?  (and for the scale=10_000 case the
differences are still rather large)

Maybe these differences are as expected.  I don't find
any explanation in the documentation.


thanks,

Erik Rijkers




Re: testing HS/SR - 1 vs 2 performance

From
Fujii Masao
Date:
On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote:
> I understand that in the scale=1000 case, there is a huge
> cache effect, but why doesn't that apply to the pgbench runs
> against the standby?  (and for the scale=10_000 case the
> differences are still rather large)

I guess that this performance degradation happened because a number of
buffer replacements caused UpdateMinRecoveryPoint() often. So I think
increasing shared_buffers would improve the performance significantly.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Mon, Apr 12, 2010 at 5:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote:
>> I understand that in the scale=1000 case, there is a huge
>> cache effect, but why doesn't that apply to the pgbench runs
>> against the standby?  (and for the scale=10_000 case the
>> differences are still rather large)
>
> I guess that this performance degradation happened because a number of
> buffer replacements caused UpdateMinRecoveryPoint() often. So I think
> increasing shared_buffers would improve the performance significantly.

I think we need to investigate this more.  It's not going to look good
for the project if people find that a hot standby server runs two
orders of magnitude slower than the primary.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Sat, April 10, 2010 01:23, Erik Rijkers wrote:
> Using 9.0devel cvs HEAD, 2010.04.08.
>
> I am trying to understand the performance difference
> between primary and standby under a standard pgbench
> read-only test.
>
> server has 32 GB, 2 quadcores.
>
> primary:
>   tps = 34606.747930 (including connections establishing)
>   tps = 34527.078068 (including connections establishing)
>   tps = 34654.297319 (including connections establishing)
>
> standby:
>   tps = 700.346283 (including connections establishing)
>   tps = 717.576886 (including connections establishing)
>   tps = 740.522472 (including connections establishing)
>
> transaction type: SELECT only
> scaling factor: 1000
> query mode: simple
> number of clients: 20
> number of threads: 1
> duration: 900 s
>
> both instances have
>   max_connections = 100
>   shared_buffers = 256MB
>   checkpoint_segments = 50
>   effective_cache_size= 16GB
>
> See also:
>
> http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php
>      (differences with scale 10_000)
>

To my surprise, I have later seen the opposite behaviour with the standby giving fast runs, and
the primary slow.

FWIW, I've overnight run a larget set of tests. (against same 9.0devel
instances as the ones from the earlier email).

These results are generally more balanced.

for scale in   for clients in 1 5 10 20       for port in 6565 6566 --> primaryport standbyport           for run in
`seq1 3`               pgbench ...               sleep ((scale / 10) * 60)           done       done   done
 
done

(so below, alternating 3 primary, followed by 3 standby runs)

scale: 10      clients:  1  tps = 15219.019272  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 15301.847615  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 15238.907436  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 12129.928289  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 12151.711589  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 12203.494512  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  5  tps = 60248.120599  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 60827.949875  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 61167.447476  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 50750.385403  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 50600.891436  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 50486.857610  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10      clients: 10  tps = 60307.739327  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 60264.230349  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 60146.370598  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 50455.537671  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 49877.000813  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 50097.949766  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 20  tps = 43355.220657  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 43352.725422  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 43496.085623  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 37169.126299  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 37100.260450  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 37342.758507  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients:  1  tps = 12514.185089  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 12542.842198  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 12595.688640  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 10435.681851  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 10456.983353  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 10434.213044  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  5  tps = 48682.166988  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 48656.883485  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 48687.894655  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 41901.629933  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 41953.386791  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 41787.962712  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100     clients: 10  tps = 48704.247239  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 48941.190050  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 48603.077936  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 42948.666272  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 42767.793899  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 42612.670983  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 20  tps = 36350.454258  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 36373.088111  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 36490.886781  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 32235.811228  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 32253.837906  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 32144.189047  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients:  1  tps = 11733.254970  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps = 11726.665739  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps = 11617.622548  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  9769.861175  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  9878.465752  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  9808.236216  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  5  tps = 45185.900553  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 45170.334037  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 45136.596374  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 39231.863815  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 39336.889619  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 39269.483772  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500     clients: 10  tps = 45468.080680  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 45727.159963  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 45399.241367  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 40759.108042  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 40783.287718  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 40858.007847  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 20  tps = 34729.742313  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 34705.119029  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 34617.517224  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 31252.355034  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 31234.885791  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 31273.307637  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients:  1  tps =   220.024691  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   294.855794  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   375.152757  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   295.965959  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =  1036.517110  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =  9167.012603  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  5  tps =  1241.224282  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps =  1894.806301  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps = 18532.885549  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps =  1497.491279  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps =  1480.164166  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps =  3470.769236  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000    clients: 10  tps =  2414.552333  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 19248.609443  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 45059.231609  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps =  1648.526373  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps =  3659.800008  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 35900.769857  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 20  tps =  2462.855864  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 27168.407568  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 34438.802096  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps =  2933.220489  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 25586.972428  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 30926.189621  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1






Re: testing HS/SR - 1 vs 2 performance

From
Jim Mlodgenski
Date:
On Mon, Apr 12, 2010 at 7:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 12, 2010 at 5:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote:
>>> I understand that in the scale=1000 case, there is a huge
>>> cache effect, but why doesn't that apply to the pgbench runs
>>> against the standby?  (and for the scale=10_000 case the
>>> differences are still rather large)
>>
>> I guess that this performance degradation happened because a number of
>> buffer replacements caused UpdateMinRecoveryPoint() often. So I think
>> increasing shared_buffers would improve the performance significantly.
>
> I think we need to investigate this more.  It's not going to look good
> for the project if people find that a hot standby server runs two
> orders of magnitude slower than the primary.
As a data point, I did a read only pgbench test and found that the
standby runs about 15% slower than the primary with identical hardware
and configs.
>
> ...Robert
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
--
Jim Mlodgenski
EnterpriseDB (http://www.enterprisedb.com)


Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Mon, April 12, 2010 14:22, Erik Rijkers wrote:
> On Sat, April 10, 2010 01:23, Erik Rijkers wrote:

Oops, typos in that pseudo loop:
of course there was a pgbench init step after that first line.

> for scale in 10 100 500 1000     pgbench ...   # initialise     sleep ((scale / 10) * 60)
>     for clients in 1 5 10 20
>         for port in 6565 6566 --> primaryport standbyport
>             for run in `seq 1 3`
>                 pgbench ...                 sleep 120
>             done
>         done
>     done
> done






Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Mon, Apr 12, 2010 at 8:32 AM, Jim Mlodgenski <jimmy76@gmail.com> wrote:
>> I think we need to investigate this more.  It's not going to look good
>> for the project if people find that a hot standby server runs two
>> orders of magnitude slower than the primary.
> As a data point, I did a read only pgbench test and found that the
> standby runs about 15% slower than the primary with identical hardware
> and configs.

Hmm.  That's not great, but it's a lot better than 50x.  I wonder what
was different in Erik's environment.  Does running in standby mode use
more memory, such that it might have pushed the machine over the line
into swap?

Or if it's CPU load, maybe Erik could gprof it?

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Aidan Van Dyk
Date:
* Robert Haas <robertmhaas@gmail.com> [100412 07:10]:
> I think we need to investigate this more.  It's not going to look good
> for the project if people find that a hot standby server runs two
> orders of magnitude slower than the primary.

Yes, it's not "good", but it's a known problem.  We've had people
complaining that wal-replay can't keep up with a wal stream from a heavy
server.

The master producing the wal stream has $XXX seperate read/modify
processes working over the data dir, and is bottle-necked by the
serialized WAL stream.  All the seek+read delays are parallized and
overlapping.

But on the slave (traditionally PITR slave, now also HS/SR), has al
lthat read-modify-write happening in a single thread fasion, meaning
that WAL record $X+1 waits until the buffer $X needs to modify is read
in.  All the seek+read delays are serialized.

You can optimize that by keepdng more of them in buffers (shared, or OS
cache), but the WAL producer, by it's very nature being a
multi-task-io-load producing random read/write is always going to go
quicker than single-stream random-io WAL consumer...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: testing HS/SR - 1 vs 2 performance

From
Aidan Van Dyk
Date:
And I see now that he's doing a stream of read-only queries on a slave,
presumably with no WAL even being replayed...

Sorry for the noise....

a.

* Aidan Van Dyk <aidan@highrise.ca> [100412 09:40]:
> * Robert Haas <robertmhaas@gmail.com> [100412 07:10]:
>  
> > I think we need to investigate this more.  It's not going to look good
> > for the project if people find that a hot standby server runs two
> > orders of magnitude slower than the primary.
> 
> Yes, it's not "good", but it's a known problem.  We've had people
> complaining that wal-replay can't keep up with a wal stream from a heavy
> server.
> 
> The master producing the wal stream has $XXX seperate read/modify
> processes working over the data dir, and is bottle-necked by the
> serialized WAL stream.  All the seek+read delays are parallized and
> overlapping.
> 
> But on the slave (traditionally PITR slave, now also HS/SR), has al
> lthat read-modify-write happening in a single thread fasion, meaning
> that WAL record $X+1 waits until the buffer $X needs to modify is read
> in.  All the seek+read delays are serialized.
> 
> You can optimize that by keepdng more of them in buffers (shared, or OS
> cache), but the WAL producer, by it's very nature being a
> multi-task-io-load producing random read/write is always going to go
> quicker than single-stream random-io WAL consumer...
> 
> a.
> 
> -- 
> Aidan Van Dyk                                             Create like a god,
> aidan@highrise.ca                                       command like a king,
> http://www.highrise.ca/                                   work like a slave.



-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: testing HS/SR - 1 vs 2 performance

From
"Kevin Grittner"
Date:
>Aidan Van Dyk <aidan@highrise.ca> wrote:
> We've had people complaining that wal-replay can't keep up with a
> wal stream from a heavy server.
I thought this thread was about the slow performance running a mix
of read-only queries on the slave versus the master, which doesn't
seem to have anything to do with the old issue you're describing.
-Kevin


Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
resending this message, as it seems to have bounced.

(below, I did fix the typo in the pseudocode loop)

---------------------------------------- Original Message ----------------------------------------
Subject: Re: [HACKERS] testing HS/SR - 1 vs 2 performance
From:    "Erik Rijkers" <er@xs4all.nl>
Date:    Mon, April 12, 2010 14:22
To:      pgsql-hackers@postgresql.org
--------------------------------------------------------------------------------------------------

On Sat, April 10, 2010 01:23, Erik Rijkers wrote:
> Using 9.0devel cvs HEAD, 2010.04.08.
>
> I am trying to understand the performance difference
> between primary and standby under a standard pgbench
> read-only test.
>
> server has 32 GB, 2 quadcores.
>
> primary:
>   tps = 34606.747930 (including connections establishing)
>   tps = 34527.078068 (including connections establishing)
>   tps = 34654.297319 (including connections establishing)
>
> standby:
>   tps = 700.346283 (including connections establishing)
>   tps = 717.576886 (including connections establishing)
>   tps = 740.522472 (including connections establishing)
>
> transaction type: SELECT only
> scaling factor: 1000
> query mode: simple
> number of clients: 20
> number of threads: 1
> duration: 900 s
>
> both instances have
>   max_connections = 100
>   shared_buffers = 256MB
>   checkpoint_segments = 50
>   effective_cache_size= 16GB
>
> See also:
>
> http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php
>      (differences with scale 10_000)
>

To my surprise, I have later seen the opposite behaviour with the standby giving fast runs, and
the primary slow.

FWIW, I've overnight run a larget set of tests. (against same 9.0devel
instances as the ones from the earlier email).

These results are generally more balanced.

for scale in 10 100 500 1000   pgbench ...   # initialise   sleep ((scale / 10) * 60)   for clients in 1 5 10 20
forport in 6565 6566 --> primaryport standbyport           for run in `seq 1 3`               pgbench ...
sleep((scale / 10) * 60)           done       done   done
 
done

(so below, alternating 3 primary, followed by 3 standby runs)

scale: 10      clients:  1  tps = 15219.019272  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 15301.847615  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 15238.907436  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 12129.928289  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 12151.711589  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps = 12203.494512  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  5  tps = 60248.120599  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 60827.949875  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 61167.447476  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 50750.385403  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 50600.891436  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10      clients:  5  tps = 50486.857610  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10      clients: 10  tps = 60307.739327  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 60264.230349  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 60146.370598  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 50455.537671  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 49877.000813  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 50097.949766  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 20  tps = 43355.220657  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 43352.725422  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 43496.085623  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 37169.126299  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 37100.260450  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 37342.758507  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients:  1  tps = 12514.185089  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 12542.842198  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 12595.688640  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 10435.681851  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 10456.983353  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps = 10434.213044  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  5  tps = 48682.166988  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 48656.883485  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 48687.894655  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 41901.629933  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 41953.386791  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100     clients:  5  tps = 41787.962712  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100     clients: 10  tps = 48704.247239  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 48941.190050  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 48603.077936  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 42948.666272  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 42767.793899  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 42612.670983  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 20  tps = 36350.454258  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 36373.088111  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 36490.886781  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 32235.811228  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 32253.837906  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 32144.189047  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients:  1  tps = 11733.254970  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps = 11726.665739  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps = 11617.622548  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  9769.861175  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  9878.465752  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  9808.236216  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  5  tps = 45185.900553  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 45170.334037  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 45136.596374  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 39231.863815  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 39336.889619  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500     clients:  5  tps = 39269.483772  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500     clients: 10  tps = 45468.080680  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 45727.159963  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 45399.241367  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 40759.108042  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 40783.287718  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 40858.007847  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 20  tps = 34729.742313  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 34705.119029  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 34617.517224  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 31252.355034  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 31234.885791  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 31273.307637  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients:  1  tps =   220.024691  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   294.855794  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   375.152757  pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   295.965959  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =  1036.517110  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =  9167.012603  pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  5  tps =  1241.224282  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps =  1894.806301  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps = 18532.885549  pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps =  1497.491279  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps =  1480.164166  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000    clients:  5  tps =  3470.769236  pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000    clients: 10  tps =  2414.552333  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 19248.609443  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 45059.231609  pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps =  1648.526373  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps =  3659.800008  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 35900.769857  pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 20  tps =  2462.855864  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 27168.407568  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 34438.802096  pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps =  2933.220489  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 25586.972428  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 30926.189621  pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1







Re: testing HS/SR - 1 vs 2 performance

From
Heikki Linnakangas
Date:
I could reproduce this on my laptop, standby is about 20% slower. I ran
oprofile, and what stands out as the difference between the master and
standby is that on standby about 20% of the CPU time is spent in
hash_seq_search(). The callpath is GetSnapshotDat() ->
KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
difference in performance.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: testing HS/SR - 1 vs 2 performance

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> I could reproduce this on my laptop, standby is about 20% slower. I ran
> oprofile, and what stands out as the difference between the master and
> standby is that on standby about 20% of the CPU time is spent in
> hash_seq_search(). The callpath is GetSnapshotDat() ->
> KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
> difference in performance.

The slowdown is proportional to the max_connections setting in the
standby. 20% slowdown might still be acceptable, but if you increase
max_connections to say 1000, things get really slow. I wouldn't
recommend max_connections=1000, of course, but I think we need to do
something about this. Changing the KnownAssignedXids data structure from
hash table into something that's quicker to scan. Preferably something
with O(N), where N is the number of entries in the data structure, not
the maximum number of entries it can hold as it is with the hash table
currently.

A quick fix would be to check if there's any entries in the hash table
before scanning it. That would eliminate the overhead when there's no
in-progress transactions in the master. But as soon as there's even one,
the overhead comes back.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: testing HS/SR - 1 vs 2 performance

From
Dimitri Fontaine
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>  Changing the KnownAssignedXids data structure from
> hash table into something that's quicker to scan. Preferably something
> with O(N), where N is the number of entries in the data structure, not
> the maximum number of entries it can hold as it is with the hash table
> currently.

So that's pretty good news RedBlack Trees made it in 9.0, isn't it? :)

> A quick fix would be to check if there's any entries in the hash table
> before scanning it. That would eliminate the overhead when there's no
> in-progress transactions in the master. But as soon as there's even one,
> the overhead comes back.

Does not sound like typical, does it?
-- 
dim


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > I could reproduce this on my laptop, standby is about 20% slower. I ran
> > oprofile, and what stands out as the difference between the master and
> > standby is that on standby about 20% of the CPU time is spent in
> > hash_seq_search(). The callpath is GetSnapshotDat() ->
> > KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
> > difference in performance.
> 
> The slowdown is proportional to the max_connections setting in the
> standby. 20% slowdown might still be acceptable, but if you increase
> max_connections to say 1000, things get really slow. I wouldn't
> recommend max_connections=1000, of course, but I think we need to do
> something about this. Changing the KnownAssignedXids data structure from
> hash table into something that's quicker to scan. Preferably something
> with O(N), where N is the number of entries in the data structure, not
> the maximum number of entries it can hold as it is with the hash table
> currently.

There's a tradeoff here to consider. KnownAssignedXids faces two
workloads: one for each WAL record where we check if the xid is already
known assigned, one for snapshots. The current implementation is
optimised towards recovery performance, not snapshot performance.

> A quick fix would be to check if there's any entries in the hash table
> before scanning it. That would eliminate the overhead when there's no
> in-progress transactions in the master. But as soon as there's even one,
> the overhead comes back.

Any fix should be fairly quick because of the way its modularised - with
something like this in mind.

I'll try a circular buffer implementation, with fastpath.

Have something in a few hours.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
>> A quick fix would be to check if there's any entries in the hash table
>> before scanning it. That would eliminate the overhead when there's no
>> in-progress transactions in the master. But as soon as there's even one,
>> the overhead comes back.
>
> Any fix should be fairly quick because of the way its modularised - with
> something like this in mind.
>
> I'll try a circular buffer implementation, with fastpath.

I started experimenting with a sorted array based implementation on
Tuesday but got carried away with other stuff. I now got back to that
and cleaned it up.

How does the attached patch look like? It's probably similar to what you
had in mind.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 88169b4..4a04051 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -64,8 +64,10 @@ typedef struct ProcArrayStruct
     int            numProcs;        /* number of valid procs entries */
     int            maxProcs;        /* allocated size of procs array */

-    int            numKnownAssignedXids;    /* current number of known assigned
-                                         * xids */
+    int            firstKnownAssigned;        /* location of first valid known
+                                         * assigned xid in the array */
+    int            lastKnownAssigned;        /* location of last valid known
+                                         * assigned xid in the array */
     int            maxKnownAssignedXids;    /* allocated size of known assigned
                                          * xids */

@@ -87,7 +89,8 @@ static ProcArrayStruct *procArray;
 /*
  * Bookkeeping for tracking emulated transactions in recovery
  */
-static HTAB *KnownAssignedXidsHash;
+static TransactionId *KnownAssignedXidsArray;
+static bool *KnownAssignedXidsValidArray;
 static TransactionId latestObservedXid = InvalidTransactionId;

 /*
@@ -201,28 +204,33 @@ CreateSharedProcArray(void)
         /* Normal processing */
         procArray->numProcs = 0;
         procArray->maxProcs = PROCARRAY_MAXPROCS;
-        procArray->numKnownAssignedXids = 0;
         procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS;
+        procArray->firstKnownAssignedXid = 0;
+        procArray->lastKnownAssignedXid = 0;
         procArray->lastOverflowedXid = InvalidTransactionId;
     }

     if (XLogRequestRecoveryConnections)
     {
-        /* Create or attach to the KnownAssignedXids hash table */
-        HASHCTL        info;
-
-        MemSet(&info, 0, sizeof(info));
-        info.keysize = sizeof(TransactionId);
-        info.entrysize = sizeof(TransactionId);
-        info.hash = tag_hash;
-
-        KnownAssignedXidsHash = ShmemInitHash("KnownAssignedXids Hash",
-                                              TOTAL_MAX_CACHED_SUBXIDS,
-                                              TOTAL_MAX_CACHED_SUBXIDS,
-                                              &info,
-                                              HASH_ELEM | HASH_FUNCTION);
-        if (!KnownAssignedXidsHash)
-            elog(FATAL, "could not initialize known assigned xids hash table");
+        Size size;
+        /* Create or attach to the KnownAssignedXids arrays */
+        size = mul_size(sizeof(TransactionId), TOTAL_MAX_CACHED_SUBXIDS);
+        KnownAssignedXidsArray = ShmemInitStruct("KnownAssignedXidsArray",
+                                                 size,
+                                                 &found);
+        if (!KnownAssignedXidsArray)
+            elog(FATAL, "could not initialize known assigned xids array");
+        if (!found)
+            MemSet(KnownAssignedXidsArray, 0, size);
+
+        size = mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS);
+        KnownAssignedXidsValidArray = ShmemInitStruct("KnownAssignedXidsValidArray",
+                                                      size,
+                                                      &found);
+        if (!KnownAssignedXidsValidArray)
+            elog(FATAL, "could not initialize known assigned xids used array");
+        if (!found)
+            MemSet(KnownAssignedXidsValidArray, 0, size);
     }
 }

@@ -2162,7 +2170,7 @@ DisplayXidCache(void)
  *
  * During recovery we do not fret too much about the distinction between
  * top-level xids and subtransaction xids. We hold both together in
- * a hash table called KnownAssignedXids. In backends, this is copied into
+ * an array called KnownAssignedXids. In backends, this is copied into
  * snapshots in GetSnapshotData(), taking advantage
  * of the fact that XidInMVCCSnapshot() doesn't care about the distinction
  * either. Subtransaction xids are effectively treated as top-level xids
@@ -2207,7 +2215,7 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
      * We can see WAL records before the running-xacts snapshot that contain
      * XIDs that are not in the running-xacts snapshot, but that we know to
      * have finished before the running-xacts snapshot was taken. Don't waste
-     * precious shared memory by keeping them in the hash table.
+     * precious shared memory by keeping them in the array.
      *
      * We can also see WAL records before the running-xacts snapshot that
      * contain XIDs that are not in the running-xacts snapshot for a different
@@ -2330,24 +2338,30 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
 /*
  * Private module functions to manipulate KnownAssignedXids
  *
- * There are 3 main users of the KnownAssignedXids data structure:
+ * We use a fixed-size sorted array in shared memory to keep track of XIDs
+ * that we consider as in-progress in the master at this time. This data
+ * structure is called KnownAssignedXids, and it has 4 main users:
  *
- *     * backends taking snapshots
+ *     * backends taking snapshots, copying all XIDs in the array
+ *     * backends checking if an XID exists in the array
  *     * startup process adding new knownassigned xids
  *     * startup process removing xids as transactions end
  *
- * If we make KnownAssignedXids a simple sorted array then the first two
- * operations are fast, but the last one is at least O(N). If we make
- * KnownAssignedXids a hash table then the last two operations are fast,
- * though we have to do more work at snapshot time. Doing more work at
- * commit could slow down taking snapshots anyway because of lwlock
- * contention. Scanning the hash table is O(N) on the max size of the array,
- * so performs poorly in comparison when we have very low numbers of
- * write transactions to process. But at least it is constant overhead
- * and a sequential memory scan will utilise hardware memory readahead
- * to give much improved performance. In any case the emphasis must be on
- * having the standby process changes quickly so that it can provide
- * high availability. So we choose to implement as a hash table.
+ * Typically, there is only a few entries in the array, compared to the
+ * maximum size, so we keep track of the first and last valid entry in the
+ * array to make scanning it quick. That also makes it quick to add entries
+ * to the end. XIDs are assigned in monotonically increasing order, so new
+ * entries always go to the end.
+ *
+ * When an entry is removed, it's merely marked as invalid, but left in
+ * place in the array. There is a second array of booleans,
+ * KnownAssignedXidsValidArray, which keeps track of which entries in the
+ * KnownAssignedXidsArray are valid.
+ *
+ * Because we leave the entry in place when an XID is marked as removed, the
+ * array will eventually fill up. When an entry needs to be added and there
+ * is no room for it, the array is compacted by copying all valid entries to
+ * the beginning of the array, removing all invalid entries.
  */

 /*
@@ -2358,41 +2372,49 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid)
 static void
 KnownAssignedXidsAdd(TransactionId *xids, int nxids)
 {
-    TransactionId *result;
-    bool        found;
     int            i;

     for (i = 0; i < nxids; i++)
     {
-        Assert(TransactionIdIsValid(xids[i]));
+        TransactionId xid = xids[i];

-        elog(trace_recovery(DEBUG4), "adding KnownAssignedXid %u", xids[i]);
+        Assert(TransactionIdIsValid(xid));

-        procArray->numKnownAssignedXids++;
-        if (procArray->numKnownAssignedXids > procArray->maxKnownAssignedXids)
-        {
-            KnownAssignedXidsDisplay(LOG);
-            LWLockRelease(ProcArrayLock);
-            elog(ERROR, "too many KnownAssignedXids (%u)", procArray->maxKnownAssignedXids);
-        }
+        elog(trace_recovery(DEBUG4), "adding KnownAssignedXid %u", xid);

-        result = (TransactionId *) hash_search(KnownAssignedXidsHash, &xids[i], HASH_ENTER,
-                                               &found);
+        Assert(procArray->lastKnownAssignedXid == 0 ||
+               TransactionIdFollows(xid, KnownAssignedXidsArray[procArray->lastKnownAssignedXid - 1]));

-        if (!result)
+        /* Compact if necessary */
+        if (procArray->lastKnownAssignedXid == procArray->maxKnownAssignedXids)
         {
-            LWLockRelease(ProcArrayLock);
-            ereport(ERROR,
-                    (errcode(ERRCODE_OUT_OF_MEMORY),
-                     errmsg("out of shared memory")));
-        }
+            int j;
+            int k;

-        if (found)
-        {
-            KnownAssignedXidsDisplay(LOG);
-            LWLockRelease(ProcArrayLock);
-            elog(ERROR, "found duplicate KnownAssignedXid %u", xids[i]);
+            k = 0;
+            for (j = procArray->firstKnownAssignedXid; j < procArray->lastKnownAssignedXid; j++)
+            {
+                if (KnownAssignedXidsValidArray[j])
+                {
+                    KnownAssignedXidsArray[k] = KnownAssignedXidsArray[j];
+                    KnownAssignedXidsValidArray[k] = true;
+                    k++;
+                }
+            }
+            procArray->firstKnownAssignedXid = 0;
+            procArray->lastKnownAssignedXid = k;
+
+            if (procArray->lastKnownAssignedXid == procArray->maxKnownAssignedXids)
+            {
+                KnownAssignedXidsDisplay(LOG);
+                LWLockRelease(ProcArrayLock);
+                elog(ERROR, "too many KnownAssignedXids (%u)", procArray->maxKnownAssignedXids);
+            }
         }
+
+        KnownAssignedXidsArray[procArray->lastKnownAssignedXid] = xid;
+        KnownAssignedXidsValidArray[procArray->lastKnownAssignedXid] = true;
+        procArray->lastKnownAssignedXid++;
     }
 }

@@ -2404,10 +2426,21 @@ KnownAssignedXidsAdd(TransactionId *xids, int nxids)
 static bool
 KnownAssignedXidsExist(TransactionId xid)
 {
-    bool        found;
+    int low = procArray->firstKnownAssignedXid;
+    int high = procArray->lastKnownAssignedXid - 1;

-    (void) hash_search(KnownAssignedXidsHash, &xid, HASH_FIND, &found);
-    return found;
+    while (low <= high)
+    {
+        int middle = low + (high - low) / 2;
+
+        if (xid == KnownAssignedXidsArray[middle])
+            return true;
+        else if (xid > KnownAssignedXidsArray[middle])
+            low = middle + 1;
+        else
+            high = middle - 1;
+    }
+    return false;
 }

 /*
@@ -2418,17 +2451,37 @@ KnownAssignedXidsExist(TransactionId xid)
 static void
 KnownAssignedXidsRemove(TransactionId xid)
 {
-    bool        found;
+    int low = procArray->firstKnownAssignedXid;
+    int high = procArray->lastKnownAssignedXid - 1;

     Assert(TransactionIdIsValid(xid));

     elog(trace_recovery(DEBUG4), "remove KnownAssignedXid %u", xid);

-    (void) hash_search(KnownAssignedXidsHash, &xid, HASH_REMOVE, &found);
-
-    if (found)
-        procArray->numKnownAssignedXids--;
-    Assert(procArray->numKnownAssignedXids >= 0);
+    while (low <= high)
+    {
+        int middle = low + (high - low) / 2;
+
+        if (xid == KnownAssignedXidsArray[middle])
+        {
+            KnownAssignedXidsValidArray[middle] = false;
+            if (middle == procArray->lastKnownAssignedXid - 1)
+            {
+                while (procArray->lastKnownAssignedXid >= 0 &&
!KnownAssignedXidsValidArray[procArray->lastKnownAssignedXid- 1]) 
+                    procArray->lastKnownAssignedXid--;
+            }
+            if (middle == procArray->firstKnownAssignedXid)
+            {
+                while (procArray->firstKnownAssignedXid < procArray->lastKnownAssignedXid &&
!KnownAssignedXidsValidArray[procArray->firstKnownAssignedXid])
+                    procArray->firstKnownAssignedXid++;
+            }
+            return;
+        }
+        else if (xid > KnownAssignedXidsArray[middle])
+            low = middle + 1;
+        else
+            high = middle - 1;
+    }

     /*
      * We can fail to find an xid if the xid came from a subtransaction that
@@ -2466,26 +2519,30 @@ static int
 KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
                                TransactionId xmax)
 {
-    HASH_SEQ_STATUS status;
-    TransactionId *knownXid;
     int            count = 0;
+    int            i;

-    hash_seq_init(&status, KnownAssignedXidsHash);
-    while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL)
+    for (i = procArray->firstKnownAssignedXid; i < procArray->lastKnownAssignedXid; i++)
     {
+        TransactionId xid = KnownAssignedXidsArray[i];
+
+        if (!KnownAssignedXidsValidArray[i])
+            continue;
+
         /*
-         * Filter out anything higher than xmax
+         * Filter out anything higher than xmax. The array is sorted so
+         * we can stop as soon as we find one that's too big
          */
-        if (TransactionIdPrecedes(xmax, *knownXid))
-            continue;
+        if (TransactionIdPrecedes(xmax, xid))
+            break;

-        *xarray = *knownXid;
+        *xarray = xid;
         xarray++;
         count++;

         /* update xmin if required */
-        if (TransactionIdPrecedes(*knownXid, *xmin))
-            *xmin = *knownXid;
+        if (TransactionIdPrecedes(xid, *xmin))
+            *xmin = xid;
     }

     return count;
@@ -2500,34 +2557,34 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
 static void
 KnownAssignedXidsRemoveMany(TransactionId xid, bool keepPreparedXacts)
 {
-    TransactionId *knownXid;
-    HASH_SEQ_STATUS status;
+    int i;

-    if (TransactionIdIsValid(xid))
-        elog(trace_recovery(DEBUG4), "prune KnownAssignedXids to %u", xid);
-    else
+    if (!TransactionIdIsValid(xid))
+    {
         elog(trace_recovery(DEBUG4), "removing all KnownAssignedXids");
+        procArray->firstKnownAssignedXid = procArray->lastKnownAssignedXid = 0;
+        return;
+    }
+    elog(trace_recovery(DEBUG4), "prune KnownAssignedXids to %u", xid);

-    hash_seq_init(&status, KnownAssignedXidsHash);
-    while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL)
+    for (i = procArray->firstKnownAssignedXid; i < procArray->lastKnownAssignedXid; i++)
     {
-        TransactionId removeXid = *knownXid;
-        bool        found;
+        TransactionId removeXid = KnownAssignedXidsArray[i];
+        if (!KnownAssignedXidsValidArray[i])
+            continue;

         if (!TransactionIdIsValid(xid) || TransactionIdPrecedes(removeXid, xid))
         {
             if (keepPreparedXacts && StandbyTransactionIdIsPrepared(removeXid))
                 continue;
-            else
-            {
-                (void) hash_search(KnownAssignedXidsHash, &removeXid,
-                                   HASH_REMOVE, &found);
-                if (found)
-                    procArray->numKnownAssignedXids--;
-                Assert(procArray->numKnownAssignedXids >= 0);
-            }
+            KnownAssignedXidsValidArray[i] = false;
         }
     }
+    while (procArray->lastKnownAssignedXid >= 0 && !KnownAssignedXidsValidArray[procArray->lastKnownAssignedXid - 1])
+        procArray->lastKnownAssignedXid--;
+
+    while (procArray->firstKnownAssignedXid < procArray->lastKnownAssignedXid &&
!KnownAssignedXidsValidArray[procArray->firstKnownAssignedXid])
+        procArray->firstKnownAssignedXid++;
 }

 /*
@@ -2538,26 +2595,21 @@ KnownAssignedXidsRemoveMany(TransactionId xid, bool keepPreparedXacts)
 static void
 KnownAssignedXidsDisplay(int trace_level)
 {
-    HASH_SEQ_STATUS status;
-    TransactionId *knownXid;
     StringInfoData buf;
-    TransactionId *xids;
     int            nxids;
     int            i;

-    xids = palloc(sizeof(TransactionId) * TOTAL_MAX_CACHED_SUBXIDS);
-    nxids = 0;
-
-    hash_seq_init(&status, KnownAssignedXidsHash);
-    while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL)
-        xids[nxids++] = *knownXid;
-
-    qsort(xids, nxids, sizeof(TransactionId), xidComparator);
-
     initStringInfo(&buf);

-    for (i = 0; i < nxids; i++)
-        appendStringInfo(&buf, "%u ", xids[i]);
+    nxids = 0;
+    for (i = procArray->firstKnownAssignedXid; i < procArray->lastKnownAssignedXid; i++)
+    {
+        if (KnownAssignedXidsValidArray[i])
+        {
+            appendStringInfo(&buf, "%u ", KnownAssignedXidsArray[i]);
+            nxids++;
+        }
+    }

     elog(trace_level, "%d KnownAssignedXids %s", nxids, buf.data);

diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index c5c1228..eee0d58 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -203,7 +203,7 @@
  * Enable debugging print statements for WAL-related operations; see
  * also the wal_debug GUC var.
  */
-/* #define WAL_DEBUG */
+#define WAL_DEBUG

 /*
  * Enable tracing of resource consumption during sort operations;

Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
> >> A quick fix would be to check if there's any entries in the hash table
> >> before scanning it. That would eliminate the overhead when there's no
> >> in-progress transactions in the master. But as soon as there's even one,
> >> the overhead comes back.
> >
> > Any fix should be fairly quick because of the way its modularised - with
> > something like this in mind.
> >
> > I'll try a circular buffer implementation, with fastpath.
>
> I started experimenting with a sorted array based implementation on
> Tuesday but got carried away with other stuff. I now got back to that
> and cleaned it up.
>
> How does the attached patch look like? It's probably similar to what you
> had in mind.

It looks like a second version of what I'm working on and about to
publish. I'll take that as a compliment!

My patch is attached here also, for discussion.

The two patches look same in their main parts, though I have quite a few
extra tweaks in there, which you can read about in comments. One tweak I
don't have is the use of the presence array that allows a sensible
bsearch, so I'll to alter my patch to use that idea but keep the rest of
my code.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: testing HS/SR - 1 vs 2 performance

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
>> How does the attached patch look like? It's probably similar to what you
>> had in mind.
> 
> It looks like a second version of what I'm working on and about to
> publish. I'll take that as a compliment!
> 
> My patch is attached here also, for discussion.
> 
> The two patches look same in their main parts, though I have quite a few
> extra tweaks in there, which you can read about in comments.

Yeah. Yours seems a lot more complex with all those extra tweaks, I
would suggest to keep it simple. I did realize one bug in my patch: I
didn't handle xid wraparound correctly in the binary search, need to use
TransactionIdFollows instead of plan >.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Fri, 2010-04-16 at 14:47 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
> >> How does the attached patch look like? It's probably similar to what you
> >> had in mind.
> > 
> > It looks like a second version of what I'm working on and about to
> > publish. I'll take that as a compliment!
> > 
> > My patch is attached here also, for discussion.
> > 
> > The two patches look same in their main parts, though I have quite a few
> > extra tweaks in there, which you can read about in comments.
> 
> Yeah. Yours seems a lot more complex with all those extra tweaks, I
> would suggest to keep it simple. I did realize one bug in my patch: I
> didn't handle xid wraparound correctly in the binary search, need to use
> TransactionIdFollows instead of plan >.

Almost done, yes, much simpler. I wrote a lot of that in the wee small
hours last night, so the difference is amusing.

And I spotted that bug, plus the off by one error also. Just rewritten
all other parts, so no worries.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I didn't handle xid wraparound correctly in the binary search, need to use
> TransactionIdFollows instead of plan >.

I think you're outsmarting yourself there.  A binary search will in fact
*not work* with circular xid comparison (this is exactly why there's no
btree opclass for XID).  You need to use plain >, and make sure the
array you're searching is ordered that way too.  The other way might
accidentally fail to malfunction if you only tested ranges of XIDs that
weren't long enough to wrap around, but that doesn't make it right.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > I didn't handle xid wraparound correctly in the binary search, need to use
> > TransactionIdFollows instead of plan >.
> 
> I think you're outsmarting yourself there.  A binary search will in fact
> *not work* with circular xid comparison (this is exactly why there's no
> btree opclass for XID).  You need to use plain >, and make sure the
> array you're searching is ordered that way too.  The other way might
> accidentally fail to malfunction if you only tested ranges of XIDs that
> weren't long enough to wrap around, but that doesn't make it right.

I don't understand the exact, please explain more.

I'm not using bsearch() just a quick chop based upon xid comparison,
which looks to me like it will work.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote:
>> I think you're outsmarting yourself there.  A binary search will in fact
>> *not work* with circular xid comparison (this is exactly why there's no
>> btree opclass for XID).

> I don't understand the exact, please explain more.
> I'm not using bsearch() just a quick chop based upon xid comparison,
> which looks to me like it will work.

Implementing it yourself doesn't get you out of the fact that it won't
work.  Consider
1 2 3 ... 3000000000 ... 3999999999

and suppose that 3000000000 is the array's middle element.  If you
search for 100, your first probe will conclude that it is to the right
of 3000000000, which is the wrong direction.  Binary search, or indeed
the entire concept that the array is "sorted" in the first place,
depends on a transitive comparison operator.  TransactionIdFollows does
not satisfy the transitive law.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Fri, 2010-04-16 at 13:00 +0100, Simon Riggs wrote:
> Almost done

Here's the finished patch.

Internal changes only, no functional or user visible changes, so no
docs, just detailed explanatory comments.

Expectation is
* performance no longer depends upon max_connections
* recovery will be about same or slightly faster
* snapshots should be about equivalent primary/standby

Main changes are
* lock free add to sorted array (equivalent to GetNewTransactionId)
* bsearch for xid removal at commit/abort/TransactionIdIsInProgress
* faster, more compact approach to snapshots, with self-trimming
* some minor API changes to facilitate above
* new code to ignore deletion failures only when !consistent

Erik,

Could you retest please?

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Fri, 2010-04-16 at 11:10 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote:
> >> I think you're outsmarting yourself there.  A binary search will in fact
> >> *not work* with circular xid comparison (this is exactly why there's no
> >> btree opclass for XID).
> 
> > I don't understand the exact, please explain more.
> > I'm not using bsearch() just a quick chop based upon xid comparison,
> > which looks to me like it will work.
> 
> Implementing it yourself doesn't get you out of the fact that it won't
> work.  Consider
> 
>     1 2 3 ... 3000000000 ... 3999999999
> 
> and suppose that 3000000000 is the array's middle element.  If you
> search for 100, your first probe will conclude that it is to the right
> of 3000000000, which is the wrong direction.  Binary search, or indeed
> the entire concept that the array is "sorted" in the first place,
> depends on a transitive comparison operator.  TransactionIdFollows does
> not satisfy the transitive law.

AFAICS the example you give isn't correct.

We would lay out the values like this

W-3 W-2 W-1 W 3 4 5

where W is the wrap value and in this example we have 7 values in the
current array, with tail at W-3 and head at 5. Note the gap between W
and 3 where we would skip the values 1 and 2 because they are special.
Each element's xid is TransactionIdAdvanced(previous element).

So when we search for value 3 we would start from W, then decide it is
to the right, which is correct and continue from there.

The values are laid out in TransactionIdFollows order, not in numeric
order, hence we need to use TransactionIdFollows to decide which way to
branch.

As long as it works I'm not worried if the array is not technically
"sorted".

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> AFAICS the example you give isn't correct.

> We would lay out the values like this

> W-3 W-2 W-1 W 3 4 5

> where W is the wrap value

Stop right there, you're already failing to think clearly.  There is no
unique "wrap value", all values act the same in circular XID space.

The fundamental problem here is that if the set of XIDs involved spans a
sufficiently long range, your array will have a[0] < a[1] < ... < a[n]
but it will fail to be true that a[0] < a[n].  If that's also true for,
say, a[0] vs the midpoint element, then a binary search for a[0] will
fail because it will make the wrong decision while probing the midpoint.

> The values are laid out in TransactionIdFollows order,

They *cannot* be "laid out in TransactionIdFollows order".  It's not
possible, because that relationship isn't a total ordering.

Now it might be the case that this is OK for HS purposes because the set
of XIDs that are relevant at any one instant should never span more than
half of the XID space.  But I'd just as soon not use that assumption
here if it's unnecessary.  It'd be cheaper anyway to sort and search the
array using plain <, so why are you so eager to use
TransactionIdFollows?
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Sat, Apr 17, 2010 at 11:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> AFAICS the example you give isn't correct.
>
>> We would lay out the values like this
>
>> W-3 W-2 W-1 W 3 4 5
>
>> where W is the wrap value
>
> Stop right there, you're already failing to think clearly.  There is no
> unique "wrap value", all values act the same in circular XID space.

Or to put it a different way, what happens when the "wrap value"
changes?  An array that was in order under one wrap value can cease to
be in order under another.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > AFAICS the example you give isn't correct.
> 
> > We would lay out the values like this
> 
> > W-3 W-2 W-1 W 3 4 5
> 
> > where W is the wrap value
> 
> Stop right there, you're already failing to think clearly.  There is no
> unique "wrap value", all values act the same in circular XID space.
> 
> The fundamental problem here is that if the set of XIDs involved spans a
> sufficiently long range, your array will have a[0] < a[1] < ... < a[n]
> but it will fail to be true that a[0] < a[n].  If that's also true for,
> say, a[0] vs the midpoint element, then a binary search for a[0] will
> fail because it will make the wrong decision while probing the midpoint.

I understand that the xids are circular, but there is only one point
where the next xid has a lower integer value but is still "later" from
the perspective of TransactionIdFollows. Apart from that single
discontinuity all other values are monotonic. From your insistence I
presume I must be missing something, but I just don't see it. Perhaps we
are just misunderstanding each other about the "sufficiently long
range".

> > The values are laid out in TransactionIdFollows order,
> 
> They *cannot* be "laid out in TransactionIdFollows order".  It's not
> possible, because that relationship isn't a total ordering.
> 
> Now it might be the case that this is OK for HS purposes because the set
> of XIDs that are relevant at any one instant should never span more than
> half of the XID space.

Agree that is true.

>  But I'd just as soon not use that assumption
> here if it's unnecessary. 

I understand what your saying. I think it is necessary here.

>  It'd be cheaper anyway to sort and search the
> array using plain <, so why are you so eager to use
> TransactionIdFollows?

The array grows to the right and is laid out one xid per element,
resulting in a sequence of values that are transactionid-monotonic. As
the values increase there will eventually reach the discontinuity where
they cease being normally monotonic. Doing it this way means that we can
add rows past the head of the array and then move the head atomically,
so that we can make adding xids lock-free.

If we try to actually sort the values then the algorithm is both more
complex and requires locking. It would be easier to just remember where
the discontinuity is and act accordingly.

So I'm not eager to use either way, but I only currently see one way
that would work.

If there's a different way, that gives the same or better algorithmic
characteristics, I will be happy to code it.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote:
>> It'd be cheaper anyway to sort and search the
>> array using plain <, so why are you so eager to use
>> TransactionIdFollows?

> The array grows to the right and is laid out one xid per element,
> resulting in a sequence of values that are transactionid-monotonic.

How do you know that just adding items at the right will produce a
sorted array?  It seems quite unlikely to me that transactions can be
guaranteed to arrive at this code in XID order.  I think you need to do
an explicitly sorted insertion.

> ... Doing it this way means that we can
> add rows past the head of the array and then move the head atomically,
> so that we can make adding xids lock-free.

... and even without that issue, this seems like utter fantasy.  How
are you going to do that "atomically"?  Have you considered what will
happen on weak-memory-ordering machines like PPC, in particular?
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sat, 2010-04-17 at 15:45 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote:
> >> It'd be cheaper anyway to sort and search the
> >> array using plain <, so why are you so eager to use
> >> TransactionIdFollows?
> 
> > The array grows to the right and is laid out one xid per element,
> > resulting in a sequence of values that are transactionid-monotonic.
> 
> How do you know that just adding items at the right will produce a
> sorted array?  It seems quite unlikely to me that transactions can be
> guaranteed to arrive at this code in XID order.  I think you need to do
> an explicitly sorted insertion.

Xids don't arrive in sequence, but "known assigned xids" are added in
sequence because we infer the existence of the intermediate xids and
assuming they are running for the snapshot.

> > ... Doing it this way means that we can
> > add rows past the head of the array and then move the head atomically,
> > so that we can make adding xids lock-free.
> 
> ... and even without that issue, this seems like utter fantasy.  How
> are you going to do that "atomically"?  Have you considered what will
> happen on weak-memory-ordering machines like PPC, in particular?

We search the array between tail and head. If the head moves by integer
overwrite just as already happens for xid assignment, then we would use
the new head for the search. The code is careful to fetch only once.

I would freely admit I know absolutely nothing about details of
weak-memory-ordering machines and have not considered them at all. How
would what I have proposed fail to work, yet what we already rely on
work correctly? Do the circumstances differ?

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sat, 2010-04-17 at 15:45 -0400, Tom Lane wrote:
>> How do you know that just adding items at the right will produce a
>> sorted array?

> Xids don't arrive in sequence, but "known assigned xids" are added in
> sequence because we infer the existence of the intermediate xids and
> assuming they are running for the snapshot.

Hm.  Okay, maybe that will work.

>> ... and even without that issue, this seems like utter fantasy.  How
>> are you going to do that "atomically"?  Have you considered what will
>> happen on weak-memory-ordering machines like PPC, in particular?

> We search the array between tail and head. If the head moves by integer
> overwrite just as already happens for xid assignment, then we would use
> the new head for the search. The code is careful to fetch only once.

... but this will not.  You need to use a lock, because there is
otherwise no guarantee that other processors see the write into the
array element before they see the change in the head pointer.

> I would freely admit I know absolutely nothing about details of
> weak-memory-ordering machines and have not considered them at all. How
> would what I have proposed fail to work, yet what we already rely on
> work correctly? Do the circumstances differ?

Yes.  We have memory ordering instructions inserted in the lock
acquisition/release code.  Trying to access and modify a shared-memory
data structure without any locking will not work.

There are some places where we suppose that a *single* write into shared
memory can safely be done without a lock, if we're not too concerned
about how soon other transactions will see the effects.  But what you
are proposing here requires more than one related write.

I've been burnt by this myself:
http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sat, 2010-04-17 at 16:48 -0400, Tom Lane wrote:

> > We search the array between tail and head. If the head moves by integer
> > overwrite just as already happens for xid assignment, then we would use
> > the new head for the search. The code is careful to fetch only once.
> 
> ... but this will not.  You need to use a lock, because there is
> otherwise no guarantee that other processors see the write into the
> array element before they see the change in the head pointer.
> 
> > I would freely admit I know absolutely nothing about details of
> > weak-memory-ordering machines and have not considered them at all. How
> > would what I have proposed fail to work, yet what we already rely on
> > work correctly? Do the circumstances differ?
> 
> Yes.  We have memory ordering instructions inserted in the lock
> acquisition/release code.  Trying to access and modify a shared-memory
> data structure without any locking will not work.
> 
> There are some places where we suppose that a *single* write into shared
> memory can safely be done without a lock, if we're not too concerned
> about how soon other transactions will see the effects.  But what you
> are proposing here requires more than one related write.
> 
> I've been burnt by this myself:
> http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php

W O W - thank you for sharing.

What I'm not clear on is why you've used a spinlock everywhere when only
weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
macro that does does nada when the hardware already protects us? (i.e. a
spinlock only for the hardware that needs it).

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> What I'm not clear on is why you've used a spinlock everywhere when only
> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> macro that does does nada when the hardware already protects us? (i.e. a
> spinlock only for the hardware that needs it).

Well, we could certainly consider that, if we had enough places where
there was a demonstrable benefit from it.  I couldn't measure any real
slowdown from adding a spinlock in that sinval code, so I didn't propose
doing so at the time --- and I'm pretty dubious that this code is
sufficiently performance-critical to justify the work, either.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > What I'm not clear on is why you've used a spinlock everywhere when only
> > weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> > macro that does does nada when the hardware already protects us? (i.e. a
> > spinlock only for the hardware that needs it).
> 
> Well, we could certainly consider that, if we had enough places where
> there was a demonstrable benefit from it.  I couldn't measure any real
> slowdown from adding a spinlock in that sinval code, so I didn't propose
> doing so at the time --- and I'm pretty dubious that this code is
> sufficiently performance-critical to justify the work, either.

OK, I'll put a spinlock around access to the head of the array.

Thanks for your input.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> > Simon Riggs <simon@2ndQuadrant.com> writes:
> > > What I'm not clear on is why you've used a spinlock everywhere when only
> > > weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> > > macro that does does nada when the hardware already protects us? (i.e. a
> > > spinlock only for the hardware that needs it).
> >
> > Well, we could certainly consider that, if we had enough places where
> > there was a demonstrable benefit from it.  I couldn't measure any real
> > slowdown from adding a spinlock in that sinval code, so I didn't propose
> > doing so at the time --- and I'm pretty dubious that this code is
> > sufficiently performance-critical to justify the work, either.
>
> OK, I'll put a spinlock around access to the head of the array.

v2 patch attached

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: testing HS/SR - 1 vs 2 performance

From
David Fetter
Date:
On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote:
> On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> > > Simon Riggs <simon@2ndQuadrant.com> writes:
> > > > What I'm not clear on is why you've used a spinlock everywhere
> > > > when only weak-memory thang CPUs are a problem. Why not have a
> > > > weak-memory-protect macro that does does nada when the
> > > > hardware already protects us? (i.e. a spinlock only for the
> > > > hardware that needs it).
> > > 
> > > Well, we could certainly consider that, if we had enough places
> > > where there was a demonstrable benefit from it.  I couldn't
> > > measure any real slowdown from adding a spinlock in that sinval
> > > code, so I didn't propose doing so at the time --- and I'm
> > > pretty dubious that this code is sufficiently
> > > performance-critical to justify the work, either.
> > 
> > OK, I'll put a spinlock around access to the head of the array.
> 
> v2 patch attached

If you've committed this, or any other patch you've sent here,
*please* mention so on the same thread.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote:
> On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote:
> > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> > > > Simon Riggs <simon@2ndQuadrant.com> writes:
> > > > > What I'm not clear on is why you've used a spinlock everywhere
> > > > > when only weak-memory thang CPUs are a problem. Why not have a
> > > > > weak-memory-protect macro that does does nada when the
> > > > > hardware already protects us? (i.e. a spinlock only for the
> > > > > hardware that needs it).
> > > > 
> > > > Well, we could certainly consider that, if we had enough places
> > > > where there was a demonstrable benefit from it.  I couldn't
> > > > measure any real slowdown from adding a spinlock in that sinval
> > > > code, so I didn't propose doing so at the time --- and I'm
> > > > pretty dubious that this code is sufficiently
> > > > performance-critical to justify the work, either.
> > > 
> > > OK, I'll put a spinlock around access to the head of the array.
> > 
> > v2 patch attached
> 
> If you've committed this, or any other patch you've sent here,
> *please* mention so on the same thread.

I haven't yet. I've written two patches - this is a major module rewrite
and is still under discussion. The other patch has nothing to do with
this (though I did accidentally include a couple of changes from this
patch and immediately revoked them).

I will wait awhile to see if anybody has some independent test results.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
David Fetter
Date:
On Sun, Apr 18, 2010 at 09:22:21PM +0100, Simon Riggs wrote:
> On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote:
> > On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote:
> > > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> > > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> > > > > Simon Riggs <simon@2ndQuadrant.com> writes:
> > > > > > What I'm not clear on is why you've used a spinlock
> > > > > > everywhere when only weak-memory thang CPUs are a problem.
> > > > > > Why not have a weak-memory-protect macro that does does
> > > > > > nada when the hardware already protects us? (i.e. a
> > > > > > spinlock only for the hardware that needs it).
> > > > > 
> > > > > Well, we could certainly consider that, if we had enough
> > > > > places where there was a demonstrable benefit from it.  I
> > > > > couldn't measure any real slowdown from adding a spinlock in
> > > > > that sinval code, so I didn't propose doing so at the time
> > > > > --- and I'm pretty dubious that this code is sufficiently
> > > > > performance-critical to justify the work, either.
> > > > 
> > > > OK, I'll put a spinlock around access to the head of the
> > > > array.
> > > 
> > > v2 patch attached
> > 
> > If you've committed this, or any other patch you've sent here,
> > *please* mention so on the same thread.
> 
> I haven't yet. I've written two patches - this is a major module
> rewrite and is still under discussion. The other patch has nothing
> to do with this (though I did accidentally include a couple of
> changes from this patch and immediately revoked them).
> 
> I will wait awhile to see if anybody has some independent test
> results.

Thanks :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Sun, Apr 18, 2010 at 4:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote:
>> On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote:
>> > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
>> > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
>> > > > Simon Riggs <simon@2ndQuadrant.com> writes:
>> > > > > What I'm not clear on is why you've used a spinlock everywhere
>> > > > > when only weak-memory thang CPUs are a problem. Why not have a
>> > > > > weak-memory-protect macro that does does nada when the
>> > > > > hardware already protects us? (i.e. a spinlock only for the
>> > > > > hardware that needs it).
>> > > >
>> > > > Well, we could certainly consider that, if we had enough places
>> > > > where there was a demonstrable benefit from it.  I couldn't
>> > > > measure any real slowdown from adding a spinlock in that sinval
>> > > > code, so I didn't propose doing so at the time --- and I'm
>> > > > pretty dubious that this code is sufficiently
>> > > > performance-critical to justify the work, either.
>> > >
>> > > OK, I'll put a spinlock around access to the head of the array.
>> >
>> > v2 patch attached
>>
>> If you've committed this, or any other patch you've sent here,
>> *please* mention so on the same thread.
>
> I haven't yet. I've written two patches - this is a major module rewrite
> and is still under discussion. The other patch has nothing to do with
> this (though I did accidentally include a couple of changes from this
> patch and immediately revoked them).
>
> I will wait awhile to see if anybody has some independent test results.

So, does anyone have a few cycles to test this out?  We are down to
handful of remaining open items, so getting this tested and committed
sooner = beta sooner.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Mark Kirkwood
Date:
Robert Haas wrote:
>
> So, does anyone have a few cycles to test this out?  We are down to
> handful of remaining open items, so getting this tested and committed
> sooner = beta sooner.
>
>
>   

I did some testing of this patch (v2). Unfortunately I don't have access 
to hardware capable of doing tests at the same scale as Erik used. 
However I was still able to show a consistent difference (I think) 
between standby performance with and without the patch applied.

Setup:

host: 2.7 Ghz dual core amd64 with 4G ram and 1 sata drive,
code: cvs head from 2010-04-14.
pgbench:  scale=100, 4 clients, 10000 (select) transactions each.

Results:

Master performance (with and without patch applied ):
tps = 10903.612340 - 14070.109951 (including connections establishing)

Standby performance without patch (:
tps = 8288.119913 - 9722.245178 (including connections establishing)

Standby performance with patch applied:
tps = 11592.922637 - 14065.553214 (including connections establishing)

I performed 8 runs of each, and results would start at the low range and 
climb up to the high one, where they would stabilize. In between runs I 
cleared the os buffer cache and (partially) reloaded it by selecting 
counts from the pgbench tables (i.e I was trying to ensure each run had 
the same or similar os cache contents).

Overall looks like the patch gets standby read only performance close to 
the master - at least in the case where there are minimal master 
transactions being tracked by the standby (I had to leave the master 
idle whilst running the standby case, as they shared the machine). Hope 
this info is useful.

regards

Mark



Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Tue, Apr 20, 2010 at 11:09 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
> Robert Haas wrote:
>>
>> So, does anyone have a few cycles to test this out?  We are down to
>> handful of remaining open items, so getting this tested and committed
>> sooner = beta sooner.
>>
>>
>>
>
> I did some testing of this patch (v2). Unfortunately I don't have access to
> hardware capable of doing tests at the same scale as Erik used. However I
> was still able to show a consistent difference (I think) between standby
> performance with and without the patch applied.
>
> Setup:
>
> host: 2.7 Ghz dual core amd64 with 4G ram and 1 sata drive,
> code: cvs head from 2010-04-14.
> pgbench:  scale=100, 4 clients, 10000 (select) transactions each.
>
> Results:
>
> Master performance (with and without patch applied ):
> tps = 10903.612340 - 14070.109951 (including connections establishing)
>
> Standby performance without patch (:
> tps = 8288.119913 - 9722.245178 (including connections establishing)
>
> Standby performance with patch applied:
> tps = 11592.922637 - 14065.553214 (including connections establishing)
>
> I performed 8 runs of each, and results would start at the low range and
> climb up to the high one, where they would stabilize. In between runs I
> cleared the os buffer cache and (partially) reloaded it by selecting counts
> from the pgbench tables (i.e I was trying to ensure each run had the same or
> similar os cache contents).
>
> Overall looks like the patch gets standby read only performance close to the
> master - at least in the case where there are minimal master transactions
> being tracked by the standby (I had to leave the master idle whilst running
> the standby case, as they shared the machine). Hope this info is useful.

Thanks, that sounds promising.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Wed, 2010-04-21 at 15:09 +1200, Mark Kirkwood wrote:

> I did some testing of this patch (v2). Unfortunately I don't have access 
> to hardware capable of doing tests at the same scale as Erik used. 
> However I was still able to show a consistent difference (I think) 
> between standby performance with and without the patch applied.

...

> Overall looks like the patch gets standby read only performance close to 
> the master - at least in the case where there are minimal master 
> transactions being tracked by the standby (I had to leave the master 
> idle whilst running the standby case, as they shared the machine). Hope 
> this info is useful.

Thanks very much for the report; always good to get confirmation.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
>> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
>>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>>> What I'm not clear on is why you've used a spinlock everywhere when only
>>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
>>>> macro that does does nada when the hardware already protects us? (i.e. a
>>>> spinlock only for the hardware that needs it).
>>> Well, we could certainly consider that, if we had enough places where
>>> there was a demonstrable benefit from it.  I couldn't measure any real
>>> slowdown from adding a spinlock in that sinval code, so I didn't propose
>>> doing so at the time --- and I'm pretty dubious that this code is
>>> sufficiently performance-critical to justify the work, either.
>> OK, I'll put a spinlock around access to the head of the array.
> 
> v2 patch attached

The locking seems overly complex to me. Looking at
KnownAssignedXidsAdd(), isn't it possible for two processes to call it
concurrently in exclusive_lock==false mode and get the same 'head'
value, and step on each others toes? I guess KnownAssignedXidsAdd() is
only called by the startup process, but it seems like an accident
waiting to happen.

Spinlocks are fast, if you have to add an if-branch to decide whether to
acquire it, I suspect you've lost any performance gain to be had
already. Let's keep it simple. And acquiring ProcArrayLock in exclusive
mode while adding entries to the array seems OK to me as well. It only
needs to be held very briefly, and getting this correct and keeping it
simple is much more important at this point than squeezing out every
possible CPU cycle from the system.

Just require that the caller holds ProcArrayLock in exclusive mode when
calling an operation that modifies the array, and in shared mode when
reading it.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: testing HS/SR - 1 vs 2 performance

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
>> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
>>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>>> What I'm not clear on is why you've used a spinlock everywhere when only
>>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
>>>> macro that does does nada when the hardware already protects us? (i.e. a
>>>> spinlock only for the hardware that needs it).
>>> Well, we could certainly consider that, if we had enough places where
>>> there was a demonstrable benefit from it.  I couldn't measure any real
>>> slowdown from adding a spinlock in that sinval code, so I didn't propose
>>> doing so at the time --- and I'm pretty dubious that this code is
>>> sufficiently performance-critical to justify the work, either.
>> OK, I'll put a spinlock around access to the head of the array.
> 
> v2 patch attached

Given the discussion about the cyclic nature of XIDs, it would be good
to add an assertion that when a new XID is added to the array, it is

a) larger than the biggest value already in the array
(TransactionIdFollows(new, head)), and
b) not too far from the smallest value in the array to confuse binary
search (TransactionIdFollows(new, tail)).

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Wed, 2010-04-21 at 15:20 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> >> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> >>> Simon Riggs <simon@2ndQuadrant.com> writes:
> >>>> What I'm not clear on is why you've used a spinlock everywhere when only
> >>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> >>>> macro that does does nada when the hardware already protects us? (i.e. a
> >>>> spinlock only for the hardware that needs it).
> >>> Well, we could certainly consider that, if we had enough places where
> >>> there was a demonstrable benefit from it.  I couldn't measure any real
> >>> slowdown from adding a spinlock in that sinval code, so I didn't propose
> >>> doing so at the time --- and I'm pretty dubious that this code is
> >>> sufficiently performance-critical to justify the work, either.
> >> OK, I'll put a spinlock around access to the head of the array.
> > 
> > v2 patch attached
> 
> The locking seems overly complex to me. 

> Looking at
> KnownAssignedXidsAdd(), isn't it possible for two processes to call it
> concurrently in exclusive_lock==false mode and get the same 'head'
> value, and step on each others toes? I guess KnownAssignedXidsAdd() is
> only called by the startup process, but it seems like an accident
> waiting to happen.

Not at all. That assumption is also used elsewhere, so it is safe to use
here.

> Spinlocks are fast, if you have to add an if-branch to decide whether to
> acquire it, I suspect you've lost any performance gain to be had
> already.

I think you're misreading the code. One caller already holds exclusive
lock, one does not. The if test is to determine whether to acquire the
lock or not.

> Let's keep it simple. And acquiring ProcArrayLock in exclusive
> mode while adding entries to the array seems OK to me as well. It only
> needs to be held very briefly, and getting this correct and keeping it
> simple is much more important at this point than squeezing out every
> possible CPU cycle from the system.

I don't understand what you're saying: you say I'm wasting a few cycles
on an if test and should change that, but at the same time you say I
shouldn't worry about a few cycles.

> Just require that the caller holds ProcArrayLock in exclusive mode when
> calling an operation that modifies the array, and in shared mode when
> reading it.

The point of the code you're discussing is to remove the exclusive lock
requests, not to save a few cycles. Spinlocks are fast, as you say.

Exclusive lock requests block under a heavy load of shared lock holders,
we know that already. It is worth removing the contention that can occur
by minimising the number of exclusive locks required. This patch shows
how and I don't see any reason from what you have said to avoid
committing it. I'm willing to hear some sound reasons, if any exist.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote:

> Given the discussion about the cyclic nature of XIDs, it would be good
> to add an assertion that when a new XID is added to the array, it is
> 
> a) larger than the biggest value already in the array
> (TransactionIdFollows(new, head)), and
> b) not too far from the smallest value in the array to confuse binary
> search (TransactionIdFollows(new, tail)).

We discussed this in November. You convinced me it isn't possible for
older xids to stay in the standby because anti-wraparound vacuums would
conflict and kick them out. The primary can't run with wrapped xids and
neither can the standby. I think that is correct.

Adding an assertion isn't going to do much because it's unlikely anybody
is going to be running for 2^31 transactions with asserts enabled.

Worrying about things like this seems strange when real and negative
behaviours are right in our faces elsewhere. Performance and scalability
are real world concerns.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Wed, Apr 21, 2010 at 9:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote:
>
>> Given the discussion about the cyclic nature of XIDs, it would be good
>> to add an assertion that when a new XID is added to the array, it is
>>
>> a) larger than the biggest value already in the array
>> (TransactionIdFollows(new, head)), and
>> b) not too far from the smallest value in the array to confuse binary
>> search (TransactionIdFollows(new, tail)).
>
> We discussed this in November. You convinced me it isn't possible for
> older xids to stay in the standby because anti-wraparound vacuums would
> conflict and kick them out. The primary can't run with wrapped xids and
> neither can the standby. I think that is correct.
>
> Adding an assertion isn't going to do much because it's unlikely anybody
> is going to be running for 2^31 transactions with asserts enabled.
>
> Worrying about things like this seems strange when real and negative
> behaviours are right in our faces elsewhere. Performance and scalability
> are real world concerns.

I think the assert is a good idea.  If there's no real problem here,
the assert won't trip.  It's just a safety precaution.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Wed, Apr 21, 2010 at 8:20 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> The locking seems overly complex to me.

I tend to agree.

!     /*
!      * Callers must hold either ProcArrayLock in Exclusive mode or
!      * ProcArrayLock in Shared mode *and* known_assigned_xids_lck
!      * to update these values.
!      */

I'm not convinced that this is either (a) correct or (b) performant.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote:
> >
> > Adding an assertion isn't going to do much because it's unlikely anybody
> > is going to be running for 2^31 transactions with asserts enabled.
> >

> I think the assert is a good idea.  If there's no real problem here,
> the assert won't trip.  It's just a safety precaution.

If you believe that, then I think you should add this to all the other
places in the current server where that assumption is made without
assertion being added. As a safety precaution.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
marcin mank
Date:
On Wed, Apr 21, 2010 at 4:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote:
>> >
>> > Adding an assertion isn't going to do much because it's unlikely anybody
>> > is going to be running for 2^31 transactions with asserts enabled.
>> >
>
>> I think the assert is a good idea.  If there's no real problem here,
>> the assert won't trip.  It's just a safety precaution.
>
> If you believe that, then I think you should add this to all the other
> places in the current server where that assumption is made without
> assertion being added. As a safety precaution.
>

Is that not a good idea that (at least for dev-builds, like with
enable-cassert) the xid counter start at like 2^31 - 1000 ? It could
help catch some bugs.

Greetings
Marcin Mańk


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Wed, 2010-04-21 at 16:22 +0200, marcin mank wrote:

> Is that not a good idea that (at least for dev-builds, like with
> enable-cassert) the xid counter start at like 2^31 - 1000 ? It could
> help catch some bugs.

It is a good idea, I'm sure that would help catch bugs.

It wouldn't help here because the case in doubt is whether it's possible
to have an xid still showing in memory arrays from the last time the
cycle wrapped. It isn't. These things aren't random. These numbers are
extracted directly from activity that was occurring on the primary and
regularly checked and cleaned as the standby runs.

So you'll need to do 2^31 transactions to prove this isn't true, which
isn't ever going to happen in testing with an assert build and nobody
with that many transactions would run an assert build anyway.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Wed, Apr 21, 2010 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote:
>> >
>> > Adding an assertion isn't going to do much because it's unlikely anybody
>> > is going to be running for 2^31 transactions with asserts enabled.
>> >
>
>> I think the assert is a good idea.  If there's no real problem here,
>> the assert won't trip.  It's just a safety precaution.
>
> If you believe that, then I think you should add this to all the other
> places in the current server where that assumption is made without
> assertion being added. As a safety precaution.

I feel like this conversation is getting a little heated.  We are just
trying to solve a technical problem here.  Perhaps I am misreading -
tone doesn't come through very well in email.

I think the assumptions that are being made in this particular case
are different from the ones made elsewhere in the server.  Generally,
we don't assume transaction IDs are arriving in ascending order - in
fact, we usually explicitly have to deal with the fact that they might
not be.   So if we have a situation where we ARE relying on them
arriving in order, because we have extrinsic reasons why we know it
has to happen that way, adding an assertion to make sure that things
are happening the way we expect doesn't seem out of line.  This code
is fairly complex.

There is arguably less value in asserting that the newly added xid
follows the tail as well as the head, but I still like the idea.  Not
sure whether that's rational or not.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Florian Pflug
Date:
On Apr 21, 2010, at 16:49 , Simon Riggs wrote:
> On Wed, 2010-04-21 at 16:22 +0200, marcin mank wrote:
>
>> Is that not a good idea that (at least for dev-builds, like with
>> enable-cassert) the xid counter start at like 2^31 - 1000 ? It could
>> help catch some bugs.
>
> It is a good idea, I'm sure that would help catch bugs.
>
> It wouldn't help here because the case in doubt is whether it's possible
> to have an xid still showing in memory arrays from the last time the
> cycle wrapped. It isn't. These things aren't random. These numbers are
> extracted directly from activity that was occurring on the primary and
> regularly checked and cleaned as the standby runs.
>
> So you'll need to do 2^31 transactions to prove this isn't true, which
> isn't ever going to happen in testing with an assert build and nobody
> with that many transactions would run an assert build anyway.


ISTM that there's no need to actually execute 2^31 transactions to trigger this bug (if it actually exists), it'd be
sufficientto increment the xid counter by more than one each time a xid is assigned, no? 

Or would that trip snapshot creation on the standby?

best regards,
Florian Pflug


Re: testing HS/SR - 1 vs 2 performance

From
Heikki Linnakangas
Date:
Robert Haas wrote:
> On Wed, Apr 21, 2010 at 9:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote:
>>
>>> Given the discussion about the cyclic nature of XIDs, it would be good
>>> to add an assertion that when a new XID is added to the array, it is
>>>
>>> a) larger than the biggest value already in the array
>>> (TransactionIdFollows(new, head)), and
>>> b) not too far from the smallest value in the array to confuse binary
>>> search (TransactionIdFollows(new, tail)).
>> We discussed this in November. You convinced me it isn't possible for
>> older xids to stay in the standby because anti-wraparound vacuums would
>> conflict and kick them out. The primary can't run with wrapped xids and
>> neither can the standby. I think that is correct.
>>
>> Adding an assertion isn't going to do much because it's unlikely anybody
>> is going to be running for 2^31 transactions with asserts enabled.
>>
>> Worrying about things like this seems strange when real and negative
>> behaviours are right in our faces elsewhere. Performance and scalability
>> are real world concerns.
> 
> I think the assert is a good idea.  If there's no real problem here,
> the assert won't trip.  It's just a safety precaution.

Right. And assertions also act as documentation, they are a precise and
compact way to document invariants we assume to hold. A comment
explaining why the cyclic nature of XIDs is not a problem would be nice
too, in addition or instead of the assertions.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Thu, 2010-04-22 at 08:57 +0300, Heikki Linnakangas wrote:
> > 
> > I think the assert is a good idea.  If there's no real problem here,
> > the assert won't trip.  It's just a safety precaution.
> 
> Right. And assertions also act as documentation, they are a precise and
> compact way to document invariants we assume to hold. A comment
> explaining why the cyclic nature of XIDs is not a problem would be nice
> too, in addition or instead of the assertions.

Agreed. I was going to reply just that earlier but have been distracted
on other things.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Sun, April 18, 2010 13:01, Simon Riggs wrote:
>>
>> OK, I'll put a spinlock around access to the head of the array.
>
> v2 patch attached
>

knownassigned_sortedarray.v2.diff applied to cvs HEAD (2010.04.21 22:36)

I have done a few smaller tests (scale 500, clients 1, 20):

init: pgbench -h /tmp -p 6565 -U rijkers -i -s 500 replicas


4x primary, clients 1:
scale: 500     clients:  1  tps = 11496.372655  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps = 11580.141685  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps = 11478.294747  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps = 11741.432016  pgbench -p 6565 -n -S -c 1 -T 900 -j 1

4x standby, clients 1:
scale: 500     clients:  1  tps =   727.217672  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =   785.431011  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =   825.291817  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =   868.107638  pgbench -p 6566 -n -S -c 1 -T 900 -j 1


4x primary, clients 20:
scale: 500     clients: 20  tps = 34963.054102  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 34818.985407  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 34964.545013  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 34959.210687  pgbench -p 6565 -n -S -c 20 -T 900 -j 1

4x standby, clients 20:
scale: 500     clients: 20  tps =  1099.808192  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps =   905.926703  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps =   943.531989  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps =  1082.215913  pgbench -p 6566 -n -S -c 20 -T 900 -j 1


This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the
original post, btw).  In that earlier instance, the extreme slowness disappeared later, after many
hours maybe even days (without bouncing either primary or standby).

I have no idea what could cause this; is no one else is seeing this ?

(if I have time I'll repeat on other hardware in the weekend)

any comment is welcome...


Erik Rijkers





Re: testing HS/SR - 1 vs 2 performance

From
Greg Smith
Date:
Erik Rijkers wrote:
> This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the
> original post, btw).  In that earlier instance, the extreme slowness disappeared later, after many
> hours maybe even days (without bouncing either primary or standby).
>   

Any possibility the standby is built with assertions turned out?  That's 
often the cause of this type of difference between pgbench results on 
two systems, which easy to introduce when everyone is building from 
source.  You should try this on both systems:

psql -c "show debug_assertions"


just to rule that out.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: testing HS/SR - 1 vs 2 performance

From
Mark Kirkwood
Date:
Greg Smith wrote:
> Erik Rijkers wrote:
>> This is the same behaviour (i.e. extreme slow standby) that I saw 
>> earlier (and which caused the
>> original post, btw).  In that earlier instance, the extreme slowness 
>> disappeared later, after many
>> hours maybe even days (without bouncing either primary or standby).
>>   
>
> Any possibility the standby is built with assertions turned out?  
> That's often the cause of this type of difference between pgbench 
> results on two systems, which easy to introduce when everyone is 
> building from source.  You should try this on both systems:
>
> psql -c "show debug_assertions"
>
>
>
Or even:

pg_config --configure

on both systems might be worth checking.


regards

Mark



Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Thu, April 22, 2010 23:54, Mark Kirkwood wrote:
> Greg Smith wrote:
>> Erik Rijkers wrote:
>>> This is the same behaviour (i.e. extreme slow standby) that I saw
>>> earlier (and which caused the
>>> original post, btw).  In that earlier instance, the extreme slowness
>>> disappeared later, after many
>>> hours maybe even days (without bouncing either primary or standby).
>>>
>>
>> Any possibility the standby is built with assertions turned out?
>> That's often the cause of this type of difference between pgbench
>> results on two systems, which easy to introduce when everyone is
>> building from source.  You should try this on both systems:
>>
>> psql -c "show debug_assertions"
>>
>>
>>
> Or even:
>
> pg_config --configure
>
> on both systems might be worth checking.

(these instances are on a single server, btw)

primary:

$ pg_config
BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin
DOCDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc
HTMLDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc
INCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include
PKGINCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include
INCLUDEDIR-SERVER = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include/server
LIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib
PKGLIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib
LOCALEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/locale
MANDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/man
SHAREDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share
SYSCONFDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/etc
PGXS = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary' '--with-pgport=6565'
'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' '--with-libxslt'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Wendif-labels -fno-strict-aliasing -fwrapv
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,'/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib'
LDFLAGS_SL =
LIBS = -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline -ltermcap -lcrypt -ldl -lm
VERSION = PostgreSQL 9.0devel-sr_primary
[data:port:db   PGPORT=6565   PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data  
PGDATABASE=replicas]
2010.04.22 20:55:28 rijkers@denkraam:~/src/perl/85devel [0]
$ time ./run_test_suite.sh
[data:port:db   PGPORT=6565   PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data  
PGDATABASE=replicas]
2010.04.22 21:00:26 rijkers@denkraam:~/src/perl/85devel [1]

standby:

$ pg_config
BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin
DOCDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc
HTMLDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc
INCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include
PKGINCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include
INCLUDEDIR-SERVER = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include/server
LIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib
PKGLIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib
LOCALEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/locale
MANDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/man
SHAREDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share
SYSCONFDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/etc
PGXS = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary' '--with-pgport=6565'
'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' '--with-libxslt'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Wendif-labels -fno-strict-aliasing -fwrapv
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,'/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib'
LDFLAGS_SL =
LIBS = -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline -ltermcap -lcrypt -ldl -lm
VERSION = PostgreSQL 9.0devel-sr_primary



$ grep -Ev '(^[[:space:]]*#)|(^$)' pgsql.sr_*ry/data/postgresql.conf
pgsql.sr_primary/data/postgresql.conf:data_directory =
'/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data'
pgsql.sr_primary/data/postgresql.conf:port = 6565
pgsql.sr_primary/data/postgresql.conf:max_connections = 100
pgsql.sr_primary/data/postgresql.conf:shared_buffers = 256MB
pgsql.sr_primary/data/postgresql.conf:checkpoint_segments = 50
pgsql.sr_primary/data/postgresql.conf:archive_mode = 'on'
pgsql.sr_primary/data/postgresql.conf:archive_command= 'cp %p
/var/data1/pg_stuff/dump/replication_archive/%f'
pgsql.sr_primary/data/postgresql.conf:max_wal_senders = 5
pgsql.sr_primary/data/postgresql.conf:effective_cache_size= 16GB
pgsql.sr_primary/data/postgresql.conf:datestyle = 'iso, mdy'
pgsql.sr_primary/data/postgresql.conf:lc_messages = 'en_US.UTF-8'
pgsql.sr_primary/data/postgresql.conf:lc_monetary = 'en_US.UTF-8'
pgsql.sr_primary/data/postgresql.conf:lc_numeric = 'en_US.UTF-8'
pgsql.sr_primary/data/postgresql.conf:lc_time = 'en_US.UTF-8'
pgsql.sr_primary/data/postgresql.conf:default_text_search_config = 'pg_catalog.english'

pgsql.sr_slavery/data/postgresql.conf:data_directory =
'/var/data1/pg_stuff/pg_installations/pgsql.sr_slavery/data'
pgsql.sr_slavery/data/postgresql.conf:port = 6566
pgsql.sr_slavery/data/postgresql.conf:max_connections = 100
pgsql.sr_slavery/data/postgresql.conf:shared_buffers = 256MB
pgsql.sr_slavery/data/postgresql.conf:checkpoint_segments = 50
pgsql.sr_slavery/data/postgresql.conf:archive_mode = 'on'
pgsql.sr_slavery/data/postgresql.conf:archive_command= 'cp %p
/var/data1/pg_stuff/dump/replication_archive/%f'
pgsql.sr_slavery/data/postgresql.conf:max_wal_senders = 5
pgsql.sr_slavery/data/postgresql.conf:effective_cache_size= 16GB
pgsql.sr_slavery/data/postgresql.conf:datestyle = 'iso, mdy'
pgsql.sr_slavery/data/postgresql.conf:lc_messages = 'en_US.UTF-8'
pgsql.sr_slavery/data/postgresql.conf:lc_monetary = 'en_US.UTF-8'
pgsql.sr_slavery/data/postgresql.conf:lc_numeric = 'en_US.UTF-8'
pgsql.sr_slavery/data/postgresql.conf:lc_time = 'en_US.UTF-8'
pgsql.sr_slavery/data/postgresql.conf:default_text_search_config = 'pg_catalog.english'

I only now notice  archive_mode = 'on'  on the standby.  That doesn't make sense of course - I'll
remove that.





Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote:
> On Sun, April 18, 2010 13:01, Simon Riggs wrote:

> any comment is welcome...

Please can you re-run with -l and post me the file of times

Please also rebuild using --enable-profile so we can see what's
happening.

Can you also try the enclosed patch which implements prefetching during
replay of btree delete records. (Need to set effective_io_concurrency)

Thanks for your further help.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: testing HS/SR - 1 vs 2 performance

From
Mark Kirkwood
Date:
Erik Rijkers wrote:
>
> This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the
> original post, btw).  In that earlier instance, the extreme slowness disappeared later, after many
> hours maybe even days (without bouncing either primary or standby).
>
> I have no idea what could cause this; is no one else is seeing this ?
>
> (if I have time I'll repeat on other hardware in the weekend)
>
> any comment is welcome...
>
>
>   

I wonder if what you are seeing is perhaps due to the tables on the 
primary being almost completely cached (from the initial create) and 
those on the standby being at best partially so. That would explain why 
the standby performance catches up after a while ( when its tables are 
equivalently cached).

One way to test this is to 'pre-cache' the standby by selecting every 
row from its tables before running the pgbench test.

regards

Mark



Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Thu, 2010-04-22 at 23:45 +0100, Simon Riggs wrote:
> On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote:
> > On Sun, April 18, 2010 13:01, Simon Riggs wrote:
> 
> > any comment is welcome...
> 
> Please can you re-run with -l and post me the file of times

Erik has sent me details of a test run. My analysis of that is:

I'm seeing the response time profile on the standby as
99% <110us
99.9% <639us
99.99% <615ms

0.052% (52 samples) are >5ms elapsed and account for 24 s, which is
about 45% of elapsed time. 

Of the 52 samples >5ms, 50 of them are >100ms and 2 >1s. 

99% of transactions happen in similar times between primary and standby,
everything dragged down by rare but severe spikes.

We're looking for something that would delay something that normally
takes <0.1ms into something that takes >100ms, yet does eventually
return. That looks like a severe resource contention issue.

This effect happens when running just a single read-only session on
standby from pgbench. No confirmation as yet as to whether recovery is
active or dormant, and what other activitity if any occurs on standby
server at same time. So no other clues as yet as to what the contention
might be, except that we note the standby is writing data and the
database is large.

> Please also rebuild using --enable-profile so we can see what's
> happening.
> 
> Can you also try the enclosed patch which implements prefetching during
> replay of btree delete records. (Need to set effective_io_concurrency)

As yet, no confirmation that the attached patch is even relevant. It was
just a wild guess at some tuning, while we wait for further info.

> Thanks for your further help.

"Some kind of contention" is best we can say at present.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Fri, Apr 23, 2010 at 11:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, 2010-04-22 at 23:45 +0100, Simon Riggs wrote:
>> On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote:
>> > On Sun, April 18, 2010 13:01, Simon Riggs wrote:
>>
>> > any comment is welcome...
>>
>> Please can you re-run with -l and post me the file of times
>
> Erik has sent me details of a test run. My analysis of that is:
>
> I'm seeing the response time profile on the standby as
> 99% <110us
> 99.9% <639us
> 99.99% <615ms
>
> 0.052% (52 samples) are >5ms elapsed and account for 24 s, which is
> about 45% of elapsed time.
>
> Of the 52 samples >5ms, 50 of them are >100ms and 2 >1s.
>
> 99% of transactions happen in similar times between primary and standby,
> everything dragged down by rare but severe spikes.
>
> We're looking for something that would delay something that normally
> takes <0.1ms into something that takes >100ms, yet does eventually
> return. That looks like a severe resource contention issue.

Wow.  Good detective work.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Marko Kreen
Date:
On 4/18/10, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, 2010-04-17 at 16:48 -0400, Tom Lane wrote:
>  > There are some places where we suppose that a *single* write into shared
>  > memory can safely be done without a lock, if we're not too concerned
>  > about how soon other transactions will see the effects.  But what you
>  > are proposing here requires more than one related write.
>  >
>  > I've been burnt by this myself:
>  > http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php
>
>
> W O W - thank you for sharing.
>
>  What I'm not clear on is why you've used a spinlock everywhere when only
>  weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
>  macro that does does nada when the hardware already protects us? (i.e. a
>  spinlock only for the hardware that needs it).

Um, you have been burned by exactly this on x86 also:
 http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php

-- 
marko


Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Marko Kreen <markokr@gmail.com> writes:
> Um, you have been burned by exactly this on x86 also:
>   http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php

Yeah, we never did figure out exactly how come you were observing that
failure on Intel-ish hardware.  I was under the impression that Intel
machines didn't have weak-memory-ordering behavior.

I wonder whether your compiler had rearranged the code in ProcArrayAdd
so that the increment happened before the array element store at the
machine-code level.  I think it would be entitled to do that under
standard C semantics, since that ProcArrayStruct pointer isn't marked
volatile.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Marko Kreen
Date:
On 4/23/10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Marko Kreen <markokr@gmail.com> writes:
>  > Um, you have been burned by exactly this on x86 also:
>  >   http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php
>
>
>  Yeah, we never did figure out exactly how come you were observing that
>  failure on Intel-ish hardware.  I was under the impression that Intel
>  machines didn't have weak-memory-ordering behavior.
>
>  I wonder whether your compiler had rearranged the code in ProcArrayAdd
>  so that the increment happened before the array element store at the
>  machine-code level.  I think it would be entitled to do that under
>  standard C semantics, since that ProcArrayStruct pointer isn't marked
>  volatile.

Sounds likely.

Which seems to hint its better to handle all processors as weak ordered
and then work with explicit locks/memory barriers, than to sprinkle
code with 'volatile' to supress optimizations on intel and then still
fail on non-intel.

-- 
marko


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
> >
> > 99% of transactions happen in similar times between primary and standby,
> > everything dragged down by rare but severe spikes.
> >
> > We're looking for something that would delay something that normally
> > takes <0.1ms into something that takes >100ms, yet does eventually
> > return. That looks like a severe resource contention issue.
> 
> Wow.  Good detective work.

While we haven't fully established the source of those problems, I am
now happy that these test results don't present any reason to avoid
commiting the main patch tested by Erik (not the smaller additional one
I sent). I expect to commit that on Sunday.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Fri, Apr 23, 2010 at 6:39 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
>> >
>> > 99% of transactions happen in similar times between primary and standby,
>> > everything dragged down by rare but severe spikes.
>> >
>> > We're looking for something that would delay something that normally
>> > takes <0.1ms into something that takes >100ms, yet does eventually
>> > return. That looks like a severe resource contention issue.
>>
>> Wow.  Good detective work.
>
> While we haven't fully established the source of those problems, I am
> now happy that these test results don't present any reason to avoid
> commiting the main patch tested by Erik (not the smaller additional one
> I sent). I expect to commit that on Sunday.

Both Heikki and I objected to that patch.  And apparently it doesn't
fix the problem, either.  So, -1 from me.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Sat, April 24, 2010 00:39, Simon Riggs wrote:
> On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
>> >
>> > 99% of transactions happen in similar times between primary and standby,
>> > everything dragged down by rare but severe spikes.
>> >
>> > We're looking for something that would delay something that normally
>> > takes <0.1ms into something that takes >100ms, yet does eventually
>> > return. That looks like a severe resource contention issue.
>>
>> Wow.  Good detective work.
>
> While we haven't fully established the source of those problems, I am
> now happy that these test results don't present any reason to avoid
> commiting the main patch tested by Erik (not the smaller additional one
> I sent). I expect to commit that on Sunday.
>

yes, that (main) patch seems to have largely
closed the gap between primary and standby; here
are some results from a lower scale (10):
 scale: 10
clients: 10, 20, 40, 60, 90
for each: 4x primary, 4x standby:  (6565=primary, 6566=standby)
-----
scale: 10      clients: 10  tps = 27624.339871  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 27604.261750  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 28015.093466  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 28422.561280  pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 10      clients: 10  tps = 27254.806526  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 27686.470866  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 28078.904035  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 27101.622337  pgbench -p 6566 -n -S -c 10 -T 900 -j 1

-----
scale: 10      clients: 20  tps = 23106.795587  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 23101.681155  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 22893.364004  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 23038.577109  pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 10      clients: 20  tps = 22903.578552  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 22970.691946  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 22999.473318  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 22884.854749  pgbench -p 6566 -n -S -c 20 -T 900 -j 1

-----
scale: 10      clients: 40  tps = 23522.499429  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23611.319191  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23616.905302  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23572.213990  pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 10      clients: 40  tps = 23714.721220  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23711.781175  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23691.867023  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23691.699231  pgbench -p 6566 -n -S -c 40 -T 900 -j 1

-----
scale: 10      clients: 60  tps = 21987.497095  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 21950.344204  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 22006.461447  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 21824.071303  pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 10      clients: 60  tps = 22149.415231  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 22211.064402  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 22164.238081  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 22174.585736  pgbench -p 6566 -n -S -c 60 -T 900 -j 1

-----
scale: 10      clients: 90  tps = 18751.213002  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18757.115811  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18692.942329  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18765.390154  pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 10      clients: 90  tps = 18929.462104  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18999.851184  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18972.321607  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18924.058827  pgbench -p 6566 -n -S -c 90 -T 900 -j 1


The higher scales still have that other standby-slowness.  It may be
caching effects (as Mark Kirkwood suggested):  the idea being that the
primary data is pre-cached because of the initial create; standby data
needs to be first-time-read from disk.

Does that make sense?

I will try to confirm this.







Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Fri, 2010-04-23 at 19:07 -0400, Robert Haas wrote:
> On Fri, Apr 23, 2010 at 6:39 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
> >> >
> >> > 99% of transactions happen in similar times between primary and standby,
> >> > everything dragged down by rare but severe spikes.
> >> >
> >> > We're looking for something that would delay something that normally
> >> > takes <0.1ms into something that takes >100ms, yet does eventually
> >> > return. That looks like a severe resource contention issue.
> >>
> >> Wow.  Good detective work.
> >
> > While we haven't fully established the source of those problems, I am
> > now happy that these test results don't present any reason to avoid
> > commiting the main patch tested by Erik (not the smaller additional one
> > I sent). I expect to commit that on Sunday.
> 
> Both Heikki and I objected to that patch.  

Please explain your objection, based upon the patch and my explanations.

> And apparently it doesn't
> fix the problem, either.  So, -1 from me.

There is an issue observed in Erik's later tests, but my interpretation
of the results so far is that the sorted array patch successfully
removes the initially reported loss of performance.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Both Heikki and I objected to that patch.
>
> Please explain your objection, based upon the patch and my explanations.

Well, we objected to the locking.  Having reread the patch a few times
though, I think I'm starting to wrap my head around it so, I don't
know, maybe it's OK.  Have you tested grabbing the ProcArrayLock in
exclusive mode instead of having a separate spinlock, to see how that
performs?

>> And apparently it doesn't
>> fix the problem, either.  So, -1 from me.
>
> There is an issue observed in Erik's later tests, but my interpretation
> of the results so far is that the sorted array patch successfully
> removes the initially reported loss of performance.

Is it possible the remaining spikes are due to fights over the spinlock?

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 06:53 -0400, Robert Haas wrote:
> On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> Both Heikki and I objected to that patch.
> >
> > Please explain your objection, based upon the patch and my explanations.
> 
> Well, we objected to the locking.  Having reread the patch a few times
> though, I think I'm starting to wrap my head around it so, I don't
> know, maybe it's OK.  Have you tested grabbing the ProcArrayLock in
> exclusive mode instead of having a separate spinlock, to see how that
> performs?

They both perform well. Heikki and I agree that spinlocks perform well.

The point of taking the ProcArrayLock in Shared rather than Exclusive
mode is the reduction in contention that gives. Adding a
KnownAssignedXid is exactly analogous to starting a transaction. We go
to some trouble to avoid taking a lock when we start a transaction and I
want to do a similar thing on the standby. I think it's even more
important we reduce contention on the standby than it is on the primary
because if the Startup process is delayed then it slows down recovery.

> >> And apparently it doesn't
> >> fix the problem, either.  So, -1 from me.
> >
> > There is an issue observed in Erik's later tests, but my interpretation
> > of the results so far is that the sorted array patch successfully
> > removes the initially reported loss of performance.
> 
> Is it possible the remaining spikes are due to fights over the spinlock?

That one specific spinlock? I considered it. I think its doubtful.

The spikes all involved a jump from <5ms to much more than that, often
as long as a second. That looks like a pure I/O problem or an LWlock
held across a slow I/O or platform specific issues of some kind.

Erik's tests were with 1 backend, so that would mean that at most 2
processes might want the spinlock you mention. The spinlock is only
taken at snapshot time, so once per transaction in Erik's tests. So the
if the spinlock were the problem then one or other of the processes
would need to grab and hold the spinlock for a very long time and then
release it sometime later. Since the spinlock code has been with us for
some time, I doubt there is anything wrong there.

I don't have any evidence as to whether the problem is solely on Erik's
system, nor whether it is an issue already in the codebase or introduced
by this patch. It seems unlikely to me that this patch introduces the
problem, and even more unlikely that the issue is caused by two
processes fighting over that particular spinlock. 

The patch is very effective at reducing overall cost of taking snapshots
and so I think it needs to be applied to allow more performance data to
be collected. I don't suppose this is the last performance tuning we
will need to do.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Robert Haas
Date:
On Sun, Apr 25, 2010 at 8:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, 2010-04-25 at 06:53 -0400, Robert Haas wrote:
>> On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >> Both Heikki and I objected to that patch.
>> >
>> > Please explain your objection, based upon the patch and my explanations.
>>
>> Well, we objected to the locking.  Having reread the patch a few times
>> though, I think I'm starting to wrap my head around it so, I don't
>> know, maybe it's OK.  Have you tested grabbing the ProcArrayLock in
>> exclusive mode instead of having a separate spinlock, to see how that
>> performs?
>
> They both perform well. Heikki and I agree that spinlocks perform well.
>
> The point of taking the ProcArrayLock in Shared rather than Exclusive
> mode is the reduction in contention that gives. Adding a
> KnownAssignedXid is exactly analogous to starting a transaction. We go
> to some trouble to avoid taking a lock when we start a transaction and I
> want to do a similar thing on the standby. I think it's even more
> important we reduce contention on the standby than it is on the primary
> because if the Startup process is delayed then it slows down recovery.
>
>> >> And apparently it doesn't
>> >> fix the problem, either.  So, -1 from me.
>> >
>> > There is an issue observed in Erik's later tests, but my interpretation
>> > of the results so far is that the sorted array patch successfully
>> > removes the initially reported loss of performance.
>>
>> Is it possible the remaining spikes are due to fights over the spinlock?
>
> That one specific spinlock? I considered it. I think its doubtful.
>
> The spikes all involved a jump from <5ms to much more than that, often
> as long as a second. That looks like a pure I/O problem or an LWlock
> held across a slow I/O or platform specific issues of some kind.
>
> Erik's tests were with 1 backend, so that would mean that at most 2
> processes might want the spinlock you mention. The spinlock is only
> taken at snapshot time, so once per transaction in Erik's tests. So the
> if the spinlock were the problem then one or other of the processes
> would need to grab and hold the spinlock for a very long time and then
> release it sometime later. Since the spinlock code has been with us for
> some time, I doubt there is anything wrong there.
>
> I don't have any evidence as to whether the problem is solely on Erik's
> system, nor whether it is an issue already in the codebase or introduced
> by this patch. It seems unlikely to me that this patch introduces the
> problem, and even more unlikely that the issue is caused by two
> processes fighting over that particular spinlock.
>
> The patch is very effective at reducing overall cost of taking snapshots
> and so I think it needs to be applied to allow more performance data to
> be collected. I don't suppose this is the last performance tuning we
> will need to do.

OK, fair enough.  I withdraw my objections.

...Robert


Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Both Heikki and I objected to that patch.
>> 
>> Please explain your objection, based upon the patch and my explanations.

> Well, we objected to the locking.

Not only is the locking overly complex, but it's outright wrong;
or at least the comments are.  KnownAssignedXidsAdd in particular
has a comment that is 100% misleading, and an actual behavior that
I find extremely likely to provoke bugs (eg, consider caller calling
it twice under the illusion it stlll has only shared lock).  I'm not
even convinced that it works correctly, since it will happily write
into KnownAssignedXids[] while holding only shared ProcArrayLock.
What happens when two processes are doing that concurrently?

This needs a redesign before it can be considered committable.  I don't
really care whether it makes things faster; it's too complicated and too
poorly documented to be maintainable.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> [ v2 patch ]

BTW, while I'm looking at this, why bother with the separate
KnownAssignedXidsValid[] array?  Wouldn't it be cheaper
(certainly so in storage, probably so in access/update times)
to have just the KnownAssignedXids[] array and store
InvalidTransactionId in unused entries?
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > Both Heikki and I objected to that patch.
> >> 
> >> Please explain your objection, based upon the patch and my explanations.
> 
> > Well, we objected to the locking.
> 
> Not only is the locking overly complex, but it's outright wrong;
> or at least the comments are.  KnownAssignedXidsAdd in particular
> has a comment that is 100% misleading, and an actual behavior that
> I find extremely likely to provoke bugs (eg, consider caller calling
> it twice under the illusion it stlll has only shared lock).  

I will reword that comment.

> I'm not
> even convinced that it works correctly, since it will happily write
> into KnownAssignedXids[] while holding only shared ProcArrayLock.
> What happens when two processes are doing that concurrently?

Heikki and I discussed this upthread. Only the Startup process ever
calls KnownAssignedXids, so the situation you describe does not happen.
We currently rely on the fact that we have only a single writer in other
places in Hot Standby, so this is not a restriction we are imposing on
the system just to support this change.

Both you and Heikki have previously spoken against having recovery run
in parallel, so the likelihood of having multiple readers in the future
seems low.

If we were to allow multiple callers of the routine in future, the use
of spinlocks to protect the head of the queue means we can safely
reserve slots and so KnownAssignedXidsAdd can easily be made to accept
multiple writers in the future. I think its possible to future proof
that now without significant change or performance drop, so I'll do
that.

> This needs a redesign before it can be considered committable.  I don't
> really care whether it makes things faster; it's too complicated and too
> poorly documented to be maintainable.

It is sufficiently complex to achieve its objective, which is keeping
track of xids during recovery, using the same locking as we do during
normal running. This is the third re-write of the code, developed over
more than 20 months and specifically modularised by me to make future
changes drop-in replacements. Initially we had an insert-sorted array,
then a hash table and now this current proposal. Recently, Heikki and I
independently came up with almost exactly the same design, with the
algorithmic characteristics we need for performance. The call points,
frequency and workload of calls are documented and well understood by
multiple people.

The only remaining point of discussion has been the code that takes
ProcArrayLock in shared mode. The difference being discussed here is
whether we access the head and tail of the array with both Shared
+spinlock or just Exclusive. The patch would be almost identical with
that change, but would not meet its locking objectives. So making the
locking stricter isn't going to make it more maintainable by a
measurable amount.

There are more than 60 lines of header comment explaining in detail how
this works, with a full algorithmic analysis. The remaining code is
commented to project standards, with easily more than 100 lines of
comments. 

I will recheck every comment and see if other comments can be usefully
added, and further review the code to see if there's anything that can
be done to improve it further. I was going to do this anyway, since I
will be adding an agreed Assert() into the patch.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote:
>> This needs a redesign before it can be considered committable.  I don't
>> really care whether it makes things faster; it's too complicated and too
>> poorly documented to be maintainable.

> There are more than 60 lines of header comment explaining in detail how
> this works, with a full algorithmic analysis. The remaining code is
> commented to project standards, with easily more than 100 lines of
> comments. 

If the comments were correct, I wouldn't be complaining.  They're
misleading or outright wrong on many points.  In particular, I don't
think you actually understand the weak-memory-ordering issue, because
the comments about that are entirely wrong.  I don't think they are
adequate on other points either --- for example, I now understand why my
complaint of a few minutes ago about KnownAssignedXidsValid is wrong,
but the comments certainly didn't help me on that.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 12:43 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > [ v2 patch ]
> 
> BTW, while I'm looking at this, why bother with the separate
> KnownAssignedXidsValid[] array?  Wouldn't it be cheaper
> (certainly so in storage, probably so in access/update times)
> to have just the KnownAssignedXids[] array and store
> InvalidTransactionId in unused entries?

Well, that was exactly my first design.

Heikki came up with the additional array and I think it is an inspired
suggestion, because it allows the search() to use a simple binary
search. I attempted to write the binary-search-with-holes and it was
going to be very ugly code, so I went for another algorithm which was
also quite cool, but not as cool as Heikki's idea - which I was going to
give credit for.

The array adds 520*max_connections bytes of shared memory, but seems a
good tradeoff for the additional performance it allows. I guess we could
use a bit array, but that's slower to access. The bit array would be
9*max_connections bytes of shared memory, so a 40kB saving.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 12:51 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote:
> >> This needs a redesign before it can be considered committable.  I don't
> >> really care whether it makes things faster; it's too complicated and too
> >> poorly documented to be maintainable.
> 
> > There are more than 60 lines of header comment explaining in detail how
> > this works, with a full algorithmic analysis. The remaining code is
> > commented to project standards, with easily more than 100 lines of
> > comments. 
> 
> If the comments were correct, I wouldn't be complaining.  They're
> misleading or outright wrong on many points.  In particular, I don't
> think you actually understand the weak-memory-ordering issue, because
> the comments about that are entirely wrong.  

The comments says "on CPUs with
+ * weak-memory ordering we can't reliably move pointers atomically, so
the
+ * rule is that updates of head and tail of the array require
ProcArrayLock
+ * in exclusive mode or (shared mode and known_assigned_xids_lck
spinlock)"

I will reword this, so it is clear that I'm talking about the head and
tail of the array, not pointers in general.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sun, 2010-04-25 at 12:51 -0400, Tom Lane wrote:
>> If the comments were correct, I wouldn't be complaining.  They're
>> misleading or outright wrong on many points.  In particular, I don't
>> think you actually understand the weak-memory-ordering issue, because
>> the comments about that are entirely wrong.  

> The comments says "on CPUs with
> + * weak-memory ordering we can't reliably move pointers atomically, so
> the
> + * rule is that updates of head and tail of the array require
> ProcArrayLock
> + * in exclusive mode or (shared mode and known_assigned_xids_lck
> spinlock)"

> I will reword this, so it is clear that I'm talking about the head and
> tail of the array, not pointers in general.

It's not about whether the pointers can be assigned atomically; on most
hardware they can.  It's about whether other processors will see that
happen in the correct sequence relative to the changes in the array
elements.

If you like I'll have a go at rewriting the comments for this patch,
because I am currently thinking that the problem is not so much with
the code as with the poor explanation of what it's doing.  Sometimes
the author is too close to the code to understand why other people
have a hard time understanding it.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote:

> If you like I'll have a go at rewriting the comments for this patch,
> because I am currently thinking that the problem is not so much with
> the code as with the poor explanation of what it's doing.  Sometimes
> the author is too close to the code to understand why other people
> have a hard time understanding it.

That would help me, thank you.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote:
>> If you like I'll have a go at rewriting the comments for this patch,
>> because I am currently thinking that the problem is not so much with
>> the code as with the poor explanation of what it's doing.  Sometimes
>> the author is too close to the code to understand why other people
>> have a hard time understanding it.

> That would help me, thank you.

OK.  You said you were currently working some more on the patch, so
I'll wait for v3 and then work on it.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Sat, April 24, 2010 01:17, Erik Rijkers wrote:
> On Sat, April 24, 2010 00:39, Simon Riggs wrote:
>> On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
>>> >
>>> > 99% of transactions happen in similar times between primary and standby,
>>> > everything dragged down by rare but severe spikes.
>>> >
>>> > We're looking for something that would delay something that normally
>>> > takes <0.1ms into something that takes >100ms, yet does eventually
>>> > return. That looks like a severe resource contention issue.
>>>
>>> Wow.  Good detective work.
>>
>> While we haven't fully established the source of those problems, I am
>> now happy that these test results don't present any reason to avoid
>> commiting the main patch tested by Erik (not the smaller additional one
>> I sent). I expect to commit that on Sunday.
>>
>
> yes, that (main) patch seems to have largely
> closed the gap between primary and standby; here
> are some results from a lower scale (10):

FWIW, here are some more results from pgbench comparing
primary and standby (both with Simon's patch).

Sorry if it's too much data, but to me at least it was illuminating;
I now understand the effects of the different parameters better.



pgbench results, all -S (SELECT-only).
 scale: 10, 100, 500, 1000
clients: 1, 10, 20, 40, 60, 90

for each: 4x primary, 4x standby, for 15 minutes:

(port 6565=primary, 6566=standby)

scale: 10      clients:  1  tps =  6404.768132  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps =  6435.072983  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps =  6458.958549  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps =  6432.088099  pgbench -p 6565 -n -S -c 1 -T 900 -j 1

scale: 10      clients:  1  tps =  6442.519035  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps =  6449.837931  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps =  6419.879026  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10      clients:  1  tps =  6428.290175  pgbench -p 6566 -n -S -c 1 -T 900 -j 1

---
scale: 10      clients: 10  tps = 27624.339871  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 27604.261750  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 28015.093466  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 28422.561280  pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 10      clients: 10  tps = 27254.806526  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 27686.470866  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 28078.904035  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10      clients: 10  tps = 27101.622337  pgbench -p 6566 -n -S -c 10 -T 900 -j 1

----
scale: 10      clients: 20  tps = 23106.795587  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 23101.681155  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 22893.364004  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 23038.577109  pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 10      clients: 20  tps = 22903.578552  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 22970.691946  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 22999.473318  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10      clients: 20  tps = 22884.854749  pgbench -p 6566 -n -S -c 20 -T 900 -j 1

----
scale: 10      clients: 40  tps = 23522.499429  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23611.319191  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23616.905302  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23572.213990  pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 10      clients: 40  tps = 23714.721220  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23711.781175  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23691.867023  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10      clients: 40  tps = 23691.699231  pgbench -p 6566 -n -S -c 40 -T 900 -j 1

----
scale: 10      clients: 60  tps = 21987.497095  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 21950.344204  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 22006.461447  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 21824.071303  pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 10      clients: 60  tps = 22149.415231  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 22211.064402  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 22164.238081  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10      clients: 60  tps = 22174.585736  pgbench -p 6566 -n -S -c 60 -T 900 -j 1

----
scale: 10      clients: 90  tps = 18751.213002  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18757.115811  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18692.942329  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18765.390154  pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 10      clients: 90  tps = 18929.462104  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18999.851184  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18972.321607  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10      clients: 90  tps = 18924.058827  pgbench -p 6566 -n -S -c 90 -T 900 -j 1

----
scale: 100     clients:  1  tps =  5854.489684  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps =  5852.679267  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps =  5844.005437  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps =  5867.593004  pgbench -p 6565 -n -S -c 1 -T 900 -j 1

scale: 100     clients:  1  tps =  1111.368215  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps =  1187.678163  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps =  1332.544440  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100     clients:  1  tps =  1488.574446  pgbench -p 6566 -n -S -c 1 -T 900 -j 1

----
scale: 100     clients: 10  tps = 28948.718439  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 28865.280357  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 28819.426232  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps = 28868.070716  pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 100     clients: 10  tps =  1839.477133  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps =  2056.574852  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps =  2501.715270  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100     clients: 10  tps =  3114.288871  pgbench -p 6566 -n -S -c 10 -T 900 -j 1

----
scale: 100     clients: 20  tps = 21561.395664  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 21482.860950  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 21447.896216  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 21450.846824  pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 100     clients: 20  tps =  4309.312692  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps =  6786.460948  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 16552.535443  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100     clients: 20  tps = 21513.519274  pgbench -p 6566 -n -S -c 20 -T 900 -j 1

----
scale: 100     clients: 40  tps = 20812.709326  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 100     clients: 40  tps = 20750.911233  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 100     clients: 40  tps = 20764.921432  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 100     clients: 40  tps = 20877.705726  pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 100     clients: 40  tps = 20818.028682  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 100     clients: 40  tps = 20876.921640  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 100     clients: 40  tps = 20912.855749  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 100     clients: 40  tps = 20821.271974  pgbench -p 6566 -n -S -c 40 -T 900 -j 1

----
scale: 100     clients: 60  tps = 18746.870683  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 100     clients: 60  tps = 18752.631011  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 100     clients: 60  tps = 18717.634243  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 100     clients: 60  tps = 18776.138286  pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 100     clients: 60  tps = 18850.133061  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 100     clients: 60  tps = 18893.463427  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 100     clients: 60  tps = 18918.599124  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 100     clients: 60  tps = 18903.389147  pgbench -p 6566 -n -S -c 60 -T 900 -j 1


----
scale: 100     clients: 90  tps = 15915.637099  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 100     clients: 90  tps = 15945.810305  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 100     clients: 90  tps = 15894.238941  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 100     clients: 90  tps = 15926.422587  pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 100     clients: 90  tps = 16025.802748  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 100     clients: 90  tps = 16049.977788  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 100     clients: 90  tps = 16053.612396  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 100     clients: 90  tps = 16040.376994  pgbench -p 6566 -n -S -c 90 -T 900 -j 1

----
scale: 500     clients:  1  tps =  5646.892935  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  5642.544001  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  5633.282536  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  5627.330477  pgbench -p 6565 -n -S -c 1 -T 900 -j 1

scale: 500     clients:  1  tps =  5625.659921  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  5645.898120  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  5644.897739  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500     clients:  1  tps =  5640.990172  pgbench -p 6566 -n -S -c 1 -T 900 -j 1

----
scale: 500     clients: 10  tps = 28123.467099  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 28230.031309  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 28054.072374  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 28129.229553  pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 500     clients: 10  tps = 28187.938585  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 28372.950289  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 28161.909825  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500     clients: 10  tps = 28224.336096  pgbench -p 6566 -n -S -c 10 -T 900 -j 1

----
scale: 500     clients: 20  tps = 20932.340860  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 20990.560386  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 20926.001588  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 20903.012936  pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 500     clients: 20  tps = 20956.917213  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 21007.634021  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 20982.386759  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500     clients: 20  tps = 21002.399320  pgbench -p 6566 -n -S -c 20 -T 900 -j 1

----
scale: 500     clients: 40  tps = 19829.483373  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 500     clients: 40  tps = 19830.187082  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 500     clients: 40  tps = 19757.072979  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 500     clients: 40  tps = 19734.710111  pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 500     clients: 40  tps = 19862.051804  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 500     clients: 40  tps = 19834.071543  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 500     clients: 40  tps = 19885.788466  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 500     clients: 40  tps = 19961.407411  pgbench -p 6566 -n -S -c 40 -T 900 -j 1

----
scale: 500     clients: 60  tps = 17829.256242  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 500     clients: 60  tps = 17816.152137  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 500     clients: 60  tps = 17828.589859  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 500     clients: 60  tps = 17748.044912  pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 500     clients: 60  tps = 17926.708964  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 500     clients: 60  tps = 17943.758910  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 500     clients: 60  tps = 17917.713556  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 500     clients: 60  tps = 17932.597396  pgbench -p 6566 -n -S -c 60 -T 900 -j 1

----
scale: 500     clients: 90  tps = 15044.660043  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 500     clients: 90  tps = 15010.685976  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 500     clients: 90  tps = 15045.828142  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 500     clients: 90  tps = 14976.636159  pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 500     clients: 90  tps = 15154.402300  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 500     clients: 90  tps = 15131.532084  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 500     clients: 90  tps = 15154.148091  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 500     clients: 90  tps = 15130.337148  pgbench -p 6566 -n -S -c 90 -T 900 -j 1

----
scale: 1000    clients:  1  tps =   222.307686  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   261.163445  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   307.343862  pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =   367.285821  pgbench -p 6565 -n -S -c 1 -T 900 -j 1

scale: 1000    clients:  1  tps =   550.457516  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =  4347.697078  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =  5601.036491  pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000    clients:  1  tps =  5604.021799  pgbench -p 6566 -n -S -c 1 -T 900 -j 1

----
scale: 1000    clients: 10  tps =  3370.534977  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 23156.038162  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 28029.755911  pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 28179.609982  pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 1000    clients: 10  tps =  2880.404087  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 13593.595875  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 28033.649640  pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000    clients: 10  tps = 28115.126500  pgbench -p 6566 -n -S -c 10 -T 900 -j 1

----
scale: 1000    clients: 20  tps =  4876.882880  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 21262.273826  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 20403.533099  pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 20753.634216  pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 1000    clients: 20  tps =  3024.129307  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 18786.862049  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 20807.372965  pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000    clients: 20  tps = 20896.373925  pgbench -p 6566 -n -S -c 20 -T 900 -j 1

----
scale: 1000    clients: 40  tps =  5910.319107  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 1000    clients: 40  tps = 19705.088545  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 1000    clients: 40  tps = 19658.237802  pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 1000    clients: 40  tps = 19593.955166  pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 1000    clients: 40  tps =  6555.937102  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 1000    clients: 40  tps = 19845.691237  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 1000    clients: 40  tps = 19755.112514  pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 1000    clients: 40  tps = 19765.648969  pgbench -p 6566 -n -S -c 40 -T 900 -j 1

----
scale: 1000    clients: 60  tps =  7104.451848  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 1000    clients: 60  tps = 17638.393625  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 1000    clients: 60  tps = 17621.464080  pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 1000    clients: 60  tps = 17538.189171  pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 1000    clients: 60  tps =  6721.909857  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 1000    clients: 60  tps = 17798.542731  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 1000    clients: 60  tps = 17797.145061  pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 1000    clients: 60  tps = 17739.437864  pgbench -p 6566 -n -S -c 60 -T 900 -j 1

----
scale: 1000    clients: 90  tps =  6682.605788  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 1000    clients: 90  tps = 14853.076617  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 1000    clients: 90  tps = 14851.040207  pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 1000    clients: 90  tps = 14859.575164  pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 1000    clients: 90  tps =  6208.586575  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 1000    clients: 90  tps = 15065.536476  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 1000    clients: 90  tps = 15033.388676  pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 1000    clients: 90  tps = 15030.808207  pgbench -p 6566 -n -S -c 90 -T 900 -j 1

Hardware: 2x Xeon (quadcore) 5482; 32 GB; Areca 1280ML CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5

Both instances on a raid volume of 12 disks (sata 7200)
(and of course, server otherwise idle)



hth,

Erik Rijkers






Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
"Erik Rijkers" <er@xs4all.nl> writes:
> FWIW, here are some more results from pgbench comparing
> primary and standby (both with Simon's patch).

That seems weird.  Why do most of the runs show primary and standby
as having comparable speed, but a few show the standby as much slower?
The parameters for those runs don't seem obviously different from cases
where it's fast.  I think there might have been something else going on
on the standby during those runs.  Or do you think those represent
cases where the mystery slowdown event happened?
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 20:25 +0200, Erik Rijkers wrote:

> Sorry if it's too much data, but to me at least it was illuminating;
> I now understand the effects of the different parameters better.

That's great, many thanks.

A few observations

* Standby performance is actually slightly above normal running. This is
credible because of the way snapshots are now taken. We don't need to
scan the procarray looking for write transactions, since we know
everything is read only. So we scan just the knownassignedxids, which if
no activity from primary will be zero-length, so snapshots will actually
get taken much faster in this case on standby. The snapshot performance
on standby is O(n) where n is the number of write transactions
"currently" on primary (transfer delays blur the word "currently").

* The results for scale factor < 100 are fine, and the results for >100
with few connections get thrown out by long transaction times. With
larger numbers of connections the wait problems seem to go away. Looks
like Erik (and possibly Hot Standby in general) has an I/O problem,
though "from what" is not yet determined. It could be just hardware, or
might be hardware plus other factors.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Sun, April 25, 2010 20:55, Tom Lane wrote:
>
> That seems weird.  Why do most of the runs show primary and standby
> as having comparable speed, but a few show the standby as much slower?
> The parameters for those runs don't seem obviously different from cases
> where it's fast.  I think there might have been something else going on
> on the standby during those runs.  Or do you think those represent
> cases where the mystery slowdown event happened?
>

the strange case is the scale 100 standby's slow start, followed by
a steady increase during -c 1, then -c 10, and finally getting up to speed
with -c 20 (and up).  And these slow-but-growing standby series are interspersed
with normal (high-speed) primary series.

I'll try to repeat this pattern on other hardware; although
if my tests were run with faulty hardware I wouldn't know how/why
that would give the above effect (such a 'regular aberration').


testing is more difficult than I thought...


Erik Rijkers



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> [ v2 patch ]

I've been studying this some more while making notes for improved
comments, and I've about come to the conclusion that having readers
move the tail pointer (at the end of KnownAssignedXidsGetAndSetXmin)
is overly tricky and probably not a performance improvement anyway.
The code is in fact wrong as it stands: it's off-by-one about setting
the new tail value.  And there's potential for contention with multiple
readers all wanting to move the tail pointer at once.  And most
importantly, KnownAssignedXidsSearch can't move the tail pointer so
we might expend many inefficient searches while never moving the tail
pointer.

I think we should get rid of that and just have the two functions that
can mark entries invalid (which they must do with exclusive lock)
advance the tail pointer when they invalidate the current tail element.
Then we have the very simple rule that only the startup process ever
changes this data structure.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Fujii Masao
Date:
On Mon, Apr 26, 2010 at 3:25 AM, Erik Rijkers <er@xs4all.nl> wrote:
> FWIW, here are some more results from pgbench comparing
> primary and standby (both with Simon's patch).

Was there a difference in CPU  utilization between the primary
and standby?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 19:18 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > [ v2 patch ]
> 
> I've been studying this some more while making notes for improved
> comments, and I've about come to the conclusion that having readers
> move the tail pointer (at the end of KnownAssignedXidsGetAndSetXmin)
> is overly tricky and probably not a performance improvement anyway.
> The code is in fact wrong as it stands: it's off-by-one about setting
> the new tail value.  And there's potential for contention with multiple
> readers all wanting to move the tail pointer at once.  

OK, since contention was my concern, I want to avoid that.

> And most
> importantly, KnownAssignedXidsSearch can't move the tail pointer so
> we might expend many inefficient searches while never moving the tail
> pointer.

> I think we should get rid of that and just have the two functions that
> can mark entries invalid (which they must do with exclusive lock)
> advance the tail pointer when they invalidate the current tail element.

OK

> Then we have the very simple rule that only the startup process ever
> changes this data structure.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 23:52 +0200, Erik Rijkers wrote:

> I'll try to repeat this pattern on other hardware; although
> if my tests were run with faulty hardware I wouldn't know how/why
> that would give the above effect (such a 'regular aberration').

> testing is more difficult than I thought...

Thanks again for your help.

Please can you confirm:
* Are the standby tests run while the primary is completely quiet?
* What OS is this? Can we use dtrace scripts?

Can anyone else confirm these test results: large scale factor and small
number of sessions?

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Mon, April 26, 2010 08:52, Fujii Masao wrote:
> On Mon, Apr 26, 2010 at 3:25 AM, Erik Rijkers <er@xs4all.nl> wrote:
>> FWIW, here are some more results from pgbench comparing
>> primary and standby (both with Simon's patch).
>
> Was there a difference in CPU  utilization between the primary
> and standby?
>

I haven't monitored it..



Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Mon, April 26, 2010 09:43, Simon Riggs wrote:
> On Sun, 2010-04-25 at 23:52 +0200, Erik Rijkers wrote:
>
>> I'll try to repeat this pattern on other hardware; although
>> if my tests were run with faulty hardware I wouldn't know how/why
>> that would give the above effect (such a 'regular aberration').
>
>> testing is more difficult than I thought...
>
> Thanks again for your help.
>
> Please can you confirm:
> * Are the standby tests run while the primary is completely quiet?

autovacuum was on. Which is probably not a good idea - I'll try a few runs without it.

> * What OS is this? Can we use dtrace scripts?

Centos 5.4.




Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Sun, 2010-04-25 at 13:51 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote:
> >> If you like I'll have a go at rewriting the comments for this patch,
> >> because I am currently thinking that the problem is not so much with
> >> the code as with the poor explanation of what it's doing.  Sometimes
> >> the author is too close to the code to understand why other people
> >> have a hard time understanding it.
>
> > That would help me, thank you.
>
> OK.  You said you were currently working some more on the patch, so
> I'll wait for v3 and then work on it.

v3 attached

Changes:
* Strange locking in KnownAssignedXidsAdd() moved to RecordKnown...
* KnownAssignedXidsAdd() reordered, assert-ish code added
* Tail movement during snapshots no longer possible
* Tail movement during xid removal added to KnownAssignedXidsSearch()
* Major comment hacking

Little bit rough, definitely needs a re-read of all comments, so good
time to send over.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> v3 attached

Thanks, will work on this tomorrow.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> v3 attached

This patch changes KnownAssignedXidsRemove() so that failure to find
the target XID is elog(ERROR) (ie, a PANIC, since this is in the
startup process).  However, this comment is still there:
/* * We can fail to find an xid if the xid came from a subtransaction that * aborts, though the xid hadn't yet been
reportedand no WAL records have * been written using the subxid. In that case the abort record will * contain that
subxidand we haven't seen it before. */
 

WTF?  Either the comment is wrong or this should not be an elog
condition.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > v3 attached
> 
> This patch changes KnownAssignedXidsRemove() so that failure to find
> the target XID is elog(ERROR) (ie, a PANIC, since this is in the
> startup process).  

Not in all cases. The code is correct, as far as I am aware from
testing.

> However, this comment is still there:
>     /*
>      * We can fail to find an xid if the xid came from a subtransaction that
>      * aborts, though the xid hadn't yet been reported and no WAL records have
>      * been written using the subxid. In that case the abort record will
>      * contain that subxid and we haven't seen it before.
>      */
> 
> WTF?  Either the comment is wrong or this should not be an elog
> condition.

That section of code has been rewritten many times. I think it is now
inaccurate and should be removed. I left it there because the
unfortunate history of the project has been the removal of comments and
then later rediscovery of the truth, sometimes more than once. I could
no longer reproduce that error; someone else may know differently.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Hmm ... there's another point here, which is that the array size creates
a hard maximum on the number of entries, whereas the hash table was a
bit more forgiving.  What is the proof that the array won't overflow?
The fact that the equivalent data structure on the master can't hold
more than this many entries doesn't seem to me to prove that, because
we will add intermediate not-observed XIDs to the array.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-04-27 at 14:53 -0400, Tom Lane wrote:
> Hmm ... there's another point here, which is that the array size
> creates
> a hard maximum on the number of entries, whereas the hash table was a
> bit more forgiving.  What is the proof that the array won't overflow?
> The fact that the equivalent data structure on the master can't hold
> more than this many entries doesn't seem to me to prove that, because
> we will add intermediate not-observed XIDs to the array.

We know that not-observed xids have actually been allocated on the
primary. We log an assignment record every 64 subtransactions, so that
the peak size of the array is 65 xids per connection.

It's possible for xids to stay in the array for longer, in the event of
a FATAL error that doesn't log an abort record. We clean those up every
checkpoint, if they exist. The potential number of them is unbounded, so
making special allowance for them doesn't remove the theoretical risk.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote:
>> WTF?  Either the comment is wrong or this should not be an elog
>> condition.

> That section of code has been rewritten many times. I think it is now
> inaccurate and should be removed. I left it there because the
> unfortunate history of the project has been the removal of comments and
> then later rediscovery of the truth, sometimes more than once. I could
> no longer reproduce that error; someone else may know differently.

I haven't tested this, but it appears to me that the failure would occur
in overflow situations.  If we have too many subxacts, we'll generate
XLOG_XACT_ASSIGNMENT, which will cause the subxids to be removed from
KnownAssignedXids[].  Then later when the top-level xact commits or
aborts we'll try to remove them again as a consequence of processing
the top-level's commit/abort record.  No?
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-04-27 at 16:18 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote:
> >> WTF?  Either the comment is wrong or this should not be an elog
> >> condition.
> 
> > That section of code has been rewritten many times. I think it is now
> > inaccurate and should be removed. I left it there because the
> > unfortunate history of the project has been the removal of comments and
> > then later rediscovery of the truth, sometimes more than once. I could
> > no longer reproduce that error; someone else may know differently.
> 
> I haven't tested this, but it appears to me that the failure would occur
> in overflow situations.  If we have too many subxacts, we'll generate
> XLOG_XACT_ASSIGNMENT, which will cause the subxids to be removed from
> KnownAssignedXids[].  Then later when the top-level xact commits or
> aborts we'll try to remove them again as a consequence of processing
> the top-level's commit/abort record.  No?

Yes, thank you for clear thinking. Anyway, looks like the comment was
right after all and the new code to throw an error is wrong in some
cases. It was useful for testing, at least. The comment was slightly
misleading, which is a good reason to rewrite it.

It seems like it might be possible to identify which xids could cause an
error and which won't. Much harder than that. We still have the possible
case where we have >64 subtransactions allocated but many of them abort
and we are left with a final commit of <64 subtransactions.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Isn't the snapshotOldestActiveXid filter in
RecordKnownAssignedTransactionIds completely wrong/useless/bogus?

AFAICS, snapshotOldestActiveXid is only set once at the start of
recovery.  This means it will soon be too old to provide any useful
filtering.  But what's far worse is that the XID space will eventually
wrap around, and that test will start filtering *everything*.

I think we should just lose that test, as well as the variable.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote:

> Isn't the snapshotOldestActiveXid filter in
> RecordKnownAssignedTransactionIds completely wrong/useless/bogus?
> 
> AFAICS, snapshotOldestActiveXid is only set once at the start of
> recovery.  This means it will soon be too old to provide any useful
> filtering.  But what's far worse is that the XID space will eventually
> wrap around, and that test will start filtering *everything*.
> 
> I think we should just lose that test, as well as the variable.

Yes, though it looks like it is still necessary in creating a valid
initial state because otherwise we may have xids in KnownAssigned array
that are already complete. The comment there talks about wasting memory,
though it appears to be a correctness issue.

So perhaps a similar test is required in ProcArrayApplyRecoveryInfo()
but not in RecordKnownAssignedTransactionIds(). That way it is applied,
but only once at initialisation.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote:
>> I think we should just lose that test, as well as the variable.

> Yes, though it looks like it is still necessary in creating a valid
> initial state because otherwise we may have xids in KnownAssigned array
> that are already complete.

Huh?  How is a filter as coarse as an oldest-running-XID filter going
to prevent that?  And aren't we initializing from trustworthy data in
ProcArrayApplyRecoveryInfo, anyway?

I still say it's useless.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote:
> >> I think we should just lose that test, as well as the variable.
> 
> > Yes, though it looks like it is still necessary in creating a valid
> > initial state because otherwise we may have xids in KnownAssigned array
> > that are already complete.
> 
> Huh?  How is a filter as coarse as an oldest-running-XID filter going
> to prevent that?  And aren't we initializing from trustworthy data in
> ProcArrayApplyRecoveryInfo, anyway?
> 
> I still say it's useless.

Quite possibly. Your looking at other code outside of this patch. I'm
happy that you do so, but is it immediately related? I can have another
look when we finish this.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote:
>> Huh?  How is a filter as coarse as an oldest-running-XID filter going
>> to prevent that?  And aren't we initializing from trustworthy data in
>> ProcArrayApplyRecoveryInfo, anyway?
>> 
>> I still say it's useless.

> Quite possibly. Your looking at other code outside of this patch. I'm
> happy that you do so, but is it immediately related? I can have another
> look when we finish this.

Well, it's nearby anyway.  I've committed the present patch (with a
number of fixes).  While I was looking at it I came across several
things in the existing code that I think are either wrong or at least
inadequately documented --- the above complaint is just the tip of the
iceberg.  I'm going to make another pass over it to see if I'm just
missing things, and then report back.
        regards, tom lane


Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-04-27 at 20:13 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote:
> >> Huh?  How is a filter as coarse as an oldest-running-XID filter going
> >> to prevent that?  And aren't we initializing from trustworthy data in
> >> ProcArrayApplyRecoveryInfo, anyway?
> >> 
> >> I still say it's useless.
> 
> > Quite possibly. Your looking at other code outside of this patch. I'm
> > happy that you do so, but is it immediately related? I can have another
> > look when we finish this.
> 
> Well, it's nearby anyway.  I've committed the present patch (with a
> number of fixes).  While I was looking at it I came across several
> things in the existing code that I think are either wrong or at least
> inadequately documented --- the above complaint is just the tip of the
> iceberg.  I'm going to make another pass over it to see if I'm just
> missing things, and then report back.

Thank you for your input and changes.

You're welcome to share my iceberg.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
Hi Simon,

In another thread you mentioned you were lacking information from me:

On Tue, May 4, 2010 17:10, Simon Riggs wrote:
>
> There is no evidence that Erik's strange performance has anything to do
> with HS; it hasn't been seen elsewhere and he didn't respond to
> questions about the test setup to provide background. The profile didn't
> fit any software problem I can see.
>

I'm sorry if I missed requests for things that where not already mentioned.

Let me repeat: OS: Centos 5.4 2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz Areca 1280ML primary and standby db
bothon a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2)
 

It goes without saying (I hope) that apart from the pgbench tests
and a few ssh sessions (myself), the machine was idle.

It would be interesting if anyone repeated these simple tests and produced
evidence that these non-HS.

(Unfortunately, I have at the moment not much time for more testing)


thanks,

Erik Rijkers



On Sun, April 25, 2010 21:07, Simon Riggs wrote:
> On Sun, 2010-04-25 at 20:25 +0200, Erik Rijkers wrote:
>
>> Sorry if it's too much data, but to me at least it was illuminating;
>> I now understand the effects of the different parameters better.
>
> That's great, many thanks.
>
> A few observations
>
> * Standby performance is actually slightly above normal running. This is
> credible because of the way snapshots are now taken. We don't need to
> scan the procarray looking for write transactions, since we know
> everything is read only. So we scan just the knownassignedxids, which if
> no activity from primary will be zero-length, so snapshots will actually
> get taken much faster in this case on standby. The snapshot performance
> on standby is O(n) where n is the number of write transactions
> "currently" on primary (transfer delays blur the word "currently").
>
> * The results for scale factor < 100 are fine, and the results for >100
> with few connections get thrown out by long transaction times. With
> larger numbers of connections the wait problems seem to go away. Looks
> like Erik (and possibly Hot Standby in general) has an I/O problem,
> though "from what" is not yet determined. It could be just hardware, or
> might be hardware plus other factors.
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>
>




Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-05-04 at 18:10 +0200, Erik Rijkers wrote:
> It would be interesting if anyone repeated these simple tests and
> produced
> evidence that these non-HS.
> 
> (Unfortunately, I have at the moment not much time for more testing)

Would you be able to make those systems available for further testing?

First, I'd perform the same test with the systems swapped, so we know
more about the symmetry of the issue. After that, would like to look
more into internals.

Is it possible to setup SytemTap and dtrace on these systems?

Thanks, either way.

-- Simon Riggs           www.2ndQuadrant.com



Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Tue, May 4, 2010 18:19, Simon Riggs wrote:
> On Tue, 2010-05-04 at 18:10 +0200, Erik Rijkers wrote:
>> It would be interesting if anyone repeated these simple tests and
>> produced evidence that these non-HS.
>>
>> (Unfortunately, I have at the moment not much time for more testing)
>
> Would you be able to make those systems available for further testing?

No. sorry.

> First, I'd perform the same test with the systems swapped, so we know
> more about the symmetry of the issue. After that, would like to look
> more into internals.

you mean "systems swapped", primary and standby?  primary and standby were on the same machine in
these tests (even the same raid).

I can eventually move the standby (the 'slow' side, as it stands) to another, quite similar
machine.  Not in the coming days though...


> Is it possible to setup SytemTap and dtrace on these systems?

I did install systemtap last week. dtrace is not installed (I think. I've never used either.)



Erik Rijkers



Re: testing HS/SR - 1 vs 2 performance

From
Greg Smith
Date:
Erik Rijkers wrote:
>   OS: Centos 5.4
>   2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
>   Areca 1280ML
>   primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2)
>   

To fill in from data you already mentioned upthread:
32 GB RAM
CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5

Thanks for the all the reporting you've done here, really helpful.  
Questions to make sure I'm trying to duplicate the right thing here:

Is your disk array all configured as one big RAID10 volume, so 
essentially a 6-disk stripe with redundancy, or something else?  In 
particular I want know whether the WAL/database/archives are split onto 
separate volumes or all on one big one when you were testing. 

Is this is on ext3 with standard mount parameters?

Also, can you confirm that every test you ran only had a single pgbench 
worker thread (-j 1 or not specified)?  That looked to be the case from 
the ones I saw where you posted the whole command used.  It would not 
surprise me to find that the CPU usage profile of a standby is just 
different enough from the primary that it results in the pgbench program 
not being scheduled enough time, due to the known Linux issues in that 
area.  Not going to assume that, of course, just one thing I want to 
check when trying to replicate what you've run into. 

I didn't see any glaring HS performance issues like you've been 
reporting on last time I tried performance testing in this area, just a 
small percentage drop.  But I didn't specifically go looking for it 
either.  With your testing rig out of service, we're going to try and 
replicate that on a system here.  My home server is like a scaled down 
version of yours (single quad-core, 8GB RAM, smaller Areca controller, 5 
disks instead of 12) and it's running the same CentOS version.  If the 
problems really universal I should see it here too.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us



Re: testing HS/SR - 1 vs 2 performance

From
Stefan Kaltenbrunner
Date:
Erik Rijkers wrote:
> Hi Simon,
> 
> In another thread you mentioned you were lacking information from me:
> 
> On Tue, May 4, 2010 17:10, Simon Riggs wrote:
>> There is no evidence that Erik's strange performance has anything to do
>> with HS; it hasn't been seen elsewhere and he didn't respond to
>> questions about the test setup to provide background. The profile didn't
>> fit any software problem I can see.
>>
> 
> I'm sorry if I missed requests for things that where not already mentioned.
> 
> Let me repeat:
>   OS: Centos 5.4
>   2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
>   Areca 1280ML
>   primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2)
> 
> It goes without saying (I hope) that apart from the pgbench tests
> and a few ssh sessions (myself), the machine was idle.
> 
> It would be interesting if anyone repeated these simple tests and produced
> evidence that these non-HS.
> 
> (Unfortunately, I have at the moment not much time for more testing)

FWIW - I'm seeing a behaviour here under pgbench -S workloads that looks 
kinda related.

using -j 16 -c 16 -T 120 I get either 100000tps and around 660000 
contextswitches per second or on some runs I end up with 150000tps and 
around 1M contextswitches/s sustained. I mostly get the 100k result but 
once in a while I get the 150k one. And one even can anticipate the 
final transaction rate from watching "vmstat 1"...

I'm not sure yet on what is causing that behaviour but that is with 
9.0B1 on a Dual Quadcore Nehalem box with 16 cpu threads (8+HT) on a 
pure in-memory workload (scale = 20 with 48GB RAM).


Stefan


Re: testing HS/SR - 1 vs 2 performance

From
"Erik Rijkers"
Date:
On Tue, May 4, 2010 20:26, Greg Smith wrote:
> Erik Rijkers wrote:
>>   OS: Centos 5.4
>>   2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
>>   Areca 1280ML
>>   primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2)
>>
>
> To fill in from data you already mentioned upthread:
> 32 GB RAM
> CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5
>
> Thanks for the all the reporting you've done here, really helpful.
> Questions to make sure I'm trying to duplicate the right thing here:
>
> Is your disk array all configured as one big RAID10 volume, so
> essentially a 6-disk stripe with redundancy, or something else?  In
> particular I want know whether the WAL/database/archives are split onto
> separate volumes or all on one big one when you were testing.

Everything together: the raid is what Areca call 'raid10(1E)'.
(to be honest I don't remember what that 1E exactly means -
extra flexibility in the number of disks, I think).

Btw, some of my emails contained the postgresql.conf of both instances.

>
> Is this is on ext3 with standard mount parameters?

ext3 noatime

> Also, can you confirm that every test you ran only had a single pgbench
> worker thread (-j 1 or not specified)?  That looked to be the case from
> the ones I saw where you posted the whole command used.  It would not

yes; the literal cmd:
time /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin/pgbench -h /tmp -p 6565 -U rijkers
-n -S -c 20 -T 900 -j 1 replicas

To avoid wrapping in the emails I just removed '-h \tmp', -U rijkers', and 'replicas'.

(I may have run the primary's pgbench binary also against the slave - don't think
that should make any difference)

> surprise me to find that the CPU usage profile of a standby is just
> different enough from the primary that it results in the pgbench program
> not being scheduled enough time, due to the known Linux issues in that
> area.  Not going to assume that, of course, just one thing I want to
> check when trying to replicate what you've run into.
>
> I didn't see any glaring HS performance issues like you've been
> reporting on last time I tried performance testing in this area, just a
> small percentage drop.  But I didn't specifically go looking for it

Here, it seems repeatable, but does not occur with all scales.

Hm, maybe I should just dump *all* of my results on the wiki for reference.  (I'll look at that
later).

> either.  With your testing rig out of service, we're going to try and
> replicate that on a system here.  My home server is like a scaled down
> version of yours (single quad-core, 8GB RAM, smaller Areca controller, 5
> disks instead of 12) and it's running the same CentOS version.  If the
> problems really universal I should see it here too.
>

Thanks,


Erik Rijkers



Re: testing HS/SR - 1 vs 2 performance

From
Simon Riggs
Date:
On Tue, 2010-05-04 at 21:34 +0200, Stefan Kaltenbrunner wrote:

> FWIW - I'm seeing a behaviour here under pgbench -S workloads that looks
> kinda related.
>
> using -j 16 -c 16 -T 120 I get either 100000tps and around 660000
> contextswitches per second or on some runs I end up with 150000tps and
> around 1M contextswitches/s sustained. I mostly get the 100k result but
> once in a while I get the 150k one. And one even can anticipate the
> final transaction rate from watching "vmstat 1"...
>
> I'm not sure yet on what is causing that behaviour but that is with
> 9.0B1 on a Dual Quadcore Nehalem box with 16 cpu threads (8+HT) on a
> pure in-memory workload (scale = 20 with 48GB RAM).

Educated guess at a fix: please test this patch. It's good for
performance testing, but doesn't work correctly at failover, which would
obviously be addressed prior to any commit.

--
 Simon Riggs           www.2ndQuadrant.com

Attachment

Re: testing HS/SR - 1 vs 2 performance

From
Greg Smith
Date:
Erik Rijkers wrote:
> Everything together: the raid is what Areca call 'raid10(1E)'.
> (to be honest I don't remember what that 1E exactly means -
> extra flexibility in the number of disks, I think).
>   

Standard RAID10 only supports an even number of disks.  The 1E variants 
also allow putting an odd number in.  If you're using an even number, 
like in your case, the results are the same until you start losing 
drives, at which point the degradation performance pattern changes a bit 
due to differences in how things are striped.  See these for more info:

http://bytepile.com/raid_class.php and 
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#IBM_ServeRAID_1E
http://publib.boulder.ibm.com/infocenter/eserver/v1r2/index.jsp?topic=/diricinfo/fqy0_craid1e.html

I don't think using RAID1E has anything to do with your results, but it 
is possible given your test configuration that part of the difference 
you're seeing relates to where on disk blocks are stored.  If you take a 
hard drive and write two copies of something onto it, the second copy 
will be a little slower than the first.  That's just because how drive 
speed drops over the surface as you move further along it.  There's some 
examples of what that looks like at 
http://projects.2ndquadrant.it/sites/default/files/pg-hw-bench-2010.pdf 
on pages 21-23.

Returning to your higher level results, one of the variables you weren't 
sure how to account for was caching effects on the standby--the 
possibility that it just didn't have the data cached the same as the 
master impacting results.  The usual way I look for that is by graphing 
the pgbench TPS over time.  In that situation, you can see it shoot 
upwards over time, very obvious pattern.  Example at 
http://projects.2ndquadrant.it/sites/default/files/pgbench-intro.pdf on 
pages 36,37.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us