Thread: testing HS/SR - 1 vs 2 performance
Using 9.0devel cvs HEAD, 2010.04.08. I am trying to understand the performance difference between primary and standby under a standard pgbench read-only test. server has 32 GB, 2 quadcores. primary: tps = 34606.747930 (including connections establishing) tps = 34527.078068 (including connections establishing)tps = 34654.297319 (including connections establishing) standby: tps = 700.346283 (including connections establishing) tps = 717.576886 (including connections establishing) tps= 740.522472 (including connections establishing) transaction type: SELECT only scaling factor: 1000 query mode: simple number of clients: 20 number of threads: 1 duration: 900 s both instances have max_connections = 100 shared_buffers = 256MB checkpoint_segments = 50 effective_cache_size= 16GB See also: http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php (differences with scale 10_000) I understand that in the scale=1000 case, there is a huge cache effect, but why doesn't that apply to the pgbench runs against the standby? (and for the scale=10_000 case the differences are still rather large) Maybe these differences are as expected. I don't find any explanation in the documentation. thanks, Erik Rijkers
On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote: > I understand that in the scale=1000 case, there is a huge > cache effect, but why doesn't that apply to the pgbench runs > against the standby? (and for the scale=10_000 case the > differences are still rather large) I guess that this performance degradation happened because a number of buffer replacements caused UpdateMinRecoveryPoint() often. So I think increasing shared_buffers would improve the performance significantly. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Apr 12, 2010 at 5:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote: >> I understand that in the scale=1000 case, there is a huge >> cache effect, but why doesn't that apply to the pgbench runs >> against the standby? (and for the scale=10_000 case the >> differences are still rather large) > > I guess that this performance degradation happened because a number of > buffer replacements caused UpdateMinRecoveryPoint() often. So I think > increasing shared_buffers would improve the performance significantly. I think we need to investigate this more. It's not going to look good for the project if people find that a hot standby server runs two orders of magnitude slower than the primary. ...Robert
On Sat, April 10, 2010 01:23, Erik Rijkers wrote: > Using 9.0devel cvs HEAD, 2010.04.08. > > I am trying to understand the performance difference > between primary and standby under a standard pgbench > read-only test. > > server has 32 GB, 2 quadcores. > > primary: > tps = 34606.747930 (including connections establishing) > tps = 34527.078068 (including connections establishing) > tps = 34654.297319 (including connections establishing) > > standby: > tps = 700.346283 (including connections establishing) > tps = 717.576886 (including connections establishing) > tps = 740.522472 (including connections establishing) > > transaction type: SELECT only > scaling factor: 1000 > query mode: simple > number of clients: 20 > number of threads: 1 > duration: 900 s > > both instances have > max_connections = 100 > shared_buffers = 256MB > checkpoint_segments = 50 > effective_cache_size= 16GB > > See also: > > http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php > (differences with scale 10_000) > To my surprise, I have later seen the opposite behaviour with the standby giving fast runs, and the primary slow. FWIW, I've overnight run a larget set of tests. (against same 9.0devel instances as the ones from the earlier email). These results are generally more balanced. for scale in for clients in 1 5 10 20 for port in 6565 6566 --> primaryport standbyport for run in `seq1 3` pgbench ... sleep ((scale / 10) * 60) done done done done (so below, alternating 3 primary, followed by 3 standby runs) scale: 10 clients: 1 tps = 15219.019272 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 15301.847615 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 15238.907436 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 12129.928289 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 12151.711589 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 12203.494512 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 5 tps = 60248.120599 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 60827.949875 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 61167.447476 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 50750.385403 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 50600.891436 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 50486.857610 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 10 tps = 60307.739327 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 60264.230349 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 60146.370598 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 50455.537671 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 49877.000813 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 50097.949766 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 20 tps = 43355.220657 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 43352.725422 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 43496.085623 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 37169.126299 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 37100.260450 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 37342.758507 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 1 tps = 12514.185089 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 12542.842198 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 12595.688640 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 10435.681851 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 10456.983353 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 10434.213044 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 5 tps = 48682.166988 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 48656.883485 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 48687.894655 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 41901.629933 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 41953.386791 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 41787.962712 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 10 tps = 48704.247239 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 48941.190050 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 48603.077936 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 42948.666272 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 42767.793899 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 42612.670983 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 20 tps = 36350.454258 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 36373.088111 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 36490.886781 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 32235.811228 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 32253.837906 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 32144.189047 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 1 tps = 11733.254970 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 11726.665739 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 11617.622548 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 9769.861175 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 9878.465752 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 9808.236216 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 5 tps = 45185.900553 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 45170.334037 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 45136.596374 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 39231.863815 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 39336.889619 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 39269.483772 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 10 tps = 45468.080680 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 45727.159963 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 45399.241367 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 40759.108042 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 40783.287718 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 40858.007847 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 20 tps = 34729.742313 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 34705.119029 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 34617.517224 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 31252.355034 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 31234.885791 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 31273.307637 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 1 tps = 220.024691 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 294.855794 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 375.152757 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 295.965959 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 1036.517110 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 9167.012603 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 5 tps = 1241.224282 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 1894.806301 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 18532.885549 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 1497.491279 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 1480.164166 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 3470.769236 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 10 tps = 2414.552333 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 19248.609443 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 45059.231609 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 1648.526373 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 3659.800008 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 35900.769857 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 20 tps = 2462.855864 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 27168.407568 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 34438.802096 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 2933.220489 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 25586.972428 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 30926.189621 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
On Mon, Apr 12, 2010 at 7:07 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 12, 2010 at 5:06 AM, Fujii Masao <masao.fujii@gmail.com> wrote: >> On Sat, Apr 10, 2010 at 8:23 AM, Erik Rijkers <er@xs4all.nl> wrote: >>> I understand that in the scale=1000 case, there is a huge >>> cache effect, but why doesn't that apply to the pgbench runs >>> against the standby? (and for the scale=10_000 case the >>> differences are still rather large) >> >> I guess that this performance degradation happened because a number of >> buffer replacements caused UpdateMinRecoveryPoint() often. So I think >> increasing shared_buffers would improve the performance significantly. > > I think we need to investigate this more. It's not going to look good > for the project if people find that a hot standby server runs two > orders of magnitude slower than the primary. As a data point, I did a read only pgbench test and found that the standby runs about 15% slower than the primary with identical hardware and configs. > > ...Robert > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- -- Jim Mlodgenski EnterpriseDB (http://www.enterprisedb.com)
On Mon, April 12, 2010 14:22, Erik Rijkers wrote: > On Sat, April 10, 2010 01:23, Erik Rijkers wrote: Oops, typos in that pseudo loop: of course there was a pgbench init step after that first line. > for scale in 10 100 500 1000 pgbench ... # initialise sleep ((scale / 10) * 60) > for clients in 1 5 10 20 > for port in 6565 6566 --> primaryport standbyport > for run in `seq 1 3` > pgbench ... sleep 120 > done > done > done > done
On Mon, Apr 12, 2010 at 8:32 AM, Jim Mlodgenski <jimmy76@gmail.com> wrote: >> I think we need to investigate this more. It's not going to look good >> for the project if people find that a hot standby server runs two >> orders of magnitude slower than the primary. > As a data point, I did a read only pgbench test and found that the > standby runs about 15% slower than the primary with identical hardware > and configs. Hmm. That's not great, but it's a lot better than 50x. I wonder what was different in Erik's environment. Does running in standby mode use more memory, such that it might have pushed the machine over the line into swap? Or if it's CPU load, maybe Erik could gprof it? ...Robert
* Robert Haas <robertmhaas@gmail.com> [100412 07:10]: > I think we need to investigate this more. It's not going to look good > for the project if people find that a hot standby server runs two > orders of magnitude slower than the primary. Yes, it's not "good", but it's a known problem. We've had people complaining that wal-replay can't keep up with a wal stream from a heavy server. The master producing the wal stream has $XXX seperate read/modify processes working over the data dir, and is bottle-necked by the serialized WAL stream. All the seek+read delays are parallized and overlapping. But on the slave (traditionally PITR slave, now also HS/SR), has al lthat read-modify-write happening in a single thread fasion, meaning that WAL record $X+1 waits until the buffer $X needs to modify is read in. All the seek+read delays are serialized. You can optimize that by keepdng more of them in buffers (shared, or OS cache), but the WAL producer, by it's very nature being a multi-task-io-load producing random read/write is always going to go quicker than single-stream random-io WAL consumer... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
And I see now that he's doing a stream of read-only queries on a slave, presumably with no WAL even being replayed... Sorry for the noise.... a. * Aidan Van Dyk <aidan@highrise.ca> [100412 09:40]: > * Robert Haas <robertmhaas@gmail.com> [100412 07:10]: > > > I think we need to investigate this more. It's not going to look good > > for the project if people find that a hot standby server runs two > > orders of magnitude slower than the primary. > > Yes, it's not "good", but it's a known problem. We've had people > complaining that wal-replay can't keep up with a wal stream from a heavy > server. > > The master producing the wal stream has $XXX seperate read/modify > processes working over the data dir, and is bottle-necked by the > serialized WAL stream. All the seek+read delays are parallized and > overlapping. > > But on the slave (traditionally PITR slave, now also HS/SR), has al > lthat read-modify-write happening in a single thread fasion, meaning > that WAL record $X+1 waits until the buffer $X needs to modify is read > in. All the seek+read delays are serialized. > > You can optimize that by keepdng more of them in buffers (shared, or OS > cache), but the WAL producer, by it's very nature being a > multi-task-io-load producing random read/write is always going to go > quicker than single-stream random-io WAL consumer... > > a. > > -- > Aidan Van Dyk Create like a god, > aidan@highrise.ca command like a king, > http://www.highrise.ca/ work like a slave. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
>Aidan Van Dyk <aidan@highrise.ca> wrote: > We've had people complaining that wal-replay can't keep up with a > wal stream from a heavy server. I thought this thread was about the slow performance running a mix of read-only queries on the slave versus the master, which doesn't seem to have anything to do with the old issue you're describing. -Kevin
resending this message, as it seems to have bounced. (below, I did fix the typo in the pseudocode loop) ---------------------------------------- Original Message ---------------------------------------- Subject: Re: [HACKERS] testing HS/SR - 1 vs 2 performance From: "Erik Rijkers" <er@xs4all.nl> Date: Mon, April 12, 2010 14:22 To: pgsql-hackers@postgresql.org -------------------------------------------------------------------------------------------------- On Sat, April 10, 2010 01:23, Erik Rijkers wrote: > Using 9.0devel cvs HEAD, 2010.04.08. > > I am trying to understand the performance difference > between primary and standby under a standard pgbench > read-only test. > > server has 32 GB, 2 quadcores. > > primary: > tps = 34606.747930 (including connections establishing) > tps = 34527.078068 (including connections establishing) > tps = 34654.297319 (including connections establishing) > > standby: > tps = 700.346283 (including connections establishing) > tps = 717.576886 (including connections establishing) > tps = 740.522472 (including connections establishing) > > transaction type: SELECT only > scaling factor: 1000 > query mode: simple > number of clients: 20 > number of threads: 1 > duration: 900 s > > both instances have > max_connections = 100 > shared_buffers = 256MB > checkpoint_segments = 50 > effective_cache_size= 16GB > > See also: > > http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php > (differences with scale 10_000) > To my surprise, I have later seen the opposite behaviour with the standby giving fast runs, and the primary slow. FWIW, I've overnight run a larget set of tests. (against same 9.0devel instances as the ones from the earlier email). These results are generally more balanced. for scale in 10 100 500 1000 pgbench ... # initialise sleep ((scale / 10) * 60) for clients in 1 5 10 20 forport in 6565 6566 --> primaryport standbyport for run in `seq 1 3` pgbench ... sleep((scale / 10) * 60) done done done done (so below, alternating 3 primary, followed by 3 standby runs) scale: 10 clients: 1 tps = 15219.019272 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 15301.847615 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 15238.907436 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 12129.928289 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 12151.711589 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 12203.494512 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 5 tps = 60248.120599 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 60827.949875 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 61167.447476 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 50750.385403 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 50600.891436 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 5 tps = 50486.857610 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 10 clients: 10 tps = 60307.739327 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 60264.230349 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 60146.370598 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 50455.537671 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 49877.000813 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 50097.949766 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 20 tps = 43355.220657 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 43352.725422 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 43496.085623 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 37169.126299 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 37100.260450 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 37342.758507 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 1 tps = 12514.185089 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 12542.842198 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 12595.688640 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 10435.681851 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 10456.983353 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 10434.213044 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 5 tps = 48682.166988 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 48656.883485 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 48687.894655 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 41901.629933 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 41953.386791 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 5 tps = 41787.962712 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 100 clients: 10 tps = 48704.247239 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 48941.190050 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 48603.077936 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 42948.666272 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 42767.793899 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 42612.670983 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 20 tps = 36350.454258 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 36373.088111 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 36490.886781 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 32235.811228 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 32253.837906 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 32144.189047 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 1 tps = 11733.254970 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 11726.665739 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 11617.622548 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 9769.861175 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 9878.465752 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 9808.236216 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 5 tps = 45185.900553 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 45170.334037 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 45136.596374 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 39231.863815 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 39336.889619 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 5 tps = 39269.483772 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 500 clients: 10 tps = 45468.080680 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 45727.159963 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 45399.241367 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 40759.108042 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 40783.287718 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 40858.007847 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 20 tps = 34729.742313 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 34705.119029 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 34617.517224 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 31252.355034 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 31234.885791 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 31273.307637 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 1 tps = 220.024691 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 294.855794 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 375.152757 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 295.965959 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 1036.517110 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 9167.012603 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 5 tps = 1241.224282 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 1894.806301 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 18532.885549 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 1497.491279 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 1480.164166 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 5 tps = 3470.769236 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1 scale: 1000 clients: 10 tps = 2414.552333 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 19248.609443 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 45059.231609 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 1648.526373 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 3659.800008 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 35900.769857 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 20 tps = 2462.855864 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 27168.407568 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 34438.802096 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 2933.220489 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 25586.972428 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 30926.189621 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
I could reproduce this on my laptop, standby is about 20% slower. I ran oprofile, and what stands out as the difference between the master and standby is that on standby about 20% of the CPU time is spent in hash_seq_search(). The callpath is GetSnapshotDat() -> KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the difference in performance. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > I could reproduce this on my laptop, standby is about 20% slower. I ran > oprofile, and what stands out as the difference between the master and > standby is that on standby about 20% of the CPU time is spent in > hash_seq_search(). The callpath is GetSnapshotDat() -> > KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the > difference in performance. The slowdown is proportional to the max_connections setting in the standby. 20% slowdown might still be acceptable, but if you increase max_connections to say 1000, things get really slow. I wouldn't recommend max_connections=1000, of course, but I think we need to do something about this. Changing the KnownAssignedXids data structure from hash table into something that's quicker to scan. Preferably something with O(N), where N is the number of entries in the data structure, not the maximum number of entries it can hold as it is with the hash table currently. A quick fix would be to check if there's any entries in the hash table before scanning it. That would eliminate the overhead when there's no in-progress transactions in the master. But as soon as there's even one, the overhead comes back. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Changing the KnownAssignedXids data structure from > hash table into something that's quicker to scan. Preferably something > with O(N), where N is the number of entries in the data structure, not > the maximum number of entries it can hold as it is with the hash table > currently. So that's pretty good news RedBlack Trees made it in 9.0, isn't it? :) > A quick fix would be to check if there's any entries in the hash table > before scanning it. That would eliminate the overhead when there's no > in-progress transactions in the master. But as soon as there's even one, > the overhead comes back. Does not sound like typical, does it? -- dim
On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > I could reproduce this on my laptop, standby is about 20% slower. I ran > > oprofile, and what stands out as the difference between the master and > > standby is that on standby about 20% of the CPU time is spent in > > hash_seq_search(). The callpath is GetSnapshotDat() -> > > KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the > > difference in performance. > > The slowdown is proportional to the max_connections setting in the > standby. 20% slowdown might still be acceptable, but if you increase > max_connections to say 1000, things get really slow. I wouldn't > recommend max_connections=1000, of course, but I think we need to do > something about this. Changing the KnownAssignedXids data structure from > hash table into something that's quicker to scan. Preferably something > with O(N), where N is the number of entries in the data structure, not > the maximum number of entries it can hold as it is with the hash table > currently. There's a tradeoff here to consider. KnownAssignedXids faces two workloads: one for each WAL record where we check if the xid is already known assigned, one for snapshots. The current implementation is optimised towards recovery performance, not snapshot performance. > A quick fix would be to check if there's any entries in the hash table > before scanning it. That would eliminate the overhead when there's no > in-progress transactions in the master. But as soon as there's even one, > the overhead comes back. Any fix should be fairly quick because of the way its modularised - with something like this in mind. I'll try a circular buffer implementation, with fastpath. Have something in a few hours. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote: >> A quick fix would be to check if there's any entries in the hash table >> before scanning it. That would eliminate the overhead when there's no >> in-progress transactions in the master. But as soon as there's even one, >> the overhead comes back. > > Any fix should be fairly quick because of the way its modularised - with > something like this in mind. > > I'll try a circular buffer implementation, with fastpath. I started experimenting with a sorted array based implementation on Tuesday but got carried away with other stuff. I now got back to that and cleaned it up. How does the attached patch look like? It's probably similar to what you had in mind. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 88169b4..4a04051 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -64,8 +64,10 @@ typedef struct ProcArrayStruct int numProcs; /* number of valid procs entries */ int maxProcs; /* allocated size of procs array */ - int numKnownAssignedXids; /* current number of known assigned - * xids */ + int firstKnownAssigned; /* location of first valid known + * assigned xid in the array */ + int lastKnownAssigned; /* location of last valid known + * assigned xid in the array */ int maxKnownAssignedXids; /* allocated size of known assigned * xids */ @@ -87,7 +89,8 @@ static ProcArrayStruct *procArray; /* * Bookkeeping for tracking emulated transactions in recovery */ -static HTAB *KnownAssignedXidsHash; +static TransactionId *KnownAssignedXidsArray; +static bool *KnownAssignedXidsValidArray; static TransactionId latestObservedXid = InvalidTransactionId; /* @@ -201,28 +204,33 @@ CreateSharedProcArray(void) /* Normal processing */ procArray->numProcs = 0; procArray->maxProcs = PROCARRAY_MAXPROCS; - procArray->numKnownAssignedXids = 0; procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS; + procArray->firstKnownAssignedXid = 0; + procArray->lastKnownAssignedXid = 0; procArray->lastOverflowedXid = InvalidTransactionId; } if (XLogRequestRecoveryConnections) { - /* Create or attach to the KnownAssignedXids hash table */ - HASHCTL info; - - MemSet(&info, 0, sizeof(info)); - info.keysize = sizeof(TransactionId); - info.entrysize = sizeof(TransactionId); - info.hash = tag_hash; - - KnownAssignedXidsHash = ShmemInitHash("KnownAssignedXids Hash", - TOTAL_MAX_CACHED_SUBXIDS, - TOTAL_MAX_CACHED_SUBXIDS, - &info, - HASH_ELEM | HASH_FUNCTION); - if (!KnownAssignedXidsHash) - elog(FATAL, "could not initialize known assigned xids hash table"); + Size size; + /* Create or attach to the KnownAssignedXids arrays */ + size = mul_size(sizeof(TransactionId), TOTAL_MAX_CACHED_SUBXIDS); + KnownAssignedXidsArray = ShmemInitStruct("KnownAssignedXidsArray", + size, + &found); + if (!KnownAssignedXidsArray) + elog(FATAL, "could not initialize known assigned xids array"); + if (!found) + MemSet(KnownAssignedXidsArray, 0, size); + + size = mul_size(sizeof(bool), TOTAL_MAX_CACHED_SUBXIDS); + KnownAssignedXidsValidArray = ShmemInitStruct("KnownAssignedXidsValidArray", + size, + &found); + if (!KnownAssignedXidsValidArray) + elog(FATAL, "could not initialize known assigned xids used array"); + if (!found) + MemSet(KnownAssignedXidsValidArray, 0, size); } } @@ -2162,7 +2170,7 @@ DisplayXidCache(void) * * During recovery we do not fret too much about the distinction between * top-level xids and subtransaction xids. We hold both together in - * a hash table called KnownAssignedXids. In backends, this is copied into + * an array called KnownAssignedXids. In backends, this is copied into * snapshots in GetSnapshotData(), taking advantage * of the fact that XidInMVCCSnapshot() doesn't care about the distinction * either. Subtransaction xids are effectively treated as top-level xids @@ -2207,7 +2215,7 @@ RecordKnownAssignedTransactionIds(TransactionId xid) * We can see WAL records before the running-xacts snapshot that contain * XIDs that are not in the running-xacts snapshot, but that we know to * have finished before the running-xacts snapshot was taken. Don't waste - * precious shared memory by keeping them in the hash table. + * precious shared memory by keeping them in the array. * * We can also see WAL records before the running-xacts snapshot that * contain XIDs that are not in the running-xacts snapshot for a different @@ -2330,24 +2338,30 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid) /* * Private module functions to manipulate KnownAssignedXids * - * There are 3 main users of the KnownAssignedXids data structure: + * We use a fixed-size sorted array in shared memory to keep track of XIDs + * that we consider as in-progress in the master at this time. This data + * structure is called KnownAssignedXids, and it has 4 main users: * - * * backends taking snapshots + * * backends taking snapshots, copying all XIDs in the array + * * backends checking if an XID exists in the array * * startup process adding new knownassigned xids * * startup process removing xids as transactions end * - * If we make KnownAssignedXids a simple sorted array then the first two - * operations are fast, but the last one is at least O(N). If we make - * KnownAssignedXids a hash table then the last two operations are fast, - * though we have to do more work at snapshot time. Doing more work at - * commit could slow down taking snapshots anyway because of lwlock - * contention. Scanning the hash table is O(N) on the max size of the array, - * so performs poorly in comparison when we have very low numbers of - * write transactions to process. But at least it is constant overhead - * and a sequential memory scan will utilise hardware memory readahead - * to give much improved performance. In any case the emphasis must be on - * having the standby process changes quickly so that it can provide - * high availability. So we choose to implement as a hash table. + * Typically, there is only a few entries in the array, compared to the + * maximum size, so we keep track of the first and last valid entry in the + * array to make scanning it quick. That also makes it quick to add entries + * to the end. XIDs are assigned in monotonically increasing order, so new + * entries always go to the end. + * + * When an entry is removed, it's merely marked as invalid, but left in + * place in the array. There is a second array of booleans, + * KnownAssignedXidsValidArray, which keeps track of which entries in the + * KnownAssignedXidsArray are valid. + * + * Because we leave the entry in place when an XID is marked as removed, the + * array will eventually fill up. When an entry needs to be added and there + * is no room for it, the array is compacted by copying all valid entries to + * the beginning of the array, removing all invalid entries. */ /* @@ -2358,41 +2372,49 @@ ExpireOldKnownAssignedTransactionIds(TransactionId xid) static void KnownAssignedXidsAdd(TransactionId *xids, int nxids) { - TransactionId *result; - bool found; int i; for (i = 0; i < nxids; i++) { - Assert(TransactionIdIsValid(xids[i])); + TransactionId xid = xids[i]; - elog(trace_recovery(DEBUG4), "adding KnownAssignedXid %u", xids[i]); + Assert(TransactionIdIsValid(xid)); - procArray->numKnownAssignedXids++; - if (procArray->numKnownAssignedXids > procArray->maxKnownAssignedXids) - { - KnownAssignedXidsDisplay(LOG); - LWLockRelease(ProcArrayLock); - elog(ERROR, "too many KnownAssignedXids (%u)", procArray->maxKnownAssignedXids); - } + elog(trace_recovery(DEBUG4), "adding KnownAssignedXid %u", xid); - result = (TransactionId *) hash_search(KnownAssignedXidsHash, &xids[i], HASH_ENTER, - &found); + Assert(procArray->lastKnownAssignedXid == 0 || + TransactionIdFollows(xid, KnownAssignedXidsArray[procArray->lastKnownAssignedXid - 1])); - if (!result) + /* Compact if necessary */ + if (procArray->lastKnownAssignedXid == procArray->maxKnownAssignedXids) { - LWLockRelease(ProcArrayLock); - ereport(ERROR, - (errcode(ERRCODE_OUT_OF_MEMORY), - errmsg("out of shared memory"))); - } + int j; + int k; - if (found) - { - KnownAssignedXidsDisplay(LOG); - LWLockRelease(ProcArrayLock); - elog(ERROR, "found duplicate KnownAssignedXid %u", xids[i]); + k = 0; + for (j = procArray->firstKnownAssignedXid; j < procArray->lastKnownAssignedXid; j++) + { + if (KnownAssignedXidsValidArray[j]) + { + KnownAssignedXidsArray[k] = KnownAssignedXidsArray[j]; + KnownAssignedXidsValidArray[k] = true; + k++; + } + } + procArray->firstKnownAssignedXid = 0; + procArray->lastKnownAssignedXid = k; + + if (procArray->lastKnownAssignedXid == procArray->maxKnownAssignedXids) + { + KnownAssignedXidsDisplay(LOG); + LWLockRelease(ProcArrayLock); + elog(ERROR, "too many KnownAssignedXids (%u)", procArray->maxKnownAssignedXids); + } } + + KnownAssignedXidsArray[procArray->lastKnownAssignedXid] = xid; + KnownAssignedXidsValidArray[procArray->lastKnownAssignedXid] = true; + procArray->lastKnownAssignedXid++; } } @@ -2404,10 +2426,21 @@ KnownAssignedXidsAdd(TransactionId *xids, int nxids) static bool KnownAssignedXidsExist(TransactionId xid) { - bool found; + int low = procArray->firstKnownAssignedXid; + int high = procArray->lastKnownAssignedXid - 1; - (void) hash_search(KnownAssignedXidsHash, &xid, HASH_FIND, &found); - return found; + while (low <= high) + { + int middle = low + (high - low) / 2; + + if (xid == KnownAssignedXidsArray[middle]) + return true; + else if (xid > KnownAssignedXidsArray[middle]) + low = middle + 1; + else + high = middle - 1; + } + return false; } /* @@ -2418,17 +2451,37 @@ KnownAssignedXidsExist(TransactionId xid) static void KnownAssignedXidsRemove(TransactionId xid) { - bool found; + int low = procArray->firstKnownAssignedXid; + int high = procArray->lastKnownAssignedXid - 1; Assert(TransactionIdIsValid(xid)); elog(trace_recovery(DEBUG4), "remove KnownAssignedXid %u", xid); - (void) hash_search(KnownAssignedXidsHash, &xid, HASH_REMOVE, &found); - - if (found) - procArray->numKnownAssignedXids--; - Assert(procArray->numKnownAssignedXids >= 0); + while (low <= high) + { + int middle = low + (high - low) / 2; + + if (xid == KnownAssignedXidsArray[middle]) + { + KnownAssignedXidsValidArray[middle] = false; + if (middle == procArray->lastKnownAssignedXid - 1) + { + while (procArray->lastKnownAssignedXid >= 0 && !KnownAssignedXidsValidArray[procArray->lastKnownAssignedXid- 1]) + procArray->lastKnownAssignedXid--; + } + if (middle == procArray->firstKnownAssignedXid) + { + while (procArray->firstKnownAssignedXid < procArray->lastKnownAssignedXid && !KnownAssignedXidsValidArray[procArray->firstKnownAssignedXid]) + procArray->firstKnownAssignedXid++; + } + return; + } + else if (xid > KnownAssignedXidsArray[middle]) + low = middle + 1; + else + high = middle - 1; + } /* * We can fail to find an xid if the xid came from a subtransaction that @@ -2466,26 +2519,30 @@ static int KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin, TransactionId xmax) { - HASH_SEQ_STATUS status; - TransactionId *knownXid; int count = 0; + int i; - hash_seq_init(&status, KnownAssignedXidsHash); - while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL) + for (i = procArray->firstKnownAssignedXid; i < procArray->lastKnownAssignedXid; i++) { + TransactionId xid = KnownAssignedXidsArray[i]; + + if (!KnownAssignedXidsValidArray[i]) + continue; + /* - * Filter out anything higher than xmax + * Filter out anything higher than xmax. The array is sorted so + * we can stop as soon as we find one that's too big */ - if (TransactionIdPrecedes(xmax, *knownXid)) - continue; + if (TransactionIdPrecedes(xmax, xid)) + break; - *xarray = *knownXid; + *xarray = xid; xarray++; count++; /* update xmin if required */ - if (TransactionIdPrecedes(*knownXid, *xmin)) - *xmin = *knownXid; + if (TransactionIdPrecedes(xid, *xmin)) + *xmin = xid; } return count; @@ -2500,34 +2557,34 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin, static void KnownAssignedXidsRemoveMany(TransactionId xid, bool keepPreparedXacts) { - TransactionId *knownXid; - HASH_SEQ_STATUS status; + int i; - if (TransactionIdIsValid(xid)) - elog(trace_recovery(DEBUG4), "prune KnownAssignedXids to %u", xid); - else + if (!TransactionIdIsValid(xid)) + { elog(trace_recovery(DEBUG4), "removing all KnownAssignedXids"); + procArray->firstKnownAssignedXid = procArray->lastKnownAssignedXid = 0; + return; + } + elog(trace_recovery(DEBUG4), "prune KnownAssignedXids to %u", xid); - hash_seq_init(&status, KnownAssignedXidsHash); - while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL) + for (i = procArray->firstKnownAssignedXid; i < procArray->lastKnownAssignedXid; i++) { - TransactionId removeXid = *knownXid; - bool found; + TransactionId removeXid = KnownAssignedXidsArray[i]; + if (!KnownAssignedXidsValidArray[i]) + continue; if (!TransactionIdIsValid(xid) || TransactionIdPrecedes(removeXid, xid)) { if (keepPreparedXacts && StandbyTransactionIdIsPrepared(removeXid)) continue; - else - { - (void) hash_search(KnownAssignedXidsHash, &removeXid, - HASH_REMOVE, &found); - if (found) - procArray->numKnownAssignedXids--; - Assert(procArray->numKnownAssignedXids >= 0); - } + KnownAssignedXidsValidArray[i] = false; } } + while (procArray->lastKnownAssignedXid >= 0 && !KnownAssignedXidsValidArray[procArray->lastKnownAssignedXid - 1]) + procArray->lastKnownAssignedXid--; + + while (procArray->firstKnownAssignedXid < procArray->lastKnownAssignedXid && !KnownAssignedXidsValidArray[procArray->firstKnownAssignedXid]) + procArray->firstKnownAssignedXid++; } /* @@ -2538,26 +2595,21 @@ KnownAssignedXidsRemoveMany(TransactionId xid, bool keepPreparedXacts) static void KnownAssignedXidsDisplay(int trace_level) { - HASH_SEQ_STATUS status; - TransactionId *knownXid; StringInfoData buf; - TransactionId *xids; int nxids; int i; - xids = palloc(sizeof(TransactionId) * TOTAL_MAX_CACHED_SUBXIDS); - nxids = 0; - - hash_seq_init(&status, KnownAssignedXidsHash); - while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL) - xids[nxids++] = *knownXid; - - qsort(xids, nxids, sizeof(TransactionId), xidComparator); - initStringInfo(&buf); - for (i = 0; i < nxids; i++) - appendStringInfo(&buf, "%u ", xids[i]); + nxids = 0; + for (i = procArray->firstKnownAssignedXid; i < procArray->lastKnownAssignedXid; i++) + { + if (KnownAssignedXidsValidArray[i]) + { + appendStringInfo(&buf, "%u ", KnownAssignedXidsArray[i]); + nxids++; + } + } elog(trace_level, "%d KnownAssignedXids %s", nxids, buf.data); diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h index c5c1228..eee0d58 100644 --- a/src/include/pg_config_manual.h +++ b/src/include/pg_config_manual.h @@ -203,7 +203,7 @@ * Enable debugging print statements for WAL-related operations; see * also the wal_debug GUC var. */ -/* #define WAL_DEBUG */ +#define WAL_DEBUG /* * Enable tracing of resource consumption during sort operations;
On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote: > >> A quick fix would be to check if there's any entries in the hash table > >> before scanning it. That would eliminate the overhead when there's no > >> in-progress transactions in the master. But as soon as there's even one, > >> the overhead comes back. > > > > Any fix should be fairly quick because of the way its modularised - with > > something like this in mind. > > > > I'll try a circular buffer implementation, with fastpath. > > I started experimenting with a sorted array based implementation on > Tuesday but got carried away with other stuff. I now got back to that > and cleaned it up. > > How does the attached patch look like? It's probably similar to what you > had in mind. It looks like a second version of what I'm working on and about to publish. I'll take that as a compliment! My patch is attached here also, for discussion. The two patches look same in their main parts, though I have quite a few extra tweaks in there, which you can read about in comments. One tweak I don't have is the use of the presence array that allows a sensible bsearch, so I'll to alter my patch to use that idea but keep the rest of my code. -- Simon Riggs www.2ndQuadrant.com
Attachment
Simon Riggs wrote: > On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote: >> How does the attached patch look like? It's probably similar to what you >> had in mind. > > It looks like a second version of what I'm working on and about to > publish. I'll take that as a compliment! > > My patch is attached here also, for discussion. > > The two patches look same in their main parts, though I have quite a few > extra tweaks in there, which you can read about in comments. Yeah. Yours seems a lot more complex with all those extra tweaks, I would suggest to keep it simple. I did realize one bug in my patch: I didn't handle xid wraparound correctly in the binary search, need to use TransactionIdFollows instead of plan >. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2010-04-16 at 14:47 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote: > >> How does the attached patch look like? It's probably similar to what you > >> had in mind. > > > > It looks like a second version of what I'm working on and about to > > publish. I'll take that as a compliment! > > > > My patch is attached here also, for discussion. > > > > The two patches look same in their main parts, though I have quite a few > > extra tweaks in there, which you can read about in comments. > > Yeah. Yours seems a lot more complex with all those extra tweaks, I > would suggest to keep it simple. I did realize one bug in my patch: I > didn't handle xid wraparound correctly in the binary search, need to use > TransactionIdFollows instead of plan >. Almost done, yes, much simpler. I wrote a lot of that in the wee small hours last night, so the difference is amusing. And I spotted that bug, plus the off by one error also. Just rewritten all other parts, so no worries. -- Simon Riggs www.2ndQuadrant.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > I didn't handle xid wraparound correctly in the binary search, need to use > TransactionIdFollows instead of plan >. I think you're outsmarting yourself there. A binary search will in fact *not work* with circular xid comparison (this is exactly why there's no btree opclass for XID). You need to use plain >, and make sure the array you're searching is ordered that way too. The other way might accidentally fail to malfunction if you only tested ranges of XIDs that weren't long enough to wrap around, but that doesn't make it right. regards, tom lane
On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > I didn't handle xid wraparound correctly in the binary search, need to use > > TransactionIdFollows instead of plan >. > > I think you're outsmarting yourself there. A binary search will in fact > *not work* with circular xid comparison (this is exactly why there's no > btree opclass for XID). You need to use plain >, and make sure the > array you're searching is ordered that way too. The other way might > accidentally fail to malfunction if you only tested ranges of XIDs that > weren't long enough to wrap around, but that doesn't make it right. I don't understand the exact, please explain more. I'm not using bsearch() just a quick chop based upon xid comparison, which looks to me like it will work. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote: >> I think you're outsmarting yourself there. A binary search will in fact >> *not work* with circular xid comparison (this is exactly why there's no >> btree opclass for XID). > I don't understand the exact, please explain more. > I'm not using bsearch() just a quick chop based upon xid comparison, > which looks to me like it will work. Implementing it yourself doesn't get you out of the fact that it won't work. Consider 1 2 3 ... 3000000000 ... 3999999999 and suppose that 3000000000 is the array's middle element. If you search for 100, your first probe will conclude that it is to the right of 3000000000, which is the wrong direction. Binary search, or indeed the entire concept that the array is "sorted" in the first place, depends on a transitive comparison operator. TransactionIdFollows does not satisfy the transitive law. regards, tom lane
On Fri, 2010-04-16 at 13:00 +0100, Simon Riggs wrote: > Almost done Here's the finished patch. Internal changes only, no functional or user visible changes, so no docs, just detailed explanatory comments. Expectation is * performance no longer depends upon max_connections * recovery will be about same or slightly faster * snapshots should be about equivalent primary/standby Main changes are * lock free add to sorted array (equivalent to GetNewTransactionId) * bsearch for xid removal at commit/abort/TransactionIdIsInProgress * faster, more compact approach to snapshots, with self-trimming * some minor API changes to facilitate above * new code to ignore deletion failures only when !consistent Erik, Could you retest please? -- Simon Riggs www.2ndQuadrant.com
Attachment
On Fri, 2010-04-16 at 11:10 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote: > >> I think you're outsmarting yourself there. A binary search will in fact > >> *not work* with circular xid comparison (this is exactly why there's no > >> btree opclass for XID). > > > I don't understand the exact, please explain more. > > I'm not using bsearch() just a quick chop based upon xid comparison, > > which looks to me like it will work. > > Implementing it yourself doesn't get you out of the fact that it won't > work. Consider > > 1 2 3 ... 3000000000 ... 3999999999 > > and suppose that 3000000000 is the array's middle element. If you > search for 100, your first probe will conclude that it is to the right > of 3000000000, which is the wrong direction. Binary search, or indeed > the entire concept that the array is "sorted" in the first place, > depends on a transitive comparison operator. TransactionIdFollows does > not satisfy the transitive law. AFAICS the example you give isn't correct. We would lay out the values like this W-3 W-2 W-1 W 3 4 5 where W is the wrap value and in this example we have 7 values in the current array, with tail at W-3 and head at 5. Note the gap between W and 3 where we would skip the values 1 and 2 because they are special. Each element's xid is TransactionIdAdvanced(previous element). So when we search for value 3 we would start from W, then decide it is to the right, which is correct and continue from there. The values are laid out in TransactionIdFollows order, not in numeric order, hence we need to use TransactionIdFollows to decide which way to branch. As long as it works I'm not worried if the array is not technically "sorted". -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > AFAICS the example you give isn't correct. > We would lay out the values like this > W-3 W-2 W-1 W 3 4 5 > where W is the wrap value Stop right there, you're already failing to think clearly. There is no unique "wrap value", all values act the same in circular XID space. The fundamental problem here is that if the set of XIDs involved spans a sufficiently long range, your array will have a[0] < a[1] < ... < a[n] but it will fail to be true that a[0] < a[n]. If that's also true for, say, a[0] vs the midpoint element, then a binary search for a[0] will fail because it will make the wrong decision while probing the midpoint. > The values are laid out in TransactionIdFollows order, They *cannot* be "laid out in TransactionIdFollows order". It's not possible, because that relationship isn't a total ordering. Now it might be the case that this is OK for HS purposes because the set of XIDs that are relevant at any one instant should never span more than half of the XID space. But I'd just as soon not use that assumption here if it's unnecessary. It'd be cheaper anyway to sort and search the array using plain <, so why are you so eager to use TransactionIdFollows? regards, tom lane
On Sat, Apr 17, 2010 at 11:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> AFAICS the example you give isn't correct. > >> We would lay out the values like this > >> W-3 W-2 W-1 W 3 4 5 > >> where W is the wrap value > > Stop right there, you're already failing to think clearly. There is no > unique "wrap value", all values act the same in circular XID space. Or to put it a different way, what happens when the "wrap value" changes? An array that was in order under one wrap value can cease to be in order under another. ...Robert
On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > AFAICS the example you give isn't correct. > > > We would lay out the values like this > > > W-3 W-2 W-1 W 3 4 5 > > > where W is the wrap value > > Stop right there, you're already failing to think clearly. There is no > unique "wrap value", all values act the same in circular XID space. > > The fundamental problem here is that if the set of XIDs involved spans a > sufficiently long range, your array will have a[0] < a[1] < ... < a[n] > but it will fail to be true that a[0] < a[n]. If that's also true for, > say, a[0] vs the midpoint element, then a binary search for a[0] will > fail because it will make the wrong decision while probing the midpoint. I understand that the xids are circular, but there is only one point where the next xid has a lower integer value but is still "later" from the perspective of TransactionIdFollows. Apart from that single discontinuity all other values are monotonic. From your insistence I presume I must be missing something, but I just don't see it. Perhaps we are just misunderstanding each other about the "sufficiently long range". > > The values are laid out in TransactionIdFollows order, > > They *cannot* be "laid out in TransactionIdFollows order". It's not > possible, because that relationship isn't a total ordering. > > Now it might be the case that this is OK for HS purposes because the set > of XIDs that are relevant at any one instant should never span more than > half of the XID space. Agree that is true. > But I'd just as soon not use that assumption > here if it's unnecessary. I understand what your saying. I think it is necessary here. > It'd be cheaper anyway to sort and search the > array using plain <, so why are you so eager to use > TransactionIdFollows? The array grows to the right and is laid out one xid per element, resulting in a sequence of values that are transactionid-monotonic. As the values increase there will eventually reach the discontinuity where they cease being normally monotonic. Doing it this way means that we can add rows past the head of the array and then move the head atomically, so that we can make adding xids lock-free. If we try to actually sort the values then the algorithm is both more complex and requires locking. It would be easier to just remember where the discontinuity is and act accordingly. So I'm not eager to use either way, but I only currently see one way that would work. If there's a different way, that gives the same or better algorithmic characteristics, I will be happy to code it. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote: >> It'd be cheaper anyway to sort and search the >> array using plain <, so why are you so eager to use >> TransactionIdFollows? > The array grows to the right and is laid out one xid per element, > resulting in a sequence of values that are transactionid-monotonic. How do you know that just adding items at the right will produce a sorted array? It seems quite unlikely to me that transactions can be guaranteed to arrive at this code in XID order. I think you need to do an explicitly sorted insertion. > ... Doing it this way means that we can > add rows past the head of the array and then move the head atomically, > so that we can make adding xids lock-free. ... and even without that issue, this seems like utter fantasy. How are you going to do that "atomically"? Have you considered what will happen on weak-memory-ordering machines like PPC, in particular? regards, tom lane
On Sat, 2010-04-17 at 15:45 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote: > >> It'd be cheaper anyway to sort and search the > >> array using plain <, so why are you so eager to use > >> TransactionIdFollows? > > > The array grows to the right and is laid out one xid per element, > > resulting in a sequence of values that are transactionid-monotonic. > > How do you know that just adding items at the right will produce a > sorted array? It seems quite unlikely to me that transactions can be > guaranteed to arrive at this code in XID order. I think you need to do > an explicitly sorted insertion. Xids don't arrive in sequence, but "known assigned xids" are added in sequence because we infer the existence of the intermediate xids and assuming they are running for the snapshot. > > ... Doing it this way means that we can > > add rows past the head of the array and then move the head atomically, > > so that we can make adding xids lock-free. > > ... and even without that issue, this seems like utter fantasy. How > are you going to do that "atomically"? Have you considered what will > happen on weak-memory-ordering machines like PPC, in particular? We search the array between tail and head. If the head moves by integer overwrite just as already happens for xid assignment, then we would use the new head for the search. The code is careful to fetch only once. I would freely admit I know absolutely nothing about details of weak-memory-ordering machines and have not considered them at all. How would what I have proposed fail to work, yet what we already rely on work correctly? Do the circumstances differ? -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sat, 2010-04-17 at 15:45 -0400, Tom Lane wrote: >> How do you know that just adding items at the right will produce a >> sorted array? > Xids don't arrive in sequence, but "known assigned xids" are added in > sequence because we infer the existence of the intermediate xids and > assuming they are running for the snapshot. Hm. Okay, maybe that will work. >> ... and even without that issue, this seems like utter fantasy. How >> are you going to do that "atomically"? Have you considered what will >> happen on weak-memory-ordering machines like PPC, in particular? > We search the array between tail and head. If the head moves by integer > overwrite just as already happens for xid assignment, then we would use > the new head for the search. The code is careful to fetch only once. ... but this will not. You need to use a lock, because there is otherwise no guarantee that other processors see the write into the array element before they see the change in the head pointer. > I would freely admit I know absolutely nothing about details of > weak-memory-ordering machines and have not considered them at all. How > would what I have proposed fail to work, yet what we already rely on > work correctly? Do the circumstances differ? Yes. We have memory ordering instructions inserted in the lock acquisition/release code. Trying to access and modify a shared-memory data structure without any locking will not work. There are some places where we suppose that a *single* write into shared memory can safely be done without a lock, if we're not too concerned about how soon other transactions will see the effects. But what you are proposing here requires more than one related write. I've been burnt by this myself: http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php regards, tom lane
On Sat, 2010-04-17 at 16:48 -0400, Tom Lane wrote: > > We search the array between tail and head. If the head moves by integer > > overwrite just as already happens for xid assignment, then we would use > > the new head for the search. The code is careful to fetch only once. > > ... but this will not. You need to use a lock, because there is > otherwise no guarantee that other processors see the write into the > array element before they see the change in the head pointer. > > > I would freely admit I know absolutely nothing about details of > > weak-memory-ordering machines and have not considered them at all. How > > would what I have proposed fail to work, yet what we already rely on > > work correctly? Do the circumstances differ? > > Yes. We have memory ordering instructions inserted in the lock > acquisition/release code. Trying to access and modify a shared-memory > data structure without any locking will not work. > > There are some places where we suppose that a *single* write into shared > memory can safely be done without a lock, if we're not too concerned > about how soon other transactions will see the effects. But what you > are proposing here requires more than one related write. > > I've been burnt by this myself: > http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php W O W - thank you for sharing. What I'm not clear on is why you've used a spinlock everywhere when only weak-memory thang CPUs are a problem. Why not have a weak-memory-protect macro that does does nada when the hardware already protects us? (i.e. a spinlock only for the hardware that needs it). -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > What I'm not clear on is why you've used a spinlock everywhere when only > weak-memory thang CPUs are a problem. Why not have a weak-memory-protect > macro that does does nada when the hardware already protects us? (i.e. a > spinlock only for the hardware that needs it). Well, we could certainly consider that, if we had enough places where there was a demonstrable benefit from it. I couldn't measure any real slowdown from adding a spinlock in that sinval code, so I didn't propose doing so at the time --- and I'm pretty dubious that this code is sufficiently performance-critical to justify the work, either. regards, tom lane
On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > What I'm not clear on is why you've used a spinlock everywhere when only > > weak-memory thang CPUs are a problem. Why not have a weak-memory-protect > > macro that does does nada when the hardware already protects us? (i.e. a > > spinlock only for the hardware that needs it). > > Well, we could certainly consider that, if we had enough places where > there was a demonstrable benefit from it. I couldn't measure any real > slowdown from adding a spinlock in that sinval code, so I didn't propose > doing so at the time --- and I'm pretty dubious that this code is > sufficiently performance-critical to justify the work, either. OK, I'll put a spinlock around access to the head of the array. Thanks for your input. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote: > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: > > Simon Riggs <simon@2ndQuadrant.com> writes: > > > What I'm not clear on is why you've used a spinlock everywhere when only > > > weak-memory thang CPUs are a problem. Why not have a weak-memory-protect > > > macro that does does nada when the hardware already protects us? (i.e. a > > > spinlock only for the hardware that needs it). > > > > Well, we could certainly consider that, if we had enough places where > > there was a demonstrable benefit from it. I couldn't measure any real > > slowdown from adding a spinlock in that sinval code, so I didn't propose > > doing so at the time --- and I'm pretty dubious that this code is > > sufficiently performance-critical to justify the work, either. > > OK, I'll put a spinlock around access to the head of the array. v2 patch attached -- Simon Riggs www.2ndQuadrant.com
Attachment
On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote: > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote: > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: > > > Simon Riggs <simon@2ndQuadrant.com> writes: > > > > What I'm not clear on is why you've used a spinlock everywhere > > > > when only weak-memory thang CPUs are a problem. Why not have a > > > > weak-memory-protect macro that does does nada when the > > > > hardware already protects us? (i.e. a spinlock only for the > > > > hardware that needs it). > > > > > > Well, we could certainly consider that, if we had enough places > > > where there was a demonstrable benefit from it. I couldn't > > > measure any real slowdown from adding a spinlock in that sinval > > > code, so I didn't propose doing so at the time --- and I'm > > > pretty dubious that this code is sufficiently > > > performance-critical to justify the work, either. > > > > OK, I'll put a spinlock around access to the head of the array. > > v2 patch attached If you've committed this, or any other patch you've sent here, *please* mention so on the same thread. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote: > On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote: > > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote: > > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: > > > > Simon Riggs <simon@2ndQuadrant.com> writes: > > > > > What I'm not clear on is why you've used a spinlock everywhere > > > > > when only weak-memory thang CPUs are a problem. Why not have a > > > > > weak-memory-protect macro that does does nada when the > > > > > hardware already protects us? (i.e. a spinlock only for the > > > > > hardware that needs it). > > > > > > > > Well, we could certainly consider that, if we had enough places > > > > where there was a demonstrable benefit from it. I couldn't > > > > measure any real slowdown from adding a spinlock in that sinval > > > > code, so I didn't propose doing so at the time --- and I'm > > > > pretty dubious that this code is sufficiently > > > > performance-critical to justify the work, either. > > > > > > OK, I'll put a spinlock around access to the head of the array. > > > > v2 patch attached > > If you've committed this, or any other patch you've sent here, > *please* mention so on the same thread. I haven't yet. I've written two patches - this is a major module rewrite and is still under discussion. The other patch has nothing to do with this (though I did accidentally include a couple of changes from this patch and immediately revoked them). I will wait awhile to see if anybody has some independent test results. -- Simon Riggs www.2ndQuadrant.com
On Sun, Apr 18, 2010 at 09:22:21PM +0100, Simon Riggs wrote: > On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote: > > On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote: > > > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote: > > > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: > > > > > Simon Riggs <simon@2ndQuadrant.com> writes: > > > > > > What I'm not clear on is why you've used a spinlock > > > > > > everywhere when only weak-memory thang CPUs are a problem. > > > > > > Why not have a weak-memory-protect macro that does does > > > > > > nada when the hardware already protects us? (i.e. a > > > > > > spinlock only for the hardware that needs it). > > > > > > > > > > Well, we could certainly consider that, if we had enough > > > > > places where there was a demonstrable benefit from it. I > > > > > couldn't measure any real slowdown from adding a spinlock in > > > > > that sinval code, so I didn't propose doing so at the time > > > > > --- and I'm pretty dubious that this code is sufficiently > > > > > performance-critical to justify the work, either. > > > > > > > > OK, I'll put a spinlock around access to the head of the > > > > array. > > > > > > v2 patch attached > > > > If you've committed this, or any other patch you've sent here, > > *please* mention so on the same thread. > > I haven't yet. I've written two patches - this is a major module > rewrite and is still under discussion. The other patch has nothing > to do with this (though I did accidentally include a couple of > changes from this patch and immediately revoked them). > > I will wait awhile to see if anybody has some independent test > results. Thanks :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Sun, Apr 18, 2010 at 4:22 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote: >> On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote: >> > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote: >> > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: >> > > > Simon Riggs <simon@2ndQuadrant.com> writes: >> > > > > What I'm not clear on is why you've used a spinlock everywhere >> > > > > when only weak-memory thang CPUs are a problem. Why not have a >> > > > > weak-memory-protect macro that does does nada when the >> > > > > hardware already protects us? (i.e. a spinlock only for the >> > > > > hardware that needs it). >> > > > >> > > > Well, we could certainly consider that, if we had enough places >> > > > where there was a demonstrable benefit from it. I couldn't >> > > > measure any real slowdown from adding a spinlock in that sinval >> > > > code, so I didn't propose doing so at the time --- and I'm >> > > > pretty dubious that this code is sufficiently >> > > > performance-critical to justify the work, either. >> > > >> > > OK, I'll put a spinlock around access to the head of the array. >> > >> > v2 patch attached >> >> If you've committed this, or any other patch you've sent here, >> *please* mention so on the same thread. > > I haven't yet. I've written two patches - this is a major module rewrite > and is still under discussion. The other patch has nothing to do with > this (though I did accidentally include a couple of changes from this > patch and immediately revoked them). > > I will wait awhile to see if anybody has some independent test results. So, does anyone have a few cycles to test this out? We are down to handful of remaining open items, so getting this tested and committed sooner = beta sooner. ...Robert
Robert Haas wrote: > > So, does anyone have a few cycles to test this out? We are down to > handful of remaining open items, so getting this tested and committed > sooner = beta sooner. > > > I did some testing of this patch (v2). Unfortunately I don't have access to hardware capable of doing tests at the same scale as Erik used. However I was still able to show a consistent difference (I think) between standby performance with and without the patch applied. Setup: host: 2.7 Ghz dual core amd64 with 4G ram and 1 sata drive, code: cvs head from 2010-04-14. pgbench: scale=100, 4 clients, 10000 (select) transactions each. Results: Master performance (with and without patch applied ): tps = 10903.612340 - 14070.109951 (including connections establishing) Standby performance without patch (: tps = 8288.119913 - 9722.245178 (including connections establishing) Standby performance with patch applied: tps = 11592.922637 - 14065.553214 (including connections establishing) I performed 8 runs of each, and results would start at the low range and climb up to the high one, where they would stabilize. In between runs I cleared the os buffer cache and (partially) reloaded it by selecting counts from the pgbench tables (i.e I was trying to ensure each run had the same or similar os cache contents). Overall looks like the patch gets standby read only performance close to the master - at least in the case where there are minimal master transactions being tracked by the standby (I had to leave the master idle whilst running the standby case, as they shared the machine). Hope this info is useful. regards Mark
On Tue, Apr 20, 2010 at 11:09 PM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > Robert Haas wrote: >> >> So, does anyone have a few cycles to test this out? We are down to >> handful of remaining open items, so getting this tested and committed >> sooner = beta sooner. >> >> >> > > I did some testing of this patch (v2). Unfortunately I don't have access to > hardware capable of doing tests at the same scale as Erik used. However I > was still able to show a consistent difference (I think) between standby > performance with and without the patch applied. > > Setup: > > host: 2.7 Ghz dual core amd64 with 4G ram and 1 sata drive, > code: cvs head from 2010-04-14. > pgbench: scale=100, 4 clients, 10000 (select) transactions each. > > Results: > > Master performance (with and without patch applied ): > tps = 10903.612340 - 14070.109951 (including connections establishing) > > Standby performance without patch (: > tps = 8288.119913 - 9722.245178 (including connections establishing) > > Standby performance with patch applied: > tps = 11592.922637 - 14065.553214 (including connections establishing) > > I performed 8 runs of each, and results would start at the low range and > climb up to the high one, where they would stabilize. In between runs I > cleared the os buffer cache and (partially) reloaded it by selecting counts > from the pgbench tables (i.e I was trying to ensure each run had the same or > similar os cache contents). > > Overall looks like the patch gets standby read only performance close to the > master - at least in the case where there are minimal master transactions > being tracked by the standby (I had to leave the master idle whilst running > the standby case, as they shared the machine). Hope this info is useful. Thanks, that sounds promising. ...Robert
On Wed, 2010-04-21 at 15:09 +1200, Mark Kirkwood wrote: > I did some testing of this patch (v2). Unfortunately I don't have access > to hardware capable of doing tests at the same scale as Erik used. > However I was still able to show a consistent difference (I think) > between standby performance with and without the patch applied. ... > Overall looks like the patch gets standby read only performance close to > the master - at least in the case where there are minimal master > transactions being tracked by the standby (I had to leave the master > idle whilst running the standby case, as they shared the machine). Hope > this info is useful. Thanks very much for the report; always good to get confirmation. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote: > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote: >> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: >>> Simon Riggs <simon@2ndQuadrant.com> writes: >>>> What I'm not clear on is why you've used a spinlock everywhere when only >>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect >>>> macro that does does nada when the hardware already protects us? (i.e. a >>>> spinlock only for the hardware that needs it). >>> Well, we could certainly consider that, if we had enough places where >>> there was a demonstrable benefit from it. I couldn't measure any real >>> slowdown from adding a spinlock in that sinval code, so I didn't propose >>> doing so at the time --- and I'm pretty dubious that this code is >>> sufficiently performance-critical to justify the work, either. >> OK, I'll put a spinlock around access to the head of the array. > > v2 patch attached The locking seems overly complex to me. Looking at KnownAssignedXidsAdd(), isn't it possible for two processes to call it concurrently in exclusive_lock==false mode and get the same 'head' value, and step on each others toes? I guess KnownAssignedXidsAdd() is only called by the startup process, but it seems like an accident waiting to happen. Spinlocks are fast, if you have to add an if-branch to decide whether to acquire it, I suspect you've lost any performance gain to be had already. Let's keep it simple. And acquiring ProcArrayLock in exclusive mode while adding entries to the array seems OK to me as well. It only needs to be held very briefly, and getting this correct and keeping it simple is much more important at this point than squeezing out every possible CPU cycle from the system. Just require that the caller holds ProcArrayLock in exclusive mode when calling an operation that modifies the array, and in shared mode when reading it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote: >> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: >>> Simon Riggs <simon@2ndQuadrant.com> writes: >>>> What I'm not clear on is why you've used a spinlock everywhere when only >>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect >>>> macro that does does nada when the hardware already protects us? (i.e. a >>>> spinlock only for the hardware that needs it). >>> Well, we could certainly consider that, if we had enough places where >>> there was a demonstrable benefit from it. I couldn't measure any real >>> slowdown from adding a spinlock in that sinval code, so I didn't propose >>> doing so at the time --- and I'm pretty dubious that this code is >>> sufficiently performance-critical to justify the work, either. >> OK, I'll put a spinlock around access to the head of the array. > > v2 patch attached Given the discussion about the cyclic nature of XIDs, it would be good to add an assertion that when a new XID is added to the array, it is a) larger than the biggest value already in the array (TransactionIdFollows(new, head)), and b) not too far from the smallest value in the array to confuse binary search (TransactionIdFollows(new, tail)). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-04-21 at 15:20 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote: > >> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote: > >>> Simon Riggs <simon@2ndQuadrant.com> writes: > >>>> What I'm not clear on is why you've used a spinlock everywhere when only > >>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect > >>>> macro that does does nada when the hardware already protects us? (i.e. a > >>>> spinlock only for the hardware that needs it). > >>> Well, we could certainly consider that, if we had enough places where > >>> there was a demonstrable benefit from it. I couldn't measure any real > >>> slowdown from adding a spinlock in that sinval code, so I didn't propose > >>> doing so at the time --- and I'm pretty dubious that this code is > >>> sufficiently performance-critical to justify the work, either. > >> OK, I'll put a spinlock around access to the head of the array. > > > > v2 patch attached > > The locking seems overly complex to me. > Looking at > KnownAssignedXidsAdd(), isn't it possible for two processes to call it > concurrently in exclusive_lock==false mode and get the same 'head' > value, and step on each others toes? I guess KnownAssignedXidsAdd() is > only called by the startup process, but it seems like an accident > waiting to happen. Not at all. That assumption is also used elsewhere, so it is safe to use here. > Spinlocks are fast, if you have to add an if-branch to decide whether to > acquire it, I suspect you've lost any performance gain to be had > already. I think you're misreading the code. One caller already holds exclusive lock, one does not. The if test is to determine whether to acquire the lock or not. > Let's keep it simple. And acquiring ProcArrayLock in exclusive > mode while adding entries to the array seems OK to me as well. It only > needs to be held very briefly, and getting this correct and keeping it > simple is much more important at this point than squeezing out every > possible CPU cycle from the system. I don't understand what you're saying: you say I'm wasting a few cycles on an if test and should change that, but at the same time you say I shouldn't worry about a few cycles. > Just require that the caller holds ProcArrayLock in exclusive mode when > calling an operation that modifies the array, and in shared mode when > reading it. The point of the code you're discussing is to remove the exclusive lock requests, not to save a few cycles. Spinlocks are fast, as you say. Exclusive lock requests block under a heavy load of shared lock holders, we know that already. It is worth removing the contention that can occur by minimising the number of exclusive locks required. This patch shows how and I don't see any reason from what you have said to avoid committing it. I'm willing to hear some sound reasons, if any exist. -- Simon Riggs www.2ndQuadrant.com
On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote: > Given the discussion about the cyclic nature of XIDs, it would be good > to add an assertion that when a new XID is added to the array, it is > > a) larger than the biggest value already in the array > (TransactionIdFollows(new, head)), and > b) not too far from the smallest value in the array to confuse binary > search (TransactionIdFollows(new, tail)). We discussed this in November. You convinced me it isn't possible for older xids to stay in the standby because anti-wraparound vacuums would conflict and kick them out. The primary can't run with wrapped xids and neither can the standby. I think that is correct. Adding an assertion isn't going to do much because it's unlikely anybody is going to be running for 2^31 transactions with asserts enabled. Worrying about things like this seems strange when real and negative behaviours are right in our faces elsewhere. Performance and scalability are real world concerns. -- Simon Riggs www.2ndQuadrant.com
On Wed, Apr 21, 2010 at 9:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote: > >> Given the discussion about the cyclic nature of XIDs, it would be good >> to add an assertion that when a new XID is added to the array, it is >> >> a) larger than the biggest value already in the array >> (TransactionIdFollows(new, head)), and >> b) not too far from the smallest value in the array to confuse binary >> search (TransactionIdFollows(new, tail)). > > We discussed this in November. You convinced me it isn't possible for > older xids to stay in the standby because anti-wraparound vacuums would > conflict and kick them out. The primary can't run with wrapped xids and > neither can the standby. I think that is correct. > > Adding an assertion isn't going to do much because it's unlikely anybody > is going to be running for 2^31 transactions with asserts enabled. > > Worrying about things like this seems strange when real and negative > behaviours are right in our faces elsewhere. Performance and scalability > are real world concerns. I think the assert is a good idea. If there's no real problem here, the assert won't trip. It's just a safety precaution. ...Robert
On Wed, Apr 21, 2010 at 8:20 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > The locking seems overly complex to me. I tend to agree. ! /* ! * Callers must hold either ProcArrayLock in Exclusive mode or ! * ProcArrayLock in Shared mode *and* known_assigned_xids_lck ! * to update these values. ! */ I'm not convinced that this is either (a) correct or (b) performant. ...Robert
On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote: > > > > Adding an assertion isn't going to do much because it's unlikely anybody > > is going to be running for 2^31 transactions with asserts enabled. > > > I think the assert is a good idea. If there's no real problem here, > the assert won't trip. It's just a safety precaution. If you believe that, then I think you should add this to all the other places in the current server where that assumption is made without assertion being added. As a safety precaution. -- Simon Riggs www.2ndQuadrant.com
On Wed, Apr 21, 2010 at 4:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote: >> > >> > Adding an assertion isn't going to do much because it's unlikely anybody >> > is going to be running for 2^31 transactions with asserts enabled. >> > > >> I think the assert is a good idea. If there's no real problem here, >> the assert won't trip. It's just a safety precaution. > > If you believe that, then I think you should add this to all the other > places in the current server where that assumption is made without > assertion being added. As a safety precaution. > Is that not a good idea that (at least for dev-builds, like with enable-cassert) the xid counter start at like 2^31 - 1000 ? It could help catch some bugs. Greetings Marcin Mańk
On Wed, 2010-04-21 at 16:22 +0200, marcin mank wrote: > Is that not a good idea that (at least for dev-builds, like with > enable-cassert) the xid counter start at like 2^31 - 1000 ? It could > help catch some bugs. It is a good idea, I'm sure that would help catch bugs. It wouldn't help here because the case in doubt is whether it's possible to have an xid still showing in memory arrays from the last time the cycle wrapped. It isn't. These things aren't random. These numbers are extracted directly from activity that was occurring on the primary and regularly checked and cleaned as the standby runs. So you'll need to do 2^31 transactions to prove this isn't true, which isn't ever going to happen in testing with an assert build and nobody with that many transactions would run an assert build anyway. -- Simon Riggs www.2ndQuadrant.com
On Wed, Apr 21, 2010 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote: >> > >> > Adding an assertion isn't going to do much because it's unlikely anybody >> > is going to be running for 2^31 transactions with asserts enabled. >> > > >> I think the assert is a good idea. If there's no real problem here, >> the assert won't trip. It's just a safety precaution. > > If you believe that, then I think you should add this to all the other > places in the current server where that assumption is made without > assertion being added. As a safety precaution. I feel like this conversation is getting a little heated. We are just trying to solve a technical problem here. Perhaps I am misreading - tone doesn't come through very well in email. I think the assumptions that are being made in this particular case are different from the ones made elsewhere in the server. Generally, we don't assume transaction IDs are arriving in ascending order - in fact, we usually explicitly have to deal with the fact that they might not be. So if we have a situation where we ARE relying on them arriving in order, because we have extrinsic reasons why we know it has to happen that way, adding an assertion to make sure that things are happening the way we expect doesn't seem out of line. This code is fairly complex. There is arguably less value in asserting that the newly added xid follows the tail as well as the head, but I still like the idea. Not sure whether that's rational or not. ...Robert
On Apr 21, 2010, at 16:49 , Simon Riggs wrote: > On Wed, 2010-04-21 at 16:22 +0200, marcin mank wrote: > >> Is that not a good idea that (at least for dev-builds, like with >> enable-cassert) the xid counter start at like 2^31 - 1000 ? It could >> help catch some bugs. > > It is a good idea, I'm sure that would help catch bugs. > > It wouldn't help here because the case in doubt is whether it's possible > to have an xid still showing in memory arrays from the last time the > cycle wrapped. It isn't. These things aren't random. These numbers are > extracted directly from activity that was occurring on the primary and > regularly checked and cleaned as the standby runs. > > So you'll need to do 2^31 transactions to prove this isn't true, which > isn't ever going to happen in testing with an assert build and nobody > with that many transactions would run an assert build anyway. ISTM that there's no need to actually execute 2^31 transactions to trigger this bug (if it actually exists), it'd be sufficientto increment the xid counter by more than one each time a xid is assigned, no? Or would that trip snapshot creation on the standby? best regards, Florian Pflug
Robert Haas wrote: > On Wed, Apr 21, 2010 at 9:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote: >> >>> Given the discussion about the cyclic nature of XIDs, it would be good >>> to add an assertion that when a new XID is added to the array, it is >>> >>> a) larger than the biggest value already in the array >>> (TransactionIdFollows(new, head)), and >>> b) not too far from the smallest value in the array to confuse binary >>> search (TransactionIdFollows(new, tail)). >> We discussed this in November. You convinced me it isn't possible for >> older xids to stay in the standby because anti-wraparound vacuums would >> conflict and kick them out. The primary can't run with wrapped xids and >> neither can the standby. I think that is correct. >> >> Adding an assertion isn't going to do much because it's unlikely anybody >> is going to be running for 2^31 transactions with asserts enabled. >> >> Worrying about things like this seems strange when real and negative >> behaviours are right in our faces elsewhere. Performance and scalability >> are real world concerns. > > I think the assert is a good idea. If there's no real problem here, > the assert won't trip. It's just a safety precaution. Right. And assertions also act as documentation, they are a precise and compact way to document invariants we assume to hold. A comment explaining why the cyclic nature of XIDs is not a problem would be nice too, in addition or instead of the assertions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2010-04-22 at 08:57 +0300, Heikki Linnakangas wrote: > > > > I think the assert is a good idea. If there's no real problem here, > > the assert won't trip. It's just a safety precaution. > > Right. And assertions also act as documentation, they are a precise and > compact way to document invariants we assume to hold. A comment > explaining why the cyclic nature of XIDs is not a problem would be nice > too, in addition or instead of the assertions. Agreed. I was going to reply just that earlier but have been distracted on other things. -- Simon Riggs www.2ndQuadrant.com
On Sun, April 18, 2010 13:01, Simon Riggs wrote: >> >> OK, I'll put a spinlock around access to the head of the array. > > v2 patch attached > knownassigned_sortedarray.v2.diff applied to cvs HEAD (2010.04.21 22:36) I have done a few smaller tests (scale 500, clients 1, 20): init: pgbench -h /tmp -p 6565 -U rijkers -i -s 500 replicas 4x primary, clients 1: scale: 500 clients: 1 tps = 11496.372655 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 11580.141685 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 11478.294747 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 11741.432016 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 4x standby, clients 1: scale: 500 clients: 1 tps = 727.217672 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 785.431011 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 825.291817 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 868.107638 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 4x primary, clients 20: scale: 500 clients: 20 tps = 34963.054102 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 34818.985407 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 34964.545013 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 34959.210687 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 4x standby, clients 20: scale: 500 clients: 20 tps = 1099.808192 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 905.926703 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 943.531989 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 1082.215913 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the original post, btw). In that earlier instance, the extreme slowness disappeared later, after many hours maybe even days (without bouncing either primary or standby). I have no idea what could cause this; is no one else is seeing this ? (if I have time I'll repeat on other hardware in the weekend) any comment is welcome... Erik Rijkers
Erik Rijkers wrote: > This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the > original post, btw). In that earlier instance, the extreme slowness disappeared later, after many > hours maybe even days (without bouncing either primary or standby). > Any possibility the standby is built with assertions turned out? That's often the cause of this type of difference between pgbench results on two systems, which easy to introduce when everyone is building from source. You should try this on both systems: psql -c "show debug_assertions" just to rule that out. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith wrote: > Erik Rijkers wrote: >> This is the same behaviour (i.e. extreme slow standby) that I saw >> earlier (and which caused the >> original post, btw). In that earlier instance, the extreme slowness >> disappeared later, after many >> hours maybe even days (without bouncing either primary or standby). >> > > Any possibility the standby is built with assertions turned out? > That's often the cause of this type of difference between pgbench > results on two systems, which easy to introduce when everyone is > building from source. You should try this on both systems: > > psql -c "show debug_assertions" > > > Or even: pg_config --configure on both systems might be worth checking. regards Mark
On Thu, April 22, 2010 23:54, Mark Kirkwood wrote: > Greg Smith wrote: >> Erik Rijkers wrote: >>> This is the same behaviour (i.e. extreme slow standby) that I saw >>> earlier (and which caused the >>> original post, btw). In that earlier instance, the extreme slowness >>> disappeared later, after many >>> hours maybe even days (without bouncing either primary or standby). >>> >> >> Any possibility the standby is built with assertions turned out? >> That's often the cause of this type of difference between pgbench >> results on two systems, which easy to introduce when everyone is >> building from source. You should try this on both systems: >> >> psql -c "show debug_assertions" >> >> >> > Or even: > > pg_config --configure > > on both systems might be worth checking. (these instances are on a single server, btw) primary: $ pg_config BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin DOCDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc HTMLDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc INCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include PKGINCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include INCLUDEDIR-SERVER = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include/server LIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib PKGLIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib LOCALEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/locale MANDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/man SHAREDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share SYSCONFDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/etc PGXS = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary' '--with-pgport=6565' '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' '--with-libxslt' CC = gcc CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv CFLAGS_SL = -fpic LDFLAGS = -Wl,-rpath,'/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib' LDFLAGS_SL = LIBS = -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline -ltermcap -lcrypt -ldl -lm VERSION = PostgreSQL 9.0devel-sr_primary [data:port:db PGPORT=6565 PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data PGDATABASE=replicas] 2010.04.22 20:55:28 rijkers@denkraam:~/src/perl/85devel [0] $ time ./run_test_suite.sh [data:port:db PGPORT=6565 PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data PGDATABASE=replicas] 2010.04.22 21:00:26 rijkers@denkraam:~/src/perl/85devel [1] standby: $ pg_config BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin DOCDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc HTMLDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc INCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include PKGINCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include INCLUDEDIR-SERVER = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include/server LIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib PKGLIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib LOCALEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/locale MANDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/man SHAREDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share SYSCONFDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/etc PGXS = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary' '--with-pgport=6565' '--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' '--with-libxslt' CC = gcc CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv CFLAGS_SL = -fpic LDFLAGS = -Wl,-rpath,'/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib' LDFLAGS_SL = LIBS = -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline -ltermcap -lcrypt -ldl -lm VERSION = PostgreSQL 9.0devel-sr_primary $ grep -Ev '(^[[:space:]]*#)|(^$)' pgsql.sr_*ry/data/postgresql.conf pgsql.sr_primary/data/postgresql.conf:data_directory = '/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data' pgsql.sr_primary/data/postgresql.conf:port = 6565 pgsql.sr_primary/data/postgresql.conf:max_connections = 100 pgsql.sr_primary/data/postgresql.conf:shared_buffers = 256MB pgsql.sr_primary/data/postgresql.conf:checkpoint_segments = 50 pgsql.sr_primary/data/postgresql.conf:archive_mode = 'on' pgsql.sr_primary/data/postgresql.conf:archive_command= 'cp %p /var/data1/pg_stuff/dump/replication_archive/%f' pgsql.sr_primary/data/postgresql.conf:max_wal_senders = 5 pgsql.sr_primary/data/postgresql.conf:effective_cache_size= 16GB pgsql.sr_primary/data/postgresql.conf:datestyle = 'iso, mdy' pgsql.sr_primary/data/postgresql.conf:lc_messages = 'en_US.UTF-8' pgsql.sr_primary/data/postgresql.conf:lc_monetary = 'en_US.UTF-8' pgsql.sr_primary/data/postgresql.conf:lc_numeric = 'en_US.UTF-8' pgsql.sr_primary/data/postgresql.conf:lc_time = 'en_US.UTF-8' pgsql.sr_primary/data/postgresql.conf:default_text_search_config = 'pg_catalog.english' pgsql.sr_slavery/data/postgresql.conf:data_directory = '/var/data1/pg_stuff/pg_installations/pgsql.sr_slavery/data' pgsql.sr_slavery/data/postgresql.conf:port = 6566 pgsql.sr_slavery/data/postgresql.conf:max_connections = 100 pgsql.sr_slavery/data/postgresql.conf:shared_buffers = 256MB pgsql.sr_slavery/data/postgresql.conf:checkpoint_segments = 50 pgsql.sr_slavery/data/postgresql.conf:archive_mode = 'on' pgsql.sr_slavery/data/postgresql.conf:archive_command= 'cp %p /var/data1/pg_stuff/dump/replication_archive/%f' pgsql.sr_slavery/data/postgresql.conf:max_wal_senders = 5 pgsql.sr_slavery/data/postgresql.conf:effective_cache_size= 16GB pgsql.sr_slavery/data/postgresql.conf:datestyle = 'iso, mdy' pgsql.sr_slavery/data/postgresql.conf:lc_messages = 'en_US.UTF-8' pgsql.sr_slavery/data/postgresql.conf:lc_monetary = 'en_US.UTF-8' pgsql.sr_slavery/data/postgresql.conf:lc_numeric = 'en_US.UTF-8' pgsql.sr_slavery/data/postgresql.conf:lc_time = 'en_US.UTF-8' pgsql.sr_slavery/data/postgresql.conf:default_text_search_config = 'pg_catalog.english' I only now notice archive_mode = 'on' on the standby. That doesn't make sense of course - I'll remove that.
On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote: > On Sun, April 18, 2010 13:01, Simon Riggs wrote: > any comment is welcome... Please can you re-run with -l and post me the file of times Please also rebuild using --enable-profile so we can see what's happening. Can you also try the enclosed patch which implements prefetching during replay of btree delete records. (Need to set effective_io_concurrency) Thanks for your further help. -- Simon Riggs www.2ndQuadrant.com
Attachment
Erik Rijkers wrote: > > This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the > original post, btw). In that earlier instance, the extreme slowness disappeared later, after many > hours maybe even days (without bouncing either primary or standby). > > I have no idea what could cause this; is no one else is seeing this ? > > (if I have time I'll repeat on other hardware in the weekend) > > any comment is welcome... > > > I wonder if what you are seeing is perhaps due to the tables on the primary being almost completely cached (from the initial create) and those on the standby being at best partially so. That would explain why the standby performance catches up after a while ( when its tables are equivalently cached). One way to test this is to 'pre-cache' the standby by selecting every row from its tables before running the pgbench test. regards Mark
On Thu, 2010-04-22 at 23:45 +0100, Simon Riggs wrote: > On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote: > > On Sun, April 18, 2010 13:01, Simon Riggs wrote: > > > any comment is welcome... > > Please can you re-run with -l and post me the file of times Erik has sent me details of a test run. My analysis of that is: I'm seeing the response time profile on the standby as 99% <110us 99.9% <639us 99.99% <615ms 0.052% (52 samples) are >5ms elapsed and account for 24 s, which is about 45% of elapsed time. Of the 52 samples >5ms, 50 of them are >100ms and 2 >1s. 99% of transactions happen in similar times between primary and standby, everything dragged down by rare but severe spikes. We're looking for something that would delay something that normally takes <0.1ms into something that takes >100ms, yet does eventually return. That looks like a severe resource contention issue. This effect happens when running just a single read-only session on standby from pgbench. No confirmation as yet as to whether recovery is active or dormant, and what other activitity if any occurs on standby server at same time. So no other clues as yet as to what the contention might be, except that we note the standby is writing data and the database is large. > Please also rebuild using --enable-profile so we can see what's > happening. > > Can you also try the enclosed patch which implements prefetching during > replay of btree delete records. (Need to set effective_io_concurrency) As yet, no confirmation that the attached patch is even relevant. It was just a wild guess at some tuning, while we wait for further info. > Thanks for your further help. "Some kind of contention" is best we can say at present. -- Simon Riggs www.2ndQuadrant.com
On Fri, Apr 23, 2010 at 11:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2010-04-22 at 23:45 +0100, Simon Riggs wrote: >> On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote: >> > On Sun, April 18, 2010 13:01, Simon Riggs wrote: >> >> > any comment is welcome... >> >> Please can you re-run with -l and post me the file of times > > Erik has sent me details of a test run. My analysis of that is: > > I'm seeing the response time profile on the standby as > 99% <110us > 99.9% <639us > 99.99% <615ms > > 0.052% (52 samples) are >5ms elapsed and account for 24 s, which is > about 45% of elapsed time. > > Of the 52 samples >5ms, 50 of them are >100ms and 2 >1s. > > 99% of transactions happen in similar times between primary and standby, > everything dragged down by rare but severe spikes. > > We're looking for something that would delay something that normally > takes <0.1ms into something that takes >100ms, yet does eventually > return. That looks like a severe resource contention issue. Wow. Good detective work. ...Robert
On 4/18/10, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, 2010-04-17 at 16:48 -0400, Tom Lane wrote: > > There are some places where we suppose that a *single* write into shared > > memory can safely be done without a lock, if we're not too concerned > > about how soon other transactions will see the effects. But what you > > are proposing here requires more than one related write. > > > > I've been burnt by this myself: > > http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php > > > W O W - thank you for sharing. > > What I'm not clear on is why you've used a spinlock everywhere when only > weak-memory thang CPUs are a problem. Why not have a weak-memory-protect > macro that does does nada when the hardware already protects us? (i.e. a > spinlock only for the hardware that needs it). Um, you have been burned by exactly this on x86 also: http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php -- marko
Marko Kreen <markokr@gmail.com> writes: > Um, you have been burned by exactly this on x86 also: > http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php Yeah, we never did figure out exactly how come you were observing that failure on Intel-ish hardware. I was under the impression that Intel machines didn't have weak-memory-ordering behavior. I wonder whether your compiler had rearranged the code in ProcArrayAdd so that the increment happened before the array element store at the machine-code level. I think it would be entitled to do that under standard C semantics, since that ProcArrayStruct pointer isn't marked volatile. regards, tom lane
On 4/23/10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Marko Kreen <markokr@gmail.com> writes: > > Um, you have been burned by exactly this on x86 also: > > http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php > > > Yeah, we never did figure out exactly how come you were observing that > failure on Intel-ish hardware. I was under the impression that Intel > machines didn't have weak-memory-ordering behavior. > > I wonder whether your compiler had rearranged the code in ProcArrayAdd > so that the increment happened before the array element store at the > machine-code level. I think it would be entitled to do that under > standard C semantics, since that ProcArrayStruct pointer isn't marked > volatile. Sounds likely. Which seems to hint its better to handle all processors as weak ordered and then work with explicit locks/memory barriers, than to sprinkle code with 'volatile' to supress optimizations on intel and then still fail on non-intel. -- marko
On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote: > > > > 99% of transactions happen in similar times between primary and standby, > > everything dragged down by rare but severe spikes. > > > > We're looking for something that would delay something that normally > > takes <0.1ms into something that takes >100ms, yet does eventually > > return. That looks like a severe resource contention issue. > > Wow. Good detective work. While we haven't fully established the source of those problems, I am now happy that these test results don't present any reason to avoid commiting the main patch tested by Erik (not the smaller additional one I sent). I expect to commit that on Sunday. -- Simon Riggs www.2ndQuadrant.com
On Fri, Apr 23, 2010 at 6:39 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote: >> > >> > 99% of transactions happen in similar times between primary and standby, >> > everything dragged down by rare but severe spikes. >> > >> > We're looking for something that would delay something that normally >> > takes <0.1ms into something that takes >100ms, yet does eventually >> > return. That looks like a severe resource contention issue. >> >> Wow. Good detective work. > > While we haven't fully established the source of those problems, I am > now happy that these test results don't present any reason to avoid > commiting the main patch tested by Erik (not the smaller additional one > I sent). I expect to commit that on Sunday. Both Heikki and I objected to that patch. And apparently it doesn't fix the problem, either. So, -1 from me. ...Robert
On Sat, April 24, 2010 00:39, Simon Riggs wrote: > On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote: >> > >> > 99% of transactions happen in similar times between primary and standby, >> > everything dragged down by rare but severe spikes. >> > >> > We're looking for something that would delay something that normally >> > takes <0.1ms into something that takes >100ms, yet does eventually >> > return. That looks like a severe resource contention issue. >> >> Wow. Good detective work. > > While we haven't fully established the source of those problems, I am > now happy that these test results don't present any reason to avoid > commiting the main patch tested by Erik (not the smaller additional one > I sent). I expect to commit that on Sunday. > yes, that (main) patch seems to have largely closed the gap between primary and standby; here are some results from a lower scale (10): scale: 10 clients: 10, 20, 40, 60, 90 for each: 4x primary, 4x standby: (6565=primary, 6566=standby) ----- scale: 10 clients: 10 tps = 27624.339871 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 27604.261750 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 28015.093466 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 28422.561280 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 27254.806526 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 27686.470866 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 28078.904035 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 27101.622337 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 ----- scale: 10 clients: 20 tps = 23106.795587 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 23101.681155 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22893.364004 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 23038.577109 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22903.578552 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22970.691946 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22999.473318 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22884.854749 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 ----- scale: 10 clients: 40 tps = 23522.499429 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23611.319191 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23616.905302 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23572.213990 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23714.721220 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23711.781175 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23691.867023 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23691.699231 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 ----- scale: 10 clients: 60 tps = 21987.497095 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 21950.344204 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22006.461447 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 21824.071303 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22149.415231 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22211.064402 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22164.238081 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22174.585736 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 ----- scale: 10 clients: 90 tps = 18751.213002 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18757.115811 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18692.942329 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18765.390154 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18929.462104 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18999.851184 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18972.321607 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18924.058827 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 The higher scales still have that other standby-slowness. It may be caching effects (as Mark Kirkwood suggested): the idea being that the primary data is pre-cached because of the initial create; standby data needs to be first-time-read from disk. Does that make sense? I will try to confirm this.
On Fri, 2010-04-23 at 19:07 -0400, Robert Haas wrote: > On Fri, Apr 23, 2010 at 6:39 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote: > >> > > >> > 99% of transactions happen in similar times between primary and standby, > >> > everything dragged down by rare but severe spikes. > >> > > >> > We're looking for something that would delay something that normally > >> > takes <0.1ms into something that takes >100ms, yet does eventually > >> > return. That looks like a severe resource contention issue. > >> > >> Wow. Good detective work. > > > > While we haven't fully established the source of those problems, I am > > now happy that these test results don't present any reason to avoid > > commiting the main patch tested by Erik (not the smaller additional one > > I sent). I expect to commit that on Sunday. > > Both Heikki and I objected to that patch. Please explain your objection, based upon the patch and my explanations. > And apparently it doesn't > fix the problem, either. So, -1 from me. There is an issue observed in Erik's later tests, but my interpretation of the results so far is that the sorted array patch successfully removes the initially reported loss of performance. -- Simon Riggs www.2ndQuadrant.com
On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Both Heikki and I objected to that patch. > > Please explain your objection, based upon the patch and my explanations. Well, we objected to the locking. Having reread the patch a few times though, I think I'm starting to wrap my head around it so, I don't know, maybe it's OK. Have you tested grabbing the ProcArrayLock in exclusive mode instead of having a separate spinlock, to see how that performs? >> And apparently it doesn't >> fix the problem, either. So, -1 from me. > > There is an issue observed in Erik's later tests, but my interpretation > of the results so far is that the sorted array patch successfully > removes the initially reported loss of performance. Is it possible the remaining spikes are due to fights over the spinlock? ...Robert
On Sun, 2010-04-25 at 06:53 -0400, Robert Haas wrote: > On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >> Both Heikki and I objected to that patch. > > > > Please explain your objection, based upon the patch and my explanations. > > Well, we objected to the locking. Having reread the patch a few times > though, I think I'm starting to wrap my head around it so, I don't > know, maybe it's OK. Have you tested grabbing the ProcArrayLock in > exclusive mode instead of having a separate spinlock, to see how that > performs? They both perform well. Heikki and I agree that spinlocks perform well. The point of taking the ProcArrayLock in Shared rather than Exclusive mode is the reduction in contention that gives. Adding a KnownAssignedXid is exactly analogous to starting a transaction. We go to some trouble to avoid taking a lock when we start a transaction and I want to do a similar thing on the standby. I think it's even more important we reduce contention on the standby than it is on the primary because if the Startup process is delayed then it slows down recovery. > >> And apparently it doesn't > >> fix the problem, either. So, -1 from me. > > > > There is an issue observed in Erik's later tests, but my interpretation > > of the results so far is that the sorted array patch successfully > > removes the initially reported loss of performance. > > Is it possible the remaining spikes are due to fights over the spinlock? That one specific spinlock? I considered it. I think its doubtful. The spikes all involved a jump from <5ms to much more than that, often as long as a second. That looks like a pure I/O problem or an LWlock held across a slow I/O or platform specific issues of some kind. Erik's tests were with 1 backend, so that would mean that at most 2 processes might want the spinlock you mention. The spinlock is only taken at snapshot time, so once per transaction in Erik's tests. So the if the spinlock were the problem then one or other of the processes would need to grab and hold the spinlock for a very long time and then release it sometime later. Since the spinlock code has been with us for some time, I doubt there is anything wrong there. I don't have any evidence as to whether the problem is solely on Erik's system, nor whether it is an issue already in the codebase or introduced by this patch. It seems unlikely to me that this patch introduces the problem, and even more unlikely that the issue is caused by two processes fighting over that particular spinlock. The patch is very effective at reducing overall cost of taking snapshots and so I think it needs to be applied to allow more performance data to be collected. I don't suppose this is the last performance tuning we will need to do. -- Simon Riggs www.2ndQuadrant.com
On Sun, Apr 25, 2010 at 8:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sun, 2010-04-25 at 06:53 -0400, Robert Haas wrote: >> On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> Both Heikki and I objected to that patch. >> > >> > Please explain your objection, based upon the patch and my explanations. >> >> Well, we objected to the locking. Having reread the patch a few times >> though, I think I'm starting to wrap my head around it so, I don't >> know, maybe it's OK. Have you tested grabbing the ProcArrayLock in >> exclusive mode instead of having a separate spinlock, to see how that >> performs? > > They both perform well. Heikki and I agree that spinlocks perform well. > > The point of taking the ProcArrayLock in Shared rather than Exclusive > mode is the reduction in contention that gives. Adding a > KnownAssignedXid is exactly analogous to starting a transaction. We go > to some trouble to avoid taking a lock when we start a transaction and I > want to do a similar thing on the standby. I think it's even more > important we reduce contention on the standby than it is on the primary > because if the Startup process is delayed then it slows down recovery. > >> >> And apparently it doesn't >> >> fix the problem, either. So, -1 from me. >> > >> > There is an issue observed in Erik's later tests, but my interpretation >> > of the results so far is that the sorted array patch successfully >> > removes the initially reported loss of performance. >> >> Is it possible the remaining spikes are due to fights over the spinlock? > > That one specific spinlock? I considered it. I think its doubtful. > > The spikes all involved a jump from <5ms to much more than that, often > as long as a second. That looks like a pure I/O problem or an LWlock > held across a slow I/O or platform specific issues of some kind. > > Erik's tests were with 1 backend, so that would mean that at most 2 > processes might want the spinlock you mention. The spinlock is only > taken at snapshot time, so once per transaction in Erik's tests. So the > if the spinlock were the problem then one or other of the processes > would need to grab and hold the spinlock for a very long time and then > release it sometime later. Since the spinlock code has been with us for > some time, I doubt there is anything wrong there. > > I don't have any evidence as to whether the problem is solely on Erik's > system, nor whether it is an issue already in the codebase or introduced > by this patch. It seems unlikely to me that this patch introduces the > problem, and even more unlikely that the issue is caused by two > processes fighting over that particular spinlock. > > The patch is very effective at reducing overall cost of taking snapshots > and so I think it needs to be applied to allow more performance data to > be collected. I don't suppose this is the last performance tuning we > will need to do. OK, fair enough. I withdraw my objections. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Both Heikki and I objected to that patch. >> >> Please explain your objection, based upon the patch and my explanations. > Well, we objected to the locking. Not only is the locking overly complex, but it's outright wrong; or at least the comments are. KnownAssignedXidsAdd in particular has a comment that is 100% misleading, and an actual behavior that I find extremely likely to provoke bugs (eg, consider caller calling it twice under the illusion it stlll has only shared lock). I'm not even convinced that it works correctly, since it will happily write into KnownAssignedXids[] while holding only shared ProcArrayLock. What happens when two processes are doing that concurrently? This needs a redesign before it can be considered committable. I don't really care whether it makes things faster; it's too complicated and too poorly documented to be maintainable. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > [ v2 patch ] BTW, while I'm looking at this, why bother with the separate KnownAssignedXidsValid[] array? Wouldn't it be cheaper (certainly so in storage, probably so in access/update times) to have just the KnownAssignedXids[] array and store InvalidTransactionId in unused entries? regards, tom lane
On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > Both Heikki and I objected to that patch. > >> > >> Please explain your objection, based upon the patch and my explanations. > > > Well, we objected to the locking. > > Not only is the locking overly complex, but it's outright wrong; > or at least the comments are. KnownAssignedXidsAdd in particular > has a comment that is 100% misleading, and an actual behavior that > I find extremely likely to provoke bugs (eg, consider caller calling > it twice under the illusion it stlll has only shared lock). I will reword that comment. > I'm not > even convinced that it works correctly, since it will happily write > into KnownAssignedXids[] while holding only shared ProcArrayLock. > What happens when two processes are doing that concurrently? Heikki and I discussed this upthread. Only the Startup process ever calls KnownAssignedXids, so the situation you describe does not happen. We currently rely on the fact that we have only a single writer in other places in Hot Standby, so this is not a restriction we are imposing on the system just to support this change. Both you and Heikki have previously spoken against having recovery run in parallel, so the likelihood of having multiple readers in the future seems low. If we were to allow multiple callers of the routine in future, the use of spinlocks to protect the head of the queue means we can safely reserve slots and so KnownAssignedXidsAdd can easily be made to accept multiple writers in the future. I think its possible to future proof that now without significant change or performance drop, so I'll do that. > This needs a redesign before it can be considered committable. I don't > really care whether it makes things faster; it's too complicated and too > poorly documented to be maintainable. It is sufficiently complex to achieve its objective, which is keeping track of xids during recovery, using the same locking as we do during normal running. This is the third re-write of the code, developed over more than 20 months and specifically modularised by me to make future changes drop-in replacements. Initially we had an insert-sorted array, then a hash table and now this current proposal. Recently, Heikki and I independently came up with almost exactly the same design, with the algorithmic characteristics we need for performance. The call points, frequency and workload of calls are documented and well understood by multiple people. The only remaining point of discussion has been the code that takes ProcArrayLock in shared mode. The difference being discussed here is whether we access the head and tail of the array with both Shared +spinlock or just Exclusive. The patch would be almost identical with that change, but would not meet its locking objectives. So making the locking stricter isn't going to make it more maintainable by a measurable amount. There are more than 60 lines of header comment explaining in detail how this works, with a full algorithmic analysis. The remaining code is commented to project standards, with easily more than 100 lines of comments. I will recheck every comment and see if other comments can be usefully added, and further review the code to see if there's anything that can be done to improve it further. I was going to do this anyway, since I will be adding an agreed Assert() into the patch. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote: >> This needs a redesign before it can be considered committable. I don't >> really care whether it makes things faster; it's too complicated and too >> poorly documented to be maintainable. > There are more than 60 lines of header comment explaining in detail how > this works, with a full algorithmic analysis. The remaining code is > commented to project standards, with easily more than 100 lines of > comments. If the comments were correct, I wouldn't be complaining. They're misleading or outright wrong on many points. In particular, I don't think you actually understand the weak-memory-ordering issue, because the comments about that are entirely wrong. I don't think they are adequate on other points either --- for example, I now understand why my complaint of a few minutes ago about KnownAssignedXidsValid is wrong, but the comments certainly didn't help me on that. regards, tom lane
On Sun, 2010-04-25 at 12:43 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > [ v2 patch ] > > BTW, while I'm looking at this, why bother with the separate > KnownAssignedXidsValid[] array? Wouldn't it be cheaper > (certainly so in storage, probably so in access/update times) > to have just the KnownAssignedXids[] array and store > InvalidTransactionId in unused entries? Well, that was exactly my first design. Heikki came up with the additional array and I think it is an inspired suggestion, because it allows the search() to use a simple binary search. I attempted to write the binary-search-with-holes and it was going to be very ugly code, so I went for another algorithm which was also quite cool, but not as cool as Heikki's idea - which I was going to give credit for. The array adds 520*max_connections bytes of shared memory, but seems a good tradeoff for the additional performance it allows. I guess we could use a bit array, but that's slower to access. The bit array would be 9*max_connections bytes of shared memory, so a 40kB saving. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2010-04-25 at 12:51 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote: > >> This needs a redesign before it can be considered committable. I don't > >> really care whether it makes things faster; it's too complicated and too > >> poorly documented to be maintainable. > > > There are more than 60 lines of header comment explaining in detail how > > this works, with a full algorithmic analysis. The remaining code is > > commented to project standards, with easily more than 100 lines of > > comments. > > If the comments were correct, I wouldn't be complaining. They're > misleading or outright wrong on many points. In particular, I don't > think you actually understand the weak-memory-ordering issue, because > the comments about that are entirely wrong. The comments says "on CPUs with + * weak-memory ordering we can't reliably move pointers atomically, so the + * rule is that updates of head and tail of the array require ProcArrayLock + * in exclusive mode or (shared mode and known_assigned_xids_lck spinlock)" I will reword this, so it is clear that I'm talking about the head and tail of the array, not pointers in general. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sun, 2010-04-25 at 12:51 -0400, Tom Lane wrote: >> If the comments were correct, I wouldn't be complaining. They're >> misleading or outright wrong on many points. In particular, I don't >> think you actually understand the weak-memory-ordering issue, because >> the comments about that are entirely wrong. > The comments says "on CPUs with > + * weak-memory ordering we can't reliably move pointers atomically, so > the > + * rule is that updates of head and tail of the array require > ProcArrayLock > + * in exclusive mode or (shared mode and known_assigned_xids_lck > spinlock)" > I will reword this, so it is clear that I'm talking about the head and > tail of the array, not pointers in general. It's not about whether the pointers can be assigned atomically; on most hardware they can. It's about whether other processors will see that happen in the correct sequence relative to the changes in the array elements. If you like I'll have a go at rewriting the comments for this patch, because I am currently thinking that the problem is not so much with the code as with the poor explanation of what it's doing. Sometimes the author is too close to the code to understand why other people have a hard time understanding it. regards, tom lane
On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote: > If you like I'll have a go at rewriting the comments for this patch, > because I am currently thinking that the problem is not so much with > the code as with the poor explanation of what it's doing. Sometimes > the author is too close to the code to understand why other people > have a hard time understanding it. That would help me, thank you. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote: >> If you like I'll have a go at rewriting the comments for this patch, >> because I am currently thinking that the problem is not so much with >> the code as with the poor explanation of what it's doing. Sometimes >> the author is too close to the code to understand why other people >> have a hard time understanding it. > That would help me, thank you. OK. You said you were currently working some more on the patch, so I'll wait for v3 and then work on it. regards, tom lane
On Sat, April 24, 2010 01:17, Erik Rijkers wrote: > On Sat, April 24, 2010 00:39, Simon Riggs wrote: >> On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote: >>> > >>> > 99% of transactions happen in similar times between primary and standby, >>> > everything dragged down by rare but severe spikes. >>> > >>> > We're looking for something that would delay something that normally >>> > takes <0.1ms into something that takes >100ms, yet does eventually >>> > return. That looks like a severe resource contention issue. >>> >>> Wow. Good detective work. >> >> While we haven't fully established the source of those problems, I am >> now happy that these test results don't present any reason to avoid >> commiting the main patch tested by Erik (not the smaller additional one >> I sent). I expect to commit that on Sunday. >> > > yes, that (main) patch seems to have largely > closed the gap between primary and standby; here > are some results from a lower scale (10): FWIW, here are some more results from pgbench comparing primary and standby (both with Simon's patch). Sorry if it's too much data, but to me at least it was illuminating; I now understand the effects of the different parameters better. pgbench results, all -S (SELECT-only). scale: 10, 100, 500, 1000 clients: 1, 10, 20, 40, 60, 90 for each: 4x primary, 4x standby, for 15 minutes: (port 6565=primary, 6566=standby) scale: 10 clients: 1 tps = 6404.768132 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 6435.072983 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 6458.958549 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 6432.088099 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 6442.519035 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 6449.837931 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 6419.879026 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 10 clients: 1 tps = 6428.290175 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 --- scale: 10 clients: 10 tps = 27624.339871 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 27604.261750 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 28015.093466 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 28422.561280 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 27254.806526 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 27686.470866 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 28078.904035 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 10 clients: 10 tps = 27101.622337 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 ---- scale: 10 clients: 20 tps = 23106.795587 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 23101.681155 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22893.364004 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 23038.577109 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22903.578552 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22970.691946 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22999.473318 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 10 clients: 20 tps = 22884.854749 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 ---- scale: 10 clients: 40 tps = 23522.499429 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23611.319191 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23616.905302 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23572.213990 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23714.721220 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23711.781175 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23691.867023 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 10 clients: 40 tps = 23691.699231 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 ---- scale: 10 clients: 60 tps = 21987.497095 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 21950.344204 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22006.461447 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 21824.071303 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22149.415231 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22211.064402 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22164.238081 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 10 clients: 60 tps = 22174.585736 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 ---- scale: 10 clients: 90 tps = 18751.213002 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18757.115811 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18692.942329 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18765.390154 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18929.462104 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18999.851184 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18972.321607 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 10 clients: 90 tps = 18924.058827 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 ---- scale: 100 clients: 1 tps = 5854.489684 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 5852.679267 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 5844.005437 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 5867.593004 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 1111.368215 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 1187.678163 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 1332.544440 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 100 clients: 1 tps = 1488.574446 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 ---- scale: 100 clients: 10 tps = 28948.718439 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 28865.280357 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 28819.426232 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 28868.070716 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 1839.477133 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 2056.574852 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 2501.715270 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 100 clients: 10 tps = 3114.288871 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 ---- scale: 100 clients: 20 tps = 21561.395664 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 21482.860950 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 21447.896216 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 21450.846824 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 4309.312692 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 6786.460948 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 16552.535443 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 100 clients: 20 tps = 21513.519274 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 ---- scale: 100 clients: 40 tps = 20812.709326 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 100 clients: 40 tps = 20750.911233 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 100 clients: 40 tps = 20764.921432 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 100 clients: 40 tps = 20877.705726 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 100 clients: 40 tps = 20818.028682 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 100 clients: 40 tps = 20876.921640 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 100 clients: 40 tps = 20912.855749 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 100 clients: 40 tps = 20821.271974 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 ---- scale: 100 clients: 60 tps = 18746.870683 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 100 clients: 60 tps = 18752.631011 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 100 clients: 60 tps = 18717.634243 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 100 clients: 60 tps = 18776.138286 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 100 clients: 60 tps = 18850.133061 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 100 clients: 60 tps = 18893.463427 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 100 clients: 60 tps = 18918.599124 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 100 clients: 60 tps = 18903.389147 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 ---- scale: 100 clients: 90 tps = 15915.637099 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 100 clients: 90 tps = 15945.810305 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 100 clients: 90 tps = 15894.238941 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 100 clients: 90 tps = 15926.422587 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 100 clients: 90 tps = 16025.802748 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 100 clients: 90 tps = 16049.977788 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 100 clients: 90 tps = 16053.612396 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 100 clients: 90 tps = 16040.376994 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 ---- scale: 500 clients: 1 tps = 5646.892935 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 5642.544001 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 5633.282536 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 5627.330477 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 5625.659921 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 5645.898120 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 5644.897739 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 500 clients: 1 tps = 5640.990172 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 ---- scale: 500 clients: 10 tps = 28123.467099 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 28230.031309 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 28054.072374 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 28129.229553 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 28187.938585 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 28372.950289 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 28161.909825 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 500 clients: 10 tps = 28224.336096 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 ---- scale: 500 clients: 20 tps = 20932.340860 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 20990.560386 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 20926.001588 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 20903.012936 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 20956.917213 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 21007.634021 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 20982.386759 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 500 clients: 20 tps = 21002.399320 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 ---- scale: 500 clients: 40 tps = 19829.483373 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 500 clients: 40 tps = 19830.187082 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 500 clients: 40 tps = 19757.072979 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 500 clients: 40 tps = 19734.710111 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 500 clients: 40 tps = 19862.051804 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 500 clients: 40 tps = 19834.071543 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 500 clients: 40 tps = 19885.788466 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 500 clients: 40 tps = 19961.407411 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 ---- scale: 500 clients: 60 tps = 17829.256242 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 500 clients: 60 tps = 17816.152137 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 500 clients: 60 tps = 17828.589859 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 500 clients: 60 tps = 17748.044912 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 500 clients: 60 tps = 17926.708964 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 500 clients: 60 tps = 17943.758910 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 500 clients: 60 tps = 17917.713556 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 500 clients: 60 tps = 17932.597396 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 ---- scale: 500 clients: 90 tps = 15044.660043 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 500 clients: 90 tps = 15010.685976 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 500 clients: 90 tps = 15045.828142 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 500 clients: 90 tps = 14976.636159 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 500 clients: 90 tps = 15154.402300 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 500 clients: 90 tps = 15131.532084 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 500 clients: 90 tps = 15154.148091 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 500 clients: 90 tps = 15130.337148 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 ---- scale: 1000 clients: 1 tps = 222.307686 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 261.163445 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 307.343862 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 367.285821 pgbench -p 6565 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 550.457516 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 4347.697078 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 5601.036491 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 scale: 1000 clients: 1 tps = 5604.021799 pgbench -p 6566 -n -S -c 1 -T 900 -j 1 ---- scale: 1000 clients: 10 tps = 3370.534977 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 23156.038162 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 28029.755911 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 28179.609982 pgbench -p 6565 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 2880.404087 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 13593.595875 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 28033.649640 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 scale: 1000 clients: 10 tps = 28115.126500 pgbench -p 6566 -n -S -c 10 -T 900 -j 1 ---- scale: 1000 clients: 20 tps = 4876.882880 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 21262.273826 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 20403.533099 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 20753.634216 pgbench -p 6565 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 3024.129307 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 18786.862049 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 20807.372965 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 scale: 1000 clients: 20 tps = 20896.373925 pgbench -p 6566 -n -S -c 20 -T 900 -j 1 ---- scale: 1000 clients: 40 tps = 5910.319107 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 1000 clients: 40 tps = 19705.088545 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 1000 clients: 40 tps = 19658.237802 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 1000 clients: 40 tps = 19593.955166 pgbench -p 6565 -n -S -c 40 -T 900 -j 1 scale: 1000 clients: 40 tps = 6555.937102 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 1000 clients: 40 tps = 19845.691237 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 1000 clients: 40 tps = 19755.112514 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 scale: 1000 clients: 40 tps = 19765.648969 pgbench -p 6566 -n -S -c 40 -T 900 -j 1 ---- scale: 1000 clients: 60 tps = 7104.451848 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 1000 clients: 60 tps = 17638.393625 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 1000 clients: 60 tps = 17621.464080 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 1000 clients: 60 tps = 17538.189171 pgbench -p 6565 -n -S -c 60 -T 900 -j 1 scale: 1000 clients: 60 tps = 6721.909857 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 1000 clients: 60 tps = 17798.542731 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 1000 clients: 60 tps = 17797.145061 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 scale: 1000 clients: 60 tps = 17739.437864 pgbench -p 6566 -n -S -c 60 -T 900 -j 1 ---- scale: 1000 clients: 90 tps = 6682.605788 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 1000 clients: 90 tps = 14853.076617 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 1000 clients: 90 tps = 14851.040207 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 1000 clients: 90 tps = 14859.575164 pgbench -p 6565 -n -S -c 90 -T 900 -j 1 scale: 1000 clients: 90 tps = 6208.586575 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 1000 clients: 90 tps = 15065.536476 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 1000 clients: 90 tps = 15033.388676 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 scale: 1000 clients: 90 tps = 15030.808207 pgbench -p 6566 -n -S -c 90 -T 900 -j 1 Hardware: 2x Xeon (quadcore) 5482; 32 GB; Areca 1280ML CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5 Both instances on a raid volume of 12 disks (sata 7200) (and of course, server otherwise idle) hth, Erik Rijkers
"Erik Rijkers" <er@xs4all.nl> writes: > FWIW, here are some more results from pgbench comparing > primary and standby (both with Simon's patch). That seems weird. Why do most of the runs show primary and standby as having comparable speed, but a few show the standby as much slower? The parameters for those runs don't seem obviously different from cases where it's fast. I think there might have been something else going on on the standby during those runs. Or do you think those represent cases where the mystery slowdown event happened? regards, tom lane
On Sun, 2010-04-25 at 20:25 +0200, Erik Rijkers wrote: > Sorry if it's too much data, but to me at least it was illuminating; > I now understand the effects of the different parameters better. That's great, many thanks. A few observations * Standby performance is actually slightly above normal running. This is credible because of the way snapshots are now taken. We don't need to scan the procarray looking for write transactions, since we know everything is read only. So we scan just the knownassignedxids, which if no activity from primary will be zero-length, so snapshots will actually get taken much faster in this case on standby. The snapshot performance on standby is O(n) where n is the number of write transactions "currently" on primary (transfer delays blur the word "currently"). * The results for scale factor < 100 are fine, and the results for >100 with few connections get thrown out by long transaction times. With larger numbers of connections the wait problems seem to go away. Looks like Erik (and possibly Hot Standby in general) has an I/O problem, though "from what" is not yet determined. It could be just hardware, or might be hardware plus other factors. -- Simon Riggs www.2ndQuadrant.com
On Sun, April 25, 2010 20:55, Tom Lane wrote: > > That seems weird. Why do most of the runs show primary and standby > as having comparable speed, but a few show the standby as much slower? > The parameters for those runs don't seem obviously different from cases > where it's fast. I think there might have been something else going on > on the standby during those runs. Or do you think those represent > cases where the mystery slowdown event happened? > the strange case is the scale 100 standby's slow start, followed by a steady increase during -c 1, then -c 10, and finally getting up to speed with -c 20 (and up). And these slow-but-growing standby series are interspersed with normal (high-speed) primary series. I'll try to repeat this pattern on other hardware; although if my tests were run with faulty hardware I wouldn't know how/why that would give the above effect (such a 'regular aberration'). testing is more difficult than I thought... Erik Rijkers
Simon Riggs <simon@2ndQuadrant.com> writes: > [ v2 patch ] I've been studying this some more while making notes for improved comments, and I've about come to the conclusion that having readers move the tail pointer (at the end of KnownAssignedXidsGetAndSetXmin) is overly tricky and probably not a performance improvement anyway. The code is in fact wrong as it stands: it's off-by-one about setting the new tail value. And there's potential for contention with multiple readers all wanting to move the tail pointer at once. And most importantly, KnownAssignedXidsSearch can't move the tail pointer so we might expend many inefficient searches while never moving the tail pointer. I think we should get rid of that and just have the two functions that can mark entries invalid (which they must do with exclusive lock) advance the tail pointer when they invalidate the current tail element. Then we have the very simple rule that only the startup process ever changes this data structure. regards, tom lane
On Mon, Apr 26, 2010 at 3:25 AM, Erik Rijkers <er@xs4all.nl> wrote: > FWIW, here are some more results from pgbench comparing > primary and standby (both with Simon's patch). Was there a difference in CPU utilization between the primary and standby? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, 2010-04-25 at 19:18 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > [ v2 patch ] > > I've been studying this some more while making notes for improved > comments, and I've about come to the conclusion that having readers > move the tail pointer (at the end of KnownAssignedXidsGetAndSetXmin) > is overly tricky and probably not a performance improvement anyway. > The code is in fact wrong as it stands: it's off-by-one about setting > the new tail value. And there's potential for contention with multiple > readers all wanting to move the tail pointer at once. OK, since contention was my concern, I want to avoid that. > And most > importantly, KnownAssignedXidsSearch can't move the tail pointer so > we might expend many inefficient searches while never moving the tail > pointer. > I think we should get rid of that and just have the two functions that > can mark entries invalid (which they must do with exclusive lock) > advance the tail pointer when they invalidate the current tail element. OK > Then we have the very simple rule that only the startup process ever > changes this data structure. -- Simon Riggs www.2ndQuadrant.com
On Sun, 2010-04-25 at 23:52 +0200, Erik Rijkers wrote: > I'll try to repeat this pattern on other hardware; although > if my tests were run with faulty hardware I wouldn't know how/why > that would give the above effect (such a 'regular aberration'). > testing is more difficult than I thought... Thanks again for your help. Please can you confirm: * Are the standby tests run while the primary is completely quiet? * What OS is this? Can we use dtrace scripts? Can anyone else confirm these test results: large scale factor and small number of sessions? -- Simon Riggs www.2ndQuadrant.com
On Mon, April 26, 2010 08:52, Fujii Masao wrote: > On Mon, Apr 26, 2010 at 3:25 AM, Erik Rijkers <er@xs4all.nl> wrote: >> FWIW, here are some more results from pgbench comparing >> primary and standby (both with Simon's patch). > > Was there a difference in CPU utilization between the primary > and standby? > I haven't monitored it..
On Mon, April 26, 2010 09:43, Simon Riggs wrote: > On Sun, 2010-04-25 at 23:52 +0200, Erik Rijkers wrote: > >> I'll try to repeat this pattern on other hardware; although >> if my tests were run with faulty hardware I wouldn't know how/why >> that would give the above effect (such a 'regular aberration'). > >> testing is more difficult than I thought... > > Thanks again for your help. > > Please can you confirm: > * Are the standby tests run while the primary is completely quiet? autovacuum was on. Which is probably not a good idea - I'll try a few runs without it. > * What OS is this? Can we use dtrace scripts? Centos 5.4.
On Sun, 2010-04-25 at 13:51 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote: > >> If you like I'll have a go at rewriting the comments for this patch, > >> because I am currently thinking that the problem is not so much with > >> the code as with the poor explanation of what it's doing. Sometimes > >> the author is too close to the code to understand why other people > >> have a hard time understanding it. > > > That would help me, thank you. > > OK. You said you were currently working some more on the patch, so > I'll wait for v3 and then work on it. v3 attached Changes: * Strange locking in KnownAssignedXidsAdd() moved to RecordKnown... * KnownAssignedXidsAdd() reordered, assert-ish code added * Tail movement during snapshots no longer possible * Tail movement during xid removal added to KnownAssignedXidsSearch() * Major comment hacking Little bit rough, definitely needs a re-read of all comments, so good time to send over. -- Simon Riggs www.2ndQuadrant.com
Attachment
Simon Riggs <simon@2ndQuadrant.com> writes: > v3 attached Thanks, will work on this tomorrow. regards, tom lane
Simon Riggs <simon@2ndQuadrant.com> writes: > v3 attached This patch changes KnownAssignedXidsRemove() so that failure to find the target XID is elog(ERROR) (ie, a PANIC, since this is in the startup process). However, this comment is still there: /* * We can fail to find an xid if the xid came from a subtransaction that * aborts, though the xid hadn't yet been reportedand no WAL records have * been written using the subxid. In that case the abort record will * contain that subxidand we haven't seen it before. */ WTF? Either the comment is wrong or this should not be an elog condition. regards, tom lane
On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > v3 attached > > This patch changes KnownAssignedXidsRemove() so that failure to find > the target XID is elog(ERROR) (ie, a PANIC, since this is in the > startup process). Not in all cases. The code is correct, as far as I am aware from testing. > However, this comment is still there: > /* > * We can fail to find an xid if the xid came from a subtransaction that > * aborts, though the xid hadn't yet been reported and no WAL records have > * been written using the subxid. In that case the abort record will > * contain that subxid and we haven't seen it before. > */ > > WTF? Either the comment is wrong or this should not be an elog > condition. That section of code has been rewritten many times. I think it is now inaccurate and should be removed. I left it there because the unfortunate history of the project has been the removal of comments and then later rediscovery of the truth, sometimes more than once. I could no longer reproduce that error; someone else may know differently. -- Simon Riggs www.2ndQuadrant.com
Hmm ... there's another point here, which is that the array size creates a hard maximum on the number of entries, whereas the hash table was a bit more forgiving. What is the proof that the array won't overflow? The fact that the equivalent data structure on the master can't hold more than this many entries doesn't seem to me to prove that, because we will add intermediate not-observed XIDs to the array. regards, tom lane
On Tue, 2010-04-27 at 14:53 -0400, Tom Lane wrote: > Hmm ... there's another point here, which is that the array size > creates > a hard maximum on the number of entries, whereas the hash table was a > bit more forgiving. What is the proof that the array won't overflow? > The fact that the equivalent data structure on the master can't hold > more than this many entries doesn't seem to me to prove that, because > we will add intermediate not-observed XIDs to the array. We know that not-observed xids have actually been allocated on the primary. We log an assignment record every 64 subtransactions, so that the peak size of the array is 65 xids per connection. It's possible for xids to stay in the array for longer, in the event of a FATAL error that doesn't log an abort record. We clean those up every checkpoint, if they exist. The potential number of them is unbounded, so making special allowance for them doesn't remove the theoretical risk. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote: >> WTF? Either the comment is wrong or this should not be an elog >> condition. > That section of code has been rewritten many times. I think it is now > inaccurate and should be removed. I left it there because the > unfortunate history of the project has been the removal of comments and > then later rediscovery of the truth, sometimes more than once. I could > no longer reproduce that error; someone else may know differently. I haven't tested this, but it appears to me that the failure would occur in overflow situations. If we have too many subxacts, we'll generate XLOG_XACT_ASSIGNMENT, which will cause the subxids to be removed from KnownAssignedXids[]. Then later when the top-level xact commits or aborts we'll try to remove them again as a consequence of processing the top-level's commit/abort record. No? regards, tom lane
On Tue, 2010-04-27 at 16:18 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote: > >> WTF? Either the comment is wrong or this should not be an elog > >> condition. > > > That section of code has been rewritten many times. I think it is now > > inaccurate and should be removed. I left it there because the > > unfortunate history of the project has been the removal of comments and > > then later rediscovery of the truth, sometimes more than once. I could > > no longer reproduce that error; someone else may know differently. > > I haven't tested this, but it appears to me that the failure would occur > in overflow situations. If we have too many subxacts, we'll generate > XLOG_XACT_ASSIGNMENT, which will cause the subxids to be removed from > KnownAssignedXids[]. Then later when the top-level xact commits or > aborts we'll try to remove them again as a consequence of processing > the top-level's commit/abort record. No? Yes, thank you for clear thinking. Anyway, looks like the comment was right after all and the new code to throw an error is wrong in some cases. It was useful for testing, at least. The comment was slightly misleading, which is a good reason to rewrite it. It seems like it might be possible to identify which xids could cause an error and which won't. Much harder than that. We still have the possible case where we have >64 subtransactions allocated but many of them abort and we are left with a final commit of <64 subtransactions. -- Simon Riggs www.2ndQuadrant.com
Isn't the snapshotOldestActiveXid filter in RecordKnownAssignedTransactionIds completely wrong/useless/bogus? AFAICS, snapshotOldestActiveXid is only set once at the start of recovery. This means it will soon be too old to provide any useful filtering. But what's far worse is that the XID space will eventually wrap around, and that test will start filtering *everything*. I think we should just lose that test, as well as the variable. regards, tom lane
On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote: > Isn't the snapshotOldestActiveXid filter in > RecordKnownAssignedTransactionIds completely wrong/useless/bogus? > > AFAICS, snapshotOldestActiveXid is only set once at the start of > recovery. This means it will soon be too old to provide any useful > filtering. But what's far worse is that the XID space will eventually > wrap around, and that test will start filtering *everything*. > > I think we should just lose that test, as well as the variable. Yes, though it looks like it is still necessary in creating a valid initial state because otherwise we may have xids in KnownAssigned array that are already complete. The comment there talks about wasting memory, though it appears to be a correctness issue. So perhaps a similar test is required in ProcArrayApplyRecoveryInfo() but not in RecordKnownAssignedTransactionIds(). That way it is applied, but only once at initialisation. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote: >> I think we should just lose that test, as well as the variable. > Yes, though it looks like it is still necessary in creating a valid > initial state because otherwise we may have xids in KnownAssigned array > that are already complete. Huh? How is a filter as coarse as an oldest-running-XID filter going to prevent that? And aren't we initializing from trustworthy data in ProcArrayApplyRecoveryInfo, anyway? I still say it's useless. regards, tom lane
On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote: > >> I think we should just lose that test, as well as the variable. > > > Yes, though it looks like it is still necessary in creating a valid > > initial state because otherwise we may have xids in KnownAssigned array > > that are already complete. > > Huh? How is a filter as coarse as an oldest-running-XID filter going > to prevent that? And aren't we initializing from trustworthy data in > ProcArrayApplyRecoveryInfo, anyway? > > I still say it's useless. Quite possibly. Your looking at other code outside of this patch. I'm happy that you do so, but is it immediately related? I can have another look when we finish this. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote: >> Huh? How is a filter as coarse as an oldest-running-XID filter going >> to prevent that? And aren't we initializing from trustworthy data in >> ProcArrayApplyRecoveryInfo, anyway? >> >> I still say it's useless. > Quite possibly. Your looking at other code outside of this patch. I'm > happy that you do so, but is it immediately related? I can have another > look when we finish this. Well, it's nearby anyway. I've committed the present patch (with a number of fixes). While I was looking at it I came across several things in the existing code that I think are either wrong or at least inadequately documented --- the above complaint is just the tip of the iceberg. I'm going to make another pass over it to see if I'm just missing things, and then report back. regards, tom lane
On Tue, 2010-04-27 at 20:13 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote: > >> Huh? How is a filter as coarse as an oldest-running-XID filter going > >> to prevent that? And aren't we initializing from trustworthy data in > >> ProcArrayApplyRecoveryInfo, anyway? > >> > >> I still say it's useless. > > > Quite possibly. Your looking at other code outside of this patch. I'm > > happy that you do so, but is it immediately related? I can have another > > look when we finish this. > > Well, it's nearby anyway. I've committed the present patch (with a > number of fixes). While I was looking at it I came across several > things in the existing code that I think are either wrong or at least > inadequately documented --- the above complaint is just the tip of the > iceberg. I'm going to make another pass over it to see if I'm just > missing things, and then report back. Thank you for your input and changes. You're welcome to share my iceberg. -- Simon Riggs www.2ndQuadrant.com
Hi Simon, In another thread you mentioned you were lacking information from me: On Tue, May 4, 2010 17:10, Simon Riggs wrote: > > There is no evidence that Erik's strange performance has anything to do > with HS; it hasn't been seen elsewhere and he didn't respond to > questions about the test setup to provide background. The profile didn't > fit any software problem I can see. > I'm sorry if I missed requests for things that where not already mentioned. Let me repeat: OS: Centos 5.4 2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz Areca 1280ML primary and standby db bothon a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2) It goes without saying (I hope) that apart from the pgbench tests and a few ssh sessions (myself), the machine was idle. It would be interesting if anyone repeated these simple tests and produced evidence that these non-HS. (Unfortunately, I have at the moment not much time for more testing) thanks, Erik Rijkers On Sun, April 25, 2010 21:07, Simon Riggs wrote: > On Sun, 2010-04-25 at 20:25 +0200, Erik Rijkers wrote: > >> Sorry if it's too much data, but to me at least it was illuminating; >> I now understand the effects of the different parameters better. > > That's great, many thanks. > > A few observations > > * Standby performance is actually slightly above normal running. This is > credible because of the way snapshots are now taken. We don't need to > scan the procarray looking for write transactions, since we know > everything is read only. So we scan just the knownassignedxids, which if > no activity from primary will be zero-length, so snapshots will actually > get taken much faster in this case on standby. The snapshot performance > on standby is O(n) where n is the number of write transactions > "currently" on primary (transfer delays blur the word "currently"). > > * The results for scale factor < 100 are fine, and the results for >100 > with few connections get thrown out by long transaction times. With > larger numbers of connections the wait problems seem to go away. Looks > like Erik (and possibly Hot Standby in general) has an I/O problem, > though "from what" is not yet determined. It could be just hardware, or > might be hardware plus other factors. > > -- > Simon Riggs www.2ndQuadrant.com > >
On Tue, 2010-05-04 at 18:10 +0200, Erik Rijkers wrote: > It would be interesting if anyone repeated these simple tests and > produced > evidence that these non-HS. > > (Unfortunately, I have at the moment not much time for more testing) Would you be able to make those systems available for further testing? First, I'd perform the same test with the systems swapped, so we know more about the symmetry of the issue. After that, would like to look more into internals. Is it possible to setup SytemTap and dtrace on these systems? Thanks, either way. -- Simon Riggs www.2ndQuadrant.com
On Tue, May 4, 2010 18:19, Simon Riggs wrote: > On Tue, 2010-05-04 at 18:10 +0200, Erik Rijkers wrote: >> It would be interesting if anyone repeated these simple tests and >> produced evidence that these non-HS. >> >> (Unfortunately, I have at the moment not much time for more testing) > > Would you be able to make those systems available for further testing? No. sorry. > First, I'd perform the same test with the systems swapped, so we know > more about the symmetry of the issue. After that, would like to look > more into internals. you mean "systems swapped", primary and standby? primary and standby were on the same machine in these tests (even the same raid). I can eventually move the standby (the 'slow' side, as it stands) to another, quite similar machine. Not in the coming days though... > Is it possible to setup SytemTap and dtrace on these systems? I did install systemtap last week. dtrace is not installed (I think. I've never used either.) Erik Rijkers
Erik Rijkers wrote: > OS: Centos 5.4 > 2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz > Areca 1280ML > primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2) > To fill in from data you already mentioned upthread: 32 GB RAM CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5 Thanks for the all the reporting you've done here, really helpful. Questions to make sure I'm trying to duplicate the right thing here: Is your disk array all configured as one big RAID10 volume, so essentially a 6-disk stripe with redundancy, or something else? In particular I want know whether the WAL/database/archives are split onto separate volumes or all on one big one when you were testing. Is this is on ext3 with standard mount parameters? Also, can you confirm that every test you ran only had a single pgbench worker thread (-j 1 or not specified)? That looked to be the case from the ones I saw where you posted the whole command used. It would not surprise me to find that the CPU usage profile of a standby is just different enough from the primary that it results in the pgbench program not being scheduled enough time, due to the known Linux issues in that area. Not going to assume that, of course, just one thing I want to check when trying to replicate what you've run into. I didn't see any glaring HS performance issues like you've been reporting on last time I tried performance testing in this area, just a small percentage drop. But I didn't specifically go looking for it either. With your testing rig out of service, we're going to try and replicate that on a system here. My home server is like a scaled down version of yours (single quad-core, 8GB RAM, smaller Areca controller, 5 disks instead of 12) and it's running the same CentOS version. If the problems really universal I should see it here too. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Erik Rijkers wrote: > Hi Simon, > > In another thread you mentioned you were lacking information from me: > > On Tue, May 4, 2010 17:10, Simon Riggs wrote: >> There is no evidence that Erik's strange performance has anything to do >> with HS; it hasn't been seen elsewhere and he didn't respond to >> questions about the test setup to provide background. The profile didn't >> fit any software problem I can see. >> > > I'm sorry if I missed requests for things that where not already mentioned. > > Let me repeat: > OS: Centos 5.4 > 2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz > Areca 1280ML > primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2) > > It goes without saying (I hope) that apart from the pgbench tests > and a few ssh sessions (myself), the machine was idle. > > It would be interesting if anyone repeated these simple tests and produced > evidence that these non-HS. > > (Unfortunately, I have at the moment not much time for more testing) FWIW - I'm seeing a behaviour here under pgbench -S workloads that looks kinda related. using -j 16 -c 16 -T 120 I get either 100000tps and around 660000 contextswitches per second or on some runs I end up with 150000tps and around 1M contextswitches/s sustained. I mostly get the 100k result but once in a while I get the 150k one. And one even can anticipate the final transaction rate from watching "vmstat 1"... I'm not sure yet on what is causing that behaviour but that is with 9.0B1 on a Dual Quadcore Nehalem box with 16 cpu threads (8+HT) on a pure in-memory workload (scale = 20 with 48GB RAM). Stefan
On Tue, May 4, 2010 20:26, Greg Smith wrote: > Erik Rijkers wrote: >> OS: Centos 5.4 >> 2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz >> Areca 1280ML >> primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2) >> > > To fill in from data you already mentioned upthread: > 32 GB RAM > CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5 > > Thanks for the all the reporting you've done here, really helpful. > Questions to make sure I'm trying to duplicate the right thing here: > > Is your disk array all configured as one big RAID10 volume, so > essentially a 6-disk stripe with redundancy, or something else? In > particular I want know whether the WAL/database/archives are split onto > separate volumes or all on one big one when you were testing. Everything together: the raid is what Areca call 'raid10(1E)'. (to be honest I don't remember what that 1E exactly means - extra flexibility in the number of disks, I think). Btw, some of my emails contained the postgresql.conf of both instances. > > Is this is on ext3 with standard mount parameters? ext3 noatime > Also, can you confirm that every test you ran only had a single pgbench > worker thread (-j 1 or not specified)? That looked to be the case from > the ones I saw where you posted the whole command used. It would not yes; the literal cmd: time /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin/pgbench -h /tmp -p 6565 -U rijkers -n -S -c 20 -T 900 -j 1 replicas To avoid wrapping in the emails I just removed '-h \tmp', -U rijkers', and 'replicas'. (I may have run the primary's pgbench binary also against the slave - don't think that should make any difference) > surprise me to find that the CPU usage profile of a standby is just > different enough from the primary that it results in the pgbench program > not being scheduled enough time, due to the known Linux issues in that > area. Not going to assume that, of course, just one thing I want to > check when trying to replicate what you've run into. > > I didn't see any glaring HS performance issues like you've been > reporting on last time I tried performance testing in this area, just a > small percentage drop. But I didn't specifically go looking for it Here, it seems repeatable, but does not occur with all scales. Hm, maybe I should just dump *all* of my results on the wiki for reference. (I'll look at that later). > either. With your testing rig out of service, we're going to try and > replicate that on a system here. My home server is like a scaled down > version of yours (single quad-core, 8GB RAM, smaller Areca controller, 5 > disks instead of 12) and it's running the same CentOS version. If the > problems really universal I should see it here too. > Thanks, Erik Rijkers
On Tue, 2010-05-04 at 21:34 +0200, Stefan Kaltenbrunner wrote: > FWIW - I'm seeing a behaviour here under pgbench -S workloads that looks > kinda related. > > using -j 16 -c 16 -T 120 I get either 100000tps and around 660000 > contextswitches per second or on some runs I end up with 150000tps and > around 1M contextswitches/s sustained. I mostly get the 100k result but > once in a while I get the 150k one. And one even can anticipate the > final transaction rate from watching "vmstat 1"... > > I'm not sure yet on what is causing that behaviour but that is with > 9.0B1 on a Dual Quadcore Nehalem box with 16 cpu threads (8+HT) on a > pure in-memory workload (scale = 20 with 48GB RAM). Educated guess at a fix: please test this patch. It's good for performance testing, but doesn't work correctly at failover, which would obviously be addressed prior to any commit. -- Simon Riggs www.2ndQuadrant.com
Attachment
Erik Rijkers wrote: > Everything together: the raid is what Areca call 'raid10(1E)'. > (to be honest I don't remember what that 1E exactly means - > extra flexibility in the number of disks, I think). > Standard RAID10 only supports an even number of disks. The 1E variants also allow putting an odd number in. If you're using an even number, like in your case, the results are the same until you start losing drives, at which point the degradation performance pattern changes a bit due to differences in how things are striped. See these for more info: http://bytepile.com/raid_class.php and http://en.wikipedia.org/wiki/Non-standard_RAID_levels#IBM_ServeRAID_1E http://publib.boulder.ibm.com/infocenter/eserver/v1r2/index.jsp?topic=/diricinfo/fqy0_craid1e.html I don't think using RAID1E has anything to do with your results, but it is possible given your test configuration that part of the difference you're seeing relates to where on disk blocks are stored. If you take a hard drive and write two copies of something onto it, the second copy will be a little slower than the first. That's just because how drive speed drops over the surface as you move further along it. There's some examples of what that looks like at http://projects.2ndquadrant.it/sites/default/files/pg-hw-bench-2010.pdf on pages 21-23. Returning to your higher level results, one of the variables you weren't sure how to account for was caching effects on the standby--the possibility that it just didn't have the data cached the same as the master impacting results. The usual way I look for that is by graphing the pgbench TPS over time. In that situation, you can see it shoot upwards over time, very obvious pattern. Example at http://projects.2ndquadrant.it/sites/default/files/pgbench-intro.pdf on pages 36,37. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us