On Thu, 2010-04-22 at 23:45 +0100, Simon Riggs wrote:
> On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote:
> > On Sun, April 18, 2010 13:01, Simon Riggs wrote:
>
> > any comment is welcome...
>
> Please can you re-run with -l and post me the file of times
Erik has sent me details of a test run. My analysis of that is:
I'm seeing the response time profile on the standby as
99% <110us
99.9% <639us
99.99% <615ms
0.052% (52 samples) are >5ms elapsed and account for 24 s, which is
about 45% of elapsed time.
Of the 52 samples >5ms, 50 of them are >100ms and 2 >1s.
99% of transactions happen in similar times between primary and standby,
everything dragged down by rare but severe spikes.
We're looking for something that would delay something that normally
takes <0.1ms into something that takes >100ms, yet does eventually
return. That looks like a severe resource contention issue.
This effect happens when running just a single read-only session on
standby from pgbench. No confirmation as yet as to whether recovery is
active or dormant, and what other activitity if any occurs on standby
server at same time. So no other clues as yet as to what the contention
might be, except that we note the standby is writing data and the
database is large.
> Please also rebuild using --enable-profile so we can see what's
> happening.
>
> Can you also try the enclosed patch which implements prefetching during
> replay of btree delete records. (Need to set effective_io_concurrency)
As yet, no confirmation that the attached patch is even relevant. It was
just a wild guess at some tuning, while we wait for further info.
> Thanks for your further help.
"Some kind of contention" is best we can say at present.
-- Simon Riggs www.2ndQuadrant.com