Thread: some longer, larger pgbench tests with various performance-related patches

some longer, larger pgbench tests with various performance-related patches

From
Robert Haas
Date:
Early yesterday morning, I was able to use Nate Boley's test machine
do a single 30-minute pgbench run at scale factor 300 using a variety
of trees built with various patches, and with the -l option added to
track latency on a per-transaction basis.  All tests were done using
32 clients and permanent tables.  The configuration was otherwise
identical to that described here:

http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com

By doing this, I hoped to get a better understanding of (1) the
effects of a scale factor too large to fit in shared_buffers, (2) what
happens on a longer test run, and (3) how response time varies
throughout the test.  First, here are the raw tps numbers:

background-clean-slru-v2: tps = 2027.282539 (including connections establishing)
buffreelistlock-reduction-v1: tps = 2625.155348 (including connections
establishing)
buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149
(including connections establishing)
freelist-ok-v2: tps = 2467.065010 (including connections establishing)
group-commit-2012-01-21: tps = 2205.128609 (including connections establishing)
master: tps = 2200.848350 (including connections establishing)
removebufmgrfreelist-v1: tps = 2679.453056 (including connections establishing)
xloginsert-scale-6: tps = 3675.312202 (including connections establishing)

Obviously these numbers are fairly noisy, especially since this is
just one run, so the increases and decreases might not be all that
meaningful.  Time permitting, I'll try to run some more tests to get
my hands around that situation a little better,

Graphs are here:

http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results

There are two graphs for each branch.  The first is a scatter plot of
latency vs. transaction time.  I found that graph hard to understand,
though; I couldn't really tell what I was looking at.  So I made a
second set of graphs which graph number of completed transactions in a
given second of the test against time.  The results are also included
on the previous page, below the latency graphs, and I find them much
more informative.

A couple of things stand out at me from these graphs.  First, some of
these transactions had really long latency.  Second, there are a
remarkable number of seconds all of the test during which no
transactions at all manage to complete, sometimes several seconds in a
row.  I'm not sure why.  Third, all of the tests initially start of
processing transactions very quickly, and get slammed down very hard,
probably because the very high rate of transaction processing early on
causes a checkpoint to occur around 200 s.  I didn't actually log when
the checkpoints were occuring, but it seems like a good guess.  It's
also interesting to wonder whether the checkpoint I/O itself causes
the drop-off, or the ensuing full page writes.  Fourth,
xloginsert-scale-6 helps quite a bit; in fact, it's the only patch
that actually changes the whole shape of the tps graph.  I'm
speculating here, but that may be because it blunts the impact of full
page writes by allowing backends to copy their full page images into
the write-ahead log in parallel.

One thing I also noticed while running the tests is that the system
was really not using much CPU time.  It was mostly idle.  That could
be because waiting for I/O leads to waiting for locks, or it could be
fundamental lock contention.  I don't know which.

A couple of obvious further tests suggest themselves: (1) rerun some
of the tests with full_page_writes=off, and (2) repeat this test with
the remaining performance-related patches.  It would be especially
interesting, I think, to see what effect the checkpoint-related
patches have on these graphs.  But I plan to drop
buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs
based on Simon's comments elsewhere.  I'm including the results here
just because these tests were already running when he made those
comments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: some longer, larger pgbench tests with various performance-related patches

From
Simon Riggs
Date:
On Tue, Jan 24, 2012 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> do a single 30-minute pgbench run at scale factor 300 using a variety

Nice

A minor but necessary point: Repeated testing of the Group commit
patch when you have synch commit off is clearly pointless, so
publishing numbers for that without saying very clearly that's what's
happening doesn't really help.

> Graphs are here:
>
> http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results

Nicer

> There are two graphs for each branch.  The first is a scatter plot of
> latency vs. transaction time.  I found that graph hard to understand,
> though; I couldn't really tell what I was looking at.  So I made a
> second set of graphs which graph number of completed transactions in a
> given second of the test against time.  The results are also included
> on the previous page, below the latency graphs, and I find them much
> more informative.
>
> A couple of things stand out at me from these graphs.  First, some of
> these transactions had really long latency.  Second, there are a
> remarkable number of seconds all of the test during which no
> transactions at all manage to complete, sometimes several seconds in a
> row.  I'm not sure why.  Third, all of the tests initially start of
> processing transactions very quickly, and get slammed down very hard,
> probably because the very high rate of transaction processing early on
> causes a checkpoint to occur around 200 s.

Check

I'm happy that this exposes characteristics I've seen and have been
discussing for a while now.

It would be useful to calculate the slow-down contribution of the
longer txns. What % of total time is given over to slowest 10% of
transactions. Looking at that, I think you'll see why we should care
about sorting out what happens in the worst cases.

>  I didn't actually log when
> the checkpoints were occuring, but it seems like a good guess.  It's
> also interesting to wonder whether the checkpoint I/O itself causes
> the drop-off, or the ensuing full page writes.  Fourth,
> xloginsert-scale-6 helps quite a bit; in fact, it's the only patch
> that actually changes the whole shape of the tps graph.  I'm
> speculating here, but that may be because it blunts the impact of full
> page writes by allowing backends to copy their full page images into
> the write-ahead log in parallel.

I think we should be working to commit XLogInsert and then Group
Commit, then come back to the discussion.

There's no way we're committing other patches but not those, so
everything needs to be viewed with those two in. i.e. commit and then
re-baseline.

So I'd say no more tests just yet, but then lots of testing next week++.

> But I plan to drop
> buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs
> based on Simon's comments elsewhere.  I'm including the results here
> just because these tests were already running when he made those
> comments.

Yep

One you aren't testing is clog_history, which is designed to work in
conjunction with background_clean_slru. That last one clearly needs a
little tuning though...

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: some longer, larger pgbench tests with various performance-related patches

From
Pavan Deolasee
Date:
On Wed, Jan 25, 2012 at 2:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Early yesterday morning, I was able to use Nate Boley's test machine
> do a single 30-minute pgbench run at scale factor 300 using a variety
> of trees built with various patches, and with the -l option added to
> track latency on a per-transaction basis.  All tests were done using
> 32 clients and permanent tables.  The configuration was otherwise
> identical to that described here:
>
> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com
>
> By doing this, I hoped to get a better understanding of (1) the
> effects of a scale factor too large to fit in shared_buffers, (2) what
> happens on a longer test run, and (3) how response time varies
> throughout the test.  First, here are the raw tps numbers:
>
> background-clean-slru-v2: tps = 2027.282539 (including connections establishing)
> buffreelistlock-reduction-v1: tps = 2625.155348 (including connections
> establishing)
> buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149
> (including connections establishing)
> freelist-ok-v2: tps = 2467.065010 (including connections establishing)
> group-commit-2012-01-21: tps = 2205.128609 (including connections establishing)
> master: tps = 2200.848350 (including connections establishing)
> removebufmgrfreelist-v1: tps = 2679.453056 (including connections establishing)
> xloginsert-scale-6: tps = 3675.312202 (including connections establishing)
>
> Obviously these numbers are fairly noisy, especially since this is
> just one run, so the increases and decreases might not be all that
> meaningful.  Time permitting, I'll try to run some more tests to get
> my hands around that situation a little better,
>

This is nice. I am sure long running tests will point out many more
issues. If we are doing these tests, it might be more effective if we
run even longer runs, such as to get at least 3-4
checkpoints/vacuum/analyze (and other such events which can impact
final numbers either way) per test. Otherwise, one patch may stood out
if it avoids, say one checkpoint.

It would definitely help to log the checkpoint,
auto-vacuum/auto-analyze details and plot them on the graph to see if
the drop in performance has anything to do with these activities. It
might also be a good idea to collect pg statistics such as relation
sizes at the end of the run.

Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


Re: some longer, larger pgbench tests with various performance-related patches

From
Robert Haas
Date:
On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I think we should be working to commit XLogInsert and then Group
> Commit, then come back to the discussion.

I definitely agree that those two have way more promise than anything
else on the table.  However, now that I understand how badly we're
getting screwed by checkpoints, I'm curious to do some more
investigation of what's going on there.  I can't help thinking that in
these particular cases the full page writes may be a bigger issue than
the checkpoint itself.  If that turns out to be the case it's not
likely to be something we're able to address for 9.2, but I'd like to
at least characterize it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Early yesterday morning, I was able to use Nate Boley's test machine
> do a single 30-minute pgbench run at scale factor 300 using a variety
> of trees built with various patches, and with the -l option added to
> track latency on a per-transaction basis.  All tests were done using
> 32 clients and permanent tables.  The configuration was otherwise
> identical to that described here:
>
> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com

Previously we mostly used this machine for CPU benchmarking.  Have you
previously described the IO subsystem?  That is becoming relevant for
these types of benchmarks.   For example, is WAL separated from data?

>
> By doing this, I hoped to get a better understanding of (1) the
> effects of a scale factor too large to fit in shared_buffers,

In my hands, the active part of data at scale of 300 fits very
comfortably into 8GB shared_buffers.

Indeed, until you have run a very long time so that pgbench_history
gets large, everything fits in 8GB.

Cheers,

Jeff


Re: some longer, larger pgbench tests with various performance-related patches

From
Robert Haas
Date:
On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Early yesterday morning, I was able to use Nate Boley's test machine
>> do a single 30-minute pgbench run at scale factor 300 using a variety
>> of trees built with various patches, and with the -l option added to
>> track latency on a per-transaction basis.  All tests were done using
>> 32 clients and permanent tables.  The configuration was otherwise
>> identical to that described here:
>>
>> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com
>
> Previously we mostly used this machine for CPU benchmarking.  Have you
> previously described the IO subsystem?  That is becoming relevant for
> these types of benchmarks.   For example, is WAL separated from data?

I actually don't know much about the I/O subsystem, but, no, WAL is
not separated from data.  I believe $PGDATA is on a SAN, but I don't
know anything about its characteristics.

>> By doing this, I hoped to get a better understanding of (1) the
>> effects of a scale factor too large to fit in shared_buffers,
>
> In my hands, the active part of data at scale of 300 fits very
> comfortably into 8GB shared_buffers.
>
> Indeed, until you have run a very long time so that pgbench_history
> gets large, everything fits in 8GB.

Hmm, my mistake.  Maybe I need to crank up the scale factor some more.This is why good benchmarking is hard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Wed, Jan 25, 2012 at 9:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 25, 2012 at 12:00 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Tue, Jan 24, 2012 at 12:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Early yesterday morning, I was able to use Nate Boley's test machine
>>> do a single 30-minute pgbench run at scale factor 300 using a variety
>>> of trees built with various patches, and with the -l option added to
>>> track latency on a per-transaction basis.  All tests were done using
>>> 32 clients and permanent tables.  The configuration was otherwise
>>> identical to that described here:
>>>
>>> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6dw@mail.gmail.com
>>
>> Previously we mostly used this machine for CPU benchmarking.  Have you
>> previously described the IO subsystem?  That is becoming relevant for
>> these types of benchmarks.   For example, is WAL separated from data?
>
> I actually don't know much about the I/O subsystem, but, no, WAL is
> not separated from data.  I believe $PGDATA is on a SAN, but I don't
> know anything about its characteristics.

I think the checkpointing issues that Greg is exploring are important,
and I'm pretty sure that that is what is limiting your TPS in this
case.  However, I don't think we can make much progress on that front
using a machine whose IO subsystem is largely unknown, and not
tweakable.

So I think the best use of this machine would be to explore the purely
CPU based scaling issues, like freelist, CLOG, and XLogInsert.

To do that, I'd set the scale factor small enough so that entire data
set is willing to be cached dirty by the kernel, based on the kernel
parameters:

egrep .  /proc/sys/vm/dirty_*

Then set shared_buffers to be less than the needs for that scale
factor, so freelist gets exercised.

And neutralize checkpoints, by setting fsync=off, so they don't
generate physical IO that we can't control for given the current
constraints on the machine set up.

Cheers,

Jeff


Re: some longer, larger pgbench tests with various performance-related patches

From
Nathan Boley
Date:
> I actually don't know much about the I/O subsystem, but, no, WAL is
> not separated from data.  I believe $PGDATA is on a SAN, but I don't
> know anything about its characteristics.

The computer is here:
http://www.supermicro.nl/Aplus/system/2U/2042/AS-2042G-6RF.cfm

$PGDATA is on a 5 disk SATA software raid 5.

Is there anything else that would be useful to know? ( Sorry, this
isn't a subject that I'm very familiar with )

-Nathan


Re: some longer, larger pgbench tests with various performance-related patches

From
Robert Haas
Date:
On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I think we should be working to commit XLogInsert and then Group
>> Commit, then come back to the discussion.
>
> I definitely agree that those two have way more promise than anything
> else on the table.  However, now that I understand how badly we're
> getting screwed by checkpoints, I'm curious to do some more
> investigation of what's going on there.  I can't help thinking that in
> these particular cases the full page writes may be a bigger issue than
> the checkpoint itself.  If that turns out to be the case it's not
> likely to be something we're able to address for 9.2, but I'd like to
> at least characterize it.

I did another run with full_page_writes=off:

http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png

This is run with the master branch as of commit
4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests,
so that the results are comparable.

The graph is pretty dramatically different from the earlier one:

http://wiki.postgresql.org/wiki/File:Tps-master.png

So are the TPS numbers:

full page writes on: tps = 2200.848350 (including connections establishing)
full page writes off: tps = 9494.757916 (including connections establishing)

It seems pretty clear that, even with full_page_writes=off, the
checkpoint is affecting things negatively: the first 700 seconds or so
show much better and more consistent performance than the remaining
portion of the test.  I'm not sure why that is, but I'm guessing there
was a checkpoint around that time.  But the effect is much worse with
full_page_writes=on: the distinctive parabolic shape of those graphs
is apparently caused by the gradually decreasing frequency of full
page writes as the number of transactions processed since the last
checkpoint grows.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: some longer, larger pgbench tests with various performance-related patches

From
Simon Riggs
Date:
On Wed, Feb 1, 2012 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> I think we should be working to commit XLogInsert and then Group
>>> Commit, then come back to the discussion.
>>
>> I definitely agree that those two have way more promise than anything
>> else on the table.  However, now that I understand how badly we're
>> getting screwed by checkpoints, I'm curious to do some more
>> investigation of what's going on there.  I can't help thinking that in
>> these particular cases the full page writes may be a bigger issue than
>> the checkpoint itself.  If that turns out to be the case it's not
>> likely to be something we're able to address for 9.2, but I'd like to
>> at least characterize it.
>
> I did another run with full_page_writes=off:
>
> http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png
>
> This is run with the master branch as of commit
> 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests,
> so that the results are comparable.
>
> The graph is pretty dramatically different from the earlier one:
>
> http://wiki.postgresql.org/wiki/File:Tps-master.png
>
> So are the TPS numbers:
>
> full page writes on: tps = 2200.848350 (including connections establishing)
> full page writes off: tps = 9494.757916 (including connections establishing)
>
> It seems pretty clear that, even with full_page_writes=off, the
> checkpoint is affecting things negatively: the first 700 seconds or so
> show much better and more consistent performance than the remaining
> portion of the test.  I'm not sure why that is, but I'm guessing there
> was a checkpoint around that time.  But the effect is much worse with
> full_page_writes=on: the distinctive parabolic shape of those graphs
> is apparently caused by the gradually decreasing frequency of full
> page writes as the number of transactions processed since the last
> checkpoint grows.

Sounds like time to test the checkpoint smoothing patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: some longer, larger pgbench tests with various performance-related patches

From
Jeff Janes
Date:
On Wed, Feb 1, 2012 at 9:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 25, 2012 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jan 24, 2012 at 4:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> I think we should be working to commit XLogInsert and then Group
>>> Commit, then come back to the discussion.
>>
>> I definitely agree that those two have way more promise than anything
>> else on the table.  However, now that I understand how badly we're
>> getting screwed by checkpoints, I'm curious to do some more
>> investigation of what's going on there.  I can't help thinking that in
>> these particular cases the full page writes may be a bigger issue than
>> the checkpoint itself.  If that turns out to be the case it's not
>> likely to be something we're able to address for 9.2, but I'd like to
>> at least characterize it.
>
> I did another run with full_page_writes=off:
>
> http://wiki.postgresql.org/wiki/File:Tps-master-nofpw.png
>
> This is run with the master branch as of commit
> 4f42b546fd87a80be30c53a0f2c897acb826ad52, same as the previous tests,
> so that the results are comparable.
>
> The graph is pretty dramatically different from the earlier one:
>
> http://wiki.postgresql.org/wiki/File:Tps-master.png
>
> So are the TPS numbers:
>
> full page writes on: tps = 2200.848350 (including connections establishing)
> full page writes off: tps = 9494.757916 (including connections establishing)
>
> It seems pretty clear that, even with full_page_writes=off, the
> checkpoint is affecting things negatively: the first 700 seconds or so
> show much better and more consistent performance than the remaining
> portion of the test.  I'm not sure why that is, but I'm guessing there
> was a checkpoint around that time.

We really need to nail that down.  Could you post the scripts (on the
wiki) you use for running the benchmark and making the graph?  I'd
like to see how much work it would be for me to change it to detect
checkpoints and do something like color the markers blue during
checkpoints and red elsewhen.

Also, I'm not sure how bad that graph really is.  The overall
throughput is more variable, and there are a few latency spikes but
they are few.  The dominant feature is simply that the long-term
average is less than the initial burst.  Of course the goal is to have
a high level of throughput with a smooth latency under sustained
conditions.  But to expect that that long-sustained smooth level of
throughput be identical to the "initial burst throughput" sounds like
more of a fantasy than a goal.  If we want to accept the lowered
throughput and work on the what variability/spikes are there, I think
a good approach would be to take the long term TPS average, and dial
the number of clients back until the initial burst TPS matches that
long term average.  Then see if the spikes still exist over the long
term using that dialed back number of clients.



> But the effect is much worse with
> full_page_writes=on: the distinctive parabolic shape of those graphs
> is apparently caused by the gradually decreasing frequency of full
> page writes as the number of transactions processed since the last
> checkpoint grows.

The dilution out of not-yet-full-page-written buffers as the
checkpoint ages is probably the ultimate cause of that pattern, but I
suspect there are intermediate causes that we could tackle.

I don't think the full-page-writes are leading to WALInsert
contention, for example, because that would probably lead to smooth
throughput decline, but not those latency spikes in which those entire
seconds go by without transactions.  I doubt it is leading to general
IO compaction, as the IO at that point should be pretty much
sequential (the checkpoint has not yet reached the sync stage, the WAL
is sequential).  So I bet that that is caused by fsyncs occurring at
xlog segment switches, and the locking that that entails.  If I
recall, we can have a segment which is completely written to OS and in
the process of being fsynced, and we can have another segment which is
in some state of partially in wal_buffers and partly written out to OS
cache, but that we can't start reusing the wal_buffers that were
already written to OS for that segment (and therefore are
theoretically available for reuse by the upcoming 3rd segment)  until
the previous segments fsync has completed.  So all WALInsert's freeze.Or something like that.  This code has changed a
bitsince last time 
I studied it.

Cheers,

Jeff


Re: some longer, larger pgbench tests with various performance-related patches

From
Robert Haas
Date:
On Sat, Feb 4, 2012 at 2:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> We really need to nail that down.  Could you post the scripts (on the
> wiki) you use for running the benchmark and making the graph?  I'd
> like to see how much work it would be for me to change it to detect
> checkpoints and do something like color the markers blue during
> checkpoints and red elsewhen.

They're pretty crude - I've attached them here.

> Also, I'm not sure how bad that graph really is.  The overall
> throughput is more variable, and there are a few latency spikes but
> they are few.  The dominant feature is simply that the long-term
> average is less than the initial burst.Of course the goal is to have
> a high level of throughput with a smooth latency under sustained
> conditions.  But to expect that that long-sustained smooth level of
> throughput be identical to the "initial burst throughput" sounds like
> more of a fantasy than a goal.

That's probably true, but the drop-off is currently quite extreme.
The fact that disabling full_page_writes causes throughput to increase
by >4x is dismaying, at least to me.

> If we want to accept the lowered
> throughput and work on the what variability/spikes are there, I think
> a good approach would be to take the long term TPS average, and dial
> the number of clients back until the initial burst TPS matches that
> long term average.  Then see if the spikes still exist over the long
> term using that dialed back number of clients.

Hmm, I might be able to do that.

> I don't think the full-page-writes are leading to WALInsert
> contention, for example, because that would probably lead to smooth
> throughput decline, but not those latency spikes in which those entire
> seconds go by without transactions.

Right.

> I doubt it is leading to general
> IO compaction, as the IO at that point should be pretty much
> sequential (the checkpoint has not yet reached the sync stage, the WAL
> is sequential).  So I bet that that is caused by fsyncs occurring at
> xlog segment switches, and the locking that that entails.

That's definitely possible.

> If I
> recall, we can have a segment which is completely written to OS and in
> the process of being fsynced, and we can have another segment which is
> in some state of partially in wal_buffers and partly written out to OS
> cache, but that we can't start reusing the wal_buffers that were
> already written to OS for that segment (and therefore are
> theoretically available for reuse by the upcoming 3rd segment)  until
> the previous segments fsync has completed.  So all WALInsert's freeze.
>  Or something like that.  This code has changed a bit since last time
> I studied it.

Yeah, I need to better-characterize where the pauses are coming from,
but I'm reluctant to invest too much effort in until Heikki's xlog
scaling patch goes in, because I think that's going to change things
enough that any work done now will mostly be wasted.

It might be worth trying a run with wal_buffers=32MB or something like
that, just to see whether that mitigates any of the locking pile-ups.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: some longer, larger pgbench tests with various performance-related patches

From
Greg Smith
Date:
On 01/24/2012 03:53 PM, Robert Haas wrote:
> There are two graphs for each branch.  The first is a scatter plot of
> latency vs. transaction time.  I found that graph hard to understand,
> though; I couldn't really tell what I was looking at.  So I made a
> second set of graphs which graph number of completed transactions in a
> given second of the test against time.

Note that you're now reinventing parts of pgbench-tools; the main two 
graphs it gives are the latency scatter plot and TPS per second.  The 
things you're likely to find interesting next are maximum latency, 90th 
percentile latency, and a delta for what changed in pg_stat_bgwriter 
during the test; those are the other things I track in that program.

I'm working toward publishing my own tests of the performance patches 
still considered useful by the end of the week.  Murphy's Law has active 
on that project since it started though--server crashed the day I left 
on a week long trip, and I've been sick ever since getting back.

> First, some of
> these transactions had really long latency.  Second, there are a
> remarkable number of seconds all of the test during which no
> transactions at all manage to complete, sometimes several seconds in a
> row.

These periods have in my tests always been associated with Linux turning 
aggressive about cleaning out its write cache, either due to fsync 
request or simply crossing one of its thresholds for doing so.  My 
current record is an 80 second pause with no transactions completing.

One of things I expect to add to pgbench_tools within the next week is 
tracking how much dirty memory is accumulating during each test.  Seeing 
that graph overlaid on top of the rest makes a lot of what's happening 
at any time more obvious.  Noting when the checkpoints happen is a bit 
less interesting, because once the first one happens, they happen almost 
continuously.

You really need to track when the write and sync phases are happening 
for that to be really useful.  This circles back to why I proposed 
exposing those timing bits in pg_stat_bgwriter.  pgbench-tools already 
grabs data from it, which avoids all the mess around log file parsing.  
If I could do that more often and extract checkpoint timing from that 
data, it would make labelling graphs like these much easier to do, from 
the client that's running the benchmark even.

> Third, all of the tests initially start of
> processing transactions very quickly, and get slammed down very hard,
> probably because the very high rate of transaction processing early on
> causes a checkpoint to occur around 200 s.

At the beginning of a write-heavy pgbench run, rate is high until one of 
these two things happen:

1) A checkpoint begins
2) Linux's write cache threshold (typically 
/proc/sys/vm/dirty_background_ratio) worth of dirty memory accumulates.

Note that (1) on its own isn't necessarily the problem, it's something 
the case that it just makes (2) happen much faster then.

Basically, the first 30 to 150 seconds of any write-heavy test always 
have an inflated speed.  You're writing into the OS cache at maximum 
speed, and none of those writes are making it to physical disk--except 
perhaps for the WAL, which is all fast and sequential.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: some longer, larger pgbench tests with various performance-related patches

From
Jeff Janes
Date:
On Mon, Feb 6, 2012 at 6:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Feb 4, 2012 at 2:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> We really need to nail that down.  Could you post the scripts (on the
>> wiki) you use for running the benchmark and making the graph?  I'd
>> like to see how much work it would be for me to change it to detect
>> checkpoints and do something like color the markers blue during
>> checkpoints and red elsewhen.
>
> They're pretty crude - I've attached them here.

Thanks.  But given Greg's comments about pgbench-tools, I'll probably
work on learning that instead.

>
>> Also, I'm not sure how bad that graph really is.  The overall
>> throughput is more variable, and there are a few latency spikes but
>> they are few.  The dominant feature is simply that the long-term
>> average is less than the initial burst.Of course the goal is to have
>> a high level of throughput with a smooth latency under sustained
>> conditions.  But to expect that that long-sustained smooth level of
>> throughput be identical to the "initial burst throughput" sounds like
>> more of a fantasy than a goal.
>
> That's probably true, but the drop-off is currently quite extreme.
> The fact that disabling full_page_writes causes throughput to increase
> by >4x is dismaying, at least to me.

I just meant that the latest graph, with full page writes off, is not that bad.

The ones with fpw on are definitely bad, more due to the latency
spikes than the throughput.


>
>> If we want to accept the lowered
>> throughput and work on the what variability/spikes are there, I think
>> a good approach would be to take the long term TPS average, and dial
>> the number of clients back until the initial burst TPS matches that
>> long term average.  Then see if the spikes still exist over the long
>> term using that dialed back number of clients.
>
> Hmm, I might be able to do that.
>
>> I don't think the full-page-writes are leading to WALInsert
>> contention, for example, because that would probably lead to smooth
>> throughput decline, but not those latency spikes in which those entire
>> seconds go by without transactions.
>
> Right.
>
>> I doubt it is leading to general
>> IO compaction, as the IO at that point should be pretty much
>> sequential (the checkpoint has not yet reached the sync stage, the WAL
>> is sequential).  So I bet that that is caused by fsyncs occurring at
>> xlog segment switches, and the locking that that entails.
>
> That's definitely possible.
>
>> If I
>> recall, we can have a segment which is completely written to OS and in
>> the process of being fsynced, and we can have another segment which is
>> in some state of partially in wal_buffers and partly written out to OS
>> cache, but that we can't start reusing the wal_buffers that were
>> already written to OS for that segment (and therefore are
>> theoretically available for reuse by the upcoming 3rd segment)  until
>> the previous segments fsync has completed.  So all WALInsert's freeze.
>>  Or something like that.  This code has changed a bit since last time
>> I studied it.
>
> Yeah, I need to better-characterize where the pauses are coming from,
> but I'm reluctant to invest too much effort in until Heikki's xlog
> scaling patch goes in, because I think that's going to change things
> enough that any work done now will mostly be wasted.

OK, You've scared me off from digging into the locking at wal switch
for now.  So instead I've spent some time today trying to unbreak the
xlog scaling patch and haven't had any luck.  Does anyone know if any
combination of that patches history + git master history had been
tested and verified to produce a recoverable WAL stream?  It is a
shame that "make check" doesn't test for that, but I don't know how to
make it do so.

Cheers,

Jeff