Thread: Doc patch making firm recommendation for setting the value of commit_delay

Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

14 November 2012, 23:44:39

Some of you will be aware that I've tried to formulate a good general
recommendation for setting the value of commit_delay, in light of the
improvements made in 9.3. I previously asked for help finding a
methodology for optimising commit_delay [1]. The documentation related
to commit_delay still says that we don't know what might work well,
though I don't think that's true any more.

I found the time to do some benchmarks of my own - Greg Smith made
available a server that he frequently uses for benchmarks. It was
previously my observation that "half of raw single-page sync time as
reported by pg_test_fsync for you wal_sync_method" worked best for
commit_delay. I went so far as to modify pg_test_fsync to output
average time per op in microseconds for each operation with
commit_delay in mind, which is a patch that has already been committed
[2]. This general approach worked really well on my laptop, which has
a slowish fsync of about 8 milliseconds (8,000 microseconds), for
which a commit_delay setting of 4,000 (microseconds) seemed to clearly
work best, on both insert and tpc-b benchmarks [3].

This server has an Intel 320 SSD series. Greg has already written
quite a bit about this drive [4] for unrelated reasons. For the SSD,
results of an insert-based pgbench-tools run are shown, with
commit_delay at 0, 20 and 59 (determined by following my general
advise above):

http://commit-delay-results-ssd-insert.staticloud.com

I also looked at a 3-disk RAID0 of 7200RPM drives connected through a
512MB battery-backed write cache (Areca controller), again using
insert.sql:

http://commit-delay-stripe-insert.staticloud.com

In addition to a tpc-b benchmark with the same data directory on the
same 3-disk stripe:

http://commit-delay-results-stripe-tpcb.staticloud.com

I used the same postgresql.conf in all cases, which you can see for
each report, and did the usual median-of-three thing in all cases
(though each run lasted 120 seconds in all cases, not the default 60
seconds). Settings used for one particular pgbench run can be viewed
here (though they're all exactly the same anyway):

http://commit-delay-results-ssd-insert.staticloud.com/19/pg_settings.txt

Attached is a doc-patch that makes recommendations that are consistent
with my observations about what works best here. I'd like to see us
making *some* recommendation - for sympathetic cases, setting
commit_delay appropriately can make a very large difference to
transaction throughput. Such sympathetic cases - many small write
transactions - are something that tends to be seen relatively
frequently with web applications, that disproportionately use cloud
hosting. It isn't at all uncommon for these cases to be highly bound
by their commit rate, and so it is compelling to try to amortize the
cost of a flush as effectively as possible there. It would be
unfortunate if no one was even aware that commit_delay is now useful
for these cases, since the setting allows cloud hosting providers to
help these cases quite a bit, without having to do something like
compromise durability, which in general isn't acceptable.

The point of all these benchmarks isn't to show how effective
commit_delay now is, or can be - we already had that discussion months
ago, before the alteration to its behaviour was committed. The point
is to put my proposed doc changes in the appropriate context, so that
I can build confidence that they're balanced and helpful, by showing
cases that are not so sympathetic.

Thoughts?

[1] http://archives.postgresql.org/pgsql-performance/2012-08/msg00003.php

[2] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=82e429794b348cd80c1d1b011e21ffac98bc6e88

[3] http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html

[4] http://blog.2ndquadrant.com/intel_ssds_lifetime_and_the_32/

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

commit_delay_doc.2012_11_14.patch

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

"David Rowley"

Date:

15 November 2012, 05:24:14

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Peter Geoghegan
> Sent: 15 November 2012 09:44
> To: PG Hackers
> Subject: [HACKERS] Doc patch making firm recommendation for setting the
> value of commit_delay
> 
> Some of you will be aware that I've tried to formulate a good general
> recommendation for setting the value of commit_delay, in light of the
> improvements made in 9.3. I previously asked for help finding a
methodology
> for optimising commit_delay [1]. The documentation related to
> commit_delay still says that we don't know what might work well, though I
> don't think that's true any more.
> 
> I found the time to do some benchmarks of my own - Greg Smith made
> available a server that he frequently uses for benchmarks. It was
previously
> my observation that "half of raw single-page sync time as reported by
> pg_test_fsync for you wal_sync_method" worked best for commit_delay. I
> went so far as to modify pg_test_fsync to output average time per op in
> microseconds for each operation with commit_delay in mind, which is a
patch
> that has already been committed [2]. This general approach worked really
> well on my laptop, which has a slowish fsync of about 8 milliseconds
(8,000
> microseconds), for which a commit_delay setting of 4,000 (microseconds)
> seemed to clearly work best, on both insert and tpc-b benchmarks [3].

<snip>
> 
> Thoughts?
> 

I think for sure, since the GUC maintained its original name, that the docs
need to be updated to let people know the background behaviour has changed
and may now be far more useful.
I've read through the patch. Only thing I see out of place is a small typo:

!    values of <varname>commit_siblings</varname> should be used is such
cases,

Should probably read

!    values of <varname>commit_siblings</varname> should be used *in* such
cases,

Thanks for doing all the benchmarks too. Good to see such a range of
different hardware tested.

Regards

David Rowley

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

"Albe Laurenz"

Date:

15 November 2012, 11:19:48

Peter Geoghegan wrote:
> Some of you will be aware that I've tried to formulate a good general
> recommendation for setting the value of commit_delay, in light of the
> improvements made in 9.3. I previously asked for help finding a
> methodology for optimising commit_delay [1]. The documentation related
> to commit_delay still says that we don't know what might work well,
> though I don't think that's true any more.
>
> I found the time to do some benchmarks of my own - Greg Smith made
> available a server that he frequently uses for benchmarks. It was
> previously my observation that "half of raw single-page sync time as
> reported by pg_test_fsync for you wal_sync_method" worked best for
> commit_delay. I went so far as to modify pg_test_fsync to output
> average time per op in microseconds for each operation with
> commit_delay in mind, which is a patch that has already been committed
> [2].

[...]

> Attached is a doc-patch that makes recommendations that are consistent
> with my observations about what works best here. I'd like to see us
> making *some* recommendation - for sympathetic cases, setting
> commit_delay appropriately can make a very large difference to
> transaction throughput. Such sympathetic cases - many small write
> transactions - are something that tends to be seen relatively
> frequently with web applications, that disproportionately use cloud
> hosting. It isn't at all uncommon for these cases to be highly bound
> by their commit rate, and so it is compelling to try to amortize the
> cost of a flush as effectively as possible there. It would be
> unfortunate if no one was even aware that commit_delay is now useful
> for these cases, since the setting allows cloud hosting providers to
> help these cases quite a bit, without having to do something like
> compromise durability, which in general isn't acceptable.
>
> The point of all these benchmarks isn't to show how effective
> commit_delay now is, or can be - we already had that discussion months
> ago, before the alteration to its behaviour was committed. The point
> is to put my proposed doc changes in the appropriate context, so that
> I can build confidence that they're balanced and helpful, by showing
> cases that are not so sympathetic.
>
> Thoughts?

If there is an agreement that half the sync time as reported by
pg_test_fsync is a good value, would it make sense to have initdb test
sync time and preset commit_delay?

Yours,
Laurenz Albe

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Greg Smith

Date:

15 November 2012, 11:57:08

On 11/15/12 12:19 AM, Albe Laurenz wrote:
> If there is an agreement that half the sync time as reported by
> pg_test_fsync is a good value, would it make sense to have initdb test
> sync time and preset commit_delay?

Peter has validated this against a good range of systems, but it would 
be optimistic to say it's a universal idea.

My main concern with this would be the relatively common practice of 
moving the pg_xlog directory after initdb time.  Sometimes people don't 
know about the option to have initdb move it.  Sometimes the drive to 
hold pg_xlog isn't available when the database is originally created 
yet.  And the camp I fall into (which admittedly is the group who can 
take care of this on their own) will move pg_xlog manually and symlink 
it on their own, because that's what we're used to.

I would rather see this just turn into one of the things a more general 
tuning tool knew how to do, executing against a fully setup system. 
Having a useful implementation of commit_delay and useful docs on it 
seems like enough of a jump forward for one release.  Moving fully into 
auto-tuning before getting more field feedback on how that works out is 
pretty aggressive.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Craig Ringer

Date:

15 November 2012, 12:11:05

On 11/15/2012 04:56 PM, Greg Smith wrote:
>
> I would rather see this just turn into one of the things a more
> general tuning tool knew how to do, executing against a fully setup
> system. Having a useful implementation of commit_delay and useful docs
> on it seems like enough of a jump forward for one release.  Moving
> fully into auto-tuning before getting more field feedback on how that
> works out is pretty aggressive.
>

It'll also potentially make it harder to get reproducible results in
benchmarking and testing across repeated runs, cause confusion when
someone relocates a DB or changes hardware, and slow down initdb (and
thus testing).

I'd be all for making it part of a "test my hardware and tune my DB"
tool, but not such a fan of doing it at initdb time. Making initdb less
predictable and more complicated sounds like asking for trouble.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Magnus Hagander

Date:

15 November 2012, 13:04:28

On Thu, Nov 15, 2012 at 9:56 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 11/15/12 12:19 AM, Albe Laurenz wrote:
>>
>> If there is an agreement that half the sync time as reported by
>> pg_test_fsync is a good value, would it make sense to have initdb test
>> sync time and preset commit_delay?
>
>
> Peter has validated this against a good range of systems, but it would be
> optimistic to say it's a universal idea.
>
> My main concern with this would be the relatively common practice of moving
> the pg_xlog directory after initdb time.  Sometimes people don't know about
> the option to have initdb move it.  Sometimes the drive to hold pg_xlog
> isn't available when the database is originally created yet.  And the camp I
> fall into (which admittedly is the group who can take care of this on their
> own) will move pg_xlog manually and symlink it on their own, because that's
> what we're used to.

An even more common usecase for this, I think, is "I installed from a
package that ran initdb for me".. I still think manual moving of
pg_xlog is a lot more common than using the initdb option in general.


> I would rather see this just turn into one of the things a more general
> tuning tool knew how to do, executing against a fully setup system. Having a
> useful implementation of commit_delay and useful docs on it seems like
> enough of a jump forward for one release.  Moving fully into auto-tuning
> before getting more field feedback on how that works out is pretty
> aggressive.

+1.


--Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

15 November 2012, 13:44:29

On 15 November 2012 10:04, Magnus Hagander <magnus@hagander.net> wrote:
>> I would rather see this just turn into one of the things a more general
>> tuning tool knew how to do, executing against a fully setup system. Having a
>> useful implementation of commit_delay and useful docs on it seems like
>> enough of a jump forward for one release.  Moving fully into auto-tuning
>> before getting more field feedback on how that works out is pretty
>> aggressive.
>
> +1.

I am inclined to agree. I did attempt to use the instrumentation
macros to have commit_delay set adaptively at runtime, which would
have at least addressed this concern, but that just didn't work as
well as this.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

15 November 2012, 13:53:31

On 15 November 2012 08:56, Greg Smith <greg@2ndquadrant.com> wrote:
> My main concern with this would be the relatively common practice of moving
> the pg_xlog directory after initdb time.

I probably should have increased the default number of seconds that
pg_test_fsync runs for, in light of the fact that that can make an
appreciable difference when following this method. I'd suggest that a
value of 5 be used there. I think we should just change that, since
everyone will just use the default anyway (or, the more cautious ones
will use a higher value than the default, if anything).

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Stephen Frost

Date:

16 November 2012, 04:21:03

* Peter Geoghegan (peter@2ndquadrant.com) wrote:
> I am inclined to agree. I did attempt to use the instrumentation
> macros to have commit_delay set adaptively at runtime, which would
> have at least addressed this concern, but that just didn't work as
> well as this.

Setting it at run-time was actually my first thought on this.  I'm
disappointed to hear that it didn't work out well.
Thanks,
    Stephen

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

16 November 2012, 05:12:38

On 16 November 2012 01:20, Stephen Frost <sfrost@snowman.net> wrote:
> Setting it at run-time was actually my first thought on this.  I'm
> disappointed to hear that it didn't work out well.

I actually did quite a lot of research on it, that was interesting,
but ultimately didn't lead to any breakthroughs. There was a couple of
research papers written on this subject in the late 80s and early 90s
that are cited in the transaction processing literature relatively
frequently, but the models involved aren't terribly applicable to
modern hardware.

This is the main paper:

http://www.hpl.hp.com/techreports/tandem/TR-88.1.pdf

This later 1991 paper reads kind of like a critique of the 1988 paper:

http://www.linux-mips.org/pub/linux/mips/people/macro/DEC/DTJ/DTJ107/DTJ107PF.PDF

There doesn't appear to be much of anything in the public domain after
this. I suppose that I could have spent more time pursuing this, but I
very much had the sense of pursuing diminishing returns. If the whole
idea of making the thing adaptive is to make it more accessible to
users, well, they'll still have to turn it on themselves (presumably
it will never be on by default), which is only slightly less effort
than following my advice here.

I suspect my tools - in particular, pgbench - just isn't sophisticated
enough to evaluate the effectiveness of any particular model. I might
have used something like Tsung, which modes workloads stochastically,
instead. However, I just didn't have the time.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

19 November 2012, 23:38:47

I've added this to the open commitfest. I think that a formal review
is probably required here.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Noah Misch

Date:

19 January 2013, 23:39:07

On Wed, Nov 14, 2012 at 08:44:26PM +0000, Peter Geoghegan wrote:
> http://commit-delay-results-ssd-insert.staticloud.com
> http://commit-delay-stripe-insert.staticloud.com
> http://commit-delay-results-stripe-tpcb.staticloud.com
> http://commit-delay-results-ssd-insert.staticloud.com/19/pg_settings.txt

staticloud.com seems to be gone.  Would you repost these?

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

21 January 2013, 03:23:33

On 19 January 2013 20:38, Noah Misch <noah@leadboat.com> wrote:
> staticloud.com seems to be gone.  Would you repost these?

I've pushed these to a git repo, hosted on github.

https://github.com/petergeoghegan/commit_delay_benchmarks

I'm sorry that I didn't take the time to make the html benchmarks
easily viewable within a browser on this occasion.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Noah Misch

Date:

21 January 2013, 16:10:41

On Mon, Jan 21, 2013 at 12:23:21AM +0000, Peter Geoghegan wrote:
> On 19 January 2013 20:38, Noah Misch <noah@leadboat.com> wrote:
> > staticloud.com seems to be gone.  Would you repost these?
> 
> I've pushed these to a git repo, hosted on github.
> 
> https://github.com/petergeoghegan/commit_delay_benchmarks
> 
> I'm sorry that I didn't take the time to make the html benchmarks
> easily viewable within a browser on this occasion.

That's plenty convenient; thanks.

What filesystem did you use for testing?  Would you also provide /proc/cpuinfo
or a rough description of the system's CPUs?

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

21 January 2013, 17:34:05

On 21 January 2013 13:10, Noah Misch <noah@leadboat.com> wrote:
> What filesystem did you use for testing?  Would you also provide /proc/cpuinfo
> or a rough description of the system's CPUs?

Unfortunately, I don't have access to that server at the moment. It's
under Greg Smith's control. I believe you yourself had an account on
this server at one point. I do know that the CPU is an Intel Core
i7-870:

http://ark.intel.com/products/41315/Intel-Core-i7-870-Processor-8M-Cache-2_93-GHz

I am pretty sure that the filesystem that the various block devices
were mounted with was ext4 (without LVM), but it might have been XFS.
I don't recall. The operating system was Debian Lenny.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Noah Misch

Date:

27 January 2013, 05:31:42

Hi Peter,

I took a look at this patch and the benchmarks you've furnished:

https://github.com/petergeoghegan/commit_delay_benchmarks
http://pgeoghegan.blogspot.com/2012/06/towards-14000-write-transactions-on-my.html

On Wed, Nov 14, 2012 at 08:44:26PM +0000, Peter Geoghegan wrote:

> Attached is a doc-patch that makes recommendations that are consistent
> with my observations about what works best here. I'd like to see us
> making *some* recommendation - for sympathetic cases, setting
> commit_delay appropriately can make a very large difference to
> transaction throughput. Such sympathetic cases - many small write
> transactions - are something that tends to be seen relatively
> frequently with web applications, that disproportionately use cloud
> hosting. It isn't at all uncommon for these cases to be highly bound
> by their commit rate, and so it is compelling to try to amortize the
> cost of a flush as effectively as possible there. It would be
> unfortunate if no one was even aware that commit_delay is now useful
> for these cases, since the setting allows cloud hosting providers to
> help these cases quite a bit, without having to do something like
> compromise durability, which in general isn't acceptable.

Your fast-fsync (SSD, BBWC) benchmarks show a small loss up to 8 clients and a
10-20% improvement at 32 clients.  That's on a 4-core/8-thread CPU, assuming
HT was left enabled.  Your slow-fsync (laptop) benchmarks show a 40-100%
improvement in the 16-64 client range.

I did a few more benchmarks along the spectrum.  First, I used a Mac, also
4-core/8-thread, with fsync_writethrough; half of fsync time gave commit_delay
= 35000.  I used pgbench, scale factor 100, 4-minute runs, three trials each:

-clients-   -tps@commit_delay=0-    -tps@commit_delay=35000-
8           51,55,63                82,84,86
16          98,100,107              130,134,143
32          137,148,157             192,200,201
64          199,201,214             249,256,258

So that's a nice 27-53% improvement, fairly similar to the pattern for your
laptop pgbench numbers.  Next, based on your comment about the possible value
for cloud-hosted applications, I tried a cc2.8xlarge (16 core, 32 thread),
GNU/Linux EC2 instance with a data directory on a standard EBS volume, ext4
filesystem.  Several 15s pg_test_fsync runs could not agree on an fsync time;
I saw results from 694us to 1904us.  Ultimately I settled on trying
commit_delay=500, scale factor 300:

-clients-   -tps@commit_delay=0-    -tps@commit_delay=500-
32          1224,1391,1584          1175,1229,1394
64          1553,1647,1673          1544,1546,1632
128         1717,1833,1900          1621,1720,1951
256         1664,1717,1918          1734,1832,1918

The numbers are all over the place, but there's more loss than gain.  Amit
Kapila also measured small losses in tps@-c16:

http://www.postgresql.org/message-id/000701cd6ff0$013a6210$03af2630$@kapila@huawei.com

I was curious about the cost of the MinimumActiveBackends() call when relying
on commit_siblings to skip the delay.  I ran a similar test with an extra 500
idle backends, clients=8, commit_siblings=20 (so the delay would never be
used), and either a zero or nonzero commit_delay.  There was no appreciable
performance advantage from setting commit_delay=0 as opposed to relying on
commit_siblings to suppress the delay.  That's good news.

On the GNU/Linux VM, pg_sleep() achieves precision on the order of 10us.
However, the sleep was consistently around 70us longer than requested.  A
300us request yielded a 370us sleep, and a 3000us request gave a 3080us sleep.
Mac OS X was similarly precise for short sleeps, but it could oversleep a full
1000us on a 35000us sleep.

> diff doc/src/sgml/wal.sgml
> index fc5c3b2..92619dd
> *** a/doc/src/sgml/wal.sgml
> --- b/doc/src/sgml/wal.sgml
> ***************
> *** 375,382 ****
>      just before a synchronous commit attempts to flush
>      <acronym>WAL</acronym> to disk, in the hope that a single flush
>      executed by one such transaction can also serve other transactions
> !    committing at about the same time.  Setting <varname>commit_delay</varname>
> !    can only help when there are many concurrently committing transactions.
>     </para>
>   
>    </sect1>
> --- 375,397 ----

The beginning of this paragraph stills says "commit_delay causes a delay just
before a synchronous commit attempts to flush WAL to disk".  Since it now
applies to every WAL flush, that should be updated.

> ***************
> *** 560,570 ****
>      is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
>      other sessions are currently in active transactions; this avoids
>      sleeping when it's unlikely that any other session will commit soon.
> !    Note that on most platforms, the resolution of a sleep request is
>      ten milliseconds, so that any nonzero <varname>commit_delay</varname>
>      setting between 1 and 10000 microseconds would have the same effect.
> !    Good values for these parameters are not yet clear; experimentation
> !    is encouraged.
>     </para>
>   
>     <para>
> --- 575,607 ----

There's a similar problem at the beginning of this paragraph; it says
specifically, "The commit_delay parameter defines for how many microseconds
the server process will sleep after writing a commit record to the log with
LogInsert but before performing a LogFlush."

>      is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
>      other sessions are currently in active transactions; this avoids
>      sleeping when it's unlikely that any other session will commit soon.
> !    Note that on some platforms, the resolution of a sleep request is
>      ten milliseconds, so that any nonzero <varname>commit_delay</varname>
>      setting between 1 and 10000 microseconds would have the same effect.
> !   </para>

As a side note, if we're ever going to recommend a fire-and-forget method for
setting commit_delay, it may be worth detecting whether the host sleep
granularity is limited like this.  Setting commit_delay = 20 for your SSD and
silently getting commit_delay = 10000 would make for an unpleasant surprise.

> ! 
> !   <para>
> !    Since the purpose of <varname>commit_delay</varname> is to allow
> !    the cost of each flush operation to be more effectively amortized
> !    across concurrently committing transactions (potentially at the
> !    expense of transaction latency), it is necessary to quantify that
> !    cost when altering the setting.  The higher that cost is, the more
> !    effective <varname>commit_delay</varname> is expected to be in
> !    increasing transaction throughput.  The

That's true for spinning disks, but I suspect it does not hold for storage
with internal parallelism, notably virtualized storage.  Consider an iSCSI
configuration with high bandwidth and high latency.  When network latency is
the limiting factor, will sending larger requests less often still help?

> !    <xref linkend="pgtestfsync"> module can be used to measure the
> !    average time in microseconds that a single WAL flush operation
> !    takes.  A value of half of the average time the module reports it
> !    takes to flush after a single 8kB write operation is often the most
> !    effective setting for <varname>commit_delay</varname>.  The benefit
> !    of tuning <varname>commit_delay</varname> can even be pronounced on
> !    storage media with very fast sync times, such as solid-state drives
> !    or RAID arrays with a battery-backed write cache.  However, higher
> !    values of <varname>commit_siblings</varname> should be used is such

Typo: "is such" -> "in such"

> !    cases, whereas smaller <varname>commit_siblings</varname> values
> !    can be helpful on higher latency media.  Note that it is quite
> !    possible that a setting of <varname>commit_delay</varname> that is
> !    too high can increase transaction latency by so much that total
> !    transaction throughput suffers.
>     </para>

One would be foolish to run a performance-sensitive workload like those in
question, including the choice to have synchronous_commit=on, on spinning
disks with no battery-backed write cache.  A cloud environment is more
credible, but my benchmark showed no gain there.  Overall, I still won't
personally recommend changing commit_delay without measuring the performance
change for one's particular workload and storage environment.  commit_delay
can now bring some impressive gains in the right situations, but I doubt those
are common enough for a fire-and-forget setting to do more good than harm.

I suggest having the documentation recommend half of the fsync time as a
starting point for benchmarking different commit_delay settings against your
own workload.  Indicate that it's more likely to help for direct use of
spinning disks than for BBWC/solid state/virtualized storage.  Not sure what
else can be credibly given as general advice for PostgreSQL DBAs.

Thanks,
nm

Re: Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

28 January 2013, 03:16:28

Hi Noah,

On 27 January 2013 02:31, Noah Misch <noah@leadboat.com> wrote:
> I did a few more benchmarks along the spectrum.

> So that's a nice 27-53% improvement, fairly similar to the pattern for your
> laptop pgbench numbers.

I presume that this applies to a tpc-b benchmark (the pgbench
default). Note that the really compelling numbers that I reported in
that blog post (where there is an increase of over 80% in transaction
throughput at lower client counts) occur with an insert-based
benchmark (i.e. a maximally commit-bound workload).

> Next, based on your comment about the possible value
> for cloud-hosted applications

> -clients-   -tps@commit_delay=0-    -tps@commit_delay=500-
> 32          1224,1391,1584          1175,1229,1394
> 64          1553,1647,1673          1544,1546,1632
> 128         1717,1833,1900          1621,1720,1951
> 256         1664,1717,1918          1734,1832,1918
>
> The numbers are all over the place, but there's more loss than gain.

I suspected that the latency of cloud storage might be relatively
poor. Since that is evidently not actually the case with Amazon EBS,
it makes sense that commit_delay isn't compelling there. I am not
disputing whether or not Amazon EBS should be considered
representative of such systems in general - I'm sure that it should
be.

> There was no appreciable
> performance advantage from setting commit_delay=0 as opposed to relying on
> commit_siblings to suppress the delay.  That's good news.

Thank you for doing that research; I investigated that the fastpath in
MinimumActiveBackends() works well myself, but it's useful to have my
findings verified.

> On the GNU/Linux VM, pg_sleep() achieves precision on the order of 10us.
> However, the sleep was consistently around 70us longer than requested.  A
> 300us request yielded a 370us sleep, and a 3000us request gave a 3080us sleep.
> Mac OS X was similarly precise for short sleeps, but it could oversleep a full
> 1000us on a 35000us sleep.

Ugh.

> The beginning of this paragraph stills says "commit_delay causes a delay just
> before a synchronous commit attempts to flush WAL to disk".  Since it now
> applies to every WAL flush, that should be updated.

Agreed.

> There's a similar problem at the beginning of this paragraph; it says
> specifically, "The commit_delay parameter defines for how many microseconds
> the server process will sleep after writing a commit record to the log with
> LogInsert but before performing a LogFlush."

Right.

> As a side note, if we're ever going to recommend a fire-and-forget method for
> setting commit_delay, it may be worth detecting whether the host sleep
> granularity is limited like this.  Setting commit_delay = 20 for your SSD and
> silently getting commit_delay = 10000 would make for an unpleasant surprise.

Yes, it would. Note on possible oversleeping added.

>> !   <para>
>> !    Since the purpose of <varname>commit_delay</varname> is to allow
>> !    the cost of each flush operation to be more effectively amortized
>> !    across concurrently committing transactions (potentially at the
>> !    expense of transaction latency), it is necessary to quantify that
>> !    cost when altering the setting.  The higher that cost is, the more
>> !    effective <varname>commit_delay</varname> is expected to be in
>> !    increasing transaction throughput.  The
>
> That's true for spinning disks, but I suspect it does not hold for storage
> with internal parallelism, notably virtualized storage.  Consider an iSCSI
> configuration with high bandwidth and high latency.  When network latency is
> the limiting factor, will sending larger requests less often still help?

Well, I don't like to speculate about things like that, because it's
just too easy to be wrong. That said, it doesn't immediately occur to
me why the statement that you've highlighted wouldn't be true of
virtualised storage that has the characteristics you describe. Any
kind of latency at flush time means that clients idle, which means
that the CPU is potentially not kept fully busy for a greater amount
of wall time, where it might otherwise be kept more busy.

> One would be foolish to run a performance-sensitive workload like those in
> question, including the choice to have synchronous_commit=on, on spinning
> disks with no battery-backed write cache.  A cloud environment is more
> credible, but my benchmark showed no gain there.

In an everyday sense you are correct. It would typically be fairly
senseless to run an application that was severely limited by
transaction throughput like this, when a battery-backed cache could be
used at the cost of a couple of hundred dollars. However, it's quite
possible to imagine a scenario in which the economics favoured using
commit_delay instead. For example, I am aware that at Facebook, a
similar Facebook-flavoured-MySQL setting (sync_binlog_timeout_usecs)
is used. Furthermore, it might not be obvious that fsync speed is an
issue in practice. Setting commit_delay to 4,000 has seemingly no
downside on my laptop - it *positively* affects both average and
worse-case transaction latency - so with spinning disks, it probably
would actually be sensible to set it and forget it, regardless of
workload.

When Robert committed this feature, he added an additional check when
WALWriteLock is acquired, that could see the lock acquired in a way
that turned out to be needless, but also prevented a flush that was
technically needless from the group commit leader/lock holder
backend's own selfish perspective. I never got around to satisfying
myself that that changed helped more than it hurt, if in fact it had
any measurable impact either way. Perhaps I should. The benchmark that
appears on my blog was actually produced with the slightly different,
original version.

> Overall, I still won't
> personally recommend changing commit_delay without measuring the performance
> change for one's particular workload and storage environment.  commit_delay
> can now bring some impressive gains in the right situations, but I doubt those
> are common enough for a fire-and-forget setting to do more good than harm.

I agree.

> I suggest having the documentation recommend half of the fsync time as a
> starting point for benchmarking different commit_delay settings against your
> own workload.  Indicate that it's more likely to help for direct use of
> spinning disks than for BBWC/solid state/virtualized storage.  Not sure what
> else can be credibly given as general advice for PostgreSQL DBAs.

That all seems reasonable. The really important thing is that we don't
state that we don't have a clue what helps - that inspires no
confidence, could turn someone off what would be a really useful
feature for them and just isn't accurate. I also think it's important
that we don't say "Setting commit_delay can only help when there are
many concurrently committing transactions", because roughly the
opposite is true. With many connections, there are already enough
committing transactions to effectively amortize the cost of a flush,
and commit_delay is then only very slightly helpful. Lower client
counts are where commit_delay actually helps (at least with slow fsync
times, which are the compelling case).

I attach a revision that I think addresses your concerns. I've
polished it a bit further too - in particular, my elaborations about
commit_delay have been concentrated at the end of wal.sgml, where they
belong. I've also removed the reference to XLogInsert, because, since
all XLogFlush call sites are now covered by commit_delay, XLogInsert
isn't particularly relevant.

I have also increased the default time that pg_test_fsync runs - I
think that the kind of variability commonly seen in its output, that
you yourself have reported, justifies doing so in passing.

--
Regards,
Peter Geoghegan

Attachment

commit_delay_doc.2013_01_28.patch

Re: Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Noah Misch

Date:

28 January 2013, 06:34:19

On Mon, Jan 28, 2013 at 12:16:24AM +0000, Peter Geoghegan wrote:
> On 27 January 2013 02:31, Noah Misch <noah@leadboat.com> wrote:
> > I did a few more benchmarks along the spectrum.
> 
> > So that's a nice 27-53% improvement, fairly similar to the pattern for your
> > laptop pgbench numbers.
> 
> I presume that this applies to a tpc-b benchmark (the pgbench
> default). Note that the really compelling numbers that I reported in
> that blog post (where there is an increase of over 80% in transaction
> throughput at lower client counts) occur with an insert-based
> benchmark (i.e. a maximally commit-bound workload).

Correct.  The pgbench default workload is already rather friendly toward
commit_delay, so I wanted to stay away from even-friendlier tests.

Would you commit to the same git repository the pgbench-tools data for the
graphs appearing in that blog post?  I couldn't readily tell what was
happening below 16 clients due to the graphed data points blending together.

> >> !   <para>
> >> !    Since the purpose of <varname>commit_delay</varname> is to allow
> >> !    the cost of each flush operation to be more effectively amortized
> >> !    across concurrently committing transactions (potentially at the
> >> !    expense of transaction latency), it is necessary to quantify that
> >> !    cost when altering the setting.  The higher that cost is, the more
> >> !    effective <varname>commit_delay</varname> is expected to be in
> >> !    increasing transaction throughput.  The
> >
> > That's true for spinning disks, but I suspect it does not hold for storage
> > with internal parallelism, notably virtualized storage.  Consider an iSCSI
> > configuration with high bandwidth and high latency.  When network latency is
> > the limiting factor, will sending larger requests less often still help?
> 
> Well, I don't like to speculate about things like that, because it's
> just too easy to be wrong. That said, it doesn't immediately occur to
> me why the statement that you've highlighted wouldn't be true of
> virtualised storage that has the characteristics you describe. Any
> kind of latency at flush time means that clients idle, which means
> that the CPU is potentially not kept fully busy for a greater amount
> of wall time, where it might otherwise be kept more busy.

On further reflection, I retract the comment.  Regardless of internal
parallelism of the storage, PostgreSQL issues WAL fsyncs serially.

> > One would be foolish to run a performance-sensitive workload like those in
> > question, including the choice to have synchronous_commit=on, on spinning
> > disks with no battery-backed write cache.  A cloud environment is more
> > credible, but my benchmark showed no gain there.
> 
> In an everyday sense you are correct. It would typically be fairly
> senseless to run an application that was severely limited by
> transaction throughput like this, when a battery-backed cache could be
> used at the cost of a couple of hundred dollars. However, it's quite
> possible to imagine a scenario in which the economics favoured using
> commit_delay instead. For example, I am aware that at Facebook, a
> similar Facebook-flavoured-MySQL setting (sync_binlog_timeout_usecs)
> is used. Furthermore, it might not be obvious that fsync speed is an
> issue in practice. Setting commit_delay to 4,000 has seemingly no
> downside on my laptop - it *positively* affects both average and
> worse-case transaction latency - so with spinning disks, it probably
> would actually be sensible to set it and forget it, regardless of
> workload.

I agree that commit_delay is looking like a safe bet for spinning disks.

> I attach a revision that I think addresses your concerns. I've
> polished it a bit further too - in particular, my elaborations about
> commit_delay have been concentrated at the end of wal.sgml, where they
> belong. I've also removed the reference to XLogInsert, because, since
> all XLogFlush call sites are now covered by commit_delay, XLogInsert
> isn't particularly relevant.

I'm happy with this formulation.

> I have also increased the default time that pg_test_fsync runs - I
> think that the kind of variability commonly seen in its output, that
> you yourself have reported, justifies doing so in passing.

On the EBS configuration with volatile fsync timings, the variability didn't
go away with 15s runs.  On systems with stable fsync times, 15s was no better
than 2s.  Absent some particular reason to believe 5s is better than 2s, I
would leave it alone.

I'm marking this patch Ready for Committer, qualified with a recommendation to
adopt only the wal.sgml changes.

Thanks,
nm

Re: Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

28 January 2013, 07:29:19

On 28 January 2013 03:34, Noah Misch <noah@leadboat.com> wrote:
> On the EBS configuration with volatile fsync timings, the variability didn't
> go away with 15s runs.  On systems with stable fsync times, 15s was no better
> than 2s.  Absent some particular reason to believe 5s is better than 2s, I
> would leave it alone.

I'm not recommending doing so because I thought you'd be likely to get
better numbers on EBS; obviously the variability you saw there likely
had a lot to do with the fact that the underlying physical machines
have multiple tenants. It has just been my observation that more
consistent figures can be obtained (on my laptop) by using a
pg_test_fsync --secs-per-test of about 5. That being the case, why
take the chance with 2 seconds? It isn't as if people run
pg_test_fsync everyday, or that they cannot set --secs-per-test to
whatever they like themselves. On the other hand, the cost of setting
it too low could be quite high now, because the absolute values (and
not just how different wal_sync_methods compare) is now important.

-- 
Regards,
Peter Geoghegan

Re: Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Peter Geoghegan

Date:

28 January 2013, 07:48:59

On 28 January 2013 03:34, Noah Misch <noah@leadboat.com> wrote:
> Would you commit to the same git repository the pgbench-tools data for the
> graphs appearing in that blog post?  I couldn't readily tell what was
> happening below 16 clients due to the graphed data points blending together.

I'm afraid that I no longer have that data. Of course, I could fairly
easily recreate it, but I don't think I'll have time tomorrow. Is it
important? Are you interested in both the insert and tpc-b cases?

-- 
Regards,
Peter Geoghegan

Re: Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Noah Misch

Date:

28 January 2013, 17:16:08

On Mon, Jan 28, 2013 at 04:48:56AM +0000, Peter Geoghegan wrote:
> On 28 January 2013 03:34, Noah Misch <noah@leadboat.com> wrote:
> > Would you commit to the same git repository the pgbench-tools data for the
> > graphs appearing in that blog post?  I couldn't readily tell what was
> > happening below 16 clients due to the graphed data points blending together.
> 
> I'm afraid that I no longer have that data. Of course, I could fairly
> easily recreate it, but I don't think I'll have time tomorrow. Is it
> important? Are you interested in both the insert and tpc-b cases?

No need, then.

Re: Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Noah Misch

Date:

29 January 2013, 06:48:47

On Mon, Jan 28, 2013 at 04:29:12AM +0000, Peter Geoghegan wrote:
> On 28 January 2013 03:34, Noah Misch <noah@leadboat.com> wrote:
> > On the EBS configuration with volatile fsync timings, the variability didn't
> > go away with 15s runs.  On systems with stable fsync times, 15s was no better
> > than 2s.  Absent some particular reason to believe 5s is better than 2s, I
> > would leave it alone.
> 
> I'm not recommending doing so because I thought you'd be likely to get
> better numbers on EBS; obviously the variability you saw there likely
> had a lot to do with the fact that the underlying physical machines
> have multiple tenants. It has just been my observation that more
> consistent figures can be obtained (on my laptop) by using a
> pg_test_fsync --secs-per-test of about 5. That being the case, why
> take the chance with 2 seconds?

I can't get too excited about it either way.

> It isn't as if people run
> pg_test_fsync everyday, or that they cannot set --secs-per-test to
> whatever they like themselves. On the other hand, the cost of setting
> it too low could be quite high now, because the absolute values (and
> not just how different wal_sync_methods compare) is now important.

True.  You'd actually want to run the tool with a short interval to select a
wal_sync_method, then test the chosen method for a longer period to get an
accurate reading for commit_delay.

Re: Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Tom Lane

Date:

16 March 2013, 00:47:36

Noah Misch <noah@leadboat.com> writes:
> I'm marking this patch Ready for Committer, qualified with a recommendation to
> adopt only the wal.sgml changes.

I've committed this along with some further wordsmithing.  I kept
Peter's change to pg_test_fsync's default -s value; I've always felt
that 2 seconds was laughably small.  It might be all right for very
quick-and-dirty tests, but as a default value, it seems like a poor
choice, because it's at the very bottom of the credible range of
choices.
        regards, tom lane

Re: Re: Doc patch making firm recommendation for setting the value of commit_delay

From

Bruce Momjian

Date:

18 March 2013, 22:07:31

On Fri, Mar 15, 2013 at 05:47:30PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > I'm marking this patch Ready for Committer, qualified with a recommendation to
> > adopt only the wal.sgml changes.
> 
> I've committed this along with some further wordsmithing.  I kept
> Peter's change to pg_test_fsync's default -s value; I've always felt
> that 2 seconds was laughably small.  It might be all right for very
> quick-and-dirty tests, but as a default value, it seems like a poor
> choice, because it's at the very bottom of the credible range of
> choices.

Agreed, 2 seconds was at the bottom.  The old behavior was very slow so
I went low.  Now that we are using it, 5 secs makes sense.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +