Thread: Perf Benchmarking and regression.

Perf Benchmarking and regression.

From
Mithun Cy
Date:
I tried to do some benchmarking on postgres master head
commit 72a98a639574d2e25ed94652848555900c81a799
Author: Andres Freund <andres@anarazel.de>
Date:   Tue Apr 26 20:32:51 2016 -0700

CASE : Read-Write Tests when data exceeds shared buffers.

Non Default settings and test
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

./pgbench -i -s 1000 postgres

./pgbench -c $threads -j $threads -T 1800 -M prepared postgres


Machine : "cthulhu" 8 node numa machine with 128 hyper threads.
>numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103
node 0 size: 65498 MB
node 0 free: 37885 MB
node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111
node 1 size: 65536 MB
node 1 free: 31215 MB
node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119
node 2 size: 65536 MB
node 2 free: 15331 MB
node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127
node 3 size: 65536 MB
node 3 free: 36774 MB
node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40
node 4 size: 65536 MB
node 4 free: 62 MB
node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48
node 5 size: 65536 MB
node 5 free: 9653 MB
node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56
node 6 size: 65536 MB
node 6 free: 50209 MB
node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64
node 7 size: 65536 MB
node 7 free: 43966 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  21  21  21  21  21  21  21
  1:  21  10  21  21  21  21  21  21
  2:  21  21  10  21  21  21  21  21
  3:  21  21  21  10  21  21  21  21
  4:  21  21  21  21  10  21  21  21
  5:  21  21  21  21  21  10  21  21
  6:  21  21  21  21  21  21  10  21
  7:  21  21  21  21  21  21  21  10


I see some regression when compared to 9.5

SessionsPostgreSQL-9.5 scale 1000PostgreSQL-9.6 scale 1000%diff
1747.367249892.14989119.3723557185
85281.2827994941.905008-6.4260484416
169000.9154198695.396233-3.3943123758
2411852.83962710843.328776-8.5170379653
3214323.04833411977.505153-16.3760054864
4016098.92658312195.447024-24.2468312336
4816959.64696512639.951087-25.4704351271
5617157.73776212543.212929-26.894715941
6417201.91492212628.002422-26.5895542487
7216956.99483511280.870599-33.4736448954
8016775.95489611348.830603-32.3506132834
8816609.13755810823.465121-34.834273705
9616510.09940411091.757753-32.8183466278
10416275.72492710665.743275-34.4683980416
11216141.81512810977.84664-31.9912503461
12015904.08661410716.17755-32.6199749153
12815738.39150310962.333439-30.3465450271


When I run git bisect on master (And this is for 128 clients).

2 commitIds which affected the performance

1. # first bad commit: [ac1d7945f866b1928c2554c0f80fd52d7f977772] Make idle backends exit if the postmaster dies.
this made performance to drop from

15947.21546 (15K +) to 13409.758510 (arround 13K+).

2. # first bad commit: [428b1d6b29ca599c5700d4bc4f4ce4c5880369bf] Allow to trigger kernel writeback after a configurable number of writes.

this made performance to drop further to 10962.333439 (10K +)

I think It did not recover afterwards.

--
Thanks and Regards
Mithun C Y

Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
Hi,

Thanks for benchmarking!

On 2016-05-06 19:43:52 +0530, Mithun Cy wrote:
> 1. # first bad commit: [ac1d7945f866b1928c2554c0f80fd52d7f977772] Make idle
> backends exit if the postmaster dies.
> this made performance to drop from
> 
> 15947.21546 (15K +) to 13409.758510 (arround 13K+).

Let's debug this one first, it's a lot more local.  I'm rather surprised
that you're seing a big effect with that "few" TPS/socket operations;
and even more that our efforts to address that problem haven't been
fruitful (given we've verified the fix on a number of machines).

Can you verify that removing   AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
in src/backend/libpq/pqcomm.c : pq_init() restores performance?

I think it'd be best to test the back/forth on master with
bgwriter_flush_after = 0
checkpointer_flush_after = 0
backend_flush_after = 0
to isolate the issue.

Also, do you see read-only workloads to be affected too?

> 2. # first bad commit: [428b1d6b29ca599c5700d4bc4f4ce4c5880369bf] Allow to
> trigger kernel writeback after a configurable number of writes.

FWIW, it'd be very interesting to test again with a bigger
backend_flush_after setting.


Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Mithun Cy
Date:

On Fri, May 6, 2016 at 8:35 PM, Andres Freund <andres@anarazel.de> wrote:
> Also, do you see read-only workloads to be affected too?
Thanks, I have not tested with above specific commitid which reported performance issue but
At HEAD commit 72a98a639574d2e25ed94652848555900c81a799
Author: Andres Freund <andres@anarazel.de>
Date:   Tue Apr 26 20:32:51 2016 -0700

READ-Only (prepared) tests (both when data fits to shared buffers or it exceeds shared buffer=8GB) performance of master has improved over 9.5

SessionsPostgreSQL-9.5 scale 300PostgreSQL-9.6 scale 300%diff
15287.5615945213.723197-1.396454598
884265.38908384871.3056890.719057507
16148330.4155158661.1283156.9646624936
24207062.803697219958.129746.2277366155
32265145.089888290190.5014439.4459269699
40311688.752973340000.5517729.0833559212
48327169.9673372408.07303313.8270960829
56274426.530496390629.2494842.3438356248
64261777.692042384613.966646.9238893505
72210747.55937376390.16202278.5976374517
80220192.818648398128.77932980.8091570713
88185176.91888423906.711882128.9198429512
96161579.719039421541.656474160.8877271115
104146935.568434450672.740567206.7145316618
112136605.466232432047.309248216.2738074582
120127687.175016455458.086889256.6983816753
128120413.936453428127.879242255.5467845776
 
SessionsPostgreSQL-9.5 scale 1000PostgreSQL-9.6 scale 1000%diff
15103.8122025155.4348081.01145191
847741.904153117.80509611.2603405694
1689722.5703186965.10079-3.0733287182
24130914.537373153849.63424517.5191367836
32197125.725706212454.4742647.7761279017
40248489.551052270304.0937678.7788571482
48291884.652232317257.8367468.6928806705
56304526.216047359676.78547618.1102862489
64301440.463174388324.71018528.8230206709
72194239.941979393676.628802102.6754254511
80144879.527847383365.678053164.6099719885
88122894.325326372905.436117203.4358463076
96109836.31148362208.867756229.7715144249
104103791.981583352330.402278239.4582094921
112105189.206682345722.499429228.6672752217
120108095.811432342597.969088216.939171416
128113242.59492333821.98763194.7848270925

Even for READ-WRITE when data fits into shared buffer (scale_factor=300 and shared_buffers=8GB) performance has improved.
Only case is when data exceeds shared_buffer(scale_factor=1000 and shared_buffers=8GB) I see some regression.

I will try to run the tests as you have suggested and will report the same.
 

Thanks and Regards
Mithun C Y

Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
Hi,

On 2016-05-06 21:21:11 +0530, Mithun Cy wrote:
> I will try to run the tests as you have suggested and will report the same.

Any news on that front?

Regards,

Andres



Re: Perf Benchmarking and regression.

From
Ashutosh Sharma
Date:
Hi Andres,

I am extremely sorry for the delayed response.  As suggested by you, I have taken the performance readings at 128 client counts after making the following two changes:

1). Removed AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); from pq_init(). Below is the git diff for the same.

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 8d6eb0b..399d54b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -206,7 +206,9 @@ pq_init(void)
        AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
                                          NULL, NULL);
        AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
+#if 0
        AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
+#endif

2). Disabled the guc vars "bgwriter_flush_after", "checkpointer_flush_after" and "backend_flush_after" by setting them to zero.

After doing the above two changes below are the readings i got for 128 client counts:

CASE : Read-Write Tests when data exceeds shared buffers.

Non Default settings and test
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

./pgbench -i -s 1000 postgres

./pgbench -c 128 -j 128 -T 1800 -M prepared postgres

Run1 : tps = 9690.678225
Run2 : tps = 9904.320645
Run3 : tps = 9943.547176

Please let me know if i need to take readings with other client counts as well.

Note: I have taken these readings on postgres master head at,

commit 91fd1df4aad2141859310564b498a3e28055ee28
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sun May 8 16:53:55 2016 -0400

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, May 11, 2016 at 3:53 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2016-05-06 21:21:11 +0530, Mithun Cy wrote:
> I will try to run the tests as you have suggested and will report the same.

Any news on that front?

Regards,

Andres


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Wed, May 11, 2016 at 12:51 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> I am extremely sorry for the delayed response.  As suggested by you, I have
> taken the performance readings at 128 client counts after making the
> following two changes:
>
> 1). Removed AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL,
> NULL); from pq_init(). Below is the git diff for the same.
>
> diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
> index 8d6eb0b..399d54b 100644
> --- a/src/backend/libpq/pqcomm.c
> +++ b/src/backend/libpq/pqcomm.c
> @@ -206,7 +206,9 @@ pq_init(void)
>         AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
> MyProcPort->sock,
>                                           NULL, NULL);
>         AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
> +#if 0
>         AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
> +#endif
>
> 2). Disabled the guc vars "bgwriter_flush_after", "checkpointer_flush_after"
> and "backend_flush_after" by setting them to zero.
>
> After doing the above two changes below are the readings i got for 128
> client counts:
>
> CASE : Read-Write Tests when data exceeds shared buffers.
>
> Non Default settings and test
> ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c
> checkpoint_completion_target=0.9 &
>
> ./pgbench -i -s 1000 postgres
>
> ./pgbench -c 128 -j 128 -T 1800 -M prepared postgres
>
> Run1 : tps = 9690.678225
> Run2 : tps = 9904.320645
> Run3 : tps = 9943.547176
>
> Please let me know if i need to take readings with other client counts as
> well.

Can you please take four new sets of readings, like this:

- Unpatched master, default *_flush_after
- Unpatched master, *_flush_after=0
- That line removed with #if 0, default *_flush_after
- That line removed with #if 0, *_flush_after=0

128 clients is fine.  But I want to see four sets of numbers that were
all taken by the same person at the same time using the same script.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Ashutosh Sharma
Date:
Hi,

Please find the test results for the following set of combinations taken at 128 client counts:

1) Unpatched master, default *_flush_after :  TPS = 10925.882396

2) Unpatched master, *_flush_after=0 :  TPS = 18613.343529     

3) That line removed with #if 0, default *_flush_after :  TPS = 9856.809278

4) That line removed with #if 0, *_flush_after=0 :  TPS = 18158.648023

Here, That line points to "AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); in pq_init()."

Please note that earlier i had taken readings with data directory and pg_xlog directory at the same location in HDD. But this time i have changed the location of pg_xlog to ssd and taken the readings. With pg_xlog and data directory at the same location in HDD i was seeing much lesser performance like for "That line removed with #if 0, *_flush_after=0 :" case i was getting 7367.709378 tps.


Also, the commit-id on which i have taken above readings along with pgbench commands used are mentioned below:

commit 8a13d5e6d1bb9ff9460c72992657077e57e30c32
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed May 11 17:06:53 2016 -0400

    Fix infer_arbiter_indexes() to not barf on system columns.

Non Default settings and test:
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

./pgbench -i -s 1000 postgres

./pgbench -c 128 -j 128 -T 1800 -M prepared postgres

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Thu, May 12, 2016 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 11, 2016 at 12:51 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> I am extremely sorry for the delayed response.  As suggested by you, I have
> taken the performance readings at 128 client counts after making the
> following two changes:
>
> 1). Removed AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL,
> NULL); from pq_init(). Below is the git diff for the same.
>
> diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
> index 8d6eb0b..399d54b 100644
> --- a/src/backend/libpq/pqcomm.c
> +++ b/src/backend/libpq/pqcomm.c
> @@ -206,7 +206,9 @@ pq_init(void)
>         AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
> MyProcPort->sock,
>                                           NULL, NULL);
>         AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
> +#if 0
>         AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
> +#endif
>
> 2). Disabled the guc vars "bgwriter_flush_after", "checkpointer_flush_after"
> and "backend_flush_after" by setting them to zero.
>
> After doing the above two changes below are the readings i got for 128
> client counts:
>
> CASE : Read-Write Tests when data exceeds shared buffers.
>
> Non Default settings and test
> ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c
> checkpoint_completion_target=0.9 &
>
> ./pgbench -i -s 1000 postgres
>
> ./pgbench -c 128 -j 128 -T 1800 -M prepared postgres
>
> Run1 : tps = 9690.678225
> Run2 : tps = 9904.320645
> Run3 : tps = 9943.547176
>
> Please let me know if i need to take readings with other client counts as
> well.

Can you please take four new sets of readings, like this:

- Unpatched master, default *_flush_after
- Unpatched master, *_flush_after=0
- That line removed with #if 0, default *_flush_after
- That line removed with #if 0, *_flush_after=0

128 clients is fine.  But I want to see four sets of numbers that were
all taken by the same person at the same time using the same script.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Please find the test results for the following set of combinations taken at
> 128 client counts:
>
> 1) Unpatched master, default *_flush_after :  TPS = 10925.882396
>
> 2) Unpatched master, *_flush_after=0 :  TPS = 18613.343529
>
> 3) That line removed with #if 0, default *_flush_after :  TPS = 9856.809278
>
> 4) That line removed with #if 0, *_flush_after=0 :  TPS = 18158.648023

I'm getting increasingly unhappy about the checkpoint flush control.
I saw major regressions on my parallel COPY test, too:

http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com

That was a completely different machine (POWER7 instead of Intel,
lousy disks instead of good ones) and a completely different workload.
Considering these results, I think there's now plenty of evidence to
suggest that this feature is going to be horrible for a large number
of users.  A 45% regression on pgbench is horrible.  (Nobody wants to
take even a 1% hit for snapshot too old, right?)  Sure, it might not
be that way for every user on every Linux system, and I'm sure it
performed well on the systems where Andres benchmarked it, or he
wouldn't have committed it.  But our goal can't be to run well only on
the newest hardware with the least-buggy kernel...

> Here, That line points to "AddWaitEventToSet(FeBeWaitSet,
> WL_POSTMASTER_DEATH, -1, NULL, NULL); in pq_init()."

Given the above results, it's not clear whether that is making things
better or worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
Hi,

On 2016-05-12 18:09:07 +0530, Ashutosh Sharma wrote:
> Please find the test results for the following set of combinations taken at
> 128 client counts:

Thanks.


> *1)* *Unpatched master, default *_flush_after :*  TPS = 10925.882396

Could you run this one with a number of different backend_flush_after
settings?  I'm suspsecting the primary issue is that the default is too low.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Thu, May 12, 2016 at 11:13 AM, Andres Freund <andres@anarazel.de> wrote:
> Could you run this one with a number of different backend_flush_after
> settings?  I'm suspsecting the primary issue is that the default is too low.

What values do you think would be good to test?  Maybe provide 3 or 4
suggested values to try?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-05-12 11:27:31 -0400, Robert Haas wrote:
> On Thu, May 12, 2016 at 11:13 AM, Andres Freund <andres@anarazel.de> wrote:
> > Could you run this one with a number of different backend_flush_after
> > settings?  I'm suspsecting the primary issue is that the default is too low.
> 
> What values do you think would be good to test?  Maybe provide 3 or 4
> suggested values to try?

0 (disabled), 16 (current default), 32, 64, 128, 256?

I'm suspecting that only backend_flush_after_* has these negative
performance implications at this point.  One path is to increase that
option's default value, another is to disable only backend guided
flushing. And add a strong hint that if you care about predictable
throughput you might want to enable it.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-05-12 10:49:06 -0400, Robert Haas wrote:
> On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Please find the test results for the following set of combinations taken at
> > 128 client counts:
> >
> > 1) Unpatched master, default *_flush_after :  TPS = 10925.882396
> >
> > 2) Unpatched master, *_flush_after=0 :  TPS = 18613.343529
> >
> > 3) That line removed with #if 0, default *_flush_after :  TPS = 9856.809278
> >
> > 4) That line removed with #if 0, *_flush_after=0 :  TPS = 18158.648023
> 
> I'm getting increasingly unhappy about the checkpoint flush control.
> I saw major regressions on my parallel COPY test, too:

Yes, I'm concerned too.

The workload in this thread is a bit of an "artificial" workload (all
data is constantly updated, doesn't fit into shared_buffers, fits into
the OS page cache), and only measures throughput not latency.  But I
agree that that's way too large a regression to accept, and that there's
a significant number of machines with way undersized shared_buffer
values.


> http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com
> 
> That was a completely different machine (POWER7 instead of Intel,
> lousy disks instead of good ones) and a completely different workload.
> Considering these results, I think there's now plenty of evidence to
> suggest that this feature is going to be horrible for a large number
> of users.  A 45% regression on pgbench is horrible.

I asked you over there whether you could benchmark with just different
values for backend_flush_after... I chose the current value because it
gives the best latency / most consistent throughput numbers, but 128kb
isn't a large window.  I suspect we might need to disable backend guided
flushing if that's not sufficient :(


> > Here, That line points to "AddWaitEventToSet(FeBeWaitSet,
> > WL_POSTMASTER_DEATH, -1, NULL, NULL); in pq_init()."
> 
> Given the above results, it's not clear whether that is making things
> better or worse.

Yea, me neither. I think it's doubful that you'd see performance
difference due to the original ac1d7945f866b1928c2554c0f80fd52d7f977772
, independent of the WaitEventSet stuff, at these throughput rates.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Fabien COELHO
Date:
>> I'm getting increasingly unhappy about the checkpoint flush control.
>> I saw major regressions on my parallel COPY test, too:
>
> Yes, I'm concerned too.

A few thoughts:
 - focussing on raw tps is not a good idea, because it may be a lot of tps   followed by a sync panic, with an
unresponsivedatabase. I wish the   performance reports would include some indication of the distribution   (eg
min/q1/median/d3/maxtps per second seen, standard deviation), not   just the final "tps" figure.
 
 - checkpoint flush control (checkpoint_flush_after) should mostly always   beneficial because it flushes sorted data.
Iwould be surprised   to see significant regressions with this on. A lot of tests showed   maybe improved tps, but
mostlygreatly improved performance stability,   where a database unresponsive 60% of the time (60% of seconds in the
thetps show very low or zero tps) and then becomes always responsive.
 
 - other flush controls ({backend,bgwriter}_flush_after) may just increase   random writes, so are more risky in nature
becausethe data is not   sorted, and it may or may not be a good idea depending on detailed   conditions. A "parallel
copy"would be just such a special IO load   which degrade performance under these settings.
 
   Maybe these two should be disabled by default because they lead to   possibly surprising regressions?
 - for any particular load, the admin can decide to disable these if   they think it is better not to flush. Also, as
suggestedby Andres,   with 128 parallel queries the default value may not be appropriate   at all.
 

-- 
Fabien.



Re: Perf Benchmarking and regression.

From
Ashutosh Sharma
Date:
<div dir="ltr">Hi,<br /><br />Following are the performance results for read write test observed with different numbers
of"<b>backend_flush_after</b>".<br /><br />1) backend_flush_after = <b>256kb</b> (32*8kb), tps = <b>10841.178815</b><br
/>2)backend_flush_after = <b>512kb</b> (64*8kb), tps = <b>11098.702707</b><br />3) backend_flush_after = <b>1MB</b>
(128*8kb),tps = <b>11434.964545</b><br />4) backend_flush_after = <b>2MB</b> (256*8kb), tps = <b>13477.089417</b><br
/><br/><br /><b>Note:</b> Above test has been performed on Unpatched master with default values for
checkpoint_flush_after,bgwriter_flush_after<br />and wal_writer_flush_after. <br /><br />With Regards,<br />Ashutosh
Sharma<br/>EnterpriseDB:<u> <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a></u><br /></div><div
class="gmail_extra"><br/><div class="gmail_quote">On Thu, May 12, 2016 at 9:20 PM, Andres Freund <span dir="ltr"><<a
href="mailto:andres@anarazel.de"target="_blank">andres@anarazel.de</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 2016-05-12
11:27:31-0400, Robert Haas wrote:<br /> > On Thu, May 12, 2016 at 11:13 AM, Andres Freund <<a
href="mailto:andres@anarazel.de">andres@anarazel.de</a>>wrote:<br /> > > Could you run this one with a number
ofdifferent backend_flush_after<br /> > > settings?  I'm suspsecting the primary issue is that the default is too
low.<br/> ><br /> > What values do you think would be good to test?  Maybe provide 3 or 4<br /> > suggested
valuesto try?<br /><br /></span>0 (disabled), 16 (current default), 32, 64, 128, 256?<br /><br /> I'm suspecting that
onlybackend_flush_after_* has these negative<br /> performance implications at this point.  One path is to increase
that<br/> option's default value, another is to disable only backend guided<br /> flushing. And add a strong hint that
ifyou care about predictable<br /> throughput you might want to enable it.<br /><br /> Greetings,<br /><br /> Andres
Freund<br/></blockquote></div><br /></div> 

Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Following are the performance results for read write test observed with
> different numbers of "backend_flush_after".
>
> 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815
> 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707
> 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545
> 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417

So even at 2MB we don't come close to recovering all of the lost
performance.  Can you please test these three scenarios?

1. Default settings for *_flush_after
2. backend_flush_after=0, rest defaults
3. backend_flush_after=0, bgwriter_flush_after=0,
wal_writer_flush_after=0, checkpoint_flush_after=0

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-05-13 10:20:04 -0400, Robert Haas wrote:
> On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Following are the performance results for read write test observed with
> > different numbers of "backend_flush_after".
> >
> > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815
> > 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707
> > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545
> > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417
> 
> So even at 2MB we don't come close to recovering all of the lost
> performance.  Can you please test these three scenarios?
>
> 1. Default settings for *_flush_after
> 2. backend_flush_after=0, rest defaults
> 3. backend_flush_after=0, bgwriter_flush_after=0,
> wal_writer_flush_after=0, checkpoint_flush_after=0

4) 1) + a shared_buffers setting appropriate to the workload.


I just want to emphasize what we're discussing here is a bit of an
extreme setup. A workload that's bigger than shared buffers, but smaller
than the OS's cache size; with a noticeable likelihood of rewriting
individual OS page cache pages within 30s.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Fri, May 13, 2016 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-05-13 10:20:04 -0400, Robert Haas wrote:
>> On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>> > Following are the performance results for read write test observed with
>> > different numbers of "backend_flush_after".
>> >
>> > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815
>> > 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707
>> > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545
>> > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417
>>
>> So even at 2MB we don't come close to recovering all of the lost
>> performance.  Can you please test these three scenarios?
>>
>> 1. Default settings for *_flush_after
>> 2. backend_flush_after=0, rest defaults
>> 3. backend_flush_after=0, bgwriter_flush_after=0,
>> wal_writer_flush_after=0, checkpoint_flush_after=0
>
> 4) 1) + a shared_buffers setting appropriate to the workload.
>
>
> I just want to emphasize what we're discussing here is a bit of an
> extreme setup. A workload that's bigger than shared buffers, but smaller
> than the OS's cache size; with a noticeable likelihood of rewriting
> individual OS page cache pages within 30s.

You're just describing pgbench with a scale factor too large to fit in
shared_buffers.  I think it's unfair to paint that as some kind of
niche use case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-05-13 14:43:15 -0400, Robert Haas wrote:
> On Fri, May 13, 2016 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote:
> > I just want to emphasize what we're discussing here is a bit of an
> > extreme setup. A workload that's bigger than shared buffers, but smaller
> > than the OS's cache size; with a noticeable likelihood of rewriting
> > individual OS page cache pages within 30s.
>
> You're just describing pgbench with a scale factor too large to fit in
> shared_buffers.

Well, that *and* a scale factor smaller than 20% of the memory
available, *and* a scale factor small enough that make re-dirtying of
already written out pages likely.


> I think it's unfair to paint that as some kind of niche use case.

I'm not saying we don't need to do something about it. Just that it's a
hard tradeoff to make. The massive performance / latency we've observed
originate from the kernel caching too much dirty IO. The fix is making
is cache fewer dirty pages.  But there's workloads where the kernel's
buffer cache works as an extension of our page cache.



Re: Perf Benchmarking and regression.

From
Amit Kapila
Date:
On Fri, May 13, 2016 at 11:13 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-05-13 10:20:04 -0400, Robert Haas wrote:
> > On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Following are the performance results for read write test observed with
> > > different numbers of "backend_flush_after".
> > >
> > > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815
> > > 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707
> > > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545
> > > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417
> >
> > So even at 2MB we don't come close to recovering all of the lost
> > performance.  Can you please test these three scenarios?
> >
> > 1. Default settings for *_flush_after
> > 2. backend_flush_after=0, rest defaults
> > 3. backend_flush_after=0, bgwriter_flush_after=0,
> > wal_writer_flush_after=0, checkpoint_flush_after=0
>
> 4) 1) + a shared_buffers setting appropriate to the workload.
>

If by 4th point, you mean to test the case when data fits in shared buffers, then Mithun has already reported above [1] that it didn't see any regression for that case


Read line - Even for READ-WRITE when data fits into shared buffer (scale_factor=300 and shared_buffers=8GB) performance has improved.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Perf Benchmarking and regression.

From
Ashutosh Sharma
Date:
<div dir="ltr">Hi,<br /><br />Please find the results for the following 3 scenarios with unpatched master:<br /><br
/>1.Default settings for *_flush_after : TPS = <b>10677.662356</b><br />2. backend_flush_after=0, rest defaults : TPS =
<b>18452.655936</b><br/>3. backend_flush_after=0, bgwriter_flush_after=0,<br />wal_writer_flush_after=0,
checkpoint_flush_after=0: TPS = <b>18614.479962</b><br /><br />With Regards,<br />Ashutosh Sharma<br />EnterpriseDB: <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/></div><div class="gmail_extra"><br /><div
class="gmail_quote">OnFri, May 13, 2016 at 7:50 PM, Robert Haas <span dir="ltr"><<a
href="mailto:robertmhaas@gmail.com"target="_blank">robertmhaas@gmail.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Fri, May 13,
2016at 7:08 AM, Ashutosh Sharma <<a href="mailto:ashu.coek88@gmail.com">ashu.coek88@gmail.com</a>> wrote:<br />
>Following are the performance results for read write test observed with<br /> > different numbers of
"backend_flush_after".<br/> ><br /> > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815<br /> > 2)
backend_flush_after= 512kb (64*8kb), tps = 11098.702707<br /> > 3) backend_flush_after = 1MB (128*8kb), tps =
11434.964545<br/> > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417<br /><br /></span>So even at 2MB we
don'tcome close to recovering all of the lost<br /> performance.  Can you please test these three scenarios?<br /><br
/>1. Default settings for *_flush_after<br /> 2. backend_flush_after=0, rest defaults<br /> 3. backend_flush_after=0,
bgwriter_flush_after=0,<br/> wal_writer_flush_after=0, checkpoint_flush_after=0<br /><div class="HOEnZb"><div
class="h5"><br/> --<br /> Robert Haas<br /> EnterpriseDB: <a href="http://www.enterprisedb.com" rel="noreferrer"
target="_blank">http://www.enterprisedb.com</a><br/> The Enterprise PostgreSQL Company<br
/></div></div></blockquote></div><br/></div> 

Re: Perf Benchmarking and regression.

From
Fabien COELHO
Date:
Hello,

> Please find the results for the following 3 scenarios with unpatched master:
>
> 1. Default settings for *_flush_after : TPS = *10677.662356*
> 2. backend_flush_after=0, rest defaults : TPS = *18452.655936*
> 3. backend_flush_after=0, bgwriter_flush_after=0,
> wal_writer_flush_after=0, checkpoint_flush_after=0 : TPS = *18614.479962*

Thanks for these runs.

These raw tps suggest that {backend,bgwriter}_flush_after should better be 
zero for this kind of load.Whether it should be the default is unclear 
yet, because as Andres pointed out this is one kind of load.

Note: these options have been added to smooth ios over time and to help 
avoid "io panics" on sync, especially with HDDs without a large BBU cache 
in front. The real benefit is that the performance are much more constant 
over time, and pg is much more responsive.

If you do other runs, it would be nice to report some stats about tps 
variability (eg latency & latency stddev which should be in the report). 
For experiments I did I used to log "-P 1" output (tps every second) and 
to compute stats on these tps (avg, stddev, min, q1, median, q3, max, pc 
of time with tps below a low threshold...), which provides some indication 
of the overall tps distribution.

-- 
Fabien



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-05-14 18:49:27 +0200, Fabien COELHO wrote:
> 
> Hello,
> 
> > Please find the results for the following 3 scenarios with unpatched master:
> > 
> > 1. Default settings for *_flush_after : TPS = *10677.662356*
> > 2. backend_flush_after=0, rest defaults : TPS = *18452.655936*
> > 3. backend_flush_after=0, bgwriter_flush_after=0,
> > wal_writer_flush_after=0, checkpoint_flush_after=0 : TPS = *18614.479962*
> 
> Thanks for these runs.

Yes!

> These raw tps suggest that {backend,bgwriter}_flush_after should better be
> zero for this kind of load.Whether it should be the default is unclear yet,
> because as Andres pointed out this is one kind of load.

FWIW, I don't think {backend,bgwriter} are the same here. It's primarily
backend that matters.  This is treating the os page cache as an
extension of postgres' buffer cache. That really primarily matters for
backend_, because otherwise backends spend time waiting for IO.


Andres



Re: Perf Benchmarking and regression.

From
Fabien COELHO
Date:
>> These raw tps suggest that {backend,bgwriter}_flush_after should better be
>> zero for this kind of load.Whether it should be the default is unclear yet,
>> because as Andres pointed out this is one kind of load.
>
> FWIW, I don't think {backend,bgwriter} are the same here. It's primarily
> backend that matters.

Indeed, I was a little hasty to put bgwriter together based on this 
report.

I'm a little wary of "bgwriter_flush_after" though, I would not be 
surprised if someone reports some regressions, although probably not with 
a pgbench tpcb kind of load.

-- 
Fabien.



Re: Perf Benchmarking and regression.

From
Ashutosh Sharma
Date:
Hi All,

As we have seen the regression of more than 45% with "backend_flush_after" enabled and set to its default value i.e. 128KB or even when it is set to some higher value like 2MB, i think we should disable it such that it does not impact the read write performance and here is the attached patch for the same.  Please have a look and let me know your thoughts on this. Thanks!

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Sun, May 15, 2016 at 1:26 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

These raw tps suggest that {backend,bgwriter}_flush_after should better be
zero for this kind of load.Whether it should be the default is unclear yet,
because as Andres pointed out this is one kind of load.

FWIW, I don't think {backend,bgwriter} are the same here. It's primarily
backend that matters.

Indeed, I was a little hasty to put bgwriter together based on this report.

I'm a little wary of "bgwriter_flush_after" though, I would not be surprised if someone reports some regressions, although probably not with a pgbench tpcb kind of load.

--
Fabien.

Attachment

Re: Perf Benchmarking and regression.

From
Andres Freund
Date:

Hi,

On May 26, 2016 9:29:51 PM PDT, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>Hi All,
>
>As we have seen the regression of more than 45% with
>"*backend_flush_after*"
>enabled and set to its default value i.e. 128KB or even when it is set
>to
>some higher value like 2MB, i think we should disable it such that it
>does
>not impact the read write performance and here is the attached patch
>for
>the same.  Please have a look and let me know your thoughts on this.
>Thanks!

I don't think the situation is quite that simple. By *disabling* backend flushing it's also easy to see massive
performanceregressions.  In situations where shared buffers was configured appropriately for the workload (not the case
hereIIRC).
 

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Perf Benchmarking and regression.

From
Noah Misch
Date:
On Thu, May 12, 2016 at 10:49:06AM -0400, Robert Haas wrote:
> On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Please find the test results for the following set of combinations taken at
> > 128 client counts:
> >
> > 1) Unpatched master, default *_flush_after :  TPS = 10925.882396
> >
> > 2) Unpatched master, *_flush_after=0 :  TPS = 18613.343529
> >
> > 3) That line removed with #if 0, default *_flush_after :  TPS = 9856.809278
> >
> > 4) That line removed with #if 0, *_flush_after=0 :  TPS = 18158.648023
> 
> I'm getting increasingly unhappy about the checkpoint flush control.
> I saw major regressions on my parallel COPY test, too:
> 
> http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com
> 
> That was a completely different machine (POWER7 instead of Intel,
> lousy disks instead of good ones) and a completely different workload.
> Considering these results, I think there's now plenty of evidence to
> suggest that this feature is going to be horrible for a large number
> of users.  A 45% regression on pgbench is horrible.  (Nobody wants to
> take even a 1% hit for snapshot too old, right?)  Sure, it might not
> be that way for every user on every Linux system, and I'm sure it
> performed well on the systems where Andres benchmarked it, or he
> wouldn't have committed it.  But our goal can't be to run well only on
> the newest hardware with the least-buggy kernel...

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com



Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Fri, May 27, 2016 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
> I don't think the situation is quite that simple. By *disabling* backend flushing it's also easy to see massive
performanceregressions.  In situations where shared buffers was configured appropriately for the workload (not the case
hereIIRC).
 

On what kind of workload does setting backend_flush_after=0 represent
a large regression vs. the default settings?

I think we have to consider that pgbench and parallel copy are pretty
common things to want to do, and a non-zero default setting hurts
those workloads a LOT.  I have a really hard time believing that the
benefits on other workloads are large enough to compensate for the
slowdowns we're seeing here.  We have nobody writing in to say that
backend_flush_after>0 is making things way better for them, and
Ashutosh and I have independently hit massive slowdowns on unrelated
workloads.  We weren't looking for slowdowns in this patch.  We were
trying to measure other stuff, and ended up tracing the behavior back
to this patch.  That really, really suggests that other people will
have similar experiences.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-05-31 16:03:46 -0400, Robert Haas wrote:
> On Fri, May 27, 2016 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
> > I don't think the situation is quite that simple. By *disabling* backend flushing it's also easy to see massive
performanceregressions.  In situations where shared buffers was configured appropriately for the workload (not the case
hereIIRC).
 
> 
> On what kind of workload does setting backend_flush_after=0 represent
> a large regression vs. the default settings?
> 
> I think we have to consider that pgbench and parallel copy are pretty
> common things to want to do, and a non-zero default setting hurts
> those workloads a LOT.

I don't think pgbench's workload has much to do with reality. Even less
so in the setup presented here.

The slowdown comes from the fact that default pgbench randomly, but
uniformly, updates a large table. Which is slower with
backend_flush_after if the workload is considerably bigger than
shared_buffers, but, and that's a very important restriction, the
workload at the same time largely fits in to less than
/proc/sys/vm/dirty_ratio / 20% (probably even 10% /
/proc/sys/vm/dirty_background_ratio) of the free os memory.  The "trick"
in that case is that very often, before a buffer has been written back
to storage by the OS, it'll be re-dirtied by postgres.  Which means
triggering flushing by postgres increases the total amount of writes.
That only matters if the kernel doesn't trigger writeback because of the
above ratios, or because of time limits (30s /
dirty_writeback_centisecs).


> I have a really hard time believing that the benefits on other
> workloads are large enough to compensate for the slowdowns we're
> seeing here.

As a random example, without looking for good parameters, on my laptop:
pgbench -i -q -s 1000

Cpu: i7-6820HQ
Ram: 24GB of memory
Storage: Samsung SSD 850 PRO 1TB, encrypted
postgres -c shared_buffers=6GB -c backend_flush_after=128 -c max_wal_size=100GB -c fsync=on -c synchronous_commit=off
pgbench -M prepared -c 16 -j 16 -T 520 -P 1 -n -N
(note the -N)
disabled:
latency average = 2.774 ms
latency stddev = 10.388 ms
tps = 5761.883323 (including connections establishing)
tps = 5762.027278 (excluding connections establishing)

128:
latency average = 2.543 ms
latency stddev = 3.554 ms
tps = 6284.069846 (including connections establishing)
tps = 6284.184570 (excluding connections establishing)

Note the latency dev which is 3x better. And the improved throughput.

That's for a workload which even fits into the OS memory. Without
backend flushing there's several periods looking like

progress: 249.0 s, 7237.6 tps, lat 1.997 ms stddev 4.365
progress: 250.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 251.0 s, 1880.6 tps, lat 17.761 ms stddev 169.682
progress: 252.0 s, 6904.4 tps, lat 2.328 ms stddev 3.256

i.e. moments in which no transactions are executed. And that's on
storage that can do 500MB/sec, and tens of thousand IOPs.


If you change the workload workload that uses synchronous_commit, is
bigger than OS memory and/or doesn't have very fast storage, the
differences can be a *LOT* bigger.


In general, any workload which doesn't fit a) the above criteria of
likely re-dirtying blocks it already dirtied, before kernel triggered
writeback happens b) concurrently COPYs into an indvidual file, is
likely to be faster (or unchanged if within s_b) with backend flushing.


Which means that transactional workloads that are bigger than the OS
memory, or which have a non-uniform distribution leading to some
locality, are likely to be faster. In practice those are *hugely* more
likely than the uniform distribution that pgbench has.

Similarly, this *considerably* reduces the impact a concurrently running
vacuum or COPY has on concurrent queries. Because suddenly VACUUM/COPY
can't create a couple gigabytes of dirty buffers which will be written
back at some random point in time later, stalling everything.


I think the benefits of a more predictable (and often faster!)
performance in a bunch of actual real-worl-ish workloads are higher than
optimizing for benchmarks.



> We have nobody writing in to say that
> backend_flush_after>0 is making things way better for them, and
> Ashutosh and I have independently hit massive slowdowns on unrelated
> workloads.

Actually, we have some of evidence of that? Just so far not in this
thread; which I don't find particularly surprising.


- Andres



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-01 15:33:18 -0700, Andres Freund wrote:
> Cpu: i7-6820HQ
> Ram: 24GB of memory
> Storage: Samsung SSD 850 PRO 1TB, encrypted
> postgres -c shared_buffers=6GB -c backend_flush_after=128 -c max_wal_size=100GB -c fsync=on -c
synchronous_commit=off
> pgbench -M prepared -c 16 -j 16 -T 520 -P 1 -n -N

Using scale 5000 database, with wal compression enabled (otherwise the
whole thing is too slow in both cases), and 64 clients gives:

disabled:
latency average = 11.896 ms
latency stddev = 42.187 ms
tps = 5378.025369 (including connections establishing)
tps = 5378.248569 (excluding connections establishing)

128:
latency average = 11.002 ms
latency stddev = 10.621 ms
tps = 5814.586813 (including connections establishing)
tps = 5814.840249 (excluding connections establishing)


With flushing disabled, rougly every 30s you see:
progress: 150.0 s, 6223.3 tps, lat 10.036 ms stddev 9.521
progress: 151.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 152.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 153.0 s, 4952.9 tps, lat 39.050 ms stddev 249.839

progress: 172.0 s, 4888.0 tps, lat 12.851 ms stddev 11.507
progress: 173.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 174.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 175.0 s, 4636.8 tps, lat 41.421 ms stddev 268.416

progress: 196.0 s, 1119.2 tps, lat 9.618 ms stddev 8.321
progress: 197.0 s, 0.0 tps, lat -nan ms stddev -nan
progress: 198.0 s, 1920.9 tps, lat 94.375 ms stddev 429.756
progress: 199.0 s, 5260.8 tps, lat 12.087 ms stddev 11.595


With backend flushing enabled there's not a single such pause.


If you use spinning rust instead of SSDs, the pauses aren't 1-2s
anymore, but easily 10-30s.

Andres



Re: Perf Benchmarking and regression.

From
Noah Misch
Date:
On Wed, Jun 01, 2016 at 03:33:18PM -0700, Andres Freund wrote:
> On 2016-05-31 16:03:46 -0400, Robert Haas wrote:
> > On Fri, May 27, 2016 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote:
> > > I don't think the situation is quite that simple. By *disabling* backend flushing it's also easy to see massive
performanceregressions.  In situations where shared buffers was configured appropriately for the workload (not the case
hereIIRC).
 
> > 
> > On what kind of workload does setting backend_flush_after=0 represent
> > a large regression vs. the default settings?
> > 
> > I think we have to consider that pgbench and parallel copy are pretty
> > common things to want to do, and a non-zero default setting hurts
> > those workloads a LOT.
> 
> I don't think pgbench's workload has much to do with reality. Even less
> so in the setup presented here.
> 
> The slowdown comes from the fact that default pgbench randomly, but
> uniformly, updates a large table. Which is slower with
> backend_flush_after if the workload is considerably bigger than
> shared_buffers, but, and that's a very important restriction, the
> workload at the same time largely fits in to less than
> /proc/sys/vm/dirty_ratio / 20% (probably even 10% /
> /proc/sys/vm/dirty_background_ratio) of the free os memory.

Looking at some of the top hits for 'postgresql shared_buffers':

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
https://www.postgresql.org/docs/current/static/runtime-config-resource.html
http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html
https://www.keithf4.com/a-large-database-does-not-mean-large-shared_buffers/
http://www.cybertec.at/2014/02/postgresql-9-3-shared-buffers-performance-1/

Choices mentioned (some in comments on a main post):

1. .25 * RAM
2. min(8GB, .25 * RAM)
3. Sizing procedure that arrived at 4GB for 900GB of data
4. Equal to data size

Thus, it is not outlandish to have the write portion of a working set exceed
shared_buffers while remaining under 10-20% of system RAM.  Choice (4) won't
achieve that, but (2) and (3) may achieve it given a mere 64 GiB of RAM.
Choice (1) can go either way; if read-mostly data occupies half of
shared_buffers, then writes passing through the other 12.5% of system RAM may
exhibit the property you describe.

Incidentally, a typical reason for a site to use low shared_buffers is to
avoid the latency spikes that *_flush_after combat:
https://www.postgresql.org/message-id/flat/4DDE2705020000250003DD4F%40gw.wicourts.gov

> > I have a really hard time believing that the benefits on other
> > workloads are large enough to compensate for the slowdowns we're
> > seeing here.
> 
> As a random example, without looking for good parameters, on my laptop:
> pgbench -i -q -s 1000
> 
> Cpu: i7-6820HQ
> Ram: 24GB of memory
> Storage: Samsung SSD 850 PRO 1TB, encrypted
> postgres -c shared_buffers=6GB -c backend_flush_after=128 -c max_wal_size=100GB -c fsync=on -c
synchronous_commit=off
> pgbench -M prepared -c 16 -j 16 -T 520 -P 1 -n -N
> (note the -N)
> disabled:
> latency average = 2.774 ms
> latency stddev = 10.388 ms
> tps = 5761.883323 (including connections establishing)
> tps = 5762.027278 (excluding connections establishing)
> 
> 128:
> latency average = 2.543 ms
> latency stddev = 3.554 ms
> tps = 6284.069846 (including connections establishing)
> tps = 6284.184570 (excluding connections establishing)
> 
> Note the latency dev which is 3x better. And the improved throughput.

That is an improvement.  The workload is no less realistic than the ones
having shown regressions.

> Which means that transactional workloads that are bigger than the OS
> memory, or which have a non-uniform distribution leading to some
> locality, are likely to be faster. In practice those are *hugely* more
> likely than the uniform distribution that pgbench has.

That is formally true; non-benchmark workloads rarely issue uniform writes.
However, enough non-benchmark workloads have too little locality to benefit
from caches.  Those will struggle against *_flush_after like uniform writes
do, so discounting uniform writes wouldn't simplify this project.


Today's defaults for *_flush_after greatly smooth and accelerate performance
for one class of plausible workloads while greatly slowing a different class
of plausible workloads.

nm



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 01:57:33 -0400, Noah Misch wrote:
> > Which means that transactional workloads that are bigger than the OS
> > memory, or which have a non-uniform distribution leading to some
> > locality, are likely to be faster. In practice those are *hugely* more
> > likely than the uniform distribution that pgbench has.
> 
> That is formally true; non-benchmark workloads rarely issue uniform writes.
> However, enough non-benchmark workloads have too little locality to benefit
> from caches.  Those will struggle against *_flush_after like uniform writes
> do, so discounting uniform writes wouldn't simplify this project.

But such workloads rarely will hit the point of constantly re-dirtying
already dirty pages in kernel memory within 30s.


> Today's defaults for *_flush_after greatly smooth and accelerate performance
> for one class of plausible workloads while greatly slowing a different class
> of plausible workloads.

I don't think checkpoint_flush_after is in that class, due to the
fsync()s we already emit at the end of checkpoints.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Noah Misch
Date:
On Thu, Jun 02, 2016 at 11:09:22PM -0700, Andres Freund wrote:
> On 2016-06-03 01:57:33 -0400, Noah Misch wrote:
> > > Which means that transactional workloads that are bigger than the OS
> > > memory, or which have a non-uniform distribution leading to some
> > > locality, are likely to be faster. In practice those are *hugely* more
> > > likely than the uniform distribution that pgbench has.
> > 
> > That is formally true; non-benchmark workloads rarely issue uniform writes.
> > However, enough non-benchmark workloads have too little locality to benefit
> > from caches.  Those will struggle against *_flush_after like uniform writes
> > do, so discounting uniform writes wouldn't simplify this project.
> 
> But such workloads rarely will hit the point of constantly re-dirtying
> already dirty pages in kernel memory within 30s.

Rarely, yes.  Not rarely enough to discount.

> > Today's defaults for *_flush_after greatly smooth and accelerate performance
> > for one class of plausible workloads while greatly slowing a different class
> > of plausible workloads.

The usual PostgreSQL handling of a deeply workload-dependent performance
feature is to disable it by default.  That's what I'm inclined to do here, for
every GUC the feature added.  Sophisticated users will nonetheless fully
exploit this valuable mechanism in 9.6.

> I don't think checkpoint_flush_after is in that class, due to the
> fsync()s we already emit at the end of checkpoints.

That's a promising hypothesis.  Some future project could impose a nonzero
default checkpoint_flush_after, having demonstrated that it imposes negligible
harm in the plausible cases it does not help.



Re: Perf Benchmarking and regression.

From
Fabien COELHO
Date:
Hello Noah,

> The usual PostgreSQL handling of a deeply workload-dependent performance
> feature is to disable it by default.  That's what I'm inclined to do here, for
> every GUC the feature added.  Sophisticated users will nonetheless fully
> exploit this valuable mechanism in 9.6.

>> I don't think checkpoint_flush_after is in that class, due to the
>> fsync()s we already emit at the end of checkpoints.

I agree with Andres that checkpoint_flush_after shoud not be treated as 
other _flush_after settings.

> That's a promising hypothesis.

This is not an hypothesis but a proven fact. There has been hundreds of 
hours of pgbenchs runs to test and demonstrate the positive impact in 
various reasonable configurations.

> Some future project could impose a nonzero default 
> checkpoint_flush_after, having demonstrated that it imposes negligible 
> harm in the plausible cases it does not help.

I think that the significant and general benefit of checkpoint_flush_after 
has been largely demonstrated and reported on the hacker thread at various 
point of the development of the feature, and that it is safe, and even 
highly advisable to keep it on by default.

The key point is that it is flushing sorted buffers so that it mostly 
results in sequential writes. It avoids in a lot of case where the final 
sync at the end of the checkpoint generates too many ios which results in 
putting postgresql off line till the fsync is completed, from seconds to 
minutes at a time.

The other *_flush_after do not benefit for any buffer reordering, so their 
positive impact is maybe more questionnable, so I would be okay if these 
are disabled by default.

-- 
Fabien.



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 10:48:18 -0400, Noah Misch wrote:
> On Thu, Jun 02, 2016 at 11:09:22PM -0700, Andres Freund wrote:
> > > Today's defaults for *_flush_after greatly smooth and accelerate performance
> > > for one class of plausible workloads while greatly slowing a different class
> > > of plausible workloads.
> 
> The usual PostgreSQL handling of a deeply workload-dependent performance
> feature is to disable it by default.

Meh. That's not actually all that often the case.  This unstable
performance issue, with the minute-long stalls, is the worst and most
frequent production problem people hit with postgres in my experience,
besides issues with autovacuum.  Ignoring that is just hurting our
users.


> > I don't think checkpoint_flush_after is in that class, due to the
> > fsync()s we already emit at the end of checkpoints.
> 
> That's a promising hypothesis.  Some future project could impose a nonzero
> default checkpoint_flush_after, having demonstrated that it imposes negligible
> harm in the plausible cases it does not help.

Have you actually looked at the thread with all the numbers? This isn't
an issue that has been decided willy-nilly. It's been discussed *over
months*.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 09:24:28 -0700, Andres Freund wrote:
> This unstable performance issue, with the minute-long stalls, is the
> worst and most frequent production problem people hit with postgres in
> my experience, besides issues with autovacuum.  Ignoring that is just
> hurting our users.

Oh, and a good proportion of the "autovacuum causes my overall systems
to slow down unacceptably" issues come from exactly this.



Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Fri, Jun 3, 2016 at 2:09 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-03 01:57:33 -0400, Noah Misch wrote:
>> > Which means that transactional workloads that are bigger than the OS
>> > memory, or which have a non-uniform distribution leading to some
>> > locality, are likely to be faster. In practice those are *hugely* more
>> > likely than the uniform distribution that pgbench has.
>>
>> That is formally true; non-benchmark workloads rarely issue uniform writes.
>> However, enough non-benchmark workloads have too little locality to benefit
>> from caches.  Those will struggle against *_flush_after like uniform writes
>> do, so discounting uniform writes wouldn't simplify this project.
>
> But such workloads rarely will hit the point of constantly re-dirtying
> already dirty pages in kernel memory within 30s.

I don't know why not.  It's not exactly uncommon to update the same
data frequently, nor is it uncommon for the hot data set to be larger
than shared_buffers and smaller than the OS cache, even significantly
smaller.  Any workload of that type is going to have this problem
regardless of whether the access pattern is uniform.  If you have a
highly non-uniform access pattern then you just have this problem on
the small subset of the data that is hot.  I think that asserting that
there's something wrong with this test is just wrong.  Many people
have done many tests very similar to this one on Linux systems over
many years to assess PostgreSQL performance.  It's a totally
legitimate test configuration.

Indeed, I'd argue that this is actually a pretty common real-world
scenario.  Most people's hot data fits in memory, because if it
doesn't, their performance sucks so badly that they either redesign
something or buy more memory until it does.  Also, most people have
more hot data than shared_buffers.  There are some who don't because
their data set is very small, and that's nice when it happens; and
there are others who don't because they carefully crank shared_buffers
up high enough that everything fits, but most don't, either because it
causes other problems, or because they just don't think to tinkering
with it, or because they set it up that way initially but then the
data grows over time.  There are a LOT of people running with 8GB or
less of shared_buffers and a working set that is in the tens of GB.

Now, what varies IME is how much total RAM there is in the system and
how frequently they write that data, as opposed to reading it.  If
they are on a tightly RAM-constrained system, then this situation
won't arise because they won't be under the dirty background limit.
And if they aren't writing that much data then they'll be fine too.
But even putting all of that together I really don't see why you're
trying to suggest that this is some bizarre set of circumstances that
should only rarely happen in the real world.  I think it clearly does
happen, and I doubt it's particularly uncommon.  If your testing
didn't discover this scenario, I feel rather strongly that that's an
oversight in your testing rather than a problem with the scenario.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 12:31:58 -0400, Robert Haas wrote:
> Now, what varies IME is how much total RAM there is in the system and
> how frequently they write that data, as opposed to reading it.  If
> they are on a tightly RAM-constrained system, then this situation
> won't arise because they won't be under the dirty background limit.
> And if they aren't writing that much data then they'll be fine too.
> But even putting all of that together I really don't see why you're
> trying to suggest that this is some bizarre set of circumstances that
> should only rarely happen in the real world.

I'm saying that if that happens constantly, you're better off adjusting
shared_buffers, because you're likely already suffering from latency
spikes and other issues. Optimizing for massive random write throughput
in a system that's not configured appropriately, at the cost of well
configured systems to suffer, doesn't seem like a good tradeoff to me.

Note that other operating systems like windows and freebsd *alreaddy*
write back much more aggressively (independent of this change). I seem
to recall you yourself being quite passionately arguing that the linux
behaviour around this is broken.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Fri, Jun 3, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-03 12:31:58 -0400, Robert Haas wrote:
>> Now, what varies IME is how much total RAM there is in the system and
>> how frequently they write that data, as opposed to reading it.  If
>> they are on a tightly RAM-constrained system, then this situation
>> won't arise because they won't be under the dirty background limit.
>> And if they aren't writing that much data then they'll be fine too.
>> But even putting all of that together I really don't see why you're
>> trying to suggest that this is some bizarre set of circumstances that
>> should only rarely happen in the real world.
>
> I'm saying that if that happens constantly, you're better off adjusting
> shared_buffers, because you're likely already suffering from latency
> spikes and other issues. Optimizing for massive random write throughput
> in a system that's not configured appropriately, at the cost of well
> configured systems to suffer, doesn't seem like a good tradeoff to me.

I really don't get it.  There's nothing in any set of guidelines for
setting shared_buffers that I've ever seen which would cause people to
avoid this scenario.  You're the first person I've ever heard describe
this as a misconfiguration.

> Note that other operating systems like windows and freebsd *alreaddy*
> write back much more aggressively (independent of this change). I seem
> to recall you yourself being quite passionately arguing that the linux
> behaviour around this is broken.

Sure, but being unhappy about the Linux behavior doesn't mean that I
want our TPS on Linux to go down.  Whether I like the behavior or not,
we have to live with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jun 3, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote:
>> Note that other operating systems like windows and freebsd *alreaddy*
>> write back much more aggressively (independent of this change). I seem
>> to recall you yourself being quite passionately arguing that the linux
>> behaviour around this is broken.

> Sure, but being unhappy about the Linux behavior doesn't mean that I
> want our TPS on Linux to go down.  Whether I like the behavior or not,
> we have to live with it.

Yeah.  Bug or not, it's reality for lots of our users.
        regards, tom lane



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 13:33:31 -0400, Robert Haas wrote:
> On Fri, Jun 3, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-06-03 12:31:58 -0400, Robert Haas wrote:
> >> Now, what varies IME is how much total RAM there is in the system and
> >> how frequently they write that data, as opposed to reading it.  If
> >> they are on a tightly RAM-constrained system, then this situation
> >> won't arise because they won't be under the dirty background limit.
> >> And if they aren't writing that much data then they'll be fine too.
> >> But even putting all of that together I really don't see why you're
> >> trying to suggest that this is some bizarre set of circumstances that
> >> should only rarely happen in the real world.
> >
> > I'm saying that if that happens constantly, you're better off adjusting
> > shared_buffers, because you're likely already suffering from latency
> > spikes and other issues. Optimizing for massive random write throughput
> > in a system that's not configured appropriately, at the cost of well
> > configured systems to suffer, doesn't seem like a good tradeoff to me.
> 
> I really don't get it.  There's nothing in any set of guidelines for
> setting shared_buffers that I've ever seen which would cause people to
> avoid this scenario.

The "roughly 1/4" of memory guideline already mostly avoids it? It's
hard to constantly re-dirty a written-back page within 30s, before the
10% (background)/20% (foreground) limits apply; if your shared buffers
are larger than the 10%/20% limits (which only apply to *available* not
total memory btw).


> You're the first person I've ever heard describe this as a
> misconfiguration.

Huh? People tried addressing this problem for *years* with bigger /
smaller shared buffers, but couldn't easily.

I'm inclined to give up and disable backend_flush_after (not the rest),
because it's new and by far the "riskiest". But I do think it's a
disservice for the majority of our users.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 13:42:09 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Fri, Jun 3, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote:
> >> Note that other operating systems like windows and freebsd *alreaddy*
> >> write back much more aggressively (independent of this change). I seem
> >> to recall you yourself being quite passionately arguing that the linux
> >> behaviour around this is broken.
> 
> > Sure, but being unhappy about the Linux behavior doesn't mean that I
> > want our TPS on Linux to go down.  Whether I like the behavior or not,
> > we have to live with it.
> 
> Yeah.  Bug or not, it's reality for lots of our users.

That means we need to address it. Which is what the feature does. So
yes, some linux specific tuning might need to be tweaked in the more
extreme cases. But that's better than relying on linux' extreme
writeback behaviour, which changes every few releases to boot. From the
tuning side this makes shared buffer sizing more common between unixoid
OSs.

Andres



Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Fri, Jun 3, 2016 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote:
>> I really don't get it.  There's nothing in any set of guidelines for
>> setting shared_buffers that I've ever seen which would cause people to
>> avoid this scenario.
>
> The "roughly 1/4" of memory guideline already mostly avoids it? It's
> hard to constantly re-dirty a written-back page within 30s, before the
> 10% (background)/20% (foreground) limits apply; if your shared buffers
> are larger than the 10%/20% limits (which only apply to *available* not
> total memory btw).

I've always heard that guideline as "roughly 1/4, but not more than
about 8GB" - and the number of people with more than 32GB of RAM is
going to just keep going up.

>> You're the first person I've ever heard describe this as a
>> misconfiguration.
>
> Huh? People tried addressing this problem for *years* with bigger /
> smaller shared buffers, but couldn't easily.

I'm saying that setting 8GB of shared_buffers on a system with
lotsamem is not widely regarded as misconfiguration.

> I'm inclined to give up and disable backend_flush_after (not the rest),
> because it's new and by far the "riskiest". But I do think it's a
> disservice for the majority of our users.

I think that's the right course of action.  I wasn't arguing for
disabling either of the other two.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 13:47:58 -0400, Robert Haas wrote:
> On Fri, Jun 3, 2016 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote:
> >> I really don't get it.  There's nothing in any set of guidelines for
> >> setting shared_buffers that I've ever seen which would cause people to
> >> avoid this scenario.
> >
> > The "roughly 1/4" of memory guideline already mostly avoids it? It's
> > hard to constantly re-dirty a written-back page within 30s, before the
> > 10% (background)/20% (foreground) limits apply; if your shared buffers
> > are larger than the 10%/20% limits (which only apply to *available* not
> > total memory btw).
> 
> I've always heard that guideline as "roughly 1/4, but not more than
> about 8GB" - and the number of people with more than 32GB of RAM is
> going to just keep going up.

I think that upper limit is wrong.  But even disregarding that:

To hit the issue in that case you have to access more data than
shared_buffers (8GB), and very frequently re-dirty already dirtied
data. So you're basically (on a very rough approximation) going to have
to write more than 8GB within 30s (256MB/s).  Unless your hardware can
handle that many mostly random writes, you are likely to hit the worst
case behaviour of pending writeback piling up and stalls.


> > I'm inclined to give up and disable backend_flush_after (not the rest),
> > because it's new and by far the "riskiest". But I do think it's a
> > disservice for the majority of our users.
> 
> I think that's the right course of action.  I wasn't arguing for
> disabling either of the other two.

Noah was...

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Robert Haas
Date:
On Fri, Jun 3, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
>> I've always heard that guideline as "roughly 1/4, but not more than
>> about 8GB" - and the number of people with more than 32GB of RAM is
>> going to just keep going up.
>
> I think that upper limit is wrong.  But even disregarding that:

Many people think the upper limit should be even lower, based on good,
practical experience.  Like I've seen plenty of people recommend
2-2.5GB.

> To hit the issue in that case you have to access more data than
> shared_buffers (8GB), and very frequently re-dirty already dirtied
> data. So you're basically (on a very rough approximation) going to have
> to write more than 8GB within 30s (256MB/s).  Unless your hardware can
> handle that many mostly random writes, you are likely to hit the worst
> case behaviour of pending writeback piling up and stalls.

I'm not entire sure that this is true, because my experience is that
the background writing behavior under Linux is not very aggressive.  I
agree you need a working set >8GB, but I think if you have that you
might not actually need to write data this quickly, because if Linux
decides to only do background writing (as opposed to blocking
processes) it may not actually keep up.

Also, 256MB/s is not actually all that crazy write rate.  I mean, it's
a lot, but even if each random UPDATE touched only 1 8kB block, that
would be about 32k TPS.  When you add in index updates and TOAST
traffic, the actual number of block writes per TPS could be
considerably higher, so we might be talking about something <10k TPS.
That's well within the range of what people try to do with PostgreSQL,
at least IME.

>> > I'm inclined to give up and disable backend_flush_after (not the rest),
>> > because it's new and by far the "riskiest". But I do think it's a
>> > disservice for the majority of our users.
>>
>> I think that's the right course of action.  I wasn't arguing for
>> disabling either of the other two.
>
> Noah was...

I know, but I'm not Noah.  :-)

We have no evidence of the other settings causing any problems yet, so
I see no reason to second-guess the decision to leave them on by
default at this stage.  Other people may disagree with that analysis,
and that's fine, but my analysis is that the case for
disable-by-default has been made for backend_flush_after but not the
others.  I also agree that backend_flush_after is much more dangerous
on theoretical grounds; the checkpointer is in a good position to sort
the requests to achieve locality, but backends are not.  And in fact I
think what the testing shows so far is that when they can't achieve
locality, backend flush control sucks.  When it can, it's neutral or
positive.  But I really see no reason to believe that that's likely to
be true on general workloads.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 15:17:06 -0400, Robert Haas wrote:
> On Fri, Jun 3, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:
> >> I've always heard that guideline as "roughly 1/4, but not more than
> >> about 8GB" - and the number of people with more than 32GB of RAM is
> >> going to just keep going up.
> >
> > I think that upper limit is wrong.  But even disregarding that:
>
> Many people think the upper limit should be even lower, based on good,
> practical experience.  Like I've seen plenty of people recommend
> 2-2.5GB.

Which largely imo is because of the writeback issue. And the locking
around buffer replacement, if you're doing it highly concurrently (which
is now mostly solved).


> > To hit the issue in that case you have to access more data than
> > shared_buffers (8GB), and very frequently re-dirty already dirtied
> > data. So you're basically (on a very rough approximation) going to have
> > to write more than 8GB within 30s (256MB/s).  Unless your hardware can
> > handle that many mostly random writes, you are likely to hit the worst
> > case behaviour of pending writeback piling up and stalls.
>
> I'm not entire sure that this is true, because my experience is that
> the background writing behavior under Linux is not very aggressive.  I
> agree you need a working set >8GB, but I think if you have that you
> might not actually need to write data this quickly, because if Linux
> decides to only do background writing (as opposed to blocking
> processes) it may not actually keep up.

But that's *bad*. Then a checkpoint comes around and latency and
throughput is shot to hell while the writeback from the fsyncs is
preventing any concurrent write activity. And if it's not keeping up
before, it's now really bad.


> And in fact I
> think what the testing shows so far is that when they can't achieve
> locality, backend flush control sucks.

FWIW, I don't think that's generally enough true. For pgbench
bigger-than-20%-of-avail-memory there's pretty much no locality, and
backend flushing helps considerably,

Andres



Re: Perf Benchmarking and regression.

From
Noah Misch
Date:
On Fri, Jun 03, 2016 at 03:17:06PM -0400, Robert Haas wrote:
> On Fri, Jun 3, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote:

> >> > I'm inclined to give up and disable backend_flush_after (not the rest),
> >> > because it's new and by far the "riskiest". But I do think it's a
> >> > disservice for the majority of our users.
> >>
> >> I think that's the right course of action.  I wasn't arguing for
> >> disabling either of the other two.
> >
> > Noah was...
> 
> I know, but I'm not Noah.  :-)
> 
> We have no evidence of the other settings causing any problems yet, so
> I see no reason to second-guess the decision to leave them on by
> default at this stage.  Other people may disagree with that analysis,
> and that's fine, but my analysis is that the case for
> disable-by-default has been made for backend_flush_after but not the
> others.  I also agree that backend_flush_after is much more dangerous
> on theoretical grounds; the checkpointer is in a good position to sort
> the requests to achieve locality, but backends are not.

Disabling just backend_flush_after by default works for me, so let's do that.
Though I would not elect, on behalf of PostgreSQL, the risk of enabling
{bgwriter,checkpoint,wal_writer}_flush_after by default, a reasonable person
may choose to do so.  I doubt the community could acquire the data necessary
to ascertain which choice has more utility.



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-03 20:41:33 -0400, Noah Misch wrote:
> Disabling just backend_flush_after by default works for me, so let's do that.
> Though I would not elect, on behalf of PostgreSQL, the risk of enabling
> {bgwriter,checkpoint,wal_writer}_flush_after by default, a reasonable person
> may choose to do so.  I doubt the community could acquire the data necessary
> to ascertain which choice has more utility.

Note that wal_writer_flush_after was essentially already enabled before,
just a lot more *aggressively*.



Re: Perf Benchmarking and regression.

From
Noah Misch
Date:
On Sun, May 29, 2016 at 01:26:03AM -0400, Noah Misch wrote:
> On Thu, May 12, 2016 at 10:49:06AM -0400, Robert Haas wrote:
> > On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Please find the test results for the following set of combinations taken at
> > > 128 client counts:
> > >
> > > 1) Unpatched master, default *_flush_after :  TPS = 10925.882396
> > >
> > > 2) Unpatched master, *_flush_after=0 :  TPS = 18613.343529
> > >
> > > 3) That line removed with #if 0, default *_flush_after :  TPS = 9856.809278
> > >
> > > 4) That line removed with #if 0, *_flush_after=0 :  TPS = 18158.648023
> > 
> > I'm getting increasingly unhappy about the checkpoint flush control.
> > I saw major regressions on my parallel COPY test, too:
> > 
> > http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com
> > 
> > That was a completely different machine (POWER7 instead of Intel,
> > lousy disks instead of good ones) and a completely different workload.
> > Considering these results, I think there's now plenty of evidence to
> > suggest that this feature is going to be horrible for a large number
> > of users.  A 45% regression on pgbench is horrible.  (Nobody wants to
> > take even a 1% hit for snapshot too old, right?)  Sure, it might not
> > be that way for every user on every Linux system, and I'm sure it
> > performed well on the systems where Andres benchmarked it, or he
> > wouldn't have committed it.  But our goal can't be to run well only on
> > the newest hardware with the least-buggy kernel...
> 
> [This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> 9.6 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within 72 hours of this
> message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> efforts toward speedy resolution.  Thanks.
> 
> [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com

This PostgreSQL 9.6 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-08 23:00:15 -0400, Noah Misch wrote:
> On Sun, May 29, 2016 at 01:26:03AM -0400, Noah Misch wrote:
> > On Thu, May 12, 2016 at 10:49:06AM -0400, Robert Haas wrote:
> > > On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > > Please find the test results for the following set of combinations taken at
> > > > 128 client counts:
> > > >
> > > > 1) Unpatched master, default *_flush_after :  TPS = 10925.882396
> > > >
> > > > 2) Unpatched master, *_flush_after=0 :  TPS = 18613.343529
> > > >
> > > > 3) That line removed with #if 0, default *_flush_after :  TPS = 9856.809278
> > > >
> > > > 4) That line removed with #if 0, *_flush_after=0 :  TPS = 18158.648023
> > > 
> > > I'm getting increasingly unhappy about the checkpoint flush control.
> > > I saw major regressions on my parallel COPY test, too:
> > > 
> > > http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com
> > > 
> > > That was a completely different machine (POWER7 instead of Intel,
> > > lousy disks instead of good ones) and a completely different workload.
> > > Considering these results, I think there's now plenty of evidence to
> > > suggest that this feature is going to be horrible for a large number
> > > of users.  A 45% regression on pgbench is horrible.  (Nobody wants to
> > > take even a 1% hit for snapshot too old, right?)  Sure, it might not
> > > be that way for every user on every Linux system, and I'm sure it
> > > performed well on the systems where Andres benchmarked it, or he
> > > wouldn't have committed it.  But our goal can't be to run well only on
> > > the newest hardware with the least-buggy kernel...
> > 
> > [This is a generic notification.]
> > 
> > The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
> > since you committed the patch believed to have created it, you own this open
> > item.  If some other commit is more relevant or if this does not belong as a
> > 9.6 open item, please let us know.  Otherwise, please observe the policy on
> > open item ownership[1] and send a status update within 72 hours of this
> > message.  Include a date for your subsequent status update.  Testers may
> > discover new open items at any time, and I want to plan to get them all fixed
> > well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
> > efforts toward speedy resolution.  Thanks.
> > 
> > [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com
> 
> This PostgreSQL 9.6 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.  Refer to the policy on open item ownership:
> http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com

I'm writing a patch right now, planning to post it later today, commit
it tomorrow.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-09 14:37:31 -0700, Andres Freund wrote:
> I'm writing a patch right now, planning to post it later today, commit
> it tomorrow.

Attached.

Attachment

Re: Perf Benchmarking and regression.

From
Michael Paquier
Date:
On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-09 14:37:31 -0700, Andres Freund wrote:
>> I'm writing a patch right now, planning to post it later today, commit
>> it tomorrow.
>
> Attached.

-        /* see bufmgr.h: OS dependent default */
-        DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES,
+        0, 0, WRITEBACK_MAX_PENDING_FLUSHES,
Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and
just enforce it to 0 for all the OSes at the top of bufmgr.h?
-- 
Michael



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-10 09:34:33 +0900, Michael Paquier wrote:
> On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-06-09 14:37:31 -0700, Andres Freund wrote:
> >> I'm writing a patch right now, planning to post it later today, commit
> >> it tomorrow.
> >
> > Attached.
> 
> -        /* see bufmgr.h: OS dependent default */
> -        DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES,
> +        0, 0, WRITEBACK_MAX_PENDING_FLUSHES,
> Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and
> just enforce it to 0 for all the OSes at the top of bufmgr.h?

What would be the point? The only reason for DEFAULT_BACKEND_FLUSH_AFTER
was that it differed between operating systems. Now it doesn't anymore.

Andres



Re: Perf Benchmarking and regression.

From
Michael Paquier
Date:
On Fri, Jun 10, 2016 at 9:37 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-10 09:34:33 +0900, Michael Paquier wrote:
>> On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-06-09 14:37:31 -0700, Andres Freund wrote:
>> >> I'm writing a patch right now, planning to post it later today, commit
>> >> it tomorrow.
>> >
>> > Attached.
>>
>> -        /* see bufmgr.h: OS dependent default */
>> -        DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES,
>> +        0, 0, WRITEBACK_MAX_PENDING_FLUSHES,
>> Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and
>> just enforce it to 0 for all the OSes at the top of bufmgr.h?
>
> What would be the point? The only reason for DEFAULT_BACKEND_FLUSH_AFTER
> was that it differed between operating systems. Now it doesn't anymore.

Then why do you keep it defined?
-- 
Michael



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-10 09:41:09 +0900, Michael Paquier wrote:
> On Fri, Jun 10, 2016 at 9:37 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-06-10 09:34:33 +0900, Michael Paquier wrote:
> >> On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
> >> > On 2016-06-09 14:37:31 -0700, Andres Freund wrote:
> >> >> I'm writing a patch right now, planning to post it later today, commit
> >> >> it tomorrow.
> >> >
> >> > Attached.
> >>
> >> -        /* see bufmgr.h: OS dependent default */
> >> -        DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES,
> >> +        0, 0, WRITEBACK_MAX_PENDING_FLUSHES,
> >> Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and
> >> just enforce it to 0 for all the OSes at the top of bufmgr.h?
> >
> > What would be the point? The only reason for DEFAULT_BACKEND_FLUSH_AFTER
> > was that it differed between operating systems. Now it doesn't anymore.
> 
> Then why do you keep it defined?

Ooops. Missing git add.

Greetings,

Andres Freund



Re: Perf Benchmarking and regression.

From
Michael Paquier
Date:
On Fri, Jun 10, 2016 at 9:42 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-10 09:41:09 +0900, Michael Paquier wrote:
>> On Fri, Jun 10, 2016 at 9:37 AM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2016-06-10 09:34:33 +0900, Michael Paquier wrote:
>> >> On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
>> >> > On 2016-06-09 14:37:31 -0700, Andres Freund wrote:
>> >> >> I'm writing a patch right now, planning to post it later today, commit
>> >> >> it tomorrow.
>> >> >
>> >> > Attached.
>> >>
>> >> -        /* see bufmgr.h: OS dependent default */
>> >> -        DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES,
>> >> +        0, 0, WRITEBACK_MAX_PENDING_FLUSHES,
>> >> Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and
>> >> just enforce it to 0 for all the OSes at the top of bufmgr.h?
>> >
>> > What would be the point? The only reason for DEFAULT_BACKEND_FLUSH_AFTER
>> > was that it differed between operating systems. Now it doesn't anymore.
>>
>> Then why do you keep it defined?
>
> Ooops. Missing git add.

:)
-- 
Michael



Re: Perf Benchmarking and regression.

From
Andres Freund
Date:
On 2016-06-09 17:19:34 -0700, Andres Freund wrote:
> On 2016-06-09 14:37:31 -0700, Andres Freund wrote:
> > I'm writing a patch right now, planning to post it later today, commit
> > it tomorrow.
> 
> Attached.

And pushed. Thanks to Michael for noticing the missing addition of
header file hunk.

Andres