Thread: Perf Benchmarking and regression.
I tried to do some benchmarking on postgres master head
commit 72a98a639574d2e25ed94652848555900c81a799
Author: Andres Freund <andres@anarazel.de>
Date: Tue Apr 26 20:32:51 2016 -0700
commit 72a98a639574d2e25ed94652848555900c81a799
Author: Andres Freund <andres@anarazel.de>
Date: Tue Apr 26 20:32:51 2016 -0700
CASE : Read-Write Tests when data exceeds shared buffers.
Non Default settings and test
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
./pgbench -i -s 1000 postgres
./pgbench -c $threads -j $threads -T 1800 -M prepared postgres
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
./pgbench -i -s 1000 postgres
./pgbench -c $threads -j $threads -T 1800 -M prepared postgres
Machine : "cthulhu" 8 node numa machine with 128 hyper threads.
>numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103
node 0 size: 65498 MB
node 0 free: 37885 MB
node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111
node 1 size: 65536 MB
node 1 free: 31215 MB
node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119
node 2 size: 65536 MB
node 2 free: 15331 MB
node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127
node 3 size: 65536 MB
node 3 free: 36774 MB
node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40
node 4 size: 65536 MB
node 4 free: 62 MB
node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48
node 5 size: 65536 MB
node 5 free: 9653 MB
node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56
node 6 size: 65536 MB
node 6 free: 50209 MB
node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64
node 7 size: 65536 MB
node 7 free: 43966 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 21 21 21 21 21 21 21
1: 21 10 21 21 21 21 21 21
2: 21 21 10 21 21 21 21 21
3: 21 21 21 10 21 21 21 21
4: 21 21 21 21 10 21 21 21
5: 21 21 21 21 21 10 21 21
6: 21 21 21 21 21 21 10 21
7: 21 21 21 21 21 21 21 10
>numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 65 66 67 68 69 70 71 96 97 98 99 100 101 102 103
node 0 size: 65498 MB
node 0 free: 37885 MB
node 1 cpus: 72 73 74 75 76 77 78 79 104 105 106 107 108 109 110 111
node 1 size: 65536 MB
node 1 free: 31215 MB
node 2 cpus: 80 81 82 83 84 85 86 87 112 113 114 115 116 117 118 119
node 2 size: 65536 MB
node 2 free: 15331 MB
node 3 cpus: 88 89 90 91 92 93 94 95 120 121 122 123 124 125 126 127
node 3 size: 65536 MB
node 3 free: 36774 MB
node 4 cpus: 1 2 3 4 5 6 7 8 33 34 35 36 37 38 39 40
node 4 size: 65536 MB
node 4 free: 62 MB
node 5 cpus: 9 10 11 12 13 14 15 16 41 42 43 44 45 46 47 48
node 5 size: 65536 MB
node 5 free: 9653 MB
node 6 cpus: 17 18 19 20 21 22 23 24 49 50 51 52 53 54 55 56
node 6 size: 65536 MB
node 6 free: 50209 MB
node 7 cpus: 25 26 27 28 29 30 31 32 57 58 59 60 61 62 63 64
node 7 size: 65536 MB
node 7 free: 43966 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 21 21 21 21 21 21 21
1: 21 10 21 21 21 21 21 21
2: 21 21 10 21 21 21 21 21
3: 21 21 21 10 21 21 21 21
4: 21 21 21 21 10 21 21 21
5: 21 21 21 21 21 10 21 21
6: 21 21 21 21 21 21 10 21
7: 21 21 21 21 21 21 21 10
I see some regression when compared to 9.5
Sessions | PostgreSQL-9.5 scale 1000 | PostgreSQL-9.6 scale 1000 | %diff |
1 | 747.367249 | 892.149891 | 19.3723557185 |
8 | 5281.282799 | 4941.905008 | -6.4260484416 |
16 | 9000.915419 | 8695.396233 | -3.3943123758 |
24 | 11852.839627 | 10843.328776 | -8.5170379653 |
32 | 14323.048334 | 11977.505153 | -16.3760054864 |
40 | 16098.926583 | 12195.447024 | -24.2468312336 |
48 | 16959.646965 | 12639.951087 | -25.4704351271 |
56 | 17157.737762 | 12543.212929 | -26.894715941 |
64 | 17201.914922 | 12628.002422 | -26.5895542487 |
72 | 16956.994835 | 11280.870599 | -33.4736448954 |
80 | 16775.954896 | 11348.830603 | -32.3506132834 |
88 | 16609.137558 | 10823.465121 | -34.834273705 |
96 | 16510.099404 | 11091.757753 | -32.8183466278 |
104 | 16275.724927 | 10665.743275 | -34.4683980416 |
112 | 16141.815128 | 10977.84664 | -31.9912503461 |
120 | 15904.086614 | 10716.17755 | -32.6199749153 |
128 | 15738.391503 | 10962.333439 | -30.3465450271 |
When I run git bisect on master (And this is for 128 clients).
2 commitIds which affected the performance1. # first bad commit: [ac1d7945f866b1928c2554c0f80fd52d7f977772] Make idle backends exit if the postmaster dies.
15947.21546 (15K +) to 13409.758510 (arround 13K+).
2. # first bad commit: [428b1d6b29ca599c5700d4bc4f4ce4c5880369bf] Allow to trigger kernel writeback after a configurable number of writes.
this made performance to drop further to 10962.333439 (10K +)
I think It did not recover afterwards.
--
Hi, Thanks for benchmarking! On 2016-05-06 19:43:52 +0530, Mithun Cy wrote: > 1. # first bad commit: [ac1d7945f866b1928c2554c0f80fd52d7f977772] Make idle > backends exit if the postmaster dies. > this made performance to drop from > > 15947.21546 (15K +) to 13409.758510 (arround 13K+). Let's debug this one first, it's a lot more local. I'm rather surprised that you're seing a big effect with that "few" TPS/socket operations; and even more that our efforts to address that problem haven't been fruitful (given we've verified the fix on a number of machines). Can you verify that removing AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); in src/backend/libpq/pqcomm.c : pq_init() restores performance? I think it'd be best to test the back/forth on master with bgwriter_flush_after = 0 checkpointer_flush_after = 0 backend_flush_after = 0 to isolate the issue. Also, do you see read-only workloads to be affected too? > 2. # first bad commit: [428b1d6b29ca599c5700d4bc4f4ce4c5880369bf] Allow to > trigger kernel writeback after a configurable number of writes. FWIW, it'd be very interesting to test again with a bigger backend_flush_after setting. Greetings, Andres Freund
On Fri, May 6, 2016 at 8:35 PM, Andres Freund <andres@anarazel.de> wrote:
> Also, do you see read-only workloads to be affected too?
Thanks, I have not tested with above specific commitid which reported performance issue but
At HEAD commit 72a98a639574d2e25ed94652848555900c81a799
Author: Andres Freund <andres@anarazel.de>
Date: Tue Apr 26 20:32:51 2016 -0700
READ-Only (prepared) tests (both when data fits to shared buffers or it exceeds shared buffer=8GB) performance of master has improved over 9.5
Sessions | PostgreSQL-9.5 scale 300 | PostgreSQL-9.6 scale 300 | %diff |
1 | 5287.561594 | 5213.723197 | -1.396454598 |
8 | 84265.389083 | 84871.305689 | 0.719057507 |
16 | 148330.4155 | 158661.128315 | 6.9646624936 |
24 | 207062.803697 | 219958.12974 | 6.2277366155 |
32 | 265145.089888 | 290190.501443 | 9.4459269699 |
40 | 311688.752973 | 340000.551772 | 9.0833559212 |
48 | 327169.9673 | 372408.073033 | 13.8270960829 |
56 | 274426.530496 | 390629.24948 | 42.3438356248 |
64 | 261777.692042 | 384613.9666 | 46.9238893505 |
72 | 210747.55937 | 376390.162022 | 78.5976374517 |
80 | 220192.818648 | 398128.779329 | 80.8091570713 |
88 | 185176.91888 | 423906.711882 | 128.9198429512 |
96 | 161579.719039 | 421541.656474 | 160.8877271115 |
104 | 146935.568434 | 450672.740567 | 206.7145316618 |
112 | 136605.466232 | 432047.309248 | 216.2738074582 |
120 | 127687.175016 | 455458.086889 | 256.6983816753 |
128 | 120413.936453 | 428127.879242 | 255.5467845776 |
Sessions | PostgreSQL-9.5 scale 1000 | PostgreSQL-9.6 scale 1000 | %diff |
1 | 5103.812202 | 5155.434808 | 1.01145191 |
8 | 47741.9041 | 53117.805096 | 11.2603405694 |
16 | 89722.57031 | 86965.10079 | -3.0733287182 |
24 | 130914.537373 | 153849.634245 | 17.5191367836 |
32 | 197125.725706 | 212454.474264 | 7.7761279017 |
40 | 248489.551052 | 270304.093767 | 8.7788571482 |
48 | 291884.652232 | 317257.836746 | 8.6928806705 |
56 | 304526.216047 | 359676.785476 | 18.1102862489 |
64 | 301440.463174 | 388324.710185 | 28.8230206709 |
72 | 194239.941979 | 393676.628802 | 102.6754254511 |
80 | 144879.527847 | 383365.678053 | 164.6099719885 |
88 | 122894.325326 | 372905.436117 | 203.4358463076 |
96 | 109836.31148 | 362208.867756 | 229.7715144249 |
104 | 103791.981583 | 352330.402278 | 239.4582094921 |
112 | 105189.206682 | 345722.499429 | 228.6672752217 |
120 | 108095.811432 | 342597.969088 | 216.939171416 |
128 | 113242.59492 | 333821.98763 | 194.7848270925 |
Even for READ-WRITE when data fits into shared buffer (scale_factor=300 and shared_buffers=8GB) performance has improved.
Only case is when data exceeds shared_buffer(scale_factor=1000 and shared_buffers=8GB) I see some regression.
I will try to run the tests as you have suggested and will report the same.
Hi, On 2016-05-06 21:21:11 +0530, Mithun Cy wrote: > I will try to run the tests as you have suggested and will report the same. Any news on that front? Regards, Andres
Hi Andres,
I am extremely sorry for the delayed response. As suggested by you, I have taken the performance readings at 128 client counts after making the following two changes:
1). Removed AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); from pq_init(). Below is the git diff for the same.
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 8d6eb0b..399d54b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -206,7 +206,9 @@ pq_init(void)
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
+#if 0
AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
+#endif
2). Disabled the guc vars "bgwriter_flush_after", "checkpointer_flush_after" and "backend_flush_after" by setting them to zero.
After doing the above two changes below are the readings i got for 128 client counts:
CASE : Read-Write Tests when data exceeds shared buffers.
Non Default settings and test
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
./pgbench -i -s 1000 postgres
./pgbench -c 128 -j 128 -T 1800 -M prepared postgres
Run1 : tps = 9690.678225
Run2 : tps = 9904.320645
Run3 : tps = 9943.547176
Please let me know if i need to take readings with other client counts as well.
Note: I have taken these readings on postgres master head at,
commit 91fd1df4aad2141859310564b498a3e28055ee28
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun May 8 16:53:55 2016 -0400
With Regards,I am extremely sorry for the delayed response. As suggested by you, I have taken the performance readings at 128 client counts after making the following two changes:
1). Removed AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); from pq_init(). Below is the git diff for the same.
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 8d6eb0b..399d54b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -206,7 +206,9 @@ pq_init(void)
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
+#if 0
AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
+#endif
2). Disabled the guc vars "bgwriter_flush_after", "checkpointer_flush_after" and "backend_flush_after" by setting them to zero.
After doing the above two changes below are the readings i got for 128 client counts:
CASE : Read-Write Tests when data exceeds shared buffers.
Non Default settings and test
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
./pgbench -i -s 1000 postgres
./pgbench -c 128 -j 128 -T 1800 -M prepared postgres
Run1 : tps = 9690.678225
Run2 : tps = 9904.320645
Run3 : tps = 9943.547176
Please let me know if i need to take readings with other client counts as well.
Note: I have taken these readings on postgres master head at,
commit 91fd1df4aad2141859310564b498a3e28055ee28
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sun May 8 16:53:55 2016 -0400
On Wed, May 11, 2016 at 3:53 AM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2016-05-06 21:21:11 +0530, Mithun Cy wrote:
> I will try to run the tests as you have suggested and will report the same.
Any news on that front?
Regards,
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, May 11, 2016 at 12:51 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > I am extremely sorry for the delayed response. As suggested by you, I have > taken the performance readings at 128 client counts after making the > following two changes: > > 1). Removed AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, > NULL); from pq_init(). Below is the git diff for the same. > > diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c > index 8d6eb0b..399d54b 100644 > --- a/src/backend/libpq/pqcomm.c > +++ b/src/backend/libpq/pqcomm.c > @@ -206,7 +206,9 @@ pq_init(void) > AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, > MyProcPort->sock, > NULL, NULL); > AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); > +#if 0 > AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); > +#endif > > 2). Disabled the guc vars "bgwriter_flush_after", "checkpointer_flush_after" > and "backend_flush_after" by setting them to zero. > > After doing the above two changes below are the readings i got for 128 > client counts: > > CASE : Read-Write Tests when data exceeds shared buffers. > > Non Default settings and test > ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c > max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c > checkpoint_completion_target=0.9 & > > ./pgbench -i -s 1000 postgres > > ./pgbench -c 128 -j 128 -T 1800 -M prepared postgres > > Run1 : tps = 9690.678225 > Run2 : tps = 9904.320645 > Run3 : tps = 9943.547176 > > Please let me know if i need to take readings with other client counts as > well. Can you please take four new sets of readings, like this: - Unpatched master, default *_flush_after - Unpatched master, *_flush_after=0 - That line removed with #if 0, default *_flush_after - That line removed with #if 0, *_flush_after=0 128 clients is fine. But I want to see four sets of numbers that were all taken by the same person at the same time using the same script. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi,
Please find the test results for the following set of combinations taken at 128 client counts:
1) Unpatched master, default *_flush_after : TPS = 10925.882396
2) Unpatched master, *_flush_after=0 : TPS = 18613.343529
3) That line removed with #if 0, default *_flush_after : TPS = 9856.809278
4) That line removed with #if 0, *_flush_after=0 : TPS = 18158.648023
Here, That line points to "AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); in pq_init()."
Please note that earlier i had taken readings with data directory and pg_xlog directory at the same location in HDD. But this time i have changed the location of pg_xlog to ssd and taken the readings. With pg_xlog and data directory at the same location in HDD i was seeing much lesser performance like for "That line removed with #if 0, *_flush_after=0 :" case i was getting 7367.709378 tps.
Please find the test results for the following set of combinations taken at 128 client counts:
1) Unpatched master, default *_flush_after : TPS = 10925.882396
2) Unpatched master, *_flush_after=0 : TPS = 18613.343529
3) That line removed with #if 0, default *_flush_after : TPS = 9856.809278
4) That line removed with #if 0, *_flush_after=0 : TPS = 18158.648023
Here, That line points to "AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL); in pq_init()."
Please note that earlier i had taken readings with data directory and pg_xlog directory at the same location in HDD. But this time i have changed the location of pg_xlog to ssd and taken the readings. With pg_xlog and data directory at the same location in HDD i was seeing much lesser performance like for "That line removed with #if 0, *_flush_after=0 :" case i was getting 7367.709378 tps.
Also, the commit-id on which i have taken above readings along with pgbench commands used are mentioned below:
commit 8a13d5e6d1bb9ff9460c72992657077e57e30c32
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed May 11 17:06:53 2016 -0400
Fix infer_arbiter_indexes() to not barf on system columns.
Non Default settings and test:
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
./pgbench -i -s 1000 postgres
./pgbench -c 128 -j 128 -T 1800 -M prepared postgres
On Thu, May 12, 2016 at 9:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Can you please take four new sets of readings, like this:On Wed, May 11, 2016 at 12:51 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> I am extremely sorry for the delayed response. As suggested by you, I have
> taken the performance readings at 128 client counts after making the
> following two changes:
>
> 1). Removed AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL,
> NULL); from pq_init(). Below is the git diff for the same.
>
> diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
> index 8d6eb0b..399d54b 100644
> --- a/src/backend/libpq/pqcomm.c
> +++ b/src/backend/libpq/pqcomm.c
> @@ -206,7 +206,9 @@ pq_init(void)
> AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
> MyProcPort->sock,
> NULL, NULL);
> AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
> +#if 0
> AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
> +#endif
>
> 2). Disabled the guc vars "bgwriter_flush_after", "checkpointer_flush_after"
> and "backend_flush_after" by setting them to zero.
>
> After doing the above two changes below are the readings i got for 128
> client counts:
>
> CASE : Read-Write Tests when data exceeds shared buffers.
>
> Non Default settings and test
> ./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c
> checkpoint_completion_target=0.9 &
>
> ./pgbench -i -s 1000 postgres
>
> ./pgbench -c 128 -j 128 -T 1800 -M prepared postgres
>
> Run1 : tps = 9690.678225
> Run2 : tps = 9904.320645
> Run3 : tps = 9943.547176
>
> Please let me know if i need to take readings with other client counts as
> well.
- Unpatched master, default *_flush_after
- Unpatched master, *_flush_after=0
- That line removed with #if 0, default *_flush_after
- That line removed with #if 0, *_flush_after=0
128 clients is fine. But I want to see four sets of numbers that were
all taken by the same person at the same time using the same script.
Thanks,
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Please find the test results for the following set of combinations taken at > 128 client counts: > > 1) Unpatched master, default *_flush_after : TPS = 10925.882396 > > 2) Unpatched master, *_flush_after=0 : TPS = 18613.343529 > > 3) That line removed with #if 0, default *_flush_after : TPS = 9856.809278 > > 4) That line removed with #if 0, *_flush_after=0 : TPS = 18158.648023 I'm getting increasingly unhappy about the checkpoint flush control. I saw major regressions on my parallel COPY test, too: http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com That was a completely different machine (POWER7 instead of Intel, lousy disks instead of good ones) and a completely different workload. Considering these results, I think there's now plenty of evidence to suggest that this feature is going to be horrible for a large number of users. A 45% regression on pgbench is horrible. (Nobody wants to take even a 1% hit for snapshot too old, right?) Sure, it might not be that way for every user on every Linux system, and I'm sure it performed well on the systems where Andres benchmarked it, or he wouldn't have committed it. But our goal can't be to run well only on the newest hardware with the least-buggy kernel... > Here, That line points to "AddWaitEventToSet(FeBeWaitSet, > WL_POSTMASTER_DEATH, -1, NULL, NULL); in pq_init()." Given the above results, it's not clear whether that is making things better or worse. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2016-05-12 18:09:07 +0530, Ashutosh Sharma wrote: > Please find the test results for the following set of combinations taken at > 128 client counts: Thanks. > *1)* *Unpatched master, default *_flush_after :* TPS = 10925.882396 Could you run this one with a number of different backend_flush_after settings? I'm suspsecting the primary issue is that the default is too low. Greetings, Andres Freund
On Thu, May 12, 2016 at 11:13 AM, Andres Freund <andres@anarazel.de> wrote: > Could you run this one with a number of different backend_flush_after > settings? I'm suspsecting the primary issue is that the default is too low. What values do you think would be good to test? Maybe provide 3 or 4 suggested values to try? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-05-12 11:27:31 -0400, Robert Haas wrote: > On Thu, May 12, 2016 at 11:13 AM, Andres Freund <andres@anarazel.de> wrote: > > Could you run this one with a number of different backend_flush_after > > settings? I'm suspsecting the primary issue is that the default is too low. > > What values do you think would be good to test? Maybe provide 3 or 4 > suggested values to try? 0 (disabled), 16 (current default), 32, 64, 128, 256? I'm suspecting that only backend_flush_after_* has these negative performance implications at this point. One path is to increase that option's default value, another is to disable only backend guided flushing. And add a strong hint that if you care about predictable throughput you might want to enable it. Greetings, Andres Freund
On 2016-05-12 10:49:06 -0400, Robert Haas wrote: > On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Please find the test results for the following set of combinations taken at > > 128 client counts: > > > > 1) Unpatched master, default *_flush_after : TPS = 10925.882396 > > > > 2) Unpatched master, *_flush_after=0 : TPS = 18613.343529 > > > > 3) That line removed with #if 0, default *_flush_after : TPS = 9856.809278 > > > > 4) That line removed with #if 0, *_flush_after=0 : TPS = 18158.648023 > > I'm getting increasingly unhappy about the checkpoint flush control. > I saw major regressions on my parallel COPY test, too: Yes, I'm concerned too. The workload in this thread is a bit of an "artificial" workload (all data is constantly updated, doesn't fit into shared_buffers, fits into the OS page cache), and only measures throughput not latency. But I agree that that's way too large a regression to accept, and that there's a significant number of machines with way undersized shared_buffer values. > http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com > > That was a completely different machine (POWER7 instead of Intel, > lousy disks instead of good ones) and a completely different workload. > Considering these results, I think there's now plenty of evidence to > suggest that this feature is going to be horrible for a large number > of users. A 45% regression on pgbench is horrible. I asked you over there whether you could benchmark with just different values for backend_flush_after... I chose the current value because it gives the best latency / most consistent throughput numbers, but 128kb isn't a large window. I suspect we might need to disable backend guided flushing if that's not sufficient :( > > Here, That line points to "AddWaitEventToSet(FeBeWaitSet, > > WL_POSTMASTER_DEATH, -1, NULL, NULL); in pq_init()." > > Given the above results, it's not clear whether that is making things > better or worse. Yea, me neither. I think it's doubful that you'd see performance difference due to the original ac1d7945f866b1928c2554c0f80fd52d7f977772 , independent of the WaitEventSet stuff, at these throughput rates. Greetings, Andres Freund
>> I'm getting increasingly unhappy about the checkpoint flush control. >> I saw major regressions on my parallel COPY test, too: > > Yes, I'm concerned too. A few thoughts: - focussing on raw tps is not a good idea, because it may be a lot of tps followed by a sync panic, with an unresponsivedatabase. I wish the performance reports would include some indication of the distribution (eg min/q1/median/d3/maxtps per second seen, standard deviation), not just the final "tps" figure. - checkpoint flush control (checkpoint_flush_after) should mostly always beneficial because it flushes sorted data. Iwould be surprised to see significant regressions with this on. A lot of tests showed maybe improved tps, but mostlygreatly improved performance stability, where a database unresponsive 60% of the time (60% of seconds in the thetps show very low or zero tps) and then becomes always responsive. - other flush controls ({backend,bgwriter}_flush_after) may just increase random writes, so are more risky in nature becausethe data is not sorted, and it may or may not be a good idea depending on detailed conditions. A "parallel copy"would be just such a special IO load which degrade performance under these settings. Maybe these two should be disabled by default because they lead to possibly surprising regressions? - for any particular load, the admin can decide to disable these if they think it is better not to flush. Also, as suggestedby Andres, with 128 parallel queries the default value may not be appropriate at all. -- Fabien.
<div dir="ltr">Hi,<br /><br />Following are the performance results for read write test observed with different numbers of"<b>backend_flush_after</b>".<br /><br />1) backend_flush_after = <b>256kb</b> (32*8kb), tps = <b>10841.178815</b><br />2)backend_flush_after = <b>512kb</b> (64*8kb), tps = <b>11098.702707</b><br />3) backend_flush_after = <b>1MB</b> (128*8kb),tps = <b>11434.964545</b><br />4) backend_flush_after = <b>2MB</b> (256*8kb), tps = <b>13477.089417</b><br /><br/><br /><b>Note:</b> Above test has been performed on Unpatched master with default values for checkpoint_flush_after,bgwriter_flush_after<br />and wal_writer_flush_after. <br /><br />With Regards,<br />Ashutosh Sharma<br/>EnterpriseDB:<u> <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a></u><br /></div><div class="gmail_extra"><br/><div class="gmail_quote">On Thu, May 12, 2016 at 9:20 PM, Andres Freund <span dir="ltr"><<a href="mailto:andres@anarazel.de"target="_blank">andres@anarazel.de</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 2016-05-12 11:27:31-0400, Robert Haas wrote:<br /> > On Thu, May 12, 2016 at 11:13 AM, Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>>wrote:<br /> > > Could you run this one with a number ofdifferent backend_flush_after<br /> > > settings? I'm suspsecting the primary issue is that the default is too low.<br/> ><br /> > What values do you think would be good to test? Maybe provide 3 or 4<br /> > suggested valuesto try?<br /><br /></span>0 (disabled), 16 (current default), 32, 64, 128, 256?<br /><br /> I'm suspecting that onlybackend_flush_after_* has these negative<br /> performance implications at this point. One path is to increase that<br/> option's default value, another is to disable only backend guided<br /> flushing. And add a strong hint that ifyou care about predictable<br /> throughput you might want to enable it.<br /><br /> Greetings,<br /><br /> Andres Freund<br/></blockquote></div><br /></div>
On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Following are the performance results for read write test observed with > different numbers of "backend_flush_after". > > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815 > 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707 > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545 > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417 So even at 2MB we don't come close to recovering all of the lost performance. Can you please test these three scenarios? 1. Default settings for *_flush_after 2. backend_flush_after=0, rest defaults 3. backend_flush_after=0, bgwriter_flush_after=0, wal_writer_flush_after=0, checkpoint_flush_after=0 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-05-13 10:20:04 -0400, Robert Haas wrote: > On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Following are the performance results for read write test observed with > > different numbers of "backend_flush_after". > > > > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815 > > 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707 > > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545 > > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417 > > So even at 2MB we don't come close to recovering all of the lost > performance. Can you please test these three scenarios? > > 1. Default settings for *_flush_after > 2. backend_flush_after=0, rest defaults > 3. backend_flush_after=0, bgwriter_flush_after=0, > wal_writer_flush_after=0, checkpoint_flush_after=0 4) 1) + a shared_buffers setting appropriate to the workload. I just want to emphasize what we're discussing here is a bit of an extreme setup. A workload that's bigger than shared buffers, but smaller than the OS's cache size; with a noticeable likelihood of rewriting individual OS page cache pages within 30s. Greetings, Andres Freund
On Fri, May 13, 2016 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-05-13 10:20:04 -0400, Robert Haas wrote: >> On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> > Following are the performance results for read write test observed with >> > different numbers of "backend_flush_after". >> > >> > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815 >> > 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707 >> > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545 >> > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417 >> >> So even at 2MB we don't come close to recovering all of the lost >> performance. Can you please test these three scenarios? >> >> 1. Default settings for *_flush_after >> 2. backend_flush_after=0, rest defaults >> 3. backend_flush_after=0, bgwriter_flush_after=0, >> wal_writer_flush_after=0, checkpoint_flush_after=0 > > 4) 1) + a shared_buffers setting appropriate to the workload. > > > I just want to emphasize what we're discussing here is a bit of an > extreme setup. A workload that's bigger than shared buffers, but smaller > than the OS's cache size; with a noticeable likelihood of rewriting > individual OS page cache pages within 30s. You're just describing pgbench with a scale factor too large to fit in shared_buffers. I think it's unfair to paint that as some kind of niche use case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-05-13 14:43:15 -0400, Robert Haas wrote: > On Fri, May 13, 2016 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote: > > I just want to emphasize what we're discussing here is a bit of an > > extreme setup. A workload that's bigger than shared buffers, but smaller > > than the OS's cache size; with a noticeable likelihood of rewriting > > individual OS page cache pages within 30s. > > You're just describing pgbench with a scale factor too large to fit in > shared_buffers. Well, that *and* a scale factor smaller than 20% of the memory available, *and* a scale factor small enough that make re-dirtying of already written out pages likely. > I think it's unfair to paint that as some kind of niche use case. I'm not saying we don't need to do something about it. Just that it's a hard tradeoff to make. The massive performance / latency we've observed originate from the kernel caching too much dirty IO. The fix is making is cache fewer dirty pages. But there's workloads where the kernel's buffer cache works as an extension of our page cache.
On Fri, May 13, 2016 at 11:13 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-05-13 10:20:04 -0400, Robert Haas wrote:
> > On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Following are the performance results for read write test observed with
> > > different numbers of "backend_flush_after".
> > >
> > > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815
> > > 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707
> > > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545
> > > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417
> >
> > So even at 2MB we don't come close to recovering all of the lost
> > performance. Can you please test these three scenarios?
> >
> > 1. Default settings for *_flush_after
> > 2. backend_flush_after=0, rest defaults
> > 3. backend_flush_after=0, bgwriter_flush_after=0,
> > wal_writer_flush_after=0, checkpoint_flush_after=0
>
> 4) 1) + a shared_buffers setting appropriate to the workload.
>
>
> On 2016-05-13 10:20:04 -0400, Robert Haas wrote:
> > On Fri, May 13, 2016 at 7:08 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Following are the performance results for read write test observed with
> > > different numbers of "backend_flush_after".
> > >
> > > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815
> > > 2) backend_flush_after = 512kb (64*8kb), tps = 11098.702707
> > > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545
> > > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417
> >
> > So even at 2MB we don't come close to recovering all of the lost
> > performance. Can you please test these three scenarios?
> >
> > 1. Default settings for *_flush_after
> > 2. backend_flush_after=0, rest defaults
> > 3. backend_flush_after=0, bgwriter_flush_after=0,
> > wal_writer_flush_after=0, checkpoint_flush_after=0
>
> 4) 1) + a shared_buffers setting appropriate to the workload.
>
If by 4th point, you mean to test the case when data fits in shared buffers, then Mithun has already reported above [1] that it didn't see any regression for that case
Read line - Even for READ-WRITE when data fits into shared buffer (scale_factor=300 and shared_buffers=8GB) performance has improved.
With Regards,
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
EnterpriseDB: http://www.enterprisedb.com
<div dir="ltr">Hi,<br /><br />Please find the results for the following 3 scenarios with unpatched master:<br /><br />1.Default settings for *_flush_after : TPS = <b>10677.662356</b><br />2. backend_flush_after=0, rest defaults : TPS = <b>18452.655936</b><br/>3. backend_flush_after=0, bgwriter_flush_after=0,<br />wal_writer_flush_after=0, checkpoint_flush_after=0: TPS = <b>18614.479962</b><br /><br />With Regards,<br />Ashutosh Sharma<br />EnterpriseDB: <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/></div><div class="gmail_extra"><br /><div class="gmail_quote">OnFri, May 13, 2016 at 7:50 PM, Robert Haas <span dir="ltr"><<a href="mailto:robertmhaas@gmail.com"target="_blank">robertmhaas@gmail.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Fri, May 13, 2016at 7:08 AM, Ashutosh Sharma <<a href="mailto:ashu.coek88@gmail.com">ashu.coek88@gmail.com</a>> wrote:<br /> >Following are the performance results for read write test observed with<br /> > different numbers of "backend_flush_after".<br/> ><br /> > 1) backend_flush_after = 256kb (32*8kb), tps = 10841.178815<br /> > 2) backend_flush_after= 512kb (64*8kb), tps = 11098.702707<br /> > 3) backend_flush_after = 1MB (128*8kb), tps = 11434.964545<br/> > 4) backend_flush_after = 2MB (256*8kb), tps = 13477.089417<br /><br /></span>So even at 2MB we don'tcome close to recovering all of the lost<br /> performance. Can you please test these three scenarios?<br /><br />1. Default settings for *_flush_after<br /> 2. backend_flush_after=0, rest defaults<br /> 3. backend_flush_after=0, bgwriter_flush_after=0,<br/> wal_writer_flush_after=0, checkpoint_flush_after=0<br /><div class="HOEnZb"><div class="h5"><br/> --<br /> Robert Haas<br /> EnterpriseDB: <a href="http://www.enterprisedb.com" rel="noreferrer" target="_blank">http://www.enterprisedb.com</a><br/> The Enterprise PostgreSQL Company<br /></div></div></blockquote></div><br/></div>
Hello, > Please find the results for the following 3 scenarios with unpatched master: > > 1. Default settings for *_flush_after : TPS = *10677.662356* > 2. backend_flush_after=0, rest defaults : TPS = *18452.655936* > 3. backend_flush_after=0, bgwriter_flush_after=0, > wal_writer_flush_after=0, checkpoint_flush_after=0 : TPS = *18614.479962* Thanks for these runs. These raw tps suggest that {backend,bgwriter}_flush_after should better be zero for this kind of load.Whether it should be the default is unclear yet, because as Andres pointed out this is one kind of load. Note: these options have been added to smooth ios over time and to help avoid "io panics" on sync, especially with HDDs without a large BBU cache in front. The real benefit is that the performance are much more constant over time, and pg is much more responsive. If you do other runs, it would be nice to report some stats about tps variability (eg latency & latency stddev which should be in the report). For experiments I did I used to log "-P 1" output (tps every second) and to compute stats on these tps (avg, stddev, min, q1, median, q3, max, pc of time with tps below a low threshold...), which provides some indication of the overall tps distribution. -- Fabien
On 2016-05-14 18:49:27 +0200, Fabien COELHO wrote: > > Hello, > > > Please find the results for the following 3 scenarios with unpatched master: > > > > 1. Default settings for *_flush_after : TPS = *10677.662356* > > 2. backend_flush_after=0, rest defaults : TPS = *18452.655936* > > 3. backend_flush_after=0, bgwriter_flush_after=0, > > wal_writer_flush_after=0, checkpoint_flush_after=0 : TPS = *18614.479962* > > Thanks for these runs. Yes! > These raw tps suggest that {backend,bgwriter}_flush_after should better be > zero for this kind of load.Whether it should be the default is unclear yet, > because as Andres pointed out this is one kind of load. FWIW, I don't think {backend,bgwriter} are the same here. It's primarily backend that matters. This is treating the os page cache as an extension of postgres' buffer cache. That really primarily matters for backend_, because otherwise backends spend time waiting for IO. Andres
>> These raw tps suggest that {backend,bgwriter}_flush_after should better be >> zero for this kind of load.Whether it should be the default is unclear yet, >> because as Andres pointed out this is one kind of load. > > FWIW, I don't think {backend,bgwriter} are the same here. It's primarily > backend that matters. Indeed, I was a little hasty to put bgwriter together based on this report. I'm a little wary of "bgwriter_flush_after" though, I would not be surprised if someone reports some regressions, although probably not with a pgbench tpcb kind of load. -- Fabien.
Hi All,
As we have seen the regression of more than 45% with "backend_flush_after" enabled and set to its default value i.e. 128KB or even when it is set to some higher value like 2MB, i think we should disable it such that it does not impact the read write performance and here is the attached patch for the same. Please have a look and let me know your thoughts on this. Thanks!
With Regards,As we have seen the regression of more than 45% with "backend_flush_after" enabled and set to its default value i.e. 128KB or even when it is set to some higher value like 2MB, i think we should disable it such that it does not impact the read write performance and here is the attached patch for the same. Please have a look and let me know your thoughts on this. Thanks!
EnterpriseDB: http://www.enterprisedb.com
On Sun, May 15, 2016 at 1:26 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
These raw tps suggest that {backend,bgwriter}_flush_after should better be
zero for this kind of load.Whether it should be the default is unclear yet,
because as Andres pointed out this is one kind of load.
FWIW, I don't think {backend,bgwriter} are the same here. It's primarily
backend that matters.
Indeed, I was a little hasty to put bgwriter together based on this report.
I'm a little wary of "bgwriter_flush_after" though, I would not be surprised if someone reports some regressions, although probably not with a pgbench tpcb kind of load.
--
Fabien.
Attachment
Hi, On May 26, 2016 9:29:51 PM PDT, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >Hi All, > >As we have seen the regression of more than 45% with >"*backend_flush_after*" >enabled and set to its default value i.e. 128KB or even when it is set >to >some higher value like 2MB, i think we should disable it such that it >does >not impact the read write performance and here is the attached patch >for >the same. Please have a look and let me know your thoughts on this. >Thanks! I don't think the situation is quite that simple. By *disabling* backend flushing it's also easy to see massive performanceregressions. In situations where shared buffers was configured appropriately for the workload (not the case hereIIRC). Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, May 12, 2016 at 10:49:06AM -0400, Robert Haas wrote: > On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Please find the test results for the following set of combinations taken at > > 128 client counts: > > > > 1) Unpatched master, default *_flush_after : TPS = 10925.882396 > > > > 2) Unpatched master, *_flush_after=0 : TPS = 18613.343529 > > > > 3) That line removed with #if 0, default *_flush_after : TPS = 9856.809278 > > > > 4) That line removed with #if 0, *_flush_after=0 : TPS = 18158.648023 > > I'm getting increasingly unhappy about the checkpoint flush control. > I saw major regressions on my parallel COPY test, too: > > http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com > > That was a completely different machine (POWER7 instead of Intel, > lousy disks instead of good ones) and a completely different workload. > Considering these results, I think there's now plenty of evidence to > suggest that this feature is going to be horrible for a large number > of users. A 45% regression on pgbench is horrible. (Nobody wants to > take even a 1% hit for snapshot too old, right?) Sure, it might not > be that way for every user on every Linux system, and I'm sure it > performed well on the systems where Andres benchmarked it, or he > wouldn't have committed it. But our goal can't be to run well only on > the newest hardware with the least-buggy kernel... [This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Andres, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a 9.6 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within 72 hours of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping 9.6rc1. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com
On Fri, May 27, 2016 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote: > I don't think the situation is quite that simple. By *disabling* backend flushing it's also easy to see massive performanceregressions. In situations where shared buffers was configured appropriately for the workload (not the case hereIIRC). On what kind of workload does setting backend_flush_after=0 represent a large regression vs. the default settings? I think we have to consider that pgbench and parallel copy are pretty common things to want to do, and a non-zero default setting hurts those workloads a LOT. I have a really hard time believing that the benefits on other workloads are large enough to compensate for the slowdowns we're seeing here. We have nobody writing in to say that backend_flush_after>0 is making things way better for them, and Ashutosh and I have independently hit massive slowdowns on unrelated workloads. We weren't looking for slowdowns in this patch. We were trying to measure other stuff, and ended up tracing the behavior back to this patch. That really, really suggests that other people will have similar experiences. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-05-31 16:03:46 -0400, Robert Haas wrote: > On Fri, May 27, 2016 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote: > > I don't think the situation is quite that simple. By *disabling* backend flushing it's also easy to see massive performanceregressions. In situations where shared buffers was configured appropriately for the workload (not the case hereIIRC). > > On what kind of workload does setting backend_flush_after=0 represent > a large regression vs. the default settings? > > I think we have to consider that pgbench and parallel copy are pretty > common things to want to do, and a non-zero default setting hurts > those workloads a LOT. I don't think pgbench's workload has much to do with reality. Even less so in the setup presented here. The slowdown comes from the fact that default pgbench randomly, but uniformly, updates a large table. Which is slower with backend_flush_after if the workload is considerably bigger than shared_buffers, but, and that's a very important restriction, the workload at the same time largely fits in to less than /proc/sys/vm/dirty_ratio / 20% (probably even 10% / /proc/sys/vm/dirty_background_ratio) of the free os memory. The "trick" in that case is that very often, before a buffer has been written back to storage by the OS, it'll be re-dirtied by postgres. Which means triggering flushing by postgres increases the total amount of writes. That only matters if the kernel doesn't trigger writeback because of the above ratios, or because of time limits (30s / dirty_writeback_centisecs). > I have a really hard time believing that the benefits on other > workloads are large enough to compensate for the slowdowns we're > seeing here. As a random example, without looking for good parameters, on my laptop: pgbench -i -q -s 1000 Cpu: i7-6820HQ Ram: 24GB of memory Storage: Samsung SSD 850 PRO 1TB, encrypted postgres -c shared_buffers=6GB -c backend_flush_after=128 -c max_wal_size=100GB -c fsync=on -c synchronous_commit=off pgbench -M prepared -c 16 -j 16 -T 520 -P 1 -n -N (note the -N) disabled: latency average = 2.774 ms latency stddev = 10.388 ms tps = 5761.883323 (including connections establishing) tps = 5762.027278 (excluding connections establishing) 128: latency average = 2.543 ms latency stddev = 3.554 ms tps = 6284.069846 (including connections establishing) tps = 6284.184570 (excluding connections establishing) Note the latency dev which is 3x better. And the improved throughput. That's for a workload which even fits into the OS memory. Without backend flushing there's several periods looking like progress: 249.0 s, 7237.6 tps, lat 1.997 ms stddev 4.365 progress: 250.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 251.0 s, 1880.6 tps, lat 17.761 ms stddev 169.682 progress: 252.0 s, 6904.4 tps, lat 2.328 ms stddev 3.256 i.e. moments in which no transactions are executed. And that's on storage that can do 500MB/sec, and tens of thousand IOPs. If you change the workload workload that uses synchronous_commit, is bigger than OS memory and/or doesn't have very fast storage, the differences can be a *LOT* bigger. In general, any workload which doesn't fit a) the above criteria of likely re-dirtying blocks it already dirtied, before kernel triggered writeback happens b) concurrently COPYs into an indvidual file, is likely to be faster (or unchanged if within s_b) with backend flushing. Which means that transactional workloads that are bigger than the OS memory, or which have a non-uniform distribution leading to some locality, are likely to be faster. In practice those are *hugely* more likely than the uniform distribution that pgbench has. Similarly, this *considerably* reduces the impact a concurrently running vacuum or COPY has on concurrent queries. Because suddenly VACUUM/COPY can't create a couple gigabytes of dirty buffers which will be written back at some random point in time later, stalling everything. I think the benefits of a more predictable (and often faster!) performance in a bunch of actual real-worl-ish workloads are higher than optimizing for benchmarks. > We have nobody writing in to say that > backend_flush_after>0 is making things way better for them, and > Ashutosh and I have independently hit massive slowdowns on unrelated > workloads. Actually, we have some of evidence of that? Just so far not in this thread; which I don't find particularly surprising. - Andres
On 2016-06-01 15:33:18 -0700, Andres Freund wrote: > Cpu: i7-6820HQ > Ram: 24GB of memory > Storage: Samsung SSD 850 PRO 1TB, encrypted > postgres -c shared_buffers=6GB -c backend_flush_after=128 -c max_wal_size=100GB -c fsync=on -c synchronous_commit=off > pgbench -M prepared -c 16 -j 16 -T 520 -P 1 -n -N Using scale 5000 database, with wal compression enabled (otherwise the whole thing is too slow in both cases), and 64 clients gives: disabled: latency average = 11.896 ms latency stddev = 42.187 ms tps = 5378.025369 (including connections establishing) tps = 5378.248569 (excluding connections establishing) 128: latency average = 11.002 ms latency stddev = 10.621 ms tps = 5814.586813 (including connections establishing) tps = 5814.840249 (excluding connections establishing) With flushing disabled, rougly every 30s you see: progress: 150.0 s, 6223.3 tps, lat 10.036 ms stddev 9.521 progress: 151.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 152.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 153.0 s, 4952.9 tps, lat 39.050 ms stddev 249.839 progress: 172.0 s, 4888.0 tps, lat 12.851 ms stddev 11.507 progress: 173.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 174.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 175.0 s, 4636.8 tps, lat 41.421 ms stddev 268.416 progress: 196.0 s, 1119.2 tps, lat 9.618 ms stddev 8.321 progress: 197.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 198.0 s, 1920.9 tps, lat 94.375 ms stddev 429.756 progress: 199.0 s, 5260.8 tps, lat 12.087 ms stddev 11.595 With backend flushing enabled there's not a single such pause. If you use spinning rust instead of SSDs, the pauses aren't 1-2s anymore, but easily 10-30s. Andres
On Wed, Jun 01, 2016 at 03:33:18PM -0700, Andres Freund wrote: > On 2016-05-31 16:03:46 -0400, Robert Haas wrote: > > On Fri, May 27, 2016 at 12:37 AM, Andres Freund <andres@anarazel.de> wrote: > > > I don't think the situation is quite that simple. By *disabling* backend flushing it's also easy to see massive performanceregressions. In situations where shared buffers was configured appropriately for the workload (not the case hereIIRC). > > > > On what kind of workload does setting backend_flush_after=0 represent > > a large regression vs. the default settings? > > > > I think we have to consider that pgbench and parallel copy are pretty > > common things to want to do, and a non-zero default setting hurts > > those workloads a LOT. > > I don't think pgbench's workload has much to do with reality. Even less > so in the setup presented here. > > The slowdown comes from the fact that default pgbench randomly, but > uniformly, updates a large table. Which is slower with > backend_flush_after if the workload is considerably bigger than > shared_buffers, but, and that's a very important restriction, the > workload at the same time largely fits in to less than > /proc/sys/vm/dirty_ratio / 20% (probably even 10% / > /proc/sys/vm/dirty_background_ratio) of the free os memory. Looking at some of the top hits for 'postgresql shared_buffers': https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server https://www.postgresql.org/docs/current/static/runtime-config-resource.html http://rhaas.blogspot.com/2012/03/tuning-sharedbuffers-and-walbuffers.html https://www.keithf4.com/a-large-database-does-not-mean-large-shared_buffers/ http://www.cybertec.at/2014/02/postgresql-9-3-shared-buffers-performance-1/ Choices mentioned (some in comments on a main post): 1. .25 * RAM 2. min(8GB, .25 * RAM) 3. Sizing procedure that arrived at 4GB for 900GB of data 4. Equal to data size Thus, it is not outlandish to have the write portion of a working set exceed shared_buffers while remaining under 10-20% of system RAM. Choice (4) won't achieve that, but (2) and (3) may achieve it given a mere 64 GiB of RAM. Choice (1) can go either way; if read-mostly data occupies half of shared_buffers, then writes passing through the other 12.5% of system RAM may exhibit the property you describe. Incidentally, a typical reason for a site to use low shared_buffers is to avoid the latency spikes that *_flush_after combat: https://www.postgresql.org/message-id/flat/4DDE2705020000250003DD4F%40gw.wicourts.gov > > I have a really hard time believing that the benefits on other > > workloads are large enough to compensate for the slowdowns we're > > seeing here. > > As a random example, without looking for good parameters, on my laptop: > pgbench -i -q -s 1000 > > Cpu: i7-6820HQ > Ram: 24GB of memory > Storage: Samsung SSD 850 PRO 1TB, encrypted > postgres -c shared_buffers=6GB -c backend_flush_after=128 -c max_wal_size=100GB -c fsync=on -c synchronous_commit=off > pgbench -M prepared -c 16 -j 16 -T 520 -P 1 -n -N > (note the -N) > disabled: > latency average = 2.774 ms > latency stddev = 10.388 ms > tps = 5761.883323 (including connections establishing) > tps = 5762.027278 (excluding connections establishing) > > 128: > latency average = 2.543 ms > latency stddev = 3.554 ms > tps = 6284.069846 (including connections establishing) > tps = 6284.184570 (excluding connections establishing) > > Note the latency dev which is 3x better. And the improved throughput. That is an improvement. The workload is no less realistic than the ones having shown regressions. > Which means that transactional workloads that are bigger than the OS > memory, or which have a non-uniform distribution leading to some > locality, are likely to be faster. In practice those are *hugely* more > likely than the uniform distribution that pgbench has. That is formally true; non-benchmark workloads rarely issue uniform writes. However, enough non-benchmark workloads have too little locality to benefit from caches. Those will struggle against *_flush_after like uniform writes do, so discounting uniform writes wouldn't simplify this project. Today's defaults for *_flush_after greatly smooth and accelerate performance for one class of plausible workloads while greatly slowing a different class of plausible workloads. nm
On 2016-06-03 01:57:33 -0400, Noah Misch wrote: > > Which means that transactional workloads that are bigger than the OS > > memory, or which have a non-uniform distribution leading to some > > locality, are likely to be faster. In practice those are *hugely* more > > likely than the uniform distribution that pgbench has. > > That is formally true; non-benchmark workloads rarely issue uniform writes. > However, enough non-benchmark workloads have too little locality to benefit > from caches. Those will struggle against *_flush_after like uniform writes > do, so discounting uniform writes wouldn't simplify this project. But such workloads rarely will hit the point of constantly re-dirtying already dirty pages in kernel memory within 30s. > Today's defaults for *_flush_after greatly smooth and accelerate performance > for one class of plausible workloads while greatly slowing a different class > of plausible workloads. I don't think checkpoint_flush_after is in that class, due to the fsync()s we already emit at the end of checkpoints. Greetings, Andres Freund
On Thu, Jun 02, 2016 at 11:09:22PM -0700, Andres Freund wrote: > On 2016-06-03 01:57:33 -0400, Noah Misch wrote: > > > Which means that transactional workloads that are bigger than the OS > > > memory, or which have a non-uniform distribution leading to some > > > locality, are likely to be faster. In practice those are *hugely* more > > > likely than the uniform distribution that pgbench has. > > > > That is formally true; non-benchmark workloads rarely issue uniform writes. > > However, enough non-benchmark workloads have too little locality to benefit > > from caches. Those will struggle against *_flush_after like uniform writes > > do, so discounting uniform writes wouldn't simplify this project. > > But such workloads rarely will hit the point of constantly re-dirtying > already dirty pages in kernel memory within 30s. Rarely, yes. Not rarely enough to discount. > > Today's defaults for *_flush_after greatly smooth and accelerate performance > > for one class of plausible workloads while greatly slowing a different class > > of plausible workloads. The usual PostgreSQL handling of a deeply workload-dependent performance feature is to disable it by default. That's what I'm inclined to do here, for every GUC the feature added. Sophisticated users will nonetheless fully exploit this valuable mechanism in 9.6. > I don't think checkpoint_flush_after is in that class, due to the > fsync()s we already emit at the end of checkpoints. That's a promising hypothesis. Some future project could impose a nonzero default checkpoint_flush_after, having demonstrated that it imposes negligible harm in the plausible cases it does not help.
Hello Noah, > The usual PostgreSQL handling of a deeply workload-dependent performance > feature is to disable it by default. That's what I'm inclined to do here, for > every GUC the feature added. Sophisticated users will nonetheless fully > exploit this valuable mechanism in 9.6. >> I don't think checkpoint_flush_after is in that class, due to the >> fsync()s we already emit at the end of checkpoints. I agree with Andres that checkpoint_flush_after shoud not be treated as other _flush_after settings. > That's a promising hypothesis. This is not an hypothesis but a proven fact. There has been hundreds of hours of pgbenchs runs to test and demonstrate the positive impact in various reasonable configurations. > Some future project could impose a nonzero default > checkpoint_flush_after, having demonstrated that it imposes negligible > harm in the plausible cases it does not help. I think that the significant and general benefit of checkpoint_flush_after has been largely demonstrated and reported on the hacker thread at various point of the development of the feature, and that it is safe, and even highly advisable to keep it on by default. The key point is that it is flushing sorted buffers so that it mostly results in sequential writes. It avoids in a lot of case where the final sync at the end of the checkpoint generates too many ios which results in putting postgresql off line till the fsync is completed, from seconds to minutes at a time. The other *_flush_after do not benefit for any buffer reordering, so their positive impact is maybe more questionnable, so I would be okay if these are disabled by default. -- Fabien.
On 2016-06-03 10:48:18 -0400, Noah Misch wrote: > On Thu, Jun 02, 2016 at 11:09:22PM -0700, Andres Freund wrote: > > > Today's defaults for *_flush_after greatly smooth and accelerate performance > > > for one class of plausible workloads while greatly slowing a different class > > > of plausible workloads. > > The usual PostgreSQL handling of a deeply workload-dependent performance > feature is to disable it by default. Meh. That's not actually all that often the case. This unstable performance issue, with the minute-long stalls, is the worst and most frequent production problem people hit with postgres in my experience, besides issues with autovacuum. Ignoring that is just hurting our users. > > I don't think checkpoint_flush_after is in that class, due to the > > fsync()s we already emit at the end of checkpoints. > > That's a promising hypothesis. Some future project could impose a nonzero > default checkpoint_flush_after, having demonstrated that it imposes negligible > harm in the plausible cases it does not help. Have you actually looked at the thread with all the numbers? This isn't an issue that has been decided willy-nilly. It's been discussed *over months*. Greetings, Andres Freund
On 2016-06-03 09:24:28 -0700, Andres Freund wrote: > This unstable performance issue, with the minute-long stalls, is the > worst and most frequent production problem people hit with postgres in > my experience, besides issues with autovacuum. Ignoring that is just > hurting our users. Oh, and a good proportion of the "autovacuum causes my overall systems to slow down unacceptably" issues come from exactly this.
On Fri, Jun 3, 2016 at 2:09 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-03 01:57:33 -0400, Noah Misch wrote: >> > Which means that transactional workloads that are bigger than the OS >> > memory, or which have a non-uniform distribution leading to some >> > locality, are likely to be faster. In practice those are *hugely* more >> > likely than the uniform distribution that pgbench has. >> >> That is formally true; non-benchmark workloads rarely issue uniform writes. >> However, enough non-benchmark workloads have too little locality to benefit >> from caches. Those will struggle against *_flush_after like uniform writes >> do, so discounting uniform writes wouldn't simplify this project. > > But such workloads rarely will hit the point of constantly re-dirtying > already dirty pages in kernel memory within 30s. I don't know why not. It's not exactly uncommon to update the same data frequently, nor is it uncommon for the hot data set to be larger than shared_buffers and smaller than the OS cache, even significantly smaller. Any workload of that type is going to have this problem regardless of whether the access pattern is uniform. If you have a highly non-uniform access pattern then you just have this problem on the small subset of the data that is hot. I think that asserting that there's something wrong with this test is just wrong. Many people have done many tests very similar to this one on Linux systems over many years to assess PostgreSQL performance. It's a totally legitimate test configuration. Indeed, I'd argue that this is actually a pretty common real-world scenario. Most people's hot data fits in memory, because if it doesn't, their performance sucks so badly that they either redesign something or buy more memory until it does. Also, most people have more hot data than shared_buffers. There are some who don't because their data set is very small, and that's nice when it happens; and there are others who don't because they carefully crank shared_buffers up high enough that everything fits, but most don't, either because it causes other problems, or because they just don't think to tinkering with it, or because they set it up that way initially but then the data grows over time. There are a LOT of people running with 8GB or less of shared_buffers and a working set that is in the tens of GB. Now, what varies IME is how much total RAM there is in the system and how frequently they write that data, as opposed to reading it. If they are on a tightly RAM-constrained system, then this situation won't arise because they won't be under the dirty background limit. And if they aren't writing that much data then they'll be fine too. But even putting all of that together I really don't see why you're trying to suggest that this is some bizarre set of circumstances that should only rarely happen in the real world. I think it clearly does happen, and I doubt it's particularly uncommon. If your testing didn't discover this scenario, I feel rather strongly that that's an oversight in your testing rather than a problem with the scenario. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-03 12:31:58 -0400, Robert Haas wrote: > Now, what varies IME is how much total RAM there is in the system and > how frequently they write that data, as opposed to reading it. If > they are on a tightly RAM-constrained system, then this situation > won't arise because they won't be under the dirty background limit. > And if they aren't writing that much data then they'll be fine too. > But even putting all of that together I really don't see why you're > trying to suggest that this is some bizarre set of circumstances that > should only rarely happen in the real world. I'm saying that if that happens constantly, you're better off adjusting shared_buffers, because you're likely already suffering from latency spikes and other issues. Optimizing for massive random write throughput in a system that's not configured appropriately, at the cost of well configured systems to suffer, doesn't seem like a good tradeoff to me. Note that other operating systems like windows and freebsd *alreaddy* write back much more aggressively (independent of this change). I seem to recall you yourself being quite passionately arguing that the linux behaviour around this is broken. Greetings, Andres Freund
On Fri, Jun 3, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-03 12:31:58 -0400, Robert Haas wrote: >> Now, what varies IME is how much total RAM there is in the system and >> how frequently they write that data, as opposed to reading it. If >> they are on a tightly RAM-constrained system, then this situation >> won't arise because they won't be under the dirty background limit. >> And if they aren't writing that much data then they'll be fine too. >> But even putting all of that together I really don't see why you're >> trying to suggest that this is some bizarre set of circumstances that >> should only rarely happen in the real world. > > I'm saying that if that happens constantly, you're better off adjusting > shared_buffers, because you're likely already suffering from latency > spikes and other issues. Optimizing for massive random write throughput > in a system that's not configured appropriately, at the cost of well > configured systems to suffer, doesn't seem like a good tradeoff to me. I really don't get it. There's nothing in any set of guidelines for setting shared_buffers that I've ever seen which would cause people to avoid this scenario. You're the first person I've ever heard describe this as a misconfiguration. > Note that other operating systems like windows and freebsd *alreaddy* > write back much more aggressively (independent of this change). I seem > to recall you yourself being quite passionately arguing that the linux > behaviour around this is broken. Sure, but being unhappy about the Linux behavior doesn't mean that I want our TPS on Linux to go down. Whether I like the behavior or not, we have to live with it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jun 3, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote: >> Note that other operating systems like windows and freebsd *alreaddy* >> write back much more aggressively (independent of this change). I seem >> to recall you yourself being quite passionately arguing that the linux >> behaviour around this is broken. > Sure, but being unhappy about the Linux behavior doesn't mean that I > want our TPS on Linux to go down. Whether I like the behavior or not, > we have to live with it. Yeah. Bug or not, it's reality for lots of our users. regards, tom lane
On 2016-06-03 13:33:31 -0400, Robert Haas wrote: > On Fri, Jun 3, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-06-03 12:31:58 -0400, Robert Haas wrote: > >> Now, what varies IME is how much total RAM there is in the system and > >> how frequently they write that data, as opposed to reading it. If > >> they are on a tightly RAM-constrained system, then this situation > >> won't arise because they won't be under the dirty background limit. > >> And if they aren't writing that much data then they'll be fine too. > >> But even putting all of that together I really don't see why you're > >> trying to suggest that this is some bizarre set of circumstances that > >> should only rarely happen in the real world. > > > > I'm saying that if that happens constantly, you're better off adjusting > > shared_buffers, because you're likely already suffering from latency > > spikes and other issues. Optimizing for massive random write throughput > > in a system that's not configured appropriately, at the cost of well > > configured systems to suffer, doesn't seem like a good tradeoff to me. > > I really don't get it. There's nothing in any set of guidelines for > setting shared_buffers that I've ever seen which would cause people to > avoid this scenario. The "roughly 1/4" of memory guideline already mostly avoids it? It's hard to constantly re-dirty a written-back page within 30s, before the 10% (background)/20% (foreground) limits apply; if your shared buffers are larger than the 10%/20% limits (which only apply to *available* not total memory btw). > You're the first person I've ever heard describe this as a > misconfiguration. Huh? People tried addressing this problem for *years* with bigger / smaller shared buffers, but couldn't easily. I'm inclined to give up and disable backend_flush_after (not the rest), because it's new and by far the "riskiest". But I do think it's a disservice for the majority of our users. Greetings, Andres Freund
On 2016-06-03 13:42:09 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Jun 3, 2016 at 12:39 PM, Andres Freund <andres@anarazel.de> wrote: > >> Note that other operating systems like windows and freebsd *alreaddy* > >> write back much more aggressively (independent of this change). I seem > >> to recall you yourself being quite passionately arguing that the linux > >> behaviour around this is broken. > > > Sure, but being unhappy about the Linux behavior doesn't mean that I > > want our TPS on Linux to go down. Whether I like the behavior or not, > > we have to live with it. > > Yeah. Bug or not, it's reality for lots of our users. That means we need to address it. Which is what the feature does. So yes, some linux specific tuning might need to be tweaked in the more extreme cases. But that's better than relying on linux' extreme writeback behaviour, which changes every few releases to boot. From the tuning side this makes shared buffer sizing more common between unixoid OSs. Andres
On Fri, Jun 3, 2016 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote: >> I really don't get it. There's nothing in any set of guidelines for >> setting shared_buffers that I've ever seen which would cause people to >> avoid this scenario. > > The "roughly 1/4" of memory guideline already mostly avoids it? It's > hard to constantly re-dirty a written-back page within 30s, before the > 10% (background)/20% (foreground) limits apply; if your shared buffers > are larger than the 10%/20% limits (which only apply to *available* not > total memory btw). I've always heard that guideline as "roughly 1/4, but not more than about 8GB" - and the number of people with more than 32GB of RAM is going to just keep going up. >> You're the first person I've ever heard describe this as a >> misconfiguration. > > Huh? People tried addressing this problem for *years* with bigger / > smaller shared buffers, but couldn't easily. I'm saying that setting 8GB of shared_buffers on a system with lotsamem is not widely regarded as misconfiguration. > I'm inclined to give up and disable backend_flush_after (not the rest), > because it's new and by far the "riskiest". But I do think it's a > disservice for the majority of our users. I think that's the right course of action. I wasn't arguing for disabling either of the other two. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-03 13:47:58 -0400, Robert Haas wrote: > On Fri, Jun 3, 2016 at 1:43 PM, Andres Freund <andres@anarazel.de> wrote: > >> I really don't get it. There's nothing in any set of guidelines for > >> setting shared_buffers that I've ever seen which would cause people to > >> avoid this scenario. > > > > The "roughly 1/4" of memory guideline already mostly avoids it? It's > > hard to constantly re-dirty a written-back page within 30s, before the > > 10% (background)/20% (foreground) limits apply; if your shared buffers > > are larger than the 10%/20% limits (which only apply to *available* not > > total memory btw). > > I've always heard that guideline as "roughly 1/4, but not more than > about 8GB" - and the number of people with more than 32GB of RAM is > going to just keep going up. I think that upper limit is wrong. But even disregarding that: To hit the issue in that case you have to access more data than shared_buffers (8GB), and very frequently re-dirty already dirtied data. So you're basically (on a very rough approximation) going to have to write more than 8GB within 30s (256MB/s). Unless your hardware can handle that many mostly random writes, you are likely to hit the worst case behaviour of pending writeback piling up and stalls. > > I'm inclined to give up and disable backend_flush_after (not the rest), > > because it's new and by far the "riskiest". But I do think it's a > > disservice for the majority of our users. > > I think that's the right course of action. I wasn't arguing for > disabling either of the other two. Noah was... Greetings, Andres Freund
On Fri, Jun 3, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote: >> I've always heard that guideline as "roughly 1/4, but not more than >> about 8GB" - and the number of people with more than 32GB of RAM is >> going to just keep going up. > > I think that upper limit is wrong. But even disregarding that: Many people think the upper limit should be even lower, based on good, practical experience. Like I've seen plenty of people recommend 2-2.5GB. > To hit the issue in that case you have to access more data than > shared_buffers (8GB), and very frequently re-dirty already dirtied > data. So you're basically (on a very rough approximation) going to have > to write more than 8GB within 30s (256MB/s). Unless your hardware can > handle that many mostly random writes, you are likely to hit the worst > case behaviour of pending writeback piling up and stalls. I'm not entire sure that this is true, because my experience is that the background writing behavior under Linux is not very aggressive. I agree you need a working set >8GB, but I think if you have that you might not actually need to write data this quickly, because if Linux decides to only do background writing (as opposed to blocking processes) it may not actually keep up. Also, 256MB/s is not actually all that crazy write rate. I mean, it's a lot, but even if each random UPDATE touched only 1 8kB block, that would be about 32k TPS. When you add in index updates and TOAST traffic, the actual number of block writes per TPS could be considerably higher, so we might be talking about something <10k TPS. That's well within the range of what people try to do with PostgreSQL, at least IME. >> > I'm inclined to give up and disable backend_flush_after (not the rest), >> > because it's new and by far the "riskiest". But I do think it's a >> > disservice for the majority of our users. >> >> I think that's the right course of action. I wasn't arguing for >> disabling either of the other two. > > Noah was... I know, but I'm not Noah. :-) We have no evidence of the other settings causing any problems yet, so I see no reason to second-guess the decision to leave them on by default at this stage. Other people may disagree with that analysis, and that's fine, but my analysis is that the case for disable-by-default has been made for backend_flush_after but not the others. I also agree that backend_flush_after is much more dangerous on theoretical grounds; the checkpointer is in a good position to sort the requests to achieve locality, but backends are not. And in fact I think what the testing shows so far is that when they can't achieve locality, backend flush control sucks. When it can, it's neutral or positive. But I really see no reason to believe that that's likely to be true on general workloads. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-06-03 15:17:06 -0400, Robert Haas wrote: > On Fri, Jun 3, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote: > >> I've always heard that guideline as "roughly 1/4, but not more than > >> about 8GB" - and the number of people with more than 32GB of RAM is > >> going to just keep going up. > > > > I think that upper limit is wrong. But even disregarding that: > > Many people think the upper limit should be even lower, based on good, > practical experience. Like I've seen plenty of people recommend > 2-2.5GB. Which largely imo is because of the writeback issue. And the locking around buffer replacement, if you're doing it highly concurrently (which is now mostly solved). > > To hit the issue in that case you have to access more data than > > shared_buffers (8GB), and very frequently re-dirty already dirtied > > data. So you're basically (on a very rough approximation) going to have > > to write more than 8GB within 30s (256MB/s). Unless your hardware can > > handle that many mostly random writes, you are likely to hit the worst > > case behaviour of pending writeback piling up and stalls. > > I'm not entire sure that this is true, because my experience is that > the background writing behavior under Linux is not very aggressive. I > agree you need a working set >8GB, but I think if you have that you > might not actually need to write data this quickly, because if Linux > decides to only do background writing (as opposed to blocking > processes) it may not actually keep up. But that's *bad*. Then a checkpoint comes around and latency and throughput is shot to hell while the writeback from the fsyncs is preventing any concurrent write activity. And if it's not keeping up before, it's now really bad. > And in fact I > think what the testing shows so far is that when they can't achieve > locality, backend flush control sucks. FWIW, I don't think that's generally enough true. For pgbench bigger-than-20%-of-avail-memory there's pretty much no locality, and backend flushing helps considerably, Andres
On Fri, Jun 03, 2016 at 03:17:06PM -0400, Robert Haas wrote: > On Fri, Jun 3, 2016 at 2:20 PM, Andres Freund <andres@anarazel.de> wrote: > >> > I'm inclined to give up and disable backend_flush_after (not the rest), > >> > because it's new and by far the "riskiest". But I do think it's a > >> > disservice for the majority of our users. > >> > >> I think that's the right course of action. I wasn't arguing for > >> disabling either of the other two. > > > > Noah was... > > I know, but I'm not Noah. :-) > > We have no evidence of the other settings causing any problems yet, so > I see no reason to second-guess the decision to leave them on by > default at this stage. Other people may disagree with that analysis, > and that's fine, but my analysis is that the case for > disable-by-default has been made for backend_flush_after but not the > others. I also agree that backend_flush_after is much more dangerous > on theoretical grounds; the checkpointer is in a good position to sort > the requests to achieve locality, but backends are not. Disabling just backend_flush_after by default works for me, so let's do that. Though I would not elect, on behalf of PostgreSQL, the risk of enabling {bgwriter,checkpoint,wal_writer}_flush_after by default, a reasonable person may choose to do so. I doubt the community could acquire the data necessary to ascertain which choice has more utility.
On 2016-06-03 20:41:33 -0400, Noah Misch wrote: > Disabling just backend_flush_after by default works for me, so let's do that. > Though I would not elect, on behalf of PostgreSQL, the risk of enabling > {bgwriter,checkpoint,wal_writer}_flush_after by default, a reasonable person > may choose to do so. I doubt the community could acquire the data necessary > to ascertain which choice has more utility. Note that wal_writer_flush_after was essentially already enabled before, just a lot more *aggressively*.
On Sun, May 29, 2016 at 01:26:03AM -0400, Noah Misch wrote: > On Thu, May 12, 2016 at 10:49:06AM -0400, Robert Haas wrote: > > On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > Please find the test results for the following set of combinations taken at > > > 128 client counts: > > > > > > 1) Unpatched master, default *_flush_after : TPS = 10925.882396 > > > > > > 2) Unpatched master, *_flush_after=0 : TPS = 18613.343529 > > > > > > 3) That line removed with #if 0, default *_flush_after : TPS = 9856.809278 > > > > > > 4) That line removed with #if 0, *_flush_after=0 : TPS = 18158.648023 > > > > I'm getting increasingly unhappy about the checkpoint flush control. > > I saw major regressions on my parallel COPY test, too: > > > > http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com > > > > That was a completely different machine (POWER7 instead of Intel, > > lousy disks instead of good ones) and a completely different workload. > > Considering these results, I think there's now plenty of evidence to > > suggest that this feature is going to be horrible for a large number > > of users. A 45% regression on pgbench is horrible. (Nobody wants to > > take even a 1% hit for snapshot too old, right?) Sure, it might not > > be that way for every user on every Linux system, and I'm sure it > > performed well on the systems where Andres benchmarked it, or he > > wouldn't have committed it. But our goal can't be to run well only on > > the newest hardware with the least-buggy kernel... > > [This is a generic notification.] > > The above-described topic is currently a PostgreSQL 9.6 open item. Andres, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > 9.6 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within 72 hours of this > message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping 9.6rc1. Consequently, I will appreciate your > efforts toward speedy resolution. Thanks. > > [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com This PostgreSQL 9.6 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com
On 2016-06-08 23:00:15 -0400, Noah Misch wrote: > On Sun, May 29, 2016 at 01:26:03AM -0400, Noah Misch wrote: > > On Thu, May 12, 2016 at 10:49:06AM -0400, Robert Haas wrote: > > > On Thu, May 12, 2016 at 8:39 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Please find the test results for the following set of combinations taken at > > > > 128 client counts: > > > > > > > > 1) Unpatched master, default *_flush_after : TPS = 10925.882396 > > > > > > > > 2) Unpatched master, *_flush_after=0 : TPS = 18613.343529 > > > > > > > > 3) That line removed with #if 0, default *_flush_after : TPS = 9856.809278 > > > > > > > > 4) That line removed with #if 0, *_flush_after=0 : TPS = 18158.648023 > > > > > > I'm getting increasingly unhappy about the checkpoint flush control. > > > I saw major regressions on my parallel COPY test, too: > > > > > > http://www.postgresql.org/message-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com > > > > > > That was a completely different machine (POWER7 instead of Intel, > > > lousy disks instead of good ones) and a completely different workload. > > > Considering these results, I think there's now plenty of evidence to > > > suggest that this feature is going to be horrible for a large number > > > of users. A 45% regression on pgbench is horrible. (Nobody wants to > > > take even a 1% hit for snapshot too old, right?) Sure, it might not > > > be that way for every user on every Linux system, and I'm sure it > > > performed well on the systems where Andres benchmarked it, or he > > > wouldn't have committed it. But our goal can't be to run well only on > > > the newest hardware with the least-buggy kernel... > > > > [This is a generic notification.] > > > > The above-described topic is currently a PostgreSQL 9.6 open item. Andres, > > since you committed the patch believed to have created it, you own this open > > item. If some other commit is more relevant or if this does not belong as a > > 9.6 open item, please let us know. Otherwise, please observe the policy on > > open item ownership[1] and send a status update within 72 hours of this > > message. Include a date for your subsequent status update. Testers may > > discover new open items at any time, and I want to plan to get them all fixed > > well in advance of shipping 9.6rc1. Consequently, I will appreciate your > > efforts toward speedy resolution. Thanks. > > > > [1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com > > This PostgreSQL 9.6 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. Refer to the policy on open item ownership: > http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com I'm writing a patch right now, planning to post it later today, commit it tomorrow. Greetings, Andres Freund
On 2016-06-09 14:37:31 -0700, Andres Freund wrote: > I'm writing a patch right now, planning to post it later today, commit > it tomorrow. Attached.
Attachment
On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-09 14:37:31 -0700, Andres Freund wrote: >> I'm writing a patch right now, planning to post it later today, commit >> it tomorrow. > > Attached. - /* see bufmgr.h: OS dependent default */ - DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES, + 0, 0, WRITEBACK_MAX_PENDING_FLUSHES, Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and just enforce it to 0 for all the OSes at the top of bufmgr.h? -- Michael
On 2016-06-10 09:34:33 +0900, Michael Paquier wrote: > On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-06-09 14:37:31 -0700, Andres Freund wrote: > >> I'm writing a patch right now, planning to post it later today, commit > >> it tomorrow. > > > > Attached. > > - /* see bufmgr.h: OS dependent default */ > - DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES, > + 0, 0, WRITEBACK_MAX_PENDING_FLUSHES, > Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and > just enforce it to 0 for all the OSes at the top of bufmgr.h? What would be the point? The only reason for DEFAULT_BACKEND_FLUSH_AFTER was that it differed between operating systems. Now it doesn't anymore. Andres
On Fri, Jun 10, 2016 at 9:37 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-10 09:34:33 +0900, Michael Paquier wrote: >> On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-06-09 14:37:31 -0700, Andres Freund wrote: >> >> I'm writing a patch right now, planning to post it later today, commit >> >> it tomorrow. >> > >> > Attached. >> >> - /* see bufmgr.h: OS dependent default */ >> - DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES, >> + 0, 0, WRITEBACK_MAX_PENDING_FLUSHES, >> Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and >> just enforce it to 0 for all the OSes at the top of bufmgr.h? > > What would be the point? The only reason for DEFAULT_BACKEND_FLUSH_AFTER > was that it differed between operating systems. Now it doesn't anymore. Then why do you keep it defined? -- Michael
On 2016-06-10 09:41:09 +0900, Michael Paquier wrote: > On Fri, Jun 10, 2016 at 9:37 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2016-06-10 09:34:33 +0900, Michael Paquier wrote: > >> On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > >> > On 2016-06-09 14:37:31 -0700, Andres Freund wrote: > >> >> I'm writing a patch right now, planning to post it later today, commit > >> >> it tomorrow. > >> > > >> > Attached. > >> > >> - /* see bufmgr.h: OS dependent default */ > >> - DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES, > >> + 0, 0, WRITEBACK_MAX_PENDING_FLUSHES, > >> Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and > >> just enforce it to 0 for all the OSes at the top of bufmgr.h? > > > > What would be the point? The only reason for DEFAULT_BACKEND_FLUSH_AFTER > > was that it differed between operating systems. Now it doesn't anymore. > > Then why do you keep it defined? Ooops. Missing git add. Greetings, Andres Freund
On Fri, Jun 10, 2016 at 9:42 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-10 09:41:09 +0900, Michael Paquier wrote: >> On Fri, Jun 10, 2016 at 9:37 AM, Andres Freund <andres@anarazel.de> wrote: >> > On 2016-06-10 09:34:33 +0900, Michael Paquier wrote: >> >> On Fri, Jun 10, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: >> >> > On 2016-06-09 14:37:31 -0700, Andres Freund wrote: >> >> >> I'm writing a patch right now, planning to post it later today, commit >> >> >> it tomorrow. >> >> > >> >> > Attached. >> >> >> >> - /* see bufmgr.h: OS dependent default */ >> >> - DEFAULT_BACKEND_FLUSH_AFTER, 0, WRITEBACK_MAX_PENDING_FLUSHES, >> >> + 0, 0, WRITEBACK_MAX_PENDING_FLUSHES, >> >> Wouldn't it be better to still use LT_BACKEND_FLUSH_AFTER here, and >> >> just enforce it to 0 for all the OSes at the top of bufmgr.h? >> > >> > What would be the point? The only reason for DEFAULT_BACKEND_FLUSH_AFTER >> > was that it differed between operating systems. Now it doesn't anymore. >> >> Then why do you keep it defined? > > Ooops. Missing git add. :) -- Michael
On 2016-06-09 17:19:34 -0700, Andres Freund wrote: > On 2016-06-09 14:37:31 -0700, Andres Freund wrote: > > I'm writing a patch right now, planning to post it later today, commit > > it tomorrow. > > Attached. And pushed. Thanks to Michael for noticing the missing addition of header file hunk. Andres