Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1508310743250.30124@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: checkpointer continuous flushing  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hello Amit,

> IBM POWER-8 24 cores, 192 hardware threads
> RAM = 492GB

Wow! Thanks for trying the patch on such high-end hardware!

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

What is the OS? The FS?

> warmup=60

Quite short, but probably ok.

> scale=300

Means about 4-4.5 GB base.

> time=7200
> synchronous_commit=on

> shared_buffers=8GB

This is small wrt hardware, but given the scale setup I think that it 
should not matter much.

> max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

> checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

You did not update checkpoint_completion_target, which means 0.5 so that 
the checkpoint is scheduled to run in at most 1 minute, which suggest at 
least 130 MB/s write performance for the checkpoint.

> parallelism - 128 clients, 128 threads

Given 192 hw threads, I would have tried used 128 clients & 64 threads, so 
that each pgbench client has its own dedicated postgres in a thread, and 
that postgres processes are not competing with pgbench. Now as pgbench is 
mostly sleeping, probably that does not matter much... I may also be 
totally wrong:-)

> Sort - off
> avg over 7200: 8256.382528 ± 6218.769282 [0.000000, 76.050000,
> 10975.500000, 13105.950000, 21729.000000]
> percent of values below 10.0: 19.5%

The max performance is consistent with 128 threads * 200 (random) writes 
per second.

> Sort - on
> avg over 7200: 8375.930639 ± 6148.747366 [0.000000, 84.000000,
> 10946.000000, 13084.000000, 20289.900000]
> percent of values below 10.0: 18.6%

This is really a small improvement, probably in the error interval of the 
measure. I would not trust much 1.5% tps or 0.9% availability 
improvements.

I think that we could conclude that on your (great) setup, with these 
configuration parameter, this patch does not harm performance. This is a 
good thing, even if I would have hoped to see better performance.

> Before going to conclusion, let me try to explain above data (I am
> explaining again even though Fabien has explained, to make it clear
> if someone has not read his mail)
>
> Let's try to understand with data for sorting - off option
>
> avg over 7200: 8256.382528 ± 6218.769282
>
> 8256.382528 - average tps for 7200s pgbench run
> 6218.769282 - standard deviation on per second figures
>
> [0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]
>
> These 5 values can be read as minimum TPS, q1, median TPS, q3,
> maximum TPS over 7200s pgbench run.  As far as I understand q1
> and q3 median of subset of values which I didn't focussed much.

q1 = 84 means that 25% of the time the performance was below 84 tps, about 
1% of the average performance, which I would translate as "pg is pretty 
unresponsive 25% of the time".

This is the kind of issue I really want to address, the eventual tps 
improvements are just a side effect.

> percent of values below 10.0: 19.5%
>
> Above means percent of time the result is below 10 tps.

Which means "postgres is really unresponsive 19.5% of the time".

If you count zeros, you will get "postgres was totally unresponsive X% of 
the time".

> Now about test results, these tests are done for pgbench full speed runs
> and the above results indicate that there is approximately 1.5%
> improvement in avg. TPS and ~1% improvement in tps values which are
> below 10 with sorting on and there is almost no improvement in median or
> maximum TPS values, instead they or slightly less when sorting is
> on which could be due to run-to-run variation.

Yes, I agree.

> I have done more tests as well by varying time and number of clients
> keeping other configuration same as above, but the results are quite
> similar.

Given the hardware, I would suggest to raise checkpoint_timeout, 
shared_buffers and max_wal_size, and use checkpoint_completion_target=0.8. 
I would expect that it should improve performance both with and without 
sorting.

It would be interesting to have informations from checkpoint logs 
(especially how many buffers written in how long, whether checkpoints are 
time or xlog triggered, ...).

> The results of sorting patch for the tests done indicate that the win is 
> not big enough with just doing sorting during checkpoints,

ISTM that you do too much generalization: The win is not big "under this 
configuration and harware".

I think that the patch may have very small influence under some 
conditions, but should not degrade performance significantly, and on the 
other hand it should provide great improvements under some (other) 
conditions.

So having no performance degradation is a good result, even if I would 
hope to get better results.  It would be interesting to understand why 
random disk writes do not perform too poorly on this box: size of I/O 
queue, kind of (expensive:-) disks, write caches, file system, raid 
level...

> we should consider flush patch along with sorting.

I also think that it would be interesting.

> I would like to perform some tests with both the patches together (sort 
> + flush) unless somebody else thinks that sorting patch alone is 
> beneficial and we should test some other kind of scenarios to see it's 
> benefit.

Yep. Is it a Linux box? If not, does it support posix_fadvise()?

>> The reason for the tablespace balancing is [...]
>
> What if tablespaces are not on separate disks

I would expect that it might very slightly degrade performance, but only 
marginally.

> or not enough hardware support to make Writes parallel?

I'm not sure that balancing or not writes over tablespaces would change 
anything to an I/O bottleneck which is not the disk write performance, so 
I would say "no impact" in that case.

> I think for such cases it might be better to do it sequentially.

Writing sequentially to different disks would be a bug, and degrade 
performance significantly on a setup with several disks, up to dividing 
the performance by the number of disks... so I do think that a patch which 
predictability and significantly degrades performance on high-end harware 
is a reasonable option.

If you want to be able to disactivate balancing, it could be done with a 
guc, but I cannot see good reasons to want to do that: it would complicate 
the code and it does not make much sense to use many tablespaces on one 
disk, while anyone who uses several tablespaces on several disks is 
probably expecting to see her expensive disks actually used in parallel.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Potential GIN vacuum bug
Next
From: Amit Langote
Date:
Subject: Re: [PROPOSAL] Table Partition