Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1603171813430.28507@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: checkpointer continuous flushing  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
Hello Tomas,

Thanks for these great measures.

> * 4 x CPU E5-4620 (2.2GHz)

4*8 = 32 cores / 64 threads.

> * 256GB of RAM

Wow!

> * 24x SSD on LSI 2208 controller (with 1GB BBWC)

Wow! RAID configuration ? The patch is designed to fix very big issues on 
HDD, but it is good to see that the impact is good on SSD as well.

Is it possible to run tests with distinct table spaces on those many 
disks?

> * shared_buffers=64GB

1/4 of the available memory.

> The pgbench was scale 60000, so ~750GB of data on disk,

*3 available memory, mostly on disk.

> or like this ("throttled"):
>
> pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench
>
> The reason for the throttling is that people generally don't run production 
> databases 100% saturated, so it'd be sad to improve the 100% saturated case 
> and hurt the common case by increasing latency.

Sure.

> The machine does ~8000 tps, so 5000 tps is ~60% of that.

Ok.

I would have suggested using the --latency-limit option to filter out very 
slow queries, otherwise if the system is stuck it may catch up later, but 
then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the transaction 
distribution is expected to be the same, around 5000 tps, and the green 
run looks pretty ok with respect to that. The magenta one shows that about 
25% of the time, things are not good at all, and the higher figures just 
show the catching up, which is not really interesting if you asked for a 
web page and it is finally delivered 1 minutes later.

> * regular-tps.png (per-second TPS) [...]

Great curves!

> consistent. Originally there was ~10% of samples with ~2000 tps, but with the 
> flushing you'd have to go to ~4600 tps. It's actually pretty difficult to 
> determine this from the chart, because the curve got so steep and I had to 
> check the data used to generate the charts.
>
> Similarly for the upper end, but I assume that's a consequence of the 
> throttling not having to compensate for the "slow" seconds anymore.

Yep, but they should be filtered out, "sorry, too late", so that would 
count as unresponsisveness, at least for a large class of applications.

Thanks a lot for there interesting tests!

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Relaxing SSL key permission checks
Next
From: Tom Lane
Date:
Subject: Re: Re: Add generate_series(date,date) and generate_series(date,date,integer)