Hello Tomas,
Thanks for these great measures.
> * 4 x CPU E5-4620 (2.2GHz)
4*8 = 32 cores / 64 threads.
> * 256GB of RAM
Wow!
> * 24x SSD on LSI 2208 controller (with 1GB BBWC)
Wow! RAID configuration ? The patch is designed to fix very big issues on
HDD, but it is good to see that the impact is good on SSD as well.
Is it possible to run tests with distinct table spaces on those many
disks?
> * shared_buffers=64GB
1/4 of the available memory.
> The pgbench was scale 60000, so ~750GB of data on disk,
*3 available memory, mostly on disk.
> or like this ("throttled"):
>
> pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench
>
> The reason for the throttling is that people generally don't run production
> databases 100% saturated, so it'd be sad to improve the 100% saturated case
> and hurt the common case by increasing latency.
Sure.
> The machine does ~8000 tps, so 5000 tps is ~60% of that.
Ok.
I would have suggested using the --latency-limit option to filter out very
slow queries, otherwise if the system is stuck it may catch up later, but
then this is not representative of "sustainable" performance.
When pgbench is running under a target rate, in both runs the transaction
distribution is expected to be the same, around 5000 tps, and the green
run looks pretty ok with respect to that. The magenta one shows that about
25% of the time, things are not good at all, and the higher figures just
show the catching up, which is not really interesting if you asked for a
web page and it is finally delivered 1 minutes later.
> * regular-tps.png (per-second TPS) [...]
Great curves!
> consistent. Originally there was ~10% of samples with ~2000 tps, but with the
> flushing you'd have to go to ~4600 tps. It's actually pretty difficult to
> determine this from the chart, because the curve got so steep and I had to
> check the data used to generate the charts.
>
> Similarly for the upper end, but I assume that's a consequence of the
> throttling not having to compensate for the "slow" seconds anymore.
Yep, but they should be filtered out, "sorry, too late", so that would
count as unresponsisveness, at least for a large class of applications.
Thanks a lot for there interesting tests!
--
Fabien.