postgresql latency & bgwriter not doing its job - Mailing list pgsql-hackers

Hello pgdevs,

I've been playing with pg for some time now to try to reduce the maximum 
latency of simple requests, to have a responsive server under small to 
medium load.

On an old computer with a software RAID5 HDD attached, pgbench 
simple update script run for a some time (scale 100, fillfactor 95)
    pgbench -M prepared -N -c 2 -T 500 -P 1 ...

gives 300 tps. However this performance is really +1000 tps for a few 
seconds followed by 16 seconds at about 0 tps for the checkpoint induced 
IO storm. The server is totally unresponsive 75% of the time. That's 
bandwidth optimization for you. Hmmm... why not.

Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above 
max):
    pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...

The same thing more or less happens in a delayed fashion... You get 50 tps 
for some time, followed by sections of 15 seconds at 0 tps for the 
checkpoint when the segments are full... the server is unresponsive about 
10% of the time (one in ten transaction is late by more than 200 ms).

It is not satisfying, pg should be able to handle that load easily.

The culprit I found is "bgwriter", which is basically doing nothing to 
prevent the coming checkpoint IO storm, even though there would be ample 
time to write the accumulating dirty pages so that checkpoint would find a 
clean field and pass in a blink. Indeed, at the end of the 500 seconds 
throttled test, "pg_stat_bgwriter" says:
  buffers_checkpoint = 19046  buffers_clean = 2995

Which suggest that bgwriter took on him to write only 6 pages per second, 
where at least 50 would have been needed for the load, and could have been 
handled by the harware without any problem.

I have not found any mean to force bgwriter to send writes when it can. 
(Well, I have: create a process which sends "CHECKPOINT" every 0.2 
seconds... it works more or less, but this is not my point:-)

Bgwriter control parameters allow to control the maximum number of pages 
(bgwriter_lru_maxpages) written per round (bgwriter_delay), and a 
multiplier (bgwriter_lru_multiplier) which controls some heuristics to 
estimate how many pages should be needed so as to make them available. 
This may be nice in some settings, but is not adapted to the write 
oriented OTPL load tested with pgbench.

The problem is that with the update load on a fitting in memory database 
there is not that much need of "new" pages, even if pages are being 
dirtied (about 50 per seconds), so it seems that the heuristics decides 
not to write much. The net result of all this cleverness is that when the 
checkpoint arrives, several thousand pages have to be written and the 
server is offline for some time.

ISTM that bgwriter lacks at least some "min page" setting it could be 
induced to free this many pages if it can. That would be a start.

A better feature would be that it adapts itself to take advantage of the 
available IOPS, depending on the load induce by other tasks (vacuum, 
queries...), in a preventive manner, so as to avoid delaying what can be 
done right now under a small load, and thus avoid later IO storms. This 
would suggest that some setting would provide the expected IOPS capability 
of the underlying hardware, as some already suggest the expected available 
memory.

Note that this preventive approach could also improve the "bandwith" 
measure: currently when pgbench is running at maximum speed before the 
checkpoint storm, nothing is written to disk but WAL, although it could 
probably also write some dirty pages. When the checkpoints arrives, less 
pages would need to be written, so the storm would be shorter.

Any thoughts on this latency issue? Am I missing something?

-- 
Fabien.



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Add CREATE support to event triggers
Next
From: Bruce Momjian
Date:
Subject: Re: pg_upgrade: allow multiple -o/-O options