Thread: postgresql latency & bgwriter not doing its job
Hello pgdevs, I've been playing with pg for some time now to try to reduce the maximum latency of simple requests, to have a responsive server under small to medium load. On an old computer with a software RAID5 HDD attached, pgbench simple update script run for a some time (scale 100, fillfactor 95) pgbench -M prepared -N -c 2 -T 500 -P 1 ... gives 300 tps. However this performance is really +1000 tps for a few seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO storm. The server is totally unresponsive 75% of the time. That's bandwidth optimization for you. Hmmm... why not. Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above max): pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ... The same thing more or less happens in a delayed fashion... You get 50 tps for some time, followed by sections of 15 seconds at 0 tps for the checkpoint when the segments are full... the server is unresponsive about 10% of the time (one in ten transaction is late by more than 200 ms). It is not satisfying, pg should be able to handle that load easily. The culprit I found is "bgwriter", which is basically doing nothing to prevent the coming checkpoint IO storm, even though there would be ample time to write the accumulating dirty pages so that checkpoint would find a clean field and pass in a blink. Indeed, at the end of the 500 seconds throttled test, "pg_stat_bgwriter" says: buffers_checkpoint = 19046 buffers_clean = 2995 Which suggest that bgwriter took on him to write only 6 pages per second, where at least 50 would have been needed for the load, and could have been handled by the harware without any problem. I have not found any mean to force bgwriter to send writes when it can. (Well, I have: create a process which sends "CHECKPOINT" every 0.2 seconds... it works more or less, but this is not my point:-) Bgwriter control parameters allow to control the maximum number of pages (bgwriter_lru_maxpages) written per round (bgwriter_delay), and a multiplier (bgwriter_lru_multiplier) which controls some heuristics to estimate how many pages should be needed so as to make them available. This may be nice in some settings, but is not adapted to the write oriented OTPL load tested with pgbench. The problem is that with the update load on a fitting in memory database there is not that much need of "new" pages, even if pages are being dirtied (about 50 per seconds), so it seems that the heuristics decides not to write much. The net result of all this cleverness is that when the checkpoint arrives, several thousand pages have to be written and the server is offline for some time. ISTM that bgwriter lacks at least some "min page" setting it could be induced to free this many pages if it can. That would be a start. A better feature would be that it adapts itself to take advantage of the available IOPS, depending on the load induce by other tasks (vacuum, queries...), in a preventive manner, so as to avoid delaying what can be done right now under a small load, and thus avoid later IO storms. This would suggest that some setting would provide the expected IOPS capability of the underlying hardware, as some already suggest the expected available memory. Note that this preventive approach could also improve the "bandwith" measure: currently when pgbench is running at maximum speed before the checkpoint storm, nothing is written to disk but WAL, although it could probably also write some dirty pages. When the checkpoints arrives, less pages would need to be written, so the storm would be shorter. Any thoughts on this latency issue? Am I missing something? -- Fabien.
On 08/25/2014 01:23 PM, Fabien COELHO wrote: > > Hello pgdevs, > > I've been playing with pg for some time now to try to reduce the maximum > latency of simple requests, to have a responsive server under small to > medium load. > > On an old computer with a software RAID5 HDD attached, pgbench simple > update script run for a some time (scale 100, fillfactor 95) > > pgbench -M prepared -N -c 2 -T 500 -P 1 ... > > gives 300 tps. However this performance is really +1000 tps for a few > seconds followed by 16 seconds at about 0 tps for the checkpoint induced > IO storm. The server is totally unresponsive 75% of the time. That's > bandwidth optimization for you. Hmmm... why not. So I think that you're confusing the roles of bgwriter vs. spread checkpoint. What you're experiencing above is pretty common for nonspread checkpoints on slow storage (and RAID5 is slow for DB updates, no matter how fast the disks are), or for attempts to do spread checkpoint on filesystems which don't support it (e.g. Ext3, HFS+). In either case, what's happening is that the *OS* is freezing all logical and physical IO while it works to write out all of RAM, which makes me suspect you're using Ext3 or HFS+. Making the bgwriter more aggressive adds a significant risk of writing the same pages multiple times between checkpoints, so it's not a simple fix. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Hi, On 2014-08-25 22:23:40 +0200, Fabien COELHO wrote: > seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO > storm. The server is totally unresponsive 75% of the time. That's bandwidth > optimization for you. Hmmm... why not. > > Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above > max): > > pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ... > > The same thing more or less happens in a delayed fashion... You get 50 tps > for some time, followed by sections of 15 seconds at 0 tps for the > checkpoint when the segments are full... the server is unresponsive about > 10% of the time (one in ten transaction is late by more than 200 ms). That's ext4 I guess? Did you check whether xfs yields a, err, more predictable performance? > It is not satisfying, pg should be able to handle that load easily. > > The culprit I found is "bgwriter", which is basically doing nothing to > prevent the coming checkpoint IO storm, even though there would be ample > time to write the accumulating dirty pages so that checkpoint would find a > clean field and pass in a blink. While I agree that the current bgwriter implementation is far from good, note that this isn't the bgwriter's job. Its goal is to avoid backends from having to write out buffers themselves. I.e. that there are clean victim buffers when shared_buffers < working set. Note that it would *not* be a good idea to make the bgwriter write out everything, as much as possible - that'd turn sequential write io into random write io. Greetings, Andres Freund
On Monday, August 25, 2014, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
The culprit I found is "bgwriter", which is basically doing nothing to prevent the coming checkpoint IO storm, even though there would be ample time to write the accumulating dirty pages so that checkpoint would find a clean field and pass in a blink. Indeed, at the end of the 500 seconds throttled test, "pg_stat_bgwriter" says:
Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i? You don't want your steady state stats polluted by the bulk load.
buffers_checkpoint = 19046
buffers_clean = 2995
Out of curiosity, what does buffers_backend show?
In any event, this almost certainly is a red herring. Whichever of the three ways are being used to write out the buffers, it is the checkpointer that is responsible for fsyncing them, and that is where your drama is almost certainly occurring. Writing out with one path rather than a different isn't going to change things, unless you change the fsync.
Also, are you familiar with checkpoint_completion_target, and what is it set to?
Cheers,
Jeff
On Tue, Aug 26, 2014 at 1:53 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello pgdevs,
>
> I've been playing with pg for some time now to try to reduce the maximum latency of simple requests, to have a responsive server under small to medium load.
>
> On an old computer with a software RAID5 HDD attached, pgbench simple update script run for a some time (scale 100, fillfactor 95)
>
> pgbench -M prepared -N -c 2 -T 500 -P 1 ...
>
> gives 300 tps. However this performance is really +1000 tps for a few seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO storm. The server is totally unresponsive 75% of the time. That's bandwidth optimization for you. Hmmm... why not.
>
> Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above max):
>
> pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
>
> The same thing more or less happens in a delayed fashion... You get 50 tps for some time, followed by sections of 15 seconds at 0 tps for the checkpoint when the segments are full... the server is unresponsive about 10% of the time (one in ten transaction is late by more than 200 ms).
I think another thing to know here is why exactly checkpoint
>
>
> Hello pgdevs,
>
> I've been playing with pg for some time now to try to reduce the maximum latency of simple requests, to have a responsive server under small to medium load.
>
> On an old computer with a software RAID5 HDD attached, pgbench simple update script run for a some time (scale 100, fillfactor 95)
>
> pgbench -M prepared -N -c 2 -T 500 -P 1 ...
>
> gives 300 tps. However this performance is really +1000 tps for a few seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO storm. The server is totally unresponsive 75% of the time. That's bandwidth optimization for you. Hmmm... why not.
>
> Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above max):
>
> pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
>
> The same thing more or less happens in a delayed fashion... You get 50 tps for some time, followed by sections of 15 seconds at 0 tps for the checkpoint when the segments are full... the server is unresponsive about 10% of the time (one in ten transaction is late by more than 200 ms).
I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply. One reason could be
that backends might need to write more WAL due Full_Page_Writes,
another could be contention around buffer content_lock.
To dig more about the reason, the same tests can be tried
by making Full_Page_Writes = off and/or
synchronous_commit = off to see if WAL writes are causing
tps to go down.
Similarly for checkpoints, use checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.
Hello Josh, > So I think that you're confusing the roles of bgwriter vs. spread > checkpoint. What you're experiencing above is pretty common for > nonspread checkpoints on slow storage (and RAID5 is slow for DB updates, > no matter how fast the disks are), or for attempts to do spread > checkpoint on filesystems which don't support it (e.g. Ext3, HFS+). In > either case, what's happening is that the *OS* is freezing all logical > and physical IO while it works to write out all of RAM, which makes me > suspect you're using Ext3 or HFS+. I'm using ext4 on debian wheezy with postgresqk 9.4b2. I agree that the OS may be able to help, but this aspect does not currently work for me at all out of the box. The "all of RAM" is really a few thousands 8 kB pages written randomly, a few dozen MB. Also, if pg needs advanced OS tweaking to handle a small load, ISTM that it fails at simplicity:-( As for checkpoint spreading, raising checkpoint_completion_target to 0.9 degrades the situation (20% of transactions are more than 200 ms late instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run). So maybe there is a bug here somewhere. > Making the bgwriter more aggressive adds a significant risk of writing > the same pages multiple times between checkpoints, so it's not a simple fix. Hmmm... This must be balanced with the risk of being offline. Not all people are interested in throughput at the price of latency, so there could be settings that help latency, even at the price of reducing throughput (average tps). After that, it is the administrator choice to set pg for higher throughput or lower latency. Note that writing some "least recently used" page multiple times does not seems to be any issue at all for me under small/medium load, especially as the system has nothing else to do: if you have nothing else to do, there is no cost in writing a page, even if you may have to write it again some time later, and it helps prevent dirty pages accumulation. So it seems to me that pg can help, it is not only/merely an OS issue. -- Fabien.
On Monday, August 25, 2014, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
I have not found any mean to force bgwriter to send writes when it can. (Well, I have: create a process which sends "CHECKPOINT" every 0.2 seconds... it works more or less, but this is not my point:-)
There is scan_whole_pool_milliseconds, which currently forces bgwriter to circle the buffer pool at least once every 2 minutes. It is currently fixed, but it should be trivial to turn it into an experimental guc that you could use to test your hypothesis.
Cheers,
Jeff
[oops, wrong from, resent...] Hello Jeff, >> The culprit I found is "bgwriter", which is basically doing nothing to >> prevent the coming checkpoint IO storm, even though there would be ample >> time to write the accumulating dirty pages so that checkpoint would find a >> clean field and pass in a blink. Indeed, at the end of the 500 seconds >> throttled test, "pg_stat_bgwriter" says: > > Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i? Yes, I did. > You don't want your steady state stats polluted by the bulk load. Sure! >> buffers_checkpoint = 19046 >> buffers_clean = 2995 > > Out of curiosity, what does buffers_backend show? buffers_backend = 157 > In any event, this almost certainly is a red herring. Possibly. It is pretty easy to reproduce, though. > Whichever of the three ways are being used to write out the buffers, it > is the checkpointer that is responsible for fsyncing them, and that is > where your drama is almost certainly occurring. Writing out with one > path rather than a different isn't going to change things, unless you > change the fsync. Well, I agree partially. ISTM that the OS does not need to wait for fsync to start writing pages if it is receiving one minute of buffer writes at 50 writes per second, I would have thought that some scheduler should start handling the flow before fsync... So I thought that if bgwriter was to write the buffers is would help, but maybe there is a better way. > Also, are you familiar with checkpoint_completion_target, and what is it > set to? The default 0.5. Moving to 0.9 seems to worsen the situation. -- Fabien.
Hello Amit, > I think another thing to know here is why exactly checkpoint > storm is causing tps to drop so steeply. Yep. Actually it is not strictly 0, but a "few" tps that I rounded to 0. progress: 63.0 s, 47.0 tps, lat 2.810 ms stddev 5.194, lag 0.354 ms progress: 64.1 s, 11.9 tps, lat 81.481 ms stddev 218.026,lag 74.172 ms progress: 65.2 s, 1.9 tps, lat 950.455 ms stddev 125.602, lag 1421.203 ms progress: 66.1 s, 4.5 tps,lat 604.790 ms stddev 440.073, lag 2418.128 ms progress: 67.2 s, 6.0 tps, lat 322.440 ms stddev 68.276, lag 3146.302ms progress: 68.0 s, 2.4 tps, lat 759.509 ms stddev 62.141, lag 4229.239 ms progress: 69.4 s, 3.6 tps, lat 440.335ms stddev 369.207, lag 4842.371 ms Transactions are 4.8 seconds behind schedule at this point. > One reason could be that backends might need to write more WAL due > Full_Page_Writes, another could be contention around buffer > content_lock. To dig more about the reason, the same tests can be tried > by making Full_Page_Writes = off and/or synchronous_commit = off to see > if WAL writes are causing tps to go down. Given the small flow of updates, I do not think that there should be reason to get that big a write contention between WAL & checkpoint. If tried with "full_page_write = off" for 500 seconds: same overall behavior, 8.5% of transactions are stuck (instead of 10%). However in details pg_stat_bgwriter is quite different: buffers_checkpoint = 13906 buffers_clean = 20748 buffers_backend = 472 That seems to suggest that bgwriter did some stuff for once, but that did not change much the result in the end. This would imply that my suggestion to make bgwriter write more would not fix the problem alone. With "synchronous_commit = off", the situation is much improved, with only 0.3% of transactions stuck. Not a surprise. However, I would not recommand that as a solution:-) Currently, the only way I was able to "solve" the issue while still writing to disk is to send "CHECKPOINT" every 0.2s, as if I had set "checkpoint_timeout = 0.2s" (although this is not currently allowed). > Similarly for checkpoints, use checkpoint_completion_target to > spread the checkpoint_writes as suggested by Jeff as well to see > if that can mitigate the problem. I had already tried, and retried after Jeff suggestion. This does not seem to mitigate anything, on the contrary. -- Fabien.
Hello again, >> I have not found any mean to force bgwriter to send writes when it can. >> (Well, I have: create a process which sends "CHECKPOINT" every 0.2 >> seconds... it works more or less, but this is not my point:-) > > There is scan_whole_pool_milliseconds, which currently forces bgwriter to > circle the buffer pool at least once every 2 minutes. It is currently > fixed, but it should be trivial to turn it into an experimental guc that > you could use to test your hypothesis. I recompiled with the variable coldly set to 1000 instead of 120000. The situation is slightly degraded (15% of transactions were above 200 ms late). However it seems that bgwriter did not write much more pages: buffers_checkpoint = 26065 buffers_clean = 5263 buffers_backend = 367 Or I may have a problem interpreting pg_stat_bgwriter. It seems that changing this value is not enough to persuade bgwriter to write more pages. Or I may have done something wrong, but I do not know what. -- Fabien.
On 2014-08-26 08:12:48 +0200, Fabien COELHO wrote: > As for checkpoint spreading, raising checkpoint_completion_target to 0.9 > degrades the situation (20% of transactions are more than 200 ms late > instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run). > So maybe there is a bug here somewhere. What are the other settings here? checkpoint_segments, checkpoint_timeout, wal_buffers? Could you show the output of log_checkpoints during that run? Checkpoint spreading only works halfway efficiently if all checkpoints are triggered by "time" and not by "xlog". Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hello Andres, >> checkpoint when the segments are full... the server is unresponsive about >> 10% of the time (one in ten transaction is late by more than 200 ms). > > That's ext4 I guess? Yes! > Did you check whether xfs yields a, err, more predictable performance? No. I cannot test that easily without reinstalling the box. I did some quick tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very same conditions. Maybe I could try again. > [...] Note that it would *not* be a good idea to make the bgwriter write > out everything, as much as possible - that'd turn sequential write io > into random write io. Hmmm. I'm not sure it would be necessary the case, it depends on how bgwriter would choose the pages to write? If they are chosen randomly then indeed that could be bad. If there is a big sequential write, should not the backend do the write directly anyway? ISTM that currently checkpoint is mostly random writes anyway, at least with the OLTP write load of pgbench. I'm just trying to be able to start them ealier so that they can be completed quickly. So although bgwriter is not the solution, ISTM that pg has no reason to wait for minutes before starting to write dirty pages, if it has nothing else to do. If the OS does some retention later and cannot spread the load, as Josh suggest, this could also be a problem, but currently the OS seems not to have much to write (but WAL) till the checkpoint. -- Fabien.
On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote: > >Did you check whether xfs yields a, err, more predictable performance? > > No. I cannot test that easily without reinstalling the box. I did some quick > tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very > same conditions. Maybe I could try again. After Robert and I went to LSF/MM this spring I sent a test program for precisely this problem and while it could *crash* machines when using ext4, xfs yielded much more predictable performance. There's a problem with priorization of write vs read IO that's apparently FS dependent. > >[...] Note that it would *not* be a good idea to make the bgwriter write > >out everything, as much as possible - that'd turn sequential write io into > >random write io. > > Hmmm. I'm not sure it would be necessary the case, it depends on how > bgwriter would choose the pages to write? If they are chosen randomly then > indeed that could be bad. The essentially have to be random to fulfil it's roal of reducing the likelihood of a backend having to write out a buffer itself. Consider how the clock sweep algorithm (not that I am happy with it) works. When looking for a new victim buffer all backends scan the buffer cache in one continuous cycle. If they find a buffer with a usagecount==0 they'll use that one and throw away its contents. Otherwise they reduce usagecount by 1 and move on. What the bgwriter *tries* to do is to write out buffers with usagecount==0 that are dirty and will soon be visited in the clock cycle. To avoid having the backends to do that. > If there is a big sequential write, should not the > backend do the write directly anyway? ISTM that currently checkpoint is > mostly random writes anyway, at least with the OLTP write load of pgbench. > I'm just trying to be able to start them ealier so that they can be > completed quickly. If the IO scheduling worked - which it really doesn't in many cases - there'd really be no need to make it finish fast. I think you should try to tune spread checkpoints to have less impact, not make bgwriter do something it's not written for. > So although bgwriter is not the solution, ISTM that pg has no reason to wait > for minutes before starting to write dirty pages, if it has nothing else to > do. That precisely *IS* a spread checkpoint. > If the OS does some retention later and cannot spread the load, as Josh > suggest, this could also be a problem, but currently the OS seems not to > have much to write (but WAL) till the checkpoint. The actual problem is that the writes by the checkpointer - done in the background - aren't flushed out eagerly enough out of the OS's page cache. Then, when the final phase of the checkpoint comes, where relation files need to be fsynced, some filesystems essentially stal while trying to write out lots and lots of dirty buffers. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> What are the other settings here? checkpoint_segments, > checkpoint_timeout, wal_buffers? They simply are the defaults: checkpoint_segments = 3 checkpoint_timeout = 5min wal_buffers = -1 I did some test checkpoint_segments = 1, the problem is just more frequent but shorter. I also reduced wal_segsize down to 1MB, which also made it even more frequent but much shorter, so the overall result was an improvement with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly. I have found no solution on this path. > Could you show the output of log_checkpoints during that run? Checkpoint > spreading only works halfway efficiently if all checkpoints are > triggered by "time" and not by "xlog". I do 500 seconds tests, so there could be at most 2 timeout triggered checkpoints. Given the write load it takes about 2 minutes to fill the 3 16 MB buffers (8 kb * 50 tps (there is one page modified per transaction) * 120 s ~ 48 MB), so checkpoints are triggered by xlog. The maths are consistent with logs (not sure which prooves which, though:-): LOG: received SIGHUP, reloading configuration files LOG: parameter "log_checkpoints" changed to "on" LOG: checkpointstarting: xlog LOG: checkpoint complete: wrote 5713 buffers (34.9%); 0 transaction log file(s) added, 0 removed,0 recycled; write=51.449 s, sync=4.857 s, total=56.485 s; sync files=12, longest=2.160 s, average=0.404 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6235 buffers (38.1%); 0 transaction log file(s) added,0 removed, 3 recycled; write=53.500 s, sync=5.102 s, total=58.670 s; sync files=8, longest=2.689 s, average=0.637s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6250 buffers (38.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=53.888 s, sync=4.504 s, total=58.495 s; sync files=8, longest=2.627s, average=0.563 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6148 buffers (37.5%); 0transaction log file(s) added, 0 removed, 3 recycled; write=53.313 s, sync=6.437 s, total=59.834 s; sync files=8,longest=3.680 s, average=0.804 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6240 buffers(38.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=149.008 s, sync=5.448 s, total=154.566s; sync files=9, longest=3.788 s, average=0.605 s Note that my current effective solution is to do as if "checkpoints_timeout = 0.2s": it works fine if I do my own spreading. -- Fabien.
On 2014-08-26 10:49:31 +0200, Fabien COELHO wrote: > > >What are the other settings here? checkpoint_segments, > >checkpoint_timeout, wal_buffers? > > They simply are the defaults: > > checkpoint_segments = 3 > checkpoint_timeout = 5min > wal_buffers = -1 > > I did some test checkpoint_segments = 1, the problem is just more frequent > but shorter. I also reduced wal_segsize down to 1MB, which also made it even > more frequent but much shorter, so the overall result was an improvement > with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly. > I have found no solution on this path. Uh. I'm not surprised you're facing utterly horrible performance with this. Did you try using a *large* checkpoints_segments setting? To achieve high performance you likely will have to make checkpoint_timeout *longer* and increase checkpoint_segments until *all* checkpoints are started because of "time". There's three reasons: a) if checkpoint_timeout + completion_target is large and the checkpoint isn't executed prematurely, most of the dirty data has been written out by the kernel's background flush processes. b) The amount of WAL written with less frequent checkpoints is often *significantly* lower because fewer full page writes need to be done. I've seen production reduction of *more* than a factor of 4. c) If checkpoint's are infrequent enough, the penalty of them causing problems, especially if not using ext4, plays less of a role overall. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> Uh. I'm not surprised you're facing utterly horrible performance with > this. Did you try using a *large* checkpoints_segments setting? To > achieve high performance I do not seek "high performance" per se, I seek "lower maximum latency". I think that the current settings and parameters are designed for high throughput, but do not allow to control the latency even with a small load. > you likely will have to make checkpoint_timeout *longer* and increase > checkpoint_segments until *all* checkpoints are started because of > "time". Well, as I want to test a *small* load in a *reasonable* time, so I did not enlarge the number of segments, otherwise it would take ages. If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target = 0.9" so that the checkpoints are triggered by the timeout, LOG: checkpoint starting: time LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log file(s) added,0 removed, 0 recycled; write=53.645 s, sync=5.127 s, total=58.927 s; sync files=12, longest=2.890 s, average=0.427s ... The result is basically the same (well 18% transactions lost, but the result do not seem to be stable one run after the other), only there are more checkpoints. I fail to understand how multiplying both the segments and time would solve the latency problem. If I set 30 segments than it takes 20 minutes to fill them, and if I put timeout to 15min then I'll have to wait for 15 minutes to test. > There's three reasons: > a) if checkpoint_timeout + completion_target is large and the checkpoint > isn't executed prematurely, most of the dirty data has been written out > by the kernel's background flush processes. Why would they be written by the kernel if bgwriter has not sent them?? > b) The amount of WAL written with less frequent checkpoints is often > *significantly* lower because fewer full page writes need to be > done. I've seen production reduction of *more* than a factor of 4. Sure, I understand that, but ISTM that this test does not exercise this issue, the load is small, the full page writes do not matter much. > c) If checkpoint's are infrequent enough, the penalty of them causing > problems, especially if not using ext4, plays less of a role overall. I think that what you suggest would only delay the issue, not solve it. I'll try to ran a long test. -- Fabien.
On 2014-08-26 11:34:36 +0200, Fabien COELHO wrote: > > >Uh. I'm not surprised you're facing utterly horrible performance with > >this. Did you try using a *large* checkpoints_segments setting? To > >achieve high performance > > I do not seek "high performance" per se, I seek "lower maximum latency". So? > I think that the current settings and parameters are designed for high > throughput, but do not allow to control the latency even with a small load. The way're you're setting them is tuned for 'basically no write activity'. > >you likely will have to make checkpoint_timeout *longer* and increase > >checkpoint_segments until *all* checkpoints are started because of "time". > > Well, as I want to test a *small* load in a *reasonable* time, so I did not > enlarge the number of segments, otherwise it would take ages. Well, that way you're testing something basically meaningless. That's not helpful either. > If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target = > 0.9" so that the checkpoints are triggered by the timeout, > > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log > file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s, > total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s > ... > > The result is basically the same (well 18% transactions lost, but the result > do not seem to be stable one run after the other), only there are more > checkpoints. With these settings you're fsyncing the entire data directy once a minute. Nearly entirely from the OS's buffer cache, because the OS's writeback logic didn't have time to kick in. > I fail to understand how multiplying both the segments and time would solve > the latency problem. If I set 30 segments than it takes 20 minutes to fill > them, and if I put timeout to 15min then I'll have to wait for 15 minutes to > test. a) The kernel's writeback logic only kicks in with delay. b) The amount of writes you're doing with short checkpoint intervals is overall significantly higher than with longer intervals. That obviously has impact on latency as well as throughput. c) the time it fills for segments to be filled is mostly irrelevant. The phase that's very likely causing troubles for you is the fsyncs issued at the end of a checkpoint. > >There's three reasons: > >a) if checkpoint_timeout + completion_target is large and the checkpoint > >isn't executed prematurely, most of the dirty data has been written out > >by the kernel's background flush processes. > > Why would they be written by the kernel if bgwriter has not sent them?? I think you're misunderstanding how spread checkpoints work. When the checkpointer process starts a spread checkpoint it first writes all buffers to the kernel in a paced manner. That pace is determined by checkpoint_completion_target and checkpoint_timeout. Once all buffers that are old enough to need to be checkpointed written out, the checkpointer fsync()s all the on disk files. That part is *NOT* paced. Then it can go on to remove old WAL files. The latency problem is almost guaranteedly created by the fsync()s mentioned above. When they're execute the kernel starts flushing out a lot of dirty buffers at once - creating very deep IO queues which makes it take long to process synchronous additions (WAL flushes, reads) to that queue. > >c) If checkpoint's are infrequent enough, the penalty of them causing > >problems, especially if not using ext4, plays less of a role overall. > > I think that what you suggest would only delay the issue, not solve it. The amount of dirty data that needs to be flushed is essentially bounded. If you have a stall of roughly the same magnitude (say a factor of two different), the smaller once a minute, the larger once an hour. Obviously the once-an-hour one will have a better latency in many, many more transactions. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hello Jeff, >> The culprit I found is "bgwriter", which is basically doing nothing to >> prevent the coming checkpoint IO storm, even though there would be ample >> time to write the accumulating dirty pages so that checkpoint would find a >> clean field and pass in a blink. Indeed, at the end of the 500 seconds >> throttled test, "pg_stat_bgwriter" says: > > Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i? Yes, I did. > You don't want your steady state stats polluted by the bulk load. Sure! >> buffers_checkpoint = 19046 >> buffers_clean = 2995 > > Out of curiosity, what does buffers_backend show? buffers_backend = 157 > In any event, this almost certainly is a red herring. Possibly! It is pretty easy to reproduce... can anyone try? > Whichever of the three ways are being used to write out the buffers, it > is the checkpointer that is responsible for fsyncing them, and that is > where your drama is almost certainly occurring. Writing out with one > path rather than a different isn't going to change things, unless you > change the fsync. Well, ISTM that the OS does not need to wait for fsync to start writing pages if it has received one minute of buffer writes at 50 writes per second, some scheduler should start handling the flow somewhere... So if bgwriter was to write the buffers is would help, but maybe there is a better way. > Also, are you familiar with checkpoint_completion_target, and what is it > set to? The default 0.5. Moving to 0.9 seems to worsen the situation. -- Fabien.
On Tue, Aug 26, 2014 at 12:53 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> Given the small flow of updates, I do not think that there should be reason to get that big a write contention between WAL & checkpoint.
>
> If tried with "full_page_write = off" for 500 seconds: same overall behavior, 8.5% of transactions are stuck (instead of 10%). However in details pg_stat_bgwriter is quite different:
>
> buffers_checkpoint = 13906
> buffers_clean = 20748
> buffers_backend = 472
>
> That seems to suggest that bgwriter did some stuff for once, but that did not change much the result in the end. This would imply that my suggestion > to make bgwriter write more would not fix the problem alone.
I think the reason could be that in most cases bgwriter
> With "synchronous_commit = off", the situation is much improved, with only 0.3% of transactions stuck. Not a surprise. However, I would not
>
> Given the small flow of updates, I do not think that there should be reason to get that big a write contention between WAL & checkpoint.
>
> If tried with "full_page_write = off" for 500 seconds: same overall behavior, 8.5% of transactions are stuck (instead of 10%). However in details pg_stat_bgwriter is quite different:
>
> buffers_checkpoint = 13906
> buffers_clean = 20748
> buffers_backend = 472
>
> That seems to suggest that bgwriter did some stuff for once, but that did not change much the result in the end. This would imply that my suggestion > to make bgwriter write more would not fix the problem alone.
I think the reason could be that in most cases bgwriter
passes the sync responsibility to checkpointer rather than
doing by itself which causes IO storm during checkpoint
and another thing is that it will not proceed to even write
the dirty buffer unless the refcounf and usage_count of
buffer is zero. I see there is some merit in your point which
is to make bgwriter more useful than its current form.
I could see 3 top level points to think about whether improvement
in any of those can improve the current situation:
a. Scanning of buffer pool to find the dirty buffers that can
be flushed.
b. Deciding on what is criteria to flush a buffer
c. Sync of buffers
> With "synchronous_commit = off", the situation is much improved, with only 0.3% of transactions stuck. Not a surprise. However, I would not
> recommand that as a solution:-)
Yeah, actually it was just to test what is the actual problem,
Yeah, actually it was just to test what is the actual problem,
I will also not recommend such a solution, however it gives
us atleast the indication that due to IO in checkpoint, backends
are not even able to flush comparatively small WAL data.
How about if you keep WAL (pg_xlog) on a separate filesystem
may be via creating symlink or if possible with -X option of initdb?
Hello Andres, > [...] > I think you're misunderstanding how spread checkpoints work. Yep, definitely:-) On the other hand I though I was seeking something "simple", namely correct latency under small load, that I would expect out of the box. What you describe is reasonable, and is more or less what I was hoping for, although I thought that bgwriter was involved from the start and checkpoint would only do what is needed in the end. My mistake. > When the checkpointer process starts a spread checkpoint it first writes > all buffers to the kernel in a paced manner. > That pace is determined by checkpoint_completion_target and > checkpoint_timeout. This pacing does not seem to work, even at slow pace. > If you have a stall of roughly the same magnitude (say a factor > of two different), the smaller once a minute, the larger once an > hour. Obviously the once-an-hour one will have a better latency in many, > many more transactions. I do not believe in delaying as much as possible writing do disk to handle a small load as a viable strategy. However, to show my good will, I have tried to follow your advices: I've launched a 5000 seconds test with 50 segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less than 1/10 of the maximum throughput. There are only two time-triggered checkpoints: LOG: checkpoint starting: time LOG: checkpoint complete: wrote 48725 buffers (47.6%); 1 transaction log file(s)added, 0 removed, 0 recycled; write=1619.750 s, sync=27.675 s, total=1647.932 s; sync files=14, longest=27.593s, average=1.976 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 22533 buffers (22.0%); 0 transaction log file(s)added, 0 removed, 23 recycled; write=826.919 s, sync=9.989 s, total=837.023 s; sync files=8, longest=6.742s, average=1.248 s For the first one, 48725 buffers is 380MB. 1800 * 0.9 = 1620 seconds to complete, so it means 30 buffer writes per second... should be ok. However sync costs 27 seconds nevertheless, and the server was more or less offline for about 30 seconds flat. For the second one, 180 MB to write, 10 seconds offline. For some reason the target time is reduced. I have also tried with the "deadline" IO scheduler which make more sense than the default "cfq", but the result was similar. Not sure how software RAID interacts with IO scheduling, though. Overall result: over the 5000s test, I have lost (i.e. more than 200ms behind schedule) more than 2.5% of transactions (1/40). Due to the unfinished cycle, the long term average is probably about 3%. Although it is better than 10%, it is not good. I would expect/hope for something pretty close to 0, even with ext4 on Linux, for a dedicated host which has nothing else to do but handle two dozen transactions per second. Current conclusion: I have not found any way to improve the situation to "good" with parameters from the configuration. Currently a small load results in periodic offline time, that can be delayed but not avoided. The delaying tactic results in less frequent but longer downtime. I prefer frequent very short downtime instead. I really think that something is amiss. Maybe pg does not handle pacing as it should. For the record, a 25tps bench with a "small" config (default 3 segments, 5min timeout, 0.5 completion target) and with a parallel: while true ; do echo "CHECKPOINT;"; sleep 0.2s; done | psql results in "losing" only 0.01% of transactions (12 transactions out of 125893 where behind more than 200ms in 5000 seconds). Although you may think it stupid, from my point of view it shows that it is possible to coerce pg to behave. With respect to the current status: (1) the ability to put checkpoint_timeout to values smaller than 30s could help, although obviously there would be other consequences. But the ability to avoid periodic offline time looks like a desirable objective. (2) I still think that a parameter to force bgwriter to write more stuff could help, but this is not tested. (3) Any other effective idea to configure for responsiveness is welcome! If someone wants to repeat these tests, it is easy and only takes a few minutes: sh> createdb test sh> pgbench -i -s 100 -F 95 test sh> pgbench -M prepared -N -R 25 -L 200 -c 2 -T 5000 -P 1 test > pgb.out Note: the -L to limit latency is a submitted patch. Without this, unresponsiveness shows as increasing laging time. -- Fabien.
On 2014-08-27 09:32:16 +0200, Fabien COELHO wrote: > > Hello Andres, > > >[...] > >I think you're misunderstanding how spread checkpoints work. > > Yep, definitely:-) On the other hand I though I was seeking something > "simple", namely correct latency under small load, that I would expect out > of the box. Yea. The current situation *sucks*. Both from the utterly borked behaviour of ext4 and other filesystems and the lack of workaround from postgres. > >When the checkpointer process starts a spread checkpoint it first writes > >all buffers to the kernel in a paced manner. > >That pace is determined by checkpoint_completion_target and > >checkpoint_timeout. > > This pacing does not seem to work, even at slow pace. It definitely does in some cases. What's your evidence the pacing doesn't work? Afaik it's the fsync that causes the problem, not the the writes themselves. > >If you have a stall of roughly the same magnitude (say a factor > >of two different), the smaller once a minute, the larger once an > >hour. Obviously the once-an-hour one will have a better latency in many, > >many more transactions. > > I do not believe in delaying as much as possible writing do disk to handle a > small load as a viable strategy. However, to show my good will, I have > tried to follow your advices: I've launched a 5000 seconds test with 50 > segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less > than 1/10 of the maximum throughput. > > There are only two time-triggered checkpoints: > > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 48725 buffers (47.6%); > 1 transaction log file(s) added, 0 removed, 0 recycled; > write=1619.750 s, sync=27.675 s, total=1647.932 s; > sync files=14, longest=27.593 s, average=1.976 s > > LOG: checkpoint starting: time > LOG: checkpoint complete: wrote 22533 buffers (22.0%); > 0 transaction log file(s) added, 0 removed, 23 recycled; > write=826.919 s, sync=9.989 s, total=837.023 s; > sync files=8, longest=6.742 s, average=1.248 s The write pacing itself doesn't seem to be bad. The bad thing is the 'sync' times here. Those are *NOT* paced and kernel probably has delayed flushing out much the writes... > (1) the ability to put checkpoint_timeout to values smaller than 30s could > help, although obviously there would be other consequences. But the ability > to avoid periodic offline time looks like a desirable objective. I'd rather not do that. It's a utterly horrible hack to go this write. > (2) I still think that a parameter to force bgwriter to write more stuff > could help, but this is not tested. It's going to be random writes. That's not going to be helpful. > (3) Any other effective idea to configure for responsiveness is welcome! I've a couple of ideas how to improve the situation, but so far I've not had the time to investigate them properly. Would you be willing to test a couple of simple patches? Did you test xfs already? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync > that causes the problem, not the the writes themselves. Hmmm. My (poor) understanding is that fsync would work fine if everything was already written beforehand:-) that is it has nothing to do but assess that all is already written. If there is remaining write work, it starts doing it "now" with the disastrous effects I'm complaining about. When I say "pacing does not work", I mean that things where not written out to disk by the OS, it does not mean that pg did not ask for it. However it does not make much sense for an OS scheduler to wait several minutes with tens of thousands of pages to write and do nothing about it... So I'm wondering. > [...] >> (1) the ability to put checkpoint_timeout to values smaller than 30s could >> help, although obviously there would be other consequences. But the ability >> to avoid periodic offline time looks like a desirable objective. > > I'd rather not do that. It's a utterly horrible hack to go this write. Hmmm. It does solve the issue, though:-) It would be the administrator choice. It is better than nothing, which is the current status. >> (2) I still think that a parameter to force bgwriter to write more stuff >> could help, but this is not tested. > > It's going to be random writes. That's not going to be helpful. The -N small OLTP load on a large (GB) table *is* random writes anyway, whether they occur at checkpoint or at any other time. Random writes are fine in this case, the load is small, there should be no problem. >> (3) Any other effective idea to configure for responsiveness is >> welcome! > > I've a couple of ideas how to improve the situation, but so far I've not > had the time to investigate them properly. Would you be willing to test > a couple of simple patches? I can test a couple of patches. I already did one on someone advice (make bgwriter round all stuff in 1s instead of 120s, without positive effect. > Did you test xfs already? No. I cannot without reinstalling, which I cannot do on a remote host, and I will probably not have time to do it when I'll have physical access. Only one partition on the host. My mistake. Will not do it again. Shame on me. If someone out there has an XFS setup, it is very easy to test and only takes a couple of minutes, really. It takes less time to do it than to write a mail about it afterwards:-) I have tested FreeBSD/UFS with similar results, a few periodic offlines. UFS journaled file system is probably not ideal for database work, but yet again the load is small, it should be able to cope without going offline. -- Fabien.
On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote: > I can test a couple of patches. I already did one on someone advice (make > bgwriter round all stuff in 1s instead of 120s, without positive effect. I've quickly cobbled together the attached patch (which at least doesn't seem to crash & burn). It tries to trigger pages being flushed out during the paced phase of checkpoints instead of the fsync phase. The sync_on_checkpoint_flush can be used to enable/disable that behaviour. I'd be interested to hear whether that improves your latency numbers. I unfortunately don't have more time to spend on this right now :(. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-08-27 11:14:46 +0200, Andres Freund wrote: > On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote: > > I can test a couple of patches. I already did one on someone advice (make > > bgwriter round all stuff in 1s instead of 120s, without positive effect. > > I've quickly cobbled together the attached patch (which at least doesn't > seem to crash & burn). It tries to trigger pages being flushed out > during the paced phase of checkpoints instead of the fsync phase. The > sync_on_checkpoint_flush can be used to enable/disable that behaviour. > > I'd be interested to hear whether that improves your latency numbers. I > unfortunately don't have more time to spend on this right now :(. And actually attached. Note that it's linux only... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2014-08-27 11:19:22 +0200, Andres Freund wrote: > On 2014-08-27 11:14:46 +0200, Andres Freund wrote: > > On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote: > > > I can test a couple of patches. I already did one on someone advice (make > > > bgwriter round all stuff in 1s instead of 120s, without positive effect. > > > > I've quickly cobbled together the attached patch (which at least doesn't > > seem to crash & burn). It tries to trigger pages being flushed out > > during the paced phase of checkpoints instead of the fsync phase. The > > sync_on_checkpoint_flush can be used to enable/disable that behaviour. > > > > I'd be interested to hear whether that improves your latency numbers. I > > unfortunately don't have more time to spend on this right now :(. > > And actually attached. Note that it's linux only... I got curious and ran a quick test: config: log_checkpoints=on checkpoint_timeout=1min checkpoint_completion_target=0.95 checkpoint_segments=100 synchronous_commit=on fsync=on huge_pages=on max_connections=200 shared_buffers=6GB wal_level=hot_standby off: $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 150 query mode: prepared number of clients: 16 number of threads: 16 duration: 120 s number of transactions actually processed: 20189 latency average: 23.136 ms latency stddev: 59.044 ms rate limit schedule lag: avg 4.599 (max 199.975) ms number of skipped transactions: 1345 (6.246 %) tps = 167.664290 (including connections establishing) tps = 167.675679 (excluding connections establishing) LOG: checkpoint starting: time LOG: checkpoint complete: wrote 12754 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=56.928s, sync=3.639 s, total=60.749 s; sync files=20, longest=2.741 s, average=0.181 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 12269 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=20.701s, sync=8.568 s, total=29.444 s; sync files=10, longest=3.568 s, average=0.856 s on: $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 150 query mode: prepared number of clients: 16 number of threads: 16 duration: 120 s number of transactions actually processed: 21327 latency average: 20.735 ms latency stddev: 14.643 ms rate limit schedule lag: avg 4.965 (max 185.003) ms number of skipped transactions: 1 (0.005 %) tps = 177.214391 (including connections establishing) tps = 177.225476 (excluding connections establishing) LOG: checkpoint starting: time LOG: checkpoint complete: wrote 12217 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=57.022s, sync=0.203 s, total=57.377 s; sync files=19, longest=0.033 s, average=0.010 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 13185 buffers (1.7%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=56.628s, sync=0.019 s, total=56.803 s; sync files=11, longest=0.017 s, average=0.001 s That machine is far from idle right now, so the noise is pretty high. But rather nice initial results. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hello Amit, > I see there is some merit in your point which is to make bgwriter more > useful than its current form. I could see 3 top level points to think > about whether improvement in any of those can improve the current > situation: > a. Scanning of buffer pool to find the dirty buffers that can > be flushed. Yep, that could be more aggressive, or the aggressiveness could be made into an adjustible parameter. There is a comment about that in the source. However I tested setting the round to 1s instead of 120s, and it had no positive effect on my test. > b. Deciding on what is criteria to flush a buffer Indeed. For instance when the IO load is low, there is no reason not to send out some buffers, it is a free lunch even if it must be done again later. If the load is high, this is another matter. So it would mean being able to decide whether the current IO load is low or not. It could be different on different disk... Hmmm, not that simple. > c. Sync of buffers Yes. That is more or less the effect of my rough approach of calling CHECKPOINT every 0.2 seconds. Andres Freund just implemented a quick linux-specific patch which does that within CHECKPOINT pacing, and which greatly improves the situation, although it could still be a little better. > [separate xlog test] I do not have a setup available for that right now. Not sure this would be an issue for low loads. -- Fabien.
On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync >> that causes the problem, not the the writes themselves. > > > Hmmm. My (poor) understanding is that fsync would work fine if everything > was already written beforehand:-) that is it has nothing to do but assess > that all is already written. If there is remaining write work, it starts > doing it "now" with the disastrous effects I'm complaining about. > > When I say "pacing does not work", I mean that things where not written out > to disk by the OS, it does not mean that pg did not ask for it. > > However it does not make much sense for an OS scheduler to wait several > minutes with tens of thousands of pages to write and do nothing about it... > So I'm wondering. Maybe what's needed, is to slightly tweak checkpoint logic to give the kernel some time to flush buffers. Correct me if I'm wrong, but the checkpointer does the sync right after the reads. Of course there will be about 30s-worth of accumulated writes (it's the default amount of time the kernel holds on to dirty buffers). Perhaps it should be delayed a small time, say 30s, to let the kernel do the writing on its own.
On Wed, Aug 27, 2014 at 10:10 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: >>> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync >>> that causes the problem, not the the writes themselves. >> >> >> Hmmm. My (poor) understanding is that fsync would work fine if everything >> was already written beforehand:-) that is it has nothing to do but assess >> that all is already written. If there is remaining write work, it starts >> doing it "now" with the disastrous effects I'm complaining about. >> >> When I say "pacing does not work", I mean that things where not written out >> to disk by the OS, it does not mean that pg did not ask for it. >> >> However it does not make much sense for an OS scheduler to wait several >> minutes with tens of thousands of pages to write and do nothing about it... >> So I'm wondering. > > Maybe what's needed, is to slightly tweak checkpoint logic to give the > kernel some time to flush buffers. > > Correct me if I'm wrong, but the checkpointer does the sync right > after the reads. Of course there will be about 30s-worth of > accumulated writes (it's the default amount of time the kernel holds > on to dirty buffers). > > Perhaps it should be delayed a small time, say 30s, to let the kernel > do the writing on its own. Errata: just after the writes :-p
On 2014-08-27 10:10:49 -0300, Claudio Freire wrote: > On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync > >> that causes the problem, not the the writes themselves. > > > > > > Hmmm. My (poor) understanding is that fsync would work fine if everything > > was already written beforehand:-) that is it has nothing to do but assess > > that all is already written. If there is remaining write work, it starts > > doing it "now" with the disastrous effects I'm complaining about. > > > > When I say "pacing does not work", I mean that things where not written out > > to disk by the OS, it does not mean that pg did not ask for it. > > > > However it does not make much sense for an OS scheduler to wait several > > minutes with tens of thousands of pages to write and do nothing about it... > > So I'm wondering. > > Maybe what's needed, is to slightly tweak checkpoint logic to give the > kernel some time to flush buffers. > > Correct me if I'm wrong, but the checkpointer does the sync right > after the reads. Of course there will be about 30s-worth of > accumulated writes (it's the default amount of time the kernel holds > on to dirty buffers). > > Perhaps it should be delayed a small time, say 30s, to let the kernel > do the writing on its own. The kernel *starts* to write out pages after 30s, it doesn't finish doing so. So I don't think that's going to work. I think a somewhat smarter version of the explicit flushes in the hack^Wpatch I posted nearby is going to more likely to be successful. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 27, 2014 at 10:15 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-08-27 10:10:49 -0300, Claudio Freire wrote: >> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: >> >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync >> >> that causes the problem, not the the writes themselves. >> > >> > >> > Hmmm. My (poor) understanding is that fsync would work fine if everything >> > was already written beforehand:-) that is it has nothing to do but assess >> > that all is already written. If there is remaining write work, it starts >> > doing it "now" with the disastrous effects I'm complaining about. >> > >> > When I say "pacing does not work", I mean that things where not written out >> > to disk by the OS, it does not mean that pg did not ask for it. >> > >> > However it does not make much sense for an OS scheduler to wait several >> > minutes with tens of thousands of pages to write and do nothing about it... >> > So I'm wondering. >> >> Maybe what's needed, is to slightly tweak checkpoint logic to give the >> kernel some time to flush buffers. >> >> Correct me if I'm wrong, but the checkpointer does the sync right >> after the reads. Of course there will be about 30s-worth of >> accumulated writes (it's the default amount of time the kernel holds >> on to dirty buffers). >> >> Perhaps it should be delayed a small time, say 30s, to let the kernel >> do the writing on its own. > > The kernel *starts* to write out pages after 30s, it doesn't finish > doing so. So I don't think that's going to work. > > I think a somewhat smarter version of the explicit flushes in the > hack^Wpatch I posted nearby is going to more likely to be successful. That path is "dangerous" (as in, may not work as intended) if the filesystem doesn't properly understand range flushes (ehem, like ext3).
On 2014-08-27 10:17:06 -0300, Claudio Freire wrote: > > I think a somewhat smarter version of the explicit flushes in the > > hack^Wpatch I posted nearby is going to more likely to be successful. > > > That path is "dangerous" (as in, may not work as intended) if the > filesystem doesn't properly understand range flushes (ehem, like > ext3). The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation guaranteeing durability. And - afaik - not implemented in a file system specific manner. It just initiates writeback for individual pages. It doesn't cause barrier, journal flushes or anything to be issued. That's still done by the fsync() later. The big disadvantage is that it's a OS specific solution, but I don't think we're going to find anything that isn't in this area. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Aug 27, 2014 at 10:20 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-08-27 10:17:06 -0300, Claudio Freire wrote: >> > I think a somewhat smarter version of the explicit flushes in the >> > hack^Wpatch I posted nearby is going to more likely to be successful. >> >> >> That path is "dangerous" (as in, may not work as intended) if the >> filesystem doesn't properly understand range flushes (ehem, like >> ext3). > > The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation > guaranteeing durability. And - afaik - not implemented in a file system > specific manner. It just initiates writeback for individual pages. It > doesn't cause barrier, journal flushes or anything to be issued. That's > still done by the fsync() later. > > The big disadvantage is that it's a OS specific solution, but I don't > think we're going to find anything that isn't in this area. I guess it should work then.
On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello Andres,[...]
I think you're misunderstanding how spread checkpoints work.
Yep, definitely:-) On the other hand I though I was seeking something "simple", namely correct latency under small load, that I would expect out of the box.
What you describe is reasonable, and is more or less what I was hoping for, although I thought that bgwriter was involved from the start and checkpoint would only do what is needed in the end. My mistake.
If all you want is to avoid the write storms when fsyncs start happening on slow storage, can you not just adjust the kernel vm.dirty* tunables to start making the kernel write out dirty buffers much sooner instead of letting them accumulate until fsyncs force them out all at once?
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
On 2014-08-27 10:32:19 -0400, Aidan Van Dyk wrote: > On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > > > > Hello Andres, > > > > [...] > >> > >> I think you're misunderstanding how spread checkpoints work. > >> > > > > Yep, definitely:-) On the other hand I though I was seeking something > > "simple", namely correct latency under small load, that I would expect out > > of the box. > > > > What you describe is reasonable, and is more or less what I was hoping > > for, although I thought that bgwriter was involved from the start and > > checkpoint would only do what is needed in the end. My mistake. > > > > > If all you want is to avoid the write storms when fsyncs start happening on > slow storage, can you not just adjust the kernel vm.dirty* tunables to > start making the kernel write out dirty buffers much sooner instead of > letting them accumulate until fsyncs force them out all at once? Well. For one that's a os specific global tunable requiring root to be adjustable. For another we actually do want some buffering: If a backend writes out a buffer's data itself (happens very frequently) it *should* get buffered... So I don't think a process independent tunable is going to do the trick. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hello, > If all you want is to avoid the write storms when fsyncs start happening on > slow storage, can you not just adjust the kernel vm.dirty* tunables to > start making the kernel write out dirty buffers much sooner instead of > letting them accumulate until fsyncs force them out all at once? I can try, when I have finished with the current round of testing. Note that, as Andres put it, it currently "sucks". Having to tinker with linux kernel parameters just to handle a small load without being offline 10% of the time does not look very good, so even if it works, ISTM that a pg side solution is desirable. -- Fabien.
On 08/27/2014 04:20 PM, Andres Freund wrote: > On 2014-08-27 10:17:06 -0300, Claudio Freire wrote: >>> I think a somewhat smarter version of the explicit flushes in the >>> hack^Wpatch I posted nearby is going to more likely to be successful. >> >> >> That path is "dangerous" (as in, may not work as intended) if the >> filesystem doesn't properly understand range flushes (ehem, like >> ext3). > > The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation > guaranteeing durability. And - afaik - not implemented in a file system > specific manner. It just initiates writeback for individual pages. It > doesn't cause barrier, journal flushes or anything to be issued. That's > still done by the fsync() later. > > The big disadvantage is that it's a OS specific solution, but I don't > think we're going to find anything that isn't in this area. I've been thinking for a long time that we should interleave the writes and the fsyncs. That still forces up to 1GB of dirty buffers to disk at once, causing a spike, but at least not more than that. Also, the scheduling of a spread checkpoint is currently a bit bogus; we don't take into account the time needed for the fsync phase. A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp). The performance impact of that was inconclusive, but one thing that it allows nicely is to interleave the fsyncs, so that you write all the buffers for one file, then fsync it, then next file and so on. IIRC the biggest worry with that patch was that sorting the buffers requires a fairly large amount of memory, and making a large allocation in the checkpointer might cause an out-of-memory, which would be bad. I don't think anyone's seriously worked on this area since. If the impact on responsiveness or performance is significant, I'm pretty sure the OOM problem could be alleviated somehow. For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs, see attached. It works by repeatedly scanning the buffer pool, writing buffers belonging to a single relation segment at a time. I would be interested to hear how this performs in your test case. - Heikki
Attachment
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: > On 08/27/2014 04:20 PM, Andres Freund wrote: > >On 2014-08-27 10:17:06 -0300, Claudio Freire wrote: > >>>I think a somewhat smarter version of the explicit flushes in the > >>>hack^Wpatch I posted nearby is going to more likely to be successful. > >> > >> > >>That path is "dangerous" (as in, may not work as intended) if the > >>filesystem doesn't properly understand range flushes (ehem, like > >>ext3). > > > >The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation > >guaranteeing durability. And - afaik - not implemented in a file system > >specific manner. It just initiates writeback for individual pages. It > >doesn't cause barrier, journal flushes or anything to be issued. That's > >still done by the fsync() later. > > > >The big disadvantage is that it's a OS specific solution, but I don't > >think we're going to find anything that isn't in this area. > > I've been thinking for a long time that we should interleave the writes and > the fsyncs. That still forces up to 1GB of dirty buffers to disk at once, > causing a spike, but at least not more than that. I think there are considerable benefits to sorting checkpoint io by file and offset in that file. Obviously the likelihood of sequential IO is higher; but there's also less chance that other processes write out dirty buffers that have to be flushed out by the fsync() in a drive by manner. I don't think it's good enough to solve the problem Fabien is talking about though. 1GB is heck of a lot of IO to submit at once. That'll cause latency issues unless you have a write back controller with more than 1GB of cache. So I think we need both, control over the amount of dirty data in the kernel *and* sorted writeouts. To the point that I've been tinkering with converting buftable.c into a radix tree. That'd allow to efficiently scan all buffers of a filenode in order. Useful for checkpoints, but also for lots of other things. Unfortunately our buffer tags are freakishly huge, making the worst case memory requirements and the depth of tree quite bad. > A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write > them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp). > The performance impact of that was inconclusive, but one thing that it > allows nicely is to interleave the fsyncs, so that you write all the buffers > for one file, then fsync it, then next file and so on. IIRC the biggest > worry with that patch was that sorting the buffers requires a fairly large > amount of memory, and making a large allocation in the checkpointer might > cause an out-of-memory, which would be bad. > > I don't think anyone's seriously worked on this area since. If the impact on > responsiveness or performance is significant, I'm pretty sure the OOM > problem could be alleviated somehow. It's a major problem imo. What I'd been thinking of is to checkpoint writeout in batches. Collect 100k buffers, sort them, write them out. Go to the next 100k. > For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs, > see attached. It works by repeatedly scanning the buffer pool, writing > buffers belonging to a single relation segment at a time. I would be > interested to hear how this performs in your test case. I bet it's not fundamentally changing the amount of transactions that don't make the deadline - there's more than enough dirty buffers in one 1GB segment to cause issues. But I think it might already be a significant benefit for peak throughput *and* latency if you combine it with my approach of initiating writeout to disk during the whole sync. My patch can slow things down considerably in the worst case by causing superflous random IO, which your patch should alleviate. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> off: > > $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200 > number of skipped transactions: 1345 (6.246 %) > > on: > > $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200 > number of skipped transactions: 1 (0.005 %) > That machine is far from idle right now, so the noise is pretty high. What is the OS and FS? Could it be XFS? > But rather nice initial results. Indeed, I can confirm: I did 5000s 25tps tests: - Off: 8002 transactions lost (6.3%) - On: 158 transactions "lost" (0.12%). Although it is still 13 times larger than the 12 (0.01%) lost with my every 0.2s CHECKPOINT hack, it is nevertheless much much better than before! The bad news, under pgbench unthrottled load, the tps is divided by 2 (300 -> 150, could have been worse), *BUT* is also much smoother, the tps is not going to 0, but stay in 50-100 range before the next spike. I'm wondering about he order of operations. It seems to me that you sync just after giving back a buffer. Maybe it would be better to pipeline it, that is something like: round 0: send buffers 0 sleep? round N: sync buffers N-1 send buffers N sleep? final N sync: sync buffer N I have not found how to control the checkpoint pacing interval, if there is such a thing. With a 200ms lag limit on pgbench, it would be nice if it is less than 200ms. I found this old thread "Add basic checkpoint sync spreading" by Greg Smith and Simons Riggs, dating from 2010: http://www.postgresql.org/message-id/4CE07548.4030709@2ndquadrant.com https://commitfest.postgresql.org/action/patch_view?id=431 which ends up "returned with feedback". -- Fabien.
On 2014-08-27 19:00:12 +0200, Fabien COELHO wrote: > > >off: > > > >$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200 > >number of skipped transactions: 1345 (6.246 %) > > > >on: > > > >$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200 > >number of skipped transactions: 1 (0.005 %) > > >That machine is far from idle right now, so the noise is pretty high. > > What is the OS and FS? Could it be XFS? That's linux v3.17-rc2 + ext4. > >But rather nice initial results. > > Indeed, I can confirm: > > I did 5000s 25tps tests: > - Off: 8002 transactions lost (6.3%) > - On: 158 transactions "lost" (0.12%). > > Although it is still 13 times larger than the 12 (0.01%) lost with my every > 0.2s CHECKPOINT hack, it is nevertheless much much better than before! > > The bad news, under pgbench unthrottled load, the tps is divided by 2 (300 > -> 150, could have been worse), *BUT* is also much smoother, the tps is not > going to 0, but stay in 50-100 range before the next spike. Yea, I'm not surprised. With a sensible (aka larger) checkpoint_timeout the performance penalty isn't that big, but it's there. That's why I think (as mentioned to Heikki nearby) it needs to be combined with sorting during the checkpoint phase. > I'm wondering about he order of operations. It seems to me that you sync > just after giving back a buffer. Yep. Was just a rather quick patch... > Maybe it would be better to pipeline it, > that is something like: > > round 0: > send buffers 0 > sleep? > > round N: > sync buffers N-1 > send buffers N > sleep? > > final N sync: > sync buffer N Yes, I think we're going to need to leave a it more room for write combining and such here. But I think it's going to better to issue flushes for several buffers together - just not after each write(). To be really beneficial it needs sorted output though. > I have not found how to control the checkpoint pacing interval, if there is > such a thing. With a 200ms lag limit on pgbench, it would be nice if it is > less than 200ms. Not sure what you mean. > I found this old thread "Add basic checkpoint sync spreading" by Greg Smith > and Simons Riggs, dating from 2010: > http://www.postgresql.org/message-id/4CE07548.4030709@2ndquadrant.com > https://commitfest.postgresql.org/action/patch_view?id=431 which ends up > "returned with feedback". I didn't really like the unapplied remainder of what was proposed in there. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hello Aidan, > If all you want is to avoid the write storms when fsyncs start happening on > slow storage, can you not just adjust the kernel vm.dirty* tunables to > start making the kernel write out dirty buffers much sooner instead of > letting them accumulate until fsyncs force them out all at once? I tried that by setting: vm.dirty_expire_centisecs = 100 vm.dirty_writeback_centisecs = 100 So it should start writing returned buffers at most 2s after they are returned, if I understood the doc correctly, instead of at most 35s. The result is that with a 5000s 25tps pretty small load (the system can do 300tps with the default configuration), I lost 2091 (1.7%) of transactions, that is they were beyond the 200ms schedule limit. Another change is that overall the lost transactions are more spread than without this setting, although there still is stretches of unresponsiveness. So although the situation is significantly better, it is still far from good with the much reduced value I tried. -- Fabien.
On Thu, Aug 28, 2014 at 3:27 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > Hello Aidan, > > >> If all you want is to avoid the write storms when fsyncs start happening >> on >> slow storage, can you not just adjust the kernel vm.dirty* tunables to >> start making the kernel write out dirty buffers much sooner instead of >> letting them accumulate until fsyncs force them out all at once? > > > I tried that by setting: > vm.dirty_expire_centisecs = 100 > vm.dirty_writeback_centisecs = 100 > > So it should start writing returned buffers at most 2s after they are > returned, if I understood the doc correctly, instead of at most 35s. > > The result is that with a 5000s 25tps pretty small load (the system can do > 300tps with the default configuration), I lost 2091 (1.7%) of transactions, > that is they were beyond the 200ms schedule limit. Another change is that > overall the lost transactions are more spread than without this setting, > although there still is stretches of unresponsiveness. > > So although the situation is significantly better, it is still far from good > with the much reduced value I tried. What do the checkpoint logs look like now? (especially interested in sync times)
>> I tried that by setting: >> vm.dirty_expire_centisecs = 100 >> vm.dirty_writeback_centisecs = 100 >> >> So it should start writing returned buffers at most 2s after they are >> returned, if I understood the doc correctly, instead of at most 35s. >> >> The result is that with a 5000s 25tps pretty small load (the system can do >> 300tps with the default configuration), I lost 2091 (1.7%) of transactions, >> that is they were beyond the 200ms schedule limit. Another change is that >> overall the lost transactions are more spread than without this setting, >> although there still is stretches of unresponsiveness. >> >> So although the situation is significantly better, it is still far from good >> with the much reduced value I tried. > > What do the checkpoint logs look like now? (especially interested in sync times) Here is an extract: LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 4893 buffers (4.8%); 0 transaction log file(s) added,0 removed, 0 recycled; write=128.430 s, sync=0.378 s, total=128.921 s; sync files=11, longest=0.375 s, average=0.034s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6200 buffers (6.1%); 0 transaction logfile(s) added, 3 removed, 0 recycled; write=128.934 s, sync=0.240 s, total=129.280 s; sync files=7, longest=0.132s, average=0.034 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6177 buffers (6.0%); 0 transaction log file(s) added, 3 removed, 0 recycled; write=130.146 s, sync=0.322 s, total=130.592 s; sync files=7,longest=0.185 s, average=0.046 s LOG: checkpoint starting: xlog LOG: checkpoint complete: wrote 6185 buffers(6.0%); 0 transaction log file(s) added, 3 removed, 0 recycled; write=132.673 s, sync=0.143 s, total=132.909s; sync files=5, longest=0.078 s, average=0.028 s LOG: checkpoint starting: xlog LOG: checkpoint complete:wrote 6188 buffers (6.0%); 0 transaction log file(s) added, 3 removed, 0 recycled; write=130.415 s, sync=0.170s, total=130.676 s; sync files=5, longest=0.132 s, average=0.034 s LOG: checkpoint starting: xlog LOG: checkpointcomplete: wrote 6203 buffers (6.1%); 0 transaction log file(s) added, 3 removed, 0 recycled; write=131.726s, sync=0.444 s, total=132.256 s; sync files=5, longest=0.441 s, average=0.088 s ... Basically sync is between 0.1-0.5 seconds. I had one particulary bad stretch of unresponsiveness, which is not clearly related to a slow checkpoint sync: progress: 4065.0 s, 28.7 tps, lat 11.886 ms stddev 48.470, lag 18.851 ms, 0 skipped progress: 4066.2 s, 1.6 tps, lat 952.830ms stddev 280.673, lag 155.310 ms, 25 skipped progress: 4067.0 s, 2.7 tps, lat 1067.730 ms stddev 316.321, lag 171.766ms, 14 skipped progress: 4068.4 s, 1.4 tps, lat 1147.392 ms stddev 240.337, lag 132.297 ms, 35 skipped progress:4069.7 s, 2.3 tps, lat 1034.543 ms stddev 213.786, lag 154.453 ms, 35 skipped progress: 4070.2 s, 1.8 tps, lat562.715 ms stddev 0.000, lag 182.833 ms, 12 skipped progress: 4071.3 s, 3.6 tps, lat 600.929 ms stddev 108.232, lag 162.723ms, 25 skipped progress: 4072.5 s, 1.7 tps, lat 1077.187 ms stddev 77.654, lag 168.849 ms, 31 skipped progress:4073.3 s, 1.3 tps, lat 1298.093 ms stddev 0.000, lag 168.882 ms, 21 skipped progress: 4074.0 s, 2.7 tps, lat 920.704ms stddev 183.587, lag 165.333 ms, 24 skipped progress: 4075.3 s, 2.4 tps, lat 756.051 ms stddev 118.241, lag 176.863ms, 29 skipped progress: 4076.1 s, 1.3 tps, lat 1424.548 ms stddev 0.000, lag 0.000 ms, 17 skipped progress: 4077.3s, 2.4 tps, lat 791.090 ms stddev 338.729, lag 155.403 ms, 26 skipped progress: 4078.1 s, 27.4 tps, lat 46.834 msstddev 198.812, lag 3.915 ms, 0 skipped -- Fabien
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: > A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write > them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp). > The performance impact of that was inconclusive, but one thing that it > allows nicely is to interleave the fsyncs, so that you write all the buffers > for one file, then fsync it, then next file and so on. IIRC the biggest > worry with that patch was that sorting the buffers requires a fairly large > amount of memory, and making a large allocation in the checkpointer might > cause an out-of-memory, which would be bad. > > I don't think anyone's seriously worked on this area since. If the impact on > responsiveness or performance is significant, I'm pretty sure the OOM > problem could be alleviated somehow. I've dug up that patch (after a bit of fighting with the archives) and refreshed it. It's *clearly* beneficial: master: andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1 starting vacuum...end. progress: 1.0 s, 2847.6 tps, lat 53.862 ms stddev 49.219 ... progress: 67.0 s, 3435.4 tps, lat 52.920 ms stddev 48.720 progress: 68.2 s, 2586.9 tps, lat 57.793 ms stddev 64.228 progress: 69.1 s, 546.5 tps, lat 294.940 ms stddev 189.546 progress: 70.0 s, 1741.3 tps, lat 134.298 ms stddev 204.740 progress: 71.0 s, 3868.8 tps, lat 48.423 ms stddev 47.934 .. progress: 89.0 s, 4022.8 tps, lat 45.601 ms stddev 40.685 progress: 90.0 s, 2463.5 tps, lat 61.907 ms stddev 64.342 progress: 91.2 s, 856.3 tps, lat 211.610 ms stddev 149.916 progress: 92.0 s, 1026.9 tps, lat 177.103 ms stddev 144.448 progress: 93.0 s, 736.9 tps, lat 254.230 ms stddev 227.339 progress: 94.1 s, 766.9 tps, lat 208.031 ms stddev 181.340 progress: 95.1 s, 979.7 tps, lat 197.014 ms stddev 193.648 progress: 96.0 s, 868.9 tps, lat 214.060 ms stddev 198.297 progress: 97.1 s, 943.4 tps, lat 178.062 ms stddev 143.224 progress: 98.0 s, 934.5 tps, lat 192.435 ms stddev 197.901 progress: 99.6 s, 623.1 tps, lat 202.954 ms stddev 165.576 progress: 100.0 s, 464.7 tps, lat 683.600 ms stddev 376.520 progress: 101.1 s, 516.0 tps, lat 395.033 ms stddev 480.346 progress: 102.0 s, 364.9 tps, lat 507.933 ms stddev 499.670 progress: 103.3 s, 592.9 tps, lat 214.123 ms stddev 273.411 progress: 104.1 s, 930.2 tps, lat 316.487 ms stddev 335.096 progress: 105.4 s, 627.6 tps, lat 262.496 ms stddev 200.690 progress: 106.1 s, 788.6 tps, lat 235.510 ms stddev 202.366 progress: 107.5 s, 644.8 tps, lat 269.020 ms stddev 223.900 progress: 108.0 s, 725.0 tps, lat 262.692 ms stddev 218.774 progress: 109.0 s, 660.0 tps, lat 272.808 ms stddev 248.501 progress: 110.0 s, 604.3 tps, lat 303.727 ms stddev 264.921 progress: 111.0 s, 723.6 tps, lat 243.224 ms stddev 229.426 progress: 112.1 s, 668.6 tps, lat 257.026 ms stddev 190.247 progress: 113.1 s, 345.0 tps, lat 492.114 ms stddev 312.082 progress: 115.4 s, 390.9 tps, lat 416.708 ms stddev 391.577 progress: 115.4 s, 14598.5 tps, lat 551.617 ms stddev 539.611 progress: 116.1 s, 161.5 tps, lat 741.611 ms stddev 485.498 progress: 117.4 s, 269.1 tps, lat 697.978 ms stddev 576.970 progress: 118.8 s, 262.3 tps, lat 674.887 ms stddev 587.848 progress: 119.1 s, 195.2 tps, lat 833.959 ms stddev 733.592 progress: 120.0 s, 3000.6 tps, lat 104.272 ms stddev 291.851 progress: 121.0 s, 3167.7 tps, lat 56.576 ms stddev 51.976 progress: 122.0 s, 3398.1 tps, lat 51.322 ms stddev 48.057 progress: 123.0 s, 3721.9 tps, lat 50.355 ms stddev 46.994 progress: 124.0 s, 2929.3 tps, lat 50.996 ms stddev 45.553 progress: 125.0 s, 754.5 tps, lat 269.293 ms stddev 287.508 progress: 126.0 s, 3297.0 tps, lat 56.912 ms stddev 77.053 ... progress: 144.0 s, 4207.9 tps, lat 44.440 ms stddev 37.210 progress: 145.9 s, 2036.4 tps, lat 79.025 ms stddev 105.411 progress: 146.0 s, 1003.1 tps, lat 292.934 ms stddev 223.650 progress: 147.4 s, 520.8 tps, lat 318.670 ms stddev 244.596 progress: 148.0 s, 3557.3 tps, lat 71.626 ms stddev 143.174 progress: 149.0 s, 4106.1 tps, lat 43.557 ms stddev 36.444 progress: 150.0 s, 4132.3 tps, lat 43.185 ms stddev 34.611 progress: 151.0 s, 4233.3 tps, lat 43.239 ms stddev 39.121 progress: 152.0 s, 4178.2 tps, lat 43.242 ms stddev 40.377 progress: 153.0 s, 755.1 tps, lat 198.560 ms stddev 155.927 progress: 154.1 s, 773.6 tps, lat 240.044 ms stddev 192.472 progress: 155.0 s, 553.7 tps, lat 303.532 ms stddev 245.491 progress: 156.2 s, 772.7 tps, lat 242.925 ms stddev 226.754 progress: 157.1 s, 541.0 tps, lat 295.132 ms stddev 218.857 progress: 158.1 s, 716.8 tps, lat 281.823 ms stddev 227.488 progress: 159.1 s, 748.7 tps, lat 223.275 ms stddev 186.162 progress: 160.0 s, 503.0 tps, lat 311.621 ms stddev 215.952 progress: 161.1 s, 905.0 tps, lat 239.623 ms stddev 245.539 progress: 162.4 s, 360.4 tps, lat 329.583 ms stddev 250.094 progress: 163.3 s, 348.9 tps, lat 583.476 ms stddev 432.200 progress: 165.5 s, 186.1 tps, lat 765.542 ms stddev 552.133 progress: 165.5 s, 9950.2 tps, lat 1424.984 ms stddev 240.249 progress: 166.4 s, 137.3 tps, lat 1056.184 ms stddev 765.356 progress: 167.2 s, 249.7 tps, lat 793.423 ms stddev 792.666 progress: 168.6 s, 194.0 tps, lat 851.712 ms stddev 737.653 progress: 169.2 s, 136.0 tps, lat 1229.394 ms stddev 1227.723 progress: 170.6 s, 92.3 tps, lat 1569.107 ms stddev 1517.027 progress: 171.2 s, 276.5 tps, lat 1325.410 ms stddev 1252.237 progress: 172.0 s, 3881.2 tps, lat 91.122 ms stddev 366.052 progress: 174.4 s, 995.6 tps, lat 120.200 ms stddev 242.005 progress: 174.4 s, 16227.2 tps, lat 651.361 ms stddev 527.946 progress: 175.1 s, 200.0 tps, lat 808.114 ms stddev 542.989 progress: 176.1 s, 179.4 tps, lat 1024.893 ms stddev 724.887 progress: 177.4 s, 367.0 tps, lat 648.519 ms stddev 790.413 progress: 178.4 s, 2236.7 tps, lat 114.064 ms stddev 271.525 progress: 179.0 s, 3643.5 tps, lat 48.404 ms stddev 47.793 progress: 180.0 s, 3309.0 tps, lat 55.598 ms stddev 49.211 transaction type: TPC-B (sort of) scaling factor: 100 query mode: prepared number of clients: 180 number of threads: 180 duration: 180 s number of transactions actually processed: 496443 latency average: 65.245 ms latency stddev: 134.292 ms tps = 2756.518300 (including connections establishing) tps = 2756.718588 (excluding connections establishing) LOG: checkpoint starting: time LOG: checkpoint complete: wrote 139127 buffers (21.2%); 0 transaction log file(s) added, 0 removed, 8 recycled; write=53.660s, sync=6.739 s, total=60.611 s; sync files=18, longest=4.382 s, average=0.374 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 70472 buffers (10.8%); 0 transaction log file(s) added, 0 removed, 70 recycled; write=53.107s, sync=5.375 s, total=58.605 s; sync files=15, longest=3.243 s, average=0.358 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 80357 buffers (12.3%); 0 transaction log file(s) added, 0 removed, 49 recycled; write=53.526s, sync=3.896 s, total=57.573 s; sync files=15, longest=1.206 s, average=0.259 s sorted writes: starting vacuum...end. progress: 1.0 s, 3237.2 tps, lat 49.544 ms stddev 42.203 .. progress: 88.0 s, 3030.8 tps, lat 47.438 ms stddev 46.563 progress: 89.3 s, 919.0 tps, lat 204.152 ms stddev 172.776 progress: 90.1 s, 973.7 tps, lat 204.650 ms stddev 197.843 progress: 91.1 s, 875.5 tps, lat 200.472 ms stddev 157.295 progress: 92.0 s, 939.5 tps, lat 185.357 ms stddev 146.362 progress: 93.1 s, 975.8 tps, lat 184.445 ms stddev 167.520 progress: 94.9 s, 673.0 tps, lat 260.900 ms stddev 224.125 progress: 95.0 s, 1057.6 tps, lat 223.717 ms stddev 260.276 progress: 96.0 s, 917.7 tps, lat 201.704 ms stddev 220.395 progress: 97.0 s, 2573.0 tps, lat 78.913 ms stddev 119.458 progress: 98.0 s, 3921.5 tps, lat 44.160 ms stddev 39.147 progress: 99.0 s, 4033.3 tps, lat 43.972 ms stddev 45.921 .. progress: 110.0 s, 4057.8 tps, lat 44.283 ms stddev 47.003 progress: 111.0 s, 897.6 tps, lat 158.250 ms stddev 132.126 progress: 112.0 s, 751.6 tps, lat 249.282 ms stddev 232.964 progress: 113.0 s, 550.6 tps, lat 296.155 ms stddev 226.163 progress: 114.0 s, 1875.3 tps, lat 123.059 ms stddev 210.661 progress: 115.1 s, 4555.4 tps, lat 39.404 ms stddev 32.331 .. progress: 151.0 s, 4240.2 tps, lat 42.158 ms stddev 38.577 progress: 152.1 s, 2551.3 tps, lat 63.584 ms stddev 73.780 progress: 153.1 s, 912.5 tps, lat 189.030 ms stddev 146.658 progress: 154.2 s, 976.2 tps, lat 192.499 ms stddev 158.108 progress: 155.0 s, 1005.1 tps, lat 178.215 ms stddev 129.062 progress: 156.0 s, 1010.1 tps, lat 175.583 ms stddev 138.823 progress: 157.0 s, 1042.0 tps, lat 161.675 ms stddev 135.350 progress: 158.1 s, 766.8 tps, lat 216.367 ms stddev 234.374 progress: 159.5 s, 1025.7 tps, lat 199.366 ms stddev 228.530 progress: 160.0 s, 1178.6 tps, lat 159.156 ms stddev 130.660 progress: 161.0 s, 2904.1 tps, lat 65.751 ms stddev 79.928 progress: 162.0 s, 4189.6 tps, lat 43.492 ms stddev 38.490 .. progress: 170.0 s, 4150.9 tps, lat 41.998 ms stddev 40.931 progress: 171.1 s, 595.5 tps, lat 233.138 ms stddev 188.073 progress: 172.0 s, 622.6 tps, lat 245.023 ms stddev 279.715 progress: 173.0 s, 588.1 tps, lat 359.104 ms stddev 315.367 progress: 174.0 s, 3011.0 tps, lat 74.811 ms stddev 172.972 progress: 175.0 s, 4829.0 tps, lat 37.520 ms stddev 30.578 progress: 176.0 s, 4324.1 tps, lat 41.388 ms stddev 37.727 progress: 177.0 s, 3632.8 tps, lat 49.219 ms stddev 47.092 progress: 178.0 s, 3621.0 tps, lat 50.332 ms stddev 43.086 progress: 179.0 s, 3468.3 tps, lat 50.969 ms stddev 43.025 progress: 180.0 s, 3710.7 tps, lat 48.782 ms stddev 49.070 transaction type: TPC-B (sort of) scaling factor: 100 query mode: prepared number of clients: 180 number of threads: 180 duration: 180 s number of transactions actually processed: 634014 latency average: 51.089 ms latency stddev: 63.864 ms number of transactions above the 100.0 ms latency limit: 66921 tps = 3520.574003 (excluding connections establishing) LOG: checkpoint starting: time LOG: checkpoint complete: wrote 135453 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 2 recycled; write=53.419s, sync=3.665 s, total=57.390 s; sync files=17, longest=1.937 s, average=0.215 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 96907 buffers (14.8%); 0 transaction log file(s) added, 0 removed, 68 recycled; write=53.147s, sync=3.369 s, total=56.633 s; sync files=15, longest=1.920 s, average=0.224 s sorted writes + flush in SyncOneBuffer(): andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1 starting vacuum...end. progress: 1.0 s, 3044.7 tps, lat 49.030 ms stddev 45.608 progress: 2.0 s, 3701.3 tps, lat 48.231 ms stddev 44.762 ... progress: 55.0 s, 4157.1 tps, lat 43.216 ms stddev 40.821 progress: 56.0 s, 4736.3 tps, lat 38.326 ms stddev 35.699 progress: 57.0 s, 2604.9 tps, lat 66.072 ms stddev 65.534 progress: 58.0 s, 1659.8 tps, lat 113.848 ms stddev 121.625 progress: 59.0 s, 2625.2 tps, lat 66.112 ms stddev 54.349 .. progress: 106.0 s, 2571.8 tps, lat 72.806 ms stddev 83.308 progress: 107.0 s, 1818.7 tps, lat 93.315 ms stddev 103.284 progress: 108.1 s, 1876.6 tps, lat 97.447 ms stddev 103.151 progress: 109.1 s, 1768.8 tps, lat 110.406 ms stddev 121.347 progress: 110.0 s, 3129.4 tps, lat 56.100 ms stddev 54.835 progress: 111.0 s, 4402.0 tps, lat 40.581 ms stddev 42.269 progress: 112.0 s, 4379.3 tps, lat 42.625 ms stddev 39.507 progress: 113.0 s, 4551.6 tps, lat 39.303 ms stddev 30.563 progress: 114.0 s, 4383.3 tps, lat 41.169 ms stddev 37.459 progress: 115.1 s, 4652.3 tps, lat 38.769 ms stddev 33.415 progress: 116.1 s, 4015.5 tps, lat 41.939 ms stddev 34.122 progress: 117.1 s, 2159.0 tps, lat 84.653 ms stddev 81.445 progress: 118.0 s, 2971.3 tps, lat 62.874 ms stddev 62.229 progress: 119.0 s, 2825.1 tps, lat 60.875 ms stddev 54.767 progress: 120.0 s, 2542.9 tps, lat 75.276 ms stddev 68.709 progress: 121.1 s, 2650.4 tps, lat 63.833 ms stddev 56.419 progress: 122.0 s, 2766.8 tps, lat 68.196 ms stddev 67.042 .. progress: 166.0 s, 2412.4 tps, lat 71.307 ms stddev 74.961 progress: 167.1 s, 1742.5 tps, lat 99.519 ms stddev 102.500 progress: 168.1 s, 1324.2 tps, lat 122.625 ms stddev 135.177 progress: 169.0 s, 2002.5 tps, lat 108.435 ms stddev 125.576 progress: 170.0 s, 2796.6 tps, lat 64.867 ms stddev 64.665 progress: 171.0 s, 3969.1 tps, lat 45.520 ms stddev 41.656 ... progress: 180.0 s, 3003.9 tps, lat 59.134 ms stddev 51.508 transaction type: TPC-B (sort of) scaling factor: 100 query mode: prepared number of clients: 180 number of threads: 180 duration: 180 s number of transactions actually processed: 584725 latency average: 55.375 ms latency stddev: 56.202 ms number of transactions above the 100.0 ms latency limit: 76644 tps = 3246.299654 (including connections establishing) tps = 3246.607040 (excluding connections establishing) LOG: checkpoint starting: time LOG: checkpoint complete: wrote 135474 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=53.260s, sync=0.079 s, total=53.577 s; sync files=19, longest=0.032 s, average=0.004 s LOG: checkpoint starting: time LOG: checkpoint complete: wrote 92770 buffers (14.2%); 0 transaction log file(s) added, 0 removed, 68 recycled; write=53.677s, sync=0.239 s, total=54.026 s; sync files=16, longest=0.093 s, average=0.014 s So, *very* clearly sorting is a benefit. Both the dips in performance are lower and the aggregate throughput is higher. Additionally flushing buffers during the flush is a clear win from the perspective of reliable performance, but noticeable performancewise. Although much less than without sorting. Now, I've used a absurdly small checkpoint_timeout here to have the time to run these benchmarks. But I'm damn sure this similar with real settings. So, what I think we need to do is: 1) Sort writes on checkpoints. It's a clear win. Personally I'm perfectly happy waving away the concerns about OOM. It's not that much memory: # SELECT pg_size_pretty(24 * ((64::int8 * 1024 * 1024 * 1024) / 8192::int8)); ┌────────────────┐ │ pg_size_pretty │ ├────────────────┤ │ 192 MB │ └────────────────┘ If you can aford 64GB shared buffers, 192MB memory allocated in the checkpointer hardly is a problem. (24 is sizeof(BufferTag) + sizeof(bufid)) 2) At least optionally flush dirty buffers during the buffer sync scan. The improvements in steady performance are too big to ignore. I think we might be able to make it an unconditional win by flushing in batches instead of doing so for each buffer like I've done. 3) Possibly do the mdsync() after & for each filenode (or segment?) like Heikki basically just prototyped. That'll reduce the number of dirty buffers our fsync() will flush out unnecessarily (which will create stalls because they'll sit in the kernel's buffer cache). > For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs, > see attached. It works by repeatedly scanning the buffer pool, writing > buffers belonging to a single relation segment at a time. I would be > interested to hear how this performs in your test case. I think that'll perform horribly on large shared buffers with many relations. Not a uncommon scenario... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: >> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write >> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp). >> The performance impact of that was inconclusive, but one thing that it >> allows nicely is to interleave the fsyncs, so that you write all the buffers >> for one file, then fsync it, then next file and so on. > ... > So, *very* clearly sorting is a benefit. pg_bench alone doesn't convince me on this. The original thread found cases where it was a loss, IIRC; you will need to test many more than one scenario to prove the point. Also, it does not matter how good it looks in test cases if it causes outright failures due to OOM; unlike you, I am not prepared to just "wave away" that risk. A possible compromise is to sort a limited number of buffers ---- say, collect a few thousand dirty buffers then sort, dump and fsync them, repeat as needed. regards, tom lane
On 2014-08-30 13:50:40 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: > >> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write > >> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp). > >> The performance impact of that was inconclusive, but one thing that it > >> allows nicely is to interleave the fsyncs, so that you write all the buffers > >> for one file, then fsync it, then next file and so on. > > > ... > > So, *very* clearly sorting is a benefit. > > pg_bench alone doesn't convince me on this. The original thread found > cases where it was a loss, IIRC; you will need to test many more than > one scenario to prove the point. Sure. And I'm not claiming Itagaki/your old patch is immediately going to be ready for commit. But our checkpoint performance has sucked for years in the field. I don't think we can wave that away. I think the primary reason it wasn't easily visible as being beneficial back then was that only the throughput, not the latency and such were looked at. > Also, it does not matter how good it looks in test cases if it causes > outright failures due to OOM; unlike you, I am not prepared to just "wave > away" that risk. I'm not "waving away" any risks. If the sort buffer is allocated when the checkpointer is started, not everytime we sort, as you've done in your version of the patch I think that risk is pretty manageable. If we really want to be sure nothing is happening at runtime, even if the checkpointer was restarted, we can put the sort array in shared memory. We're talking about (sizeof(BufferTag) + sizeof(int))/8192 ~= 0.3 % overhead over shared_buffers here. If we decide to got that way, it's a pretty darn small to price not to regularly have stalls that last minutes. > A possible compromise is to sort a limited number of > buffers ---- say, collect a few thousand dirty buffers then sort, dump and > fsync them, repeat as needed. Yea, that's what I suggested nearby. But I don't really like it, because it robs us of the the chance to fsync() a relfilenode immediately after having synced all its buffers. Doing so is rather beneficial because then fewer independently dirtied pages have to be flushed out - reducing the impact of the fsync(). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-08-30 13:50:40 -0400, Tom Lane wrote: >> A possible compromise is to sort a limited number of >> buffers ---- say, collect a few thousand dirty buffers then sort, dump and >> fsync them, repeat as needed. > Yea, that's what I suggested nearby. But I don't really like it, because > it robs us of the the chance to fsync() a relfilenode immediately after > having synced all its buffers. Uh, how so exactly? You could still do that. Yeah, you might fsync a rel once per sort-group and not just once per checkpoint, but it's not clear that that's a loss as long as the group size isn't tiny. regards, tom lane
On 2014-08-30 14:16:10 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-08-30 13:50:40 -0400, Tom Lane wrote: > >> A possible compromise is to sort a limited number of > >> buffers ---- say, collect a few thousand dirty buffers then sort, dump and > >> fsync them, repeat as needed. > > > Yea, that's what I suggested nearby. But I don't really like it, because > > it robs us of the the chance to fsync() a relfilenode immediately after > > having synced all its buffers. > > Uh, how so exactly? You could still do that. Yeah, you might fsync a rel > once per sort-group and not just once per checkpoint, but it's not clear > that that's a loss as long as the group size isn't tiny. Because it wouldn't have the benefit of sycing the minimal amount of data anymore. If lots of other relfilenodes have been synced inbetween the amount of newly dirtied pages in the os' buffercache (written by backends, bgwriter) for a individual relfilenode is much higher. A fsync() on a file with dirty data often causes *serious* latency spikes - we should try hard to avoid superflous calls. As an example: Calling fsync() on pgbench_accounts's underlying files, from outside postgres, *before* postgres even started its first checkpoint does this: progress: 72.0 s, 4324.9 tps, lat 41.481 ms stddev 40.567 progress: 73.0 s, 4704.9 tps, lat 38.465 ms stddev 35.436 progress: 74.0 s, 4448.5 tps, lat 40.058 ms stddev 32.634 progress: 75.0 s, 4634.5 tps, lat 39.229 ms stddev 33.463 progress: 76.8 s, 2753.1 tps, lat 48.693 ms stddev 75.309 progress: 77.1 s, 126.6 tps, lat 773.433 ms stddev 222.667 progress: 78.0 s, 183.7 tps, lat 786.401 ms stddev 395.954 progress: 79.1 s, 170.3 tps, lat 975.949 ms stddev 596.751 progress: 80.0 s, 2116.6 tps, lat 168.608 ms stddev 398.933 progress: 81.0 s, 4436.1 tps, lat 40.313 ms stddev 34.198 progress: 82.0 s, 4383.9 tps, lat 41.811 ms stddev 37.241 Note the dip from 4k tps to 130 tps. We can get a handle on that (on some platforms at least) for writes issued during the buffer sync by forcing the kernel to write out the pages in small increments; but I doubt we want to do that for writes by backends themselves. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 08/30/2014 09:45 PM, Andres Freund wrote: > On 2014-08-30 14:16:10 -0400, Tom Lane wrote: >> Andres Freund <andres@2ndquadrant.com> writes: >>> On 2014-08-30 13:50:40 -0400, Tom Lane wrote: >>>> A possible compromise is to sort a limited number of >>>> buffers ---- say, collect a few thousand dirty buffers then sort, dump and >>>> fsync them, repeat as needed. >> >>> Yea, that's what I suggested nearby. But I don't really like it, because >>> it robs us of the the chance to fsync() a relfilenode immediately after >>> having synced all its buffers. >> >> Uh, how so exactly? You could still do that. Yeah, you might fsync a rel >> once per sort-group and not just once per checkpoint, but it's not clear >> that that's a loss as long as the group size isn't tiny. > > Because it wouldn't have the benefit of sycing the minimal amount of > data anymore. If lots of other relfilenodes have been synced inbetween > the amount of newly dirtied pages in the os' buffercache (written by > backends, bgwriter) for a individual relfilenode is much higher. I wonder how much of the benefit from sorting comes from sorting the pages within each file, and how much just from grouping all the writes of each file together. In other words, how much difference is there between sorting, and fsyncing between each file, or the crude patch I posted earlier. If we're going to fsync between each file, there's no need to sort all the buffers at once. It's enough to pick one file as the target - like in my crude patch - and sort only the buffers for that file. Then fsync that file and move on to the next file. That requires scanning the buffers multiple times, but I think that's OK. - Heikki
On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote: > On 08/30/2014 09:45 PM, Andres Freund wrote: > >On 2014-08-30 14:16:10 -0400, Tom Lane wrote: > >>Andres Freund <andres@2ndquadrant.com> writes: > >>>On 2014-08-30 13:50:40 -0400, Tom Lane wrote: > >>>>A possible compromise is to sort a limited number of > >>>>buffers ---- say, collect a few thousand dirty buffers then sort, dump and > >>>>fsync them, repeat as needed. > >> > >>>Yea, that's what I suggested nearby. But I don't really like it, because > >>>it robs us of the the chance to fsync() a relfilenode immediately after > >>>having synced all its buffers. > >> > >>Uh, how so exactly? You could still do that. Yeah, you might fsync a rel > >>once per sort-group and not just once per checkpoint, but it's not clear > >>that that's a loss as long as the group size isn't tiny. > > > >Because it wouldn't have the benefit of sycing the minimal amount of > >data anymore. If lots of other relfilenodes have been synced inbetween > >the amount of newly dirtied pages in the os' buffercache (written by > >backends, bgwriter) for a individual relfilenode is much higher. > > I wonder how much of the benefit from sorting comes from sorting the pages > within each file, and how much just from grouping all the writes of each > file together. In other words, how much difference is there between sorting, > and fsyncing between each file, or the crude patch I posted earlier. I haven't implemented fsync()ing between files so far. From the io stats I'm seing the performance improvements come from the OS being able to write data back in bigger chunks. Which seems entirely reasonable. If the database and the write load are big enough, so writeback will be triggered repeatedly during one checkpoint, the OS's buffercache will have lots of nonsequential data to flush. Leading to much smaller IOs, more seeks and deeper queues (=> latency increases). > If we're going to fsync between each file, there's no need to sort all the > buffers at once. It's enough to pick one file as the target - like in my > crude patch - and sort only the buffers for that file. Then fsync that file > and move on to the next file. That requires scanning the buffers multiple > times, but I think that's OK. I really can't see that working out. Production instances of postgres with large shared_buffers settings (say 96GB in one case) have tens of thousands of relations (~34500 in the same case). And that's a database with a relatively simple schema. I've seen much worse. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi,
2014-08-31 8:10 GMT+09:00 Andres Freund <andres@2ndquadrant.com>:
On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:
> If we're going to fsync between each file, there's no need to sort all theI really can't see that working out. Production instances of postgres
> buffers at once. It's enough to pick one file as the target - like in my
> crude patch - and sort only the buffers for that file. Then fsync that file
> and move on to the next file. That requires scanning the buffers multiple
> times, but I think that's OK.
with large shared_buffers settings (say 96GB in one case) have tens of
thousands of relations (~34500 in the same case). And that's a database
with a relatively simple schema. I've seen much worse.
Yeah, it is impossible in one checkpointer process. All buffer search cost is
relatively high than we expect. We need clever algorithm for efficient and
distributed buffer search using multi process or threads.
Regards,
--
Mitsumasa KONDO
Hello Heikki, > For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs, see > attached. It works by repeatedly scanning the buffer pool, writing buffers > belonging to a single relation segment at a time. I tried this patch on the same host I used with the same "-R 25 -L 200 -T 5000", alas without significant positive effects, about 6% of transactions were "lost", including stretches of several seconds of unresponsiveness. Maybe this is because "pgbench -N" only basically touches 2 tables (accounts & history), so there are few file syncs involved, thus it does not help much with spreading a lot of writes. 2014-08-30 23:52:24.167 CEST: LOG: checkpoint starting: xlog 2014-08-30 23:54:09.239 CEST: LOG: checkpoint sync:number=1 file=base/16384/11902 time=24.529 msec 2014-08-30 23:54:23.812 CEST: LOG: checkpoint sync: number=1 file=base/16384/16397_vmtime=32.547 msec 2014-08-30 23:54:24.771 CEST: LOG: checkpoint sync: number=1 file=base/16384/11873time=557.470 msec 2014-08-30 23:54:37.419 CEST: LOG: checkpoint complete: wrote 4931 buffers (30.1%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=122.946 s, sync=10.129 s, total=133.252 s; sync files=6, longest=9.854 s, average=1.790 s Note that given the small load and table size, pgbench implies random I/Os and basically nothing else. -- Fabien.
On Tue, Aug 26, 2014 at 1:02 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
Hello again,I recompiled with the variable coldly set to 1000 instead of 120000. The situation is slightly degraded (15% of transactions were above 200 ms late). However it seems that bgwriter did not write much more pages:I have not found any mean to force bgwriter to send writes when it can.
(Well, I have: create a process which sends "CHECKPOINT" every 0.2
seconds... it works more or less, but this is not my point:-)
There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes. It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.
You should probably try it set to 200 rather than 1000, to put it on an equal footing with the checkpoint_timeout of 0.2 seconds you reported on.
Not that I think this will improve the situation. Afterall, my theory is that it does not matter who *writes* the pages, it only matters how they get fsynced.
buffers_checkpoint = 26065
buffers_clean = 5263
buffers_backend = 367
Or I may have a problem interpreting pg_stat_bgwriter.
For this experiment, what was checkpoint_timeout set to?
Cheers,
Jeff
>>> There is scan_whole_pool_milliseconds, which currently forces bgwriter to >>> circle the buffer pool at least once every 2 minutes. It is currently >>> fixed, but it should be trivial to turn it into an experimental guc that >>> you could use to test your hypothesis. >> >> I recompiled with the variable coldly set to 1000 instead of 120000. The >> situation is slightly degraded (15% of transactions were above 200 ms >> late). However it seems that bgwriter did not write much more pages: > > > You should probably try it set to 200 rather than 1000, to put it on an > equal footing with the checkpoint_timeout of 0.2 seconds you reported on. As I understand it, the setting makes the bgwriter processe scan all shared_buffers every this amount of time... but ITSM that the key point is that bgwriter has no insentive to start writing out buffers anyway with its current decision rules, and that should not change with the frequency at which they are scanned (?) > For this experiment, what was checkpoint_timeout set to? AFAICR, the default, 5min. -- Fabien.
On Tue, Sep 2, 2014 at 8:14 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
As I understand it, the setting makes the bgwriter processe scan all shared_buffers every this amount of time... but ITSM that the key point is that bgwriter has no insentive to start writing out buffers anyway with its current decision rules, and that should not change with the frequency at which they are scanned (?)There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes. It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.
I recompiled with the variable coldly set to 1000 instead of 120000. The
situation is slightly degraded (15% of transactions were above 200 ms
late). However it seems that bgwriter did not write much more pages:
You should probably try it set to 200 rather than 1000, to put it on an
equal footing with the checkpoint_timeout of 0.2 seconds you reported on.
Ah, I see now. The usage counts are not zero, so it visits the buffer and then leaves it alone.
Cheers,
Jeff
On Sat, Aug 30, 2014 at 2:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > If the sort buffer is allocated when the checkpointer is started, not > everytime we sort, as you've done in your version of the patch I think > that risk is pretty manageable. If we really want to be sure nothing is > happening at runtime, even if the checkpointer was restarted, we can put > the sort array in shared memory. I don't think that allocating the array at checkpointer start time helps. If it works, then you're strictly worse off than if you allocate it at every checkpoint, because you're holding onto the memory all the time instead of only when it's being used. And if it fails, what then? Sure, you can have that copy of the checkpointer process exit, but that does nothing good. The postmaster will keep on restarting it and it will keep on dying for lack of memory, and no checkpoints will complete. Oops. So it seems to me that the possibly-sensible approaches are: 1. Allocate an array when we need to sort, and if the allocation fails, have some kind of fallback strategy, like logging a WARNING an writing the buffers out without sorting them. If it succeeds, do the checkpoint and then free the memory until we need it again. 2. Putting the array in shared_memory, so that once the server is started, we can be sure the memory is allocated and the sort will work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-03 17:08:12 -0400, Robert Haas wrote: > On Sat, Aug 30, 2014 at 2:04 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > If the sort buffer is allocated when the checkpointer is started, not > > everytime we sort, as you've done in your version of the patch I think > > that risk is pretty manageable. If we really want to be sure nothing is > > happening at runtime, even if the checkpointer was restarted, we can put > > the sort array in shared memory. > > I don't think that allocating the array at checkpointer start time > helps. If it works, then you're strictly worse off than if you > allocate it at every checkpoint, because you're holding onto the > memory all the time instead of only when it's being used. And if it > fails, what then? Sure, you can have that copy of the checkpointer > process exit, but that does nothing good. The postmaster will keep on > restarting it and it will keep on dying for lack of memory, and no > checkpoints will complete. Oops. It's imo quite clearly better to keep it allocated. For one after postmaster started the checkpointer successfully you don't need to be worried about later failures to allocate memory if you allocate it once (unless the checkpointer FATALs out which should be exceedingly rare - we're catching ERRORs). It's much much more likely to succeed initially. Secondly it's not like there's really that much time where no checkpointer isn't running. > So it seems to me that the possibly-sensible approaches are: > > 1. Allocate an array when we need to sort, and if the allocation > fails, have some kind of fallback strategy, like logging a WARNING an > writing the buffers out without sorting them. If it succeeds, do the > checkpoint and then free the memory until we need it again. I think if we want to go that way I vote for keeping the array allocated and continuing to try to allocate it after allocation failures. And, as you suggest, fall back to a simple sequential search through all buffers. > 2. Putting the array in shared_memory, so that once the server is > started, we can be sure the memory is allocated and the sort will > work. But I prefer this approach. If we ever want to have more than one process writing out data for checkpoints we're going to need it anyway. And that's something not that far away for large setups imo. Especially due to checksums. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Aug 30, 2014 at 8:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote: >>> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write >>> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp). >>> The performance impact of that was inconclusive, but one thing that it >>> allows nicely is to interleave the fsyncs, so that you write all the buffers >>> for one file, then fsync it, then next file and so on. > >> ... >> So, *very* clearly sorting is a benefit. > > pg_bench alone doesn't convince me on this. The original thread found > cases where it was a loss, IIRC; you will need to test many more than > one scenario to prove the point. The same objection came up last time I tried to push for sorted checkpoints. I did not find any reference to where it caused a loss, nor was I able to come up with a case where writing out in arbitrary order would be better than writing out in file sequential order. In fact if we ask for low latency this means that the OS must keep the backlog small eliminating any chance of write combining writes that arrive out of order. I have a use case where the system continuously loads data into time partitioned indexed tables, at every checkpoint all of the indexes of the latest partition need to be written out. The only way I could get the write out to happen with sequential I/O was to set checkpoint_completion_target to zero and ensure OS cache allows for enough dirty pages to absorb the whole checkpoint. The fsync that followed did obviously nasty things to latency. Whereas sorted checkpoints were able to do sequential I/O with checkpoint spreading and low latency tuned OS virtual memory settings. I can create a benchmark that shows this behavior if you need additional data points to pgbench's OLTP workload to convince you that sorting checkpoint writes is a good idea. I did just come up with a case where plain sorting might cause an issue. If the writes go to different I/O devices then naive sorting will first use one device then the other, whereas arbitrary writing will load balance between the devices. Assuming that separate tablespaces are used for separate I/O devices, it should be enough to just interleave writes of each tablespace, weighed by the amount of writes per tablespace. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: > It's imo quite clearly better to keep it allocated. For one after > postmaster started the checkpointer successfully you don't need to be > worried about later failures to allocate memory if you allocate it once > (unless the checkpointer FATALs out which should be exceedingly rare - > we're catching ERRORs). It's much much more likely to succeed > initially. Secondly it's not like there's really that much time where no > checkpointer isn't running. In principle you could do the sort with the full sized array and then compress it to a list of buffer IDs that need to be written out. This way most of the time you only need a small array and the large array is only needed for a fraction of a second. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma <ants@cybertec.at> wrote: > On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> It's imo quite clearly better to keep it allocated. For one after >> postmaster started the checkpointer successfully you don't need to be >> worried about later failures to allocate memory if you allocate it once >> (unless the checkpointer FATALs out which should be exceedingly rare - >> we're catching ERRORs). It's much much more likely to succeed >> initially. Secondly it's not like there's really that much time where no >> checkpointer isn't running. > > In principle you could do the sort with the full sized array and then > compress it to a list of buffer IDs that need to be written out. This > way most of the time you only need a small array and the large array > is only needed for a fraction of a second. It's not the size of the array that's the problem; it's the size of the detonation when the allocation fails. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
hi On Thu, Sep 4, 2014 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma <ants@cybertec.at> wrote: >> On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres@2ndquadrant.com> wrote: >>> It's imo quite clearly better to keep it allocated. For one after >>> postmaster started the checkpointer successfully you don't need to be >>> worried about later failures to allocate memory if you allocate it once >>> (unless the checkpointer FATALs out which should be exceedingly rare - >>> we're catching ERRORs). It's much much more likely to succeed >>> initially. Secondly it's not like there's really that much time where no >>> checkpointer isn't running. >> >> In principle you could do the sort with the full sized array and then >> compress it to a list of buffer IDs that need to be written out. This >> way most of the time you only need a small array and the large array >> is only needed for a fraction of a second. > > It's not the size of the array that's the problem; it's the size of > the detonation when the allocation fails. > You can use a file backed memory array Or because it's only a hint and - keys are in buffers (BufferTag), right? - transition is only from 'data I care to data I don't care' if a buffer is concurrently evicted when sorting. Use a pre allocate buffer index array an read keys from buffers when sorting, without memory barrier, spinlocks, whatever.
On Fri, Sep 5, 2014 at 4:17 AM, didier <did447@gmail.com> wrote: >> It's not the size of the array that's the problem; it's the size of >> the detonation when the allocation fails. >> > You can use a file backed memory array > Or because it's only a hint and > - keys are in buffers (BufferTag), right? > - transition is only from 'data I care to data I don't care' if a > buffer is concurrently evicted when sorting. > > Use a pre allocate buffer index array an read keys from buffers when > sorting, without memory barrier, spinlocks, whatever. If by a file-backed memory array you mean mmap() something, that doesn't do anything to address the possibility of failure. mmap(), like malloc(), can fail. But I think we don't really need to have this argument. Andres's proposal to put this in the main shared memory segment sounds fine to me. If allocating that fails, you'll know to reduce shared_buffers and try again. If it succeeds, you should be safe after that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 5, 2014 at 3:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Sep 5, 2014 at 4:17 AM, didier <did447@gmail.com> wrote: >>> It's not the size of the array that's the problem; it's the size of >>> the detonation when the allocation fails. >>> >> You can use a file backed memory array >> Or because it's only a hint and >> - keys are in buffers (BufferTag), right? >> - transition is only from 'data I care to data I don't care' if a >> buffer is concurrently evicted when sorting. >> >> Use a pre allocate buffer index array an read keys from buffers when >> sorting, without memory barrier, spinlocks, whatever. > > If by a file-backed memory array you mean mmap() something, that > doesn't do anything to address the possibility of failure. mmap(), > like malloc(), can fail. It's worse. If your process needs to page from disk when touching a byte in the mmap'd array, and the disk somehow fails to work, you get a segmentation fault.