Thread: postgresql latency & bgwriter not doing its job

postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello pgdevs,

I've been playing with pg for some time now to try to reduce the maximum 
latency of simple requests, to have a responsive server under small to 
medium load.

On an old computer with a software RAID5 HDD attached, pgbench 
simple update script run for a some time (scale 100, fillfactor 95)
    pgbench -M prepared -N -c 2 -T 500 -P 1 ...

gives 300 tps. However this performance is really +1000 tps for a few 
seconds followed by 16 seconds at about 0 tps for the checkpoint induced 
IO storm. The server is totally unresponsive 75% of the time. That's 
bandwidth optimization for you. Hmmm... why not.

Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above 
max):
    pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...

The same thing more or less happens in a delayed fashion... You get 50 tps 
for some time, followed by sections of 15 seconds at 0 tps for the 
checkpoint when the segments are full... the server is unresponsive about 
10% of the time (one in ten transaction is late by more than 200 ms).

It is not satisfying, pg should be able to handle that load easily.

The culprit I found is "bgwriter", which is basically doing nothing to 
prevent the coming checkpoint IO storm, even though there would be ample 
time to write the accumulating dirty pages so that checkpoint would find a 
clean field and pass in a blink. Indeed, at the end of the 500 seconds 
throttled test, "pg_stat_bgwriter" says:
  buffers_checkpoint = 19046  buffers_clean = 2995

Which suggest that bgwriter took on him to write only 6 pages per second, 
where at least 50 would have been needed for the load, and could have been 
handled by the harware without any problem.

I have not found any mean to force bgwriter to send writes when it can. 
(Well, I have: create a process which sends "CHECKPOINT" every 0.2 
seconds... it works more or less, but this is not my point:-)

Bgwriter control parameters allow to control the maximum number of pages 
(bgwriter_lru_maxpages) written per round (bgwriter_delay), and a 
multiplier (bgwriter_lru_multiplier) which controls some heuristics to 
estimate how many pages should be needed so as to make them available. 
This may be nice in some settings, but is not adapted to the write 
oriented OTPL load tested with pgbench.

The problem is that with the update load on a fitting in memory database 
there is not that much need of "new" pages, even if pages are being 
dirtied (about 50 per seconds), so it seems that the heuristics decides 
not to write much. The net result of all this cleverness is that when the 
checkpoint arrives, several thousand pages have to be written and the 
server is offline for some time.

ISTM that bgwriter lacks at least some "min page" setting it could be 
induced to free this many pages if it can. That would be a start.

A better feature would be that it adapts itself to take advantage of the 
available IOPS, depending on the load induce by other tasks (vacuum, 
queries...), in a preventive manner, so as to avoid delaying what can be 
done right now under a small load, and thus avoid later IO storms. This 
would suggest that some setting would provide the expected IOPS capability 
of the underlying hardware, as some already suggest the expected available 
memory.

Note that this preventive approach could also improve the "bandwith" 
measure: currently when pgbench is running at maximum speed before the 
checkpoint storm, nothing is written to disk but WAL, although it could 
probably also write some dirty pages. When the checkpoints arrives, less 
pages would need to be written, so the storm would be shorter.

Any thoughts on this latency issue? Am I missing something?

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Josh Berkus
Date:
On 08/25/2014 01:23 PM, Fabien COELHO wrote:
> 
> Hello pgdevs,
> 
> I've been playing with pg for some time now to try to reduce the maximum
> latency of simple requests, to have a responsive server under small to
> medium load.
> 
> On an old computer with a software RAID5 HDD attached, pgbench simple
> update script run for a some time (scale 100, fillfactor 95)
> 
>     pgbench -M prepared -N -c 2 -T 500 -P 1 ...
> 
> gives 300 tps. However this performance is really +1000 tps for a few
> seconds followed by 16 seconds at about 0 tps for the checkpoint induced
> IO storm. The server is totally unresponsive 75% of the time. That's
> bandwidth optimization for you. Hmmm... why not.

So I think that you're confusing the roles of bgwriter vs. spread
checkpoint.  What you're experiencing above is pretty common for
nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
no matter how fast the disks are), or for attempts to do spread
checkpoint on filesystems which don't support it (e.g. Ext3, HFS+).  In
either case, what's happening is that the *OS* is freezing all logical
and physical IO while it works to write out all of RAM, which makes me
suspect you're using Ext3 or HFS+.

Making the bgwriter more aggressive adds a significant risk of writing
the same pages multiple times between checkpoints, so it's not a simple fix.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
Hi,

On 2014-08-25 22:23:40 +0200, Fabien COELHO wrote:
> seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO
> storm. The server is totally unresponsive 75% of the time. That's bandwidth
> optimization for you. Hmmm... why not.
> 
> Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above
> max):
> 
>     pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
> 
> The same thing more or less happens in a delayed fashion... You get 50 tps
> for some time, followed by sections of 15 seconds at 0 tps for the
> checkpoint when the segments are full... the server is unresponsive about
> 10% of the time (one in ten transaction is late by more than 200 ms).

That's ext4 I guess? Did you check whether xfs yields a, err, more
predictable performance?

> It is not satisfying, pg should be able to handle that load easily.
> 
> The culprit I found is "bgwriter", which is basically doing nothing to
> prevent the coming checkpoint IO storm, even though there would be ample
> time to write the accumulating dirty pages so that checkpoint would find a
> clean field and pass in a blink.

While I agree that the current bgwriter implementation is far from good,
note that this isn't the bgwriter's job. Its goal is to avoid backends
from having to write out buffers themselves. I.e. that there are clean
victim buffers when shared_buffers < working set.

Note that it would *not* be a good idea to make the bgwriter write out
everything, as much as possible - that'd turn sequential write io into
random write io.

Greetings,

Andres Freund



Re: postgresql latency & bgwriter not doing its job

From
Jeff Janes
Date:
On Monday, August 25, 2014, Fabien COELHO <coelho@cri.ensmp.fr> wrote:


The culprit I found is "bgwriter", which is basically doing nothing to prevent the coming checkpoint IO storm, even though there would be ample time to write the accumulating dirty pages so that checkpoint would find a clean field and pass in a blink. Indeed, at the end of the 500 seconds throttled test, "pg_stat_bgwriter" says:

Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?  You don't want your steady state stats polluted by the bulk load.
 

  buffers_checkpoint = 19046
  buffers_clean = 2995

Out of curiosity, what does buffers_backend show?

In any event, this almost certainly is a red herring.  Whichever of the three ways are being used to write out the buffers, it is the checkpointer that is responsible for fsyncing them, and that is where your drama is almost certainly occurring. Writing out with one path rather than a different isn't going to change things, unless you change the fsync.
 
Also, are you familiar with checkpoint_completion_target, and what is it set to?

Cheers,

Jeff

Re: postgresql latency & bgwriter not doing its job

From
Amit Kapila
Date:
On Tue, Aug 26, 2014 at 1:53 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello pgdevs,
>
> I've been playing with pg for some time now to try to reduce the maximum latency of simple requests, to have a responsive server under small to medium load.
>
> On an old computer with a software RAID5 HDD attached, pgbench simple update script run for a some time (scale 100, fillfactor 95)
>
>         pgbench -M prepared -N -c 2 -T 500 -P 1 ...
>
> gives 300 tps. However this performance is really +1000 tps for a few seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO storm. The server is totally unresponsive 75% of the time. That's bandwidth optimization for you. Hmmm... why not.
>
> Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above max):
>
>         pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
>
> The same thing more or less happens in a delayed fashion... You get 50 tps for some time, followed by sections of 15 seconds at 0 tps for the checkpoint when the segments are full... the server is unresponsive about 10% of the time (one in ten transaction is late by more than 200 ms).

I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply.  One reason could be
that backends might need to write more WAL due Full_Page_Writes,
another could be contention around buffer content_lock.

To dig more about the reason, the same tests can be tried
by making Full_Page_Writes = off and/or
synchronous_commit = off to see if WAL writes are causing
tps to go down.

Similarly for checkpoints, use  checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello Josh,

> So I think that you're confusing the roles of bgwriter vs. spread
> checkpoint.  What you're experiencing above is pretty common for
> nonspread checkpoints on slow storage (and RAID5 is slow for DB updates,
> no matter how fast the disks are), or for attempts to do spread
> checkpoint on filesystems which don't support it (e.g. Ext3, HFS+).  In
> either case, what's happening is that the *OS* is freezing all logical
> and physical IO while it works to write out all of RAM, which makes me
> suspect you're using Ext3 or HFS+.

I'm using ext4 on debian wheezy with postgresqk 9.4b2.

I agree that the OS may be able to help, but this aspect does not 
currently work for me at all out of the box. The "all of RAM" is really a 
few thousands 8 kB pages written randomly, a few dozen MB.

Also, if pg needs advanced OS tweaking to handle a small load, ISTM that 
it fails at simplicity:-(

As for checkpoint spreading, raising checkpoint_completion_target to 0.9 
degrades the situation (20% of transactions are more than 200 ms late 
instead of 10%, bgwriter wrote less that 1 page per second, on on 500s 
run). So maybe there is a bug here somewhere.

> Making the bgwriter more aggressive adds a significant risk of writing
> the same pages multiple times between checkpoints, so it's not a simple fix.

Hmmm... This must be balanced with the risk of being offline. Not all 
people are interested in throughput at the price of latency, so there 
could be settings that help latency, even at the price of reducing 
throughput (average tps). After that, it is the administrator choice to 
set pg for higher throughput or lower latency.

Note that writing some "least recently used" page multiple times does not 
seems to be any issue at all for me under small/medium load, especially as 
the system has nothing else to do: if you have nothing else to do, there 
is no cost in writing a page, even if you may have to write it again some 
time later, and it helps prevent dirty pages accumulation. So it seems to 
me that pg can help, it is not only/merely an OS issue.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Jeff Janes
Date:
On Monday, August 25, 2014, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

I have not found any mean to force bgwriter to send writes when it can. (Well, I have: create a process which sends "CHECKPOINT" every 0.2 seconds... it works more or less, but this is not my point:-)

There is scan_whole_pool_milliseconds, which currently forces bgwriter to circle the buffer pool at least once every 2 minutes.  It is currently fixed, but it should be trivial to turn it into an experimental guc that you could use to test your hypothesis.
 
Cheers,

Jeff

Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
[oops, wrong from, resent...]

Hello Jeff,

>> The culprit I found is "bgwriter", which is basically doing nothing to
>> prevent the coming checkpoint IO storm, even though there would be ample
>> time to write the accumulating dirty pages so that checkpoint would find a
>> clean field and pass in a blink. Indeed, at the end of the 500 seconds
>> throttled test, "pg_stat_bgwriter" says:
>
> Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?

Yes, I did.

> You don't want your steady state stats polluted by the bulk load.

Sure!

>>   buffers_checkpoint = 19046
>>   buffers_clean = 2995
>
> Out of curiosity, what does buffers_backend show?
    buffers_backend = 157

> In any event, this almost certainly is a red herring.

Possibly. It is pretty easy to reproduce, though.

> Whichever of the three ways are being used to write out the buffers, it 
> is the checkpointer that is responsible for fsyncing them, and that is 
> where your drama is almost certainly occurring. Writing out with one 
> path rather than a different isn't going to change things, unless you 
> change the fsync.

Well, I agree partially. ISTM that the OS does not need to wait for fsync 
to start writing pages if it is receiving one minute of buffer writes at 
50 writes per second, I would have thought that some scheduler should 
start handling the flow before fsync... So I thought that if bgwriter was 
to write the buffers is would help, but maybe there is a better way.

> Also, are you familiar with checkpoint_completion_target, and what is it
> set to?

The default 0.5. Moving to 0.9 seems to worsen the situation.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello Amit,

> I think another thing to know here is why exactly checkpoint
> storm is causing tps to drop so steeply.

Yep. Actually it is not strictly 0, but a "few" tps that I rounded to 0.
  progress: 63.0 s, 47.0 tps, lat 2.810 ms stddev 5.194, lag 0.354 ms  progress: 64.1 s, 11.9 tps, lat 81.481 ms stddev
218.026,lag 74.172 ms  progress: 65.2 s, 1.9 tps, lat 950.455 ms stddev 125.602, lag 1421.203 ms  progress: 66.1 s, 4.5
tps,lat 604.790 ms stddev 440.073, lag 2418.128 ms  progress: 67.2 s, 6.0 tps, lat 322.440 ms stddev 68.276, lag
3146.302ms  progress: 68.0 s, 2.4 tps, lat 759.509 ms stddev 62.141, lag 4229.239 ms  progress: 69.4 s, 3.6 tps, lat
440.335ms stddev 369.207, lag 4842.371 ms
 

Transactions are 4.8 seconds behind schedule at this point.

> One reason could be that backends might need to write more WAL due 
> Full_Page_Writes, another could be contention around buffer 
> content_lock. To dig more about the reason, the same tests can be tried 
> by making Full_Page_Writes = off and/or synchronous_commit = off to see 
> if WAL writes are causing tps to go down.

Given the small flow of updates, I do not think that there should be 
reason to get that big a write contention between WAL & checkpoint.

If tried with "full_page_write = off" for 500 seconds: same overall 
behavior, 8.5% of transactions are stuck (instead of 10%). However in 
details pg_stat_bgwriter is quite different:
  buffers_checkpoint = 13906  buffers_clean = 20748  buffers_backend = 472

That seems to suggest that bgwriter did some stuff for once, but that did 
not change much the result in the end. This would imply that my suggestion 
to make bgwriter write more would not fix the problem alone.

With "synchronous_commit = off", the situation is much improved, with only 
0.3% of transactions stuck. Not a surprise. However, I would not recommand 
that as a solution:-)

Currently, the only way I was able to "solve" the issue while still 
writing to disk is to send "CHECKPOINT" every 0.2s, as if I had set 
"checkpoint_timeout = 0.2s" (although this is not currently allowed).

> Similarly for checkpoints, use  checkpoint_completion_target to
> spread the checkpoint_writes as suggested by Jeff as well to see
> if that can mitigate the problem.

I had already tried, and retried after Jeff suggestion. This does not seem 
to mitigate anything, on the contrary.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello again,

>> I have not found any mean to force bgwriter to send writes when it can.
>> (Well, I have: create a process which sends "CHECKPOINT" every 0.2
>> seconds... it works more or less, but this is not my point:-)
>
> There is scan_whole_pool_milliseconds, which currently forces bgwriter to
> circle the buffer pool at least once every 2 minutes.  It is currently
> fixed, but it should be trivial to turn it into an experimental guc that
> you could use to test your hypothesis.

I recompiled with the variable coldly set to 1000 instead of 120000. The 
situation is slightly degraded (15% of transactions were above 200 ms 
late). However it seems that bgwriter did not write much more pages:
  buffers_checkpoint = 26065  buffers_clean = 5263  buffers_backend = 367

Or I may have a problem interpreting pg_stat_bgwriter.

It seems that changing this value is not enough to persuade bgwriter to 
write more pages. Or I may have done something wrong, but I do not know 
what.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-26 08:12:48 +0200, Fabien COELHO wrote:
> As for checkpoint spreading, raising checkpoint_completion_target to 0.9
> degrades the situation (20% of transactions are more than 200 ms late
> instead of 10%, bgwriter wrote less that 1 page per second, on on 500s run).
> So maybe there is a bug here somewhere.

What are the other settings here? checkpoint_segments,
checkpoint_timeout, wal_buffers?

Could you show the output of log_checkpoints during that run? Checkpoint
spreading only works halfway efficiently if all checkpoints are
triggered by "time" and not by "xlog".

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello Andres,

>> checkpoint when the segments are full... the server is unresponsive about
>> 10% of the time (one in ten transaction is late by more than 200 ms).
>
> That's ext4 I guess?

Yes!

> Did you check whether xfs yields a, err, more predictable performance?

No. I cannot test that easily without reinstalling the box. I did some 
quick tests with ZFS/FreeBSD which seemed to freeze the same, but not in 
the very same conditions. Maybe I could try again.

> [...] Note that it would *not* be a good idea to make the bgwriter write 
> out everything, as much as possible - that'd turn sequential write io 
> into random write io.

Hmmm. I'm not sure it would be necessary the case, it depends on how 
bgwriter would choose the pages to write? If they are chosen randomly then 
indeed that could be bad. If there is a big sequential write, should not 
the backend do the write directly anyway? ISTM that currently checkpoint 
is mostly random writes anyway, at least with the OLTP write load of 
pgbench. I'm just trying to be able to start them ealier so that they can 
be completed quickly.

So although bgwriter is not the solution, ISTM that pg has no reason to 
wait for minutes before starting to write dirty pages, if it has nothing 
else to do. If the OS does some retention later and cannot spread the 
load, as Josh suggest, this could also be a problem, but currently the OS 
seems not to have much to write (but WAL) till the checkpoint.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote:
> >Did you check whether xfs yields a, err, more predictable performance?
> 
> No. I cannot test that easily without reinstalling the box. I did some quick
> tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very
> same conditions. Maybe I could try again.

After Robert and I went to LSF/MM this spring I sent a test program for
precisely this problem and while it could *crash* machines when using
ext4, xfs yielded much more predictable performance. There's a problem
with priorization of write vs read IO that's apparently FS dependent.

> >[...] Note that it would *not* be a good idea to make the bgwriter write
> >out everything, as much as possible - that'd turn sequential write io into
> >random write io.
> 
> Hmmm. I'm not sure it would be necessary the case, it depends on how
> bgwriter would choose the pages to write? If they are chosen randomly then
> indeed that could be bad.

The essentially have to be random to fulfil it's roal of reducing the
likelihood of a backend having to write out a buffer itself. Consider
how the clock sweep algorithm (not that I am happy with it) works. When
looking for a new victim buffer all backends scan the buffer cache in
one continuous cycle. If they find a buffer with a usagecount==0 they'll
use that one and throw away its contents. Otherwise they reduce
usagecount by 1 and move on. What the bgwriter *tries* to do is to write
out buffers with usagecount==0 that are dirty and will soon be visited
in the clock cycle. To avoid having the backends to do that.

> If there is a big sequential write, should not the
> backend do the write directly anyway? ISTM that currently checkpoint is
> mostly random writes anyway, at least with the OLTP write load of pgbench.
> I'm just trying to be able to start them ealier so that they can be
> completed quickly.

If the IO scheduling worked - which it really doesn't in many cases -
there'd really be no need to make it finish fast. I think you should try
to tune spread checkpoints to have less impact, not make bgwriter do
something it's not written for.

> So although bgwriter is not the solution, ISTM that pg has no reason to wait
> for minutes before starting to write dirty pages, if it has nothing else to
> do.

That precisely *IS* a spread checkpoint.

> If the OS does some retention later and cannot spread the load, as Josh
> suggest, this could also be a problem, but currently the OS seems not to
> have much to write (but WAL) till the checkpoint.

The actual problem is that the writes by the checkpointer - done in the
background - aren't flushed out eagerly enough out of the OS's page
cache. Then, when the final phase of the checkpoint comes, where
relation files need to be fsynced, some filesystems essentially stal
while trying to write out lots and lots of dirty buffers.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
> What are the other settings here? checkpoint_segments,
> checkpoint_timeout, wal_buffers?

They simply are the defaults:
  checkpoint_segments = 3  checkpoint_timeout = 5min  wal_buffers = -1

I did some test checkpoint_segments = 1, the problem is just more frequent 
but shorter. I also reduced wal_segsize down to 1MB, which also made it 
even more frequent but much shorter, so the overall result was an 
improvement with 5% to 3% of transactions lost instead of 10-14%, if I 
recall correctly. I have found no solution on this path.

> Could you show the output of log_checkpoints during that run? Checkpoint
> spreading only works halfway efficiently if all checkpoints are
> triggered by "time" and not by "xlog".

I do 500 seconds tests, so there could be at most 2 timeout triggered 
checkpoints. Given the write load it takes about 2 minutes to fill the 3 
16 MB buffers (8 kb * 50 tps (there is one page modified per transaction) 
* 120 s ~ 48 MB), so checkpoints are triggered by xlog. The maths are 
consistent with logs (not sure which prooves which, though:-):
  LOG:  received SIGHUP, reloading configuration files  LOG:  parameter "log_checkpoints" changed to "on"  LOG:
checkpointstarting: xlog  LOG:  checkpoint complete: wrote 5713 buffers (34.9%); 0 transaction log    file(s) added, 0
removed,0 recycled; write=51.449 s, sync=4.857 s,    total=56.485 s; sync files=12, longest=2.160 s, average=0.404 s
LOG: checkpoint starting: xlog  LOG:  checkpoint complete: wrote 6235 buffers (38.1%); 0 transaction log    file(s)
added,0 removed, 3 recycled; write=53.500 s, sync=5.102 s,    total=58.670 s; sync files=8, longest=2.689 s,
average=0.637s  LOG:  checkpoint starting: xlog  LOG:  checkpoint complete: wrote 6250 buffers (38.1%); 0 transaction
log   file(s) added, 0 removed, 3 recycled; write=53.888 s, sync=4.504 s,    total=58.495 s; sync files=8,
longest=2.627s, average=0.563 s  LOG:  checkpoint starting: xlog  LOG: checkpoint complete: wrote 6148 buffers (37.5%);
0transaction log    file(s) added, 0 removed, 3 recycled; write=53.313 s, sync=6.437 s,    total=59.834 s; sync
files=8,longest=3.680 s, average=0.804 s  LOG:  checkpoint starting: xlog  LOG:  checkpoint complete: wrote 6240
buffers(38.1%); 0 transaction log    file(s) added, 0 removed, 3 recycled; write=149.008 s, sync=5.448 s,
total=154.566s; sync files=9, longest=3.788 s, average=0.605 s
 

Note that my current effective solution is to do as if 
"checkpoints_timeout = 0.2s": it works fine if I do my own spreading.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-26 10:49:31 +0200, Fabien COELHO wrote:
> 
> >What are the other settings here? checkpoint_segments,
> >checkpoint_timeout, wal_buffers?
> 
> They simply are the defaults:
> 
>   checkpoint_segments = 3
>   checkpoint_timeout = 5min
>   wal_buffers = -1
> 
> I did some test checkpoint_segments = 1, the problem is just more frequent
> but shorter. I also reduced wal_segsize down to 1MB, which also made it even
> more frequent but much shorter, so the overall result was an improvement
> with 5% to 3% of transactions lost instead of 10-14%, if I recall correctly.
> I have found no solution on this path.

Uh. I'm not surprised you're facing utterly horrible performance with
this. Did you try using a *large* checkpoints_segments setting? To
achieve high performance you likely will have to make checkpoint_timeout
*longer* and increase checkpoint_segments until *all* checkpoints are
started because of "time".

There's three reasons:
a) if checkpoint_timeout + completion_target is large and the checkpoint
isn't executed prematurely, most of the dirty data has been written out
by the kernel's background flush processes.

b) The amount of WAL written with less frequent checkpoints is often
*significantly* lower because fewer full page writes need to be
done. I've seen production reduction of *more* than a factor of 4.

c) If checkpoint's are infrequent enough, the penalty of them causing
problems, especially if not using ext4, plays less of a role overall.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
> Uh. I'm not surprised you're facing utterly horrible performance with
> this. Did you try using a *large* checkpoints_segments setting? To
> achieve high performance

I do not seek "high performance" per se, I seek "lower maximum latency".

I think that the current settings and parameters are designed for high 
throughput, but do not allow to control the latency even with a small 
load.

> you likely will have to make checkpoint_timeout *longer* and increase 
> checkpoint_segments until *all* checkpoints are started because of 
> "time".

Well, as I want to test a *small* load in a *reasonable* time, so I did 
not enlarge the number of segments, otherwise it would take ages.

If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target = 
0.9" so that the checkpoints are triggered by the timeout,
  LOG:  checkpoint starting: time  LOG:  checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log    file(s)
added,0 removed, 0 recycled; write=53.645 s, sync=5.127 s,    total=58.927 s; sync files=12, longest=2.890 s,
average=0.427s  ...
 

The result is basically the same (well 18% transactions lost, but the 
result do not seem to be stable one run after the other), only there are 
more checkpoints.

I fail to understand how multiplying both the segments and time would 
solve the latency problem. If I set 30 segments than it takes 20 minutes 
to fill them, and if I put timeout to 15min then I'll have to wait for 15 
minutes to test.

> There's three reasons:
> a) if checkpoint_timeout + completion_target is large and the checkpoint
> isn't executed prematurely, most of the dirty data has been written out
> by the kernel's background flush processes.

Why would they be written by the kernel if bgwriter has not sent them??

> b) The amount of WAL written with less frequent checkpoints is often
> *significantly* lower because fewer full page writes need to be
> done. I've seen production reduction of *more* than a factor of 4.

Sure, I understand that, but ISTM that this test does not exercise this 
issue, the load is small, the full page writes do not matter much.

> c) If checkpoint's are infrequent enough, the penalty of them causing
> problems, especially if not using ext4, plays less of a role overall.

I think that what you suggest would only delay the issue, not solve it.

I'll try to ran a long test.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-26 11:34:36 +0200, Fabien COELHO wrote:
> 
> >Uh. I'm not surprised you're facing utterly horrible performance with
> >this. Did you try using a *large* checkpoints_segments setting? To
> >achieve high performance
> 
> I do not seek "high performance" per se, I seek "lower maximum latency".

So?

> I think that the current settings and parameters are designed for high
> throughput, but do not allow to control the latency even with a small load.

The way're you're setting them is tuned for 'basically no write
activity'.

> >you likely will have to make checkpoint_timeout *longer* and increase
> >checkpoint_segments until *all* checkpoints are started because of "time".
> 
> Well, as I want to test a *small* load in a *reasonable* time, so I did not
> enlarge the number of segments, otherwise it would take ages.

Well, that way you're testing something basically meaningless. That's
not helpful either.

> If I put a "checkpoint_timeout = 1min" and "checkpoint_completion_target =
> 0.9" so that the checkpoints are triggered by the timeout,
> 
>   LOG:  checkpoint starting: time
>   LOG:  checkpoint complete: wrote 4476 buffers (27.3%); 0 transaction log
>     file(s) added, 0 removed, 0 recycled; write=53.645 s, sync=5.127 s,
>     total=58.927 s; sync files=12, longest=2.890 s, average=0.427 s
>   ...
>
> The result is basically the same (well 18% transactions lost, but the result
> do not seem to be stable one run after the other), only there are more
> checkpoints.

With these settings you're fsyncing the entire data directy once a
minute. Nearly entirely from the OS's buffer cache, because the OS's
writeback logic didn't have time to kick in.

> I fail to understand how multiplying both the segments and time would solve
> the latency problem. If I set 30 segments than it takes 20 minutes to fill
> them, and if I put timeout to 15min then I'll have to wait for 15 minutes to
> test.

a) The kernel's writeback logic only kicks in with delay. b) The amount
of writes you're doing with short checkpoint intervals is overall
significantly higher than with longer intervals. That obviously has
impact on latency as well as throughput. c) the time it fills for
segments to be filled is mostly irrelevant. The phase that's very likely
causing troubles for you is the fsyncs issued at the end of a
checkpoint.

> >There's three reasons:
> >a) if checkpoint_timeout + completion_target is large and the checkpoint
> >isn't executed prematurely, most of the dirty data has been written out
> >by the kernel's background flush processes.
> 
> Why would they be written by the kernel if bgwriter has not sent them??

I think you're misunderstanding how spread checkpoints work. When the
checkpointer process starts a spread checkpoint it first writes all
buffers to the kernel in a paced manner. That pace is determined by
checkpoint_completion_target and checkpoint_timeout. Once all buffers
that are old enough to need to be checkpointed written out, the
checkpointer fsync()s all the on disk files. That part is *NOT*
paced. Then it can go on to remove old WAL files.

The latency problem is almost guaranteedly created by the fsync()s
mentioned above. When they're execute the kernel starts flushing out a
lot of dirty buffers at once - creating very deep IO queues which makes
it take long to process synchronous additions (WAL flushes, reads) to
that queue.

> >c) If checkpoint's are infrequent enough, the penalty of them causing
> >problems, especially if not using ext4, plays less of a role overall.
> 
> I think that what you suggest would only delay the issue, not solve it.

The amount of dirty data that needs to be flushed is essentially
bounded. If you have a stall of roughly the same magnitude (say a factor
of two different), the smaller once a minute, the larger once an
hour. Obviously the once-an-hour one will have a better latency in many,
many more transactions.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello Jeff,

>> The culprit I found is "bgwriter", which is basically doing nothing to
>> prevent the coming checkpoint IO storm, even though there would be ample
>> time to write the accumulating dirty pages so that checkpoint would find a
>> clean field and pass in a blink. Indeed, at the end of the 500 seconds
>> throttled test, "pg_stat_bgwriter" says:
>
> Are you doing pg_stat_reset_shared('bgwriter') after running pgbench -i?

Yes, I did.

> You don't want your steady state stats polluted by the bulk load.

Sure!

>>   buffers_checkpoint = 19046
>>   buffers_clean = 2995
>
> Out of curiosity, what does buffers_backend show?
   buffers_backend = 157

> In any event, this almost certainly is a red herring.

Possibly! It is pretty easy to reproduce... can anyone try?

> Whichever of the three ways are being used to write out the buffers, it 
> is the checkpointer that is responsible for fsyncing them, and that is 
> where your drama is almost certainly occurring. Writing out with one 
> path rather than a different isn't going to change things, unless you 
> change the fsync.

Well, ISTM that the OS does not need to wait for fsync to start writing 
pages if it has received one minute of buffer writes at 50 writes per 
second, some scheduler should start handling the flow somewhere... So if 
bgwriter was to write the buffers is would help, but maybe there is a 
better way.

> Also, are you familiar with checkpoint_completion_target, and what is it
> set to?

The default 0.5. Moving to 0.9 seems to worsen the situation.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Amit Kapila
Date:
On Tue, Aug 26, 2014 at 12:53 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> Given the small flow of updates, I do not think that there should be reason to get that big a write contention between WAL & checkpoint.
>
> If tried with "full_page_write = off" for 500 seconds: same overall behavior, 8.5% of transactions are stuck (instead of 10%). However in details pg_stat_bgwriter is quite different:
>
>   buffers_checkpoint = 13906
>   buffers_clean = 20748
>   buffers_backend = 472
>
> That seems to suggest that bgwriter did some stuff for once, but that did not change much the result in the end. This would imply that my suggestion > to make bgwriter write more would not fix the problem alone.

I think the reason could be that in most cases bgwriter
passes the sync responsibility to checkpointer rather than
doing by itself which causes IO storm during checkpoint
and another thing is that it will not proceed to even write
the dirty buffer unless the refcounf and usage_count of
buffer is zero.  I see there is some merit in your point which
is to make bgwriter more useful than its current form.
I could see 3 top level points to think about whether improvement
in any of those can improve the current situation:
a. Scanning of buffer pool to find the dirty buffers that can
be flushed.  
b. Deciding on what is criteria to flush a buffer
c. Sync of buffers



> With "synchronous_commit = off", the situation is much improved, with only 0.3% of transactions stuck. Not a surprise. However, I would not 
> recommand that as a solution:-)

Yeah, actually it was just to test what is the actual problem,
I will also not recommend such a solution, however it gives
us atleast the indication that due to IO in checkpoint, backends
are not even able to flush comparatively small WAL data.
How about if you keep WAL (pg_xlog) on a separate filesystem
may be via creating symlink or if possible with -X option of initdb?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello Andres,

> [...]
> I think you're misunderstanding how spread checkpoints work.

Yep, definitely:-) On the other hand I though I was seeking something 
"simple", namely correct latency under small load, that I would expect out 
of the box.

What you describe is reasonable, and is more or less what I was hoping 
for, although I thought that bgwriter was involved from the start and 
checkpoint would only do what is needed in the end. My mistake.

> When the checkpointer process starts a spread checkpoint it first writes 
> all buffers to the kernel in a paced manner.
> That pace is determined by checkpoint_completion_target and 
> checkpoint_timeout.

This pacing does not seem to work, even at slow pace.

> If you have a stall of roughly the same magnitude (say a factor
> of two different), the smaller once a minute, the larger once an
> hour. Obviously the once-an-hour one will have a better latency in many,
> many more transactions.

I do not believe in delaying as much as possible writing do disk to handle 
a small load as a viable strategy.  However, to show my good will, I have 
tried to follow your advices: I've launched a 5000 seconds test with 50 
segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less 
than 1/10 of the maximum throughput.

There are only two time-triggered checkpoints:
  LOG:  checkpoint starting: time  LOG:  checkpoint complete: wrote 48725 buffers (47.6%);      1 transaction log
file(s)added, 0 removed, 0 recycled;      write=1619.750 s, sync=27.675 s, total=1647.932 s;      sync files=14,
longest=27.593s, average=1.976 s
 
  LOG:  checkpoint starting: time  LOG:  checkpoint complete: wrote 22533 buffers (22.0%);      0 transaction log
file(s)added, 0 removed, 23 recycled;      write=826.919 s, sync=9.989 s, total=837.023 s;      sync files=8,
longest=6.742s, average=1.248 s
 

For the first one, 48725 buffers is 380MB. 1800 * 0.9 = 1620 seconds to 
complete, so it means 30 buffer writes per second... should be ok. However 
sync costs 27 seconds nevertheless, and the server was more or less 
offline for about 30 seconds flat. For the second one, 180 MB to write, 10 
seconds offline. For some reason the target time is reduced. I have also 
tried with the "deadline" IO scheduler which make more sense than the 
default "cfq", but the result was similar. Not sure how software RAID 
interacts with IO scheduling, though.

Overall result: over the 5000s test, I have lost (i.e. more than 200ms 
behind schedule) more than 2.5% of transactions (1/40). Due to the 
unfinished cycle, the long term average is probably about 3%. Although it 
is better than 10%, it is not good. I would expect/hope for something 
pretty close to 0, even with ext4 on Linux, for a dedicated host which has 
nothing else to do but handle two dozen transactions per second.

Current conclusion: I have not found any way to improve the situation to 
"good" with parameters from the configuration. Currently a small load 
results in periodic offline time, that can be delayed but not avoided. The 
delaying tactic results in less frequent but longer downtime. I prefer 
frequent very short downtime instead.

I really think that something is amiss. Maybe pg does not handle pacing as 
it should.

For the record, a 25tps bench with a "small" config (default 3 segments, 
5min timeout, 0.5 completion target) and with a parallel:
    while true ; do echo "CHECKPOINT;"; sleep 0.2s; done | psql

results in "losing" only 0.01% of transactions (12 transactions out of 
125893 where behind more than 200ms in 5000 seconds). Although you may 
think it stupid, from my point of view it shows that it is possible to 
coerce pg to behave.

With respect to the current status:

(1) the ability to put checkpoint_timeout to values smaller than 30s could 
help, although obviously there would be other consequences. But the 
ability to avoid periodic offline time looks like a desirable objective.

(2) I still think that a parameter to force bgwriter to write more stuff 
could help, but this is not tested.

(3) Any other effective idea to configure for responsiveness is welcome!

If someone wants to repeat these tests, it is easy and only takes a few 
minutes:
  sh> createdb test  sh> pgbench -i -s 100 -F 95 test  sh> pgbench -M prepared -N -R 25 -L 200 -c 2 -T 5000 -P 1 test >
pgb.out

Note: the -L to limit latency is a submitted patch. Without this, 
unresponsiveness shows as increasing laging time.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 09:32:16 +0200, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >[...]
> >I think you're misunderstanding how spread checkpoints work.
> 
> Yep, definitely:-) On the other hand I though I was seeking something
> "simple", namely correct latency under small load, that I would expect out
> of the box.

Yea. The current situation *sucks*. Both from the utterly borked
behaviour of ext4 and other filesystems and the lack of workaround from postgres.

> >When the checkpointer process starts a spread checkpoint it first writes
> >all buffers to the kernel in a paced manner.
> >That pace is determined by checkpoint_completion_target and
> >checkpoint_timeout.
> 
> This pacing does not seem to work, even at slow pace.

It definitely does in some cases. What's your evidence the pacing
doesn't work? Afaik it's the fsync that causes the problem, not the the
writes themselves.

> >If you have a stall of roughly the same magnitude (say a factor
> >of two different), the smaller once a minute, the larger once an
> >hour. Obviously the once-an-hour one will have a better latency in many,
> >many more transactions.
> 
> I do not believe in delaying as much as possible writing do disk to handle a
> small load as a viable strategy.  However, to show my good will, I have
> tried to follow your advices: I've launched a 5000 seconds test with 50
> segments, 30 min timeout, 0.9 completion target, at 25 tps, which is less
> than 1/10 of the maximum throughput.
> 
> There are only two time-triggered checkpoints:
> 
>   LOG:  checkpoint starting: time
>   LOG:  checkpoint complete: wrote 48725 buffers (47.6%);
>       1 transaction log file(s) added, 0 removed, 0 recycled;
>       write=1619.750 s, sync=27.675 s, total=1647.932 s;
>       sync files=14, longest=27.593 s, average=1.976 s
> 
>   LOG:  checkpoint starting: time
>   LOG:  checkpoint complete: wrote 22533 buffers (22.0%);
>       0 transaction log file(s) added, 0 removed, 23 recycled;
>       write=826.919 s, sync=9.989 s, total=837.023 s;
>       sync files=8, longest=6.742 s, average=1.248 s

The write pacing itself doesn't seem to be bad. The bad thing is the
'sync' times here. Those are *NOT* paced and kernel probably has delayed
flushing out much the writes...

> (1) the ability to put checkpoint_timeout to values smaller than 30s could
> help, although obviously there would be other consequences. But the ability
> to avoid periodic offline time looks like a desirable objective.

I'd rather not do that. It's a utterly horrible hack to go this write.

> (2) I still think that a parameter to force bgwriter to write more stuff
> could help, but this is not tested.

It's going to be random writes. That's not going to be helpful.

> (3) Any other effective idea to configure for responsiveness is welcome!

I've a couple of ideas how to improve the situation, but so far I've not
had the time to investigate them properly. Would you be willing to test
a couple of simple patches?

Did you test xfs already?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync 
> that causes the problem, not the the writes themselves.

Hmmm. My (poor) understanding is that fsync would work fine if everything 
was already written beforehand:-) that is it has nothing to do but assess 
that all is already written. If there is remaining write work, it starts 
doing it "now" with the disastrous effects I'm complaining about.

When I say "pacing does not work", I mean that things where not written 
out to disk by the OS, it does not mean that pg did not ask for it.

However it does not make much sense for an OS scheduler to wait several 
minutes with tens of thousands of pages to write and do nothing about 
it... So I'm wondering.

> [...]
>> (1) the ability to put checkpoint_timeout to values smaller than 30s could
>> help, although obviously there would be other consequences. But the ability
>> to avoid periodic offline time looks like a desirable objective.
>
> I'd rather not do that. It's a utterly horrible hack to go this write.

Hmmm. It does solve the issue, though:-) It would be the administrator 
choice. It is better than nothing, which is the current status.

>> (2) I still think that a parameter to force bgwriter to write more stuff
>> could help, but this is not tested.
>
> It's going to be random writes. That's not going to be helpful.

The -N small OLTP load on a large (GB) table *is* random writes anyway, 
whether they occur at checkpoint or at any other time. Random writes are 
fine in this case, the load is small, there should be no problem.

>> (3) Any other effective idea to configure for responsiveness is 
>> welcome!
>
> I've a couple of ideas how to improve the situation, but so far I've not
> had the time to investigate them properly. Would you be willing to test
> a couple of simple patches?

I can test a couple of patches. I already did one on someone advice (make 
bgwriter round all stuff in 1s instead of 120s, without positive effect.

> Did you test xfs already?

No. I cannot without reinstalling, which I cannot do on a remote host, and 
I will probably not have time to do it when I'll have physical access. 
Only one partition on the host. My mistake. Will not do it again. Shame on 
me.

If someone out there has an XFS setup, it is very easy to test and only 
takes a couple of minutes, really. It takes less time to do it than to 
write a mail about it afterwards:-)

I have tested FreeBSD/UFS with similar results, a few periodic offlines. 
UFS journaled file system is probably not ideal for database work, but yet 
again the load is small, it should be able to cope without going offline.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
> I can test a couple of patches. I already did one on someone advice (make
> bgwriter round all stuff in 1s instead of 120s, without positive effect.

I've quickly cobbled together the attached patch (which at least doesn't
seem to crash & burn). It tries to trigger pages being flushed out
during the paced phase of checkpoints instead of the fsync phase. The
sync_on_checkpoint_flush can be used to enable/disable that behaviour.

I'd be interested to hear whether that improves your latency numbers. I
unfortunately don't have more time to spend on this right now :(.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 11:14:46 +0200, Andres Freund wrote:
> On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
> > I can test a couple of patches. I already did one on someone advice (make
> > bgwriter round all stuff in 1s instead of 120s, without positive effect.
>
> I've quickly cobbled together the attached patch (which at least doesn't
> seem to crash & burn). It tries to trigger pages being flushed out
> during the paced phase of checkpoints instead of the fsync phase. The
> sync_on_checkpoint_flush can be used to enable/disable that behaviour.
>
> I'd be interested to hear whether that improves your latency numbers. I
> unfortunately don't have more time to spend on this right now :(.

And actually attached. Note that it's linux only...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 11:19:22 +0200, Andres Freund wrote:
> On 2014-08-27 11:14:46 +0200, Andres Freund wrote:
> > On 2014-08-27 11:05:52 +0200, Fabien COELHO wrote:
> > > I can test a couple of patches. I already did one on someone advice (make
> > > bgwriter round all stuff in 1s instead of 120s, without positive effect.
> > 
> > I've quickly cobbled together the attached patch (which at least doesn't
> > seem to crash & burn). It tries to trigger pages being flushed out
> > during the paced phase of checkpoints instead of the fsync phase. The
> > sync_on_checkpoint_flush can be used to enable/disable that behaviour.
> > 
> > I'd be interested to hear whether that improves your latency numbers. I
> > unfortunately don't have more time to spend on this right now :(.
> 
> And actually attached. Note that it's linux only...

I got curious and ran a quick test:

config:
log_checkpoints=on
checkpoint_timeout=1min
checkpoint_completion_target=0.95
checkpoint_segments=100
synchronous_commit=on
fsync=on
huge_pages=on
max_connections=200
shared_buffers=6GB
wal_level=hot_standby

off:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 150
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 20189
latency average: 23.136 ms
latency stddev: 59.044 ms
rate limit schedule lag: avg 4.599 (max 199.975) ms
number of skipped transactions: 1345 (6.246 %)
tps = 167.664290 (including connections establishing)
tps = 167.675679 (excluding connections establishing)

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12754 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 2 recycled;
write=56.928s, sync=3.639 s, total=60.749 s; sync files=20, longest=2.741 s, average=0.181 s
 
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12269 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 6 recycled;
write=20.701s, sync=8.568 s, total=29.444 s; sync files=10, longest=3.568 s, average=0.856 s
 


on:

$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 150
query mode: prepared
number of clients: 16
number of threads: 16
duration: 120 s
number of transactions actually processed: 21327
latency average: 20.735 ms
latency stddev: 14.643 ms
rate limit schedule lag: avg 4.965 (max 185.003) ms
number of skipped transactions: 1 (0.005 %)
tps = 177.214391 (including connections establishing)
tps = 177.225476 (excluding connections establishing)

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 12217 buffers (1.6%); 0 transaction log file(s) added, 0 removed, 1 recycled;
write=57.022s, sync=0.203 s, total=57.377 s; sync files=19, longest=0.033 s, average=0.010 s
 
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 13185 buffers (1.7%); 0 transaction log file(s) added, 0 removed, 6 recycled;
write=56.628s, sync=0.019 s, total=56.803 s; sync files=11, longest=0.017 s, average=0.001 s
 


That machine is far from idle right now, so the noise is pretty
high. But rather nice initial results.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello Amit,

> I see there is some merit in your point which is to make bgwriter more 
> useful than its current form. I could see 3 top level points to think 
> about whether improvement in any of those can improve the current 
> situation:

> a. Scanning of buffer pool to find the dirty buffers that can
> be flushed.

Yep, that could be more aggressive, or the aggressiveness could be made 
into an adjustible parameter. There is a comment about that in the source. 
However I tested setting the round to 1s instead of 120s, and it had no 
positive effect on my test.

> b. Deciding on what is criteria to flush a buffer

Indeed.

For instance when the IO load is low, there is no reason not to send out 
some buffers, it is a free lunch even if it must be done again later. If 
the load is high, this is another matter. So it would mean being able to 
decide whether the current IO load is low or not. It could be different on 
different disk... Hmmm, not that simple.

> c. Sync of buffers

Yes. That is more or less the effect of my rough approach of calling 
CHECKPOINT every 0.2 seconds.  Andres Freund just implemented a quick 
linux-specific patch which does that within CHECKPOINT pacing, and which 
greatly improves the situation, although it could still be a little 
better.

> [separate xlog test]

I do not have a setup available for that right now. Not sure this would be 
an issue for low loads.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Claudio Freire
Date:
On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>> that causes the problem, not the the writes themselves.
>
>
> Hmmm. My (poor) understanding is that fsync would work fine if everything
> was already written beforehand:-) that is it has nothing to do but assess
> that all is already written. If there is remaining write work, it starts
> doing it "now" with the disastrous effects I'm complaining about.
>
> When I say "pacing does not work", I mean that things where not written out
> to disk by the OS, it does not mean that pg did not ask for it.
>
> However it does not make much sense for an OS scheduler to wait several
> minutes with tens of thousands of pages to write and do nothing about it...
> So I'm wondering.

Maybe what's needed, is to slightly tweak checkpoint logic to give the
kernel some time to flush buffers.

Correct me if I'm wrong, but the checkpointer does the sync right
after the reads. Of course there will be about 30s-worth of
accumulated writes (it's the default amount of time the kernel holds
on to dirty buffers).

Perhaps it should be delayed a small time, say 30s, to let the kernel
do the writing on its own.



Re: postgresql latency & bgwriter not doing its job

From
Claudio Freire
Date:
On Wed, Aug 27, 2014 at 10:10 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>>> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>>> that causes the problem, not the the writes themselves.
>>
>>
>> Hmmm. My (poor) understanding is that fsync would work fine if everything
>> was already written beforehand:-) that is it has nothing to do but assess
>> that all is already written. If there is remaining write work, it starts
>> doing it "now" with the disastrous effects I'm complaining about.
>>
>> When I say "pacing does not work", I mean that things where not written out
>> to disk by the OS, it does not mean that pg did not ask for it.
>>
>> However it does not make much sense for an OS scheduler to wait several
>> minutes with tens of thousands of pages to write and do nothing about it...
>> So I'm wondering.
>
> Maybe what's needed, is to slightly tweak checkpoint logic to give the
> kernel some time to flush buffers.
>
> Correct me if I'm wrong, but the checkpointer does the sync right
> after the reads. Of course there will be about 30s-worth of
> accumulated writes (it's the default amount of time the kernel holds
> on to dirty buffers).
>
> Perhaps it should be delayed a small time, say 30s, to let the kernel
> do the writing on its own.


Errata: just after the writes :-p



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
> >> that causes the problem, not the the writes themselves.
> >
> >
> > Hmmm. My (poor) understanding is that fsync would work fine if everything
> > was already written beforehand:-) that is it has nothing to do but assess
> > that all is already written. If there is remaining write work, it starts
> > doing it "now" with the disastrous effects I'm complaining about.
> >
> > When I say "pacing does not work", I mean that things where not written out
> > to disk by the OS, it does not mean that pg did not ask for it.
> >
> > However it does not make much sense for an OS scheduler to wait several
> > minutes with tens of thousands of pages to write and do nothing about it...
> > So I'm wondering.
> 
> Maybe what's needed, is to slightly tweak checkpoint logic to give the
> kernel some time to flush buffers.
> 
> Correct me if I'm wrong, but the checkpointer does the sync right
> after the reads. Of course there will be about 30s-worth of
> accumulated writes (it's the default amount of time the kernel holds
> on to dirty buffers).
> 
> Perhaps it should be delayed a small time, say 30s, to let the kernel
> do the writing on its own.

The kernel *starts* to write out pages after 30s, it doesn't finish
doing so. So I don't think that's going to work.

I think a somewhat smarter version of the explicit flushes in the
hack^Wpatch I posted nearby is going to more likely to be successful.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Claudio Freire
Date:
On Wed, Aug 27, 2014 at 10:15 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-08-27 10:10:49 -0300, Claudio Freire wrote:
>> On Wed, Aug 27, 2014 at 6:05 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>> >> [...] What's your evidence the pacing doesn't work? Afaik it's the fsync
>> >> that causes the problem, not the the writes themselves.
>> >
>> >
>> > Hmmm. My (poor) understanding is that fsync would work fine if everything
>> > was already written beforehand:-) that is it has nothing to do but assess
>> > that all is already written. If there is remaining write work, it starts
>> > doing it "now" with the disastrous effects I'm complaining about.
>> >
>> > When I say "pacing does not work", I mean that things where not written out
>> > to disk by the OS, it does not mean that pg did not ask for it.
>> >
>> > However it does not make much sense for an OS scheduler to wait several
>> > minutes with tens of thousands of pages to write and do nothing about it...
>> > So I'm wondering.
>>
>> Maybe what's needed, is to slightly tweak checkpoint logic to give the
>> kernel some time to flush buffers.
>>
>> Correct me if I'm wrong, but the checkpointer does the sync right
>> after the reads. Of course there will be about 30s-worth of
>> accumulated writes (it's the default amount of time the kernel holds
>> on to dirty buffers).
>>
>> Perhaps it should be delayed a small time, say 30s, to let the kernel
>> do the writing on its own.
>
> The kernel *starts* to write out pages after 30s, it doesn't finish
> doing so. So I don't think that's going to work.
>
> I think a somewhat smarter version of the explicit flushes in the
> hack^Wpatch I posted nearby is going to more likely to be successful.


That path is "dangerous" (as in, may not work as intended) if the
filesystem doesn't properly understand range flushes (ehem, like
ext3).



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
> > I think a somewhat smarter version of the explicit flushes in the
> > hack^Wpatch I posted nearby is going to more likely to be successful.
> 
> 
> That path is "dangerous" (as in, may not work as intended) if the
> filesystem doesn't properly understand range flushes (ehem, like
> ext3).

The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
guaranteeing durability. And - afaik - not implemented in a file system
specific manner. It just initiates writeback for individual pages. It
doesn't cause barrier, journal flushes or anything to be issued. That's
still done by the fsync() later.

The big disadvantage is that it's a OS specific solution, but I don't
think we're going to find anything that isn't in this area.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Claudio Freire
Date:
On Wed, Aug 27, 2014 at 10:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
>> > I think a somewhat smarter version of the explicit flushes in the
>> > hack^Wpatch I posted nearby is going to more likely to be successful.
>>
>>
>> That path is "dangerous" (as in, may not work as intended) if the
>> filesystem doesn't properly understand range flushes (ehem, like
>> ext3).
>
> The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> guaranteeing durability. And - afaik - not implemented in a file system
> specific manner. It just initiates writeback for individual pages. It
> doesn't cause barrier, journal flushes or anything to be issued. That's
> still done by the fsync() later.
>
> The big disadvantage is that it's a OS specific solution, but I don't
> think we're going to find anything that isn't in this area.

I guess it should work then.



Re: postgresql latency & bgwriter not doing its job

From
Aidan Van Dyk
Date:
On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Andres,

[...]

I think you're misunderstanding how spread checkpoints work.

Yep, definitely:-) On the other hand I though I was seeking something "simple", namely correct latency under small load, that I would expect out of the box.

What you describe is reasonable, and is more or less what I was hoping for, although I thought that bgwriter was involved from the start and checkpoint would only do what is needed in the end. My mistake.


If all you want is to avoid the write storms when fsyncs start happening on slow storage, can you not just adjust the kernel vm.dirty* tunables to start making the kernel write out dirty buffers much sooner instead of letting them accumulate until fsyncs force them out all at once?

a.


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 10:32:19 -0400, Aidan Van Dyk wrote:
> On Wed, Aug 27, 2014 at 3:32 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> 
> >
> > Hello Andres,
> >
> >  [...]
> >>
> >> I think you're misunderstanding how spread checkpoints work.
> >>
> >
> > Yep, definitely:-) On the other hand I though I was seeking something
> > "simple", namely correct latency under small load, that I would expect out
> > of the box.
> >
> > What you describe is reasonable, and is more or less what I was hoping
> > for, although I thought that bgwriter was involved from the start and
> > checkpoint would only do what is needed in the end. My mistake.
> >
> >
> If all you want is to avoid the write storms when fsyncs start happening on
> slow storage, can you not just adjust the kernel vm.dirty* tunables to
> start making the kernel write out dirty buffers much sooner instead of
> letting them accumulate until fsyncs force them out all at once?

Well. For one that's a os specific global tunable requiring root to be
adjustable. For another we actually do want some buffering: If a backend
writes out a buffer's data itself (happens very frequently) it *should*
get buffered... So I don't think a process independent tunable is going
to do the trick.


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello,

> If all you want is to avoid the write storms when fsyncs start happening on
> slow storage, can you not just adjust the kernel vm.dirty* tunables to
> start making the kernel write out dirty buffers much sooner instead of
> letting them accumulate until fsyncs force them out all at once?

I can try, when I have finished with the current round of testing.

Note that, as Andres put it, it currently "sucks". Having to tinker with 
linux kernel parameters just to handle a small load without being offline 
10% of the time does not look very good, so even if it works, ISTM that a 
pg side solution is desirable.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Heikki Linnakangas
Date:
On 08/27/2014 04:20 PM, Andres Freund wrote:
> On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
>>> I think a somewhat smarter version of the explicit flushes in the
>>> hack^Wpatch I posted nearby is going to more likely to be successful.
>>
>>
>> That path is "dangerous" (as in, may not work as intended) if the
>> filesystem doesn't properly understand range flushes (ehem, like
>> ext3).
>
> The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> guaranteeing durability. And - afaik - not implemented in a file system
> specific manner. It just initiates writeback for individual pages. It
> doesn't cause barrier, journal flushes or anything to be issued. That's
> still done by the fsync() later.
>
> The big disadvantage is that it's a OS specific solution, but I don't
> think we're going to find anything that isn't in this area.

I've been thinking for a long time that we should interleave the writes
and the fsyncs. That still forces up to 1GB of dirty buffers to disk at
once, causing a spike, but at least not more than that. Also, the
scheduling of a spread checkpoint is currently a bit bogus; we don't
take into account the time needed for the fsync phase.

A long time ago, Itagaki Takahiro wrote a patch sort the buffers and
write them out in order
(http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp).
The performance impact of that was inconclusive, but one thing that it
allows nicely is to interleave the fsyncs, so that you write all the
buffers for one file, then fsync it, then next file and so on. IIRC the
biggest worry with that patch was that sorting the buffers requires a
fairly large amount of memory, and making a large allocation in the
checkpointer might cause an out-of-memory, which would be bad.

I don't think anyone's seriously worked on this area since. If the
impact on responsiveness or performance is significant, I'm pretty sure
the OOM problem could be alleviated somehow.

For the kicks, I wrote a quick & dirty patch for interleaving the
fsyncs, see attached. It works by repeatedly scanning the buffer pool,
writing buffers belonging to a single relation segment at a time. I
would be interested to hear how this performs in your test case.

- Heikki


Attachment

Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> On 08/27/2014 04:20 PM, Andres Freund wrote:
> >On 2014-08-27 10:17:06 -0300, Claudio Freire wrote:
> >>>I think a somewhat smarter version of the explicit flushes in the
> >>>hack^Wpatch I posted nearby is going to more likely to be successful.
> >>
> >>
> >>That path is "dangerous" (as in, may not work as intended) if the
> >>filesystem doesn't properly understand range flushes (ehem, like
> >>ext3).
> >
> >The sync_file_range(SYNC_FILE_RANGE_WRITE) I used isn't a operation
> >guaranteeing durability. And - afaik - not implemented in a file system
> >specific manner. It just initiates writeback for individual pages. It
> >doesn't cause barrier, journal flushes or anything to be issued. That's
> >still done by the fsync() later.
> >
> >The big disadvantage is that it's a OS specific solution, but I don't
> >think we're going to find anything that isn't in this area.
> 
> I've been thinking for a long time that we should interleave the writes and
> the fsyncs. That still forces up to 1GB of dirty buffers to disk at once,
> causing a spike, but at least not more than that.

I think there are considerable benefits to sorting checkpoint io by file
and offset in that file. Obviously the likelihood of sequential IO is
higher; but there's also less chance that other processes write out
dirty buffers that have to be flushed out by the fsync() in a drive by
manner.

I don't think it's good enough to solve the problem Fabien is talking
about though. 1GB is heck of a lot of IO to submit at once. That'll
cause latency issues unless you have a write back controller with more
than 1GB of cache.

So I think we need both, control over the amount of dirty data in the
kernel *and* sorted writeouts. To the point that I've been tinkering
with converting buftable.c into a radix tree. That'd allow to
efficiently scan all buffers of a filenode in order. Useful for
checkpoints, but also for lots of other things. Unfortunately our buffer
tags are freakishly huge, making the worst case memory requirements and
the depth of tree quite bad.

> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp).
> The performance impact of that was inconclusive, but one thing that it
> allows nicely is to interleave the fsyncs, so that you write all the buffers
> for one file, then fsync it, then next file and so on. IIRC the biggest
> worry with that patch was that sorting the buffers requires a fairly large
> amount of memory, and making a large allocation in the checkpointer might
> cause an out-of-memory, which would be bad.
> 
> I don't think anyone's seriously worked on this area since. If the impact on
> responsiveness or performance is significant, I'm pretty sure the OOM
> problem could be alleviated somehow.

It's a major problem imo.

What I'd been thinking of is to checkpoint writeout in batches. Collect
100k buffers, sort them, write them out. Go to the next 100k.

> For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs,
> see attached. It works by repeatedly scanning the buffer pool, writing
> buffers belonging to a single relation segment at a time. I would be
> interested to hear how this performs in your test case.

I bet it's not fundamentally changing the amount of transactions that
don't make the deadline - there's more than enough dirty buffers in one
1GB segment to cause issues. But I think it might already be a
significant benefit for peak throughput *and* latency if you combine it
with my approach of initiating writeout to disk during the whole
sync. My patch can slow things down considerably in the worst case by
causing superflous random IO, which your patch should alleviate.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
> off:
>
> $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
> number of skipped transactions: 1345 (6.246 %)
>
> on:
>
> $ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
> number of skipped transactions: 1 (0.005 %)

> That machine is far from idle right now, so the noise is pretty high.

What is the OS and FS? Could it be XFS?

> But rather nice initial results.

Indeed, I can confirm:

I did 5000s 25tps tests: - Off: 8002 transactions lost (6.3%) - On: 158 transactions "lost" (0.12%).

Although it is still 13 times larger than the 12 (0.01%) lost with my 
every 0.2s CHECKPOINT hack, it is nevertheless much much better than 
before!

The bad news, under pgbench unthrottled load, the tps is divided by 2 (300 
-> 150, could have been worse), *BUT* is also much smoother, the tps is 
not going to 0, but stay in 50-100 range before the next spike.

I'm wondering about he order of operations. It seems to me that you sync 
just after giving back a buffer. Maybe it would be better to pipeline it, 
that is something like:
  round 0:    send buffers 0    sleep?
  round N:    sync buffers N-1    send buffers N    sleep?
  final N sync:    sync buffer N

I have not found how to control the checkpoint pacing interval, if there 
is such a thing. With a 200ms lag limit on pgbench, it would be nice if it 
is less than 200ms.

I found this old thread "Add basic checkpoint sync spreading" by Greg 
Smith and Simons Riggs, dating from 2010: 
http://www.postgresql.org/message-id/4CE07548.4030709@2ndquadrant.com 
https://commitfest.postgresql.org/action/patch_view?id=431 which ends up 
"returned with feedback".

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 19:00:12 +0200, Fabien COELHO wrote:
> 
> >off:
> >
> >$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
> >number of skipped transactions: 1345 (6.246 %)
> >
> >on:
> >
> >$ pgbench -p 5440 -h /tmp postgres -M prepared -c 16 -j16 -T 120 -R 180 -L 200
> >number of skipped transactions: 1 (0.005 %)
> 
> >That machine is far from idle right now, so the noise is pretty high.
> 
> What is the OS and FS? Could it be XFS?

That's linux v3.17-rc2 + ext4.

> >But rather nice initial results.
> 
> Indeed, I can confirm:
> 
> I did 5000s 25tps tests:
>  - Off: 8002 transactions lost (6.3%)
>  - On: 158 transactions "lost" (0.12%).
> 
> Although it is still 13 times larger than the 12 (0.01%) lost with my every
> 0.2s CHECKPOINT hack, it is nevertheless much much better than before!
> 
> The bad news, under pgbench unthrottled load, the tps is divided by 2 (300
> -> 150, could have been worse), *BUT* is also much smoother, the tps is not
> going to 0, but stay in 50-100 range before the next spike.

Yea, I'm not surprised. With a sensible (aka larger) checkpoint_timeout
the performance penalty isn't that big, but it's there. That's why I
think (as mentioned to Heikki nearby) it needs to be combined with
sorting during the checkpoint phase.

> I'm wondering about he order of operations. It seems to me that you sync
> just after giving back a buffer.

Yep. Was just a rather quick patch...

> Maybe it would be better to pipeline it,
> that is something like:
> 
>   round 0:
>     send buffers 0
>     sleep?
> 
>   round N:
>     sync buffers N-1
>     send buffers N
>     sleep?
> 
>   final N sync:
>     sync buffer N

Yes, I think we're going to need to leave a it more room for write
combining and such here. But I think it's going to better to issue
flushes for several buffers together - just not after each write(). To
be really beneficial it needs sorted output though.

> I have not found how to control the checkpoint pacing interval, if there is
> such a thing. With a 200ms lag limit on pgbench, it would be nice if it is
> less than 200ms.

Not sure what you mean.

> I found this old thread "Add basic checkpoint sync spreading" by Greg Smith
> and Simons Riggs, dating from 2010:
> http://www.postgresql.org/message-id/4CE07548.4030709@2ndquadrant.com
> https://commitfest.postgresql.org/action/patch_view?id=431 which ends up
> "returned with feedback".

I didn't really like the unapplied remainder of what was proposed in
there.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello Aidan,

> If all you want is to avoid the write storms when fsyncs start happening on
> slow storage, can you not just adjust the kernel vm.dirty* tunables to
> start making the kernel write out dirty buffers much sooner instead of
> letting them accumulate until fsyncs force them out all at once?

I tried that by setting:  vm.dirty_expire_centisecs = 100  vm.dirty_writeback_centisecs = 100

So it should start writing returned buffers at most 2s after they are 
returned, if I understood the doc correctly, instead of at most 35s.

The result is that with a 5000s 25tps pretty small load (the system can do 
300tps with the default configuration), I lost 2091 (1.7%) of 
transactions, that is they were beyond the 200ms schedule limit. Another 
change is that overall the lost transactions are more spread than without 
this setting, although there still is stretches of unresponsiveness.

So although the situation is significantly better, it is still far from 
good with the much reduced value I tried.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Claudio Freire
Date:
On Thu, Aug 28, 2014 at 3:27 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> Hello Aidan,
>
>
>> If all you want is to avoid the write storms when fsyncs start happening
>> on
>> slow storage, can you not just adjust the kernel vm.dirty* tunables to
>> start making the kernel write out dirty buffers much sooner instead of
>> letting them accumulate until fsyncs force them out all at once?
>
>
> I tried that by setting:
>   vm.dirty_expire_centisecs = 100
>   vm.dirty_writeback_centisecs = 100
>
> So it should start writing returned buffers at most 2s after they are
> returned, if I understood the doc correctly, instead of at most 35s.
>
> The result is that with a 5000s 25tps pretty small load (the system can do
> 300tps with the default configuration), I lost 2091 (1.7%) of transactions,
> that is they were beyond the 200ms schedule limit. Another change is that
> overall the lost transactions are more spread than without this setting,
> although there still is stretches of unresponsiveness.
>
> So although the situation is significantly better, it is still far from good
> with the much reduced value I tried.


What do the checkpoint logs look like now? (especially interested in sync times)



Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
>> I tried that by setting:
>>   vm.dirty_expire_centisecs = 100
>>   vm.dirty_writeback_centisecs = 100
>>
>> So it should start writing returned buffers at most 2s after they are
>> returned, if I understood the doc correctly, instead of at most 35s.
>>
>> The result is that with a 5000s 25tps pretty small load (the system can do
>> 300tps with the default configuration), I lost 2091 (1.7%) of transactions,
>> that is they were beyond the 200ms schedule limit. Another change is that
>> overall the lost transactions are more spread than without this setting,
>> although there still is stretches of unresponsiveness.
>>
>> So although the situation is significantly better, it is still far from good
>> with the much reduced value I tried.
>
> What do the checkpoint logs look like now? (especially interested in sync times)

Here is an extract:
  LOG:  checkpoint starting: xlog  LOG:  checkpoint complete: wrote 4893 buffers (4.8%);    0 transaction log file(s)
added,0 removed, 0 recycled;    write=128.430 s, sync=0.378 s, total=128.921 s;    sync files=11, longest=0.375 s,
average=0.034s  LOG:  checkpoint starting: xlog  LOG:  checkpoint complete: wrote 6200 buffers (6.1%);    0 transaction
logfile(s) added, 3 removed, 0 recycled;    write=128.934 s, sync=0.240 s, total=129.280 s;    sync files=7,
longest=0.132s, average=0.034 s  LOG:  checkpoint starting: xlog  LOG:  checkpoint complete: wrote 6177 buffers (6.0%);
  0 transaction log file(s) added, 3 removed, 0 recycled;    write=130.146 s, sync=0.322 s, total=130.592 s;    sync
files=7,longest=0.185 s, average=0.046 s  LOG:  checkpoint starting: xlog  LOG:  checkpoint complete: wrote 6185
buffers(6.0%);    0 transaction log file(s) added, 3 removed, 0 recycled;    write=132.673 s, sync=0.143 s,
total=132.909s;    sync files=5, longest=0.078 s, average=0.028 s  LOG:  checkpoint starting: xlog  LOG:  checkpoint
complete:wrote 6188 buffers (6.0%);    0 transaction log file(s) added, 3 removed, 0 recycled;    write=130.415 s,
sync=0.170s, total=130.676 s;    sync files=5, longest=0.132 s, average=0.034 s  LOG:  checkpoint starting: xlog  LOG:
checkpointcomplete: wrote 6203 buffers (6.1%);    0 transaction log file(s) added, 3 removed, 0 recycled;
write=131.726s, sync=0.444 s, total=132.256 s;   sync files=5, longest=0.441 s, average=0.088 s  ...
 

Basically sync is between 0.1-0.5 seconds.

I had one particulary bad stretch of unresponsiveness, which is not 
clearly related to a slow checkpoint sync:
  progress: 4065.0 s, 28.7 tps, lat 11.886 ms stddev 48.470, lag 18.851 ms, 0 skipped  progress: 4066.2 s, 1.6 tps, lat
952.830ms stddev 280.673, lag 155.310 ms, 25 skipped  progress: 4067.0 s, 2.7 tps, lat 1067.730 ms stddev 316.321, lag
171.766ms, 14 skipped  progress: 4068.4 s, 1.4 tps, lat 1147.392 ms stddev 240.337, lag 132.297 ms, 35 skipped
progress:4069.7 s, 2.3 tps, lat 1034.543 ms stddev 213.786, lag 154.453 ms, 35 skipped  progress: 4070.2 s, 1.8 tps,
lat562.715 ms stddev 0.000, lag 182.833 ms, 12 skipped  progress: 4071.3 s, 3.6 tps, lat 600.929 ms stddev 108.232, lag
162.723ms, 25 skipped  progress: 4072.5 s, 1.7 tps, lat 1077.187 ms stddev 77.654, lag 168.849 ms, 31 skipped
progress:4073.3 s, 1.3 tps, lat 1298.093 ms stddev 0.000, lag 168.882 ms, 21 skipped  progress: 4074.0 s, 2.7 tps, lat
920.704ms stddev 183.587, lag 165.333 ms, 24 skipped  progress: 4075.3 s, 2.4 tps, lat 756.051 ms stddev 118.241, lag
176.863ms, 29 skipped  progress: 4076.1 s, 1.3 tps, lat 1424.548 ms stddev 0.000, lag 0.000 ms, 17 skipped  progress:
4077.3s, 2.4 tps, lat 791.090 ms stddev 338.729, lag 155.403 ms, 26 skipped  progress: 4078.1 s, 27.4 tps, lat 46.834
msstddev 198.812, lag 3.915 ms, 0 skipped
 



-- 
Fabien



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp).
> The performance impact of that was inconclusive, but one thing that it
> allows nicely is to interleave the fsyncs, so that you write all the buffers
> for one file, then fsync it, then next file and so on. IIRC the biggest
> worry with that patch was that sorting the buffers requires a fairly large
> amount of memory, and making a large allocation in the checkpointer might
> cause an out-of-memory, which would be bad.
> 
> I don't think anyone's seriously worked on this area since. If the impact on
> responsiveness or performance is significant, I'm pretty sure the OOM
> problem could be alleviated somehow.

I've dug up that patch (after a bit of fighting with the archives) and
refreshed it. It's *clearly* beneficial:

master:
andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1
starting vacuum...end.
progress: 1.0 s, 2847.6 tps, lat 53.862 ms stddev 49.219
...
progress: 67.0 s, 3435.4 tps, lat 52.920 ms stddev 48.720
progress: 68.2 s, 2586.9 tps, lat 57.793 ms stddev 64.228
progress: 69.1 s, 546.5 tps, lat 294.940 ms stddev 189.546
progress: 70.0 s, 1741.3 tps, lat 134.298 ms stddev 204.740
progress: 71.0 s, 3868.8 tps, lat 48.423 ms stddev 47.934
..
progress: 89.0 s, 4022.8 tps, lat 45.601 ms stddev 40.685
progress: 90.0 s, 2463.5 tps, lat 61.907 ms stddev 64.342
progress: 91.2 s, 856.3 tps, lat 211.610 ms stddev 149.916
progress: 92.0 s, 1026.9 tps, lat 177.103 ms stddev 144.448
progress: 93.0 s, 736.9 tps, lat 254.230 ms stddev 227.339
progress: 94.1 s, 766.9 tps, lat 208.031 ms stddev 181.340
progress: 95.1 s, 979.7 tps, lat 197.014 ms stddev 193.648
progress: 96.0 s, 868.9 tps, lat 214.060 ms stddev 198.297
progress: 97.1 s, 943.4 tps, lat 178.062 ms stddev 143.224
progress: 98.0 s, 934.5 tps, lat 192.435 ms stddev 197.901
progress: 99.6 s, 623.1 tps, lat 202.954 ms stddev 165.576
progress: 100.0 s, 464.7 tps, lat 683.600 ms stddev 376.520
progress: 101.1 s, 516.0 tps, lat 395.033 ms stddev 480.346
progress: 102.0 s, 364.9 tps, lat 507.933 ms stddev 499.670
progress: 103.3 s, 592.9 tps, lat 214.123 ms stddev 273.411
progress: 104.1 s, 930.2 tps, lat 316.487 ms stddev 335.096
progress: 105.4 s, 627.6 tps, lat 262.496 ms stddev 200.690
progress: 106.1 s, 788.6 tps, lat 235.510 ms stddev 202.366
progress: 107.5 s, 644.8 tps, lat 269.020 ms stddev 223.900
progress: 108.0 s, 725.0 tps, lat 262.692 ms stddev 218.774
progress: 109.0 s, 660.0 tps, lat 272.808 ms stddev 248.501
progress: 110.0 s, 604.3 tps, lat 303.727 ms stddev 264.921
progress: 111.0 s, 723.6 tps, lat 243.224 ms stddev 229.426
progress: 112.1 s, 668.6 tps, lat 257.026 ms stddev 190.247
progress: 113.1 s, 345.0 tps, lat 492.114 ms stddev 312.082
progress: 115.4 s, 390.9 tps, lat 416.708 ms stddev 391.577
progress: 115.4 s, 14598.5 tps, lat 551.617 ms stddev 539.611
progress: 116.1 s, 161.5 tps, lat 741.611 ms stddev 485.498
progress: 117.4 s, 269.1 tps, lat 697.978 ms stddev 576.970
progress: 118.8 s, 262.3 tps, lat 674.887 ms stddev 587.848
progress: 119.1 s, 195.2 tps, lat 833.959 ms stddev 733.592
progress: 120.0 s, 3000.6 tps, lat 104.272 ms stddev 291.851
progress: 121.0 s, 3167.7 tps, lat 56.576 ms stddev 51.976
progress: 122.0 s, 3398.1 tps, lat 51.322 ms stddev 48.057
progress: 123.0 s, 3721.9 tps, lat 50.355 ms stddev 46.994
progress: 124.0 s, 2929.3 tps, lat 50.996 ms stddev 45.553
progress: 125.0 s, 754.5 tps, lat 269.293 ms stddev 287.508
progress: 126.0 s, 3297.0 tps, lat 56.912 ms stddev 77.053
...
progress: 144.0 s, 4207.9 tps, lat 44.440 ms stddev 37.210
progress: 145.9 s, 2036.4 tps, lat 79.025 ms stddev 105.411
progress: 146.0 s, 1003.1 tps, lat 292.934 ms stddev 223.650
progress: 147.4 s, 520.8 tps, lat 318.670 ms stddev 244.596
progress: 148.0 s, 3557.3 tps, lat 71.626 ms stddev 143.174
progress: 149.0 s, 4106.1 tps, lat 43.557 ms stddev 36.444
progress: 150.0 s, 4132.3 tps, lat 43.185 ms stddev 34.611
progress: 151.0 s, 4233.3 tps, lat 43.239 ms stddev 39.121
progress: 152.0 s, 4178.2 tps, lat 43.242 ms stddev 40.377
progress: 153.0 s, 755.1 tps, lat 198.560 ms stddev 155.927
progress: 154.1 s, 773.6 tps, lat 240.044 ms stddev 192.472
progress: 155.0 s, 553.7 tps, lat 303.532 ms stddev 245.491
progress: 156.2 s, 772.7 tps, lat 242.925 ms stddev 226.754
progress: 157.1 s, 541.0 tps, lat 295.132 ms stddev 218.857
progress: 158.1 s, 716.8 tps, lat 281.823 ms stddev 227.488
progress: 159.1 s, 748.7 tps, lat 223.275 ms stddev 186.162
progress: 160.0 s, 503.0 tps, lat 311.621 ms stddev 215.952
progress: 161.1 s, 905.0 tps, lat 239.623 ms stddev 245.539
progress: 162.4 s, 360.4 tps, lat 329.583 ms stddev 250.094
progress: 163.3 s, 348.9 tps, lat 583.476 ms stddev 432.200
progress: 165.5 s, 186.1 tps, lat 765.542 ms stddev 552.133
progress: 165.5 s, 9950.2 tps, lat 1424.984 ms stddev 240.249
progress: 166.4 s, 137.3 tps, lat 1056.184 ms stddev 765.356
progress: 167.2 s, 249.7 tps, lat 793.423 ms stddev 792.666
progress: 168.6 s, 194.0 tps, lat 851.712 ms stddev 737.653
progress: 169.2 s, 136.0 tps, lat 1229.394 ms stddev 1227.723
progress: 170.6 s, 92.3 tps, lat 1569.107 ms stddev 1517.027
progress: 171.2 s, 276.5 tps, lat 1325.410 ms stddev 1252.237
progress: 172.0 s, 3881.2 tps, lat 91.122 ms stddev 366.052
progress: 174.4 s, 995.6 tps, lat 120.200 ms stddev 242.005
progress: 174.4 s, 16227.2 tps, lat 651.361 ms stddev 527.946
progress: 175.1 s, 200.0 tps, lat 808.114 ms stddev 542.989
progress: 176.1 s, 179.4 tps, lat 1024.893 ms stddev 724.887
progress: 177.4 s, 367.0 tps, lat 648.519 ms stddev 790.413
progress: 178.4 s, 2236.7 tps, lat 114.064 ms stddev 271.525
progress: 179.0 s, 3643.5 tps, lat 48.404 ms stddev 47.793
progress: 180.0 s, 3309.0 tps, lat 55.598 ms stddev 49.211
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 496443
latency average: 65.245 ms
latency stddev: 134.292 ms
tps = 2756.518300 (including connections establishing)
tps = 2756.718588 (excluding connections establishing)


LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 139127 buffers (21.2%); 0 transaction log file(s) added, 0 removed, 8 recycled;
write=53.660s, sync=6.739 s, total=60.611 s; sync files=18, longest=4.382 s, average=0.374 s
 
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 70472 buffers (10.8%); 0 transaction log file(s) added, 0 removed, 70 recycled;
write=53.107s, sync=5.375 s, total=58.605 s; sync files=15, longest=3.243 s, average=0.358 s
 
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 80357 buffers (12.3%); 0 transaction log file(s) added, 0 removed, 49 recycled;
write=53.526s, sync=3.896 s, total=57.573 s; sync files=15, longest=1.206 s, average=0.259 s
 

sorted writes:
starting vacuum...end.
progress: 1.0 s, 3237.2 tps, lat 49.544 ms stddev 42.203
..
progress: 88.0 s, 3030.8 tps, lat 47.438 ms stddev 46.563
progress: 89.3 s, 919.0 tps, lat 204.152 ms stddev 172.776
progress: 90.1 s, 973.7 tps, lat 204.650 ms stddev 197.843
progress: 91.1 s, 875.5 tps, lat 200.472 ms stddev 157.295
progress: 92.0 s, 939.5 tps, lat 185.357 ms stddev 146.362
progress: 93.1 s, 975.8 tps, lat 184.445 ms stddev 167.520
progress: 94.9 s, 673.0 tps, lat 260.900 ms stddev 224.125
progress: 95.0 s, 1057.6 tps, lat 223.717 ms stddev 260.276
progress: 96.0 s, 917.7 tps, lat 201.704 ms stddev 220.395
progress: 97.0 s, 2573.0 tps, lat 78.913 ms stddev 119.458
progress: 98.0 s, 3921.5 tps, lat 44.160 ms stddev 39.147
progress: 99.0 s, 4033.3 tps, lat 43.972 ms stddev 45.921
..
progress: 110.0 s, 4057.8 tps, lat 44.283 ms stddev 47.003
progress: 111.0 s, 897.6 tps, lat 158.250 ms stddev 132.126
progress: 112.0 s, 751.6 tps, lat 249.282 ms stddev 232.964
progress: 113.0 s, 550.6 tps, lat 296.155 ms stddev 226.163
progress: 114.0 s, 1875.3 tps, lat 123.059 ms stddev 210.661
progress: 115.1 s, 4555.4 tps, lat 39.404 ms stddev 32.331
..
progress: 151.0 s, 4240.2 tps, lat 42.158 ms stddev 38.577
progress: 152.1 s, 2551.3 tps, lat 63.584 ms stddev 73.780
progress: 153.1 s, 912.5 tps, lat 189.030 ms stddev 146.658
progress: 154.2 s, 976.2 tps, lat 192.499 ms stddev 158.108
progress: 155.0 s, 1005.1 tps, lat 178.215 ms stddev 129.062
progress: 156.0 s, 1010.1 tps, lat 175.583 ms stddev 138.823
progress: 157.0 s, 1042.0 tps, lat 161.675 ms stddev 135.350
progress: 158.1 s, 766.8 tps, lat 216.367 ms stddev 234.374
progress: 159.5 s, 1025.7 tps, lat 199.366 ms stddev 228.530
progress: 160.0 s, 1178.6 tps, lat 159.156 ms stddev 130.660
progress: 161.0 s, 2904.1 tps, lat 65.751 ms stddev 79.928
progress: 162.0 s, 4189.6 tps, lat 43.492 ms stddev 38.490
..
progress: 170.0 s, 4150.9 tps, lat 41.998 ms stddev 40.931
progress: 171.1 s, 595.5 tps, lat 233.138 ms stddev 188.073
progress: 172.0 s, 622.6 tps, lat 245.023 ms stddev 279.715
progress: 173.0 s, 588.1 tps, lat 359.104 ms stddev 315.367
progress: 174.0 s, 3011.0 tps, lat 74.811 ms stddev 172.972
progress: 175.0 s, 4829.0 tps, lat 37.520 ms stddev 30.578
progress: 176.0 s, 4324.1 tps, lat 41.388 ms stddev 37.727
progress: 177.0 s, 3632.8 tps, lat 49.219 ms stddev 47.092
progress: 178.0 s, 3621.0 tps, lat 50.332 ms stddev 43.086
progress: 179.0 s, 3468.3 tps, lat 50.969 ms stddev 43.025
progress: 180.0 s, 3710.7 tps, lat 48.782 ms stddev 49.070
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 634014
latency average: 51.089 ms
latency stddev: 63.864 ms
number of transactions above the 100.0 ms latency limit: 66921
tps = 3520.574003 (excluding connections establishing)

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 135453 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 2 recycled;
write=53.419s, sync=3.665 s, total=57.390 s; sync files=17, longest=1.937 s, average=0.215 s
 
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 96907 buffers (14.8%); 0 transaction log file(s) added, 0 removed, 68 recycled;
write=53.147s, sync=3.369 s, total=56.633 s; sync files=15, longest=1.920 s, average=0.224 s
 


sorted writes + flush in SyncOneBuffer():


andres@awork2:~/src/postgresql$ pgbench -p 5440 -h /tmp postgres -M prepared -c 180 -j 180 -T 180 -L 100 --progress=1
starting vacuum...end.
progress: 1.0 s, 3044.7 tps, lat 49.030 ms stddev 45.608
progress: 2.0 s, 3701.3 tps, lat 48.231 ms stddev 44.762
...
progress: 55.0 s, 4157.1 tps, lat 43.216 ms stddev 40.821
progress: 56.0 s, 4736.3 tps, lat 38.326 ms stddev 35.699
progress: 57.0 s, 2604.9 tps, lat 66.072 ms stddev 65.534
progress: 58.0 s, 1659.8 tps, lat 113.848 ms stddev 121.625
progress: 59.0 s, 2625.2 tps, lat 66.112 ms stddev 54.349
..
progress: 106.0 s, 2571.8 tps, lat 72.806 ms stddev 83.308
progress: 107.0 s, 1818.7 tps, lat 93.315 ms stddev 103.284
progress: 108.1 s, 1876.6 tps, lat 97.447 ms stddev 103.151
progress: 109.1 s, 1768.8 tps, lat 110.406 ms stddev 121.347
progress: 110.0 s, 3129.4 tps, lat 56.100 ms stddev 54.835
progress: 111.0 s, 4402.0 tps, lat 40.581 ms stddev 42.269
progress: 112.0 s, 4379.3 tps, lat 42.625 ms stddev 39.507
progress: 113.0 s, 4551.6 tps, lat 39.303 ms stddev 30.563
progress: 114.0 s, 4383.3 tps, lat 41.169 ms stddev 37.459
progress: 115.1 s, 4652.3 tps, lat 38.769 ms stddev 33.415
progress: 116.1 s, 4015.5 tps, lat 41.939 ms stddev 34.122
progress: 117.1 s, 2159.0 tps, lat 84.653 ms stddev 81.445
progress: 118.0 s, 2971.3 tps, lat 62.874 ms stddev 62.229
progress: 119.0 s, 2825.1 tps, lat 60.875 ms stddev 54.767
progress: 120.0 s, 2542.9 tps, lat 75.276 ms stddev 68.709
progress: 121.1 s, 2650.4 tps, lat 63.833 ms stddev 56.419
progress: 122.0 s, 2766.8 tps, lat 68.196 ms stddev 67.042
..
progress: 166.0 s, 2412.4 tps, lat 71.307 ms stddev 74.961
progress: 167.1 s, 1742.5 tps, lat 99.519 ms stddev 102.500
progress: 168.1 s, 1324.2 tps, lat 122.625 ms stddev 135.177
progress: 169.0 s, 2002.5 tps, lat 108.435 ms stddev 125.576
progress: 170.0 s, 2796.6 tps, lat 64.867 ms stddev 64.665
progress: 171.0 s, 3969.1 tps, lat 45.520 ms stddev 41.656
...
progress: 180.0 s, 3003.9 tps, lat 59.134 ms stddev 51.508
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 180
number of threads: 180
duration: 180 s
number of transactions actually processed: 584725
latency average: 55.375 ms
latency stddev: 56.202 ms
number of transactions above the 100.0 ms latency limit: 76644
tps = 3246.299654 (including connections establishing)
tps = 3246.607040 (excluding connections establishing)

LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 135474 buffers (20.7%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=53.260s, sync=0.079 s, total=53.577 s; sync files=19, longest=0.032 s, average=0.004 s
 
LOG:  checkpoint starting: time
LOG:  checkpoint complete: wrote 92770 buffers (14.2%); 0 transaction log file(s) added, 0 removed, 68 recycled;
write=53.677s, sync=0.239 s, total=54.026 s; sync files=16, longest=0.093 s, average=0.014 s
 


So, *very* clearly sorting is a benefit. Both the dips in performance
are lower and the aggregate throughput is higher.

Additionally flushing buffers during the flush is a clear win from the
perspective of reliable performance, but noticeable
performancewise. Although much less than without sorting.


Now, I've used a absurdly small checkpoint_timeout here to have the time
to run these benchmarks. But I'm damn sure this similar with real
settings.

So, what I think we need to do is:

1) Sort writes on checkpoints. It's a clear win. Personally I'm
perfectly happy waving away the concerns about OOM. It's not that much
memory:
# SELECT pg_size_pretty(24 * ((64::int8 * 1024 * 1024 * 1024) /
8192::int8));
┌────────────────┐
│ pg_size_pretty │
├────────────────┤
│ 192 MB         │
└────────────────┘

If you can aford 64GB shared buffers, 192MB memory allocated in the
checkpointer hardly is a problem. (24 is sizeof(BufferTag) + sizeof(bufid))

2) At least optionally flush dirty buffers during the buffer sync
scan. The improvements in steady performance are too big to ignore. I
think we might be able to make it an unconditional win by flushing in
batches instead of doing so for each buffer like I've done.

3) Possibly do the mdsync() after & for each filenode (or segment?) like
Heikki basically just prototyped. That'll reduce the number of dirty
buffers our fsync() will flush out unnecessarily (which will create
stalls because they'll sit in the kernel's buffer cache).

> For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs,
> see attached. It works by repeatedly scanning the buffer pool, writing
> buffers belonging to a single relation segment at a time. I would be
> interested to hear how this performs in your test case.

I think that'll perform horribly on large shared buffers with many
relations. Not a uncommon scenario...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
>> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
>> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp).
>> The performance impact of that was inconclusive, but one thing that it
>> allows nicely is to interleave the fsyncs, so that you write all the buffers
>> for one file, then fsync it, then next file and so on.

> ...
> So, *very* clearly sorting is a benefit.

pg_bench alone doesn't convince me on this.  The original thread found
cases where it was a loss, IIRC; you will need to test many more than
one scenario to prove the point.

Also, it does not matter how good it looks in test cases if it causes
outright failures due to OOM; unlike you, I am not prepared to just "wave
away" that risk.  A possible compromise is to sort a limited number of
buffers ---- say, collect a few thousand dirty buffers then sort, dump and
fsync them, repeat as needed.
        regards, tom lane



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
> >> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
> >> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp).
> >> The performance impact of that was inconclusive, but one thing that it
> >> allows nicely is to interleave the fsyncs, so that you write all the buffers
> >> for one file, then fsync it, then next file and so on.
> 
> > ...
> > So, *very* clearly sorting is a benefit.
> 
> pg_bench alone doesn't convince me on this.  The original thread found
> cases where it was a loss, IIRC; you will need to test many more than
> one scenario to prove the point.

Sure. And I'm not claiming Itagaki/your old patch is immediately going
to be ready for commit. But our checkpoint performance has sucked for
years in the field. I don't think we can wave that away.

I think the primary reason it wasn't easily visible as being beneficial
back then was that only the throughput, not the latency and such were
looked at.

> Also, it does not matter how good it looks in test cases if it causes
> outright failures due to OOM; unlike you, I am not prepared to just "wave
> away" that risk.

I'm not "waving away" any risks.

If the sort buffer is allocated when the checkpointer is started, not
everytime we sort, as you've done in your version of the patch I think
that risk is pretty manageable. If we really want to be sure nothing is
happening at runtime, even if the checkpointer was restarted, we can put
the sort array in shared memory.

We're talking about (sizeof(BufferTag) + sizeof(int))/8192 ~= 0.3 %
overhead over shared_buffers here. If we decide to got that way, it's a
pretty darn small to price not to regularly have stalls that last
minutes.

> A possible compromise is to sort a limited number of
> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
> fsync them, repeat as needed.

Yea, that's what I suggested nearby. But I don't really like it, because
it robs us of the the chance to fsync() a relfilenode immediately after
having synced all its buffers. Doing so is rather beneficial because
then fewer independently dirtied pages have to be flushed out - reducing
the impact of the fsync().

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
>> A possible compromise is to sort a limited number of
>> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
>> fsync them, repeat as needed.

> Yea, that's what I suggested nearby. But I don't really like it, because
> it robs us of the the chance to fsync() a relfilenode immediately after
> having synced all its buffers.

Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
once per sort-group and not just once per checkpoint, but it's not clear
that that's a loss as long as the group size isn't tiny.
        regards, tom lane



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
> >> A possible compromise is to sort a limited number of
> >> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
> >> fsync them, repeat as needed.
> 
> > Yea, that's what I suggested nearby. But I don't really like it, because
> > it robs us of the the chance to fsync() a relfilenode immediately after
> > having synced all its buffers.
> 
> Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
> once per sort-group and not just once per checkpoint, but it's not clear
> that that's a loss as long as the group size isn't tiny.

Because it wouldn't have the benefit of sycing the minimal amount of
data anymore. If lots of other relfilenodes have been synced inbetween
the amount of newly dirtied pages in the os' buffercache (written by
backends, bgwriter) for a individual relfilenode is much higher.

A fsync() on a file with dirty data often causes *serious* latency
spikes - we should try hard to avoid superflous calls.

As an example: Calling fsync() on pgbench_accounts's underlying files,
from outside postgres, *before* postgres even started its first
checkpoint does this:
progress: 72.0 s, 4324.9 tps, lat 41.481 ms stddev 40.567
progress: 73.0 s, 4704.9 tps, lat 38.465 ms stddev 35.436
progress: 74.0 s, 4448.5 tps, lat 40.058 ms stddev 32.634
progress: 75.0 s, 4634.5 tps, lat 39.229 ms stddev 33.463
progress: 76.8 s, 2753.1 tps, lat 48.693 ms stddev 75.309
progress: 77.1 s, 126.6 tps, lat 773.433 ms stddev 222.667
progress: 78.0 s, 183.7 tps, lat 786.401 ms stddev 395.954
progress: 79.1 s, 170.3 tps, lat 975.949 ms stddev 596.751
progress: 80.0 s, 2116.6 tps, lat 168.608 ms stddev 398.933
progress: 81.0 s, 4436.1 tps, lat 40.313 ms stddev 34.198
progress: 82.0 s, 4383.9 tps, lat 41.811 ms stddev 37.241

Note the dip from 4k tps to 130 tps.

We can get a handle on that (on some platforms at least) for writes
issued during the buffer sync by forcing the kernel to write out the
pages in small increments; but I doubt we want to do that for writes by
backends themselves.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Heikki Linnakangas
Date:
On 08/30/2014 09:45 PM, Andres Freund wrote:
> On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>>> On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
>>>> A possible compromise is to sort a limited number of
>>>> buffers ---- say, collect a few thousand dirty buffers then sort, dump and
>>>> fsync them, repeat as needed.
>>
>>> Yea, that's what I suggested nearby. But I don't really like it, because
>>> it robs us of the the chance to fsync() a relfilenode immediately after
>>> having synced all its buffers.
>>
>> Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
>> once per sort-group and not just once per checkpoint, but it's not clear
>> that that's a loss as long as the group size isn't tiny.
>
> Because it wouldn't have the benefit of sycing the minimal amount of
> data anymore. If lots of other relfilenodes have been synced inbetween
> the amount of newly dirtied pages in the os' buffercache (written by
> backends, bgwriter) for a individual relfilenode is much higher.

I wonder how much of the benefit from sorting comes from sorting the 
pages within each file, and how much just from grouping all the writes 
of each file together. In other words, how much difference is there 
between sorting, and fsyncing between each file, or the crude patch I 
posted earlier.

If we're going to fsync between each file, there's no need to sort all 
the buffers at once. It's enough to pick one file as the target - like 
in my crude patch - and sort only the buffers for that file. Then fsync 
that file and move on to the next file. That requires scanning the 
buffers multiple times, but I think that's OK.

- Heikki




Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:
> On 08/30/2014 09:45 PM, Andres Freund wrote:
> >On 2014-08-30 14:16:10 -0400, Tom Lane wrote:
> >>Andres Freund <andres@2ndquadrant.com> writes:
> >>>On 2014-08-30 13:50:40 -0400, Tom Lane wrote:
> >>>>A possible compromise is to sort a limited number of
> >>>>buffers ---- say, collect a few thousand dirty buffers then sort, dump and
> >>>>fsync them, repeat as needed.
> >>
> >>>Yea, that's what I suggested nearby. But I don't really like it, because
> >>>it robs us of the the chance to fsync() a relfilenode immediately after
> >>>having synced all its buffers.
> >>
> >>Uh, how so exactly?  You could still do that.  Yeah, you might fsync a rel
> >>once per sort-group and not just once per checkpoint, but it's not clear
> >>that that's a loss as long as the group size isn't tiny.
> >
> >Because it wouldn't have the benefit of sycing the minimal amount of
> >data anymore. If lots of other relfilenodes have been synced inbetween
> >the amount of newly dirtied pages in the os' buffercache (written by
> >backends, bgwriter) for a individual relfilenode is much higher.
> 
> I wonder how much of the benefit from sorting comes from sorting the pages
> within each file, and how much just from grouping all the writes of each
> file together. In other words, how much difference is there between sorting,
> and fsyncing between each file, or the crude patch I posted earlier.

I haven't implemented fsync()ing between files so far. From the io stats
I'm seing the performance improvements come from the OS being able to
write data back in bigger chunks. Which seems entirely reasonable. If
the database and the write load are big enough, so writeback will be
triggered repeatedly during one checkpoint, the OS's buffercache will
have lots of nonsequential data to flush. Leading to much smaller IOs,
more seeks and deeper queues (=> latency increases).

> If we're going to fsync between each file, there's no need to sort all the
> buffers at once. It's enough to pick one file as the target - like in my
> crude patch - and sort only the buffers for that file. Then fsync that file
> and move on to the next file. That requires scanning the buffers multiple
> times, but I think that's OK.

I really can't see that working out. Production instances of postgres
with large shared_buffers settings (say 96GB in one case) have tens of
thousands of relations (~34500 in the same case). And that's a database
with a relatively simple schema. I've seen much worse.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Mitsumasa KONDO
Date:
Hi,

2014-08-31 8:10 GMT+09:00 Andres Freund <andres@2ndquadrant.com>:
On 2014-08-31 01:50:48 +0300, Heikki Linnakangas wrote:
> If we're going to fsync between each file, there's no need to sort all the
> buffers at once. It's enough to pick one file as the target - like in my
> crude patch - and sort only the buffers for that file. Then fsync that file
> and move on to the next file. That requires scanning the buffers multiple
> times, but I think that's OK.

I really can't see that working out. Production instances of postgres
with large shared_buffers settings (say 96GB in one case) have tens of
thousands of relations (~34500 in the same case). And that's a database
with a relatively simple schema. I've seen much worse.
Yeah, it is impossible in one checkpointer process. All buffer search cost is 
relatively high than we expect. We need clever algorithm for efficient and
distributed buffer search using multi process or threads. 

Regards,
--
Mitsumasa KONDO

Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
Hello Heikki,

> For the kicks, I wrote a quick & dirty patch for interleaving the fsyncs, see 
> attached. It works by repeatedly scanning the buffer pool, writing buffers 
> belonging to a single relation segment at a time.

I tried this patch on the same host I used with the same "-R 25 -L 200 -T 
5000", alas without significant positive effects, about 6% of transactions 
were "lost", including stretches of several seconds of unresponsiveness.

Maybe this is because "pgbench -N" only basically touches 2 tables 
(accounts & history), so there are few file syncs involved, thus it does 
not help much with spreading a lot of writes.
  2014-08-30 23:52:24.167 CEST:   LOG:  checkpoint starting: xlog  2014-08-30 23:54:09.239 CEST:   LOG:  checkpoint
sync:number=1 file=base/16384/11902 time=24.529 msec  2014-08-30 23:54:23.812 CEST:   LOG:  checkpoint sync: number=1
file=base/16384/16397_vmtime=32.547 msec  2014-08-30 23:54:24.771 CEST:   LOG:  checkpoint sync: number=1
file=base/16384/11873time=557.470 msec  2014-08-30 23:54:37.419 CEST:   LOG:  checkpoint complete: wrote 4931 buffers
(30.1%);    0 transaction log file(s) added, 0 removed, 3 recycled;     write=122.946 s, sync=10.129 s, total=133.252
s;    sync files=6, longest=9.854 s, average=1.790 s
 

Note that given the small load and table size, pgbench implies random 
I/Os and basically nothing else.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Jeff Janes
Date:
On Tue, Aug 26, 2014 at 1:02 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello again,


I have not found any mean to force bgwriter to send writes when it can.
(Well, I have: create a process which sends "CHECKPOINT" every 0.2
seconds... it works more or less, but this is not my point:-)

There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes.  It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.

I recompiled with the variable coldly set to 1000 instead of 120000. The situation is slightly degraded (15% of transactions were above 200 ms late). However it seems that bgwriter did not write much more pages:


You should probably try it set to 200 rather than 1000, to put it on an equal footing with the checkpoint_timeout of 0.2 seconds you reported on.

Not that I think this will improve the situation.  Afterall, my theory is that it does not matter who *writes* the pages, it only matters how they get fsynced.
 

  buffers_checkpoint = 26065
  buffers_clean = 5263
  buffers_backend = 367

Or I may have a problem interpreting pg_stat_bgwriter.


For this experiment, what was checkpoint_timeout set to?
 
Cheers,

Jeff

Re: postgresql latency & bgwriter not doing its job

From
Fabien COELHO
Date:
>>> There is scan_whole_pool_milliseconds, which currently forces bgwriter to
>>> circle the buffer pool at least once every 2 minutes.  It is currently
>>> fixed, but it should be trivial to turn it into an experimental guc that
>>> you could use to test your hypothesis.
>>
>> I recompiled with the variable coldly set to 1000 instead of 120000. The
>> situation is slightly degraded (15% of transactions were above 200 ms
>> late). However it seems that bgwriter did not write much more pages:
>
>
> You should probably try it set to 200 rather than 1000, to put it on an
> equal footing with the checkpoint_timeout of 0.2 seconds you reported on.

As I understand it, the setting makes the bgwriter processe scan all 
shared_buffers every this amount of time... but ITSM that the key point is 
that bgwriter has no insentive to start writing out buffers anyway with 
its current decision rules, and that should not change with the frequency 
at which they are scanned (?)

> For this experiment, what was checkpoint_timeout set to?

AFAICR, the default, 5min.

-- 
Fabien.



Re: postgresql latency & bgwriter not doing its job

From
Jeff Janes
Date:
On Tue, Sep 2, 2014 at 8:14 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

There is scan_whole_pool_milliseconds, which currently forces bgwriter to
circle the buffer pool at least once every 2 minutes.  It is currently
fixed, but it should be trivial to turn it into an experimental guc that
you could use to test your hypothesis.

I recompiled with the variable coldly set to 1000 instead of 120000. The
situation is slightly degraded (15% of transactions were above 200 ms
late). However it seems that bgwriter did not write much more pages:


You should probably try it set to 200 rather than 1000, to put it on an
equal footing with the checkpoint_timeout of 0.2 seconds you reported on.

As I understand it, the setting makes the bgwriter processe scan all shared_buffers every this amount of time... but ITSM that the key point is that bgwriter has no insentive to start writing out buffers anyway with its current decision rules, and that should not change with the frequency at which they are scanned (?)

Ah, I see now.  The usage counts are not zero, so it visits the buffer and then leaves it alone.

Cheers,

Jeff

Re: postgresql latency & bgwriter not doing its job

From
Robert Haas
Date:
On Sat, Aug 30, 2014 at 2:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> If the sort buffer is allocated when the checkpointer is started, not
> everytime we sort, as you've done in your version of the patch I think
> that risk is pretty manageable. If we really want to be sure nothing is
> happening at runtime, even if the checkpointer was restarted, we can put
> the sort array in shared memory.

I don't think that allocating the array at checkpointer start time
helps.  If it works, then you're strictly worse off than if you
allocate it at every checkpoint, because you're holding onto the
memory all the time instead of only when it's being used.  And if it
fails, what then?  Sure, you can have that copy of the checkpointer
process exit, but that does nothing good.  The postmaster will keep on
restarting it and it will keep on dying for lack of memory, and no
checkpoints will complete.  Oops.

So it seems to me that the possibly-sensible approaches are:

1. Allocate an array when we need to sort, and if the allocation
fails, have some kind of fallback strategy, like logging a WARNING an
writing the buffers out without sorting them.  If it succeeds, do the
checkpoint and then free the memory until we need it again.

2. Putting the array in shared_memory, so that once the server is
started, we can be sure the memory is allocated and the sort will
work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: postgresql latency & bgwriter not doing its job

From
Andres Freund
Date:
On 2014-09-03 17:08:12 -0400, Robert Haas wrote:
> On Sat, Aug 30, 2014 at 2:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > If the sort buffer is allocated when the checkpointer is started, not
> > everytime we sort, as you've done in your version of the patch I think
> > that risk is pretty manageable. If we really want to be sure nothing is
> > happening at runtime, even if the checkpointer was restarted, we can put
> > the sort array in shared memory.
> 
> I don't think that allocating the array at checkpointer start time
> helps.  If it works, then you're strictly worse off than if you
> allocate it at every checkpoint, because you're holding onto the
> memory all the time instead of only when it's being used.  And if it
> fails, what then?  Sure, you can have that copy of the checkpointer
> process exit, but that does nothing good.  The postmaster will keep on
> restarting it and it will keep on dying for lack of memory, and no
> checkpoints will complete.  Oops.

It's imo quite clearly better to keep it allocated. For one after
postmaster started the checkpointer successfully you don't need to be
worried about later failures to allocate memory if you allocate it once
(unless the checkpointer FATALs out which should be exceedingly rare -
we're catching ERRORs). It's much much more likely to succeed
initially. Secondly it's not like there's really that much time where no
checkpointer isn't running.

> So it seems to me that the possibly-sensible approaches are:
> 
> 1. Allocate an array when we need to sort, and if the allocation
> fails, have some kind of fallback strategy, like logging a WARNING an
> writing the buffers out without sorting them.  If it succeeds, do the
> checkpoint and then free the memory until we need it again.

I think if we want to go that way I vote for keeping the array allocated
and continuing to try to allocate it after allocation failures. And, as
you suggest, fall back to a simple sequential search through all
buffers.

> 2. Putting the array in shared_memory, so that once the server is
> started, we can be sure the memory is allocated and the sort will
> work.

But I prefer this approach. If we ever want to have more than one
process writing out data for checkpoints we're going to need it
anyway. And that's something not that far away for large setups
imo. Especially due to checksums.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: postgresql latency & bgwriter not doing its job

From
Ants Aasma
Date:
On Sat, Aug 30, 2014 at 8:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
>>> A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
>>> them out in order (http://www.postgresql.org/message-id/flat/20070614153758.6A62.ITAGAKI.TAKAHIRO@oss.ntt.co.jp).
>>> The performance impact of that was inconclusive, but one thing that it
>>> allows nicely is to interleave the fsyncs, so that you write all the buffers
>>> for one file, then fsync it, then next file and so on.
>
>> ...
>> So, *very* clearly sorting is a benefit.
>
> pg_bench alone doesn't convince me on this.  The original thread found
> cases where it was a loss, IIRC; you will need to test many more than
> one scenario to prove the point.

The same objection came up last time I tried to push for sorted
checkpoints. I did not find any reference to where it caused a loss,
nor was I able to come up with a case where writing out in arbitrary
order would be better than writing out in file sequential order. In
fact if we ask for low latency this means that the OS must keep the
backlog small eliminating any chance of write combining writes that
arrive out of order.

I have a use case where the system continuously loads data into time
partitioned indexed tables, at every checkpoint all of the indexes of
the latest partition need to be written out. The only way I could get
the write out to happen with sequential I/O was to set
checkpoint_completion_target to zero and ensure OS cache allows for
enough dirty pages to absorb the whole checkpoint. The fsync that
followed did obviously nasty things to latency. Whereas sorted
checkpoints were able to do sequential I/O with checkpoint spreading
and low latency tuned OS virtual memory settings. I can create a
benchmark that shows this behavior if you need additional data points
to pgbench's OLTP workload to convince you that sorting checkpoint
writes is a good idea.

I did just come up with a case where plain sorting might cause an
issue. If the writes go to different I/O devices then naive sorting
will first use one device then the other, whereas arbitrary writing
will load balance between the devices. Assuming that separate
tablespaces are used for separate I/O devices, it should be enough to
just interleave writes of each tablespace, weighed by the amount of
writes per tablespace.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: postgresql latency & bgwriter not doing its job

From
Ants Aasma
Date:
On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> It's imo quite clearly better to keep it allocated. For one after
> postmaster started the checkpointer successfully you don't need to be
> worried about later failures to allocate memory if you allocate it once
> (unless the checkpointer FATALs out which should be exceedingly rare -
> we're catching ERRORs). It's much much more likely to succeed
> initially. Secondly it's not like there's really that much time where no
> checkpointer isn't running.

In principle you could do the sort with the full sized array and then
compress it to a list of buffer IDs that need to be written out. This
way most of the time you only need a small array and the large array
is only needed for a fraction of a second.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: postgresql latency & bgwriter not doing its job

From
Robert Haas
Date:
On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma <ants@cybertec.at> wrote:
> On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> It's imo quite clearly better to keep it allocated. For one after
>> postmaster started the checkpointer successfully you don't need to be
>> worried about later failures to allocate memory if you allocate it once
>> (unless the checkpointer FATALs out which should be exceedingly rare -
>> we're catching ERRORs). It's much much more likely to succeed
>> initially. Secondly it's not like there's really that much time where no
>> checkpointer isn't running.
>
> In principle you could do the sort with the full sized array and then
> compress it to a list of buffer IDs that need to be written out. This
> way most of the time you only need a small array and the large array
> is only needed for a fraction of a second.

It's not the size of the array that's the problem; it's the size of
the detonation when the allocation fails.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: postgresql latency & bgwriter not doing its job

From
didier
Date:
hi

On Thu, Sep 4, 2014 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma <ants@cybertec.at> wrote:
>> On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>>> It's imo quite clearly better to keep it allocated. For one after
>>> postmaster started the checkpointer successfully you don't need to be
>>> worried about later failures to allocate memory if you allocate it once
>>> (unless the checkpointer FATALs out which should be exceedingly rare -
>>> we're catching ERRORs). It's much much more likely to succeed
>>> initially. Secondly it's not like there's really that much time where no
>>> checkpointer isn't running.
>>
>> In principle you could do the sort with the full sized array and then
>> compress it to a list of buffer IDs that need to be written out. This
>> way most of the time you only need a small array and the large array
>> is only needed for a fraction of a second.
>
> It's not the size of the array that's the problem; it's the size of
> the detonation when the allocation fails.
>
You can use a file backed memory array
Or because it's only a hint and
- keys are in buffers (BufferTag), right?
- transition is only from 'data I care to data I don't care' if a
buffer is concurrently evicted when sorting.

Use a pre allocate buffer index array an read keys from buffers when
sorting, without memory barrier, spinlocks, whatever.



Re: postgresql latency & bgwriter not doing its job

From
Robert Haas
Date:
On Fri, Sep 5, 2014 at 4:17 AM, didier <did447@gmail.com> wrote:
>> It's not the size of the array that's the problem; it's the size of
>> the detonation when the allocation fails.
>>
> You can use a file backed memory array
> Or because it's only a hint and
> - keys are in buffers (BufferTag), right?
> - transition is only from 'data I care to data I don't care' if a
> buffer is concurrently evicted when sorting.
>
> Use a pre allocate buffer index array an read keys from buffers when
> sorting, without memory barrier, spinlocks, whatever.

If by a file-backed memory array you mean mmap() something, that
doesn't do anything to address the possibility of failure.  mmap(),
like malloc(), can fail.

But I think we don't really need to have this argument.  Andres's
proposal to put this in the main shared memory segment sounds fine to
me.  If allocating that fails, you'll know to reduce shared_buffers
and try again.  If it succeeds, you should be safe after that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: postgresql latency & bgwriter not doing its job

From
Claudio Freire
Date:
On Fri, Sep 5, 2014 at 3:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Sep 5, 2014 at 4:17 AM, didier <did447@gmail.com> wrote:
>>> It's not the size of the array that's the problem; it's the size of
>>> the detonation when the allocation fails.
>>>
>> You can use a file backed memory array
>> Or because it's only a hint and
>> - keys are in buffers (BufferTag), right?
>> - transition is only from 'data I care to data I don't care' if a
>> buffer is concurrently evicted when sorting.
>>
>> Use a pre allocate buffer index array an read keys from buffers when
>> sorting, without memory barrier, spinlocks, whatever.
>
> If by a file-backed memory array you mean mmap() something, that
> doesn't do anything to address the possibility of failure.  mmap(),
> like malloc(), can fail.

It's worse. If your process needs to page from disk when touching a
byte in the mmap'd array, and the disk somehow fails to work, you get
a segmentation fault.