Thread: checkpointer continuous flushing

checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello pg-devs,

This patch is a simplified and generalized version of Andres Freund's 
August 2014 patch for flushing while writing during checkpoints, with some 
documentation and configuration warnings added.

For the initial patch, see:
  http://www.postgresql.org/message-id/20140827091922.GD21544@awork2.anarazel.de

For the whole thread:
  http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@sto

The objective is to help avoid PG stalling when fsyncing on checkpoints, 
and in general to get better latency-bound performance.

Flushes are managed with pg throttled writes instead of waiting for the
checkpointer final "fsync" which induces occasional stalls. From
"pgbench -P 1 ...", such stalls look like this:
  progress: 35.0 s, 615.9 tps, lat 1.344 ms stddev 4.043   # ok  progress: 36.0 s, 3.0 tps, lat 346.111 ms stddev
123.828# stalled  progress: 37.0 s, 4.0 tps, lat 252.462 ms stddev 29.346  # ...  progress: 38.0 s, 161.0 tps, lat
6.968ms stddev 32.964  # restart  progress: 39.0 s, 701.0 tps, lat 1.421 ms stddev 3.326   # ok
 

I've seen similar behavior on FreeBSD with its native FS, so it is not a
Linux-specific or ext4-specific issue, even if both factor may contribute.

There are two implementations, first one based on "sync_file_range" is Linux
specific, while the other relies on "posix_fadvise". Tests below ran on Linux.
If someone could test the posix_fadvise version on relevant platforms, that
would be great...

The Linux specific "sync_file_range" approach was suggested among other ideas
by Theodore Ts'o on Robert Haas blog in March 2014:
  http://rhaas.blogspot.fr/2014/03/linuxs-fsync-woes-are-getting-some.html

Two guc variables control whether the feature is activated for writes of 
dirty pages issued by checkpointer and bgwriter. Given that the settings 
may improve or degrade performance, having GUC seems justified.  In 
particular the stalling issue disappears with SSD.

The effect is significant on a series of tests shown below with scale 10 
pgbench on an (old) dedicated host (8 GB memory, 8 cores, ext4 over hw 
RAID), with shared_buffers=1GB checkpoint_completion_target=0.8 
completion_timeout=30s, unless stated otherwise.

Note: I know that this completion_timeout is too small for a normal 
config, but the point is to test how checkpoints behave, so the test 
triggers as many checkpoints as possible, hence the minimum timeout 
setting. I have also done some tests with larger timeout.


(1) THROTTLED PGBENCH

The objective of the patch is to be able to reduce the latency of transactions
under a moderate load. These first serie of tests focuses on this point with
the help of pgbench -R (rate) and -L (skip/count late transactions).
The measure counts transactions which were skipped or beyond the expected
latency limit while targetting a transaction rate.

* "pgbench -M prepared -N -T 100 -P 1 -R 100 -L 100" (100 tps targeted during  100 seconds, and latency limit is 100
ms),over 256 runs, 7 hours per case:
 
  flush     | percent of skipped  cp  | bgw | & out of latency limit transactions  off | off | 6.5 %  off |  on | 6.1 %
 on | off | 0.4 %   on |  on | 0.4 %
 

* Same as above (100 tps target) over one run of 4000 seconds with  shared_buffers=256MB and checkpoint_timeout=10mn:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 1.3 %  off |  on | 1.5
%  on | off | 0.6 %   on |  on | 0.6 %
 

* Same as first one but with "-R 150", i.e. targetting 150 tps, 256 runs:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 8.0 %  off |  on | 8.0
%  on | off | 0.4 %   on |  on | 0.4 %
 

* Same as above (150 tps target) over one run of 4000 seconds with  shared_buffers=256MB and checkpoint_timeout=10mn:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 1.7 %  off |  on | 1.9
%  on | off | 0.7 %   on |  on | 0.6 %
 

Turning "checkpoint_flush_to_disk = on" reduces significantly the number
of late transactions. These late transactions are not uniformly distributed,
but are rather clustered around times when pg is stalled, i.e. more or less
unresponsive.

bgwriter_flush_to_disk does not seem to have a significant impact on these 
tests, maybe because pg shared_buffers size is much larger than the 
database, so the bgwriter is seldom active.


(2) FULL SPEED PGBENCH

This is not the target use case, but it seems necessary to assess the 
impact of these options of tps figures and their variability.

* "pgbench -M prepared -N -T 100 -P 1" over 512 runs, 14 hours per case.
      flush   | performance on ...    cp  | bgw | 512 100-seconds runs | 1s intervals (over 51200 seconds)    off | off
|691 +- 36 tps        | 691 +- 236 tps    off |  on | 677 +- 29 tps        | 677 +- 230 tps     on | off | 655 +- 23
tps       | 655 +- 130 tps     on |  on | 657 +- 22 tps        | 657 +- 130 tps
 

On this first test, setting checkpoint_flush_to_disk reduces the performance by
5%, but the per second standard deviation is nearly halved, that is the
performance is more stable over the runs, although lower.
Option bgwriter_flush_to_disk effect is inconclusive.

* "pgbench -M prepared -N -T 4000 -P 1" on only 1 (long) run, with  checkpoint_timeout=10mn and shared_buffers=256MB
(atleast 6 checkpoints  during the run, probably more because segments are filled more often than  every 10mn):
 
       flush   | performance ... (stddev over per second tps)     off | off | 877 +- 179 tps     off |  on | 880 +- 183
tps     on | off | 896 +- 131 tps      on |  on | 888 +- 132 tps
 

On this second short test, setting checkpoint_flush_to_disk seems to maybe 
slightly improve performance (maybe 2% ?) and significantly reduces 
variability, so it looks like a good move.

* "pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients)
      flush   | performance on ...    cp  | bgw | 32 100-seconds runs | 1s intervals (over 3200 seconds)    off | off |
1970+- 60 tps      | 1970 +- 783 tps    off |  on | 1928 +- 61 tps      | 1928 +- 813 tps     on | off | 1578 +- 45 tps
    | 1578 +- 631 tps     on |  on | 1594 +- 47 tps      | 1594 +- 618 tps
 

On this test both average and standard deviation are both reduced by 20%.
This does not look like a win.


CONCLUSION

This approach is simple and significantly improves pg fsync behavior under
moderate load, where the database stays mostly responsive.  Under full load,
the situation may be improved or degraded, it depends.


OTHER OPTIONS

Another idea suggested by Theodore Ts'o seems impractical: playing with 
Linux io-scheduler priority (ioprio_set) looks only relevant with the 
"sfq" scheduler on actual hard disk, but does not work with other 
schedulers, especially "deadline" which seems more advisable for Pg, nor 
for hardware RAID, which is a common setting.

Also, Theodore Ts'o suggested to use "sync_file_range" to check whether 
the writes have reached the disk, and possibly to delay the actual 
fsync/checkpoint conclusion if not... I have not tried that, the 
implementation is not as trivial, and I'm not sure what to do when the 
completion target is coming, but possibly that could be an interesting 
option to investigate. Preliminary tests by adding a sleep between the 
writes and the final fsync did not yield very good results.

I've also played with numerous other options (changing checkpointer 
throttling parameters, reducing checkpoint timeout to 1 second, playing 
around with various kernel settings), but that did not seem to be very 
effective for the problem at hand.


I also attached a test script I used, that can be adapted if someone wants 
to collect some performance data. I also have some basic scripts to 
extract and compute stats, ask if needed.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Amit Langote
Date:
Hi Fabien,

On 2015-06-01 PM 08:40, Fabien COELHO wrote:
> 
> Turning "checkpoint_flush_to_disk = on" reduces significantly the number
> of late transactions. These late transactions are not uniformly distributed,
> but are rather clustered around times when pg is stalled, i.e. more or less
> unresponsive.
> 
> bgwriter_flush_to_disk does not seem to have a significant impact on these
> tests, maybe because pg shared_buffers size is much larger than the database,
> so the bgwriter is seldom active.
> 

Not that the GUC naming is the most pressing issue here, but do you think
"*_flush_on_write" describes what the patch does?

Thanks,
Amit




Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

> Not that the GUC naming is the most pressing issue here, but do you think
> "*_flush_on_write" describes what the patch does?

It is currently "*_flush_to_disk". In Andres Freund version the name is 
"sync_on_checkpoint_flush", but I did not found it very clear. Using 
"*_flush_on_write" instead as your suggest, would be fine as well, it 
emphasizes the "when/how" it occurs instead of the final "destination", 
why not...

About words: checkpoint "write"s pages, but this really mean passing the 
pages to the memory manager, which will think about it... "flush" seems to 
suggest a more effective write, but really it may mean the same, the page 
is just passed to the OS. So "write/flush" is really "to OS" and not "to 
disk". I like the data to be on "disk" in the end, and as soon as 
possible, hence the choice to emphasize that point.

Now I would really be okay with anything that people find simple to 
understand, so any opinion is welcome!

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

It's nice to see the topic being picked up.

If I see correctly you picked up the version without sorting durch
checkpoints. I think that's not going to work - there'll be too many
situations where the new behaviour will be detrimental.  Did you
consider combining both approaches?

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Mon, Jun 1, 2015 at 5:10 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello pg-devs,

This patch is a simplified and generalized version of Andres Freund's August 2014 patch for flushing while writing during checkpoints, with some documentation and configuration warnings added.

For the initial patch, see:

  http://www.postgresql.org/message-id/20140827091922.GD21544@awork2.anarazel.de

For the whole thread:

  http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@sto

The objective is to help avoid PG stalling when fsyncing on checkpoints, and in general to get better latency-bound performance.


-FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
+FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln, bool flush_to_disk)
 {
  XLogRecPtr recptr;
  ErrorContextCallback errcallback;
@@ -2410,7 +2417,8 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
   buf->tag.forkNum,
   buf->tag.blockNum,
   bufToWrite,
-  false);
+  false,
+  flush_to_disk);

Won't this lead to more-unsorted writes (random I/O) as the
FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
per files or order of blocks on disk?

I remember sometime back there was some discusion regarding
sorting writes during checkpoint, one idea could be try to
check this idea along with that patch.  I just saw that Andres has
also given same suggestion which indicates that it is important
to see both the things together.

Also here another related point is that I think currently even fsync
requests are not in order of the files as they are stored on disk so
that also might cause random I/O?

Yet another idea could be to allow BGWriter to also fsync the dirty
buffers, that may have side impact of not able to clear the dirty pages
at speed required by system, but I think if that happens one can
think of having multiple BGwriter tasks. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> If I see correctly you picked up the version without sorting durch
> checkpoints. I think that's not going to work - there'll be too many
> situations where the new behaviour will be detrimental.  Did you
> consider combining both approaches?

Ja, I thought that it was a more complex patch with uncertain/less clear 
benefits, and as this simpler version was already effective enough as it 
was, so I decided to start with that and try to have reasonable proof of 
benefits so that it could get through.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

> [...]
>> The objective is to help avoid PG stalling when fsyncing on checkpoints,
>> and in general to get better latency-bound performance.
>
> Won't this lead to more-unsorted writes (random I/O) as the
> FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
> per files or order of blocks on disk?

Yep, probably. Under "moderate load" this is not an issue. The 
io-scheduler and other hd firmware will probably reorder writes anyway. 
Also, if several data are updated together, probably they are likely to be 
already neighbours in memory as well as on disk.

> I remember sometime back there was some discusion regarding
> sorting writes during checkpoint, one idea could be try to
> check this idea along with that patch.  I just saw that Andres has
> also given same suggestion which indicates that it is important
> to see both the things together.

I would rather separate them, unless this is a blocker. This version seems 
already quite effective and very light. ISTM that adding a sort phase 
would mean reworking significantly how the checkpointer processes pages.

> Also here another related point is that I think currently even fsync
> requests are not in order of the files as they are stored on disk so
> that also might cause random I/O?

I think that currently the fsync is on the file handler, so what happens 
depends on how fsync is implemented by the system.

> Yet another idea could be to allow BGWriter to also fsync the dirty
> buffers,

ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".

> that may have side impact of not able to clear the dirty pages at speed 
> required by system, but I think if that happens one can think of having 
> multiple BGwriter tasks.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-06-02 15:15:39 +0200, Fabien COELHO wrote:
> >Won't this lead to more-unsorted writes (random I/O) as the
> >FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
> >per files or order of blocks on disk?
> 
> Yep, probably. Under "moderate load" this is not an issue. The io-scheduler
> and other hd firmware will probably reorder writes anyway.

They pretty much can't if you flush things frequently. That's why I
think this won't be acceptable without the sorting in the checkpointer.

> Also, if several
> data are updated together, probably they are likely to be already neighbours
> in memory as well as on disk.

No, that's not how it'll happen outside of simplistic cases where you
start with an empty shared_buffers. Shared buffers are maintained by a
simplified LRU, so how often individual blocks are touched will define
the buffer replacement.

> >I remember sometime back there was some discusion regarding
> >sorting writes during checkpoint, one idea could be try to
> >check this idea along with that patch.  I just saw that Andres has
> >also given same suggestion which indicates that it is important
> >to see both the things together.
> 
> I would rather separate them, unless this is a blocker.

I think it is a blocker.

> This version seems
> already quite effective and very light. ISTM that adding a sort phase would
> mean reworking significantly how the checkpointer processes pages.

Meh. The patch for that wasn't that big.

The problem with doing this separately is that without the sorting this
will be slower for throughput in a good number of cases. So we'll have
yet another GUC that's very hard to tune.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> I would rather separate them, unless this is a blocker.
>
> I think it is a blocker.

Hmmm. This is an argument...

>> This version seems already quite effective and very light. ISTM that 
>> adding a sort phase would mean reworking significantly how the 
>> checkpointer processes pages.
>
> Meh. The patch for that wasn't that big.

Hmmm. I think it should be implemented as Tom suggested, that is per 
chunks of shared buffers, in order to avoid allocating a "large" memory.

> The problem with doing this separately is that without the sorting this
> will be slower for throughput in a good number of cases. So we'll have
> yet another GUC that's very hard to tune.

ISTM that the two aspects are orthogonal, which would suggests two gucs 
anyway.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-06-02 15:42:14 +0200, Fabien COELHO wrote:
> >>This version seems already quite effective and very light. ISTM that
> >>adding a sort phase would mean reworking significantly how the
> >>checkpointer processes pages.
> >
> >Meh. The patch for that wasn't that big.
> 
> Hmmm. I think it should be implemented as Tom suggested, that is per chunks
> of shared buffers, in order to avoid allocating a "large" memory.

I don't necessarily agree. But that's really just a minor implementation
detail. The actual problem is sorting & fsyncing in a way that deals
efficiently with tablespaces, i.e. doesn't write to tablespaces
one-by-one.  Not impossible, but it requires some thought.

> >The problem with doing this separately is that without the sorting this
> >will be slower for throughput in a good number of cases. So we'll have
> >yet another GUC that's very hard to tune.
> 
> ISTM that the two aspects are orthogonal, which would suggests two gucs
> anyway.

They're pretty closely linked from their performance impact. IMO this
feature, if done correctly, should result in better performance in 95+%
of the workloads and be enabled by default. And that'll not be possible
without actually writing mostly sequentially.

It's also not just the sequential writes making this important, it's
also that it allows to do the final fsync() of the individual segments
as soon as their last buffer has been written out. That's important
because it means the file will get fewer writes done independently
(i.e. backends writing out dirty buffers) which will make the final
fsync more expensive.

It might be that we want to different gucs, but I don't think we can
release without both features.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> Hmmm. I think it should be implemented as Tom suggested, that is per chunks
>> of shared buffers, in order to avoid allocating a "large" memory.
>
> I don't necessarily agree. But that's really just a minor implementation
> detail.

Probably.

> The actual problem is sorting & fsyncing in a way that deals efficiently 
> with tablespaces, i.e. doesn't write to tablespaces one-by-one.
> Not impossible, but it requires some thought.

Hmmm... I would have neglected this point in a first approximation,
but I agree that not interleaving tablespaces could indeed loose some 
performance.

>> ISTM that the two aspects are orthogonal, which would suggests two gucs
>> anyway.
>
> They're pretty closely linked from their performance impact.

Sure.

> IMO this feature, if done correctly, should result in better performance 
> in 95+% of the workloads

To demonstrate that would require time...

> and be enabled by default.

I did not had such an ambition with the submitted patch:-)

> And that'll not be possible without actually writing mostly 
> sequentially.

> It's also not just the sequential writes making this important, it's 
> also that it allows to do the final fsync() of the individual segments 
> as soon as their last buffer has been written out.

Hmmm... I'm not sure this would have a large impact. The writes are 
throttled as much as possible, so fsync will catch plenty other writes 
anyway, if there are some.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-06-02 17:01:50 +0200, Fabien COELHO wrote:
> >The actual problem is sorting & fsyncing in a way that deals efficiently
> >with tablespaces, i.e. doesn't write to tablespaces one-by-one.
> >Not impossible, but it requires some thought.
> 
> Hmmm... I would have neglected this point in a first approximation,
> but I agree that not interleaving tablespaces could indeed loose some
> performance.

I think it'll be a hard to diagnose performance regression. So we'll
have to fix it. That argument actually was the blocker in previous
attempts...

> >IMO this feature, if done correctly, should result in better performance
> >in 95+% of the workloads
> 
> To demonstrate that would require time...

Well, that's part of the contribution process. Obviously you can't test
100% of the problems, but you can work hard with coming up with very
adversarial scenarios and evaluate performance for those.

> >and be enabled by default.
> 
> I did not had such an ambition with the submitted patch:-)

I don't think we want yet another tuning knob that's hard to tune
because it's critical for one factor (latency) but bad for another
(throughput); especially when completely unnecessarily.

> >And that'll not be possible without actually writing mostly sequentially.
> 
> >It's also not just the sequential writes making this important, it's also
> >that it allows to do the final fsync() of the individual segments as soon
> >as their last buffer has been written out.
> 
> Hmmm... I'm not sure this would have a large impact. The writes are
> throttled as much as possible, so fsync will catch plenty other writes
> anyway, if there are some.

That might be the case in a database with a single small table;
i.e. where all the writes go to a single file. But as soon as you have
large tables (i.e. many segments) or multiple tables, a significant part
of the writes issued independently from checkpointing will be outside
the processing of the individual segment.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>>> IMO this feature, if done correctly, should result in better performance
>>> in 95+% of the workloads
>>
>> To demonstrate that would require time...
>
> Well, that's part of the contribution process. Obviously you can't test
> 100% of the problems, but you can work hard with coming up with very
> adversarial scenarios and evaluate performance for those.

I did spent time (well, a machine spent time, really) to collect some 
convincing data for the simple version without sorting to demonstrate that 
it brings a clear value, which seems not to be enough...

> I don't think we want yet another tuning knob that's hard to tune
> because it's critical for one factor (latency) but bad for another
> (throughput); especially when completely unnecessarily.

Hmmm.

My opinion is that throughput is given too much attention in general, but 
if both can be kept/improved, this would be easier to sell, obviously.


>>> It's also not just the sequential writes making this important, it's also
>>> that it allows to do the final fsync() of the individual segments as soon
>>> as their last buffer has been written out.
>>
>> Hmmm... I'm not sure this would have a large impact. The writes are
>> throttled as much as possible, so fsync will catch plenty other writes
>> anyway, if there are some.
>
> That might be the case in a database with a single small table;
> i.e. where all the writes go to a single file. But as soon as you have
> large tables (i.e. many segments) or multiple tables, a significant part
> of the writes issued independently from checkpointing will be outside
> the processing of the individual segment.

Statistically, I think that it would reduce the number of unrelated writes 
taken in a fsync by about half: the last table to be written on a 
tablespace, at the end of the checkpoint, will have accumulated 
checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint 
time, while the first table will have avoided most of them.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-06-02 18:59:05 +0200, Fabien COELHO wrote:
> 
> >>>IMO this feature, if done correctly, should result in better performance
> >>>in 95+% of the workloads
> >>
> >>To demonstrate that would require time...
> >
> >Well, that's part of the contribution process. Obviously you can't test
> >100% of the problems, but you can work hard with coming up with very
> >adversarial scenarios and evaluate performance for those.
> 
> I did spent time (well, a machine spent time, really) to collect some
> convincing data for the simple version without sorting to demonstrate that
> it brings a clear value, which seems not to be enough...

"which seems not to be enough" - man. It's trivial to make things
faster/better/whatever if you don't care about regressions in other
parts. And if we'd add a guc for each of these cases we'd end up with
thousands of them.

> My opinion is that throughput is given too much attention in general, but if
> both can be kept/improved, this would be easier to sell, obviously.

Your priorities are not everyone's. That's life.


> >That might be the case in a database with a single small table;
> >i.e. where all the writes go to a single file. But as soon as you have
> >large tables (i.e. many segments) or multiple tables, a significant part
> >of the writes issued independently from checkpointing will be outside
> >the processing of the individual segment.
> 
> Statistically, I think that it would reduce the number of unrelated writes
> taken in a fsync by about half: the last table to be written on a
> tablespace, at the end of the checkpoint, will have accumulated
> checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
> time, while the first table will have avoided most of them.

That's disregarding that a buffer written out by a backend starts to get
written out by the kernel after ~5-30s, even without a fsync triggering
it.



Re: checkpointer continuous flushing

From
Amit Langote
Date:
Hi,

On 2015-06-02 PM 07:19, Fabien COELHO wrote:
> 
>> Not that the GUC naming is the most pressing issue here, but do you think
>> "*_flush_on_write" describes what the patch does?
> 
> It is currently "*_flush_to_disk". In Andres Freund version the name is
> "sync_on_checkpoint_flush", but I did not found it very clear. Using
> "*_flush_on_write" instead as your suggest, would be fine as well, it
> emphasizes the "when/how" it occurs instead of the final "destination", why
> not...
> 
> About words: checkpoint "write"s pages, but this really mean passing the pages
> to the memory manager, which will think about it... "flush" seems to suggest a
> more effective write, but really it may mean the same, the page is just passed
> to the OS. So "write/flush" is really "to OS" and not "to disk". I like the
> data to be on "disk" in the end, and as soon as possible, hence the choice to
> emphasize that point.
> 
> Now I would really be okay with anything that people find simple to
> understand, so any opinion is welcome!
> 

It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
understand this and the previous discussion(s) correctly, the patch tries to
alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
the sync in step with writes (which do abide by the
checkpoint_completion_target). Given that impression, it seems *_sync_on_write
may even do the job.


Again, this is a minor issue.

By the way, I tend to agree with others here that there needs to be found a
good balance such that this sync-blocks-one-at-time-in-random-order approach
does not hurt generalized workload too much although it seems to help with
solving the latency problem that you seem set out to solve.

Thanks,
Amit




Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Tue, Jun 2, 2015 at 6:45 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello Amit,
>
>> [...]
>>>
>>> The objective is to help avoid PG stalling when fsyncing on checkpoints,
>>> and in general to get better latency-bound performance.
>>
>>
>> Won't this lead to more-unsorted writes (random I/O) as the
>> FlushBuffer requests (by checkpointer or bgwriter) are not sorted as
>> per files or order of blocks on disk?
>
>
> Yep, probably. Under "moderate load" this is not an issue. The io-scheduler and other hd firmware will probably reorder writes anyway. Also, if several data are updated together, probably they are likely to be already neighbours in memory as well as on disk.
>
>> I remember sometime back there was some discusion regarding
>> sorting writes during checkpoint, one idea could be try to
>> check this idea along with that patch.  I just saw that Andres has
>> also given same suggestion which indicates that it is important
>> to see both the things together.
>
>
> I would rather separate them, unless this is a blocker. This version seems already quite effective and very light. ISTM that adding a sort phase would mean reworking significantly how the checkpointer processes pages.
>

I agree with you that if we have to add a sort phase, there is additional
work and that work could be significant depending on the design we
choose, however without that, this patch can have impact on many kind
of workloads, even in your mail in one of the tests
("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
it has shown 20% degradation which is quite significant and test also
seems to be representative of the workload which many users in real-world
will use.

Now one can say that for such workloads turn the new knob to off, but
in reality it could be difficult to predict if the load is always moderate.
I think users might be able to predict that at table level, but inspite of that
I don't think having any such knob can give us ticket to flush the buffers
in random order.

>> Also here another related point is that I think currently even fsync
>> requests are not in order of the files as they are stored on disk so
>> that also might cause random I/O?
>
>
> I think that currently the fsync is on the file handler, so what happens depends on how fsync is implemented by the system.
>

That can also lead to random I/O if the fsync for different files is not in
order as they are actually stored on disk.


>> Yet another idea could be to allow BGWriter to also fsync the dirty
>> buffers,
>
>
> ISTM That it is done with this patch with "bgwriter_flush_to_disk=on".
>

I think patch just issues an async operation not the actual flush.  Why
I have suggested so is that in your tests when the checkpoint_timeout
is small it seems there is a good gain in performance that means if
keep on flushing dirty buffers at regular intervals, the system's performance
is good and BGWriter is the process where that can be done conveniently
apart from checkpoint,  one might think that if same can be achieved by using
shorter checkpoint_timeout interval, then why to do this incremental flushes
by bgwriter, but in reality I think checkpoint is responsible for other things
as well other than dirty buffers, so we can't leave everything till checkpoint
happens.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>>> That might be the case in a database with a single small table; i.e. 
>>> where all the writes go to a single file. But as soon as you have 
>>> large tables (i.e. many segments) or multiple tables, a significant 
>>> part of the writes issued independently from checkpointing will be 
>>> outside the processing of the individual segment.
>>
>> Statistically, I think that it would reduce the number of unrelated writes
>> taken in a fsync by about half: the last table to be written on a
>> tablespace, at the end of the checkpoint, will have accumulated
>> checkpoint-unrelated writes (bgwriter, whatever) from the whole checkpoint
>> time, while the first table will have avoided most of them.
>
> That's disregarding that a buffer written out by a backend starts to get
> written out by the kernel after ~5-30s, even without a fsync triggering
> it.

I meant my argument with "continuous flushing" activated, so there is no 
up to 30 seconds delay induced my the memory manager. Hmmm, maybe I do not 
understood your argument.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> It is currently "*_flush_to_disk". In Andres Freund version the name is
>> "sync_on_checkpoint_flush", but I did not found it very clear. Using
>> "*_flush_on_write" instead as your suggest, would be fine as well, it
>> emphasizes the "when/how" it occurs instead of the final "destination", why
>> not...
> [...]
>
> It seems 'sync' gets closer to what I really wanted 'flush' to mean. If 
> I understand this and the previous discussion(s) correctly, the patch 
> tries to alleviate the problems caused by one-big-sync-at-the 
> end-of-writes by doing the sync in step with writes (which do abide by 
> the checkpoint_completion_target). Given that impression, it seems 
> *_sync_on_write may even do the job.

I desagree with this one, because the sync is only *initiated*, not done. 
For this reason I think that "flush" seems a better word. I understand 
"sync" as "committed to disk". For the data to be synced, it should call 
with the "wait after" option, which is a partial "fsync", but that would 
be terrible for performance as all checkpointed pages would be written one 
by one, without any opportunity for reordering them.

For what it's worth and for the record, Linux sync_file_range 
documentation says "This is an asynchronous flush-to-disk operation" to 
describe the corresponding option. This is probably where I took it.

So two contenders:
  *_flush_to_disk  *_flush_on_write

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> I agree with you that if we have to add a sort phase, there is additional
> work and that work could be significant depending on the design we
> choose, however without that, this patch can have impact on many kind
> of workloads, even in your mail in one of the tests
> ("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients))
> it has shown 20% degradation which is quite significant and test also
> seems to be representative of the workload which many users in real-world
> will use.

Yes, I do agree with the 4 clients, but I doubt that many user run their 
application at maximum available throughput all the time (like always 
driving foot to the floor). So for me throttled runs are more 
representative of real life.

> Now one can say that for such workloads turn the new knob to off, but
> in reality it could be difficult to predict if the load is always moderate.

Hmmm. The switch says "I prefer stable (say latency bounded) performance", 
if you run a web site probably you should want that.

Anyway, I'll look at sorting when I have some time.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Langote
Date:
Fabien,

On 2015-06-03 PM 02:53, Fabien COELHO wrote:
> 
>>
>> It seems 'sync' gets closer to what I really wanted 'flush' to mean. If I
>> understand this and the previous discussion(s) correctly, the patch tries to
>> alleviate the problems caused by one-big-sync-at-the end-of-writes by doing
>> the sync in step with writes (which do abide by the
>> checkpoint_completion_target). Given that impression, it seems
>> *_sync_on_write may even do the job.
> 
> I desagree with this one, because the sync is only *initiated*, not done. For
> this reason I think that "flush" seems a better word. I understand "sync" as
> "committed to disk". For the data to be synced, it should call with the "wait
> after" option, which is a partial "fsync", but that would be terrible for
> performance as all checkpointed pages would be written one by one, without any
> opportunity for reordering them.
> 
> For what it's worth and for the record, Linux sync_file_range documentation
> says "This is an asynchronous flush-to-disk operation" to describe the
> corresponding option. This is probably where I took it.
> 

Ah, okay! I didn't quite think about the async aspect here. But, I sure do
hope that the added mechanism turns out to be *less* async than kernel's own
dirty cache handling to achieve the hoped for gain.

> So two contenders:
> 
>   *_flush_to_disk
>   *_flush_on_write
> 

Yep!

Regards,
Amit




Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> They pretty much can't if you flush things frequently. That's why I
> think this won't be acceptable without the sorting in the checkpointer.


* VERSION 2 "WORK IN PROGRESS".

The implementation is more a proof-of-concept for having feedback than
clean code. What it does:
 - as version 1 : simplified asynchronous flush based on Andres Freund   patch, with sync_file_range/posix_fadvise used
tohint the OS that   the buffer must be sent to disk "now".
 
 - added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki   but with a smaller and static buffer
allocatedonce. Also,   sorting is done by chunks in the current version.
 
 - also added: sync/advise calls are now merged if possible,   so less calls are used, especially when buffers are
sorted,  but also if there are few files.
 


* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1" scale 10  (simple update pgbench
with a mostly-write activity),  with checkpoint_completion_target=0.8
and shared_buffers=1GB.

Contrary to v1, I have not tested bgwriter flushing as the impact
on the first round was close to nought. This does not mean that particular
loads may benefit or be harmed but flushing from bgwriter.

- 100 tps throttled max 100 ms latency over 6400 seconds  with checkpoint_timeout=30s
  flush | sort | late transactions    off |  off | 6.0 %    off |   on | 6.1 %     on |  off | 0.4 %     on |   on |
0.4% (93% improvement)
 

- 100 tps throttled max 100 ms latency over 4000 seconds  with checkpoint_timeout=10mn
  flush | sort | late transactions    off |  off | 1.5 %    off |   on | 0.6 % (?!)     on |  off | 0.8 %     on |   on
|0.6 % (60% improvement)
 

- 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours)  with checkpoint_timeout=30s
  flush | sort | late transactions    off |  off | 8.5 %    off |   on | 8.1 %     on |  off | 0.5 %     on |   on |
0.4% (95% improvement)
 

- full speed bgbench over 6400 seconds with checkpoint_timeout=30s
  flush | sort | tps performance over per second data    off |  off | 676 +- 230    off |   on | 683 +- 213     on |
off| 712 +- 130     on |   on | 725 +- 116 (7.2% avg/50% stddev improvements)
 

- full speed bgbench over 4000 seconds with checkpoint_timeout=10mn
  flush | sort | tps performance over per second data    off |  off | 885 +- 188    off |   on | 940 +- 120 (6%/36%!)
 on |  off | 778 +- 245 (hmmm... not very consistent?)     on |   on | 927 +- 108 (4.5% avg/43% sttdev improvements)
 

- full speed bgbench "-j2 -c4" over 6400 seconds with checkpoint_timeout=30s
  flush | sort | tps performance over per second data    off |  off | 2012 +- 747    off |   on | 2086 +- 708     on |
off| 2099 +- 459     on |   on | 2114 +- 422 (5% avg/44% stddev improvements)
 


* CONCLUSION :

For all these HDD tests, when both options are activated the tps performance
is improved, the latency is reduced and the performance is more stable
(smaller standard deviation).

Overall the option effects, not surprisingly, are quite (with exceptions) 
orthogonal: - latency is essentially improved (60 to 95% reduction) by flushing - throughput is improved (4 to 7%
better)thanks to sorting
 

In detail, some loads may benefit more from only one option activated.
Also on SSD probably both options would have limited benefit.

Usual caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.


* LOOK FOR THOUGHTS

- The bgwriter flushing option seems ineffective, it could be removed  from the patch?

- Move fsync as early as possible, suggested by Andres Freund?

In these tests, when the flush option is activated, the fsync duration
at the end of the checkpoint is small: on more than 5525 checkpoint
fsyncs, 0.5% are above 1 second when flush is on, but the figure raises
to 24% when it is off.... This suggest that doing the fsync as soon as
possible would probably have no significant effect on these tests.

My opinion is that this should be left out for the nonce.


- Take into account tablespaces, as pointed out by Andres Freund?

The issue is that if writes are sorted, they are not be distributed 
randomly over tablespaces, inducing lower performance on such systems.

How to do it: while scanning shared_buffers, count dirty buffers for each
tablespace. Then start as many threads as table spaces, each one doing
its own independent throttling for a tablespace? For some obscure reason 
there are 2 tablespaces by default (pg_global and  pg_default), that would 
mean at least 2 threads.

Alternatively, maybe it can be done from one thread, but it would probably 
involve some strange hocus-pocus to switch frequently between tablespaces.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Cédric Villemain
Date:
Le 07/06/2015 16:53, Fabien COELHO a écrit :
> +» » /*·Others:·say·that·data·should·not·be·kept·in·memory...
> +» » ·*·This·is·not·exactly·what·we·want·to·say,·because·we·want·to·write
> +» » ·*·the·data·for·durability·but·we·may·need·it·later·nevertheless.
> +» » ·*·It·seems·that·Linux·would·free·the·memory·*if*·the·data·has
> +» » ·*·already·been·written·do·disk,·else·it·is·ignored.
> +» » ·*·For·FreeBSD·this·may·have·the·desired·effect·of·moving·the
> +» » ·*·data·to·the·io·layer.
> +» » ·*/
> +» » rc·=·posix_fadvise(context->fd,·context->offset,·context->nbytes,
> +» » » » » » ···POSIX_FADV_DONTNEED);
> +

It looks a bit hazardous, do you have a benchmark for freeBSD ?

Sources says:case POSIX_FADV_DONTNEED:    /*     * Flush any open FS buffers and then remove pages     * from the
backingVM object.  Using vinvalbuf() here     * is a bit heavy-handed as it flushes all buffers for     * the given
vnode,not just the buffers covering the     * requested range.
 

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Cédric,

> It looks a bit hazardous, do you have a benchmark for freeBSD ?

No, I just consulted the FreeBSD man page for posix_fadvise. I someone can 
run tests on something which HDDs is not linux, that would be nice.

> Sources says:
>     case POSIX_FADV_DONTNEED:
>         /*
>          * Flush any open FS buffers and then remove pages
>          * from the backing VM object.  Using vinvalbuf() here
>          * is a bit heavy-handed as it flushes all buffers for
>          * the given vnode, not just the buffers covering the
>          * requested range.

It is indeed heavy-handed, but that would probably trigger the expected 
behavior which is to start writing to disk, so I would expect to see 
benefits similar to those of "sync_file_range" on Linux.

Buffer writes from bgwriter & checkpointer are throttled, which reduces 
the potential impact of a "heavy-handed" approach in the kernel.

Now if on some platforms the behavior is absurd, obviously it would be 
better to turn the feature off on those.

Note that this is already used by pg in "initdb", but the impact would 
probably be very small anyway.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:

Hello,

Here is version 3, including many performance tests with various settings, 
representing about 100 hours of pgbench run. This patch aims at improving 
checkpoint I/O behavior so that tps throughput is improved, late 
transactions are less frequent, and overall performances are more stable.


* SOLILOQUIZING

> - The bgwriter flushing option seems ineffective, it could be removed
>  from the patch?

I did that.

> - Move fsync as early as possible, suggested by Andres Freund?
>
> My opinion is that this should be left out for the nonce.

I did that.

> - Take into account tablespaces, as pointed out by Andres Freund?
>
> Alternatively, maybe it can be done from one thread, but it would probably 
> involve some strange hocus-pocus to switch frequently between tablespaces.

I did the hocus-pocus approach, including a quasi-proof (not sure what is 
this mathematical object:-) in comments to show how/why it works.


* PATCH CONTENTS
 - as version 1: simplified asynchronous flush based on Andres Freund   patch, with sync_file_range/posix_fadvise used
tohint the OS that   the buffer must be sent to disk "now".
 
 - as version 2: checkpoint buffer sorting based on a 2007 patch by   Takahiro Itagaki but with a smaller and static
bufferallocated once.   Also, sorting is done by chunks of 131072 pages in the current version,   with a guc to change
thisvalue.
 
 - as version 2: sync/advise calls are now merged if possible,   so less calls will be used, especially when buffers
aresorted,   but also if there are few files written.
 
 - new: the checkpointer balance its page writes per tablespace.   this is done by choosing to write pages for a
tablespacefor which   the progress ratio (written/to_write) is beyond the overall progress   ratio for all tablespace,
andby doing that in a round robin manner   so that all tablespaces regularly get some attention. No threads.
 
 - new: some more documentation is added.
 - removed: "bgwriter_flush_to_write" is removed, as there was no clear   benefit on the (simple) tests. It could be
consideredfor another patch.
 
 - question: I'm not sure I understand the checkpointer memory management.   There is some exception handling in the
checkpointermain. I wonder   whether the allocated memory would be lost in such event and should   be reallocated.  The
patchcurrently assumes that the memory is kept.
 


* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
random write activity on one table), checkpoint_completion_target=0.8, with
different settings on a 16GB 8-core host:
 . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s . small: scale=120 shared_buffers=2GB
checkpoint_timeout=300stime=4000s . medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s . large:
scale=1000shared_buffers=4GB checkpoint_timeout=40min time=7500s
 

Note: figures noted with a star (*) had various issues during their run, so
pgbench progress figures were more or less incorrect, thus the standard
deviation computation is not to be trusted beyond "pretty bad".

Caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.

- full-speed 1-client
     options   | tps performance over per second data  flush | sort |    tiny    |    small     |   medium     |
large   off |  off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6    off |   on | 699 +- 223 | 457 +- 315
|479 +- 319   | 48.4 +- 28.8     on |  off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3     on |   on |
722+- 119 | 550 +- 140   | 549 +- 180   | 47.2 +- 16.8
 

- full speed 4-clients
      options  | tps performance over per second data  flush | sort |    tiny     |     small     |    medium    off |
off| 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *    off |   on | 2086 +- 673 | 819 +-  905 * | 807 +- 1029 *     on |
off| 2212 +- 451 | 169 +- 1269 * | 160 +-  502 *     on |   on | 2073 +- 437 | 743 +-  413   | 822 +-  467
 

- 100-tps 1-client max 100-ms latency
     options   | percent of late transactions  flush | sort |  tiny | small | medium    off |  off |  6.31 | 29.44 |
30.74   off |   on |  6.23 |  8.93 |  7.12     on |  off |  0.44 |  7.01 |  8.14     on |   on |  0.59 |  0.83 |  1.84
 

- 200-tps 1-client max 100-ms latency
     options   | percent of late transactions  flush | sort |  tiny | small | medium    off |  off | 10.00 | 50.61 |
45.51   off |   on |  8.82 | 12.75 | 12.89     on |  off |  0.59 | 40.48 | 42.64     on |   on |  0.53 |  1.76 |  2.59
 

- 400-tps 1-client (or 4 for medium) max 100-ms latency
     options   | percent of late transactions  flush | sort | tiny | small | medium    off |  off | 12.0 | 64.28 | 68.6
  off |   on | 11.3 | 22.05 | 22.6     on |  off |  1.1 | 67.93 | 67.9     on |   on |  0.6 |  3.24 |  3.1
 


* CONCLUSION :

For most of these HDD tests, when both options are activated the tps 
throughput is improved (+3 to +300%), late transactions are reduced (by 
91% to 97%) and overall the performance is more stable (tps standard 
deviation is typically halved).

The option effects are somehow orthogonal:
 - latency is essentially limited by flushing, although sorting also   contributes.
 - throughput is mostly improved thanks to sorting, with some occasional   small positive or negative effect from
flushing.

In detail, some loads may benefit more from only one option activated. In 
particular, flushing may have a small adverse effect on throughput in some 
conditions, although not always. With SSD probably both options would 
probably have limited benefit.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

On 2015-06-17 08:24:38 +0200, Fabien COELHO wrote:
> Here is version 3, including many performance tests with various settings,
> representing about 100 hours of pgbench run. This patch aims at improving
> checkpoint I/O behavior so that tps throughput is improved, late
> transactions are less frequent, and overall performances are more stable.

First off: This is pretty impressive stuff. Being at pgcon, I don't have
time to look into this in detail, but I do plan to comment more
extensively.

> >- Move fsync as early as possible, suggested by Andres Freund?
> >
> >My opinion is that this should be left out for the nonce.

"for the nonce" - what does that mean?

> I did that.

I'm doubtful that it's a good idea to separate this out, if you did.

>  - as version 2: checkpoint buffer sorting based on a 2007 patch by
>    Takahiro Itagaki but with a smaller and static buffer allocated once.
>    Also, sorting is done by chunks of 131072 pages in the current version,
>    with a guc to change this value.

I think it's a really bad idea to do this in chunks. That'll mean we'll
frequently uselessly cause repetitive random IO, often interleaved. That
pattern is horrible for SSDs too. We should always try to do this at
once, and only fail back to using less memory if we couldn't allocate
everything.

> * PERFORMANCE TESTS
> 
> Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
> random write activity on one table), checkpoint_completion_target=0.8, with
> different settings on a 16GB 8-core host:
> 
>  . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
>  . small: scale=120 shared_buffers=2GB checkpoint_timeout=300s time=4000s
>  . medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s
>  . large: scale=1000 shared_buffers=4GB checkpoint_timeout=40min time=7500s

It'd be interesting to see numbers for tiny, without the overly small
checkpoint timeout value. 30s is below the OS's writeback time.

> Note: figures noted with a star (*) had various issues during their run, so
> pgbench progress figures were more or less incorrect, thus the standard
> deviation computation is not to be trusted beyond "pretty bad".
> 
> Caveat: these are only benches on one host at a particular time and
> location, which may or may not be reproducible nor be representative
> as such of any other load.  The good news is that all these tests tell
> the same thing.
> 
> - full-speed 1-client
> 
>      options   | tps performance over per second data
>   flush | sort |    tiny    |    small     |   medium     |    large
>     off |  off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6
>     off |   on | 699 +- 223 | 457 +- 315   | 479 +- 319   | 48.4 +- 28.8
>      on |  off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3
>      on |   on | 722 +- 119 | 550 +- 140   | 549 +- 180   | 47.2 +- 16.8
> 
> - full speed 4-clients
> 
>       options  | tps performance over per second data
>   flush | sort |    tiny     |     small     |    medium
>     off |  off | 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *
>     off |   on | 2086 +- 673 | 819 +-  905 * | 807 +- 1029 *
>      on |  off | 2212 +- 451 | 169 +- 1269 * | 160 +-  502 *
>      on |   on | 2073 +- 437 | 743 +-  413   | 822 +-  467
> 
> - 100-tps 1-client max 100-ms latency
> 
>      options   | percent of late transactions
>   flush | sort |  tiny | small | medium
>     off |  off |  6.31 | 29.44 | 30.74
>     off |   on |  6.23 |  8.93 |  7.12
>      on |  off |  0.44 |  7.01 |  8.14
>      on |   on |  0.59 |  0.83 |  1.84
> 
> - 200-tps 1-client max 100-ms latency
> 
>      options   | percent of late transactions
>   flush | sort |  tiny | small | medium
>     off |  off | 10.00 | 50.61 | 45.51
>     off |   on |  8.82 | 12.75 | 12.89
>      on |  off |  0.59 | 40.48 | 42.64
>      on |   on |  0.53 |  1.76 |  2.59
> 
> - 400-tps 1-client (or 4 for medium) max 100-ms latency
> 
>      options   | percent of late transactions
>   flush | sort | tiny | small | medium
>     off |  off | 12.0 | 64.28 | 68.6
>     off |   on | 11.3 | 22.05 | 22.6
>      on |  off |  1.1 | 67.93 | 67.9
>      on |   on |  0.6 |  3.24 |  3.1
> 

So you've not run things at more serious concurrency, that'd be
interesting to see.

I'd also like to see concurrent workloads with synchronous_commit=off -
I've seen absolutely horrible latency behaviour for that, and I'm hoping
this will help. It's also a good way to simulate faster hardware than
you have.

It's also curious that sorting is detrimental for full speed 'tiny'.

> * CONCLUSION :
> 
> For most of these HDD tests, when both options are activated the tps
> throughput is improved (+3 to +300%), late transactions are reduced (by 91%
> to 97%) and overall the performance is more stable (tps standard deviation
> is typically halved).
> 
> The option effects are somehow orthogonal:
> 
>  - latency is essentially limited by flushing, although sorting also
>    contributes.
> 
>  - throughput is mostly improved thanks to sorting, with some occasional
>    small positive or negative effect from flushing.
> 
> In detail, some loads may benefit more from only one option activated. In
> particular, flushing may have a small adverse effect on throughput in some
> conditions, although not always.

> With SSD probably both options would probably have limited benefit.

I doubt that. Small random writes have bad consequences for wear
leveling. You might not notice that with a short tests - again, I doubt
it - but it'll definitely become visible over time.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>>> - Move fsync as early as possible, suggested by Andres Freund?
>>>
>>> My opinion is that this should be left out for the nonce.
>
> "for the nonce" - what does that mean?
 Nonce \Nonce\ (n[o^]ns), n. [For the nonce, OE. for the nones, ...     {for the nonce}, i. e. for the present time.

> I'm doubtful that it's a good idea to separate this out, if you did.

Actually I did, because as explained in another mail the fsync time when 
the other options are activated as reported in the logs is essentially 
null, so it would not bring significant improvements on these runs,
and also the patch changes enough things as it is.

So this is an evidence-based decision.

I also agree that it seems interesting on principle and should be 
beneficial in some case, but I would rather keep that on a TODO list 
together with trying to do better things in the bgwriter and try to focus 
on the current proposal which already changes significantly the 
checkpointer throttling logic.

>>  - as version 2: checkpoint buffer sorting based on a 2007 patch by
>>    Takahiro Itagaki but with a smaller and static buffer allocated once.
>>    Also, sorting is done by chunks of 131072 pages in the current version,
>>    with a guc to change this value.
>
> I think it's a really bad idea to do this in chunks.

The small problem I see is that for a very large setting there could be 
several seconds or even minutes of sorting, which may or may not be 
desirable, so having some control on that seems a good idea.

Another argument is that Tom said he wanted that:-)

In practice the value can be set at a high value so that it is nearly 
always sorted in one go. Maybe value "0" could be made special and used to 
trigger this behavior systematically, and be the default.

> That'll mean we'll frequently uselessly cause repetitive random IO,

This is not an issue if the chunks are large enough, and anyway the guc 
allows to change the behavior as desired. As I said, keeping some control 
seems a good idea, and the "full sorting" can be made the default 
behavior.

> often interleaved. That pattern is horrible for SSDs too. We should 
> always try to do this at once, and only fail back to using less memory 
> if we couldn't allocate everything.

The memory is needed anyway in order to avoid a double or significantly 
more heavy implementation for the throttling loop. It is allocated once on 
the first checkpoint. The allocation could be moved to the checkpointer 
initialization if this is a concern. The memory needed is one int per 
buffer, which is smaller than the 2007 patch.

>>  . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
>
> It'd be interesting to see numbers for tiny, without the overly small
> checkpoint timeout value. 30s is below the OS's writeback time.

The point of tiny was to trigger a lot of checkpoints. The size is pretty 
ridiculous anyway, as "tiny" implies. I think I did some tests on other 
versions of the patch and longer checkpoint_timeout on pretty small 
database that showed smaller benefit from the options, as one would 
expect. I'll try to re-run some.

> So you've not run things at more serious concurrency, that'd be
> interesting to see.

I do not have a box available for "serious concurrency".

> I'd also like to see concurrent workloads with synchronous_commit=off -
> I've seen absolutely horrible latency behaviour for that, and I'm hoping
> this will help. It's also a good way to simulate faster hardware than
> you have.

> It's also curious that sorting is detrimental for full speed 'tiny'.

Yep.

>> With SSD probably both options would probably have limited benefit.
>
> I doubt that. Small random writes have bad consequences for wear
> leveling. You might not notice that with a short tests - again, I doubt
> it - but it'll definitely become visible over time.

Possibly. Testing such effects does not seem easy, though. At least I have 
not seen "write stalls" on SSD, which is my primary concern.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

On 2015-06-20 08:57:57 +0200, Fabien COELHO wrote:
> Actually I did, because as explained in another mail the fsync time when the
> other options are activated as reported in the logs is essentially null, so
> it would not bring significant improvements on these runs,
> and also the patch changes enough things as it is.
> 
> So this is an evidence-based decision.

Meh. You're testing on low concurrency.

> >> - as version 2: checkpoint buffer sorting based on a 2007 patch by
> >>   Takahiro Itagaki but with a smaller and static buffer allocated once.
> >>   Also, sorting is done by chunks of 131072 pages in the current version,
> >>   with a guc to change this value.
> >
> >I think it's a really bad idea to do this in chunks.
> 
> The small problem I see is that for a very large setting there could be
> several seconds or even minutes of sorting, which may or may not be
> desirable, so having some control on that seems a good idea.

If the sorting of the dirty blocks alone takes minutes, it'll never
finish writing that many buffers out. That's a utterly bogus argument.

> Another argument is that Tom said he wanted that:-)

I don't think he said that when we discussed this last.

> In practice the value can be set at a high value so that it is nearly always
> sorted in one go. Maybe value "0" could be made special and used to trigger
> this behavior systematically, and be the default.

You're just making things too complicated.

> >That'll mean we'll frequently uselessly cause repetitive random IO,
> 
> This is not an issue if the chunks are large enough, and anyway the guc
> allows to change the behavior as desired.

I don't think this is true. If two consecutive blocks are dirty, but you
sync them in two different chunks, you *always* will cause additional
random IO. Either the drive will have to skip the write for that block,
or the os will prefetch the data. More importantly with SSDs it voids
the wear leveling advantages.
> >often interleaved. That pattern is horrible for SSDs too. We should always
> >try to do this at once, and only fail back to using less memory if we
> >couldn't allocate everything.
> 
> The memory is needed anyway in order to avoid a double or significantly more
> heavy implementation for the throttling loop. It is allocated once on the
> first checkpoint. The allocation could be moved to the checkpointer
> initialization if this is a concern. The memory needed is one int per
> buffer, which is smaller than the 2007 patch.

There's a reason the 2007 patch (and my revision of it last year) did
what it did. You can't just access buffer descriptors without
locking. Besides, causing additional cacheline bouncing during the
sorting process is a bad idea.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Jim Nasby
Date:
On 6/20/15 2:57 AM, Fabien COELHO wrote:
>>>  - as version 2: checkpoint buffer sorting based on a 2007 patch by
>>>    Takahiro Itagaki but with a smaller and static buffer allocated once.
>>>    Also, sorting is done by chunks of 131072 pages in the current
>>> version,
>>>    with a guc to change this value.
>>
>> I think it's a really bad idea to do this in chunks.
>
> The small problem I see is that for a very large setting there could be
> several seconds or even minutes of sorting, which may or may not be
> desirable, so having some control on that seems a good idea.

ISTM a more elegant way to handle that would be to start off with a very 
small number of buffers and sort larger and larger lists while the OS is 
busy writing/syncing.

> Another argument is that Tom said he wanted that:-)

Did he elaborate why? I don't see him on this thread (though I don't 
have all of it).

> In practice the value can be set at a high value so that it is nearly
> always sorted in one go. Maybe value "0" could be made special and used
> to trigger this behavior systematically, and be the default.

It'd be nice if it was just self-tuning, with no GUC.

It looks like it'd be much better to get this committed without more 
than we have now than to do without it though...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> So this is an evidence-based decision.
>
> Meh. You're testing on low concurrency.

Well, I'm just testing on the available box.

I do not see the link between high concurrency and whether moving fsync as 
early as possible would have a large performance impact. I think it might 
be interesting if bgwriter is doing a lot of writes, but I'm not sure 
under which configuration & load that would be.

>>> I think it's a really bad idea to do this in chunks.
>>
>> The small problem I see is that for a very large setting there could be
>> several seconds or even minutes of sorting, which may or may not be
>> desirable, so having some control on that seems a good idea.
>
> If the sorting of the dirty blocks alone takes minutes, it'll never
> finish writing that many buffers out. That's a utterly bogus argument.

Well, if in the future you have 8 TB of memory (I've seen a 512GB memory 
server a few weeks ago), set shared_buffers=2TB, then if I'm not mistaken 
in the worst case you may have 256 millions 8k-buffers to checkpoint. Then 
it really depends on the I/O backend stuff used by the box, but if you 
bought 8 TB of RAM probably you would have a nice I/O stuff attached.

>> Another argument is that Tom said he wanted that:-)
>
> I don't think he said that when we discussed this last.

That is what I was recalling when I wrote this sentence:

http://www.postgresql.org/message-id/6599.1409421040@sss.pgh.pa.us

But it had more to do with memory-allocation management.

>> In practice the value can be set at a high value so that it is nearly always
>> sorted in one go. Maybe value "0" could be made special and used to trigger
>> this behavior systematically, and be the default.
>
> You're just making things too complicated.

ISTM that it is not really complicated, but anyway it is easy to change 
the checkpoint_sort stuff to a boolean.

In the reported performance tests, the is usually just one chunk anyway, 
sometimes two, so this gives an idea of the overall performance effect.

>> This is not an issue if the chunks are large enough, and anyway the guc
>> allows to change the behavior as desired.
>
> I don't think this is true. If two consecutive blocks are dirty, but you
> sync them in two different chunks, you *always* will cause additional
> random IO.

I think that it could be a small number if the chunks are large, i.e. the 
performance benefit of sorting larger and larger chunks is decreasing.

> Either the drive will have to skip the write for that block,
> or the os will prefetch the data. More importantly with SSDs it voids
> the wear leveling advantages.

Possibly. I do not understand wear leveling done by SSD firmware.

>>> often interleaved. That pattern is horrible for SSDs too. We should always
>>> try to do this at once, and only fail back to using less memory if we
>>> couldn't allocate everything.
>>
>> The memory is needed anyway in order to avoid a double or significantly more
>> heavy implementation for the throttling loop. It is allocated once on the
>> first checkpoint. The allocation could be moved to the checkpointer
>> initialization if this is a concern. The memory needed is one int per
>> buffer, which is smaller than the 2007 patch.
>
> There's a reason the 2007 patch (and my revision of it last year) did
> what it did. You can't just access buffer descriptors without
> locking.

I really think that you can because the sorting is really "advisory", i.e. 
the checkpointer will work fine if the sorting is wrong or not done at 
all, as it is now, when the checkpointer writes buffers. The only 
condition is that the buffers must not be moved with their "to write in 
this checkpoint" flag, but this is also necessary for the current 
checkpointer stuff to work.

Moreover, this trick is alreay pre-existing from the patch I submitted: 
some tests are done without locking, but the actual "buffer write" does 
the locking and would skip it if the previous test was wrong, as described 
in comments in the code.

> Besides, causing additional cacheline bouncing during the
> sorting process is a bad idea.

Hmmm. The impact would be to multiply the memory required by 3 or 4 
(buf_id, relation, forknum, offset), instead of just buf_id, and I 
understood that memory was a concern.

Moreover, once the sort process get the lines which contain the sorting 
data from the buffer descriptor in its cache, I think that it should be 
pretty much okay. Incidentally, they would probably have been brought to 
cache by the scan to collect them. Also, I do not think that the sorting 
time for 128000 buffers, and possible cache misses, was a big issue, but I 
do not have a measure to defend that. I could try to collect some data 
about that.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Jim,

>> The small problem I see is that for a very large setting there could be
>> several seconds or even minutes of sorting, which may or may not be
>> desirable, so having some control on that seems a good idea.
>
> ISTM a more elegant way to handle that would be to start off with a very 
> small number of buffers and sort larger and larger lists while the OS is busy 
> writing/syncing.

You really have to have done a significant part/most/all of sorting before 
starting to write.

>> Another argument is that Tom said he wanted that:-)
>
> Did he elaborate why? I don't see him on this thread (though I don't have all 
> of it).

http://www.postgresql.org/message-id/6599.1409421040@sss.pgh.pa.us

But it has more to do with memory management.

>> In practice the value can be set at a high value so that it is nearly
>> always sorted in one go. Maybe value "0" could be made special and used
>> to trigger this behavior systematically, and be the default.
>
> It'd be nice if it was just self-tuning, with no GUC.

Hmmm. It can easilly be turned into a boolean, but otherwise I have no 
clue about how to decide whether to sort and/or flush.

> It looks like it'd be much better to get this committed without more than we 
> have now than to do without it though...

Yep, I think the figures are definitely encouraging.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
<sorry, resent stalled post, wrong from>

> It'd be interesting to see numbers for tiny, without the overly small
> checkpoint timeout value. 30s is below the OS's writeback time.

Here are some tests with longer timeout:

tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min         max_wal_size=1GB warmup=600 time=4000
  flsh |      full speed tps      | percent of late tx, 4 clients, for tps:  /srt |  1 client  |  4 clients  |  100 |
200|  400 |  800 | 1200 | 1600  N/N  | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 | 2.38  N/Y  | 924
+-122 | 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 | 1.67  Y/N  | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 |
0.71| 0.81 | 1.26  Y/Y  | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 | 1.13
 

There seems to be a small 1-2% performance benefit with 4 clients, this is 
reversed for 1 client, there are significantly and consistently less late 
transactions when options are activated, the performance is more stable
(standard deviation reduced by 10-18%).

The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times 
over in 5 minutes, so the checkpoint basically writes everything in 220 
seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less 
in order in memory, so may be written out in order anyway.


medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min          max_wal_size=4GB warmup=1200 time=7500
  flsh |      full speed tps       | percent of late tx, 4 clients  /srt |  1 client   |  4 clients  |   100 |   200 |
400 |   N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |   N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 |
24.07|   Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |   Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |
2.84|
 

The effect of sorting is very positive (+150% to 270% tps). On this run, 
flushing has a positive (+20% with 1 client) or negative (-8 % with 4 
clients) on throughput, and late transactions are reduced by 92-95% when 
both options are activated.

At 550 tps checkpoints are xlog-triggered and write about 1/3 of the 
database, (170000 buffers to write very 220-260 seconds, 4 MB/s).

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Mon, Jun 22, 2015 at 1:41 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> <sorry, resent stalled post, wrong from>
>
>> It'd be interesting to see numbers for tiny, without the overly small
>> checkpoint timeout value. 30s is below the OS's writeback time.
>
>
> Here are some tests with longer timeout:
>
> tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min
>          max_wal_size=1GB warmup=600 time=4000
>
>   flsh |      full speed tps      | percent of late tx, 4 clients, for tps:
>   /srt |  1 client  |  4 clients  |  100 |  200 |  400 |  800 | 1200 | 1600
>   N/N  | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 | 2.38
>   N/Y  | 924 +- 122 | 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 | 1.67
>   Y/N  | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 | 0.71 | 0.81 | 1.26
>   Y/Y  | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 | 1.13
>
> There seems to be a small 1-2% performance benefit with 4 clients, this is reversed for 1 client, there are significantly and consistently less late transactions when options are activated, the performance is more stable
> (standard deviation reduced by 10-18%).
>
> The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times over in 5 minutes, so the checkpoint basically writes everything in 220 seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less in order in memory, so may be written out in order anyway.
>
>
> medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
>           max_wal_size=4GB warmup=1200 time=7500
>
>   flsh |      full speed tps       | percent of late tx, 4 clients
>   /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
>    N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
>    N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
>    Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
>    Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |
>
> The effect of sorting is very positive (+150% to 270% tps). On this run, flushing has a positive (+20% with 1 client) or negative (-8 % with 4 clients) on throughput, and late transactions are reduced by 92-95% when both options are activated.
>

Why there is dip in performance with multiple clients, can it be
due to reason that we started doing more stuff after holding bufhdr
lock in below code?

BufferSync()
{
..
for (buf_id = 0; buf_id < NBuffers; buf_id++)
  {
  volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
@@ -1621,32 +1719,185 @@ BufferSync(int flags)
 
  if ((bufHdr->flags & mask) == mask)
  {
+ Oid spc;
+ TableSpaceCountEntry * entry;
+ bool found;
+
  bufHdr->flags |= BM_CHECKPOINT_NEEDED;
+ CheckpointBufferIds[num_to_write] = buf_id;
  num_to_write++;
+
+ /* keep track of per tablespace buffers */
+ spc = bufHdr->tag.rnode.spcNode;
+ entry = (TableSpaceCountEntry *)
+ hash_search(spcBuffers, (void *) &spc, HASH_ENTER, &found);
+
+ if (found) entry->count++;
+ else entry->count = 1;
  }
..
}


-
BufferSync()
{
..
- buf_id = StrategySyncStart(NULL, NULL);
- num_to_scan = NBuffers;
+ active_spaces = nb_spaces;
+ space = 0;
  num_written = 0;
- while (num_to_scan-- > 0)
+
+ while (active_spaces != 0)
..
}

The changed code doesn't seems to give any consideration to
clock-sweep point which might not be helpful for cases when checkpoint
could have flushed soon-to-be-recycled buffers.  I think flushing the
sorted buffers w.r.t tablespaces is a good idea, but not giving any
preference to clock-sweep point seems to me that we would loose in
some cases by this new change.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
>>           max_wal_size=4GB warmup=1200 time=7500
>>
>>   flsh |      full speed tps       | percent of late tx, 4 clients
>>   /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
>>    N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
>>    N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
>>    Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
>>    Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |
>>
>> The effect of sorting is very positive (+150% to 270% tps). On this run,
> flushing has a positive (+20% with 1 client) or negative (-8 % with 4
> clients) on throughput, and late transactions are reduced by 92-95% when
> both options are activated.
>
> Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients 
compared to 1 client?

> can it be due to reason that we started doing more stuff after holding 
> bufhdr lock in below code?

I think it is very unlikely that the buffer being locked would be 
simultaneously requested by one of the 4 clients for an UPDATE, so I do 
not think it should have a significant impact.

> BufferSync() [...]

> BufferSync()
> {
> ..
> - buf_id = StrategySyncStart(NULL, NULL);
> - num_to_scan = NBuffers;
> + active_spaces = nb_spaces;
> + space = 0;
>  num_written = 0;
> - while (num_to_scan-- > 0)
> +
> + while (active_spaces != 0)
> ..
> }
>
> The changed code doesn't seems to give any consideration to
> clock-sweep point

Indeed.

> which might not be helpful for cases when checkpoint could have flushed 
> soon-to-be-recycled buffers. I think flushing the sorted buffers w.r.t 
> tablespaces is a good idea, but not giving any preference to clock-sweep 
> point seems to me that we would loose in some cases by this new change.

I do not see how to do both, as these two orders seem more or less 
unrelated?  The traditionnal assumption is that the I/O are very slow and 
they are to be optimized first, so going for buffer ordering to be nice to 
the disk looks like the priority.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> I'd also like to see concurrent workloads with synchronous_commit=off -
> I've seen absolutely horrible latency behaviour for that, and I'm hoping
> this will help. It's also a good way to simulate faster hardware than
> you have.

It helps. I've done a few runs, where the very-very-bad situation is 
improved to... I would say very-bad:

medium3: scale=200 shared_buffers=4GB checkpoint_timeout=15min    max_wal_size=4GB warmup=1200 time=6000 clients=4
synchronous_commit=off
 flush sort |  tps | percent of seconds offline off   off  |  296 | 83% offline off   on   | 1496 | 33% offline off
on  | 1641 | 59% offline on    on   | 1515 | 31% offline
 

The offline figure is the percentage of seconds in the 6000 seconds run 
where 0.0 tps are reported, or where nothing is reported because pgbench 
is stuck.

It is somehow better... on an abysmal scale: sorting and flushing reduced 
the offline time by a factor of 2.6. Too bad it is so high to begin with. 
The tps is improved by a factor of 5 with either options.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> It'd be interesting to see numbers for tiny, without the overly small
> checkpoint timeout value. 30s is below the OS's writeback time.

Here are some tests with longer timeout:

tiny2: scale=10 shared_buffers=1GB checkpoint_timeout=5min        max_wal_size=1GB warmup=600 time=4000
 flsh |      full speed tps      | percent of late tx, 4 clients, for tps: /srt |  1 client  |  4 clients  |  100 |
200|  400 |  800 | 1200 | 1600 N/N  | 930 +- 124 | 2560 +- 394 | 0.70 | 1.03 | 1.27 | 1.56 | 2.02 | 2.38 N/Y  | 924 +-
122| 2612 +- 326 | 0.63 | 0.79 | 0.94 | 1.15 | 1.45 | 1.67 Y/N  | 907 +- 112 | 2590 +- 315 | 0.58 | 0.83 | 0.68 | 0.71
|0.81 | 1.26 Y/Y  | 915 +- 114 | 2590 +- 317 | 0.60 | 0.68 | 0.70 | 0.78 | 0.88 | 1.13
 

There seems to be a small 1-2% performance benefit with 4 clients, this is 
reversed for 1 client, there are significantly and consistently less late 
transactions when options are activated, the performance is more stable
(standard deviation reduced by 10-18%).

The db is about 200 MB ~ 25000 pages, at 2500+ tps it is written 40 times 
over in 5 minutes, so the checkpoint basically writes everything over 220 
seconds, 0.9 MB/s. Given the preload phase the buffers may be more or less 
in order in memory, so would be written out in order.


medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min         max_wal_size=4GB warmup=1200 time=7500
 flsh |      full speed tps       | percent of late tx, 4 clients /srt |  1 client   |  4 clients  |   100 |   200 |
400|  N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |  N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 |
24.07|  Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |  Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |
2.84|
 

The effect of sorting is very positive (+150% to 270% tps). On this run, 
flushing has a positive (+20% with 1 client) or negative (-8 % with 4 
clients) on throughput, and late transactions are reduced by 92-95% when 
both options are activated.

At 550 tps checkpoints are xlog-triggered and write about 1/3 of the 
database, (170000 buffers to write very 220-260 seconds, 4 MB/s).

-- 
Fabien.



Re: checkpointer continuous flushing

From
Jim Nasby
Date:
On 6/22/15 11:59 PM, Fabien COELHO wrote:
>> which might not be helpful for cases when checkpoint could have
>> flushed soon-to-be-recycled buffers. I think flushing the sorted
>> buffers w.r.t tablespaces is a good idea, but not giving any
>> preference to clock-sweep point seems to me that we would loose in
>> some cases by this new change.
>
> I do not see how to do both, as these two orders seem more or less
> unrelated?  The traditionnal assumption is that the I/O are very slow
> and they are to be optimized first, so going for buffer ordering to be
> nice to the disk looks like the priority.

The point is that it's already expensive for backends to advance the 
clock; if they then have to wait on IO as well it gets REALLY expensive. 
So we want to avoid that.

Other than that though, it is pretty orthogonal, so perhaps another 
indication that the clock should be handled separately from both 
backends and bgwriter...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Tue, Jun 23, 2015 at 10:29 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

medium2: scale=300 shared_buffers=5GB checkpoint_timeout=30min
          max_wal_size=4GB warmup=1200 time=7500

  flsh |      full speed tps       | percent of late tx, 4 clients
  /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
   N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
   N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
   Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
   Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,
flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients compared to 1 client?

What do you mean by "negative (-8 % with 4  clients) on throughput"
in above sentence?  I thought by that you mean that there is dip
in TPS with patch as compare to HEAD at 4 clients.

Also I am not completely sure what's +- means in your data above?
 
can it be due to reason that we started doing more stuff after holding bufhdr lock in below code?

I think it is very unlikely that the buffer being locked would be simultaneously requested by one of the 4 clients for an UPDATE, so I do not think it should have a significant impact.

BufferSync() [...]

BufferSync()
{
..
- buf_id = StrategySyncStart(NULL, NULL);
- num_to_scan = NBuffers;
+ active_spaces = nb_spaces;
+ space = 0;
 num_written = 0;
- while (num_to_scan-- > 0)
+
+ while (active_spaces != 0)
..
}

The changed code doesn't seems to give any consideration to
clock-sweep point

Indeed.

which might not be helpful for cases when checkpoint could have flushed soon-to-be-recycled buffers. I think flushing the sorted buffers w.r.t tablespaces is a good idea, but not giving any preference to clock-sweep point seems to me that we would loose in some cases by this new change.

I do not see how to do both, as these two orders seem more or less unrelated?

I understand your point and I also don't have any specific answer
for it at this moment, the point of worry is that it should not lead
to degradation of certain cases as compare to current algorithm.
The workload where it could effect is when your data doesn't fit
in shared buffers, but can fit in RAM.
 

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>>>>   flsh |      full speed tps       | percent of late tx, 4 clients
>>>>   /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
>>>>    N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
>>>>    N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
>>>>    Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
>>>>    Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |
>>>>
>>>> The effect of sorting is very positive (+150% to 270% tps). On this run,
>>>>
>>> flushing has a positive (+20% with 1 client) or negative (-8 % with 4
>>> clients) on throughput, and late transactions are reduced by 92-95% when
>>> both options are activated.
>>>
>>> Why there is dip in performance with multiple clients,
>>
>> I'm not sure to see the "dip". The performances are better with 4 clients
>> compared to 1 client?
>
> What do you mean by "negative (-8 % with 4 clients) on throughput" in 
> above sentence? I thought by that you mean that there is dip in TPS with 
> patch as compare to HEAD at 4 clients.

Ok, I misunderstood your question. I thought you meant a dip between 1 
client and 4 clients. I meant that when flush is turned on tps goes down 
by 8% (743 to 681 tps) on this particular run. Basically tps improvements 
mostly come from "sort", and "flush" has uncertain effects on tps 
(throuput), but much more on latency and performance stability (lower late 
rate, lower standard deviation).

Note that I'm not comparing to HEAD in the above tests, but with the new 
options desactivated, which should be more or less comparable to current 
HEAD, i.e. there is no sorting nor flushing done, but this is not strictly 
speaking HEAD behavior. Probably I should get some figures with HEAD as 
well to check the "more or less" assumption.

> Also I am not completely sure what's +- means in your data above?

The first figure before "+-" is the tps, the second after is its standard 
deviation computed in per-second traces. Some runs are very bad, with 
pgbench stuck at times, and result on stddev larger than the average, they 
ere noted with "*".

> I understand your point and I also don't have any specific answer
> for it at this moment, the point of worry is that it should not lead
> to degradation of certain cases as compare to current algorithm.
> The workload where it could effect is when your data doesn't fit
> in shared buffers, but can fit in RAM.

Hmmm. My point of view is still that the logical priority is to optimize 
for disk IO first, then look for compatible RAM optimisations later.

I can run tests with a small shared_buffers, but probably it would just 
trigger a lot of checkpoints, or worse rely on the bgwriter to find space, 
which would generate random IOs.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> I do not see how to do both, as these two orders seem more or less
>> unrelated?  The traditionnal assumption is that the I/O are very slow
>> and they are to be optimized first, so going for buffer ordering to be
>> nice to the disk looks like the priority.
>
> The point is that it's already expensive for backends to advance the clock; 
> if they then have to wait on IO as well it gets REALLY expensive. So we want 
> to avoid that.

I do not know what this clock stuff does. Note that the checkpoint buffer 
scan is done once at the beginning of the checkpoint and its time is 
relatively small compared to everything else in the checkpoint.

If this scan is an issue, it can be done in reverse order, or in some 
other order, but I think it is better to do it in order for better cache 
behavior, although the effect should be marginal.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> Besides, causing additional cacheline bouncing during the
>> sorting process is a bad idea.
>
> Hmmm. The impact would be to multiply the memory required by 3 or 4 (buf_id, 
> relation, forknum, offset), instead of just buf_id, and I understood that 
> memory was a concern.
>
> Moreover, once the sort process get the lines which contain the sorting data 
> from the buffer descriptor in its cache, I think that it should be pretty 
> much okay. Incidentally, they would probably have been brought to cache by 
> the scan to collect them. Also, I do not think that the sorting time for 
> 128000 buffers, and possible cache misses, was a big issue, but I do not have 
> a measure to defend that. I could try to collect some data about that.

I've collected some data by adding a "sort time" measure, with 
checkpoint_sort_size=10000000 so that sorting is in one chunk, and done 
some large checkpoints:

LOG:  checkpoint complete: wrote 41091 buffers (6.3%);  0 transaction log file(s) added, 0 removed, 0 recycled;
sort=0.024s, write=0.488 s, sync=8.790 s, total=9.837 s;  sync files=41, longest=8.717 s, average=0.214 s;
distance=404972kB, estimate=404972 kB
 

LOG:  checkpoint complete: wrote 212124 buffers (32.4%);  0 transaction log file(s) added, 0 removed, 0 recycled;
sort=0.078s, write=128.885 s, sync=1.269 s, total=131.646 s;  sync files=43, longest=1.155 s, average=0.029 s;
distance=2102950kB, estimate=2102950 kB
 

LOG:  checkpoint complete: wrote 384427 buffers (36.7%);  0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.120s, write=83.995 s, sync=13.944 s, total=98.035 s;  sync files=9, longest=13.724 s, average=1.549 s;
distance=3783305kB, estimate=3783305 kB
 

LOG:  checkpoint complete: wrote 809211 buffers (77.2%);  0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.358s, write=138.146 s, sync=14.943 s, total=153.124 s;  sync files=13, longest=14.871 s, average=1.149 s;
distance=8075338kB, estimate=8075338 kB
 

Summary of these checkpoints:
  #buffers   size   sort     41091  328MB  0.024    212124  1.7GB  0.078    384427  2.9GB  0.120    809211  6.2GB
0.358

Sort times are pretty negligeable compared to the whole checkpoint time,
and under 0.1 s/GB of buffers sorted.

On a 512 GB server with shared_buffers=128GB (25%), this suggest a worst 
case checkpoint sorting in a few seconds, and then you have a hundred GB 
to write anyway. If we project on next decade 1 TB checkpoint that would 
make sorting in under a minute... But then you have 1 TB of data to dump.

As a comparison point, I've done the large checkpoint with the default 
sort size of 131072:

LOG:  checkpoint complete: wrote 809211 buffers (77.2%);  0 transaction log file(s) added, 0 removed, 1 recycled;
sort=0.251s, write=152.377 s, sync=15.062 s, total=167.453 s;  sync files=13, longest=14.974 s, average=1.158 s;
distance=8075338kB, estimate=8075338 kB
 

The 0.251 sort time is to be compared to 0.358. Well, n.log(n) is not too 
bad, as expected.


These figures suggest that sorting time and associated cache misses are 
not a significant issue and thus are not worth bothering much about, and 
also that probably a simple boolean option would be quite acceptable 
instead of the chunk approach.

Attached is an updated version of the patch which turns the sort option 
into a boolean, and also include the sort time in the checkpoint log.

There is still an open question about whether the sorting buffer 
allocation is lost on some signals and should be reallocated in such 
event.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Wed, Jun 24, 2015 at 9:50 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

  flsh |      full speed tps       | percent of late tx, 4 clients
  /srt |  1 client   |  4 clients  |   100 |   200 |   400 |
   N/N | 173 +- 289* | 198 +- 531* | 27.61 | 43.92 | 61.16 |
   N/Y | 458 +- 327* | 743 +- 920* |  7.05 | 14.24 | 24.07 |
   Y/N | 169 +- 166* | 187 +- 302* |  4.01 | 39.84 | 65.70 |
   Y/Y | 546 +- 143  | 681 +- 459  |  1.55 |  3.51 |  2.84 |

The effect of sorting is very positive (+150% to 270% tps). On this run,

flushing has a positive (+20% with 1 client) or negative (-8 % with 4
clients) on throughput, and late transactions are reduced by 92-95% when
both options are activated.

Why there is dip in performance with multiple clients,

I'm not sure to see the "dip". The performances are better with 4 clients
compared to 1 client?

What do you mean by "negative (-8 % with 4 clients) on throughput" in above sentence? I thought by that you mean that there is dip in TPS with patch as compare to HEAD at 4 clients.

Ok, I misunderstood your question. I thought you meant a dip between 1 client and 4 clients. I meant that when flush is turned on tps goes down by 8% (743 to 681 tps) on this particular run.

This 8% might matter if the dip is bigger with more clients and
more aggressive workload.  Do you know what could lead to this
dip, because if we know what is the reason than it will be more
predictable to know if this is the max dip that could happen or it
could lead to bigger dip in other cases.
 
Basically tps improvements mostly come from "sort", and "flush" has uncertain effects on tps (throuput), but much more on latency and performance stability (lower late rate, lower standard deviation).


I agree that performance stability is important, but not sure if it
is good idea to sacrifice the throuput for it.  If sort + flush always
gives better results, then isn't it better to perform these actions
together under one option.
 
Note that I'm not comparing to HEAD in the above tests, but with the new options desactivated, which should be more or less comparable to current HEAD, i.e. there is no sorting nor flushing done, but this is not strictly speaking HEAD behavior. Probably I should get some figures with HEAD as well to check the "more or less" assumption.

Also I am not completely sure what's +- means in your data above?

The first figure before "+-" is the tps, the second after is its standard deviation computed in per-second traces. Some runs are very bad, with pgbench stuck at times, and result on stddev larger than the average, they ere noted with "*".

I understand your point and I also don't have any specific answer
for it at this moment, the point of worry is that it should not lead
to degradation of certain cases as compare to current algorithm.
The workload where it could effect is when your data doesn't fit
in shared buffers, but can fit in RAM.

Hmmm. My point of view is still that the logical priority is to optimize for disk IO first, then look for compatible RAM optimisations later.

It is not only about RAM optimisation which we can do later, but also
about avoiding regression in existing use-cases.


With Regards,
Amit Kapila.

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> [...]
>> Ok, I misunderstood your question. I thought you meant a dip between 1
>> client and 4 clients. I meant that when flush is turned on tps goes down by
>> 8% (743 to 681 tps) on this particular run.
>
> This 8% might matter if the dip is bigger with more clients and
> more aggressive workload.  Do you know what could lead to this
> dip, because if we know what is the reason than it will be more
> predictable to know if this is the max dip that could happen or it
> could lead to bigger dip in other cases.

I do not know the cause of the dip, and whether it would increase with 
more clients. I do not have a box for such tests. If someone can provided 
the box, I can provide test scripts:-)

The first, although higher, measure is really very unstable, with pg 
totaly unresponsive (offline, really) at time.

I think that the flush option may always have a risk of (small) 
detrimental effects on tps, because there are two steady states: one with 
pg only doing wal-logged transactions with great tps, and one with pg 
doing the checkpoint at nought tps. If this is on the same disk, even at 
best the combination means that probably each operation will amper the 
other one a little bit, so the combined tps performance would/could be 
lower than doing one after the other and having pg offline 50% of the 
time...

Please also note that this 8% "dip" is on a 681 (with the dip) vs 198 (no 
options at all) a X 3.4 improvement compared to pg current behavior.

>> Basically tps improvements mostly come from "sort", and "flush" has
>> uncertain effects on tps (throuput), but much more on latency and
>> performance stability (lower late rate, lower standard deviation).
>
> I agree that performance stability is important, but not sure if it
> is good idea to sacrifice the throuput for it.

See discussion above. I think better stability may imply slightly lower 
throughput on some load. That is why there are options and DBA to choose 
them:-)

> If sort + flush always gives better results, then isn't it better to 
> perform these actions together under one option.

Sure, but that is not currently the case. Also what is done is very 
orthogonal, so I would tend to keep these separate. If one is always 
beneficial and it is wished that it should be always activated, then the 
option could be removed.

>> Hmmm. My point of view is still that the logical priority is to optimize
>> for disk IO first, then look for compatible RAM optimisations later.
>
> It is not only about RAM optimisation which we can do later, but also
> about avoiding regression in existing use-cases.

Hmmm. Currently I have not seen really significant regressions. I have 
seen some less good impact of some options on some loads.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> Note that I'm not comparing to HEAD in the above tests, but with the new 
> options desactivated, which should be more or less comparable to current 
> HEAD, i.e. there is no sorting nor flushing done, but this is not strictly 
> speaking HEAD behavior. Probably I should get some figures with HEAD as well 
> to check the "more or less" assumption.

Just for answering myself on this point, I tried current HEAD vs patch v4 
with sort OFF + flush OFF: the figures are indeed quite comparable (see 
below), so although the internal implementation is different, the 
performance when both options are off is still a reasonable approximation 
of the performance without the patch, as I was expecting. What patch v4 
still does with OFF/OFF which is not done by HEAD is balancing writes 
among tablespaces, but there is only one disk on these tests so it does 
not matter.

tps & stddev full speed:
                            HEAD         OFF/OFF
 tiny 1 client          727 +- 227     221 +- 246 small 1 client         158 +- 316     158 +- 325 medium 1 client
 148 +- 285     157 +- 326 tiny 4 clients        2088 +- 786    2074 +- 699 small 4 clients        192 +- 648     188
+-560 medium 4 clients       220 +- 654     220 +- 648
 

percent of late transactions:
                            HEAD       OFF/OFF
 tiny 4 clients 100 tps      6.31        6.67 small 4c 100 tps           35.68       35.23 medium 4c 100 tps
37.38      38.00 tiny 4c 200 tps             9.06        9.10 small 4c 200 tps           51.65       51.16 medium 4c
200tps          51.35       50.20 tiny 4 clients 400 tps     11.4        10.5 small 4 clients 400 tps    66.4
67.6

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-06-26 21:47:30 +0200, Fabien COELHO wrote:
> tps & stddev full speed:

>                             HEAD         OFF/OFF
> 
>  tiny 1 client          727 +- 227     221 +- 246

Huh?




Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>>                             HEAD         OFF/OFF
>>
>>  tiny 1 client          727 +- 227     221 +- 246
>
> Huh?

Indeed, just to check that someone was reading this magnificent mail:-)

Just a typo because I reformated the figures for simpler comparison. 221 
is really 721, quite close to 727.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> Attached is an updated version of the patch which turns the sort option into 
> a boolean, and also include the sort time in the checkpoint log.
>
> There is still an open question about whether the sorting buffer allocation 
> is lost on some signals and should be reallocated in such event.

In such case, probably the allocation should be managed from 
CheckpointerMain, and the lazy allocation could remain for other callers 
(I guess just "initdb").


More open questions:
 - best name for the flush option (checkpoint_flush_to_disk,     checkpoint_flush_on_write, checkpoint_flush, ...)
 - best name for the sort option (checkpoint_sort,     checkpoint_sort_buffers, checkpoint_sort_ios, ...)


Other nice-to-have inputs:
 - tests on a non-linux system with posix_fadvise   (FreeBSD? others?)
 - tests on a large dedicated box


Attached are some scripts to help with testing, if someone's feels like 
that:
 - cp_test.sh: run some tests, to adapt to one's setup...
 - cp_test_count.pl: show percent of late transactions
 - avg.py: show stats about stuff
   sh> grep 'progress: ' OUTPUT_FILE | cut -d' ' -f4 | avg.py
   *BEWARE* that if pgbench got stuck some "0" data are missing,   look for the actual tps in the output file and for
theline   count to check whether it is the case... some currently submitted   patch on pgbench helps, see
https://commitfest.postgresql.org/5/199/

-- 
Fabien.

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello,

Attached is very minor v5 update which does a rebase & completes the 
cleanup of doing a full sort instead of a chuncked sort.

>> Attached is an updated version of the patch which turns the sort option 
>> into a boolean, and also include the sort time in the checkpoint log.
>> 
>> There is still an open question about whether the sorting buffer allocation 
>> is lost on some signals and should be reallocated in such event.
>
> In such case, probably the allocation should be managed from 
> CheckpointerMain, and the lazy allocation could remain for other callers (I 
> guess just "initdb").
>
>
> More open questions:
>
> - best name for the flush option (checkpoint_flush_to_disk,
>     checkpoint_flush_on_write, checkpoint_flush, ...)
>
> - best name for the sort option (checkpoint_sort,
>     checkpoint_sort_buffers, checkpoint_sort_ios, ...)
>
>
> Other nice-to-have inputs:
>
> - tests on a non-linux system with posix_fadvise
>   (FreeBSD? others?)
>
> - tests on a large dedicated box
>
>
> Attached are some scripts to help with testing, if someone's feels like that:
>
> - cp_test.sh: run some tests, to adapt to one's setup...
>
> - cp_test_count.pl: show percent of late transactions
>
> - avg.py: show stats about stuff
>
>   sh> grep 'progress: ' OUTPUT_FILE | cut -d' ' -f4 | avg.py
>
>   *BEWARE* that if pgbench got stuck some "0" data are missing,
>   look for the actual tps in the output file and for the line
>   count to check whether it is the case... some currently submitted
>   patch on pgbench helps, see https://commitfest.postgresql.org/5/199/

As this pgbench patch is now in master, pgbench is less likely to get 
stuck, but check nevertheless that the number of progress line matches the 
expected number.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Heikki Linnakangas
Date:
On 07/26/2015 06:01 PM, Fabien COELHO wrote:
> Attached is very minor v5 update which does a rebase & completes the
> cleanup of doing a full sort instead of a chuncked sort.

Some thoughts on this:

* I think we should drop the "flush" part of this for now. It's not as
clearly beneficial as the sorting part, and adds a great deal more code
complexity. And it's orthogonal to the sorting patch, so we can deal
with it separately.

* Is it really necessary to parallelize the I/O among tablespaces? I can
see the point, but I wonder if it makes any difference in practice.

* Is there ever any harm in sorting the buffers? The GUC is useful for
benchmarking, but could we leave it out of the final patch?

* Do we need to worry about exceeding the 1 GB allocation limit in
AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers.
That's a lot, but it's not totally crazy these days that someone might
do that. At the very least, we need to lower the maximum of
shared_buffers so that you can't hit that limit.

I ripped out the "flushing" part, keeping only the sorting. I refactored
the logic in BufferSync() a bit. There's now a separate function,
nextCheckpointBuffer(), that returns the next buffer ID from the sorted
list. The tablespace-parallelization behaviour in encapsulated there,
keeping the code in BufferSync() much simpler. See attached. Needs some
minor cleanup and commenting still before committing, and I haven't done
any testing besides a simple "make check".

- Heikki


Attachment

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Heikki,

Thanks for having a look at the patch.

> * I think we should drop the "flush" part of this for now. It's not as 
> clearly beneficial as the sorting part, and adds a great deal more code 
> complexity. And it's orthogonal to the sorting patch, so we can deal with it 
> separately.

I agree that it is orthogonal and that the two features could be in 
distinct patches. The flush part is the first patch I really submitted 
because it has significant effect on latency, and I was told to mix it 
with sorting...

The flushing part really helps to keep "write stalls" under control in 
many cases, for instance:

- 400-tps 1-client (or 4 for medium) max 100-ms latency
     options   | percent of late transactions  flush | sort | tiny | small | medium    off |  off | 12.0 | 64.28 | 68.6
  off |   on | 11.3 | 22.05 | 22.6     on |  off |  1.1 | 67.93 | 67.9     on |   on |  0.6 |  3.24 |  3.1
 

The "percent of late transactions" is really the fraction of time the 
database is unreachable because of write stalls... So sort without flush 
is cleary not enough.

Another thing suggested by Andres is to fsync as early as possible, but 
this is not a simple patch because is intermix things which are currently 
in distinct parts of checkpoint processing, so I already decided that this 
would be for another submission.

> * Is it really necessary to parallelize the I/O among tablespaces? I can see 
> the point, but I wonder if it makes any difference in practice.

I think that if someone bothers with tablespace there is no reason to kill 
them behind her. Without sorting you may hope that tablespaces will be 
touched randomly enough, but once buffers are sorted you can probably find 
cases where it would write on one table space and then on the other.

So I think that it really should be kept.

> * Is there ever any harm in sorting the buffers? The GUC is useful for 
> benchmarking, but could we leave it out of the final patch?

I think that the performance show that it is basically always beneficial, 
so the guc may be left out. However on SSD it is unclear to me whether it 
is just a loss of time or whether it helps, say with wear-leveling. Maybe 
best to keep it? Anyway it is definitely needed for testing.

> * Do we need to worry about exceeding the 1 GB allocation limit in 
> AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers. That's a 
> lot, but it's not totally crazy these days that someone might do that. At the 
> very least, we need to lower the maximum of shared_buffers so that you can't 
> hit that limit.

Yep.

> I ripped out the "flushing" part, keeping only the sorting. I refactored 
> the logic in BufferSync() a bit. There's now a separate function,
> nextCheckpointBuffer(), that returns the next buffer ID from the sorted 
> list. The tablespace-parallelization behaviour in encapsulated there,

I do not understand the new tablespace-parallelization logic: there is no 
test about the tablespace of the buffer in the selection process... Note 
that I did wrote a proof for the one I put, and also did some detailed 
testing on the side because I'm always wary of proofs, especially mines:-)

I notice that you assume that table space numbers are always small and 
contiguous. Is that a fact? I was feeling more at ease with relying on a 
hash table to avoid such an assumption.

> keeping the code in BufferSync() much simpler. See attached. Needs some 
> minor cleanup and commenting still before committing, and I haven't done 
> any testing besides a simple "make check".

Hmmm..., just another detail, the patch does not sort:
  + if (checkpoint_sort && num_to_write > 1 && false)


I'll resubmit a patch with only the sorting part, and do the kind of 
restructuring you suggest which is a good thing.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-08-08 20:49:03 +0300, Heikki Linnakangas wrote:
> * I think we should drop the "flush" part of this for now. It's not as
> clearly beneficial as the sorting part, and adds a great deal more code
> complexity. And it's orthogonal to the sorting patch, so we can deal with it
> separately.

I don't agree. For one I've seen it cause rather big latency
improvements, and we're horrible at that. But more importantly I think
the requirements of the flush logic influences how exactly the sorting
is done. Splitting them will just make it harder to do the flushing in a
not too big patch.

> * Is it really necessary to parallelize the I/O among tablespaces? I can see
> the point, but I wonder if it makes any difference in practice.

Today it's somewhat common to have databases that are bottlenecked on
write IO and all those writes being done by the checkpointer. If we
suddenly do the writes to individual tablespaces separately and
sequentially we'll be bottlenecked on the peak IO of a single
tablespace.

> * Is there ever any harm in sorting the buffers? The GUC is useful for
> benchmarking, but could we leave it out of the final patch?

Agreed.

> * Do we need to worry about exceeding the 1 GB allocation limit in
> AllocateCheckpointBufferIds? It's enough got 2 TB of shared_buffers. That's
> a lot, but it's not totally crazy these days that someone might do that. At
> the very least, we need to lower the maximum of shared_buffers so that you
> can't hit that limit.

We can just use the _huge variant?

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

On 2015-08-08 20:49:03 +0300, Heikki Linnakangas wrote:
> I ripped out the "flushing" part, keeping only the sorting. I refactored the
> logic in BufferSync() a bit. There's now a separate function,
> nextCheckpointBuffer(), that returns the next buffer ID from the sorted
> list. The tablespace-parallelization behaviour in encapsulated there,
> keeping the code in BufferSync() much simpler. See attached. Needs some
> minor cleanup and commenting still before committing, and I haven't done any
> testing besides a simple "make check".

Thought it'd be useful to review the current version as well. Some of
what I'm commenting on you'll probably already have though of under the
label of "minor cleanup".

>  /*
> + * Array of buffer ids of all buffers to checkpoint.
> + */
> +static int *CheckpointBufferIds = NULL;
> +
> +/* Compare checkpoint buffers
> + */

Should be at the beginning of the file. There's a bunch more cases of that.


> +/* Compare checkpoint buffers
> + */
> +static int bufcmp(const int * pa, const int * pb)
> +{
> +    BufferDesc
> +        *a = GetBufferDescriptor(*pa),
> +        *b = GetBufferDescriptor(*pb);
> +
> +    /* tag: rnode, forkNum (different files), blockNum
> +     * rnode: { spcNode (ignore: not really needed),
> +     *   dbNode (ignore: this is a directory), relNode }
> +     * spcNode: table space oid, not that there are at least two
> +     * (pg_global and pg_default).
> +     */
> +    /* compare relation */
> +    if (a->tag.rnode.spcNode < b->tag.rnode.spcNode)
> +        return -1;
> +    else if (a->tag.rnode.spcNode > b->tag.rnode.spcNode)
> +        return 1;
> +    if (a->tag.rnode.relNode < b->tag.rnode.relNode)
> +        return -1;
> +    else if (a->tag.rnode.relNode > b->tag.rnode.relNode)
> +        return 1;
> +    /* same relation, compare fork */
> +    else if (a->tag.forkNum < b->tag.forkNum)
> +        return -1;
> +    else if (a->tag.forkNum > b->tag.forkNum)
> +        return 1;
> +    /* same relation/fork, so same segmented "file", compare block number
> +     * which are mapped on different segments depending on the number.
> +     */
> +    else if (a->tag.blockNum < b->tag.blockNum)
> +        return -1;
> +    else /* should not be the same block anyway... */
> +        return 1;
> +}

This definitely needs comments about ignoring the normal buffer header
locking.

Why are we ignoring the database directory? I doubt it'll make a huge
difference, but grouping metadata affecting operations by directory
helps.

> +
> +static void
> +AllocateCheckpointBufferIds(void)
> +{
> +    /* Safe worst case allocation, all buffers belong to the checkpoint...
> +     * that is pretty unlikely.
> +     */
> +    CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
> +}

(wrong comment style...)

Heikki, you were concerned about the size of the allocation of this,
right? I don't think it's relevant - we used to allocate an array of
that size for the backend's private buffer pin array until 9.5, so in
theory we should be safe agains that. NBuffers is limited to INT_MAX/2
in guc.ċ, which ought to be sufficient?

> +    /*
> +     * Lazy allocation: this function is called through the checkpointer,
> +     * but also by initdb. Maybe the allocation could be moved to the callers.
> +     */
> +    if (CheckpointBufferIds == NULL)
> +        AllocateCheckpointBufferIds();
> +
> 

I don't think it's a good idea to allocate this on every round. That
just means a lot of page table entries have to be built and torn down
regularly. It's not like checkpoints only run for 1% of the time or
such.

FWIW, I still think it's a much better idea to allocate the memory once
in shared buffers. It's not like that makes us need more memory overall,
and it'll be huge page allocations if configured. I also think that
sooner rather than later we're going to need more than one process
flushing buffers, and then it'll need to be moved there.

> +    /*
> +     * Sort buffer ids to help find sequential writes.
> +     *
> +     * Note: buffers are not locked in anyway, but that does not matter,
> +     * this sorting is really advisory, if some buffer changes status during
> +     * this pass it will be filtered out later.  The only necessary property
> +     * is that marked buffers do not move elsewhere.
> +     */

That reasoning makes it impossible to move the fsyncing of files into
the loop (whenever we move to a new file). That's not nice. The
formulation with "necessary property" doesn't seem very clear to me?

How about:
/** Note: Buffers are not locked in any way during sorting, but that's ok:* A change in the buffer header is only
relevantwhen it changes the* buffer's identity. If the identity has changed it'll have been* written out by
BufferAlloc(),so there's no need for checkpointer to* write it out anymore. The buffer might also get written out by a*
backendor bgwriter, but that's equally harmless.*/
 

>          Also, qsort implementation
> +     * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
> +     * because of these possible concurrent changes.

Hm. Is that actually the case for our qsort implementation? If the pivot
element changes its identity won't the result be pretty much random?

> +
> +    if (checkpoint_sort && num_to_write > 1 && false)
> +    {

&& false - Huh?

> +        qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
> +                  (int(*)(const void *, const void *)) bufcmp);
> +

Ick, I'd rather move the typecasts to the comparator.

> +        for (i = 1; i < num_to_write; i++)
> +        {
> +            bufHdr = GetBufferDescriptor(CheckpointBufferIds[i]);
> +
> +            spc = bufHdr->tag.rnode.spcNode;
> +            if (spc != lastspc && (bufHdr->flags & BM_CHECKPOINT_NEEDED) != 0)
> +            {
> +                if (allocatedSpc <= j)
> +                {
> +                    allocatedSpc = j + 5;
> +                    spcStatus = (TableSpaceCheckpointStatus *)
> +                        repalloc(spcStatus, sizeof(TableSpaceCheckpointStatus) * allocatedSpc);
> +                }
> +
> +                spcStatus[j].index_end = spcStatus[j + 1].index = i;
> +                j++;
> +                lastspc = spc;
> +            }
> +        }
> +        spcStatus[j].index_end = num_to_write;

This really deserves some explanation.

Regards,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

Thanks for your comments. Some answers and new patches included.

>> + /*
>> + * Array of buffer ids of all buffers to checkpoint.
>> + */
>> +static int *CheckpointBufferIds = NULL;
>
> Should be at the beginning of the file. There's a bunch more cases of that.

done.

>> +/* Compare checkpoint buffers
>> + */
>> +static int bufcmp(const int * pa, const int * pb)
>> +{
>> +    BufferDesc
>> +        *a = GetBufferDescriptor(*pa),
>> +        *b = GetBufferDescriptor(*pb);
>
> This definitely needs comments about ignoring the normal buffer header
> locking.

Added.

> Why are we ignoring the database directory? I doubt it'll make a huge
> difference, but grouping metadata affecting operations by directory
> helps.

I wanted to do the minimal comparisons to order buffers per file, so I 
skipped everything else. My idea of a checkpoint is a lot of data in a few 
files (at least compared to the data...), so I do not think that it is 
worth it. I may be proven wrong!

>> +static void
>> +AllocateCheckpointBufferIds(void)
>> +{
>> +    /* Safe worst case allocation, all buffers belong to the checkpoint...
>> +     * that is pretty unlikely.
>> +     */
>> +    CheckpointBufferIds = (int *) palloc(sizeof(int) * NBuffers);
>> +}
>
> (wrong comment style...)

Fixed.

> Heikki, you were concerned about the size of the allocation of this, 
> right? I don't think it's relevant - we used to allocate an array of 
> that size for the backend's private buffer pin array until 9.5, so in 
> theory we should be safe agains that. NBuffers is limited to INT_MAX/2 
> in guc.ċ, which ought to be sufficient?

I think that there is no issue with the current shared_buffers limit. I 
could allocate and use 4 GB on my laptop without problem. I added a cast 
to ensure that unsigned int are used for the size computation.

>> + /* + * Lazy allocation: this function is called through the 
>> checkpointer, + * but also by initdb. Maybe the allocation could be 
>> moved to the callers. + */ + if (CheckpointBufferIds == NULL) + 
>> AllocateCheckpointBufferIds(); +
>>
>
> I don't think it's a good idea to allocate this on every round.
> That just means a lot of page table entries have to be built and torn 
> down regularly. It's not like checkpoints only run for 1% of the time or 
> such.

Sure. It is not allocated on every round, it is allocated once on the 
first checkpoint, the variable tested is static. There is no free. Maybe
the allocation could be moved to the callers, though.

> FWIW, I still think it's a much better idea to allocate the memory once
> in shared buffers.

Hmmm. The memory does not need to be shared with other processes?

> It's not like that makes us need more memory overall, and it'll be huge 
> page allocations if configured. I also think that sooner rather than 
> later we're going to need more than one process flushing buffers, and 
> then it'll need to be moved there.

That is an argument. I think that it could wait for the need to actually 
arise.

>> +    /*
>> +     * Sort buffer ids to help find sequential writes.
>> +     *
>> +     * Note: buffers are not locked in anyway, but that does not matter,
>> +     * this sorting is really advisory, if some buffer changes status during
>> +     * this pass it will be filtered out later.  The only necessary property
>> +     * is that marked buffers do not move elsewhere.
>> +     */
>
> That reasoning makes it impossible to move the fsyncing of files into 
> the loop (whenever we move to a new file). That's not nice.

I do not see why. Moving rsync ahead is definitely an idea that you 
already pointed out, I have given it some thoughts, and it would require 
a carefull implementation and some restructuring. For instance, you do not 
want to issue fsync right after having done writes, you want to wait a 
little bit so that the system had time to write the buffers to disk.

> The formulation with "necessary property" doesn't seem very clear to me?

Removed.

> How about: /* * Note: Buffers are not locked in any way during sorting, 
> but that's ok: * A change in the buffer header is only relevant when it 
> changes the * buffer's identity. If the identity has changed it'll have 
> been * written out by BufferAlloc(), so there's no need for checkpointer 
> to * write it out anymore. The buffer might also get written out by a * 
> backend or bgwriter, but that's equally harmless. */

This new version included.

>>          Also, qsort implementation
>> +     * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
>> +     * because of these possible concurrent changes.
>
> Hm. Is that actually the case for our qsort implementation?

I think that it is hard to write a qsort which would fail that. That would 
mean that it would compare the same items twice, which would be 
inefficient.

> If the pivot element changes its identity won't the result be pretty 
> much random?

That would be a very unlikely event, given the short time spent in qsort. 
Anyway, this is not a problem, and is the beauty of the "advisory" sort: 
if the sort is wrong because of any such rare event, it just mean that the 
buffers would not be strictly in file order, which is currently the 
case.... Well, too bad, but the correctness of the checkpoint does not 
depend on it, that just mean that the checkpointer would come back twice 
on one file, no big deal.

>> +    if (checkpoint_sort && num_to_write > 1 && false)
>> +    {
>
> && false - Huh?

Probably Heikki tests.

>> +        qsort(CheckpointBufferIds, num_to_write,  sizeof(int),
>> +                  (int(*)(const void *, const void *)) bufcmp);
>> +
>
> Ick, I'd rather move the typecasts to the comparator.

Done.

>> +        for (i = 1; i < num_to_write; i++)
>> +        { [...]
>
> This really deserves some explanation.

I think that this version does not work. I've reinstated my version and a 
lot of comments in the attached patches.

Please find attached two combined patches which provide both features one 
after the other.

(a) shared buffer sorting
 - I took Heikki hint about restructuring the buffer selection in a   separate function, which makes the code much more
readable.
 - I also followed Heikki intention (I think) that only active   table spaces are considered in the switching loop.

(b) add asynchronous flushes on top of the previous sort patch



I think that the many performance results I reported show that the 
improvements need both features, and one feature without the other is much 
less effective at improving responsiveness, which is my primary concern.
The TPS improvements are just a side effect.

I did not remove the gucs: I think it could be kept so that people can 
test around with it, and they may be removed in the future? I would be 
also fine if they are removed.

There are a lot of comments in some places. I think that they should be 
kept because the code is subtle.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-08-10 19:07:12 +0200, Fabien COELHO wrote:
> I think that there is no issue with the current shared_buffers limit. I
> could allocate and use 4 GB on my laptop without problem. I added a cast to
> ensure that unsigned int are used for the size computation.

You can't allocate 4GB with palloc(), it has a builtin limit against
allocating more than 1GB.

> >>+ /* + * Lazy allocation: this function is called through the
> >>checkpointer, + * but also by initdb. Maybe the allocation could be
> >>moved to the callers. + */ + if (CheckpointBufferIds == NULL) +
> >>AllocateCheckpointBufferIds(); +
> >>
> >
> >I don't think it's a good idea to allocate this on every round.
> >That just means a lot of page table entries have to be built and torn down
> >regularly. It's not like checkpoints only run for 1% of the time or such.
> 
> Sure. It is not allocated on every round, it is allocated once on the first
> checkpoint, the variable tested is static. There is no free. Maybe
> the allocation could be moved to the callers, though.

Well, then everytime the checkpointer is restarted.

> >FWIW, I still think it's a much better idea to allocate the memory once
> >in shared buffers.
> 
> Hmmm. The memory does not need to be shared with other processes?

The point is that it's done at postmaster startup, and we're pretty much
guaranteed that the memory will availabl.e.

> >It's not like that makes us need more memory overall, and it'll be huge
> >page allocations if configured. I also think that sooner rather than later
> >we're going to need more than one process flushing buffers, and then it'll
> >need to be moved there.
> 
> That is an argument. I think that it could wait for the need to actually
> arise.

Huge pages are used today.

> >>+    /*
> >>+     * Sort buffer ids to help find sequential writes.
> >>+     *
> >>+     * Note: buffers are not locked in anyway, but that does not matter,
> >>+     * this sorting is really advisory, if some buffer changes status during
> >>+     * this pass it will be filtered out later.  The only necessary property
> >>+     * is that marked buffers do not move elsewhere.
> >>+     */
> >
> >That reasoning makes it impossible to move the fsyncing of files into the
> >loop (whenever we move to a new file). That's not nice.
> 
> I do not see why.

Because it means that the sorting isn't necessarily correct. I.e. we
can't rely on it to determine whether a file has already been fsynced.

> >>         Also, qsort implementation
> >>+     * should be resilient to occasional contradictions (cmp(a,b) != -cmp(b,a))
> >>+     * because of these possible concurrent changes.
> >
> >Hm. Is that actually the case for our qsort implementation?
> 
> I think that it is hard to write a qsort which would fail that. That would
> mean that it would compare the same items twice, which would be inefficient.

What? The same two elements aren't frequently compared pairwise with
each other, but of course an individual element is frequently compared
with other elements. Consider what happens when the chosen pivot element
changes its identity after already dividing half. The two partitions
will not be divided in any meaning full way anymore. I don't see how
this will results in a meaningful sort.

> >If the pivot element changes its identity won't the result be pretty much
> >random?
> 
> That would be a very unlikely event, given the short time spent in
> qsort.

Meh, we don't want to rely on "likeliness" on such things.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> You can't allocate 4GB with palloc(), it has a builtin limit against
> allocating more than 1GB.

Argh, too bad, I assumed very naively that palloc was malloc in disguise.

>> [...]
> Well, then everytime the checkpointer is restarted.

Hm...

> The point is that it's done at postmaster startup, and we're pretty much
> guaranteed that the memory will availabl.e.

Ok ok, I stop resisting... I'll have a look.

Would it also fix the 1 GB palloc limit on the same go? I guess so...


>>> That reasoning makes it impossible to move the fsyncing of files into the
>>> loop (whenever we move to a new file). That's not nice.
>>
>> I do not see why.
>
> Because it means that the sorting isn't necessarily correct. I.e. we
> can't rely on it to determine whether a file has already been fsynced.

Ok, I understand your point.

Then the file would be fsynced twice: if the fsync is done properly (data 
have already been flushed to disk) then it would not cost much, and doing 
it sometimes twice on some file would not be a big issue. The code could 
also detect such event and log a warning, which would give a hint about 
how often it occurs in practice.

>>> Hm. Is that actually the case for our qsort implementation?
>>
>> I think that it is hard to write a qsort which would fail that. That would
>> mean that it would compare the same items twice, which would be inefficient.
>
> What? The same two elements aren't frequently compared pairwise with 
> each other, but of course an individual element is frequently compared 
> with other elements.

Sure.

> Consider what happens when the chosen pivot element changes its identity 
> after already dividing half. The two partitions will not be divided in 
> any meaning full way anymore. I don't see how this will results in a 
> meaningful sort.

It would be partly meaningful, which is enough for performance, and does 
not matter for correctness: currently buffers are not sorted at all and it 
works, even if it does not work well.

>>> If the pivot element changes its identity won't the result be pretty much
>>> random?
>>
>> That would be a very unlikely event, given the short time spent in
>> qsort.
>
> Meh, we don't want to rely on "likeliness" on such things.

My main argument is that even if it occurs, and the qsort result is partly 
wrong, it does not change correctness, it just mean that the actual writes 
will be less in order than wished. If it occurs, one pivot separation 
would be quite strange, but then others would be right, so the buffers 
would be "partly sorted".

Another issue I see is that even if buffers are locked within cmp, the 
status may change between two cmp... I do not think that locking all 
buffers for sorting them is an option. So on the whole, I think that 
locking buffers for sorting is probably not possible with the simple (and 
efficient) lightweight approach used in the patch.

The good news, as I argued before, is that the order is only advisory to 
help with performance, but the correctness is really that all checkpoint 
buffers are written and fsync is called in the end, and does not depend on 
the buffer order. That is how it currently works anyway.

If you block on this then I'll put a heavy weight approach, but that would 
be a waste of memory in my opinion, hence my argumentation for the 
lightweight approach.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> Ok ok, I stop resisting... I'll have a look.

Here is a v7 a&b version which uses shared memory instead of palloc.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On August 10, 2015 8:24:21 PM GMT+02:00, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>Hello Andres,
>
>> You can't allocate 4GB with palloc(), it has a builtin limit against
>> allocating more than 1GB.
>
>Argh, too bad, I assumed very naively that palloc was malloc in
>disguise.

It is, but there's some layering (memory pools/contexts) on top. You can get huge allocations with polloc_huge.

>Then the file would be fsynced twice: if the fsync is done properly
>(data 
>have already been flushed to disk) then it would not cost much, and
>doing 
>it sometimes twice on some file would not be a big issue. The code
>could 
>also detect such event and log a warning, which would give a hint about
>
>how often it occurs in practice.

Right. At the cost of keeping track of all files...



>>>> If the pivot element changes its identity won't the result be
>pretty much
>>>> random?
>>>
>>> That would be a very unlikely event, given the short time spent in
>>> qsort.
>>
>> Meh, we don't want to rely on "likeliness" on such things.
>
>My main argument is that even if it occurs, and the qsort result is
>partly 
>wrong, it does not change correctness, it just mean that the actual
>writes 
>will be less in order than wished. If it occurs, one pivot separation 
>would be quite strange, but then others would be right, so the buffers 
>would be "partly sorted".

It doesn't matter for correctness today, correct. But it makes out impossible to rely on or too.

>Another issue I see is that even if buffers are locked within cmp, the 
>status may change between two cmp...

Sure. That's not what in suggesting. Earlier versions of the patch kept an array of buffer headers exactly because of
that.
I do not think that locking all 
>buffers for sorting them is an option. So on the whole, I think that 
>locking buffers for sorting is probably not possible with the simple
>(and 
>efficient) lightweight approach used in the patch.

Yes, the other version has a higher space overhead. I'm not convinced that's meaningful in comparison to shared buffets
inspace.
 
And rather doubtful it a loss performance wise in a loaded server. All the buffer headers are touched on other cores
anddoing the sort with indirection will greatly increase bus traffic.
 

>The good news, as I argued before, is that the order is only advisory
>to 
>help with performance, but the correctness is really that all
>checkpoint 
>buffers are written and fsync is called in the end, and does not depend
>on 
>the buffer order. That is how it currently works anyway

It's not particularly desirable to have a performance feature that works less well if the server is heavily and
concurrentlyloaded. The likelihood of bogus sort results will increase with the churn rate in shared buffers.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: checkpointer continuous flushing

From
Michael Paquier
Date:
On Tue, Aug 11, 2015 at 4:28 AM, Andres Freund wrote:
> On August 10, 2015 8:24:21 PM GMT+02:00, Fabien COELHO wrote:
>>> You can't allocate 4GB with palloc(), it has a builtin limit against
>>> allocating more than 1GB.
>>
>>Argh, too bad, I assumed very naively that palloc was malloc in
>>disguise.
>
> It is, but there's some layering (memory pools/contexts) on top. You can get huge allocations with polloc_huge.

palloc_huge does not exist yet ;)
There is either repalloc_huge or palloc_extended now, though
implementing one would be trivial.
-- 
Michael



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> [...] Right. At the cost of keeping track of all files...

Sure. Pg already tracks all files, and probably some more tracking would 
be necessary for an early fsync feature to know what are those already 
fsync'ed and what are those not yet fsync'ed.

> Yes, the other version has a higher space overhead.

Yep, this is my concern.

> I'm not convinced that's meaningful in comparison to shared buffers in 
> space. And rather doubtful it a loss performance wise in a loaded 
> server. All the buffer headers are touched on other cores and doing the 
> sort with indirection will greatly increase bus traffic.

The measures I collected and reported showed that the sorting time is 
basically insignificant, so bus traffic induced by sorting does not seem 
to be an issue.

> [...] It's not particularly desirable to have a performance feature that 
> works less well if the server is heavily and concurrently loaded. The 
> likelihood of bogus sort results will increase with the churn rate in 
> shared buffers.

Hm.

In conclusion I'm not convinced that it is worth the memory, but I'm also 
tired of arguing, and hopefully nobody else cares about a few more bytes 
per shared_buffers, so why should I care?

Here is a v8, I reduced the memory overhead of the "heavy weight" approach 
from 24 to 16 bytes per buffer, so it is medium weight:-). It might be 
compacted further down to 12 bytes by combining the 2 bits of forkNum 
either with relNode or blockNum, and use a uint64_t comparison field with
all data so that the comparison code would be simpler and faster.
I also fixed the computation of the shmem size which I had not updated
when switching to shmem.

The patches still include the two guc, but it is easy to remove one or the 
other. They are useful is someone wants to test. The default is on for 
sort, and off for flush. Maybe it should be on for both.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> Here is a v8,

I collected a few performance figures with this patch on an old box with 8 
cores, 16 GB, RAID 1 HDD, under Ubuntu precise.
  postgresql.conf:    shared_buffers = 4GB    checkpoint_timeout = 15min    checkpoint_completion_target = 0.8
max_wal_size= 4GB
 
  init> pgbench -i -s 250  warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4
  # 400 tps throttled "simple update" test  sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4
    sort/flush : percent of skipped/late transactions     on   on   :  2.7     on   off  : 16.2     off  on   : 68.4
off  off  : 68.7
 
  # 200 tps  sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4
    sort/flush : percent of skipped/late transactions     on   on   :  2.7     on   off  :  9.5     off  on   : 47.4
off  off  : 48.8
 

The large "percent of skipped/late transactions" is to be understood as 
"fraction of time with postgresql offline because of a write stall".
  # full speed 1 client  sh> pgbench -M prepared -N -P 1 -T 4000
    sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)     on   on   : 631 +- 131 (0.1%)     on   off  :
564+- 303 (12.0%)     off  on   : 167 +- 315 (76.8%) # stuck...     off  off  : 177 +- 305 (71.2%) # ~ current pg
 
  # full speed 2 threads 4 clients  sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
    sort/flush : tps avg & stddev (percent of time below 10.0 tps)     on   on   : 1058 +- 455 (0.1%)     on   off  :
1056+- 942 (32.8%)     off  on   :  170 +- 500 (88.3%) # stuck...     off  off  :  209 +- 506 (82.0%) # ~ current pg
 

The combined features provide a tps speedup of 3-5 on these runs, and 
allow to have some control on write stalls. Flushing is not effective on 
unsorted buffers, at least on these example.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi Fabien,

On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote:
>     sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
>      on   on   : 631 +- 131 (0.1%)
>      on   off  : 564 +- 303 (12.0%)
>      off  on   : 167 +- 315 (76.8%) # stuck...
>      off  off  : 177 +- 305 (71.2%) # ~ current pg

What exactly do you mean with 'stuck'?

- Andres



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-08-11 17:15:22 +0200, Fabien COELHO wrote:
> +void
> +PerformFileFlush(FileFlushContext * context)
> +{
> +    if (context->ncalls != 0)
> +    {
> +        int rc;
> +
> +#if defined(HAVE_SYNC_FILE_RANGE)
> +
> +        /* Linux: tell the memory manager to move these blocks to io so
> +         * that they are considered for being actually written to disk.
> +         */
> +        rc = sync_file_range(context->fd, context->offset, context->nbytes,
> +                             SYNC_FILE_RANGE_WRITE);
> +
> +#elif defined(HAVE_POSIX_FADVISE)
> +
> +        /* Others: say that data should not be kept in memory...
> +         * This is not exactly what we want to say, because we want to write
> +         * the data for durability but we may need it later nevertheless.
> +         * It seems that Linux would free the memory *if* the data has
> +         * already been written do disk, else the "dontneed" call is ignored.
> +         * For FreeBSD this may have the desired effect of moving the
> +         * data to the io layer, although the system does not seem to
> +         * take into account the provided offset & size, so it is rather
> +         * rough...
> +         */
> +        rc = posix_fadvise(context->fd, context->offset, context->nbytes,
> +                           POSIX_FADV_DONTNEED);
> +
> +#endif
> +
> +        if (rc < 0)
> +            ereport(ERROR,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not flush block " INT64_FORMAT
> +                            " on " INT64_FORMAT " blocks in file \"%s\": %m",
> +                            context->offset / BLCKSZ,
> +                            context->nbytes / BLCKSZ,
> +                            context->filename)));
> +    }

I'm a bit wary that this might cause significant regressions on
platforms not supporting sync_file_range, but support posix_fadvise()
for workloads that are bigger than shared_buffers. Consider what happens
if the workload does *not* fit into shared_buffers but *does* fit into
the OS's buffer cache. Suddenly reads will go to disk again, no?

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> On 2015-08-12 22:34:59 +0200, Fabien COELHO wrote:
>>     sort/flush : tps avg & stddev (percent of time beyond 10.0 tps)
>>      on   on   : 631 +- 131 (0.1%)
>>      on   off  : 564 +- 303 (12.0%)
>>      off  on   : 167 +- 315 (76.8%) # stuck...
>>      off  off  : 177 +- 305 (71.2%) # ~ current pg
>
> What exactly do you mean with 'stuck'?

I mean that the during the I/O storms induced by the checkpoint pgbench 
sometimes get stuck, i.e. does not report its progression every second (I 
run with "-P 1"). This occurs when sort is off, either with or without 
flush, for instance an extract from the off/off medium run:
 progress: 573.0 s, 5.0 tps, lat 933.022 ms stddev 83.977 progress: 574.0 s, 777.1 tps, lat 7.161 ms stddev 37.059
progress:575.0 s, 148.9 tps, lat 4.597 ms stddev 10.708 progress: 814.4 s, 0.0 tps, lat -nan ms stddev -nan progress:
815.0s, 0.0 tps, lat -nan ms stddev -nan progress: 816.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 817.0 s, 0.0
tps,lat -nan ms stddev -nan progress: 818.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 819.0 s, 0.0 tps, lat -nan ms
stddev-nan progress: 820.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 821.0 s, 0.0 tps, lat -nan ms stddev -nan
progress:822.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 823.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 824.0
s,0.0 tps, lat -nan ms stddev -nan progress: 825.0 s, 0.0 tps, lat -nan ms stddev -nan progress: 826.0 s, 0.0 tps, lat
-nanms stddev -nan
 

There is a 239.4 seconds gap in pgbench output. This occurs from time to 
time and may represent a significant part of the run, and I count these 
"stuck" times as 0 tps. Sometimes pgbench is stuck performance wise but 
manages nevetheless to report a "0.0 tps" every second, as above after it 
unstuck.

The actual origin of the issue with a stuck client (pgbench, libpq, OS, 
postgres...) is unclear to me, but the whole system does not behave well 
under an I/O storm anyway, and I have not succeeded in understanding where 
pgbench is stuck when it does not report its progress. I tried some runs 
with gdb but it did not get stuck and reported a lot of "0.0 tps" during 
the storms.


Here are a few more figures with the v8 version of the patch, on a host 
with 8 cores, 16 GB, RAID 1 HDD, under Ubuntu precise. I already reported 
the medium case, and the small case turned afterwards.
  small postgresql.conf:    shared_buffers = 2GB    checkpoint_timeout = 300s # this is the default
checkpoint_completion_target= 0.8    # initialization: pgbench -i -s 120
 
  medium postgresql.conf: ## ALREADY REPORTED    shared_buffers = 4GB    checkpoint_timeout = 15min
checkpoint_completion_target= 0.8    max_wal_size = 4GB    # initialization: pgbench -i -s 250
 
  warmup> pgbench -T 1200 -M prepared -S -j 2 -c 4
  # 400 tps throttled test  sh> pgbench -M prepared -N -P 1 -T 4000 -R 400 -L 100 -j 2 -c 4
      options  / percent of skipped/late transactions    sort/flush /   small  medium     on   on   :    3.5    2.7
on  off  :   24.6   16.2     off  on   :   66.1   68.4     off  off  :   63.2   68.7
 
  # 200 tps throttled test  sh> pgbench -M prepared -N -P 1 -T 4000 -R 200 -L 100 -j 2 -c 4
      options  / percent of skipped/late transactions    sort/flush /   small  medium     on   on   :    1.9    2.7
on  off  :   14.3    9.5     off  on   :   45.6   47.4     off  off  :   47.4   48.8
 
  # 100 tps throttled test  sh> pgbench -M prepared -N -P 1 -T 4000 -R 100 -L 100 -j 2 -c 4
      options  / percent of skipped/late transactions    sort/flush /   small  medium     on   on   :    0.9    1.8
on  off  :    9.3    7.9     off  on   :    5.0   13.0     off  off  :   31.2   31.9
 
  # full speed 1 client  sh> pgbench -M prepared -N -P 1 -T 4000
      options  / tps avg & stddev (percent of time below 10.0 tps)    sort/flush /    small              medium     on
on   : 564 +- 148 ( 0.1%)   631 +- 131 ( 0.1%)     on   off  : 470 +- 340 (21.7%)   564 +- 303 (12.0%)     off  on   :
157+- 296 (66.2%)   167 +- 315 (76.8%)     off  off  : 154 +- 251 (61.5%)   177 +- 305 (71.2%)
 
  # full speed 2 threads 4 clients  sh> pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4
      options  / tps avg & stddev (percent of time below 10.0 tps)    sort/flush /    small              medium     on
on   : 757 +- 417 ( 0.1%)  1058 +- 455 ( 0.1%)     on   off  : 752 +- 893 (48.4%)  1056 +- 942 (32.8%)     off  on   :
173+- 521 (83.0%)   170 +- 500 (88.3%)     off  off  : 199 +- 512 (82.5%)   209 +- 506 (82.0%)
 

In all cases, the "sort on & flush on" provides the best results, with tps 
speedup from 3-5, and overall high responsiveness (& lower latency).

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:

<Oops, stalled post, sorry wrong "From", resent..>


Hello Andres,

>> +        rc = posix_fadvise(context->fd, context->offset, [...]
> 
> I'm a bit wary that this might cause significant regressions on
> platforms not supporting sync_file_range, but support posix_fadvise()
> for workloads that are bigger than shared_buffers. Consider what happens
> if the workload does *not* fit into shared_buffers but *does* fit into
> the OS's buffer cache. Suddenly reads will go to disk again, no?

That is an interesting question!

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS 
implementation of posix_fadvise, so it may differ between OS.

This is a reason why I think that flushing should be kept a guc, even if the 
sort guc is removed and always on. The sync_file_range implementation is 
clearly always very beneficial for Linux, and the posix_fadvise may or may 
not induce a good behavior depending on the underlying system.

This is also a reason why the default value for the flush guc is currently 
set to false in the patch. The documentation should advise to turn it on for 
Linux and to test otherwise. Or if Linux is assumed to be often a host, then 
maybe to set the default to on and to suggest that on some systems it may be 
better to have it off. (Another reason to keep it "off" is that I'm not sure 
about what happens with such HD flushing features on virtual servers).

Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host 
and it was as bad as Linux (namely the database and even the box was offline 
for long minutes...), and if you can avoid that having to read back some data 
may be not that bad a down payment.

The issue is largely mitigated if the data is not removed from 
shared_buffers, because the OS buffer is just a copy of already hold data. 
What I would do on such systems is to increase shared_buffers and keep 
flushing on, that is to count less on the system cache and more on postgres 
own cache.

Overall, I'm not convince that the practice of relying on the OS cache is a 
good one, given what it does with it, at least on Linux.

Now, if someone could provide a dedicated box with posix_fadvise (say 
FreeBSD, maybe others...) for testing that would allow to provide data 
instead of speculating... and then maybe to decide to change its default 
value.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-08-17 15:21:22 +0200, Fabien COELHO wrote:
> My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
> implementation of posix_fadvise, so it may differ between OS.

As long as fadvise has no 'undirty' option, I don't see how that
problem goes away. You're telling the OS to throw the buffer away, so
unless it ignores it that'll have consequences when you read the page
back in.

> This is a reason why I think that flushing should be kept a guc, even if the
> sort guc is removed and always on. The sync_file_range implementation is
> clearly always very beneficial for Linux, and the posix_fadvise may or may
> not induce a good behavior depending on the underlying system.

That's certainly an argument.

> This is also a reason why the default value for the flush guc is currently
> set to false in the patch. The documentation should advise to turn it on for
> Linux and to test otherwise. Or if Linux is assumed to be often a host, then
> maybe to set the default to on and to suggest that on some systems it may be
> better to have it off. 

I'd say it should then be an os-specific default. No point in making
people work for it needlessly on linux and/or elsewhere.

> (Another reason to keep it "off" is that I'm not sure about what
> happens with such HD flushing features on virtual servers).

I don't see how that matters? Either the host will entirely ignore
flushing, and thus the sync_file_range and the fsync won't cost much, or
fsync will be honored, in which case the pre-flushing is helpful.


> Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
> and it was as bad as Linux (namely the database and even the box was offline
> for long minutes...), and if you can avoid that having to read back some
> data may be not that bad a down payment.

I don't see how that'd alleviate my fear. Sure, the latency for many
workloads will be better, but I don't how that argument says anything
about the reads? And we'll not just use this in cases it'd be
beneficial...

> The issue is largely mitigated if the data is not removed from
> shared_buffers, because the OS buffer is just a copy of already hold data.
> What I would do on such systems is to increase shared_buffers and keep
> flushing on, that is to count less on the system cache and more on postgres
> own cache.

That doesn't work that well for a bunch of reasons. For one it's
completely non-adaptive. With the OS's page cache you can rely on free
memory being used for caching *and* it be available should a query or
another program need lots of memory.

> Overall, I'm not convince that the practice of relying on the OS cache is a
> good one, given what it does with it, at least on Linux.

The alternatives aren't super realistic near-term though. Using direct
IO efficiently on the set of operating systems we support is
*hard*. It's more or less trivial to hack pg up to use direct IO for
relations/shared_buffers, but it'll perform utterly horribly in many
many cases.

To pick one thing out: Without the OS buffering writes any write will
have to wait for the disks, instead being asynchronous. That'll make
writes performed by backends a massive bottleneck.

> Now, if someone could provide a dedicated box with posix_fadvise (say
> FreeBSD, maybe others...) for testing that would allow to provide data
> instead of speculating... and then maybe to decide to change its default
> value.

Testing, as an approximation, how it turns out to work on linux would be
a good step.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>>> [...] posix_fadvise().
>>
>> My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
>> implementation of posix_fadvise, so it may differ between OS.
>
> As long as fadvise has no 'undirty' option, I don't see how that
> problem goes away. You're telling the OS to throw the buffer away, so
> unless it ignores it that'll have consequences when you read the page
> back in.

Yep, probably.

Note that we are talking about checkpoints, which "write" buffers out 
*but* keep them nevertheless. As the buffer is kept, the OS page is a 
duplicate, and freeing it should not harm, at least immediatly.

The situation is different if the memory is reused in between, which is 
the work of the bgwriter I think, based on LRU/LFU heuristics, but such 
writes are not flushed by the current patch.

Now, if a buffer was recently updated it should not be selected by the 
bgwriter, if the LRU/LFU heuristics works as expected, which mitigate the 
issue somehow...

To sum up, I agree that it is indeed possible that flushing with 
posix_fadvise could reduce read OS-memory hits on some systems for some 
workloads, although not on Linux, see below.

So the option is best kept as "off" for now, without further data, I'm 
fine with that.

> [...] I'd say it should then be an os-specific default. No point in 
> making people work for it needlessly on linux and/or elsewhere.

Ok. Version 9 attached does that, "on" for Linux, "off" for others because 
of the potential issues you mentioned.

>> (Another reason to keep it "off" is that I'm not sure about what
>> happens with such HD flushing features on virtual servers).
>
> I don't see how that matters? Either the host will entirely ignore
> flushing, and thus the sync_file_range and the fsync won't cost much, or
> fsync will be honored, in which case the pre-flushing is helpful.

Possibly. I know that I do not know:-)  The distance between the database 
and real hardware is so great in VM, that I think that it may have any 
effect, including good, bad or none:-)

>> Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host
>> and it was as bad as Linux (namely the database and even the box was offline
>> for long minutes...), and if you can avoid that having to read back some
>> data may be not that bad a down payment.
>
> I don't see how that'd alleviate my fear.

I'm trying to mitigate your fears, not to alleviate them:-)

> Sure, the latency for many workloads will be better, but I don't how 
> that argument says anything about the reads?

It just says that there may be a compromise, better in some case, possibly 
not so in others, because posix_fadvise does not really say what the 
database would like to say to the OS, this is why I wrote such a large 
comment about it in the source file in the first place.

> And we'll not just use this in cases it'd be beneficial...

I'm fine if it is off by default for some systems. If people want to avoid 
write stalls they can use the option, but it may have adverse effect on 
the tps in some cases, that's life? Not using the option also has adverse 
effects in some cases, because you have write stalls... and currently you 
do not have the choice, so it would be a progress.

>> The issue is largely mitigated if the data is not removed from
>> shared_buffers, because the OS buffer is just a copy of already hold data.
>> What I would do on such systems is to increase shared_buffers and keep
>> flushing on, that is to count less on the system cache and more on postgres
>> own cache.
>
> That doesn't work that well for a bunch of reasons. For one it's
> completely non-adaptive. With the OS's page cache you can rely on free
> memory being used for caching *and* it be available should a query or
> another program need lots of memory.

Yep. I was thinking about a dedicated database server, not a shared one.

>> Overall, I'm not convince that the practice of relying on the OS cache is a
>> good one, given what it does with it, at least on Linux.
>
> The alternatives aren't super realistic near-term though. Using direct
> IO efficiently on the set of operating systems we support is
> *hard*. [...]

Sure.  This is not necessarily what I had in mind.

Currently pg "write"s stuff to the OS, and then suddenly calls "fsync" out 
of the blue, hoping that in between the OS will actually have done a good 
job with the underlying hardware.  This is pretty naive, the fsync 
generates write storms, and the database is offline: trying to improve 
these things is the motivation for this patch.

Now if you think of the bgwriter, it does pretty much the same, and 
probably may generate plenty of random I/Os, because the underlying 
LRU/LFU heuristics used to select buffers does not care about the file 
structures.

So I think that to get good performance the database must take some 
control over the OS. That does not mean that direct I/O needs to be 
involved, although maybe it could, but this patch shows that it is not 
needed to improve things.

>> Now, if someone could provide a dedicated box with posix_fadvise (say
>> FreeBSD, maybe others...) for testing that would allow to provide data
>> instead of speculating... and then maybe to decide to change its default
>> value.
>
> Testing, as an approximation, how it turns out to work on linux would be
> a good step.

Do you mean testing with posix_fadvise on Linux?

I did think about it, but the documented behavior of this call on Linux is 
disappointing: if the buffer has been written to disk, it is freed by the 
OS. If not, nothing is done. Given that the flush is called pretty close 
after writes, mostly the buffer will not have been written to disk yet, 
and the call would just be a no-op... So I concluded that there is no 
point in trying that on Linux because it will have no effect other than 
loosing some time, IMO.

Really, a useful test would be FreeBSD, when posix_fadvise does move 
things to disk, although the actual offsets & length are ignored, but I do 
not think that it would be a problem. I do not know about other systems 
and what they do with posix_fadvise.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Tue, Aug 18, 2015 at 1:02 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Andres,

[...] posix_fadvise().

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS
implementation of posix_fadvise, so it may differ between OS.

As long as fadvise has no 'undirty' option, I don't see how that
problem goes away. You're telling the OS to throw the buffer away, so
unless it ignores it that'll have consequences when you read the page
back in.

Yep, probably.

Note that we are talking about checkpoints, which "write" buffers out *but* keep them nevertheless. As the buffer is kept, the OS page is a duplicate, and freeing it should not harm, at least immediatly.


This theory could makes sense if we can predict in some way that
the data we are flushing out of OS cache won't be needed soon.
After flush, we can only rely to an extent that data could be found in
shared_buffers if the usage_count is high, other wise it could be
replaced any moment by backend needing the buffer and there is no
free buffer.  Now here one way to think is that if the usage_count is
low, then anyway it's okay to assume that this won't be needed in near
future, however I don't think relying only on usage_count for such a thing
is good idea.

To sum up, I agree that it is indeed possible that flushing with posix_fadvise could reduce read OS-memory hits on some systems for some workloads, although not on Linux, see below.

So the option is best kept as "off" for now, without further data, I'm fine with that.


One point to think here is on what basis user can decide make
this option on, is it predictable in any way?
I think one case could be when the data set fits in shared_buffers.

In general, providing an option is a good idea if user can decide with
ease when to use that option or we can give some clear recommendation
for the same otherwise one has to recommend that test your workload
with this option and if it works then great else don't use it which might also
be okay in some cases, but it is better to be clear.


One minor point, while glancing through the patch, I noticed that couple
of multiline comments are not written in the way which is usually used
in code (Keep the first line as empty).

+/* Status of buffers to checkpoint for a particular tablespace,

+ * used internally in BufferSync.

+ * - space: oid of the tablespace

+ * - num_to_write: number of checkpoint pages counted for this tablespace

+ * - num_written: number of pages actually written out



+/* entry structure for table space to count hashtable,

+ * used internally in BufferSync.

+ */



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> So the option is best kept as "off" for now, without further data, I'm
>> fine with that.
>
> One point to think here is on what basis user can decide make
> this option on, is it predictable in any way?
> I think one case could be when the data set fits in shared_buffers.

Yep.

> In general, providing an option is a good idea if user can decide with 
> ease when to use that option or we can give some clear recommendation 
> for the same otherwise one has to recommend that test your workload with 
> this option and if it works then great else don't use it which might 
> also be okay in some cases, but it is better to be clear.

My opinion, which is not backed by any data (anyone can feel free to 
provide a FreeBSD box for testing...) is that it would mostly be an 
improvement if you have a significant write load to have the flush option 
on when running on non-Linux systems which provide posix_fadvise.

If you have a lot of reads and few writes, then postgresql currently works 
reasonably enough, which is why people do not complain too much about 
write stalls, and I expect that the situation would not be significantly 
degraded.

Now there are competing positive and negative effects induced by using 
posix_fadvise, and moreover its implementation varries from OS to OS, so 
without running some experiments it is hard to be definite.

> One minor point, while glancing through the patch, I noticed that couple
> of multiline comments are not written in the way which is usually used
> in code (Keep the first line as empty).

Indeed.

Please find attached a v10, where I have reviewed comments for style & 
contents, and also slightly extended the documentation about the flush 
option to hint that it is essentially useful for high write loads. Without 
further data, I think it is not obvious to give more definite advices.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Tue, Aug 18, 2015 at 12:38 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

So the option is best kept as "off" for now, without further data, I'm
fine with that.

One point to think here is on what basis user can decide make
this option on, is it predictable in any way?
I think one case could be when the data set fits in shared_buffers.

Yep.

In general, providing an option is a good idea if user can decide with ease when to use that option or we can give some clear recommendation for the same otherwise one has to recommend that test your workload with this option and if it works then great else don't use it which might also be okay in some cases, but it is better to be clear.

My opinion, which is not backed by any data (anyone can feel free to provide a FreeBSD box for testing...) is that it would mostly be an improvement if you have a significant write load to have the flush option on when running on non-Linux systems which provide posix_fadvise.

If you have a lot of reads and few writes, then postgresql currently works reasonably enough, which is why people do not complain too much about write stalls, and I expect that the situation would not be significantly degraded.

Now there are competing positive and negative effects induced by using posix_fadvise, and moreover its implementation varries from OS to OS, so without running some experiments it is hard to be definite.


Sure, I think what can help here is a testcase/'s (in form of script file
or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.
Ofcourse, that is not mandatory to proceed with this patch, but still can
help you to prove your point as you might not have access to different
kind of systems to run the tests.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> Sure, I think what can help here is a testcase/'s (in form of script file
> or some other form, to test this behaviour of patch) which you can write
> and post here, so that others can use that to get the data and share it.

Sure... note that I already did that on this thread, without any echo... 
but I can do it again...

Tests should be run on a dedicated host. If it has n cores, I suggest to 
share them between postgres checkpointer & workers and pgbench threads so 
as to avoid thread competition to use cores. With 8 cores I used up to 2 
threads & 4 clients, so that there is 2 core left for the checkpointer and 
other stuff (i.e. I also run iotop & htop in parallel...). Although it may 
seem conservative to do so, I think that the point of the test is to 
exercise checkpoints and not to test the process scheduler of the OS.

Here are the latest version of my test scripts:
 (1) cp_test.sh <name> <test>

Run "test" with setup "name". Currently it runs 4000 seconds pgbench with 
the 4 possible on/off combinations for sorting & flushing, after some 
warmup. The 4000 second is chosen so that there are a few checkpoint 
cycles. For larger checkpoint times, I suggest to extend the run time to 
see at least 3 checkpoints during the run.

More test settings can be added to the 2 "case"s. Postgres settings,
especially shared_buffers, should be set to a pertinent value wrt the 
memory of the test host.

The test run with postgres version found in the PATH, so ensure that the 
right version is found!
 (2) cp_test_count.py one-test-output.log

For rate limited runs, look at the final figures and compute the number of 
late & skipped transactions. This can also be done by hand.
 (3) avg.py

For full speed runs, compute stats about per second tps:
  sh> grep 'progress:' one-test-output.log | cut -d' ' -f4 | \        ./avg.py --limit=10 --length=4000  warning: 633
missingdata, extending with zeros  avg over 4000: 199.290575 ± 512.114070 [0.000000, 0.000000, 4.000000, 5.000000,
2280.900000] percent of values below 10.0: 82.5%
 

The figures I reported are the 199 (average tps), 512 (standard deviation 
on per second figures), 82.5% (percent of time below 10 tps, aka postgres 
is basically unresponsive). In brakets, the min q1 median q3 and max tps 
seen in the run.

> Ofcourse, that is not mandatory to proceed with this patch, but still can
> help you to prove your point as you might not have access to different
> kind of systems to run the tests.

I agree that more tests would be useful to decide which default value for 
the flushing option is the better. For Linux, all tests so far suggest 
"on" is the best choice, but for other systems that use posix_fadvise, it 
is really an open question.

Another option would be to give me a temporary access for some available 
host, I'm used to running these tests...

-- 
Fabien.

Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Wed, Aug 19, 2015 at 12:13 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Sure, I think what can help here is a testcase/'s (in form of script file
or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.

Sure... note that I already did that on this thread, without any echo... but I can do it again...

 
Thanks.

I have tried your scripts and found some problem while using avg.py
script.
grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py --limit=10 --length=300
: No such file or directory

I didn't get chance to poke into avg.py script (the command without
avg.py works fine). Python version on the m/c, I planned to test is
Python 2.7.5.

Today while reading the first patch (checkpoint-continuous-flush-10-a),
I have given some thought to below part of patch which I would like
to share with you.

+static int

+NextBufferToWrite(

+ TableSpaceCheckpointStatus *spcStatus, int nb_spaces,

+ int *pspace, int num_to_write, int num_written)

+{

+ int space = *pspace, buf_id = -1, index;

+

+ /*

+ * Select a tablespace depending on the current overall progress.

+ *

+ * The progress ratio of each unfinished tablespace is compared to

+ * the overall progress ratio to find one with is not in advance

+ * (i.e. overall ratio > tablespace ratio,

+ *  i.e. tablespace written/to_write > overall written/to_write



Here, I think above calculation can go for toss if backend or bgwriter
starts writing buffers when checkpoint is in progress.  The tablespace
written parameter won't be able to consider the one's written by backends
or bgwriter.  Now it may not big thing to worry but I find Heikki's version
worth considering, he has not changed the overall idea of this patch, but
the calculations are somewhat simpler and hence less chance of going
wrong. 


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

> I have tried your scripts and found some problem while using avg.py
> script.
> grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py
> --limit=10 --length=300
> : No such file or directory

> I didn't get chance to poke into avg.py script (the command without
> avg.py works fine). Python version on the m/c, I planned to test is
> Python 2.7.5.

Strange... What does "/usr/bin/env python" say? Can the script be started 
on its own at all? I think that the script should work both with python2 
and python3, at least it does on my laptop...

> Today while reading the first patch (checkpoint-continuous-flush-10-a),
> I have given some thought to below part of patch which I would like
> to share with you.
>
> + * Select a tablespace depending on the current overall progress.
> + *
> + * The progress ratio of each unfinished tablespace is compared to
> + * the overall progress ratio to find one with is not in advance
> + * (i.e. overall ratio > tablespace ratio,
> + *  i.e. tablespace written/to_write > overall written/to_write

> Here, I think above calculation can go for toss if backend or bgwriter
> starts writing buffers when checkpoint is in progress.  The tablespace
> written parameter won't be able to consider the one's written by backends
> or bgwriter.

Sure... This is *already* the case with the current checkpointer, the 
schedule is performed with respect to the initial number of buffers it 
think it will have to write, and if someone else writes these buffers then 
the schedule is skewed a little bit, or more... I have not changed this 
logic, but I extended it to handle several tablespaces.

If this (the checkpointer progress evaluation used for its schedule is 
sometimes wrong because of other writes) is proven to be a major 
performance issue, then the processes which writes the checkpointed 
buffers behind its back should tell the checkpointer about it, probably 
with some shared data structure, so that the checkpointer can adapt its 
schedule.

This is an independent issue, that may be worth to address some day. My 
opinion is that when the bgwriter or backends quick in to write buffers, 
they are basically generating random I/Os on HDD and killing tps and 
latency, so it is a very bad time anyway, thus I'm not sure that this is 
the next problem to address to improve pg performance and responsiveness.

> Now it may not big thing to worry but I find Heikki's version worth 
> considering, he has not changed the overall idea of this patch, but the 
> calculations are somewhat simpler and hence less chance of going wrong.

I do not think that Heikki version worked wrt to balancing writes over 
tablespaces, and I'm not sure it worked at all. However I reused some of 
his ideas to simplify and improve the code.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Sun, Aug 23, 2015 at 12:33 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

I have tried your scripts and found some problem while using avg.py
script.
grep 'progress:' test_medium4_FW_off.out | cut -d' ' -f4 | ./avg.py
--limit=10 --length=300
: No such file or directory

I didn't get chance to poke into avg.py script (the command without
avg.py works fine). Python version on the m/c, I planned to test is
Python 2.7.5.

Strange... What does "/usr/bin/env python" say?

Python 2.7.5 (default, Apr  9 2015, 11:07:29) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
 
Can the script be started on its own at all?

I have tried like below which results in same error, also I tried few
other variations but could not succeed.
./avg.py 
: No such file or directory
 

Here, I think above calculation can go for toss if backend or bgwriter
starts writing buffers when checkpoint is in progress.  The tablespace
written parameter won't be able to consider the one's written by backends
or bgwriter.

Sure... This is *already* the case with the current checkpointer, the schedule is performed with respect to the initial number of buffers it think it will have to write, and if someone else writes these buffers then the schedule is skewed a little bit, or more... I have not changed this logic, but I extended it to handle several tablespaces.


I don't know how good or bad it is to build  further on somewhat skewed
logic, but the point is that unless it is required why to use it.
 
I do not think that Heikki version worked wrt to balancing writes over tablespaces,

I also think that it doesn't balances over tablespaces, but the question
is why do we need to balance over tablespaces, can we reliably
predict in someway which indicates that performing balancing over
tablespace can help the workload.  I think here we are doing more
engineering than required for this patch.  
 
and I'm not sure it worked at all.

Okay, his version might have some bugs, but then those could be
fixed as well.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> Can the script be started on its own at all?
>
> I have tried like below which results in same error, also I tried few
> other variations but could not succeed.
> ./avg.py

Hmmm... Ensure that the script is readable and executable:
  sh> chmod a+rx ./avg.py

Also check the file:
  sh> file ./avg.py  ./avg.py: Python script, UTF-8 Unicode text executable

>> Sure... This is *already* the case with the current checkpointer, the
>> schedule is performed with respect to the initial number of buffers it
>> think it will have to write, and if someone else writes these buffers then
>> the schedule is skewed a little bit, or more... I have not changed this
>
> I don't know how good or bad it is to build  further on somewhat skewed
> logic,

The logic is no more skewed that it is with the current version: your 
remark about the estimation which may be wrong in some cases is clearly 
valid, but it is orthogonal (independent, unrelated, different) to what is 
addressed by this patch.

I currently have no reason to believe that the issue you raise is a major 
performance issue, but if so it may be addressed by another patch by 
whoever want to do so.

What I have done is to demonstrate that generating a lot of random I/Os is 
a major performance issue (well, sure), and this patch addresses this 
point and provide major speedup (*3-5) and latency reductions (from +60% 
unavailability to nearly full availability) for high OLTP write load, by 
reordering and flushing checkpoint buffers in a sensible way.

> but the point is that unless it is required why to use it.

This is really required to avoid predictable performance regressions, see 
below.

>> I do not think that Heikki version worked wrt to balancing writes over
>> tablespaces,
>
> I also think that it doesn't balances over tablespaces, but the question 
> is why do we need to balance over tablespaces, can we reliably predict 
> in someway which indicates that performing balancing over tablespace can 
> help the workload.

The reason for the tablespace balancing is that in the current postgres 
buffers are written more or less randomly, so it is (probably) implicitely 
and statistically balanced over tablespaces because of this randomness, 
and indeed, AFAIK, people with multi tablespace setup have not complained 
that postgres was using the disks sequentially.

However, once the buffers are sorted per file, the order becomes 
deterministic and there is no more implicit balancing, which means that if 
someone has a pg setup with several disks it will write sequentially on 
these instead of in parallel.

This regression was pointed out by Andres Freund, I agree that such a 
regression for high end systems must be avoided, hence the tablespace 
balancing.

> I think here we are doing more engineering than required for this patch.

I do not think so, I think that Andres remark is justified to avoid a 
performance regression on high end systems which use tablespaces, which is 
really undesirable.

About the balancing code, it is not that difficult, even if it is not 
trivial: the point is to select the tablespace for which the progress 
ratio (written/to_write) is below the overall progress ratio, so that it 
catches up, and do so in a round robin maner, so that all tablespaces get 
to write things. I also have both written a proof and tested the logic (in 
a separate script).

-- 
Fabien.



Re: checkpointer continuous flushing

From
Michael Paquier
Date:
On Mon, Aug 24, 2015 at 4:15 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> [stuff]

Moved to next CF 2015-09.
-- 
Michael



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-08-18 09:08:43 +0200, Fabien COELHO wrote:
> Please find attached a v10, where I have reviewed comments for style &
> contents, and also slightly extended the documentation about the flush
> option to hint that it is essentially useful for high write loads. Without
> further data, I think it is not obvious to give more definite advices.

v10b misses the checkpoint_sort part of the patch, and thus cannot be applied.

Andres



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> Please find attached a v10, where I have reviewed comments for style & 
>> contents, and also slightly extended the documentation about the flush 
>> option to hint that it is essentially useful for high write loads. 
>> Without further data, I think it is not obvious to give more definite 
>> advices.
>
> v10b misses the checkpoint_sort part of the patch, and thus cannot be 
> applied.

Yes, indeed, the second part is expected to be applied on top of v10a.

Please find attached the cumulated version (v10a + v10b).

-- 
Fabien.

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-08-27 14:32:39 +0200, Fabien COELHO wrote:
> >v10b misses the checkpoint_sort part of the patch, and thus cannot be
> >applied.
> 
> Yes, indeed, the second part is expected to be applied on top of v10a.

Oh, sorry. I'd somehow assumed they were two variants of the same patch
(one with "slim" sorting and the other without).



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>>> v10b misses the checkpoint_sort part of the patch, and thus cannot be
>>> applied.
>>
>> Yes, indeed, the second part is expected to be applied on top of v10a.
>
> Oh, sorry. I'd somehow assumed they were two variants of the same patch
> (one with "slim" sorting and the other without).

The idea is that as these two features could be committed separately. 
However, experiments show that flushing is really efficient when sorting 
is done first, and moreover the two features conflict, so I've made two 
dependent patches.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Mon, Aug 24, 2015 at 12:45 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Also check the file:
>
>   sh> file ./avg.py
>   ./avg.py: Python script, UTF-8 Unicode text executable
>

There were some CRLF line terminators, after removing those, it worked
fine and here are the results of some of the tests done for sorting patch
(checkpoint-continuous-flush-10-a) :

Config Used
----------------------
M/c details

--------------------IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB


Test details
------------------
warmup=60
scale=300
max_connections=150
shared_buffers=8GB
checkpoint_timeout=2min
time=7200
synchronous_commit=on
max_wal_size=5GB

parallelism - 128 clients, 128 threads

Sort - off
avg over 7200: 8256.382528 ± 6218.769282 [0.000000, 76.050000, 10975.500000, 13105.950000, 21729.000000]
percent of values below 10.0: 19.5%

Sort - on
avg over 7200: 8375.930639 ± 6148.747366 [0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]
percent of values below 10.0: 18.6%

Before going to conclusion, let me try to explain above data (I am
explaining again even though Fabien has explained, to make it clear
if someone has not read his mail)

Let's try to understand with data for sorting - off option

avg over 7200: 8256.382528 ± 6218.769282

8256.382528 - average tps for 7200s pgbench run 
6218.769282 - standard deviation on per second figures

[0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]

These 5 values can be read as minimum TPS, q1, median TPS, q3,
maximum TPS over 7200s pgbench run.  As far as I understand q1
and q3 median of subset of values which I didn't focussed much.

percent of values below 10.0: 19.5%

Above means percent of time the result is below 10 tps.

Now about test results, these tests are done for pgbench full speed runs
and the above results indicate that there is approximately 1.5%
improvement in avg. TPS and ~1% improvement in tps values which are
below 10 with sorting on and there is almost no improvement in median or
maximum TPS values, instead they or slightly less when sorting is
on which could be due to run-to-run variation.

I have done more tests as well by varying time and number of clients
keeping other configuration same as above, but the results are quite
similar.

The results of sorting patch for the tests done indicate that the win is not
big enough with just doing sorting during checkpoints, we should consider
flush patch along with sorting.  I would like to perform some tests with both
the patches together (sort + flush) unless somebody else thinks that sorting
patch alone is beneficial and we should test some other kind of scenarios to
see it's benefit.

>
> The reason for the tablespace balancing is that in the current postgres buffers are written more or less randomly, so it is (probably) implicitely and statistically balanced over tablespaces because of this randomness, and indeed, AFAIK, people with multi tablespace setup have not complained that postgres was using the disks sequentially.
>
> However, once the buffers are sorted per file, the order becomes deterministic and there is no more implicit balancing, which means that if someone has a pg setup with several disks it will write sequentially on these instead of in parallel.
>

What if tablespaces are not on separate disks or not enough hardware
support to make Writes parallel?  I think for such cases it might be
better to do it sequentially.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

> IBM POWER-8 24 cores, 192 hardware threads
> RAM = 492GB

Wow! Thanks for trying the patch on such high-end hardware!

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

What is the OS? The FS?

> warmup=60

Quite short, but probably ok.

> scale=300

Means about 4-4.5 GB base.

> time=7200
> synchronous_commit=on

> shared_buffers=8GB

This is small wrt hardware, but given the scale setup I think that it 
should not matter much.

> max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

> checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

You did not update checkpoint_completion_target, which means 0.5 so that 
the checkpoint is scheduled to run in at most 1 minute, which suggest at 
least 130 MB/s write performance for the checkpoint.

> parallelism - 128 clients, 128 threads

Given 192 hw threads, I would have tried used 128 clients & 64 threads, so 
that each pgbench client has its own dedicated postgres in a thread, and 
that postgres processes are not competing with pgbench. Now as pgbench is 
mostly sleeping, probably that does not matter much... I may also be 
totally wrong:-)

> Sort - off
> avg over 7200: 8256.382528 ± 6218.769282 [0.000000, 76.050000,
> 10975.500000, 13105.950000, 21729.000000]
> percent of values below 10.0: 19.5%

The max performance is consistent with 128 threads * 200 (random) writes 
per second.

> Sort - on
> avg over 7200: 8375.930639 ± 6148.747366 [0.000000, 84.000000,
> 10946.000000, 13084.000000, 20289.900000]
> percent of values below 10.0: 18.6%

This is really a small improvement, probably in the error interval of the 
measure. I would not trust much 1.5% tps or 0.9% availability 
improvements.

I think that we could conclude that on your (great) setup, with these 
configuration parameter, this patch does not harm performance. This is a 
good thing, even if I would have hoped to see better performance.

> Before going to conclusion, let me try to explain above data (I am
> explaining again even though Fabien has explained, to make it clear
> if someone has not read his mail)
>
> Let's try to understand with data for sorting - off option
>
> avg over 7200: 8256.382528 ± 6218.769282
>
> 8256.382528 - average tps for 7200s pgbench run
> 6218.769282 - standard deviation on per second figures
>
> [0.000000, 84.000000, 10946.000000, 13084.000000, 20289.900000]
>
> These 5 values can be read as minimum TPS, q1, median TPS, q3,
> maximum TPS over 7200s pgbench run.  As far as I understand q1
> and q3 median of subset of values which I didn't focussed much.

q1 = 84 means that 25% of the time the performance was below 84 tps, about 
1% of the average performance, which I would translate as "pg is pretty 
unresponsive 25% of the time".

This is the kind of issue I really want to address, the eventual tps 
improvements are just a side effect.

> percent of values below 10.0: 19.5%
>
> Above means percent of time the result is below 10 tps.

Which means "postgres is really unresponsive 19.5% of the time".

If you count zeros, you will get "postgres was totally unresponsive X% of 
the time".

> Now about test results, these tests are done for pgbench full speed runs
> and the above results indicate that there is approximately 1.5%
> improvement in avg. TPS and ~1% improvement in tps values which are
> below 10 with sorting on and there is almost no improvement in median or
> maximum TPS values, instead they or slightly less when sorting is
> on which could be due to run-to-run variation.

Yes, I agree.

> I have done more tests as well by varying time and number of clients
> keeping other configuration same as above, but the results are quite
> similar.

Given the hardware, I would suggest to raise checkpoint_timeout, 
shared_buffers and max_wal_size, and use checkpoint_completion_target=0.8. 
I would expect that it should improve performance both with and without 
sorting.

It would be interesting to have informations from checkpoint logs 
(especially how many buffers written in how long, whether checkpoints are 
time or xlog triggered, ...).

> The results of sorting patch for the tests done indicate that the win is 
> not big enough with just doing sorting during checkpoints,

ISTM that you do too much generalization: The win is not big "under this 
configuration and harware".

I think that the patch may have very small influence under some 
conditions, but should not degrade performance significantly, and on the 
other hand it should provide great improvements under some (other) 
conditions.

So having no performance degradation is a good result, even if I would 
hope to get better results.  It would be interesting to understand why 
random disk writes do not perform too poorly on this box: size of I/O 
queue, kind of (expensive:-) disks, write caches, file system, raid 
level...

> we should consider flush patch along with sorting.

I also think that it would be interesting.

> I would like to perform some tests with both the patches together (sort 
> + flush) unless somebody else thinks that sorting patch alone is 
> beneficial and we should test some other kind of scenarios to see it's 
> benefit.

Yep. Is it a Linux box? If not, does it support posix_fadvise()?

>> The reason for the tablespace balancing is [...]
>
> What if tablespaces are not on separate disks

I would expect that it might very slightly degrade performance, but only 
marginally.

> or not enough hardware support to make Writes parallel?

I'm not sure that balancing or not writes over tablespaces would change 
anything to an I/O bottleneck which is not the disk write performance, so 
I would say "no impact" in that case.

> I think for such cases it might be better to do it sequentially.

Writing sequentially to different disks would be a bug, and degrade 
performance significantly on a setup with several disks, up to dividing 
the performance by the number of disks... so I do think that a patch which 
predictability and significantly degrades performance on high-end harware 
is a reasonable option.

If you want to be able to disactivate balancing, it could be done with a 
guc, but I cannot see good reasons to want to do that: it would complicate 
the code and it does not make much sense to use many tablespaces on one 
disk, while anyone who uses several tablespaces on several disks is 
probably expecting to see her expensive disks actually used in parallel.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Mon, Aug 31, 2015 at 12:40 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello Amit,
>
>> IBM POWER-8 24 cores, 192 hardware threads
>> RAM = 492GB
>
>
> Wow! Thanks for trying the patch on such high-end hardware!
>
> About the disks: what kind of HDD (RAID? speed?)? HDD write cache?
>

Speed of Reads -
Timing cached reads:   27790 MB in  1.98 seconds = 14001.86 MB/sec
Timing buffered disk reads: 3830 MB in  3.00 seconds = 1276.55 MB/sec

Copy speed - 

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s


> What is the OS? The FS?
>

OS info -
Linux <m/c addr> 3.10.0-123.1.2.el7.ppc64 #1 SMP Wed Jun 4 15:23:17 EDT 2014 ppc64 ppc64 ppc64 GNU/Linux

FS - ext4


>> shared_buffers=8GB
>
>
> This is small wrt hardware, but given the scale setup I think that it should not matter much.
>

Yes, I was testing the case for Read-Write transactions when all the data
fits in shared_buffers, so this is okay.

>> max_wal_size=5GB
>
>
> Hmmm... Maybe quite small given the average performance?
>

We can check with larger value, but do you expect some different
results and why?

>> checkpoint_timeout=2min
>
>
> This seems rather small. Are the checkpoints xlog or time triggered?
>

I wanted to test by triggering more checkpoints, but I can test with
larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?


> You did not update checkpoint_completion_target, which means 0.5 so that the checkpoint is scheduled to run in at most 1 minute, which suggest at least 130 MB/s write performance for the checkpoint.
>

The value used in your script was 0.8 for checkpoint_completion_target
which I have not changed during tests.

>> parallelism - 128 clients, 128 threads
>
>
> Given 192 hw threads, I would have tried used 128 clients & 64 threads, so that each pgbench client has its own dedicated postgres in a thread, and that postgres processes are not competing with pgbench. Now as pgbench is mostly sleeping, probably that does not matter much... I may also be totally wrong:-)
>

In next run, I can use it with 64 threads, lets settle on other parameters
first for which you expect there could be a clear win with the first patch.

>
>
> Given the hardware, I would suggest to raise checkpoint_timeout, shared_buffers and max_wal_size, and use checkpoint_completion_target=0.8. I would expect that it should improve performance both with and without sorting.
>

I don't think increasing shared_buffers would have any impact, because
8GB is sufficient for 300 scale factor data, checkpoint_completion_target is
already 0.8 in my previous tests.  Lets try with checkpoint_timeout = 10 min
and max_wal_size = 15GB, do you have any other suggestion?

> It would be interesting to have informations from checkpoint logs (especially how many buffers written in how long, whether checkpoints are time or xlog triggered, ...).
>
>> The results of sorting patch for the tests done indicate that the win is not big enough with just doing sorting during checkpoints,
>
>
> ISTM that you do too much generalization: The win is not big "under this configuration and harware".
>

Hmm.. nothing like that, this was based on couple of tests done by
me and I am open to do some more if you or anybody feels that the
first patch (checkpoint-continuous-flush-10-a) can alone gives benefit,
in-fact I have started these tests with the intention to see if first
patch gives benefit, then that could be evaluated and eventually
committed separately.

> I think that the patch may have very small influence under some conditions, but should not degrade performance significantly, and on the other hand it should provide great improvements under some (other) conditions.
>

True, let us try to find conditions/scenarios where you think it can give
big boost, suggestions are welcome.

>>
>> What if tablespaces are not on separate disks
>
>
> I would expect that it might very slightly degrade performance, but only marginally.
>
>
> If you want to be able to disactivate balancing, it could be done with a guc, but I cannot see good reasons to want to do that: it would complicate the code and it does not make much sense to use many tablespaces on one disk, while anyone who uses several tablespaces on several disks is probably expecting to see her expensive disks actually used in parallel.
>

I think we can leave this for committer to take a call or if anybody
else has any opinion, because there is nothing wrong in what you
have done, but I am not clear if there is a clear need for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> About the disks: what kind of HDD (RAID? speed?)? HDD write cache?
>
> Speed of Reads -
> Timing cached reads:   27790 MB in  1.98 seconds = 14001.86 MB/sec
> Timing buffered disk reads: 3830 MB in  3.00 seconds = 1276.55 MB/sec

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

> Copy speed -
>
> dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
> 262144+0 records in
> 262144+0 records out
> 2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

Woops, 1.6 GB/s write... same questions, "rotating plates"?? Looks more 
like several SSD... Or the file is kept in memory and not committed to 
disk yet? Try a "sync" afterwards??

If these are SSD, or if there is some SSD cache on top of the HDD, I would 
not expect the patch to do much, because the SSD random I/O writes are 
pretty comparable to sequential I/O writes.

I would be curious whether flushing helps, though.

>>> max_wal_size=5GB
>>
>> Hmmm... Maybe quite small given the average performance?
>
> We can check with larger value, but do you expect some different
> results and why?

Because checkpoints are xlog triggered (which depends on max_wal_size) or 
time triggered (which depends on checkpoint_timeout). Given the large tps, 
I expect that the WAL is filled very quickly hence may trigger checkpoints 
every ... that is the question.

>>> checkpoint_timeout=2min
>>
>> This seems rather small. Are the checkpoints xlog or time triggered?
>
> I wanted to test by triggering more checkpoints, but I can test with
> larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

For a +2 hours test, I would suggest 10 or 15 minutes.

It would be useful to know about checkpoint stats before suggesting values 
for max_wal_size and checkpoint_timeout.

> [...] The value used in your script was 0.8 for 
> checkpoint_completion_target which I have not changed during tests.

Ok.

>>> parallelism - 128 clients, 128 threads [...]
> In next run, I can use it with 64 threads, lets settle on other parameters
> first for which you expect there could be a clear win with the first patch.

Ok.

>> Given the hardware, I would suggest to raise checkpoint_timeout, 
>> shared_buffers and max_wal_size, [...]. I would expect that it should 
>> improve performance both with and without sorting.
>
> I don't think increasing shared_buffers would have any impact, because
> 8GB is sufficient for 300 scale factor data,

It fits at the beginning, but when updates and inserts are performed 
postgres adds new pages (update = delete + insert), and the deleted space 
is eventually reclaimed by vacuum later on.

Now if space is available in the page it is reused, so what really happens 
is not that simple...

At 8500 tps the disk space extension for tables may be up to 3 MB/s at the 
beginning, and would evolve but should be at least about 0.6 MB/s (insert 
in history, assuming updates are performed in page), on average.

So whether the database fits in 8 GB shared buffer during the 2 hours of 
the pgbench run is an open question.

> checkpoint_completion_target is already 0.8 in my previous tests.  Lets 
> try with checkpoint_timeout = 10 min and max_wal_size = 15GB, do you 
> have any other suggestion?

Maybe shared_buffers = 32GB to ensure that it is a "in buffer" run ?

>> It would be interesting to have informations from checkpoint logs 
>> (especially how many buffers written in how long, whether checkpoints 
>> are time or xlog triggered, ...).

Information still welcome.

> Hmm.. nothing like that, this was based on couple of tests done by
> me and I am open to do some more if you or anybody feels that the
> first patch (checkpoint-continuous-flush-10-a) can alone gives benefit,
> in-fact I have started these tests with the intention to see if first
> patch gives benefit, then that could be evaluated and eventually
> committed separately.

Ok.

My initial question remains: is the setup using HDDs? For SSD there should 
be probably no significant benefit with sorting, although it should not 
harm, and I'm not sure about flushing.

> True, let us try to find conditions/scenarios where you think it can give
> big boost, suggestions are welcome.

HDDs?

> I think we can leave this for committer to take a call or if anybody
> else has any opinion, because there is nothing wrong in what you
> have done, but I am not clear if there is a clear need for the same.

I may have an old box available with two disks, so that I can run some 
tests with table spaces, but with very few cores.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:

On Tue, Sep 1, 2015 at 5:30 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

Speed of Reads -
Timing cached reads:   27790 MB in  1.98 seconds = 14001.86 MB/sec
Timing buffered disk reads: 3830 MB in  3.00 seconds = 1276.55 MB/sec

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

Yes, there is no SSD in system. I have confirmed the same.  There are RAID
spinning drives.
 


Copy speed -

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

Woops, 1.6 GB/s write... same questions, "rotating plates"??

One thing to notice is that if I don't remove the output file (output.img) the
speed is much slower, see the below output. I think this means in our case
we will get ~320 MB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.28086 s, 1.7 GB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 6.72301 s, 319 MB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 6.73963 s, 319 MB/s

If I remove the file each time:

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.2855 s, 1.7 GB/s

rm /data/akapila/output.img

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.27725 s, 1.7 GB/s

rm /data/akapila/output.img

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.27417 s, 1.7 GB/s

rm /data/akapila/output.img


 
Looks more like several SSD... Or the file is kept in memory and not committed to disk yet? Try a "sync" afterwards??


 
If these are SSD, or if there is some SSD cache on top of the HDD, I would not expect the patch to do much, because the SSD random I/O writes are pretty comparable to sequential I/O writes.

I would be curious whether flushing helps, though.


Yes, me too. I think we should try to reach on consensus for exact scenarios
and configuration where this patch('es) can give benefit or we want to verify
if there is any regression as I have access to this m/c for a very-very limited
time.  This m/c might get formatted soon for some other purpose.

 
max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

We can check with larger value, but do you expect some different
results and why?

Because checkpoints are xlog triggered (which depends on max_wal_size) or time triggered (which depends on checkpoint_timeout). Given the large tps, I expect that the WAL is filled very quickly hence may trigger checkpoints every ... that is the question.

checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

I wanted to test by triggering more checkpoints, but I can test with
larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

For a +2 hours test, I would suggest 10 or 15 minutes.


Okay, lets keep it as 10 minutes.

I don't think increasing shared_buffers would have any impact, because
8GB is sufficient for 300 scale factor data,

It fits at the beginning, but when updates and inserts are performed postgres adds new pages (update = delete + insert), and the deleted space is eventually reclaimed by vacuum later on.

Now if space is available in the page it is reused, so what really happens is not that simple...

At 8500 tps the disk space extension for tables may be up to 3 MB/s at the beginning, and would evolve but should be at least about 0.6 MB/s (insert in history, assuming updates are performed in page), on average.

So whether the database fits in 8 GB shared buffer during the 2 hours of the pgbench run is an open question.


With this kind of configuration, I have noticed that more than 80%
of updates are HOT updates, not much bloat, so I think it won't
cross 8GB limit, but still I can keep it to 32GB if you have any doubts.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??
>
> Yes, there is no SSD in system. I have confirmed the same.  There are RAID
> spinning drives.

Ok...

I guess that there is some kind of cache to explain these great tps 
figures, probably on the RAID controller. What does "lspci" says? Does 
"hdparm" suggests that the write cache is enabled? It would be fine if the 
I/O system has a BBU, but that could also hide some of the patch 
benefits...

A tentative explanation for the similar figures with and without sorting 
could be that depending on the controller cache size (may be 1GB or more) 
and firmware, the I/O system reorders disk writes so that they are 
basically sequential and the fact that pg sorts them beforehand has little 
or no impact. This may also be help by the fact that buffers are not 
really in random order to begin with as the warmup phase does an initial 
"select stuff from table".

There could be other possible factors such as the file system details, 
"WAFL" hacks... the tricks are endless:-)

Checking for the right explanation would involve removing the 
unconditional select warmup to use only a long and random warmup, and 
probably trying a much larger than cache database, and/or disabling the 
write cache, reading the hardware documentation in detail... But this is 
also a lot of bother and time.

Maybe the simplest approach would be to disable the write cache for the 
test. Is that possible?

>> Woops, 1.6 GB/s write... same questions, "rotating plates"??
>
> One thing to notice is that if I don't remove the output file 
> (output.img) the speed is much slower, see the below output. I think 
> this means in our case we will get ~320 MB/s

I would say that the OS was doing something here, and 320 MB/s looks more 
like an actual HDD RAID system sequential write performance.

>> If these are SSD, or if there is some SSD cache on top of the HDD, I would
>> not expect the patch to do much, because the SSD random I/O writes are
>> pretty comparable to sequential I/O writes.
>>
>> I would be curious whether flushing helps, though.
>
> Yes, me too. I think we should try to reach on consensus for exact 
> scenarios and configuration where this patch('es) can give benefit or we 
> want to verify if there is any regression as I have access to this m/c 
> for a very-very limited time.  This m/c might get formatted soon for 
> some other purpose.

Yep, it would be great if you have time for a flush test before it 
disappears... I think it is advisable to disable the write cache as it may 
also hide the impact of flushing.

>> So whether the database fits in 8 GB shared buffer during the 2 hours of
>> the pgbench run is an open question.
>
> With this kind of configuration, I have noticed that more than 80%
> of updates are HOT updates, not much bloat, so I think it won't
> cross 8GB limit, but still I can keep it to 32GB if you have any doubts.

The problem with performance tests is that you want to test one thing, but 
there are many factors that intervene and you may end up testing something 
else, such as lock contention or process scheduler or whatever, rather 
than what you were trying to put in evidence. So I would suggest to be on 
the safe side and use the larger value.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>>> I would be curious whether flushing helps, though.
>> 
>> Yes, me too. I think we should try to reach on consensus for exact 
>> scenarios and configuration where this patch('es) can give benefit or we 
>> want to verify if there is any regression as I have access to this m/c for 
>> a very-very limited time.  This m/c might get formatted soon for some other 
>> purpose.
>
> Yep, it would be great if you have time for a flush test before it 
> disappears... I think it is advisable to disable the write cache as it may 
> also hide the impact of flushing.

Still thinking... Depending on the results, it might be interesting to 
have these tests run with the write cache enabled as well, to check how 
much it interferes positively with performance.

I would guess "quite a lot".

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

Here's a bunch of comments on this (hopefully the latest?) version of
the patch:

* I'm not sure I like the FileWrite & FlushBuffer API changes. Do you forsee other callsites needing similar logic?
Wouldn'tit be just as easy to put this logic into the checkpointing code?
 

* We don't do one-line ifs; function parameters are always in the same line as the function name

* Wouldn't a binary heap over the tablespaces + progress be nicer? If you make the sorting criterion include the
tablespaceid you wouldn't need the lookahead loop in NextBufferToWrite().  Isn't the current approach O(NBuffers^2) in
theworst case?
 

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> Here's a bunch of comments on this (hopefully the latest?)

Who knows?! :-)

> version of the patch:
>
> * I'm not sure I like the FileWrite & FlushBuffer API changes. Do you
>  forsee other callsites needing similar logic?

I foresee that the bgwriter should also do something more sensible than 
generating random I/Os over HDDs, and this is also true for workers... But 
this is for another time, maybe.

> Wouldn't it be just as easy to put this logic into the checkpointing 
> code?

Not sure it would simplify anything, because the checkpointer currently 
knows about buffers but flushing is about files, which are hidden from 
view.

Doing it with this API change means that the code does not have to compute 
twice in which file is a buffer: The buffer/file boundary has to be broken 
somewhere anyway so that flushing can be done when needed, and the 
solution I took seems the simplest way to do it, without having to make 
the checkpointer too much file concious.

> * We don't do one-line ifs;

Ok, I'll return them.

> function parameters are always in the same line as the function name

Ok, I'll try to improve.

> * Wouldn't a binary heap over the tablespaces + progress be nicer?

I'm not sure where it would fit exactly.

Anyway, I think it would complicate the code significantly (compared to 
the straightforward array), so I would not do anything like that without a 
strong intensive, such as an actual failing case.

Moreover such a data structure would probably require some kind of pointer 
(probably 8 bytes added per node, maybe more), and the amount of memory is 
already a concern, at least to me, and moreover it has to reside in shared 
memory which does not simplify allocation of tree data structures.

> If you make the sorting criterion include the tablespace id you wouldn't 
> need the lookahead loop in NextBufferToWrite().

Yep, I thought of it. It would mean 4 more bytes per buffer, and bsearch 
to find the boundaries, so significantly less simple code. I think that 
the current approach is ok as the number of tablespace should be small.

It may be improved upon later if there is a motivation to do so.

> Isn't the current approach O(NBuffers^2) in the worst case?

ISTM that the overall lookahead complexity is Nbuffers * Ntablespace: 
buffers are scanned once for each tablespace. I assume that the number of 
tablespace is kept low, and having a simpler code which use less memory 
seems a good idea.

ISTM that using a tablespace in the sorting would reduce the complexity 
to ln(NBuffers) * Ntablespace for finding the boundaries, and then 
Nbuffers * (Ntablespace/Ntablespace) = NBuffers for scanning, at the 
expense of more memory and code complexity.

So this is a voluntary design decision.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Here is a rebased two-part v11.

> * We don't do one-line ifs;

I've found one instance.

> function parameters are always in the same line as the function name

ISTM that I did that, or maybe I did not understand what I've done wrong.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Petr Jelinek
Date:
On 2015-09-06 19:05, Fabien COELHO wrote:
>
> Here is a rebased two-part v11.
>
>> function parameters are always in the same line as the function name
>
> ISTM that I did that, or maybe I did not understand what I've done wrong.
>

I see one instance of this issue
+static int
+NextBufferToWrite(
+    TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
+    int *pspace, int num_to_write, int num_written)

Also
+static int bufcmp(const void * pa, const void * pb)
+{

should IMHO be formatted as
+static int
+bufcmp(const void * pa, const void * pb)
+{


And I think we generally put the struct typedefs at the top of the C 
file and don't mix them with function definitions (I am talking about 
the TableSpaceCheckpointStatus and TableSpaceCountEntry).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Petr,

>>> function parameters are always in the same line as the function name
>> 
>> ISTM that I did that, or maybe I did not understand what I've done wrong.
>
> I see one instance of this issue
> +static int
> +NextBufferToWrite(
> +    TableSpaceCheckpointStatus *spcStatus, int nb_spaces,
> +    int *pspace, int num_to_write, int num_written)

Ok, I was looking for function calls.

> should IMHO be formatted as
> +static int
> +bufcmp(const void * pa, const void * pb)
> +{

Indeed.

> And I think we generally put the struct typedefs at the top of the C file and 
> don't mix them with function definitions (I am talking about the 
> TableSpaceCheckpointStatus and TableSpaceCountEntry).

Ok, moved up.

Thanks for the hints!  Two-part v12 attached fixes these.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-09-06 16:05:01 +0200, Fabien COELHO wrote:
> >Wouldn't it be just as easy to put this logic into the checkpointing code?
> 
> Not sure it would simplify anything, because the checkpointer currently
> knows about buffers but flushing is about files, which are hidden from
> view.

It'd not really simplify things, but it'd keep it local.

> >* Wouldn't a binary heap over the tablespaces + progress be nicer?
> 
> I'm not sure where it would fit exactly.

Imagine a binaryheap.h style heap over a structure like (tablespaceid,
progress, progress_inc, nextbuf) where the comparator compares the progress.

> Anyway, I think it would complicate the code significantly (compared to the
> straightforward array)

I doubt it. I mean instead of your GetNext you'd just do:   next_tblspc = DatumGetPointer(binaryheap_first(heap));   if
(next_tblspc== 0)       return 0;   next_tblspc.progress += next_tblspc.progress_slice;
binaryheap_replace_first(PointerGetDatum(next_tblspc));
   return next_tblspc.nextbuf++;


progress_slice is the number of buffers in the tablespace divided by the
number of total buffers, to avoid doing any sort of expensive math in
the more frequently executed path.

> Moreover such a data structure would probably require some kind of pointer
> (probably 8 bytes added per node, maybe more), and the amount of memory is
> already a concern, at least to me, and moreover it has to reside in shared
> memory which does not simplify allocation of tree data structures.

I'm not seing where you'd need an extra pointer? Maybe the
misunderstanding is that I'm proposing to do a heap over the
*tablespaces* not the actual buffers.

> >If you make the sorting criterion include the tablespace id you wouldn't
> >need the lookahead loop in NextBufferToWrite().
> 
> Yep, I thought of it. It would mean 4 more bytes per buffer, and bsearch to
> find the boundaries, so significantly less simple code.

What for would you need to bsearch?


> I think that the current approach is ok as the number of tablespace
> should be small.

Right that's often the case.

> >Isn't the current approach O(NBuffers^2) in the worst case?
> 
> ISTM that the overall lookahead complexity is Nbuffers * Ntablespace:
> buffers are scanned once for each tablespace.

Which in the worst case is NBuffers * 2...

> ISTM that using a tablespace in the sorting would reduce the complexity to
> ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
> (Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
> memory and code complexity.

Afaics finding the boundaries can be done as part of the enumeration of
tablespaces in BufferSync(). That code needs to be moved, but that's not
too bad. I don't see the code be that much more complicated?

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Mon, Sep 7, 2015 at 3:09 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-09-06 16:05:01 +0200, Fabien COELHO wrote:
> > >Wouldn't it be just as easy to put this logic into the checkpointing code?
> >
> > Not sure it would simplify anything, because the checkpointer currently
> > knows about buffers but flushing is about files, which are hidden from
> > view.
>
> It'd not really simplify things, but it'd keep it local.
>

How about using the value of guc (checkpoint_flush_to_disk) and
AmCheckpointerProcess to identify whether to do async flush in FileWrite?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> It'd not really simplify things, but it'd keep it local.
>
> How about using the value of guc (checkpoint_flush_to_disk) and 
> AmCheckpointerProcess to identify whether to do async flush in 
> FileWrite?

ISTM that what you suggest would just replace the added function arguments 
with global variables to communicate and keep the necessary data for 
managing the asynchronous flushing, which is called per tablespace
(1) on file changes (2) when the checkpointer is going to sleep.

Although it can be done obviously, I prefer to have functions arguments 
rather than global variables, on principle.

Also, because of (2) and of the dependency on the number of tablespaces 
being flushed, the flushing stuff cannot be fully hidden from the 
checkpointer anyway.

Also I think that probably the bgwriter should do something similar, so 
function parameters would be useful to drive flushing from it, rather than 
adding yet another set of global variables, or share the same variables 
for somehow different purposes.

So having these added parameters look reasonable to me.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Sat, Sep 5, 2015 at 12:26 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
I would be curious whether flushing helps, though.

Yes, me too. I think we should try to reach on consensus for exact scenarios and configuration where this patch('es) can give benefit or we want to verify if there is any regression as I have access to this m/c for a very-very limited time.  This m/c might get formatted soon for some other purpose.

Yep, it would be great if you have time for a flush test before it disappears... I think it is advisable to disable the write cache as it may also hide the impact of flushing.

Still thinking... Depending on the results, it might be interesting to have these tests run with the write cache enabled as well, to check how much it interferes positively with performance.


I have done some tests with both the patches(sort+flush) and below
are results:

M/c details

--------------------IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB




Test - 1 (Data Fits in shared_buffers)
--------------------------------------------------------
non-default settings used in script provided by Fabien upthread

used below options for pgbench and the same is used for rest
of tests as well.

fw)  ## full speed parallel write pgbench
run="FW"
opts="-M prepared -P 1 -T $time $para"
;;

warmup=1000
scale=300
max_connections=300
shared_buffers=32GB
checkpoint_timeout=10min
time=7200
synchronous_commit=on
max_wal_size=15GB

para="-j 64 -c 128"
checkpoint_completion_target=0.8

checkpoint_flush_to_disk="on off"
checkpoint_sort="on off"

Flush - off and Sort - off
avg over 7203: 27480.350104 ± 12791.098857 [0.000000, 16009.400000, 32109.200000, 37629.000000, 51671.400000]
percent of values below 10.0: 2.8%

Flush - off and Sort - on
avg over 7200: 27482.501264 ± 12552.036065 [0.000000, 16587.250000, 31225.950000, 37516.450000, 51296.900000]
percent of values below 10.0: 2.8%

Flush - on and Sort - off
avg over 7200: 25214.757292 ± 11059.709509 [5268.000000, 14188.400000, 26472.450000, 35626.100000, 51479.000000]
percent of values below 10.0: 0.0%

Flush - on and Sort - on
avg over 7200: 26819.631722 ± 10589.745016 [5191.700000, 16825.450000, 29429.750000, 35707.950000, 51475.100000]
percent of values below 10.0: 0.0%

For this test run, the best results are when both the sort and flush options
are enabled, the value of lowest TPS is increased substantially without
sacrificing much on average or median TPS values (though there is ~9%
dip in median TPS value).  When only sorting is enabled, there is neither
significant gain nor any loss.  When only flush is enabled, there is
significant degradation in both average and median value of TPS ~8%
and ~21% respectively.


Test - 2 (Data doesn't fit in shared_buffers, but fits in RAM)
----------------------------------------------------------------------------------------
warmup=1000
scale=3000
max_connections=300
shared_buffers=32GB
checkpoint_timeout=10min
time=7200
synchronous_commit=on
max_wal_size=25GB

para="-j 64 -c 128"
checkpoint_completion_target=0.8

checkpoint_flush_to_disk="on off"
checkpoint_sort="on off"

Flush - off and Sort - off
avg over 7200: 5050.059444 ± 4884.528702 [0.000000, 98.100000, 4699.100000, 10125.950000, 13631.000000]
percent of values below 10.0: 7.7%

Flush - off and Sort - on
avg over 7200: 6194.150264 ± 4913.525651 [0.000000, 98.100000, 8982.000000, 10558.000000, 14035.200000]
percent of values below 10.0: 11.0%

Flush - on and Sort - off
avg over 7200: 2771.327472 ± 1860.963043 [287.900000, 2038.850000, 2375.500000, 2679.000000, 12862.000000]
percent of values below 10.0: 0.0%

Flush - on and Sort - on
avg over 7200: 6110.617722 ± 1939.381029 [1652.200000, 5215.100000, 5724.000000, 6196.550000, 13828.000000]
percent of values below 10.0: 0.0%


For this test run, again the best results are when both the sort and flush
options are enabled, the value of lowest TPS is increased substantially
and the average and median value of TPS has also increased to
~21% and ~22% respectively.  When only sorting is enabled, there is a
significant gain in average and median TPS values, but then there is also
an increase in number of times when TPS is below 10 which is bad.
When only flush is enabled, there is significant degradation in both average
and median value of TPS to ~82% and ~97% respectively, now I am not
sure if such a big degradation could be expected for this case or it's just
a problem in this run, I have not repeated this test.


Test - 3 (Data doesn't fit in shared_buffers, but fits in RAM)
----------------------------------------------------------------------------------------
Same configuration and settings as above, but this time, I have enforced
Flush to use posix_fadvise() rather than sync_file_range()  (basically changed
code to comment out sync_file_range() and enable posix_fadvise()).

Flush - on and Sort - on
avg over 7200: 3400.915069 ± 739.626478 [1642.100000, 2965.550000, 3271.900000, 3558.800000, 6763.000000]
percent of values below 10.0: 0.0%

On using posix_fadvise(), the results for best case (both flush and sort as
on) shows significant degradation in average and median TPS values
by ~48% and ~43% which indicates that probably using posix_fadvise()
with the current options might not be the best way to achieve Flush.


Overall, I think this patch (sort+flush) brings a lot of value on table in
terms of stablizing the TPS during checkpoint, however some of the
cases like use of posix_fadvise() and the case (all data fits in shared_buffers)
where the value of median TPS is regressed could be investigated
to see what can be done to improve them.  I think more tests can be done
to ensure the benefit or regression of this patch, but for now this is what
best I can do.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

> I have done some tests with both the patches(sort+flush) and below
> are results:

Thanks a lot for these runs on this great harware!

> Test - 1 (Data Fits in shared_buffers)

Rounded for easier comparison:
  flush/sort  off off: 27480.4 ± 12791.1 [   0, 16009, 32109, 37629, 51671] (2.8%)  off on : 27482.5 ± 12552.0 [   0,
16587,31226, 37516, 51297] (2.8%)
 

The two above case are pretty indistinguishable, sorting has no impact. 
The 2.8% means more than 1 minute offline per hour (not necessarily a 
whole minute, it may be distributed over the whole hour).
  on  off: 25214.8 ± 11059.7 [5268, 14188, 26472, 35626, 51479] (0.0%)  on  on : 26819.6 ± 10589.7 [5192, 16825, 29430,
35708,51475] (0.0%)
 

> For this test run, the best results are when both the sort and flush 
> options are enabled, the value of lowest TPS is increased substantially 
> without sacrificing much on average or median TPS values (though there 
> is ~9% dip in median TPS value).  When only sorting is enabled, there is 
> neither significant gain nor any loss.  When only flush is enabled, 
> there is significant degradation in both average and median value of TPS 
> ~8% and ~21% respectively.

I interpret the five numbers in bracket as an indicator of performance 
stability: they should be equal for perfect stability. Once they show some 
stability, the next point for me is to focus at the average performance. I 
do not see a median decrease as a big issue if the average is reasonably 
good.

Thus I essentially note the -2.5% dip on average of on-on vs off-on. I 
would say that it is probably significant, although it might be in the 
error margin of the measure. Not sure whether the little stddev reduction 
is really significant. Anyway the benefit is clear: 100% availability.

Flushing without sorting is a bad idea (tm), not a surprise.

> Test - 2 (Data doesn't fit in shared_buffers, but fits in RAM)
 flush/sort off off: 5050.1 ± 4884.5 [   0,   98, 4699, 10126, 13631] ( 7.7%) off on : 6194.2 ± 4913.5 [   0,   98,
8982,10558, 14035] (11.0%) on  off: 2771.3 ± 1861.0 [ 288, 2039, 2375,  2679, 12862] ( 0.0%) on  on : 6110.6 ± 1939.3
[1652,5215, 5724,  6196, 13828] ( 0.0%)
 

I'm not sure that the off-on vs on-on -1.3% avg tps dip is significant, 
but it may be. With both flushing and sorting pg becomes fully available,
and the standard deviation is devided by more than 2, so the benefit is 
clear.

> For this test run, again the best results are when both the sort and flush
> options are enabled, the value of lowest TPS is increased substantially
> and the average and median value of TPS has also increased to
> ~21% and ~22% respectively.  When only sorting is enabled, there is a
> significant gain in average and median TPS values, but then there is also
> an increase in number of times when TPS is below 10 which is bad.
> When only flush is enabled, there is significant degradation in both average
> and median value of TPS to ~82% and ~97% respectively, now I am not
> sure if such a big degradation could be expected for this case or it's just
> a problem in this run, I have not repeated this test.

Yes, I agree that it is strange that sorting without flushing on its own 
both improves performance (+20% tps) but seems to degrade availability at 
the same time. A rerun would have helped to check whether it is a fluke or 
it is reproducible.

> Test - 3 (Data doesn't fit in shared_buffers, but fits in RAM)
> ----------------------------------------------------------------------------------------
> Same configuration and settings as above, but this time, I have enforced
> Flush to use posix_fadvise() rather than sync_file_range()  (basically
> changed code to comment out sync_file_range() and enable posix_fadvise()).
>
> On using posix_fadvise(), the results for best case (both flush and sort as
> on) shows significant degradation in average and median TPS values
> by ~48% and ~43% which indicates that probably using posix_fadvise()
> with the current options might not be the best way to achieve Flush.

Yes, indeed.

The way posix_fadvise is implemented on Linux is between no effect and bad 
effect (the buffer is erased). You hit the later quite strongly... As you 
are doing a "not fit in shared buffer" test, it is essential that buffers 
are kept in ram, but posix_fadvise on Linux just instructs to erase the 
buffer from memory if it was already passed to the I/O subsystem, which 
given the probably large I/O device cache on your host should be done 
pretty quickly, so that later read must be fetch back from the device 
(either cache or disk), which means a drop in performance.

Note that FreeBSD implementation seems more convincing, although less good 
than Linux sync_file_range function. I've no idea about other systems.

> Overall, I think this patch (sort+flush) brings a lot of value on table 
> in terms of stablizing the TPS during checkpoint, however some of the 
> cases like use of posix_fadvise() and the case (all data fits in 
> shared_buffers) where the value of median TPS is regressed could be 
> investigated to see what can be done to improve them.  I think more 
> tests can be done to ensure the benefit or regression of this patch, but 
> for now this is what best I can do.

Thanks a lot, again, for these tests!

I think that we may conclude, on these run:

(1) sorting seems not to harm performance, and may help a lot.

(2) Linux flushing with sync_file_range may degrade a little raw tps    average in some case, but definitely improves
performancestability    (always 100% availability when on !).
 

(3) posix_fadvise on Linux is a bad idea... the good news is that it    is not needed there:-) How good or bad an idea
itis on other system    is an open question...
 

These results are consistent with the current default values in the patch: 
sorting is on by default, flushing is on with Linux and off otherwise 
(posix_fadvise).

Also, as the effect on other systems is unclear, I think it is best to 
keep both settings as GUCs for now.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Tue, Sep 8, 2015 at 8:09 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Thanks a lot, again, for these tests!
>
> I think that we may conclude, on these run:
>
> (1) sorting seems not to harm performance, and may help a lot.
>

I agree with first part, but about helping a lot, I am not sure based on
the tests conducted by me, among all the runs, it has shown improvement
in average TPS is one case and that too with a dip in number of times the
TPS is below 10.

> (2) Linux flushing with sync_file_range may degrade a little raw tps
>     average in some case, but definitely improves performance stability
>     (always 100% availability when on !).
>

Agreed, I think the benefit is quite clear, but it would be better if we try
to do some more test for the cases (data fits in shared_buffers) where
we saw small regression just to make sure that regression is small.

> (3) posix_fadvise on Linux is a bad idea... the good news is that it
>     is not needed there:-) How good or bad an idea it is on other system
>     is an open question...
>

I don't know what is the best way to verify that, if some body else has
access to such a m/c, please help to get that verified.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>> I think that we may conclude, on these run:
>>
>> (1) sorting seems not to harm performance, and may help a lot.
>
> I agree with first part, but about helping a lot, I am not sure

I'm focussing on the "sort" dimension alone, that is I'm comparing the 
average tps performance with sorting with the same test without sorting, : 
There are 4 cases from your tests, if I'm not mistaken:
 - T1 flush=off  27480 -> 27482 :    +0.0% - T1 flush=on   25214 -> 26819 :    +6.3% - T2 flush=off   5050 ->  6194 :
+22.6%- T2 flush=on    2771 ->  6110 :  +120.4%
 

The average improvement induced by sort=on is +50%, if you do not agree on 
"a lot", maybe we can agree on "significantly":-)

> based on the tests conducted by me, among all the runs, it has shown 
> improvement in average TPS is one case and that too with a dip in number 
> of times the TPS is below 10.

>> (2) Linux flushing with sync_file_range may degrade a little raw tps
>>     average in some case, but definitely improves performance stability
>>     (always 100% availability when on !).
>
> Agreed, I think the benefit is quite clear, but it would be better if we try
> to do some more test for the cases (data fits in shared_buffers) where
> we saw small regression just to make sure that regression is small.

I've already reported a lot of tests (several hundred of hours on two 
different hosts), and I did not have such a dip, but the hardware was more 
"usual" or "casual", really different from your runs.

If you can run more tests, great!

I think that the main safeguard to handle the (small) uncertainty is to 
keep gucs to control these features.

>> (3) posix_fadvise on Linux is a bad idea... the good news is that it
>>     is not needed there:-) How good or bad an idea it is on other system
>>     is an open question...
>
> I don't know what is the best way to verify that, if some body else has
> access to such a m/c, please help to get that verified.

Yep. There has been such calls on this thread which were not very 
effective, up to now.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Robert Haas
Date:
On Tue, Sep 8, 2015 at 11:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> (3) posix_fadvise on Linux is a bad idea... the good news is that it
>>     is not needed there:-) How good or bad an idea it is on other system
>>     is an open question...
>
> I don't know what is the best way to verify that, if some body else has
> access to such a m/c, please help to get that verified.

Why wouldn't we just leave it out then?  Putting it in when the one
platform we've tried it on shows a regression makes no sense.  We
shouldn't include it and then remove it if someone can prove it's bad;
we should only include it in the first place if we have good
benchmarks showing that it is good.

Does anyone have a big Windows box they can try this on?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>>> (3) posix_fadvise on Linux is a bad idea... the good news is that it
>>>     is not needed there:-) How good or bad an idea it is on other system
>>>     is an open question...
>>
>> I don't know what is the best way to verify that, if some body else has
>> access to such a m/c, please help to get that verified.
>
> Why wouldn't we just leave it out then? Putting it in when the one 
> platform we've tried it on shows a regression makes no sense.  We 
> shouldn't include it and then remove it if someone can prove it's bad; 
> we should only include it in the first place if we have good benchmarks 
> showing that it is good.
>
> Does anyone have a big Windows box they can try this on?

Just a box with a disk would be enough, it does not need to be big!

As I wrote before, FreeBSD would be a good candidate because the 
posix_fadvise seems much more reasonable than on Linux, and should be 
profitable, so it would be a pity to remove it.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-09-09 20:56:15 +0200, Fabien COELHO wrote:
> As I wrote before, FreeBSD would be a good candidate because the
> posix_fadvise seems much more reasonable than on Linux, and should be
> profitable, so it would be a pity to remove it.

Why do you think it's different on fbsd? Also, why is it unreasonable
that DONNEED removes stuff from the cache?

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>>> Wouldn't it be just as easy to put this logic into the checkpointing 
>>> code?
>>
>> Not sure it would simplify anything, because the checkpointer currently
>> knows about buffers but flushing is about files, which are hidden from
>> view.
>
> It'd not really simplify things, but it'd keep it local.

Ok, it would be local, but it would also mean that the checkpointer would 
have to deal explicitely with files, whereas it currently does not have 
to.

I think that the current buffer/file boundary is on engineering principle 
a good one, so I tried to break it as little as possible to enable the 
feature, and I wanted to avoid to have to do a buffer to file translation 
twice, once in the checkpointer and once when writing the buffer.

>>> * Wouldn't a binary heap over the tablespaces + progress be nicer?
>>
>> I'm not sure where it would fit exactly.
>
> Imagine a binaryheap.h style heap over a structure like (tablespaceid,
> progress, progress_inc, nextbuf) where the comparator compares the progress.

It would replace what is currently an array. The balancing code needs to 
enumerate all tablespaces in a round-robin way so as to ensure that all 
tablespaces are given some attention, otherwise you could have a balance 
on two tablespaces but others could be left out. The array makes this 
property straightforward.


>> Anyway, I think it would complicate the code significantly (compared to the
>> straightforward array)
>
> I doubt it. I mean instead of your GetNext you'd just do:
>    next_tblspc = DatumGetPointer(binaryheap_first(heap));
>    if (next_tblspc == 0)
>        return 0;
>    next_tblspc.progress += next_tblspc.progress_slice;
>    binaryheap_replace_first(PointerGetDatum(next_tblspc));
>
>    return next_tblspc.nextbuf++;

Compare to the array, this tree approach would required ln(Ntablespace) 
work to extract and reinsert the tablespace under progress, so there is no 
complexity advantage.

Moreover, given that in most cases there are 1 or 2 tablespaces, a tree 
structure is really on the heavy side.

> progress_slice is the number of buffers in the tablespace divided by the
> number of total buffers, to avoid doing any sort of expensive math in
> the more frequently executed path.

If there are many buffers, I'm not too sure about rounding issues and the 
like, so the current approach with a rational seems more secure.

> [...] I'm not seing where you'd need an extra pointer?

Indeed, I misunderstood.

> [...] What for would you need to bsearch?

To find the tablespace boundaries in the sorted buffer array in 
log(NBuffers) * Ntablespace, instead of NBuffers.

>> I think that the current approach is ok as the number of tablespace
>> should be small.
>
> Right that's often the case.

Yep.

>> ISTM that using a tablespace in the sorting would reduce the complexity to
>> ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
>> (Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
>> memory and code complexity.
>
> Afaics finding the boundaries can be done as part of the enumeration of
> tablespaces in BufferSync(). That code needs to be moved, but that's not
> too bad. I don't see the code be that much more complicated?

Hmmm. you are proposing to replace prooved and heavilly tested code by a 
more complex tree data structures distributed quite differently around the 
source, and no very clear benefit.

So I would prefer to keep the code as is, that is pretty straightforward, 
and wait for a strong incentive before doing anything fancier. ISTM that 
there are other places in pg need attention more that further tweaking 
this patch.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-09-09 21:29:12 +0200, Fabien COELHO wrote:
> >Imagine a binaryheap.h style heap over a structure like (tablespaceid,
> >progress, progress_inc, nextbuf) where the comparator compares the progress.
> 
> It would replace what is currently an array.

It'd still be one afterwards.

> The balancing code needs to enumerate all tablespaces in a round-robin
> way so as to ensure that all tablespaces are given some attention,
> otherwise you could have a balance on two tablespaces but others could
> be left out. The array makes this property straightforward.

Why would a heap as I've described it require that?

> >>Anyway, I think it would complicate the code significantly (compared to the
> >>straightforward array)
> >
> >I doubt it. I mean instead of your GetNext you'd just do:
> >   next_tblspc = DatumGetPointer(binaryheap_first(heap));
> >   if (next_tblspc == 0)
> >       return 0;
> >   next_tblspc.progress += next_tblspc.progress_slice;
> >   binaryheap_replace_first(PointerGetDatum(next_tblspc));
> >
> >   return next_tblspc.nextbuf++;
> 
> Compare to the array, this tree approach would required ln(Ntablespace) work
> to extract and reinsert the tablespace under progress, so there is no
> complexity advantage.

extract/reinsert is actually O(1).


> >progress_slice is the number of buffers in the tablespace divided by the
> >number of total buffers, to avoid doing any sort of expensive math in
> >the more frequently executed path.
> 
> If there are many buffers, I'm not too sure about rounding issues and the
> like, so the current approach with a rational seems more secure.

Meh. The amount of imbalance introduced by rounding won't matter.

> >>ISTM that using a tablespace in the sorting would reduce the complexity to
> >>ln(NBuffers) * Ntablespace for finding the boundaries, and then Nbuffers *
> >>(Ntablespace/Ntablespace) = NBuffers for scanning, at the expense of more
> >>memory and code complexity.
> >
> >Afaics finding the boundaries can be done as part of the enumeration of
> >tablespaces in BufferSync(). That code needs to be moved, but that's not
> >too bad. I don't see the code be that much more complicated?
> 
> Hmmm. you are proposing to replace prooved and heavilly tested code by a
> more complex tree data structures distributed quite differently around the
> source, and no very clear benefit.

There's no "proved and heavily tested code" touched here.

> So I would prefer to keep the code as is, that is pretty straightforward,
> and wait for a strong incentive before doing anything fancier.

I find the proposed code not particularly pretty, so I don't really buy
the straightforwardness argument.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> As I wrote before, FreeBSD would be a good candidate because the 
>> posix_fadvise seems much more reasonable than on Linux, and should be 
>> profitable, so it would be a pity to remove it.
>
> Why do you think it's different on fbsd? Also, why is it unreasonable 
> that DONNEED removes stuff from the cache?

Yep, I agree that this part is a bad point, obviously, but at least there 
is also some advantage: I understood that buffers are actually pushed 
towards the disk when calling posix_fadvise with DONTNEED on FreeBSD, so 
in-buffer tests shoud see better performance, and out-of-buffer in-memory 
tests would probably be degraded as Amit test shown on Linux. As an admin 
I can choose if I know whether I run in buffer or not.

On Linux either the call is ignored (if the page is not written yet) or 
the page is coldly removed, so it has either no effect or a bad effect, 
basically.

So I think that the current off default when running with posix_fadvise is 
reasonable, and in some case turning it on can probably provide a better 
performance stability, esp for in-buffer runs.

Now, franckly I do not care much about FreeBSD or Windows, so I'm fine 
with dropping posix_fadvise if this is a blocker.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> It would replace what is currently an array.
>
> It'd still be one afterwards.
> [...]
> extract/reinsert is actually O(1).

Hm, strange. I probably did not understood at all the heap structure 
you're suggesting. No big deal.

> [...] Why would a heap as I've described it require that?

Hmmm... The heap does *not* require anything, the *balancing* requires 
this property.

> [...] There's no "proved and heavily tested code" touched here.

I've prooved and tested heavily the submitted patch based on an array, 
that you want to replace with some heap, so I think that my point stands.

Moreover, I do not see a clear benefit in changing the data structure.

>> So I would prefer to keep the code as is, that is pretty straightforward,
>> and wait for a strong incentive before doing anything fancier.
>
> I find the proposed code not particularly pretty, so I don't really buy
> the straightforwardness argument.

No big deal. From my point of view, the data structure change you're 
suggesting does not bring significant value, so there is no good reason to 
do it.

If you want to submit another patch, this is free software, please 
proceed.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Wed, Sep 9, 2015 at 2:31 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello Amit,
>
>>> I think that we may conclude, on these run:
>>>
>>> (1) sorting seems not to harm performance, and may help a lot.
>>
>>
>> I agree with first part, but about helping a lot, I am not sure
>
>
> I'm focussing on the "sort" dimension alone, that is I'm comparing the average tps performance with sorting with the same test without sorting, : There are 4 cases from your tests, if I'm not mistaken:
>
>  - T1 flush=off  27480 -> 27482 :    +0.0%
>  - T1 flush=on   25214 -> 26819 :    +6.3%
>  - T2 flush=off   5050 ->  6194 :   +22.6%
>  - T2 flush=on    2771 ->  6110 :  +120.4%
>

There is a clear win only in cases when sort is used with flush, apart
from that using sort alone doesn't have any clear advantage.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Jeff Janes
Date:
On Wed, Sep 9, 2015 at 12:12 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-09 20:56:15 +0200, Fabien COELHO wrote:
> As I wrote before, FreeBSD would be a good candidate because the
> posix_fadvise seems much more reasonable than on Linux, and should be
> profitable, so it would be a pity to remove it.

Why do you think it's different on fbsd? Also, why is it unreasonable
that DONNEED removes stuff from the cache?


It seems kind of silly that it means "No one, even people I am not aware of and have no right to speak for, needs this" as opposed to "I don't need this, don't keep it around on my behalf."

Cheers,

Jeff

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Amit,

>>  - T1 flush=off  27480 -> 27482 :    +0.0%
>>  - T1 flush=on   25214 -> 26819 :    +6.3%
>>  - T2 flush=off   5050 ->  6194 :   +22.6%
>>  - T2 flush=on    2771 ->  6110 :  +120.4%
>
> There is a clear win only in cases when sort is used with flush, apart
> from that using sort alone doesn't have any clear advantage.

Indeed, I agree that the improvement is much smaller without flushing, 
although it is there somehow (+0.0 & +22.6 => +11.3% on average).

Well, at least we may agree that it is "somehow significantly better" ?:-)

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> Thanks for the hints!  Two-part v12 attached fixes these.

Here is a v13, which is just a rebase after 1aba62ec.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello,

> [...] If you make the sorting criterion include the tablespace id you 
> wouldn't need the lookahead loop in NextBufferToWrite().

I'm considering this precise point, i.e. including the tablespace as
a sorting criterion.

Currently the array used for sorting is 16 bytes per buffer (although I 
wrote 12 in another mail, I was wrong...). The data include the bufid (4 
bytes) the relation & fork num (8 bytes, but really 4 bytes + 2 bits are 
used), and the block number (4 bytes) which is the offset within the 
relation. These 3 combined data allow to find the file and the offset 
within that file, for the given buffer id.

I'm concerned that these 16 bytes are already significant and I do not 
want to extend them any more. I was already pretty happy with the previous 
version with 4 bytes per buffer.

Now as the number of tablespace is expected to be very small (1, 2, maybe 
3), there is no problem to pack it within the unused 30 bits in forknum. 
That would mean some masking and casts here and there, so it would not be 
very beautiful, but it would make it easy to find the buffers for a given 
tablespace, and indeed remove the lookahead stuff in the next buffer 
function, as you suggest.


My question is: would that be acceptable, or would someone object to the 
use of masks and things like that?  The benefit would be a simpler/more 
direct next buffer function, but some more tinkering around the sorting 
criterion to use a packed representation.

Note that I do not think that it would have any actual impact on 
performance... it would only make a difference if there were really many 
tablespaces (the scanning complexity would be Nbuffer instead of 
Nbuffer*Ntablespace, but as Ntablespace is small...).  My motivation is 
rather to help the patch get through, so I'm fine if this is not needed.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:
> 
> >Thanks for the hints!  Two-part v12 attached fixes these.
> 
> Here is a v13, which is just a rebase after 1aba62ec.

I'm working on this patch, to get it into a state I think it'd be
commitable.

In my performance testing it showed that calling PerformFileFlush() only
at segment boundaries and in CheckpointWriteDelay() can lead to rather
spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
problematic because it only is triggered while on schedule, and not when
behind. My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.

I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().

As far as I can see the while (nb_spaces != 0)/NextBufferToWrite() logic
doesn't work correctly if tablespaces aren't actually sorted. I'm
actually inclined to fix this by simply removing the flag to
enable/disable sorting.

Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
so many places looks ugly, I want to push that to the underlying
functions. If we add a different flushing approach we shouldn't have to
touch several places that don't actually really care.

I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
seems to work well, with a bit less code actually.


I'll post this after some more cleanup & testing.


I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
particularly good. In high throughput workloads the 100ms sleep is too
long, leading to bursty IO behaviour. If 1k+ buffers a written out a
second 100ms is a rather long sleep. For another that we only sleep
100ms when the write rate is low makes the checkpoint finish rather
quickly - on a slow disk (say microsd) that can cause unneccesary
slowdowns for concurrent activity.  ISTM we should calculate the sleep
time in a better way. The SIGHUP behaviour is also weird.   Anyway, this
probably belongs on a new thread.


Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> Here is a v13, which is just a rebase after 1aba62ec.
>
> I'm working on this patch, to get it into a state I think it'd be
> commitable.

I'll review it carefully. Also, if you can include some performance 
feature it would help, even if I'll do some more runs.

> In my performance testing it showed that calling PerformFileFlush() only 
> at segment boundaries and in CheckpointWriteDelay() can lead to rather 
> spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is 
> problematic because it only is triggered while on schedule, and not when 
> behind.

When behind, the PerformFileFlush should be called on segment boundaries.
The idea was not to go to sleep without flushing, and to do it as little 
as possible.

> My testing seems to show that just adding a limit of 32 buffers to
> FileAsynchronousFlush() leads to markedly better results.

Hmmm. 32 buffers means 256 KB, which is quite small. Not sure what a good 
"limit" would be. It could depend whether pages are close or not.

> I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
> sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
> might even be possible to later approximate that on windows using
> FlushViewOfFile().

I'm not sure that mmap/msync can be used for this purpose, because there 
is no real control it seems about where the file is mmapped.

> As far as I can see the while (nb_spaces != 0)/NextBufferToWrite() logic 
> doesn't work correctly if tablespaces aren't actually sorted. I'm 
> actually inclined to fix this by simply removing the flag to 
> enable/disable sorting.

I do no think that there is a significant downside to having sort always 
on, but showing it requires to be able to test, so to have a guc. The 
point of the guc is to demonstrate that the feature is harmless:-)

> Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
> so many places looks ugly, I want to push that to the underlying
> functions. If we add a different flushing approach we shouldn't have to
> touch several places that don't actually really care.

I agree that it is pretty ugly, but I do not think that you can remove 
them all. You need at least one for checking the guc and one for enabling 
the feature. Maybe their number could be reduced if the functions are 
switched to do-nothing stubs which are called nevertheless, but I was not 
keen on letting unused code when there is no sync_file_range nor 
posix_fadvise.

> I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
> seems to work well, with a bit less code actually.

Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 
element set in most case is an improvement.

> I'll post this after some more cleanup & testing.

I'll have a look when it is ready.

> I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
> particularly good. In high throughput workloads the 100ms sleep is too
> long, leading to bursty IO behaviour. If 1k+ buffers a written out a
> second 100ms is a rather long sleep. For another that we only sleep
> 100ms when the write rate is low makes the checkpoint finish rather
> quickly - on a slow disk (say microsd) that can cause unneccesary
> slowdowns for concurrent activity.  ISTM we should calculate the sleep
> time in a better way.

I also noted this point, but I'm not sure how to have a better approach, 
so I let it as it is. I tried 50 ms & 200 ms on some runs, without 
significant effect on performance for the test I ran then. The point of 
having not too small a value is that it provide some significant work to 
the IO subsystem without overflowing it. On average it does not matter. 
I'm unsure how it would interact with flushing. So I decided not to do 
anything about it. Maybe it should be a guc, but I would not know how to 
choose it.

> The SIGHUP behaviour is also weird.  Anyway, this probably belongs on a 
> new thread.

Probably. I did not try to look at that.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Mon, Oct 19, 2015 at 4:06 AM, Andres Freund <andres@anarazel.de> wrote:
>
>
> I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
> sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
> might even be possible to later approximate that on windows using
> FlushViewOfFile().
>

I think this idea is worth exploring especially because we can have
Windows equivalent for this optimisation.  Will this option by any
chance can lead to increase in memory usage as mmap has to
map the file/'s?


With Regards,
Amit Kapila.

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-10-19 21:14:55 +0200, Fabien COELHO wrote:
> >In my performance testing it showed that calling PerformFileFlush() only
> >at segment boundaries and in CheckpointWriteDelay() can lead to rather
> >spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
> >problematic because it only is triggered while on schedule, and not when
> >behind.
> 
> When behind, the PerformFileFlush should be called on segment
> boundaries.

That means it's flushing up to a gigabyte of data at once. Far too
much. The implementation pretty always will go behind schedule for some
time. Since sync_file_range() doesn't flush in the foreground I don't
think it's important to do the flushing in concert with sleeping.

> >My testing seems to show that just adding a limit of 32 buffers to
> >FileAsynchronousFlush() leads to markedly better results.
> 
> Hmmm. 32 buffers means 256 KB, which is quite small.

Why? The aim is to not overwhelm the request queue - which is where the
coalescing is done. And usually that's rather small. If you flush much more
sync_file_range starts to do work in the foreground.

> >I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
> >sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
> >might even be possible to later approximate that on windows using
> >FlushViewOfFile().
> 
> I'm not sure that mmap/msync can be used for this purpose, because there is
> no real control it seems about where the file is mmapped.

I'm not following? Why does it matter where a file is mapped?

I have had a friend (Christian Kruse, thanks!)  confirm that at least on
OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that
that should be the case on freebsd too.

> >Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
> >so many places looks ugly, I want to push that to the underlying
> >functions. If we add a different flushing approach we shouldn't have to
> >touch several places that don't actually really care.
> 
> I agree that it is pretty ugly, but I do not think that you can remove them
> all.

Sure, never said all. But most.

> >I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
> >seems to work well, with a bit less code actually.
> 
> Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
> set in most case is an improvement.

Yes, it'll not matter that much in many cases. But I rather disliked the
NextBufferToWrite() implementation, especially that it walkes the array
multiple times. And I did see setups with ~15 tablespaces.

> >I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
> >particularly good. In high throughput workloads the 100ms sleep is too
> >long, leading to bursty IO behaviour. If 1k+ buffers a written out a
> >second 100ms is a rather long sleep. For another that we only sleep
> >100ms when the write rate is low makes the checkpoint finish rather
> >quickly - on a slow disk (say microsd) that can cause unneccesary
> >slowdowns for concurrent activity.  ISTM we should calculate the sleep
> >time in a better way.
> 
> I also noted this point, but I'm not sure how to have a better approach, so
> I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
> effect on performance for the test I ran then. The point of having not too
> small a value is that it provide some significant work to the IO subsystem
> without overflowing it.

I don't think that makes much sense. All a longer sleep achieves is
creating a larger burst of writes afterwards. We should really sleep
adaptively.


Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>>> In my performance testing it showed that calling PerformFileFlush() only
>>> at segment boundaries and in CheckpointWriteDelay() can lead to rather
>>> spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
>>> problematic because it only is triggered while on schedule, and not when
>>> behind.
>>
>> When behind, the PerformFileFlush should be called on segment
>> boundaries.
>
> That means it's flushing up to a gigabyte of data at once. Far too
> much.

Hmmm. I do not get it. There would not be gigabytes, there would be as 
much as was written since the last sleep, about 100 ms ago, which is not 
likely to be gigabytes?

> The implementation pretty always will go behind schedule for some
> time. Since sync_file_range() doesn't flush in the foreground I don't
> think it's important to do the flushing in concert with sleeping.

For me it is important to avoid accumulating too large flushes, and that 
is the point of the call before sleeping.

>>> My testing seems to show that just adding a limit of 32 buffers to
>>> FileAsynchronousFlush() leads to markedly better results.
>>
>> Hmmm. 32 buffers means 256 KB, which is quite small.
>
> Why?

Because the point of sorting is to generate sequential writes so that the 
HDD has a lot of aligned stuff to write without moving the head, and 32 is 
rather small for that.

> The aim is to not overwhelm the request queue - which is where the
> coalescing is done. And usually that's rather small.

That is an argument. How small, though? It seems to be 128 by default, so 
I'd rather have 128? Also, it can be changed, so maybe it should really be 
a guc?

> If you flush much more sync_file_range starts to do work in the 
> foreground.

Argh, too bad. I would have hoped that the would just deal with in an 
asynchronous way, this is not a "fsync" call, just a flush advise.

>>> I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
>>> sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
>>> might even be possible to later approximate that on windows using
>>> FlushViewOfFile().
>>
>> I'm not sure that mmap/msync can be used for this purpose, because there is
>> no real control it seems about where the file is mmapped.
>
> I'm not following? Why does it matter where a file is mapped?

Because it should be in shared buffers where pg needs it? You probably 
should not want to mmap all pg data files in user space for a large 
database? Or if so, currently the OS keeps the data in memory if it has 
enough space, but if you got to mmap this cache management would be pg 
responsability, if I understand correctly mmap and your intentions.

> I have had a friend (Christian Kruse, thanks!)  confirm that at least on
> OSX msync(MS_ASYNC) triggers writeback. A freebsd dev confirmed that
> that should be the case on freebsd too.

Good. My concern is how mmap could be used, though, not the flushing part.

>> Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
>> set in most case is an improvement.
>
> Yes, it'll not matter that much in many cases. But I rather disliked the
> NextBufferToWrite() implementation, especially that it walkes the array
> multiple times. And I did see setups with ~15 tablespaces.

ISTM that it is rather an argument for taking the tablespace into the 
sorting, not necessarily for a binary heap.

>> I also noted this point, but I'm not sure how to have a better approach, so
>> I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
>> effect on performance for the test I ran then. The point of having not too
>> small a value is that it provide some significant work to the IO subsystem
>> without overflowing it.
>
> I don't think that makes much sense. All a longer sleep achieves is
> creating a larger burst of writes afterwards. We should really sleep
> adaptively.

It sounds reasonable, but what would be the criterion?

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-10-21 07:49:23 +0200, Fabien COELHO wrote:
>
> Hello Andres,
>
> >>>In my performance testing it showed that calling PerformFileFlush() only
> >>>at segment boundaries and in CheckpointWriteDelay() can lead to rather
> >>>spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
> >>>problematic because it only is triggered while on schedule, and not when
> >>>behind.
> >>
> >>When behind, the PerformFileFlush should be called on segment
> >>boundaries.
> >
> >That means it's flushing up to a gigabyte of data at once. Far too
> >much.
>
> Hmmm. I do not get it. There would not be gigabytes,

I said 'up to a gigabyte' not gigabytes. But it actually can be more
than one  if you're unluckly.

> there would be as much as was written since the last sleep, about 100
> ms ago, which is not likely to be gigabytes?

In many cases we don't sleep all that frequently - after one 100ms sleep
we're already behind a lot. And even so, it's pretty easy to get into
checkpoint scenarios with ~500 mbyte/s as a writeout rate. Only issuing
a sync_file_range() 10 times for that is obviously problematic.

> >The implementation pretty always will go behind schedule for some
> >time. Since sync_file_range() doesn't flush in the foreground I don't
> >think it's important to do the flushing in concert with sleeping.
>
> For me it is important to avoid accumulating too large flushes, and that is
> the point of the call before sleeping.

I don't follow this argument. It's important to avoid large flushes,
therefore we potentially allow large flushes to accumulate?

> >>>My testing seems to show that just adding a limit of 32 buffers to
> >>>FileAsynchronousFlush() leads to markedly better results.
> >>
> >>Hmmm. 32 buffers means 256 KB, which is quite small.
> >
> >Why?
>
> Because the point of sorting is to generate sequential writes so that the
> HDD has a lot of aligned stuff to write without moving the head, and 32 is
> rather small for that.

A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
data back. It just puts it into the write queue. You can have merging
between IOs from either side.  But more importantly you can't merge that
many requests together anyway.

> >The aim is to not overwhelm the request queue - which is where the
> >coalescing is done. And usually that's rather small.
>
> That is an argument. How small, though? It seems to be 128 by default, so
> I'd rather have 128? Also, it can be changed, so maybe it should really be a
> guc?

I couldn't see any benefits above (and below) 32 on a 20 drive system,
so I doubt it's worthwhile. It's actually good for interactivity to
allow other requests into the queue concurrently - otherwise other
reads/writes will obviously have a higher latency...

> >If you flush much more sync_file_range starts to do work in the
> >foreground.
>
> Argh, too bad. I would have hoped that the would just deal with in an
> asynchronous way,

It's even in the man page:
"Note  that  even  this  may  block if you attempt to write more than
request queue size."

> this is not a "fsync" call, just a flush advise.

sync_file_range isn't fadvise().

> Because it should be in shared buffers where pg needs it?

Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
instead of sync_file_range().

> >>Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
> >>set in most case is an improvement.
> >
> >Yes, it'll not matter that much in many cases. But I rather disliked the
> >NextBufferToWrite() implementation, especially that it walkes the array
> >multiple times. And I did see setups with ~15 tablespaces.
>
> ISTM that it is rather an argument for taking the tablespace into the
> sorting, not necessarily for a binary heap.

I don't understand your problem with that. The heap specific code is
small, smaller than your NextBufferToWrite() implementation?
   ts_heap = binaryheap_allocate(nb_spaces,                                 ts_progress_cmp,
    NULL);
 
   spcContext = (FileFlushContext *)       palloc(sizeof(FileFlushContext) * nb_spaces);
   for (i = 0; i < nb_spaces; i++)   {       TableSpaceCheckpointStatus *spc = &spcStatus[i];
       spc->progress_slice = ((float8) num_to_write) / (float8) spc->num_to_write;
       ResetFileFlushContext(&spcContext[i]);       spc->flushContext = &spcContext[i];
       binaryheap_add_unordered(ts_heap, PointerGetDatum(&spcStatus[i]));   }
   binaryheap_build(ts_heap);

and then
   while (!binaryheap_empty(ts_heap))   {       TableSpaceCheckpointStatus *ts = (TableSpaceCheckpointStatus *)
 DatumGetPointer(binaryheap_first(ts_heap));
 

...       ts->progress += ts->progress_slice;       ts->num_written++;
...       if (ts->num_written == ts->num_to_write)       {
...           binaryheap_remove_first(ts_heap);       }       else       {           /* update heap with the new
progress*/           binaryheap_replace_first(ts_heap, PointerGetDatum(ts));       }
 

> >>I also noted this point, but I'm not sure how to have a better approach, so
> >>I let it as it is. I tried 50 ms & 200 ms on some runs, without significant
> >>effect on performance for the test I ran then. The point of having not too
> >>small a value is that it provide some significant work to the IO subsystem
> >>without overflowing it.
> >
> >I don't think that makes much sense. All a longer sleep achieves is
> >creating a larger burst of writes afterwards. We should really sleep
> >adaptively.
>
> It sounds reasonable, but what would be the criterion?

What IsCheckpointOnSchedule() does is essentially to calculate progress
for two things:
1) Are we on schedule based on WAL segments until CheckPointSegments  (computed via max_wal_size these days). I.e. is
thepercentage of  used up WAL bigger than the percentage of written out buffers.
 

2) Are we on schedule based on checkpoint_timeout. I.e. is the  percentage of checkpoint_timeout already passed bigger
thanthe  percentage of buffers written out.
 

So the trick is just to compute the number of work items (e.g. buffers
to write out) and divide the remaining time by it. That's how long you
can sleep.

It's slightly trickier for WAL and I'm not sure it's equally
important. But even there it shouldn't be too hard to calculate the
amount of time till we're behind on schedule and only sleep that long.


I'm running benchmarks right now, they'll take a bit to run to
completion.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> there would be as much as was written since the last sleep, about 100
>> ms ago, which is not likely to be gigabytes?
>
> In many cases we don't sleep all that frequently - after one 100ms sleep
> we're already behind a lot.

I think that "being behind" is not a problem as such, it is really the way 
the scheduler has been designed and works, by keeping pace with time & 
wall progress by little bursts of writes. If you reduce the sleep time a 
lot then it would end up having writes interleaved with small sleeps, but 
then this would be bad for performance has the OS would loose the ability 
to write much data sequentially on the disk.

It does not mean that the default 100 ms is a good figure, but the "being 
behind" is a feature, not an issue as such.

> And even so, it's pretty easy to get into checkpoint scenarios with ~500 
> mbyte/s as a writeout rate.

Hmmmm. Not with my hardware:-)

> Only issuing a sync_file_range() 10 times for that is obviously 
> problematic.

Hmmm. Then it should depend on the expected write capacity of the 
underlying disks...

>>> The implementation pretty always will go behind schedule for some
>>> time. Since sync_file_range() doesn't flush in the foreground I don't
>>> think it's important to do the flushing in concert with sleeping.
>>
>> For me it is important to avoid accumulating too large flushes, and that is
>> the point of the call before sleeping.
>
> I don't follow this argument. It's important to avoid large flushes,
> therefore we potentially allow large flushes to accumulate?

On my simple test hardware the flushes are not large, I think, so the 
problem does not arise. Maybe I should check.

>>>>> My testing seems to show that just adding a limit of 32 buffers to
>>>>> FileAsynchronousFlush() leads to markedly better results.
>>>>
>>>> Hmmm. 32 buffers means 256 KB, which is quite small.
>>>
>>> Why?
>>
>> Because the point of sorting is to generate sequential writes so that the
>> HDD has a lot of aligned stuff to write without moving the head, and 32 is
>> rather small for that.
>
> A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
> data back. It just puts it into the write queue.

Yes.

> You can have merging between IOs from either side.  But more importantly 
> you can't merge that many requests together anyway.

Probably.

>>> The aim is to not overwhelm the request queue - which is where the
>>> coalescing is done. And usually that's rather small.
>>
>> That is an argument. How small, though? It seems to be 128 by default, so
>> I'd rather have 128? Also, it can be changed, so maybe it should really be a
>> guc?
>
> I couldn't see any benefits above (and below) 32 on a 20 drive system,

So it is one kind of (big) hardware. Assuming that pages are contiguous, 
how much is written on each disk depends on the RAID type, the stripe 
size, and when it is really written depends on the various cache (in the 
RAID HW card if any, on the disk, ...), so whether 32 at the OS level is 
the right size is pretty unclear to me. I would have said the larger the 
better, but indeed you should avoid blocking.

> so I doubt it's worthwhile. It's actually good for interactivity to
> allow other requests into the queue concurrently - otherwise other
> reads/writes will obviously have a higher latency...

Sure. Now on my tests, with my (old & little) hardware it seemed quite 
smooth. What I'm driving at is that what is good may be relative and 
depend on the underlying hardware, which makes it not obvious to choose 
the right parameter.

>>> If you flush much more sync_file_range starts to do work in the
>>> foreground.
>>
>> Argh, too bad. I would have hoped that the would just deal with in an
>> asynchronous way,
>
> It's even in the man page:
> "Note  that  even  this  may  block if you attempt to write more than
> request queue size."

Hmmm. What about choosing "request queue size * 0.5", then ?

>> Because it should be in shared buffers where pg needs it?
>
> Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
> instead of sync_file_range().

I think that I do not really understand how it may work, but possible it 
could.

>> ISTM that it is rather an argument for taking the tablespace into the
>> sorting, not necessarily for a binary heap.
>
> I don't understand your problem with that. The heap specific code is
> small, smaller than your NextBufferToWrite() implementation?

You have not yet posted the updated version of the patch.

Thee complexity of the round robin scan on the array is O(1) and very few 
instructions, plus some stop condition which is mostly true I think if the 
writes are balanced between table spaces, there is no dynamic allocation 
in the data structure (it is an array). The binary heap is O(log(n)), 
probably there are dynamic allocations and frees when extracting/inserting 
something, there are functions calls to rebalance the tree, and so on. Ok, 
"n" is expected to be small.

So basically, for me it is not obviously superior to the previous version. 
Now I'm also tired, so if it works reasonably I'll be fine with it.

> [... code extract ...]

>>> I don't think that makes much sense. All a longer sleep achieves is
>>> creating a larger burst of writes afterwards. We should really sleep
>>> adaptively.
>>
>> It sounds reasonable, but what would be the criterion?
>
> What IsCheckpointOnSchedule() does is essentially to calculate progress
> for two things:
> 1) Are we on schedule based on WAL segments until CheckPointSegments
>   (computed via max_wal_size these days). I.e. is the percentage of
>   used up WAL bigger than the percentage of written out buffers.
>
> 2) Are we on schedule based on checkpoint_timeout. I.e. is the
>   percentage of checkpoint_timeout already passed bigger than the
>   percentage of buffers written out.

> So the trick is just to compute the number of work items (e.g. buffers
> to write out) and divide the remaining time by it. That's how long you
> can sleep.

See discussion above. ISTM that the "bursts" is a useful feature of the 
checkpoint scheduler, especially with sorted buffers & flushes. You want 
to provide grouped writes that will be easilly written to disk together. 
You do not want to have page writes issued one by one and interleaved with 
small sleeps.

> It's slightly trickier for WAL and I'm not sure it's equally
> important. But even there it shouldn't be too hard to calculate the
> amount of time till we're behind on schedule and only sleep that long.

The scheduler stops writing as soon as it has overtaken the progress, so 
it should be a very small time, but if you do that you would end up 
writing pages one by one, which is not desirable at all.

> I'm running benchmarks right now, they'll take a bit to run to
> completion.

Good.

I'm looking forward to have a look at the updated version of the patch.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:
> Here is a v13, which is just a rebase after 1aba62ec.

And here's v14. It's not something entirely ready. A lot of details have
changed, I unfortunately don't remember them all. But there are more
important things than the details of the patch.

I've played *a lot* with this patch. I found a bunch of issues:

1) The FileFlushContext context infrastructure isn't actually
   correct. There's two problems: First, using the actual 'fd' number to
   reference a to-be-flushed file isn't meaningful. If there  are lots
   of files open, fds get reused within fd.c. That part is enough fixed
   by referencing File instead the fd. The bigger problem is that the
   infrastructure doesn't deal with files being closed. There can, which
   isn't that hard to trigger, be smgr invalidations causing smgr handle
   and thus the file to be closed.

   I think this means that the entire flushing infrastructure actually
   needs to be hoisted up, onto the smgr/md level.

2) I noticed that sync_file_range() blocked far more often than I'd
   expected. Reading the kernel code that turned out to be caused by a
   pessimization in the kernel introduced years ago - in many situation
   SFR_WRITE waited for the writes. A fix for this will be in the 4.4
   kernel.

3) I found that latency wasn't improved much for workloads that are
   significantly bigger than shared buffers. The problem here is that
   neither bgwriter nor the backends have, so far, done
   sync_file_range() calls. That meant that the old problem of having
   gigabytes of dirty data that periodically get flushed out, still
   exists. Having these do flushes mostly attacks that problem.


Benchmarking revealed that for workloads where the hot data set mostly
fits into shared buffers flushing and sorting is anywhere from a small
to a massive improvement, both in throughput and latency. Even without
the patch from 2), although fixing that improves things furhter.



What I did not expect, and what confounded me for a long while, is that
for workloads where the hot data set does *NOT* fit into shared buffers,
sorting often led to be a noticeable reduction in throughput. Up to
30%. The performance was still much more regular than before, i.e. no
more multi-second periods without any transactions happening.

By now I think I know what's going on: Before the sorting portion of the
patch the write-loop in BufferSync() starts at the current clock hand,
by using StrategySyncStart(). But after the sorting that obviously
doesn't happen anymore - buffers are accessed in their sort order. By
starting at the current clock hand and moving on from there the
checkpointer basically makes it more less likely that victim buffers
need to be written either by the backends themselves or by
bgwriter. That means that the sorted checkpoint writes can, indirectly,
increase the number of unsorted writes by other processes :(

My benchmarking suggest that that effect is the larger, the shorter the
checkpoint timeout is. That seems to intuitively make sense, give the
above explanation attempt. If the checkpoint takes longer the clock hand
will almost certainly soon overtake checkpoints 'implicit' hand.

I'm not sure if we can really do anything about this problem. While I'm
pretty jet lagged, I still spent a fair amount of time thinking about
it. Seems to suggest that we need to bring back the setting to
enable/disable sorting :(


What I think needs to happen next with the patch is:
1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
   handling the issue of smgr invalidations.
2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
   later can contain multiple elements like checkpoint, bgwriter,
   backends, ddl, bulk-writes. That seems better than adding GUCs for
   these separately. Then make the flush locations in the patch
   configurable using that.
3) I think we should remove the sort timing from the checkpoint logging
   before commit. It'll always be pretty short.


Greetings,

Andres Freund

Attachment

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> And here's v14. It's not something entirely ready.

I'm going to have a careful look at it.

> A lot of details have changed, I unfortunately don't remember them all. 
> But there are more important things than the details of the patch.
>
> I've played *a lot* with this patch. I found a bunch of issues:
>
> 1) The FileFlushContext context infrastructure isn't actually
>   correct. There's two problems: First, using the actual 'fd' number to
>   reference a to-be-flushed file isn't meaningful. If there  are lots
>   of files open, fds get reused within fd.c.

Hmm.

My assumption is that a file being used (i.e. with modifie pages, being 
used for writes...) would not be closed before everything is cleared...

After some poking in the code, I think that this issue may indeed be 
there, although the probability of hitting it is close to 0, but alas not 
0:-)

To fix it, ITSM that it is enough to hold a "do not close lock" on the 
file while a flush is in progress (a short time) that would prevent 
mdclose to do its stuff.

> That part is enough fixed by referencing File instead the fd. The bigger 
> problem is that the infrastructure doesn't deal with files being closed. 
> There can, which isn't that hard to trigger, be smgr invalidations 
> causing smgr handle and thus the file to be closed.
>
> I think this means that the entire flushing infrastructure actually
> needs to be hoisted up, onto the smgr/md level.

Hmmm. I'm not sure that it is necessary, see above my suggestion.

> 2) I noticed that sync_file_range() blocked far more often than I'd
>   expected. Reading the kernel code that turned out to be caused by a
>   pessimization in the kernel introduced years ago - in many situation
>   SFR_WRITE waited for the writes. A fix for this will be in the 4.4
>   kernel.

Alas, Pg cannot help issues in the kernel.

> 3) I found that latency wasn't improved much for workloads that are
>   significantly bigger than shared buffers. The problem here is that
>   neither bgwriter nor the backends have, so far, done
>   sync_file_range() calls. That meant that the old problem of having
>   gigabytes of dirty data that periodically get flushed out, still
>   exists. Having these do flushes mostly attacks that problem.

I'm concious that the patch only addresses *checkpointer* writes, not 
those from bgwrither or backends writes. I agree that these should need to 
be addressed at some point as well, but given the time to get a patch 
through, the more complex the slower (sort propositions are 10 years old), 
I think this should be postponed for later.

> Benchmarking revealed that for workloads where the hot data set mostly
> fits into shared buffers flushing and sorting is anywhere from a small
> to a massive improvement, both in throughput and latency. Even without
> the patch from 2), although fixing that improves things furhter.

This is consistent with my experiments: sorting improves things, and 
flushing on top of sorting improves things further.

> What I did not expect, and what confounded me for a long while, is that
> for workloads where the hot data set does *NOT* fit into shared buffers,
> sorting often led to be a noticeable reduction in throughput. Up to
> 30%.

I did not see such behavior in the many tests I ran. Could you share more 
precise details so that I can try to reproduce this performance 
regression? (available memory, shared buffers, db size, ...).

> The performance was still much more regular than before, i.e. no
> more multi-second periods without any transactions happening.
>
> By now I think I know what's going on: Before the sorting portion of the
> patch the write-loop in BufferSync() starts at the current clock hand,
> by using StrategySyncStart(). But after the sorting that obviously
> doesn't happen anymore - buffers are accessed in their sort order. By
> starting at the current clock hand and moving on from there the
> checkpointer basically makes it more less likely that victim buffers
> need to be written either by the backends themselves or by
> bgwriter. That means that the sorted checkpoint writes can, indirectly,
> increase the number of unsorted writes by other processes :(

I'm quite surprised at such a large effect on throughput, though.

This explanation seems to suggest that if bgwriter/workders write are 
sorted and/or coordinated with the checkpointer somehow then all would be 
well?

ISTM that this explanation could be checked by looking whether 
bgwriter/workers writes are especially large compared to checkpointer 
writes in those cases with reduced throughput? The data is in the log.



> My benchmarking suggest that that effect is the larger, the shorter the
> checkpoint timeout is.

Hmmm. The shorter the timeout, the more likely the sorting NOT to be 
effective, and the more likely to go back to random I/Os, and maybe to 
seem some effect of the sync strategy stuff.

> That seems to intuitively make sense, give the above explanation 
> attempt. If the checkpoint takes longer the clock hand will almost 
> certainly soon overtake checkpoints 'implicit' hand.
>
> I'm not sure if we can really do anything about this problem. While I'm
> pretty jet lagged, I still spent a fair amount of time thinking about
> it. Seems to suggest that we need to bring back the setting to
> enable/disable sorting :(
>
>
> What I think needs to happen next with the patch is:
> 1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
>   handling the issue of smgr invalidations.

Not sure that much is necessary, see above.

> 2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
>   later can contain multiple elements like checkpoint, bgwriter,
>   backends, ddl, bulk-writes. That seems better than adding GUCs for
>   these separately. Then make the flush locations in the patch
>   configurable using that.

My 0,02€ on this point: I have not seen much of this style of guc 
elsewhere. The only one I found while scanning the postgres file are 
*_path and *_libraries. It seems to me that this would depart 
significantly from the usual style, so one guc per case, or one shared guc 
but with only on/off, would blend in more cleanly with the usual style.

> 3) I think we should remove the sort timing from the checkpoint logging
>   before commit. It'll always be pretty short.

I added it to show that it was really short, in response to concerns that 
my approach of just sorting through indexes to reduce the memory needed 
instead of copying the data to be sorted did not induce significant 
performance issues. I prooved my point, but peer pressure made me switch 
to larger memory anyway.

I think it should be kept while the features are under testing. I do not 
think that it harms in anyway.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

On 2015-11-12 15:31:41 +0100, Fabien COELHO wrote:
> >A lot of details have changed, I unfortunately don't remember them all.
> >But there are more important things than the details of the patch.
> >
> >I've played *a lot* with this patch. I found a bunch of issues:
> >
> >1) The FileFlushContext context infrastructure isn't actually
> >  correct. There's two problems: First, using the actual 'fd' number to
> >  reference a to-be-flushed file isn't meaningful. If there  are lots
> >  of files open, fds get reused within fd.c.
> 
> Hmm.
> 
> My assumption is that a file being used (i.e. with modifie pages, being used
> for writes...) would not be closed before everything is cleared...

That's likely, but far from guaranteed.

> After some poking in the code, I think that this issue may indeed be there,
> although the probability of hitting it is close to 0, but alas not 0:-)

I did hit it...

> To fix it, ITSM that it is enough to hold a "do not close lock" on the file
> while a flush is in progress (a short time) that would prevent mdclose to do
> its stuff.

Could you expand a bit more on this? You're suggesting something like a
boolean in the vfd struct? If that, how would you deal with FileClose()
being called?


> >3) I found that latency wasn't improved much for workloads that are
> >  significantly bigger than shared buffers. The problem here is that
> >  neither bgwriter nor the backends have, so far, done
> >  sync_file_range() calls. That meant that the old problem of having
> >  gigabytes of dirty data that periodically get flushed out, still
> >  exists. Having these do flushes mostly attacks that problem.
> 
> I'm concious that the patch only addresses *checkpointer* writes, not those
> from bgwrither or backends writes. I agree that these should need to be
> addressed at some point as well, but given the time to get a patch through,
> the more complex the slower (sort propositions are 10 years old), I think
> this should be postponed for later.

I think we need to have at least a PoC of all of the relevant
changes. We're doing these to fix significant latency and throughput
issues, and if the approach turns out not to be suitable for
e.g. bgwriter or backends, that might have influence over checkpointer's
design as well.

> >What I did not expect, and what confounded me for a long while, is that
> >for workloads where the hot data set does *NOT* fit into shared buffers,
> >sorting often led to be a noticeable reduction in throughput. Up to
> >30%.
> 
> I did not see such behavior in the many tests I ran. Could you share more
> precise details so that I can try to reproduce this performance regression?
> (available memory, shared buffers, db size, ...).


I generally found that I needed to disable autovacuum's analyze to get
anything even close to stable numbers. The issue in described in
http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
otherwise badly kicks in. I basically just set
autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from occuring.

I'll show actual numbers at some point yes. I tried three different systems:

* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB shared_buffers. Tried checkpoint timeouts from 60 to 300s.  I
couldsee issues in workloads ranging from scale 300 to 5000. Throughput regressions are visible for both sync_commit
on/offworkloads. Here the largest regressions were visible.
 

* my workstation: 24GB Ram, 2x E5520, a) Raid 10 of of 4 4TB, 7.2krpm devices b) Raid 1 of 2 m4 512GB SSDs. One of the
latterwas killed during the test.  Both showed regressions, but smaller.
 

* EC2 d2.8xlarge, 244 GB RAM, 24 x 2000 HDD, 64GB shared_buffers. I tried scale 3000,8000,15000. Here sorting, without
flushing,didn't lead much to regressions.
 


I think generally the regressions were visible with a) noticeable shared
buffers, b) workload not fitting into shared buffers, c) significant
throughput, leading to high cache replacement ratios.


Another thing that's worthwhile to mention, while not surprising, is
that the benefits of this patch are massively smaller when WAL and data
are separated onto different disks.  For workloads fitting into
shared_buffers I saw no performance difference - not particularly
surprising. I guess if you'd construct a case where the data, not WAL,
is the bottleneck that'd be different.  Also worthwhile to mention that
the separate disks setups was noticeably faster.

> >The performance was still much more regular than before, i.e. no
> >more multi-second periods without any transactions happening.
> >
> >By now I think I know what's going on: Before the sorting portion of the
> >patch the write-loop in BufferSync() starts at the current clock hand,
> >by using StrategySyncStart(). But after the sorting that obviously
> >doesn't happen anymore - buffers are accessed in their sort order. By
> >starting at the current clock hand and moving on from there the
> >checkpointer basically makes it more less likely that victim buffers
> >need to be written either by the backends themselves or by
> >bgwriter. That means that the sorted checkpoint writes can, indirectly,
> >increase the number of unsorted writes by other processes :(
> 
> I'm quite surprised at such a large effect on throughput, though.

Me too.


> This explanation seems to suggest that if bgwriter/workders write are sorted
> and/or coordinated with the checkpointer somehow then all would be well?

Well, you can't easily sort bgwriter/backend writes stemming from cache
replacement. Unless your access patterns are entirely sequential the
data in shared buffers will be laid out in a nearly entirely random
order.  We could try sorting the data, but with any reasonable window,
for many workloads the likelihood of actually achieving much with that
seems low.


> ISTM that this explanation could be checked by looking whether
> bgwriter/workers writes are especially large compared to checkpointer writes
> in those cases with reduced throughput? The data is in the log.

What do you mean with "large"? Numerous?

> >My benchmarking suggest that that effect is the larger, the shorter the
> >checkpoint timeout is.
> 
> Hmmm. The shorter the timeout, the more likely the sorting NOT to be
> effective

You mean, as evidenced by the results, or is that what you'd actually
expect?


> >2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
> >  later can contain multiple elements like checkpoint, bgwriter,
> >  backends, ddl, bulk-writes. That seems better than adding GUCs for
> >  these separately. Then make the flush locations in the patch
> >  configurable using that.
> 
> My 0,02€ on this point: I have not seen much of this style of guc elsewhere.
> The only one I found while scanning the postgres file are *_path and
> *_libraries. It seems to me that this would depart significantly from the
> usual style, so one guc per case, or one shared guc but with only on/off,
> would blend in more cleanly with the usual style.

Such a guc would allow one 'on' and 'off' setting, and either would
hopefully be the norm. That seems advantageous to me.


> >3) I think we should remove the sort timing from the checkpoint logging
> >  before commit. It'll always be pretty short.
> 
> I added it to show that it was really short, in response to concerns that my
> approach of just sorting through indexes to reduce the memory needed instead
> of copying the data to be sorted did not induce significant performance
> issues. I prooved my point, but peer pressure made me switch to larger
> memory anyway.

Grumble. I'm getting a bit tired about this topic. This wasn't even
remotely primarily about sorting speed, and you damn well know it.


> I think it should be kept while the features are under testing. I do not
> think that it harms in anyway.

That's why I said we should remove it *before commit*.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> To fix it, ITSM that it is enough to hold a "do not close lock" on the file
>> while a flush is in progress (a short time) that would prevent mdclose to do
>> its stuff.
>
> Could you expand a bit more on this? You're suggesting something like a
> boolean in the vfd struct?

Basically yes, I'm suggesting a mutex in the vdf struct.

> If that, how would you deal with FileClose() being called?

Just wait for the mutex, which would be held while flushes are accumulated 
into the flush context and released after the flush is performed and the 
fd is not necessary anymore for this purpose, which is expected to be 
short (at worst between the wake & sleep of the checkpointer, and just one 
file at a time).

>> I'm concious that the patch only addresses *checkpointer* writes, not those
>> from bgwrither or backends writes. I agree that these should need to be
>> addressed at some point as well, but given the time to get a patch through,
>> the more complex the slower (sort propositions are 10 years old), I think
>> this should be postponed for later.
>
> I think we need to have at least a PoC of all of the relevant
> changes. We're doing these to fix significant latency and throughput
> issues, and if the approach turns out not to be suitable for
> e.g. bgwriter or backends, that might have influence over checkpointer's
> design as well.

Hmmm. See below.

>>> What I did not expect, and what confounded me for a long while, is that
>>> for workloads where the hot data set does *NOT* fit into shared buffers,
>>> sorting often led to be a noticeable reduction in throughput. Up to
>>> 30%.
>>
>> I did not see such behavior in the many tests I ran. Could you share more
>> precise details so that I can try to reproduce this performance regression?
>> (available memory, shared buffers, db size, ...).
>
>
> I generally found that I needed to disable autovacuum's analyze to get
> anything even close to stable numbers. The issue in described in
> http://archives.postgresql.org/message-id/20151031145303.GC6064%40alap3.anarazel.de
> otherwise badly kicks in. I basically just set
> autovacuum_analyze_threshold to INT_MAX/2147483647 to prevent that from occuring.
>
> I'll show actual numbers at some point yes. I tried three different systems:
>
> * my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
>  shared_buffers. Tried checkpoint timeouts from 60 to 300s.

Hmmm. This is quite short. I tend to do tests with much larger timeouts. I 
would advise against a short timeout esp. in a high throughput system, the 
whole point of the checkpointer is to accumulate as much changes as 
possible.

I'll look into that.

>> This explanation seems to suggest that if bgwriter/workders write are sorted
>> and/or coordinated with the checkpointer somehow then all would be well?
>
> Well, you can't easily sort bgwriter/backend writes stemming from cache
> replacement. Unless your access patterns are entirely sequential the
> data in shared buffers will be laid out in a nearly entirely random
> order.  We could try sorting the data, but with any reasonable window,
> for many workloads the likelihood of actually achieving much with that
> seems low.

Maybe the sorting could be shared with others so that everybody uses the 
same order?

That would suggest to have one global sorting of buffers, maybe maintained 
by the checkpointer, which could be used by all processes that need to 
scan the buffers (in file order), instead of scanning them in memory 
order.

For this purpose, I think that the initial index-based sorting would 
suffice. Could be resorted periodically with some delay maintained in a 
guc, or when significant buffer changes have occured (read & writes).

>> ISTM that this explanation could be checked by looking whether
>> bgwriter/workers writes are especially large compared to checkpointer writes
>> in those cases with reduced throughput? The data is in the log.
>
> What do you mean with "large"? Numerous?

I mean the amount of buffers written by bgwriter/worker is greater than 
what is written by the checkpointer. If all fits in shared buffers, 
bgwriter/worker mostly do not need to write anything and the checkpointer 
does all the writes.

The larger the memory needed, the more likely workers/bgwriter will have 
to quick in and generate random I/Os because nothing sensible is currently 
done, so this is consistent with your findings, although I'm surprised 
that it would have a large effect on throughput, as already said.

>> Hmmm. The shorter the timeout, the more likely the sorting NOT to be
>> effective
>
> You mean, as evidenced by the results, or is that what you'd actually
> expect?

What I would expect...

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2015-11-12 17:44:40 +0100, Fabien COELHO wrote:
> 
> >>To fix it, ITSM that it is enough to hold a "do not close lock" on the file
> >>while a flush is in progress (a short time) that would prevent mdclose to do
> >>its stuff.
> >
> >Could you expand a bit more on this? You're suggesting something like a
> >boolean in the vfd struct?
> 
> Basically yes, I'm suggesting a mutex in the vdf struct.

I can't see that being ok. I mean what would that thing even do? VFD
isn't shared between processes, and if we get a smgr flush we have to
apply it, or risk breaking other things.

> >* my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
> > shared_buffers. Tried checkpoint timeouts from 60 to 300s.
> 
> Hmmm. This is quite short.

Indeed. I'd never do that in a production scenario myself. But
nonetheless it showcases a problem.


> >Well, you can't easily sort bgwriter/backend writes stemming from cache
> >replacement. Unless your access patterns are entirely sequential the
> >data in shared buffers will be laid out in a nearly entirely random
> >order.  We could try sorting the data, but with any reasonable window,
> >for many workloads the likelihood of actually achieving much with that
> >seems low.
> 
> Maybe the sorting could be shared with others so that everybody uses the
> same order?
> 
> That would suggest to have one global sorting of buffers, maybe maintained
> by the checkpointer, which could be used by all processes that need to scan
> the buffers (in file order), instead of scanning them in memory order.

Uh. Cache replacement is based on an approximated LRU, you can't just
remove that without serious regressions.


> >>Hmmm. The shorter the timeout, the more likely the sorting NOT to be
> >>effective
> >
> >You mean, as evidenced by the results, or is that what you'd actually
> >expect?
> 
> What I would expect...

I don't see why then? If you very quickly writes lots of data the OS
will continously flush dirty data to the disk, in which case sorting is
rather important?


Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello,

>> Basically yes, I'm suggesting a mutex in the vdf struct.
>
> I can't see that being ok. I mean what would that thing even do? VFD
> isn't shared between processes, and if we get a smgr flush we have to
> apply it, or risk breaking other things.

Probably something is eluding my comprehension:-)

My basic assumption is that the fopen & fd is per process, so we just have 
to deal with the one in the checkpointer process, so it is enough that the 
checkpointer does not close the file while it is flushing things to it?

>>> * my laptop, 16 GB Ram, 840 EVO 1TB as storage. With 2GB
>>> shared_buffers. Tried checkpoint timeouts from 60 to 300s.
>>
>> Hmmm. This is quite short.
>
> Indeed. I'd never do that in a production scenario myself. But
> nonetheless it showcases a problem.

I would say that it would render sorting ineffective because all the 
rewriting is done by bgwriter or workers, which does not totally explain 
why the throughput would be worst than before, I would expect it to be as 
bad as before...

>>> Well, you can't easily sort bgwriter/backend writes stemming from cache
>>> replacement. Unless your access patterns are entirely sequential the
>>> data in shared buffers will be laid out in a nearly entirely random
>>> order.  We could try sorting the data, but with any reasonable window,
>>> for many workloads the likelihood of actually achieving much with that
>>> seems low.
>>
>> Maybe the sorting could be shared with others so that everybody uses the
>> same order?
>>
>> That would suggest to have one global sorting of buffers, maybe maintained
>> by the checkpointer, which could be used by all processes that need to scan
>> the buffers (in file order), instead of scanning them in memory order.
>
> Uh. Cache replacement is based on an approximated LRU, you can't just
> remove that without serious regressions.

I understand that, but there is a balance to find. Generating random I/Os 
is very bad for performance, so the decision process must combine LRU/LFU 
heuristics with considering things in some order as well.

>>>> Hmmm. The shorter the timeout, the more likely the sorting NOT to be
>>>> effective
>>>
>>> You mean, as evidenced by the results, or is that what you'd actually
>>> expect?
>>
>> What I would expect...
>
> I don't see why then? If you very quickly writes lots of data the OS
> will continously flush dirty data to the disk, in which case sorting is
> rather important?

What I have in mind is: the shorter the timeout the less neighboring 
buffers will be touched, so the less nice sequential writes will be found 
by sorting them, so the worst the positive impact on performance...

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>>> Basically yes, I'm suggesting a mutex in the vdf struct.
>> 
>> I can't see that being ok. I mean what would that thing even do? VFD
>> isn't shared between processes, and if we get a smgr flush we have to
>> apply it, or risk breaking other things.
>
> Probably something is eluding my comprehension:-)
>
> My basic assumption is that the fopen & fd is per process, so we just have to 
> deal with the one in the checkpointer process, so it is enough that the 
> checkpointer does not close the file while it is flushing things to it?

Hmmm...

Maybe I'm a little bit too optimistic here, because it seems that I'm 
suggesting to create a dead lock if the checkpointer has both buffers to 
flush in waiting and wishes to close the very same file that holds them.

So on wanting to close the file the checkpointer should rather flushes the 
outstanding flushes in wait and then close the fd, which suggest some 
global variable to hold flush context so that this can be done.

Hmmm.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Wed, Nov 11, 2015 at 1:08 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote:
> > Here is a v13, which is just a rebase after 1aba62ec.
>
>
> 3) I found that latency wasn't improved much for workloads that are
>    significantly bigger than shared buffers. The problem here is that
>    neither bgwriter nor the backends have, so far, done
>    sync_file_range() calls. That meant that the old problem of having
>    gigabytes of dirty data that periodically get flushed out, still
>    exists. Having these do flushes mostly attacks that problem.
>
>
> Benchmarking revealed that for workloads where the hot data set mostly
> fits into shared buffers flushing and sorting is anywhere from a small
> to a massive improvement, both in throughput and latency. Even without
> the patch from 2), although fixing that improves things furhter.
>
>
>
> What I did not expect, and what confounded me for a long while, is that
> for workloads where the hot data set does *NOT* fit into shared buffers,
> sorting often led to be a noticeable reduction in throughput. Up to
> 30%. The performance was still much more regular than before, i.e. no
> more multi-second periods without any transactions happening.
>
> By now I think I know what's going on: Before the sorting portion of the
> patch the write-loop in BufferSync() starts at the current clock hand,
> by using StrategySyncStart(). But after the sorting that obviously
> doesn't happen anymore - buffers are accessed in their sort order. By
> starting at the current clock hand and moving on from there the
> checkpointer basically makes it more less likely that victim buffers
> need to be written either by the backends themselves or by
> bgwriter. That means that the sorted checkpoint writes can, indirectly,
> increase the number of unsorted writes by other processes :(
>

That sounds to be a tricky problem.  I think the way to improve the current
situation is to change buffer allocation algorithm such that instead of
backend issuing the write for dirty buffer, it will just continue to find next
free buffer when it finds that selected buffer is dirty and if it could not
find the non-dirty buffer for certain number of attempts, it will signal bgwriter
to write-out some buffers.  Now the writing algorithm of bgwriter has to
be such that it picks the buffers in chunks from checkpoint-list, sort them
and then write them.  Checkpoint also uses the same checkpoint-list to flush
the dirty buffers.  This will ensure that the writes will always be sorted-writes
irrespective of which process does the writes. There could be multiple ways
to form this checkpoint-list and one of the way could be MarkBufferDirty()
adds it to such a list.  I think following such a mechanism could solve the
problem of unsorted writes in the system, but it arises a question, what kind
of latency such a mechanism could introduce for a backend which
signals bgwriter after not finding a non-dirty buffer for certain number of
attempts, I think if we sense this could be a problematic case, then we
can make both bgwriter and checkpoint to always start from next victim
buffer and then traverse the checkpoint-list.
 
>
> My benchmarking suggest that that effect is the larger, the shorter the
> checkpoint timeout is. That seems to intuitively make sense, give the
> above explanation attempt. If the checkpoint takes longer the clock hand
> will almost certainly soon overtake checkpoints 'implicit' hand.
>
> I'm not sure if we can really do anything about this problem. While I'm
> pretty jet lagged, I still spent a fair amount of time thinking about
> it. Seems to suggest that we need to bring back the setting to
> enable/disable sorting :(
>
>
> What I think needs to happen next with the patch is:
> 1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully
>    handling the issue of smgr invalidations.
> 2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that
>    later can contain multiple elements like checkpoint, bgwriter,
>    backends, ddl, bulk-writes. That seems better than adding GUCs for
>    these separately. Then make the flush locations in the patch
>    configurable using that.
> 3) I think we should remove the sort timing from the checkpoint logging
>    before commit. It'll always be pretty short.
>

It seems for now you have left out the windows specific implementation
in pg_flush_data().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> Hmmm...
>
> Maybe I'm a little bit too optimistic here, because it seems that I'm 
> suggesting to create a dead lock if the checkpointer has both buffers to 
> flush in waiting and wishes to close the very same file that holds them.
>
> So on wanting to close the file the checkpointer should rather flushes the 
> outstanding flushes in wait and then close the fd, which suggest some global 
> variable to hold flush context so that this can be done.
>
> Hmmm.

On third (fourth, fifth:-) thoughts:

The vfd (virtual file descriptor?) structure in the checkpointer could 
keep a pointer to the current flush if it concerns this fd, so that if it 
decides to close if while there is a write in progress (I'm still baffled 
at why and when the checkpointer process would take such a decision, maybe 
while responding to some signals, because it seems that there is no such 
event in the checkpointer loop itself...) then on close the process could 
flush before close, or just close which probably would induce flushing, 
but at least cleanup the structure so that the the closed fd would not be 
flushed after being closed and result in an error.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Tomas Vondra
Date:
Hi,

I'm planning to do some thorough benchmarking of the patches proposed in 
this thread, on various types of hardware (10k SAS drives and SSDs). But 
is that actually needed? I see Andres did some testing, as he posted 
summary of the results on 11/12, but I don't see any actual results or 
even info about what benchmarks were done (pgbench?).

If yes, do we only want to compare 0001-ckpt-14-andres.patch against 
master, or do we need to test one of the previous Fabien's patches?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Tomas,

> I'm planning to do some thorough benchmarking of the patches proposed in this 
> thread, on various types of hardware (10k SAS drives and SSDs). But is that 
> actually needed? I see Andres did some testing, as he posted summary of the 
> results on 11/12, but I don't see any actual results or even info about what 
> benchmarks were done (pgbench?).
>
> If yes, do we only want to compare 0001-ckpt-14-andres.patch against master, 
> or do we need to test one of the previous Fabien's patches?

My 0.02€,

Although I disagree with some aspects of Andres patch, I'm not a committer 
and I'm tired of arguing. I'm just planing to do minor changes to Andres 
version to fix a potential issue if the file is closed which flushing is 
in progress, but that will not change the overall shape of it.

So testing on Andres version seems relevant to me.

For SSD the performance impact should be limited. For disk it should be 
significant if there is no big cache in front of it. There were some 
concerns raised for some loads in the thread (shared memory smaller than 
needed I think?), if you can include such cases that would be great. My 
guess is that it should be not very beneficial in this case because the 
writing is mostly done by bgwriter & worker in this case, and these are 
still random.

-- 
Fabien.

Re: checkpointer continuous flushing

From
Michael Paquier
Date:
On Thu, Dec 17, 2015 at 4:27 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> Hello Tomas,
>
>> I'm planning to do some thorough benchmarking of the patches proposed in
>> this thread, on various types of hardware (10k SAS drives and SSDs). But is
>> that actually needed? I see Andres did some testing, as he posted summary of
>> the results on 11/12, but I don't see any actual results or even info about
>> what benchmarks were done (pgbench?).
>>
>> If yes, do we only want to compare 0001-ckpt-14-andres.patch against
>> master, or do we need to test one of the previous Fabien's patches?
>
>
> My 0.02€,
>
> Although I disagree with some aspects of Andres patch, I'm not a committer
> and I'm tired of arguing. I'm just planing to do minor changes to Andres
> version to fix a potential issue if the file is closed which flushing is in
> progress, but that will not change the overall shape of it.
>
> So testing on Andres version seems relevant to me.
>
> For SSD the performance impact should be limited. For disk it should be
> significant if there is no big cache in front of it. There were some
> concerns raised for some loads in the thread (shared memory smaller than
> needed I think?), if you can include such cases that would be great. My
> guess is that it should be not very beneficial in this case because the
> writing is mostly done by bgwriter & worker in this case, and these are
> still random.

As there are still plans to move on regarding tests (and because this
patch makes a difference), this is moved to next CF.
--
Michael



Re: checkpointer continuous flushing

From
Tomas Vondra
Date:
Hi,

On 12/16/2015 08:27 PM, Fabien COELHO wrote:
>
> Hello Tomas,
>
>> I'm planning to do some thorough benchmarking of the patches proposed
>> in this thread, on various types of hardware (10k SAS drives and
>> SSDs). But is that actually needed? I see Andres did some testing, as
>> he posted summary of the results on 11/12, but I don't see any actual
>> results or even info about what benchmarks were done (pgbench?).
>>
>> If yes, do we only want to compare 0001-ckpt-14-andres.patch against
>> master, or do we need to test one of the previous Fabien's patches?
>
> My 0.02€,
>
> Although I disagree with some aspects of Andres patch, I'm not a
> committer and I'm tired of arguing. I'm just planing to do minor changes
> to Andres version to fix a potential issue if the file is closed which
> flushing is in progress, but that will not change the overall shape of it.
>
> So testing on Andres version seems relevant to me.

The patch no longer applies to master. Can someone rebase it?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-06 21:01:47 +0100, Tomas Vondra wrote:
> >Although I disagree with some aspects of Andres patch, I'm not a
> >committer and I'm tired of arguing. I'm just planing to do minor changes
> >to Andres version to fix a potential issue if the file is closed which
> >flushing is in progress, but that will not change the overall shape of it.

Are you working on that aspect?

> >So testing on Andres version seems relevant to me.
> 
> The patch no longer applies to master. Can someone rebase it?

I'm working on an updated version, trying to mitigate the performance
regressions I observed.

Andres



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
<Ooops, wrong from address, resent, sorry for the noise>

Hello Andres,

>>> Although I disagree with some aspects of Andres patch, I'm not a
>>> committer and I'm tired of arguing. I'm just planing to do minor changes
>>> to Andres version to fix a potential issue if the file is closed which
>>> flushing is in progress, but that will not change the overall shape of 
>>> it.
> 
> Are you working on that aspect?

I read your patch and I know what I want to try to have a small and simple 
fix. I must admit that I have not really understood in which condition the 
checkpointer would decide to close a file, but that does not mean that the 
potential issue should not be addressed.

Also, I gave some thoughts about what should be done for bgwriter random IOs. 
The idea is to implement some per-file sorting there and then do some LRU/LFU 
combing. It would not interact much with the checkpointer, so for me the two 
issues should be kept separate and this should not preclude changing the 
checkpointer, esp. given the significant performance benefit of the patch.

However, all this is still in my stack of things to do, and I had not much 
time in the Fall for that. I may have more time in the coming weeks. I'm fine 
if things are updated and performance figures are collected in between, I'll 
take it from where it is when I have time, if something remains to be done.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:
> I read your patch and I know what I want to try to have a small and simple
> fix. I must admit that I have not really understood in which condition the
> checkpointer would decide to close a file, but that does not mean that the
> potential issue should not be addressed.

There's a trivial example: Consider three tablespaces and
max_files_per_process = 2. The balancing can easily cause three files
being flushed at the same time.

But more importantly: You designed the API to be generic because you
wanted it to be usable for other purposes as well. And for that it
certainly needs to deal with that.

> Also, I gave some thoughts about what should be done for bgwriter random
> IOs. The idea is to implement some per-file sorting there and then do some
> LRU/LFU combing. It would not interact much with the checkpointer, so for me
> the two issues should be kept separate and this should not preclude changing
> the checkpointer, esp. given the significant performance benefit of the
> patch.

Well, the problem is that the patch significantly regresses some cases
right now. So keeping them separate isn't particularly feasible.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello,

>> I read your patch and I know what I want to try to have a small and simple
>> fix. I must admit that I have not really understood in which condition the
>> checkpointer would decide to close a file, but that does not mean that the
>> potential issue should not be addressed.
>
> There's a trivial example: Consider three tablespaces and
> max_files_per_process = 2. The balancing can easily cause three files
> being flushed at the same time.

Indeed. Thanks for this explanation!

> But more importantly: You designed the API to be generic because you
> wanted it to be usable for other purposes as well. And for that it
> certainly needs to deal with that.

Yes, I'm planning to try to do the minimum possible damage to the current 
API to fix the issue.

>> Also, I gave some thoughts about what should be done for bgwriter random
>> IOs. The idea is to implement some per-file sorting there and then do some
>> LRU/LFU combing. It would not interact much with the checkpointer, so for me
>> the two issues should be kept separate and this should not preclude changing
>> the checkpointer, esp. given the significant performance benefit of the
>> patch.
>
> Well, the problem is that the patch significantly regresses some cases
> right now. So keeping them separate isn't particularly feasible.

I have not seen significant regressions on my many test runs. In 
particular, I would not consider that having a tps deep in cases where 
postgresql is doing 0 tps most of the time anyway (ie pg is offline) 
because of random IO issues should be blocker.

As I understood it, the regressions occur when the checkpointer is less 
used, i.e. bgwriter is doing most of the writes, but this does not change 
much whether the checkpointer sorts buffers or not, and the overall 
behavior of pg is very bad anyway in these cases.

Also I think that coupling the two issues is a recipee for never having 
anything done in the end and keep the current awful behavior:-(

The solution on the bgwriter front is somehow similar to the checkpointer, 
but from a code point of view there is minimum interaction, so I would 
really separate them, esp. as the bgwriter part will require extensive 
testing and discussions as well.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-07 12:50:07 +0100, Fabien COELHO wrote:
> >But more importantly: You designed the API to be generic because you
> >wanted it to be usable for other purposes as well. And for that it
> >certainly needs to deal with that.
> 
> Yes, I'm planning to try to do the minimum possible damage to the current
> API to fix the issue.

What's your thought there? Afaics it's infeasible to do the flushing tat
the fd.c level.

Andres



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> Yes, I'm planning to try to do the minimum possible damage to the current
>> API to fix the issue.
>
> What's your thought there? Afaics it's infeasible to do the flushing tat
> the fd.c level.

I thought of adding a pointer to the current flush structure at the vfd 
level, so that on closing a file with a flush in progress the flush can be 
done and the structure properly cleaned up, hence later the checkpointer 
would see a clean thing and be able to skip it instead of generating 
flushes on a closed file or on a different file...

Maybe I'm missing something, but that is the plan I had in mind.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-07 13:07:33 +0100, Fabien COELHO wrote:
> 
> >>Yes, I'm planning to try to do the minimum possible damage to the current
> >>API to fix the issue.
> >
> >What's your thought there? Afaics it's infeasible to do the flushing tat
> >the fd.c level.
> 
> I thought of adding a pointer to the current flush structure at the vfd
> level, so that on closing a file with a flush in progress the flush can be
> done and the structure properly cleaned up, hence later the checkpointer
> would see a clean thing and be able to skip it instead of generating flushes
> on a closed file or on a different file...
> 
> Maybe I'm missing something, but that is the plan I had in mind.

That might work, although it'd not be pretty (not fatally so
though). But I'm inclined to go a different way: I think it's a mistake
to do flusing based on a single file. It seems better to track a fixed
number of outstanding 'block flushes', independent of the file. Whenever
the number of outstanding blocks is exceeded, sort that list, and flush
all outstanding flush requests after merging neighbouring flushes. Imo
that means that we'd better track writes on a relfilenode + block number
level.

Andres



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> I thought of adding a pointer to the current flush structure at the vfd
>> level, so that on closing a file with a flush in progress the flush can be
>> done and the structure properly cleaned up, hence later the checkpointer
>> would see a clean thing and be able to skip it instead of generating flushes
>> on a closed file or on a different file...
>>
>> Maybe I'm missing something, but that is the plan I had in mind.
>
> That might work, although it'd not be pretty (not fatally so
> though).

Alas, any solution has to communicate somehow between the API levels, so 
it cannot be "pretty", although we should avoid the worse.

> But I'm inclined to go a different way: I think it's a mistake to do 
> flusing based on a single file. It seems better to track a fixed number 
> of outstanding 'block flushes', independent of the file. Whenever the 
> number of outstanding blocks is exceeded, sort that list, and flush all 
> outstanding flush requests after merging neighbouring flushes.

Hmmm. I'm not sure I understand your strategy.

I do not think that flushing without a prior sorting would be effective, 
because there is no clear reason why buffers written together would then 
be next to the other and thus give sequential write benefits, we would 
just get flushed random IO, I tested that and it worked badly.

One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the 
thread, so it makes sense to limit this cost, hence the aggregation. These 
removed some performation regression I had in some cases.

Also, the granularity of the buffer flush call is a file + offset + size, 
so necessarily it should be done this way (i.e. per file).

Once buffers are sorted per file and offset within file, then written 
buffers are as close as possible one after the other, the merging is very 
easy to compute (it is done on the fly, no need to keep the list of 
buffers for instance), it is optimally effective, and when the 
checkpointed file changes then we will never go back to it before the next 
checkpoint, so there is no reason not to flush right then.

So basically I do not see a clear positive advantage to your suggestion, 
especially when taking into consideration the scheduling process of the 
scheduler:

In effect the checkpointer already works with little bursts of activity 
between sleep phases, so that it writes buffers a few at a time, so it may 
already work more or less as you expect, but not for the same reason.

The closest stategy that I experimented which is maybe close to your 
suggestion was to manage a minimum number of buffers to write when awaken 
and to change the sleep delay in between, but I had no clear way to choose 
values and the experiments I did did not show significant performance 
impact by varying these parameters, so I kept that out. If you find a 
magic number of buffer which results in consistant better performance, 
fine with me, but this is independent with aggregating before or after.

> Imo that means that we'd better track writes on a relfilenode + block 
> number level.

I do not think that it is a better option. Moreover, the current approach 
has been proven to be very effective on hundreds of runs, so redoing it 
differently for the sake of it does not look like good resource 
allocation.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-07 16:05:32 +0100, Fabien COELHO wrote:
> >But I'm inclined to go a different way: I think it's a mistake to do
> >flusing based on a single file. It seems better to track a fixed number of
> >outstanding 'block flushes', independent of the file. Whenever the number
> >of outstanding blocks is exceeded, sort that list, and flush all
> >outstanding flush requests after merging neighbouring flushes.
> 
> Hmmm. I'm not sure I understand your strategy.
> 
> I do not think that flushing without a prior sorting would be effective,
> because there is no clear reason why buffers written together would then be
> next to the other and thus give sequential write benefits, we would just get
> flushed random IO, I tested that and it worked badly.

Oh, I was thinking of sorting & merging these outstanding flushes. Sorry
for not making that clear.


> One of the point of aggregating flushes is that the range flush call cost
> is significant, as shown by preliminary tests I did, probably up in the
> thread, so it makes sense to limit this cost, hence the aggregation. These
> removed some performation regression I had in some cases.

FWIW, my tests show that flushing for clean ranges is pretty cheap.


> Also, the granularity of the buffer flush call is a file + offset + size, so
> necessarily it should be done this way (i.e. per file).

What syscalls we issue, and at what level we track outstanding flushes,
doesn't have to be the same.


> Once buffers are sorted per file and offset within file, then written
> buffers are as close as possible one after the other, the merging is very
> easy to compute (it is done on the fly, no need to keep the list of buffers
> for instance), it is optimally effective, and when the checkpointed file
> changes then we will never go back to it before the next checkpoint, so
> there is no reason not to flush right then.

Well, that's true if there's only one tablespace, but e.g. not the case
with two tablespaces of about the same number of dirty buffers.


> So basically I do not see a clear positive advantage to your suggestion,
> especially when taking into consideration the scheduling process of the
> scheduler:

I don't think it makes a big difference for the checkpointer alone, but
it makes the interface much more suitable for other processes, e.g. the
bgwriter, and normal backends.



> >Imo that means that we'd better track writes on a relfilenode + block
> >number level.
> 
> I do not think that it is a better option. Moreover, the current approach
> has been proven to be very effective on hundreds of runs, so redoing it
> differently for the sake of it does not look like good resource allocation.

For a subset of workloads, yes.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> One of the point of aggregating flushes is that the range flush call cost
>> is significant, as shown by preliminary tests I did, probably up in the
>> thread, so it makes sense to limit this cost, hence the aggregation. These
>> removed some performation regression I had in some cases.
>
> FWIW, my tests show that flushing for clean ranges is pretty cheap.

Yes, I agree that it is quite cheap, but I had a few % tps regressions 
in some cases without aggregating, and aggregating was enough to avoid 
these small regressions.

>> Also, the granularity of the buffer flush call is a file + offset + size, so
>> necessarily it should be done this way (i.e. per file).
>
> What syscalls we issue, and at what level we track outstanding flushes,
> doesn't have to be the same.

Sure. But the current version is simple, efficient and proven by many 
runs, so there should be a very strong argument to justify a significant 
benefit to change the approach, and I see no such thing in your arguments.

For me the current approach is optimal for the checkpointer, because it 
takes advantage of all available information to perform a better job.

>> Once buffers are sorted per file and offset within file, then written
>> buffers are as close as possible one after the other, the merging is very
>> easy to compute (it is done on the fly, no need to keep the list of buffers
>> for instance), it is optimally effective, and when the checkpointed file
>> changes then we will never go back to it before the next checkpoint, so
>> there is no reason not to flush right then.
>
> Well, that's true if there's only one tablespace, but e.g. not the case
> with two tablespaces of about the same number of dirty buffers.

ISTM that in the version of the patch I sent there was one flushing 
structure per tablespace each doing its own flushing on its files, so it 
should work the same, only the writing intensity is devided by the number 
of tablespace? Or am I missing something?

>> So basically I do not see a clear positive advantage to your suggestion,
>> especially when taking into consideration the scheduling process of the
>> scheduler:
>
> I don't think it makes a big difference for the checkpointer alone, but
> it makes the interface much more suitable for other processes, e.g. the
> bgwriter, and normal backends.

Hmmm.

ISTM that the requirement are not exactly the same for the bgwriter and 
backends vs the checkpointer. The checkpointer has the advantage of being 
able to plan its IOs on the long term (volume & time is known...) and the 
implementation takes the full benefit of this planing by sorting and 
scheduling and flushing buffers so as to generate as much sequential 
writes as possible.

The bgwriter and backends have a much shorter vision (a few seconds, or 
juste one query being process), so the solution will be less efficient and 
probably more messy on the coding side. This is life. I do not see why not 
to take the benefit of a full planing in the checkpointer just because 
other processes cannot do the same, especially as under plenty of loads 
the checkpointer does most of the writing so is the limiting factor.

So I do not buy your suggestion for the checkpointer. Maybe it will be the 
way to go for bgwriter and backends, then fine for them.

>>> Imo that means that we'd better track writes on a relfilenode + block
>>> number level.
>>
>> I do not think that it is a better option. Moreover, the current approach
>> has been proven to be very effective on hundreds of runs, so redoing it
>> differently for the sake of it does not look like good resource allocation.
>
> For a subset of workloads, yes.

Hmmm. What I understood is that the workloads that have some performance 
regressions (regressions that I have *not* seen in the many tests I ran) 
are not due to checkpointer IOs, but rather in settings where most of the 
writes is done by backends or bgwriter.

I do not see the point of rewriting the checkpointer for them, although 
obviously I agree that something has to be done also for the other 
processes.

Maybe if all the writes (bgwriter and checkpointer) where performed by the 
same process then some dynamic mixing and sorting and aggregating would 
make sense, but this is currently not the case, and would probably have 
quite limited effect.

Basically I do not understand how changing the flushing organisation as 
you suggest would improve the checkpointer performance significantly, for 
me it should only degrade the performance compared to the current version, 
as far as the checkpointer is concerned.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-07 21:08:10 +0100, Fabien COELHO wrote:
> Hmmm. What I understood is that the workloads that have some performance
> regressions (regressions that I have *not* seen in the many tests I ran) are
> not due to checkpointer IOs, but rather in settings where most of the writes
> is done by backends or bgwriter.

As far as I can see you've not run many tests where the hot/warm data
set is larger than memory (the full machine's memory, not
shared_buffers). That quite drastically alters the performance
characteristics here, because you suddenly have lots of synchronous read
IO thrown into the mix.


Whether it's bgwriter or not I've not fully been able to establish, but
it's a working theory.


> I do not see the point of rewriting the checkpointer for them, although
> obviously I agree that something has to be done also for the other
> processes.

Rewriting the checkpointer and fixing the flush interface in a more
generic way aren't the same thing at all.


Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> Hmmm. What I understood is that the workloads that have some performance
>> regressions (regressions that I have *not* seen in the many tests I ran) are
>> not due to checkpointer IOs, but rather in settings where most of the writes
>> is done by backends or bgwriter.
>
> As far as I can see you've not run many tests where the hot/warm data
> set is larger than memory (the full machine's memory, not
> shared_buffers).

Indeed, I think I ran some, but not many with such characteristics.

> That quite drastically alters the performance characteristics here, 
> because you suddenly have lots of synchronous read IO thrown into the 
> mix.

If I understand this point correctly...

I would expect the overall performance to be abysmal in such a situation 
because you get only intermixed *random* read and writes: As you point 
out, synchroneous *random* reads (very slow), but on the write side the 
IOs are mostly random as well on the checkpointer side because there is 
not much to aggregate to get sequential writes.

Now why would that degrade performance significantly? For me it should 
render the sorting/flushing less and less effective, and it would go back 
to the previous performance levels...

Or maybe it only the flushing itself which degrades performance, as you 
point out, because then you have some synchronous (synced) writes as well 
as read, as opposed to just the reads before without the patch.

If this is indeed the issue, then the solution to avoid the regression is 
*not* to flush so that the OS IO scheduler is less constrained in its job, 
and can be slightly more effective (well, we talking of abysmal random IO 
disk performance here, so effective would be between slightly more or less 
very very very bad).

Maybe a trick could be not to aggregate and flush when buffers in the same 
file are too much apart anyway, for instance, based on some threshold? 
This can be implemented locally when deciding to merge buffer flushes or 
not, and whether to flush or not, so it would fit the current code quite 
simply.

Now my understanding of the sync_file_range call is that it is an advice 
to flush the stuff, but it is still asynchronous in nature, so whether it 
would impact performance that badly depends on the OS IO scheduler. Also, 
I would like to check whether, under the "regressed performance" (in tps 
term that you observed), pg is more or less responsive. It could be that 
the average performance is better but pg is offline longer on fsync. In 
which case, I would consider it better to have lower tps in such cases 
*if* pg responsiveness is significantly improved.

Would you have these measures for the regression runs you observed?

> Whether it's bgwriter or not I've not fully been able to establish, but
> it's a working theory.

Ok, that is something to check for confirmation or infirmation.

Given the above discussion, I think my suggestion may be wrong: as the tps 
is low because of random read/write accesses then not many buffers are 
modified (so the bgwriter/backends won't need to make space), the 
checkpointer does not have much to write (good), *but* all of it is random 
(bad).

>> I do not see the point of rewriting the checkpointer for them, although
>> obviously I agree that something has to be done also for the other
>> processes.
>
> Rewriting the checkpointer and fixing the flush interface in a more
> generic way aren't the same thing at all.

Hmmm, probably I misunderstood something in the discussion. It started 
with an implementation strategy, but it derived to discussing a 
performance regression. I aggree that these are two different subjects.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:
> > I read your patch and I know what I want to try to have a small and simple
> > fix. I must admit that I have not really understood in which condition the
> > checkpointer would decide to close a file, but that does not mean that the
> > potential issue should not be addressed.
>
> There's a trivial example: Consider three tablespaces and
> max_files_per_process = 2. The balancing can easily cause three files
> being flushed at the same time.
>

Won't the same thing can occur without patch in mdsync() and can't
we handle it in same way?  In particular, I am referring to below code:

mdsync()

{

..

/*

* It is possible that the relation has been dropped or

* truncated since the fsync request was entered.

* Therefore, allow ENOENT, but only if we didn't fail

* already on this file.  This applies both for

* _mdfd_getseg() and for FileSync, since fd.c might have

* closed the file behind our back.

*

* XXX is there any point in allowing more than one retry?

* Don't see one at the moment, but easy to change the

* test here if so.

*/

if (!FILE_POSSIBLY_DELETED(errno) ||

failures > 0)

ereport(ERROR,

(errcode_for_file_access(),

errmsg("could not fsync file \"%s\": %m",

path)));

else

ereport(DEBUG1,

(errcode_for_file_access(),

errmsg("could not fsync file \"%s\" but retrying: %m",

  path)));

}




With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-09 18:04:39 +0530, Amit Kapila wrote:
> On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:
> > > I read your patch and I know what I want to try to have a small and
> simple
> > > fix. I must admit that I have not really understood in which condition
> the
> > > checkpointer would decide to close a file, but that does not mean that
> the
> > > potential issue should not be addressed.
> >
> > There's a trivial example: Consider three tablespaces and
> > max_files_per_process = 2. The balancing can easily cause three files
> > being flushed at the same time.
> >
> 
> Won't the same thing can occur without patch in mdsync() and can't
> we handle it in same way?  In particular, I am referring to below code:

I don't see how that's corresponding - the problem is that current
proposed infrastructure keeps a kernel level (or fd.c in my versio) fd
open in it's 'pending flushes' struct. But since that isn't associated
with fd.c opening/closing files that fd isn't very meaningful.


> mdsync()

That seems to address different issues.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Sat, Jan 9, 2016 at 6:08 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-01-09 18:04:39 +0530, Amit Kapila wrote:
> > On Thu, Jan 7, 2016 at 4:21 PM, Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2016-01-07 11:27:13 +0100, Fabien COELHO wrote:
> > > > I read your patch and I know what I want to try to have a small and
> > simple
> > > > fix. I must admit that I have not really understood in which condition
> > the
> > > > checkpointer would decide to close a file, but that does not mean that
> > the
> > > > potential issue should not be addressed.
> > >
> > > There's a trivial example: Consider three tablespaces and
> > > max_files_per_process = 2. The balancing can easily cause three files
> > > being flushed at the same time.
> > >
> >
> > Won't the same thing can occur without patch in mdsync() and can't
> > we handle it in same way?  In particular, I am referring to below code:
>
> I don't see how that's corresponding - the problem is that current
> proposed infrastructure keeps a kernel level (or fd.c in my versio) fd
> open in it's 'pending flushes' struct. But since that isn't associated
> with fd.c opening/closing files that fd isn't very meaningful.
>

Okay, but I think that is the reason why you are worried that it is possible
to issue sync_file_range() on a closed file, is that right or am I missing
something?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-09 18:24:01 +0530, Amit Kapila wrote:
> Okay, but I think that is the reason why you are worried that it is possible
> to issue sync_file_range() on a closed file, is that right or am I missing
> something?

That's one potential issue. You can also fsync a different file, try to
print an error message containing an unallocated filename (that's how I
noticed the issue in the first place)...

I don't think it's going to be acceptable to issue operations on more or
less random fds, even if that operation is hopefully harmless.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Sat, Jan 9, 2016 at 6:26 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-01-09 18:24:01 +0530, Amit Kapila wrote:
> > Okay, but I think that is the reason why you are worried that it is possible
> > to issue sync_file_range() on a closed file, is that right or am I missing
> > something?
>
> That's one potential issue. You can also fsync a different file, try to
> print an error message containing an unallocated filename (that's how I
> noticed the issue in the first place)...
>
> I don't think it's going to be acceptable to issue operations on more or
> less random fds, even if that operation is hopefully harmless.
>

Right that won't be acceptable, however I think with your latest
proposal [1], we might not need to solve this problem or do we still
need to address it.  I think that idea will help to mitigate the problem of
backend and bgwriter writes as well.  In that, can't we do it with the
help of existing infrastructure of *pendingOpsTable* and
*CheckpointerShmem->requests[]*, as already the flush requests are
remembered in those structures, we can use those to apply your idea
to issue flush requests.



[1]
"It seems better to track a fixed
number of outstanding 'block flushes', independent of the file. Whenever
the number of outstanding blocks is exceeded, sort that list, and flush
all outstanding flush requests after merging neighbouring flushes."

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-09 19:05:54 +0530, Amit Kapila wrote:
> Right that won't be acceptable, however I think with your latest
> proposal [1]

Sure, that'd address that problem.


> [...] think that idea will help to mitigate the problem of backend and
> bgwriter writes as well.  In that, can't we do it with the help of
> existing infrastructure of *pendingOpsTable* and
> *CheckpointerShmem->requests[]*, as already the flush requests are
> remembered in those structures, we can use those to apply your idea to
> issue flush requests.

Hm, that might be possible. But that might have some bigger implications
- we currently can issue thousands of flush requests a second, without
much chance of merging. I'm not sure it's a good idea to overlay that
into the lower frequency pendingOpsTable. Backends having to issue
fsyncs because the pending fsync queue is full is darn expensive. In
contrast to that a 'flush hint' request getting lost doesn't cost that
much.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-07 21:17:32 +0100, Andres Freund wrote:
> On 2016-01-07 21:08:10 +0100, Fabien COELHO wrote:
> > Hmmm. What I understood is that the workloads that have some performance
> > regressions (regressions that I have *not* seen in the many tests I ran) are
> > not due to checkpointer IOs, but rather in settings where most of the writes
> > is done by backends or bgwriter.
> 
> As far as I can see you've not run many tests where the hot/warm data
> set is larger than memory (the full machine's memory, not
> shared_buffers). That quite drastically alters the performance
> characteristics here, because you suddenly have lots of synchronous read
> IO thrown into the mix.
> 
> Whether it's bgwriter or not I've not fully been able to establish, but
> it's a working theory.

Hm. New theory: The current flush interface does the flushing inside
FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
problem with that is that at that point we (need to) hold a content lock
on the buffer!

Especially on a system that's bottlenecked on IO that means we'll
frequently hold content locks for a noticeable amount of time, while
flushing blocks, without any need to.

Even if that's not the reason for the slowdowns I observed, I think this
fact gives further credence to the current "pending flushes" tracking
residing on the wrong level.


Andres



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> Hm. New theory: The current flush interface does the flushing inside
> FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
> problem with that is that at that point we (need to) hold a content lock
> on the buffer!

You are worrying that FlushBuffer is holding a lock on a buffer and the 
"sync_file_range" call occurs is issued at that moment.

Although I agree that it is not that good, I would be surprise if that was 
the explanation for a performance regression, because the sync_file_range 
with the chosen parameters is an async call, it "advises" the OS to send 
the file, but it does not wait for it to be completed.

Moreover, for this issue to have a significant impact, it would require 
that another backend just happen to need this very buffer, but ISTM that 
the performance regression you are arguing about is on random IO bound 
performance, that is a few 100 tps in the best case, for very large bases, 
so a lot of buffers, so the probability of such a collision is very small, 
so it would not explain a significant regression.

> Especially on a system that's bottlenecked on IO that means we'll
> frequently hold content locks for a noticeable amount of time, while
> flushing blocks, without any need to.

I'm not that sure it is really noticeable, because sync_file_range does 
not wait for completion.

> Even if that's not the reason for the slowdowns I observed, I think this
> fact gives further credence to the current "pending flushes" tracking
> residing on the wrong level.

ISTM that I put the tracking at the level where is the information is 
available without having to recompute it several times, as the flush needs 
to know the fd and offset. Doing it differently would mean more code and 
translating buffer to file/offset several times, I think.

Also, maybe you could answer a question I had about the performance 
regression you observed, I could not find the post where you gave the 
detailed information about it, so that I could try reproducing it: what 
are the exact settings and conditions (shared_buffers, pgbench scaling, 
host memory, ...), what is the observed regression (tps? other?), and what 
is the responsiveness of the database under the regression (eg % of 
seconds with 0 tps for instance, or something like that).

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Sat, Jan 9, 2016 at 7:10 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-01-09 19:05:54 +0530, Amit Kapila wrote:
> > Right that won't be acceptable, however I think with your latest
> > proposal [1]
>
> Sure, that'd address that problem.
>
>
> > [...] think that idea will help to mitigate the problem of backend and
> > bgwriter writes as well.  In that, can't we do it with the help of
> > existing infrastructure of *pendingOpsTable* and
> > *CheckpointerShmem->requests[]*, as already the flush requests are
> > remembered in those structures, we can use those to apply your idea to
> > issue flush requests.
>
> Hm, that might be possible. But that might have some bigger implications
> - we currently can issue thousands of flush requests a second, without
> much chance of merging. I'm not sure it's a good idea to overlay that
> into the lower frequency pendingOpsTable.
>

In that case, we can have unified structure to remember flush requests
rather than backend and bgwriter noting that information in
CheckpointerShmem and checkpointer in pendingOpsTable.  I understand
there are some benefits of having pendingOpsTable, but having a
common structure seems to be more beneficial and in particular
because it can be used for the purpose of flush hints.

Now, I am sure we can invent a new way of tracking the flush
requests for flush hints, but I think we might want to consider why
can't we have one unified way of tracking the flush requests which
can be used both for *flush* and *flush hints*.

> Backends having to issue
> fsyncs because the pending fsync queue is full is darn expensive. In
> contrast to that a 'flush hint' request getting lost doesn't cost that
> much.
>

In general, I think the cases where backends have to do flush should
be less as the size of fsync queue is NBuffers and we take care of
handling duplicate fsync requests for the same buffer.   

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-09 16:49:56 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >Hm. New theory: The current flush interface does the flushing inside
> >FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
> >problem with that is that at that point we (need to) hold a content lock
> >on the buffer!
> 
> You are worrying that FlushBuffer is holding a lock on a buffer and the
> "sync_file_range" call occurs is issued at that moment.
> 
> Although I agree that it is not that good, I would be surprise if that was
> the explanation for a performance regression, because the sync_file_range
> with the chosen parameters is an async call, it "advises" the OS to send the
> file, but it does not wait for it to be completed.

I frequently see sync_file_range blocking - it waits till it could
submit the writes into the io queues. On a system bottlenecked on IO
that's not always possible immediately.

> Also, maybe you could answer a question I had about the performance
> regression you observed, I could not find the post where you gave the
> detailed information about it, so that I could try reproducing it: what are
> the exact settings and conditions (shared_buffers, pgbench scaling, host
> memory, ...), what is the observed regression (tps? other?), and what is the
> responsiveness of the database under the regression (eg % of seconds with 0
> tps for instance, or something like that).

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \       -D /srv/temp/pgdev-dev-800/ \       -c maintenance_work_mem=2GB \       -c fsync=on \       -c
synchronous_commit=off\       -c shared_buffers=2GB \       -c wal_level=hot_standby \       -c max_wal_senders=10 \
  -c max_wal_size=100GB \       -c checkpoint_timeout=30s
 

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j16 -T 300 -P 1
I get

My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
master:
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1155733
latency average: 4.151 ms
latency stddev: 8.712 ms
tps = 3851.242965 (including connections establishing)
tps = 3851.725856 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 855156
latency average: 5.612 ms
latency stddev: 7.896 ms
tps = 2849.876327 (including connections establishing)
tps = 2849.912015 (excluding connections establishing)

My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram:
master:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 2104781
latency average: 2.280 ms
latency stddev: 9.868 ms
tps = 7010.397938 (including connections establishing)
tps = 7010.475848 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1930716
latency average: 2.484 ms
latency stddev: 7.303 ms
tps = 6434.785605 (including connections establishing)
tps = 6435.177773 (excluding connections establishing)

In neither case there are periods of 0 tps, but both have times of <
1000 tps with noticeably increased latency.


The endresults are similar with a sane checkpoint timeout - the tests
just take much longer to give meaningful results. Constantly running
long tests on prosumer level SSDs isn't nice - I've now killed 5 SSDs
with postgres testing...

As you can see there's roughly a 30% performance regression on the
slower SSD and a ~9% on the faster one. HDD results are similar (but I
can't repeat on the laptop right now since the 2nd hdd is now an SSD).


My working copy of checkpoint sorting & flushing currently results in:
My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1136260
latency average: 4.223 ms
latency stddev: 8.298 ms
tps = 3786.696499 (including connections establishing)
tps = 3786.778875 (excluding connections establishing)

My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 2050661
latency average: 2.339 ms
latency stddev: 7.708 ms
tps = 6833.593170 (including connections establishing)
tps = 6833.680391 (excluding connections establishing)

My version of the patch currently addresses various points, which need
to be separated and benchmarked separate:
* Different approach to background writer, trying to make backends write less. While that proves to be beneficial in
isolation,on its own that doesn't address the performance regression.
 
* Different flushing API, done outside the lock

So this partially addresses the performance problems, but not yet
completely.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-11 14:45:16 +0100, Andres Freund wrote:
> On 2016-01-09 16:49:56 +0100, Fabien COELHO wrote:
> > >Hm. New theory: The current flush interface does the flushing inside
> > >FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
> > >problem with that is that at that point we (need to) hold a content lock
> > >on the buffer!
> > 
> > You are worrying that FlushBuffer is holding a lock on a buffer and the
> > "sync_file_range" call occurs is issued at that moment.
> > 
> > Although I agree that it is not that good, I would be surprise if that was
> > the explanation for a performance regression, because the sync_file_range
> > with the chosen parameters is an async call, it "advises" the OS to send the
> > file, but it does not wait for it to be completed.
> 
> I frequently see sync_file_range blocking - it waits till it could
> submit the writes into the io queues. On a system bottlenecked on IO
> that's not always possible immediately.
> 
> > Also, maybe you could answer a question I had about the performance
> > regression you observed, I could not find the post where you gave the
> > detailed information about it, so that I could try reproducing it: what are
> > the exact settings and conditions (shared_buffers, pgbench scaling, host
> > memory, ...), what is the observed regression (tps? other?), and what is the
> > responsiveness of the database under the regression (eg % of seconds with 0
> > tps for instance, or something like that).
> 
> I measured it in a different number of cases, both on SSDs and spinning
> rust. I just reproduced it with:
> 
> postgres-ckpt14 \
>         -D /srv/temp/pgdev-dev-800/ \
>         -c maintenance_work_mem=2GB \
>         -c fsync=on \
>         -c synchronous_commit=off \
>         -c shared_buffers=2GB \
>         -c wal_level=hot_standby \
>         -c max_wal_senders=10 \
>         -c max_wal_size=100GB \
>         -c checkpoint_timeout=30s
> 
> Using a fresh cluster each time (copied from a "template" to save time)
> and using
> pgbench -M prepared -c 16 -j16 -T 300 -P 1
> I get
> 
> My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
> master:
> scaling factor: 800
> query mode: prepared
> number of clients: 16
> number of threads: 16
> duration: 300 s
> number of transactions actually processed: 1155733
> latency average: 4.151 ms
> latency stddev: 8.712 ms
> tps = 3851.242965 (including connections establishing)
> tps = 3851.725856 (excluding connections establishing)
> 
> ckpt-14 (flushing by backends disabled):
> scaling factor: 800
> query mode: prepared
> number of clients: 16
> number of threads: 16
> duration: 300 s
> number of transactions actually processed: 855156
> latency average: 5.612 ms
> latency stddev: 7.896 ms
> tps = 2849.876327 (including connections establishing)
> tps = 2849.912015 (excluding connections establishing)

Hm. I think I have an entirely different theory that might explain some
of this theory. I instrumented lwlocks to check for additional blocking
and found some. Admittedly not exactly where I thought it might
be. Check out what you can observe when adding/enabling an elog in
FlushBuffer() (and the progress printing from BufferSync()):

(sorry, a bit long, but it's necessary to understand)

[2016-01-11 20:15:02 CET][14957] CONTEXT:  writing block 0 of relation base/13000/16387
to_scan: 131141, scanned: 6, %processed: 0.00, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG:  xlog flush request 1F/D2FD7E0; write 1F/D296000; flush 1F/D296000; insert:
1F/D33B418
[2016-01-11 20:15:02 CET][14957] CONTEXT:  writing block 2 of relation base/13000/16387
to_scan: 131141, scanned: 7, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG:  xlog flush request 1F/D3B2E30; write 1F/D33C000; flush 1F/D33C000; insert:
1F/D403198
[2016-01-11 20:15:02 CET][14957] CONTEXT:  writing block 3 of relation base/13000/16387
to_scan: 131141, scanned: 9, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG:  xlog flush request 1F/D469990; write 1F/D402000; flush 1F/D402000; insert:
1F/D4FDD00
[2016-01-11 20:15:02 CET][14957] CONTEXT:  writing block 5 of relation base/13000/16387
to_scan: 131141, scanned: 11, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG:  xlog flush request 1F/D5663E8; write 1F/D4FC000; flush 1F/D4FC000; insert:
1F/D5D1390
[2016-01-11 20:15:02 CET][14957] CONTEXT:  writing block 7 of relation base/13000/16387
to_scan: 131141, scanned: 14, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG:  xlog flush request 1F/D673700; write 1F/D5D0000; flush 1F/D5D0000; insert:
1F/D687E58
[2016-01-11 20:15:02 CET][14957] CONTEXT:  writing block 10 of relation base/13000/16387
to_scan: 131141, scanned: 15, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:02 CET][14957] LOG:  xlog flush request 1F/D76BEC8; write 1F/D686000; flush 1F/D686000; insert:
1F/D7A83A0
[2016-01-11 20:15:02 CET][14957] CONTEXT:  writing block 11 of relation base/13000/16387
to_scan: 131141, scanned: 16, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/D7AE5C0; write 1F/D7A83E8; flush 1F/D7A83E8; insert:
1F/D8B9A88
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 12 of relation base/13000/16387
to_scan: 131141, scanned: 17, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/DA08370; write 1F/D963A38; flush 1F/D963A38; insert:
1F/DA0A7D0
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 13 of relation base/13000/16387
to_scan: 131141, scanned: 18, %processed: 0.01, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/DAC09A0; write 1F/DA92250; flush 1F/DA92250; insert:
1F/DB9AAC8
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 14 of relation base/13000/16387
to_scan: 131141, scanned: 21, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/DCEFF18; write 1F/DC2AD30; flush 1F/DC2AD30; insert:
1F/DCF25B0
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 17 of relation base/13000/16387
to_scan: 131141, scanned: 23, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/DD0E9E0; write 1F/DCF25F8; flush 1F/DCF25F8; insert:
1F/DDD6198
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 19 of relation base/13000/16387
to_scan: 131141, scanned: 24, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/DED6A20; write 1F/DEC0358; flush 1F/DEC0358; insert:
1F/DFB64C8
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 20 of relation base/13000/16387
to_scan: 131141, scanned: 25, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/DFDEE90; write 1F/DFB6560; flush 1F/DFB6560; insert:
1F/E073468
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 21 of relation base/13000/16387
to_scan: 131141, scanned: 26, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/E295638; write 1F/E10B9F8; flush 1F/E10B9F8; insert:
1F/E2B40E0
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 22 of relation base/13000/16387
to_scan: 131141, scanned: 27, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/E381688; write 1F/E354BC0; flush 1F/E354BC0; insert:
1F/E459598
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 23 of relation base/13000/16387
to_scan: 131141, scanned: 28, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/E56EF70; write 1F/E4C0C98; flush 1F/E4C0C98; insert:
1F/E56F200
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 24 of relation base/13000/16387
to_scan: 131141, scanned: 29, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/E67E538; write 1F/E5DC440; flush 1F/E5DC440; insert:
1F/E6F7FF8
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 25 of relation base/13000/16387
to_scan: 131141, scanned: 31, %processed: 0.02, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/E873DD8; write 1F/E7D81F0; flush 1F/E7D81F0; insert:
1F/E8A1710
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 27 of relation base/13000/16387
to_scan: 131141, scanned: 33, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/E9E3948; write 1F/E979610; flush 1F/E979610; insert:
1F/EA27AC0
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 29 of relation base/13000/16387
to_scan: 131141, scanned: 35, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/EABDDC8; write 1F/EA6DFE0; flush 1F/EA6DFE0; insert:
1F/EB10728
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 31 of relation base/13000/16387
to_scan: 131141, scanned: 37, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/EC07328; write 1F/EBAABE0; flush 1F/EBAABE0; insert:
1F/EC9B8A8
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 33 of relation base/13000/16387
to_scan: 131141, scanned: 40, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/ED18FF8; write 1F/EC9B8A8; flush 1F/EC9B8A8; insert:
1F/ED8C2F8
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 36 of relation base/13000/16387
to_scan: 131141, scanned: 41, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/EEED640; write 1F/EE0BAD8; flush 1F/EE0BAD8; insert:
1F/EF35EA8
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 37 of relation base/13000/16387
to_scan: 131141, scanned: 42, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/EFF20B8; write 1F/EFAAE20; flush 1F/EFAAE20; insert:
1F/F06FAC0
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 38 of relation base/13000/16387
to_scan: 131141, scanned: 43, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/F1430B0; write 1F/F0DEAB8; flush 1F/F0DEAB8; insert:
1F/F265020
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 39 of relation base/13000/16387
to_scan: 131141, scanned: 45, %processed: 0.03, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/F3556C0; write 1F/F268F68; flush 1F/F268F68; insert:
1F/F3682B8
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 41 of relation base/13000/16387
to_scan: 131141, scanned: 46, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/F5005F8; write 1F/F4376F8; flush 1F/F4376F8; insert:
1F/F523838
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 42 of relation base/13000/16387
to_scan: 131141, scanned: 47, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/F6261C0; write 1F/F5A07A0; flush 1F/F5A07A0; insert:
1F/F691288
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 43 of relation base/13000/16387
to_scan: 131141, scanned: 48, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/F7CBCD0; write 1F/F719020; flush 1F/F719020; insert:
1F/F80DBB0
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 44 of relation base/13000/16387
to_scan: 131141, scanned: 49, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/F9359C8; write 1F/F874CB8; flush 1F/F874CB8; insert:
1F/F95AD58
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 45 of relation base/13000/16387
to_scan: 131141, scanned: 50, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/FA33F38; write 1F/FA03490; flush 1F/FA03490; insert:
1F/FAD4DF8
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 46 of relation base/13000/16387
to_scan: 131141, scanned: 51, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/FBDBCD8; write 1F/FB52238; flush 1F/FB52238; insert:
1F/FC54E68
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 47 of relation base/13000/16387
to_scan: 131141, scanned: 52, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/FD74B60; write 1F/FD10360; flush 1F/FD10360; insert:
1F/FDB6A88
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 48 of relation base/13000/16387
to_scan: 131141, scanned: 53, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/FE4FF60; write 1F/FDB6AD0; flush 1F/FDB6AD0; insert:
1F/FE90028
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 49 of relation base/13000/16387
to_scan: 131141, scanned: 54, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:03 CET][14957] LOG:  xlog flush request 1F/FFD6A78; write 1F/FF223F0; flush 1F/FF223F0; insert:
1F/10022F70
[2016-01-11 20:15:03 CET][14957] CONTEXT:  writing block 50 of relation base/13000/16387
to_scan: 131141, scanned: 55, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/10144C98; write 1F/10023000; flush 1F/10023000; insert:
1F/10157730
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 51 of relation base/13000/16387
to_scan: 131141, scanned: 58, %processed: 0.04, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/102AA468; write 1F/1020C600; flush 1F/1020C600; insert:
1F/102C73F0
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 54 of relation base/13000/16387
to_scan: 131141, scanned: 60, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/10313470; write 1F/102C7460; flush 1F/102C7460; insert:
1F/103D4F38
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 56 of relation base/13000/16387
to_scan: 131141, scanned: 61, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/10510CE8; write 1F/104562F0; flush 1F/104562F0; insert:
1F/105171E8
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 57 of relation base/13000/16387
to_scan: 131141, scanned: 62, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/10596B18; write 1F/105191B0; flush 1F/105191B0; insert:
1F/106076F8
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 58 of relation base/13000/16387
to_scan: 131141, scanned: 63, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/1073FB28; write 1F/10693638; flush 1F/10693638; insert:
1F/10787D40
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 59 of relation base/13000/16387
to_scan: 131141, scanned: 64, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/1088D058; write 1F/107F7068; flush 1F/107F7068; insert:
1F/10920EA0
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 60 of relation base/13000/16387
to_scan: 131141, scanned: 67, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/109D9158; write 1F/109A8458; flush 1F/109A8458; insert:
1F/10A8A240
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 63 of relation base/13000/16387
to_scan: 131141, scanned: 68, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/10BDAA38; write 1F/10B2AD48; flush 1F/10B2AD48; insert:
1F/10C16768
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 64 of relation base/13000/16387
to_scan: 131141, scanned: 69, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/10D824D0; write 1F/10C859A0; flush 1F/10C859A0; insert:
1F/10DCC860
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 65 of relation base/13000/16387
to_scan: 131141, scanned: 70, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/10E24CD8; write 1F/10DCC8A8; flush 1F/10DCC8A8; insert:
1F/10EA8588
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 66 of relation base/13000/16387
to_scan: 131141, scanned: 71, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/10FD3E90; write 1F/10F57530; flush 1F/10F57530; insert:
1F/11043A58
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 67 of relation base/13000/16387
to_scan: 131141, scanned: 72, %processed: 0.05, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/111CE4A0; write 1F/11043AC8; flush 1F/11043AC8; insert:
1F/111ED470
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 68 of relation base/13000/16387
to_scan: 131141, scanned: 73, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11338080; write 1F/112917C8; flush 1F/112917C8; insert:
1F/1135CF80
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 69 of relation base/13000/16387
to_scan: 131141, scanned: 76, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11369068; write 1F/1135CF80; flush 1F/1135CF80; insert:
1F/1140BE88
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 72 of relation base/13000/16387
to_scan: 131141, scanned: 77, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/1146A420; write 1F/1136E000; flush 1F/1136E000; insert:
1F/11483530
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 73 of relation base/13000/16387
to_scan: 131141, scanned: 78, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/1157B800; write 1F/11483530; flush 1F/11483530; insert:
1F/11583E20
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 74 of relation base/13000/16387
to_scan: 131141, scanned: 79, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/116368C0; write 1F/11583E20; flush 1F/11583E20; insert:
1F/116661A8
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 75 of relation base/13000/16387
to_scan: 131141, scanned: 81, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/116FC598; write 1F/11668178; flush 1F/11668178; insert:
1F/11716758
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 0 of relation base/13000/16393
to_scan: 131141, scanned: 82, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/117DA658; write 1F/117631F0; flush 1F/117631F0; insert:
1F/118206F0
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 1 of relation base/13000/16393
to_scan: 131141, scanned: 83, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11956320; write 1F/118E96B8; flush 1F/118E96B8; insert:
1F/1196F000
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 2 of relation base/13000/16393
to_scan: 131141, scanned: 84, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11A09B00; write 1F/1196F090; flush 1F/1196F090; insert:
1F/11A23D38
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 3 of relation base/13000/16393
to_scan: 131141, scanned: 85, %processed: 0.06, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11B43C80; write 1F/11AB2148; flush 1F/11AB2148; insert:
1F/11B502D8
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 4 of relation base/13000/16393
to_scan: 131141, scanned: 86, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11BE2610; write 1F/11B503B8; flush 1F/11B503B8; insert:
1F/11BF9068
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 5 of relation base/13000/16393
to_scan: 131141, scanned: 87, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11CB9FD8; write 1F/11BF9168; flush 1F/11BF9168; insert:
1F/11CBE1F8
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 6 of relation base/13000/16393
to_scan: 131141, scanned: 88, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11D24E10; write 1F/11CBE268; flush 1F/11CBE268; insert:
1F/11D8BC18
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 7 of relation base/13000/16393
to_scan: 131141, scanned: 89, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11E9B070; write 1F/11DEC840; flush 1F/11DEC840; insert:
1F/11EB7EC0
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 8 of relation base/13000/16393
to_scan: 131141, scanned: 90, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/11F5C3F0; write 1F/11F3FBD0; flush 1F/11F3FBD0; insert:
1F/11FE1A08
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 9 of relation base/13000/16393
to_scan: 131141, scanned: 91, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/121EDC00; write 1F/1208E838; flush 1F/1208E838; insert:
1F/121F1EF8
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 10 of relation base/13000/16393
to_scan: 131141, scanned: 92, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/122E0A70; write 1F/121F1F90; flush 1F/121F1F90; insert:
1F/122E9198
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 11 of relation base/13000/16393
to_scan: 131141, scanned: 93, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/1243B698; write 1F/123A7EC8; flush 1F/123A7EC8; insert:
1F/1245E620
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 12 of relation base/13000/16393
to_scan: 131141, scanned: 94, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/1258E7B0; write 1F/124BF6B8; flush 1F/124BF6B8; insert:
1F/1259F198
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 13 of relation base/13000/16393
to_scan: 131141, scanned: 95, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/126C8E38; write 1F/12662BA0; flush 1F/12662BA0; insert:
1F/126FE690
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 14 of relation base/13000/16393
to_scan: 131141, scanned: 96, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/127DE810; write 1F/126FE6D8; flush 1F/126FE6D8; insert:
1F/128081B0
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 15 of relation base/13000/16393
to_scan: 131141, scanned: 97, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:04 CET][14957] LOG:  xlog flush request 1F/12980108; write 1F/128A6000; flush 1F/128A6000; insert:
1F/129A8E00
[2016-01-11 20:15:04 CET][14957] CONTEXT:  writing block 16 of relation base/13000/16393
to_scan: 131141, scanned: 98, %processed: 0.07, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG:  xlog flush request 1F/12A55978; write 1F/129ACDB8; flush 1F/129ACDB8; insert:
1F/12A6A408
[2016-01-11 20:15:05 CET][14957] CONTEXT:  writing block 17 of relation base/13000/16393
to_scan: 131141, scanned: 99, %processed: 0.08, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG:  xlog flush request 1F/12BC1148; write 1F/12B12F40; flush 1F/12B12F40; insert:
1F/12BC15F8
[2016-01-11 20:15:05 CET][14957] CONTEXT:  writing block 18 of relation base/13000/16393
to_scan: 131141, scanned: 100, %processed: 0.08, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG:  xlog flush request 1F/12D36E20; write 1F/12C70120; flush 1F/12C70120; insert:
1F/12D4DC08
[2016-01-11 20:15:05 CET][14957] CONTEXT:  writing block 19 of relation base/13000/16393
to_scan: 131141, scanned: 9892, %processed: 7.54, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG:  xlog flush request 1F/13128AF8; write 1F/12DEE670; flush 1F/12DEE670; insert:
1F/1313B7D0
[2016-01-11 20:15:05 CET][14957] CONTEXT:  writing block 101960 of relation base/13000/16396
to_scan: 131141, scanned: 18221, %processed: 13.89, %writeouts: 100.00
[2016-01-11 20:15:05 CET][14957] LOG:  xlog flush request 1F/13276328; write 1F/1313A000; flush 1F/1313A000; insert:
1F/134E93A8
[2016-01-11 20:15:05 CET][14957] CONTEXT:  writing block 188242 of relation base/13000/16396
to_scan: 131141, scanned: 25857, %processed: 19.72, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG:  xlog flush request 1F/13497370; write 1F/1346E000; flush 1F/1346E000; insert:
1F/136C00F8
[2016-01-11 20:15:06 CET][14957] CONTEXT:  writing block 267003 of relation base/13000/16396
to_scan: 131141, scanned: 26859, %processed: 20.48, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG:  xlog flush request 1F/136B5BB0; write 1F/135D6000; flush 1F/135D6000; insert:
1F/136C00F8
[2016-01-11 20:15:06 CET][14957] CONTEXT:  writing block 277621 of relation base/13000/16396
to_scan: 131141, scanned: 27582, %processed: 21.03, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG:  xlog flush request 1F/138C6C38; write 1F/1375E900; flush 1F/1375E900; insert:
1F/138D5518
[2016-01-11 20:15:06 CET][14957] CONTEXT:  writing block 285176 of relation base/13000/16396
to_scan: 131141, scanned: 28943, %processed: 22.07, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG:  xlog flush request 1F/13A5B768; write 1F/138C8000; flush 1F/138C8000; insert:
1F/13AB61D0
[2016-01-11 20:15:06 CET][14957] CONTEXT:  writing block 300007 of relation base/13000/16396
to_scan: 131141, scanned: 36181, %processed: 27.59, %writeouts: 100.00
[2016-01-11 20:15:06 CET][14957] LOG:  xlog flush request 1F/13C320C8; write 1F/13A8A000; flush 1F/13A8A000; insert:
1F/13DAAB40
[2016-01-11 20:15:06 CET][14957] CONTEXT:  writing block 375983 of relation base/13000/16396
to_scan: 131141, scanned: 40044, %processed: 30.54, %writeouts: 100.00
[2016-01-11 20:15:07 CET][14957] LOG:  xlog flush request 1F/13E196C8; write 1F/13CBA000; flush 1F/13CBA000; insert:
1F/13F9E6D8
[2016-01-11 20:15:07 CET][14957] CONTEXT:  writing block 416439 of relation base/13000/16396
to_scan: 131141, scanned: 48250, %processed: 36.79, %writeouts: 100.00
[2016-01-11 20:15:07 CET][14957] LOG:  xlog flush request 1F/143F6160; write 1F/13EE8000; flush 1F/13EE8000; insert:
1F/1461BB08

You can see that initially every buffer triggers a WAL flush. That
causes a slowdown because a) we're doing significantly more WAL flushes
in that time period, both causing slowdown of concurrent IO and
concurrent WAL insertions b) due to the many slow flushes we get behind
on the checkpoint schedule, triggering a rapid fire period of writes
afterwards.

My theory is that this happens due to the sorting: pgbench is an update
heavy workload, the first few pages are always going to be used if
there's free space as freespacemap.c essentially prefers those. Due to
the sorting all a relation's early pages are going to be in "in a row".

Indeed, the behaviour is not visible in a significant manner when using
pgbench -N, where there are far fewer updated pages.

I'm not entirely sure how we can deal with that.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Tue, Jan 12, 2016 at 12:57 AM, Andres Freund <andres@anarazel.de> wrote:>
>
> My theory is that this happens due to the sorting: pgbench is an update
> heavy workload, the first few pages are always going to be used if
> there's free space as freespacemap.c essentially prefers those. Due to
> the sorting all a relation's early pages are going to be in "in a row".
>

Not sure, what is best way to tackle this problem, but I think one way could
be to perform sorting at flush requests level rather than before writing
to OS buffers.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-12 17:50:36 +0530, Amit Kapila wrote:
> On Tue, Jan 12, 2016 at 12:57 AM, Andres Freund <andres@anarazel.de> wrote:>
> >
> > My theory is that this happens due to the sorting: pgbench is an update
> > heavy workload, the first few pages are always going to be used if
> > there's free space as freespacemap.c essentially prefers those. Due to
> > the sorting all a relation's early pages are going to be in "in a row".
> >
> 
> Not sure, what is best way to tackle this problem, but I think one way could
> be to perform sorting at flush requests level rather than before writing
> to OS buffers.

I'm not following. If you just sort a couple hundred more or less random
buffers - which is what you get if you look in buf_id order through
shared_buffers - the likelihood of actually finding neighbouring writes
is pretty low.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

Thanks for the details. Many comments and some questions below.

>> Also, maybe you could answer a question I had about the performance
>> regression you observed, I could not find the post where you gave the
>> detailed information about it, so that I could try reproducing it: what are
>> the exact settings and conditions (shared_buffers, pgbench scaling, host
>> memory, ...), what is the observed regression (tps? other?), and what is the
>> responsiveness of the database under the regression (eg % of seconds with 0
>> tps for instance, or something like that).
>
> I measured it in a different number of cases, both on SSDs
> and spinning rust.

Argh! This is a key point: the sort/flush is designed to help HDDs, and 
would have limited effect on SSDs, and it seems that you are showing that 
the effect is in fact negative on SSDs, too bad:-(

The bad news is that I do not have a host with a SSD available for 
reproducing such results.

On SSDs, the linux IO scheduler works quite well, so this is a place where 
I would consider simply disactivating flushing and/or sorting.

ISTM that I would rather update the documentation to "do not activate on 
SSD" than try to find a miraculous solution which may or may not exist. 
Basically I would use your results to give better advises in the 
documentation, not as a motivation to rewrite the patch from scratch.

> postgres-ckpt14 \
>        -D /srv/temp/pgdev-dev-800/ \
>        -c maintenance_work_mem=2GB \
>        -c fsync=on \
>        -c synchronous_commit=off \

I'm not sure I like this one. I guess the intention is to focus on 
checkpointer writes and reduce the impact of WAL writes. Why not.

>        -c shared_buffers=2GB \
>        -c wal_level=hot_standby \
>        -c max_wal_senders=10 \
>        -c max_wal_size=100GB \
>        -c checkpoint_timeout=30s

That is a very short one, but the point is to exercise the checkpoint, so 
why not.

> My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
> master:
> scaling factor: 800

The DB is probably about 12GB, so it fits in memory in the end, meaning 
that there should be only write activity after some time? So this is not 
really the case where it does not fit in memory, but it is large enough to 
get mostly random IOs both in read & write, so why not.

> query mode: prepared
> number of clients: 16
> number of threads: 16
> duration: 300 s
> number of transactions actually processed: 1155733

Assuming one buffer accessed per transaction on average, and considering a 
uniform random distribution, this means about 50% of pages actually loaded 
in memory at the end of the run (1 - e(-1155766/800*2048)) (with 2048 
pages per scale unit).

> latency average: 4.151 ms
> latency stddev: 8.712 ms
> tps = 3851.242965 (including connections establishing)
> tps = 3851.725856 (excluding connections establishing)

> ckpt-14 (flushing by backends disabled):

Is this comment refering to "synchronous_commit = off"?
I guess this is the same on master above, even if not written?

> [...] In neither case there are periods of 0 tps, but both have times of 
> 1000 tps with noticeably increased latency.

Ok, but we are talking SSDs, things are not too bad, even if there are ups 
and downs.

> The endresults are similar with a sane checkpoint timeout - the tests
> just take much longer to give meaningful results. Constantly running
> long tests on prosumer level SSDs isn't nice - I've now killed 5 SSDs
> with postgres testing...

Indeed. It wears out and costs, too bad:-(

> As you can see there's roughly a 30% performance regression on the
> slower SSD and a ~9% on the faster one. HDD results are similar (but I
> can't repeat on the laptop right now since the 2nd hdd is now an SSD).

Ok, that is what I would have expected, the larger the database, the 
smaller the impact of sorting & flushin on SSDs. Now I would have hoped 
that flushing would help get a more constant load even in this case, at 
least this is what I measured in my tests. The closest to your setting 
test I ran is scale=660, and the sort/flush got 400 tps vs 100 tps 
without, with 30 minutes checkpoints, but HDDs do not compare to SSDs...

My overall comments about this SSD regression is that the patch is really 
designed to make a difference for HDDs, so to advise not activate on SSDs 
if there is a regression in such a case.

Now this is a little disappointing as on paper sorted writes should also 
be slightly better on SSDs, but if the bench says the contrary, I have to 
believe the bench:-)

-- 
Fabien.



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Tue, Jan 12, 2016 at 5:52 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-01-12 17:50:36 +0530, Amit Kapila wrote:
> > On Tue, Jan 12, 2016 at 12:57 AM, Andres Freund <andres@anarazel.de> wrote:>
> > >
> > > My theory is that this happens due to the sorting: pgbench is an update
> > > heavy workload, the first few pages are always going to be used if
> > > there's free space as freespacemap.c essentially prefers those. Due to
> > > the sorting all a relation's early pages are going to be in "in a row".
> > >
> >
> > Not sure, what is best way to tackle this problem, but I think one way could
> > be to perform sorting at flush requests level rather than before writing
> > to OS buffers.
>
> I'm not following. If you just sort a couple hundred more or less random
> buffers - which is what you get if you look in buf_id order through
> shared_buffers - the likelihood of actually finding neighbouring writes
> is pretty low.
>

Why can't we do it at larger intervals (relative to total amount of writes)?
To explain, what I have in mind, let us assume that checkpoint interval
is longer (10 mins) and in the mean time all the writes are being done
by bgwriter which it registers in shared memory so that later checkpoint
can perform corresponding fsync's, now when the request queue
becomes threshhold size (let us say 1/3rd) full, then we can perform
sorting and merging and issue flush hints.  Checkpointer task can
also follow somewhat similar technique which means that once it
has written 1/3rd or so of buffers (which we need to track), it can
perform flush hints after sort+merge.  Now, I think we can also
do it in checkpointer alone rather than in bgwriter and checkpointer.
Basically, I think this can lead to lesser merging of neighbouring
writes, but might not hurt if sync_file_range() API is cheap.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-12 13:54:21 +0100, Fabien COELHO wrote:
> >I measured it in a different number of cases, both on SSDs
> >and spinning rust.
> 
> Argh! This is a key point: the sort/flush is designed to help HDDs, and
> would have limited effect on SSDs, and it seems that you are showing that
> the effect is in fact negative on SSDs, too bad:-(

As you quoted, I could reproduce the slowdown both with SSDs *and* with
rotating disks.

> On SSDs, the linux IO scheduler works quite well, so this is a place where I
> would consider simply disactivating flushing and/or sorting.

Not my experience. In different scenarios, primarily with a large
shared_buffers fitting the whole hot working set, the patch
significantly improves performance.

> >postgres-ckpt14 \
> >       -D /srv/temp/pgdev-dev-800/ \
> >       -c maintenance_work_mem=2GB \
> >       -c fsync=on \
> >       -c synchronous_commit=off \
> 
> I'm not sure I like this one. I guess the intention is to focus on
> checkpointer writes and reduce the impact of WAL writes. Why not.

Now sure what you mean? s_c = off is *very* frequent in the field.

> >My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
> >master:
> >scaling factor: 800
> 
> The DB is probably about 12GB, so it fits in memory in the end, meaning that
> there should be only write activity after some time? So this is not really
> the case where it does not fit in memory, but it is large enough to get
> mostly random IOs both in read & write, so why not.

Doesn't really fit into ram - shared buffers uses some space (which will
be double buffered) and the xlog will use some more.

> >ckpt-14 (flushing by backends disabled):
> 
> Is this comment refering to "synchronous_commit = off"?
> I guess this is the same on master above, even if not written?

No, what I mean by that is that I didn't active flushing writes in
backends - something I found hugely effective in reducing jitter in a
number of workloads, but doesn't help throughput.

> >As you can see there's roughly a 30% performance regression on the
> >slower SSD and a ~9% on the faster one. HDD results are similar (but I
> >can't repeat on the laptop right now since the 2nd hdd is now an SSD).
> 
> Ok, that is what I would have expected, the larger the database, the smaller
> the impact of sorting & flushin on SSDs.

Again: "HDD results are similar". I primarily tested on a 4 disk raid10
of 4 disks, and a raid0 of 20 disks.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-12 19:17:49 +0530, Amit Kapila wrote:
> Why can't we do it at larger intervals (relative to total amount of writes)?
> To explain, what I have in mind, let us assume that checkpoint interval
> is longer (10 mins) and in the mean time all the writes are being done
> by bgwriter

But that's not the scenario with the regression here, so I'm not sure
why you're bringing it up?

And if we're flushing significant portion of the writes, how does that
avoid the performance problem pointed out two messages upthread? Where
sorting leads to flushing highly contended buffers together, leading to
excessive wal flushing?

But more importantly, unless you also want to delay the writes
themselves, leaving that many dirty buffers in the kernel page cache
will bring back exactly the type of stalls (where the kernel flushes all
the pending dirty data in a short amount of time) we're trying to avoid
with the forced flushing. So doing flushes in a large patches is
something we really fundamentally do *not* want!

> which it registers in shared memory so that later checkpoint
> can perform corresponding fsync's, now when the request queue
> becomes threshhold size (let us say 1/3rd) full, then we can perform
> sorting and merging and issue flush hints.

Which means that a significant portion of the writes won't be able to be
collapsed, since only a random 1/3 of the buffers is sorted together.


> Basically, I think this can lead to lesser merging of neighbouring
> writes, but might not hurt if sync_file_range() API is cheap.

The cost of writing out data doess correspond heavily with the number of
random writes - which is what you get if you reduce the number of
neighbouring writes.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Tue, Jan 12, 2016 at 7:24 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-01-12 19:17:49 +0530, Amit Kapila wrote:
> > Why can't we do it at larger intervals (relative to total amount of writes)?
> > To explain, what I have in mind, let us assume that checkpoint interval
> > is longer (10 mins) and in the mean time all the writes are being done
> > by bgwriter
>
> But that's not the scenario with the regression here, so I'm not sure
> why you're bringing it up?
>
> And if we're flushing significant portion of the writes, how does that
> avoid the performance problem pointed out two messages upthread? Where
> sorting leads to flushing highly contended buffers together, leading to
> excessive wal flushing?
>

I think it will avoid that problem, because what I am telling is not-to-sort
the buffers before writing, rather sort the flush requests.  If I remember
correctly, the initial patch of Fabien doesn't have sorting at the buffer
level, but still he is able to see the benefits in many cases.
 
>
> But more importantly, unless you also want to delay the writes
> themselves, leaving that many dirty buffers in the kernel page cache
> will bring back exactly the type of stalls (where the kernel flushes all
> the pending dirty data in a short amount of time) we're trying to avoid
> with the forced flushing. So doing flushes in a large patches is
> something we really fundamentally do *not* want!
>

Could it be because random I/O?

> > which it registers in shared memory so that later checkpoint
> > can perform corresponding fsync's, now when the request queue
> > becomes threshhold size (let us say 1/3rd) full, then we can perform
> > sorting and merging and issue flush hints.
>
> Which means that a significant portion of the writes won't be able to be
> collapsed, since only a random 1/3 of the buffers is sorted together.
>
>
> > Basically, I think this can lead to lesser merging of neighbouring
> > writes, but might not hurt if sync_file_range() API is cheap.
>
> The cost of writing out data doess correspond heavily with the number of
> random writes - which is what you get if you reduce the number of
> neighbouring writes.
>

Yeah, thats right, but I am not sure how much difference it would
create if sorting everything at one short versus if we do that in
batches.  In anycase, I am just trying to think out loud to see if we
can find some solution to the regression you have seen above
without disabling sorting altogether for certain cases.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

>> Argh! This is a key point: the sort/flush is designed to help HDDs, and
>> would have limited effect on SSDs, and it seems that you are showing that
>> the effect is in fact negative on SSDs, too bad:-(
>
> As you quoted, I could reproduce the slowdown both with SSDs *and* with
> rotating disks.

Ok, once again I misunderstood. So you have a regression on HDD with the 
settings you pointed out, I can try that.

>> On SSDs, the linux IO scheduler works quite well, so this is a place where I
>> would consider simply disactivating flushing and/or sorting.
>
> Not my experience. In different scenarios, primarily with a large
> shared_buffers fitting the whole hot working set, the patch
> significantly improves performance.

Good! That would be what I expected, but I have no way to test that.

>>> postgres-ckpt14 \
>>>       -D /srv/temp/pgdev-dev-800/ \
>>>       -c maintenance_work_mem=2GB \
>>>       -c fsync=on \
>>>       -c synchronous_commit=off \
>>
>> I'm not sure I like this one. I guess the intention is to focus on
>> checkpointer writes and reduce the impact of WAL writes. Why not.
>
> Now sure what you mean? s_c = off is *very* frequent in the field.

Too bad, because for me it is really disactivating the D of ACID...

I think that this setting would not issue the "sync" calls on the WAL 
file, which means that the impact of WAL writing is somehow reduced and 
random writes (more or less for each transaction) is switched to 
sequential writes by the IO scheduler.

>>> My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
>>> master:
>>> scaling factor: 800
>>
>> The DB is probably about 12GB, so it fits in memory in the end, meaning that
>> there should be only write activity after some time? So this is not really
>> the case where it does not fit in memory, but it is large enough to get
>> mostly random IOs both in read & write, so why not.
>
> Doesn't really fit into ram - shared buffers uses some space (which will
> be double buffered) and the xlog will use some more.

Hmmm. My understanding is that you are really using about 6GB of shared 
buffer data in a run, plus some write only stuff...

xlog is flush/synced constantly and never read again, I would be surprise 
that it has a significant memory impact.

>>> ckpt-14 (flushing by backends disabled):
>>
>> Is this comment refering to "synchronous_commit = off"?
>> I guess this is the same on master above, even if not written?
>
> No, what I mean by that is that I didn't active flushing writes in
> backends -

I'm not sure that I understand. What is the actual corresponding directive 
in the configuration file?

>>> As you can see there's roughly a 30% performance regression on the
>>> slower SSD and a ~9% on the faster one. HDD results are similar (but I
>>> can't repeat on the laptop right now since the 2nd hdd is now an SSD).
>>
>> Ok, that is what I would have expected, the larger the database, the smaller
>> the impact of sorting & flushin on SSDs.
>
> Again: "HDD results are similar". I primarily tested on a 4 disk raid10
> of 4 disks, and a raid0 of 20 disks.

I guess similar but with a much lower tps. Anyway I can try that.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi Fabien,

On 2016-01-11 14:45:16 +0100, Andres Freund wrote:
> I measured it in a different number of cases, both on SSDs and spinning
> rust. I just reproduced it with:
> 
> postgres-ckpt14 \
>         -D /srv/temp/pgdev-dev-800/ \
>         -c maintenance_work_mem=2GB \
>         -c fsync=on \
>         -c synchronous_commit=off \
>         -c shared_buffers=2GB \
>         -c wal_level=hot_standby \
>         -c max_wal_senders=10 \
>         -c max_wal_size=100GB \
>         -c checkpoint_timeout=30s

What kernel, filesystem and filesystem option did you measure with?

I was/am using ext4, and it turns out that, when abling flushing, the
results are hugely dependant on barriers=on/off, with the latter making
flushing rather advantageous. Additionally data=ordered/writeback makes
measureable difference too.

Reading kernel sources trying to understand some more of the performance
impact.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> Hi Fabien,

Hello Tomas.

> On 2016-01-11 14:45:16 +0100, Andres Freund wrote:
>> I measured it in a different number of cases, both on SSDs and spinning
>> rust. I just reproduced it with:
>>
>> postgres-ckpt14 \
>>         -D /srv/temp/pgdev-dev-800/ \
>>         -c maintenance_work_mem=2GB \
>>         -c fsync=on \
>>         -c synchronous_commit=off \
>>         -c shared_buffers=2GB \
>>         -c wal_level=hot_standby \
>>         -c max_wal_senders=10 \
>>         -c max_wal_size=100GB \
>>         -c checkpoint_timeout=30s
>
> What kernel, filesystem and filesystem option did you measure with?

Andres did these measures, not me, so I do not know.

> I was/am using ext4, and it turns out that, when abling flushing, the
> results are hugely dependant on barriers=on/off, with the latter making
> flushing rather advantageous. Additionally data=ordered/writeback makes
> measureable difference too.

These are very interesting tests, I'm looking forward to have a look at 
the results.

The fact that these options change performance is expected. Personnaly the 
test I submitted on the thread used ext4 with default mount options plus 
"relatime".

If I had a choice, I would tend to take the safest options, because the 
point of a database is to keep data safe. That's why I'm not found of the 
"synchronous_commit=off" chosen above.

> Reading kernel sources trying to understand some more of the performance
> impact.

Wow!

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> Hello Tomas.

Ooops, sorry Andres, I mixed up the thread in my head so was not clear who 
was asking the questions to whom.

>> I was/am using ext4, and it turns out that, when abling flushing, the
>> results are hugely dependant on barriers=on/off, with the latter making
>> flushing rather advantageous. Additionally data=ordered/writeback makes
>> measureable difference too.
>
> These are very interesting tests, I'm looking forward to have a look at the 
> results.
>
> The fact that these options change performance is expected. Personnaly the 
> test I submitted on the thread used ext4 with default mount options plus 
> "relatime".

I confirm that: nothing special but "relatime" on ext4 on my test host.

> If I had a choice, I would tend to take the safest options, because the point 
> of a database is to keep data safe. That's why I'm not found of the 
> "synchronous_commit=off" chosen above.

"found" -> "fond". I confirm this opinion. If you have BBU on you 
disk/raid system probably playing with some of these options is safe, 
though. Not the case with my basic hardware.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Andres,

> I measured it in a different number of cases, both on SSDs and spinning
> rust. I just reproduced it with:
>
> postgres-ckpt14 \
>        -D /srv/temp/pgdev-dev-800/ \
>        -c maintenance_work_mem=2GB \
>        -c fsync=on \
>        -c synchronous_commit=off \
>        -c shared_buffers=2GB \
>        -c wal_level=hot_standby \
>        -c max_wal_senders=10 \
>        -c max_wal_size=100GB \
>        -c checkpoint_timeout=30s
>
> Using a fresh cluster each time (copied from a "template" to save time)
> and using
> pgbench -M prepared -c 16 -j 16 -T 300 -P 1

I'm running some tests similar to those above...

Do you do some warmup when testing? I guess the answer is "no".

I understand that you have 8 cores/16 threads on your host?

Loading scale 800 data for 300 seconds tests takes much more than 300 
seconds (init takes ~360 seconds, vacuum & index are slow). With 30 
seconds checkpoint cycles and without any warmup, I feel that these tests 
are really on the very short (too short) side, so I'm not sure how much I 
can trust such results as significant. The data I reported were with more 
real life like parameters.

Anyway, I'll have some results to show with a setting more or less similar 
to yours.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >I measured it in a different number of cases, both on SSDs and spinning
> >rust. I just reproduced it with:
> >
> >postgres-ckpt14 \
> >       -D /srv/temp/pgdev-dev-800/ \
> >       -c maintenance_work_mem=2GB \
> >       -c fsync=on \
> >       -c synchronous_commit=off \
> >       -c shared_buffers=2GB \
> >       -c wal_level=hot_standby \
> >       -c max_wal_senders=10 \
> >       -c max_wal_size=100GB \
> >       -c checkpoint_timeout=30s
> >
> >Using a fresh cluster each time (copied from a "template" to save time)
> >and using
> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1

So, I've analyzed the problem further, and I think I found something
rater interesting. I'd profiled the kernel looking where it blocks in
the IO request queues, and found that the wal writer was involved
surprisingly often.

So, in a workload where everything (checkpoint, bgwriter, backend
writes) is flushed:                             2995 tps
After I kill the wal writer with -STOP:         10887 tps

Stracing the wal writer shows:

17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, si_uid=1000} ---
17:29:02.001538 rt_sigreturn({mask=[]}) = 0
17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.001615 write(3, "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 49152) = 49152
17:29:02.001671 fdatasync(3)            = 0
17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, si_uid=1000} ---
17:29:02.005043 rt_sigreturn({mask=[]}) = 0
17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.005111 write(3, "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 8192) = 8192
17:29:02.005147 fdatasync(3)            = 0
17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, si_uid=1000} ---
17:29:02.008705 rt_sigreturn({mask=[]}) = 0
17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 \331_/\0\0\0\267\30\0\0\0\0\0\0        "..., 98304) = 98304
17:29:02.008822 fdatasync(3)            = 0
17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.016141 rt_sigreturn({mask=[]}) = 0
17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.016204 write(3, "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 57344) = 57344
17:29:02.016281 fdatasync(3)            = 0
17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.019199 rt_sigreturn({mask=[]}) = 0
17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
17:29:02.019249 write(3, "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0        "..., 73728) = 73728
17:29:02.019355 fdatasync(3)            = 0
17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
17:29:02.022696 rt_sigreturn({mask=[]}) = 0

I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
second. As soon as the wal writer is stopped, it's much bigger chunks,
on the order of 50-130 pages. And, not that surprisingly, that improves
performance, because there's far fewer cache flushes submitted to the
hardware.


> I'm running some tests similar to those above...

> Do you do some warmup when testing? I guess the answer is "no".

Doesn't make a difference here, I tried both. As long as before/after
benchmarks start from the same state...


> I understand that you have 8 cores/16 threads on your host?

On one of them, 4 cores/8 threads on the laptop.


> Loading scale 800 data for 300 seconds tests takes much more than 300
> seconds (init takes ~360 seconds, vacuum & index are slow). With 30 seconds
> checkpoint cycles and without any warmup, I feel that these tests are really
> on the very short (too short) side, so I'm not sure how much I can trust
> such results as significant. The data I reported were with more real life
> like parameters.

I see exactly the same with 300s or 1000s checkpoint cycles, it just
takes a lot longer to repeat. They're also similar (although obviously
both before/after patch are higher) if I disable full_page_writes,
thereby eliminating a lot of other IO.

Andres



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
<Oops, wrong "From" again, resent>

>>> I measured it in a different number of cases, both on SSDs and spinning
>>> rust. I just reproduced it with:
>>>
>>> postgres-ckpt14 \
>>>       -D /srv/temp/pgdev-dev-800/ \
>>>       -c maintenance_work_mem=2GB \
>>>       -c fsync=on \
>>>       -c synchronous_commit=off \
>>>       -c shared_buffers=2GB \
>>>       -c wal_level=hot_standby \
>>>       -c max_wal_senders=10 \
>>>       -c max_wal_size=100GB \
>>>       -c checkpoint_timeout=30s
>>>
>>> Using a fresh cluster each time (copied from a "template" to save time)
>>> and using
>>> pgbench -M prepared -c 16 -j 16 -T 300 -P 1

I must say that I have not succeeded in reproducing any significant 
regression up to now on an HDD. I'm running some more tests again because 
I had left out some options above that I thought were non essential.

I have deep problems with the 30-second checkpoint tests: basically the 
checkpoints take much more than 30 seconds to complete, the system is not 
stable, the 300 seconds runs last more than 900 seconds because the 
clients are stuck a long time. The overall behavior is appaling as most of 
the time is spent in IO panic at 0 tps.

Also, the performance level is around 160 tps on HDDs, which make sense to 
me for a 7200 rpm HDD capable of about x00 random writes per second. It 
seems to me that you reported much better performance on HDD, but I cannot 
really see how this would be possible if data are indeed writen to disk. 
Any idea?

Also, what is the very precise postgres version & patch used in your 
tests on HDDs?

> both before/after patch are higher) if I disable full_page_writes,
> thereby eliminating a lot of other IO.

Maybe this is an explanation....

-- 
Fabien.




Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-19 10:27:31 +0100, Fabien COELHO wrote:
> Also, the performance level is around 160 tps on HDDs, which make sense to
> me for a 7200 rpm HDD capable of about x00 random writes per second. It
> seems to me that you reported much better performance on HDD, but I cannot
> really see how this would be possible if data are indeed writen to disk. Any
> idea?

synchronous_commit = off does make a significant difference.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> synchronous_commit = off does make a significant difference.

Sure, but I had thought about that and kept this one...

I think I found one possible culprit: I automatically wrote 300 seconds 
for checkpoint_timeout, instead of 30 seconds in your settings. I'll have 
to rerun the tests with this (unreasonnable) figure to check whether I 
really get a regression.

Other tests I ran with "reasonnable" settings on a large (scale=800) db 
did not show any significant performance regression, up to know.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-19 13:34:14 +0100, Fabien COELHO wrote:
> 
> >synchronous_commit = off does make a significant difference.
> 
> Sure, but I had thought about that and kept this one...

But why are you then saying this is fundamentally limited to 160
xacts/sec?

> I think I found one possible culprit: I automatically wrote 300 seconds for
> checkpoint_timeout, instead of 30 seconds in your settings. I'll have to
> rerun the tests with this (unreasonnable) figure to check whether I really
> get a regression.

I've not seen meaningful changes in the size of the regression between 30/300s.

> Other tests I ran with "reasonnable" settings on a large (scale=800) db did
> not show any significant performance regression, up to know.

Try running it so that the data set nearly, but not entirely fit into
the OS page cache, while definitely not fitting into shared_buffers. The
scale=800 just worked for that on my hardware, no idea how it's for yours.

That seems to be the point where the effect is the worst.



Re: checkpointer continuous flushing

From
Robert Haas
Date:
On Mon, Jan 18, 2016 at 11:39 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-01-16 10:01:25 +0100, Fabien COELHO wrote:
>> Hello Andres,
>>
>> >I measured it in a different number of cases, both on SSDs and spinning
>> >rust. I just reproduced it with:
>> >
>> >postgres-ckpt14 \
>> >       -D /srv/temp/pgdev-dev-800/ \
>> >       -c maintenance_work_mem=2GB \
>> >       -c fsync=on \
>> >       -c synchronous_commit=off \
>> >       -c shared_buffers=2GB \
>> >       -c wal_level=hot_standby \
>> >       -c max_wal_senders=10 \
>> >       -c max_wal_size=100GB \
>> >       -c checkpoint_timeout=30s
>> >
>> >Using a fresh cluster each time (copied from a "template" to save time)
>> >and using
>> >pgbench -M prepared -c 16 -j 16 -T 300 -P 1
>
> So, I've analyzed the problem further, and I think I found something
> rater interesting. I'd profiled the kernel looking where it blocks in
> the IO request queues, and found that the wal writer was involved
> surprisingly often.
>
> So, in a workload where everything (checkpoint, bgwriter, backend
> writes) is flushed:                             2995 tps
> After I kill the wal writer with -STOP:         10887 tps
>
> Stracing the wal writer shows:
>
> 17:29:02.001517 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17857, si_uid=1000} ---
> 17:29:02.001538 rt_sigreturn({mask=[]}) = 0
> 17:29:02.001582 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.001615 write(3, "\210\320\5\0\1\0\0\0\0@\330_/\0\0\0w\f\0\0\0\0\0\0\0\4\0\2\t\30\0\372"..., 49152) = 49152
> 17:29:02.001671 fdatasync(3)            = 0
> 17:29:02.005022 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17825, si_uid=1000} ---
> 17:29:02.005043 rt_sigreturn({mask=[]}) = 0
> 17:29:02.005081 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.005111 write(3, "\210\320\5\0\1\0\0\0\0\0\331_/\0\0\0\7\26\0\0\0\0\0\0T\251\0\0\0\0\0\0"..., 8192) = 8192
> 17:29:02.005147 fdatasync(3)            = 0
> 17:29:02.008688 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17866, si_uid=1000} ---
> 17:29:02.008705 rt_sigreturn({mask=[]}) = 0
> 17:29:02.008730 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.008757 write(3, "\210\320\5\0\1\0\0\0\0 \331_/\0\0\0\267\30\0\0\0\0\0\0        "..., 98304) = 98304
> 17:29:02.008822 fdatasync(3)            = 0
> 17:29:02.016125 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
> 17:29:02.016141 rt_sigreturn({mask=[]}) = 0
> 17:29:02.016174 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.016204 write(3, "\210\320\5\0\1\0\0\0\0\240\332_/\0\0\0s\5\0\0\0\0\0\0\t\30\0\2|8\2u"..., 57344) = 57344
> 17:29:02.016281 fdatasync(3)            = 0
> 17:29:02.019181 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
> 17:29:02.019199 rt_sigreturn({mask=[]}) = 0
> 17:29:02.019226 read(8, 0x7ffea6b6b200, 16) = -1 EAGAIN (Resource temporarily unavailable)
> 17:29:02.019249 write(3, "\210\320\5\0\1\0\0\0\0\200\333_/\0\0\0\307\f\0\0\0\0\0\0        "..., 73728) = 73728
> 17:29:02.019355 fdatasync(3)            = 0
> 17:29:02.022680 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=17865, si_uid=1000} ---
> 17:29:02.022696 rt_sigreturn({mask=[]}) = 0
>
> I.e. we're fdatasync()ing small amount of pages. Roughly 500 times a
> second. As soon as the wal writer is stopped, it's much bigger chunks,
> on the order of 50-130 pages. And, not that surprisingly, that improves
> performance, because there's far fewer cache flushes submitted to the
> hardware.

This seems like a problem with the WAL writer quite independent of
anything else.  It seems likely to be inadvertent fallout from this
patch:

Author: Simon Riggs <simon@2ndQuadrant.com>
Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000
   Wakeup WALWriter as needed for asynchronous commit performance.   Previously we waited for wal_writer_delay before
flushingWAL. Now   we also wake WALWriter as soon as a WAL buffer page has filled.   Significant effect observed on
performanceof asynchronous commits   by Robert Haas, attributed to the ability to set hint bits on tuples   earlier and
soreducing contention caused by clog lookups.
 

If I understand correctly, prior to that commit, WAL writer woke up 5
times per second and flushed just that often (unless you changed the
default settings).    But as the commit message explained, that turned
out to suck - you could make performance go up very significantly by
radically decreasing wal_writer_delay.  This commit basically lets it
flush at maximum velocity - as fast as we finish one flush, we can
start the next.  That must have seemed like a win at the time from the
way the commit message was written, but you seem to now be seeing the
opposite effect, where performance is suffering because flushes are
too frequent rather than too infrequent.  I wonder if there's an ideal
flush rate and what it is, and how much it depends on what hardware
you have got.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>>> synchronous_commit = off does make a significant difference.
>>
>> Sure, but I had thought about that and kept this one...
>
> But why are you then saying this is fundamentally limited to 160
> xacts/sec?

I'm just saying that the tested load generates mostly random IOs (probably 
on average over 1 page per transaction), random IOs are very slow on a 
HDD, so I do not expect great tps.

>> I think I found one possible culprit: I automatically wrote 300 seconds for
>> checkpoint_timeout, instead of 30 seconds in your settings. I'll have to
>> rerun the tests with this (unreasonnable) figure to check whether I really
>> get a regression.
>
> I've not seen meaningful changes in the size of the regression between 30/300s.

At 300 seconds (5 minutes) the checkpoints of the accumulated takes 15-25 
minutes, during which the database is mostly offline, and there is no 
clear difference with/without sort+flush.

>> Other tests I ran with "reasonnable" settings on a large (scale=800) db did
>> not show any significant performance regression, up to now.
>
> Try running it so that the data set nearly, but not entirely fit into
> the OS page cache, while definitely not fitting into shared_buffers. The
> scale=800 just worked for that on my hardware, no idea how it's for yours.
> That seems to be the point where the effect is the worst.

I have 16GB memory on the tested host, same as your hardware I think, so I 
use scale 800 => 12GB at the beginning of the run. Not sure it fits the 
bill as I think it fits in memory, so the load is mostly write and no/very 
few reads. I'll also try with scale 1000.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> This seems like a problem with the WAL writer quite independent of
> anything else.  It seems likely to be inadvertent fallout from this
> patch:
> 
> Author: Simon Riggs <simon@2ndQuadrant.com>
> Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000
> 
>     Wakeup WALWriter as needed for asynchronous commit performance.
>     Previously we waited for wal_writer_delay before flushing WAL. Now
>     we also wake WALWriter as soon as a WAL buffer page has filled.
>     Significant effect observed on performance of asynchronous commits
>     by Robert Haas, attributed to the ability to set hint bits on tuples
>     earlier and so reducing contention caused by clog lookups.

In addition to that the "powersaving" effort also plays a role - without
the latch we'd not wake up at any meaningful rate at all atm.


> If I understand correctly, prior to that commit, WAL writer woke up 5
> times per second and flushed just that often (unless you changed the
> default settings).    But as the commit message explained, that turned
> out to suck - you could make performance go up very significantly by
> radically decreasing wal_writer_delay.  This commit basically lets it
> flush at maximum velocity - as fast as we finish one flush, we can
> start the next.  That must have seemed like a win at the time from the
> way the commit message was written, but you seem to now be seeing the
> opposite effect, where performance is suffering because flushes are
> too frequent rather than too infrequent.  I wonder if there's an ideal
> flush rate and what it is, and how much it depends on what hardware
> you have got.

I think the problem isn't really that it's flushing too much WAL in
total, it's that it's flushing WAL in a too granular fashion. I suspect
we want something where we attempt a minimum number of flushes per
second (presumably tied to wal_writer_delay) and, once exceeded, a
minimum number of pages per flush. I think we even could continue to
write() the data at the same rate as today, we just would need to reduce
the number of fdatasync()s we issue. And possibly could make the
eventual fdatasync()s cheaper by hinting the kernel to write them out
earlier.

Now the question what the minimum number of pages we want to flush for
(setting wal_writer_delay triggered ones aside) isn't easy to answer. A
simple model would be to statically tie it to the size of wal_buffers;
say, don't flush unless at least 10% of XLogBuffers have been written
since the last flush. More complex approaches would be to measure the
continuous WAL writeout rate.

By tying it to both a minimum rate under activity (ensuring things go to
disk fast) and a minimum number of pages to sync (ensuring a reasonable
number of cache flush operations) we should be able to mostly accomodate
the different types of workloads. I think.

Andres



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > This seems like a problem with the WAL writer quite independent of
> > anything else.  It seems likely to be inadvertent fallout from this
> > patch:
> > 
> > Author: Simon Riggs <simon@2ndQuadrant.com>
> > Branch: master Release: REL9_2_BR [4de82f7d7] 2011-11-13 09:00:57 +0000
> > 
> >     Wakeup WALWriter as needed for asynchronous commit performance.
> >     Previously we waited for wal_writer_delay before flushing WAL. Now
> >     we also wake WALWriter as soon as a WAL buffer page has filled.
> >     Significant effect observed on performance of asynchronous commits
> >     by Robert Haas, attributed to the ability to set hint bits on tuples
> >     earlier and so reducing contention caused by clog lookups.
> 
> In addition to that the "powersaving" effort also plays a role - without
> the latch we'd not wake up at any meaningful rate at all atm.

The relevant thread is at
http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
what I didn't remember is that I voiced concern back then about exactly this:
http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
;)

Simon: CCed you, as the author of the above commit. Quick summary:
The frequent wakeups of wal writer can lead to significant performance
regressions in workloads that are bigger than shared_buffers, because
the super-frequent fdatasync()s by the wal writer slow down concurrent
writes (bgwriter, checkpointer, individual backend writes)
dramatically. To the point that SIGSTOPing the wal writer gets a pgbench
workload from 2995 to 10887 tps.  The reasons fdatasyncs cause a slow
down is that it prevents real use of queuing to the storage devices.


On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > If I understand correctly, prior to that commit, WAL writer woke up 5
> > times per second and flushed just that often (unless you changed the
> > default settings).    But as the commit message explained, that turned
> > out to suck - you could make performance go up very significantly by
> > radically decreasing wal_writer_delay.  This commit basically lets it
> > flush at maximum velocity - as fast as we finish one flush, we can
> > start the next.  That must have seemed like a win at the time from the
> > way the commit message was written, but you seem to now be seeing the
> > opposite effect, where performance is suffering because flushes are
> > too frequent rather than too infrequent.  I wonder if there's an ideal
> > flush rate and what it is, and how much it depends on what hardware
> > you have got.
> 
> I think the problem isn't really that it's flushing too much WAL in
> total, it's that it's flushing WAL in a too granular fashion. I suspect
> we want something where we attempt a minimum number of flushes per
> second (presumably tied to wal_writer_delay) and, once exceeded, a
> minimum number of pages per flush. I think we even could continue to
> write() the data at the same rate as today, we just would need to reduce
> the number of fdatasync()s we issue. And possibly could make the
> eventual fdatasync()s cheaper by hinting the kernel to write them out
> earlier.
> 
> Now the question what the minimum number of pages we want to flush for
> (setting wal_writer_delay triggered ones aside) isn't easy to answer. A
> simple model would be to statically tie it to the size of wal_buffers;
> say, don't flush unless at least 10% of XLogBuffers have been written
> since the last flush. More complex approaches would be to measure the
> continuous WAL writeout rate.
> 
> By tying it to both a minimum rate under activity (ensuring things go to
> disk fast) and a minimum number of pages to sync (ensuring a reasonable
> number of cache flush operations) we should be able to mostly accomodate
> the different types of workloads. I think.

This unfortunately leaves out part of the reasoning for the above
commit: We want WAL to be flushed fast, so we immediately can set hint
bits.

One, relatively extreme, approach would be to continue *writing* WAL in
the background writer as today, but use rules like suggested above
guiding the actual flushing. Additionally using operations like
sync_file_range() (and equivalents on other OSs).  Then, to address the
regression of SetHintBits() having to bail out more often, actually
trigger a WAL flush whenever WAL is already written, but not flushed.
has the potential to be bad in a number of other cases tho :(

Andres



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-20 11:13:26 +0100, Andres Freund wrote:
> On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> > On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > I think the problem isn't really that it's flushing too much WAL in
> > total, it's that it's flushing WAL in a too granular fashion. I suspect
> > we want something where we attempt a minimum number of flushes per
> > second (presumably tied to wal_writer_delay) and, once exceeded, a
> > minimum number of pages per flush. I think we even could continue to
> > write() the data at the same rate as today, we just would need to reduce
> > the number of fdatasync()s we issue. And possibly could make the
> > eventual fdatasync()s cheaper by hinting the kernel to write them out
> > earlier.
> >
> > Now the question what the minimum number of pages we want to flush for
> > (setting wal_writer_delay triggered ones aside) isn't easy to answer. A
> > simple model would be to statically tie it to the size of wal_buffers;
> > say, don't flush unless at least 10% of XLogBuffers have been written
> > since the last flush. More complex approaches would be to measure the
> > continuous WAL writeout rate.
> >
> > By tying it to both a minimum rate under activity (ensuring things go to
> > disk fast) and a minimum number of pages to sync (ensuring a reasonable
> > number of cache flush operations) we should be able to mostly accomodate
> > the different types of workloads. I think.
>
> This unfortunately leaves out part of the reasoning for the above
> commit: We want WAL to be flushed fast, so we immediately can set hint
> bits.
>
> One, relatively extreme, approach would be to continue *writing* WAL in
> the background writer as today, but use rules like suggested above
> guiding the actual flushing. Additionally using operations like
> sync_file_range() (and equivalents on other OSs).  Then, to address the
> regression of SetHintBits() having to bail out more often, actually
> trigger a WAL flush whenever WAL is already written, but not flushed.
> has the potential to be bad in a number of other cases tho :(

Chatting on IM with Heikki, I noticed that we're pretty pessimistic in
SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN),
because we can't easily set the LSN. But, it's actually fairly common
that the pages LSN is already newer than the commitLSN - in which case
we, afaics, just can go ahead and set the hint bit, no?

So, instead of    if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)        return;                /* not
flushedyet, so don't set hint */
 
we do    if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN)        && BufferGetLSNAtomic(buffer) < commitLSN)
    return;                /* not flushed yet, so don't set hint */
 

In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers
a large portion of the hint writes that we currently skip.

Right now, on my laptop, I get (-M prepared -c 32 -j 32):
current wal-writer                              12827 tps, 95 % IO util, 93 % CPU
no flushing in wal writer *                     13185 tps, 46 % IO util, 93 % CPU
no flushing in wal writer & above change        16366 tps, 41 % IO util, 95 % CPU
flushing in wal writer & above change:          14812 tps, 94 % IO util, 95 % CPU

* sometimes the results initially were much lower, with lots of lock contention. Can't figure out why that's only
sometimesthe case. In those cases the results were more like 8967 tps.
 

these aren't meant as thorough benchmarks, just to provide some
orientation.


Now that solution won't improve every situation, e.g. for a workload
that inserts a lot of rows in one transaction, and only does inserts, it
probably won't do all that much. But it still seems like a pretty good
mitigation strategy. I hope that with a smarter write strategy (getting
that 50% reduction in IO util) and the above we should be ok.

Andres



Re: checkpointer continuous flushing

From
Alvaro Herrera
Date:
Andres Freund wrote:

> The relevant thread is at
> http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
> what I didn't remember is that I voiced concern back then about exactly this:
> http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
> ;)

Interesting.  If we consider for a minute that part of the cause for the
slowdown is slowness in pg_clog, maybe we should reconsider the initial
decision to flush as quickly as possible (i.e. adopt a strategy where
walwriter sleeps a bit between two flushes) in light of the group-update
feature for CLOG being proposed by Amit Kapila in another thread -- it
seems that these things might go hand-in-hand.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-20 12:16:24 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > The relevant thread is at
> > http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
> > what I didn't remember is that I voiced concern back then about exactly this:
> > http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
> > ;)
> 
> Interesting.  If we consider for a minute that part of the cause for the
> slowdown is slowness in pg_clog, maybe we should reconsider the initial
> decision to flush as quickly as possible (i.e. adopt a strategy where
> walwriter sleeps a bit between two flushes) in light of the group-update
> feature for CLOG being proposed by Amit Kapila in another thread -- it
> seems that these things might go hand-in-hand.

I don't think it's strongly related - the contention here is on read
access to the clog, not on write access. While Amit's patch will reduce
the impact of that a bit, I don't see it making a fundamental
difference.

Andres



Re: checkpointer continuous flushing

From
Amit Kapila
Date:
On Wed, Jan 20, 2016 at 9:07 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-01-20 12:16:24 -0300, Alvaro Herrera wrote:
> > Andres Freund wrote:
> >
> > > The relevant thread is at
> > > http://archives.postgresql.org/message-id/CA%2BTgmoaCr3kDPafK5ygYDA9mF9zhObGp_13q0XwkEWsScw6h%3Dw%40mail.gmail.com
> > > what I didn't remember is that I voiced concern back then about exactly this:
> > > http://archives.postgresql.org/message-id/201112011518.29964.andres%40anarazel.de
> > > ;)
> >
> > Interesting.  If we consider for a minute that part of the cause for the
> > slowdown is slowness in pg_clog, maybe we should reconsider the initial
> > decision to flush as quickly as possible (i.e. adopt a strategy where
> > walwriter sleeps a bit between two flushes) in light of the group-update
> > feature for CLOG being proposed by Amit Kapila in another thread -- it
> > seems that these things might go hand-in-hand.
>
> I don't think it's strongly related - the contention here is on read
> access to the clog, not on write access.

Aren't reads on clog contended with parallel writes to clog?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-01-21 11:33:15 +0530, Amit Kapila wrote:
> On Wed, Jan 20, 2016 at 9:07 PM, Andres Freund <andres@anarazel.de> wrote:
> > I don't think it's strongly related - the contention here is on read
> > access to the clog, not on write access.
> 
> Aren't reads on clog contended with parallel writes to clog?

Sure. But you're not going to beat "no access to the clog" due to hint
bits, by making parallel writes a bit better citizens.



Re: checkpointer continuous flushing

From
Robert Haas
Date:
On Wed, Jan 20, 2016 at 9:02 AM, Andres Freund <andres@anarazel.de> wrote:
> Chatting on IM with Heikki, I noticed that we're pretty pessimistic in
> SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN),
> because we can't easily set the LSN. But, it's actually fairly common
> that the pages LSN is already newer than the commitLSN - in which case
> we, afaics, just can go ahead and set the hint bit, no?
>
> So, instead of
>                 if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)
>                         return;                         /* not flushed yet, so don't set hint */
> we do
>                 if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN)
>                         && BufferGetLSNAtomic(buffer) < commitLSN)
>                         return;                         /* not flushed yet, so don't set hint */
>
> In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers
> a large portion of the hint writes that we currently skip.

Dang.  That's a really good idea.  Although I think you'd probably
better revise the comment, since it will otherwise be false.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: checkpointer continuous flushing

From
Alvaro Herrera
Date:
This patch got its fair share of reviewer attention this commitfest.
Moving to the next one.  Andres, if you want to commit ahead of time
you're of course encouraged to do so.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
Hi,

Fabien asked me to post a new version of the checkpoint flushing patch
series. While this isn't entirely ready for commit, I think we're
getting closer.

I don't want to post a full series right now, but my working state is
available on
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The main changes are that:
1) the significant performance regressions I saw are addressed by  changing the wal writer flushing logic
2) The flushing API moved up a couple layers, and now deals with buffer  tags, rather than the physical files
3) Writes from checkpoints, bgwriter and files are flushed, configurable  by individual GUCs. Without that I still saw
thespiked in a lot of circumstances.
 

There's also a more experimental reimplementation of bgwriter, but I'm
not sure it's realistic to polish that up within the constraints of 9.6.

Regards,

Andres 



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
Hi Fabien,

On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
> 
> The main changes are that:
> 1) the significant performance regressions I saw are addressed by
>    changing the wal writer flushing logic
> 2) The flushing API moved up a couple layers, and now deals with buffer
>    tags, rather than the physical files
> 3) Writes from checkpoints, bgwriter and files are flushed, configurable
>    by individual GUCs. Without that I still saw the spiked in a lot of circumstances.
> 
> There's also a more experimental reimplementation of bgwriter, but I'm
> not sure it's realistic to polish that up within the constraints of 9.6.

Any comments before I spend more time polishing this? I'm currently
updating docs and comments to actually describe the current state...

Andres



Re: checkpointer continuous flushing - V16

From
Fabien COELHO
Date:
Hello Andres,

> Any comments before I spend more time polishing this?

I'm running tests on various settings, I'll send a report when it is done.
Up to now the performance seems as good as with the previous version.

> I'm currently updating docs and comments to actually describe the 
> current state...

I did notice the mismatched documentation.

I think I would appreciate comments to understand why/how the ringbuffer 
is used, and more comments in general, so it is fine if you improve this 
part.

Minor details:

"typedefs.list" should be updated to WritebackContext.

"WritebackContext" is a typedef, "struct" is not needed.


I'll look at the code more deeply probably over next weekend.

-- 
Fabien.



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
On 2016-02-08 19:52:30 +0100, Fabien COELHO wrote:
> I think I would appreciate comments to understand why/how the ringbuffer is
> used, and more comments in general, so it is fine if you improve this part.

I'd suggest to leave out the ringbuffer/new bgwriter parts. I think
they'd be committed separately, and probably not in 9.6.

Thanks,

Andres



Re: checkpointer continuous flushing - V16

From
Fabien COELHO
Date:
>> I think I would appreciate comments to understand why/how the 
>> ringbuffer is used, and more comments in general, so it is fine if you 
>> improve this part.
>
> I'd suggest to leave out the ringbuffer/new bgwriter parts.

Ok, so the patch would only onclude the checkpointer stuff.

I'll look at this part in detail.

-- 
Fabien.



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
On February 9, 2016 10:46:34 AM GMT+01:00, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>>> I think I would appreciate comments to understand why/how the 
>>> ringbuffer is used, and more comments in general, so it is fine if
>you 
>>> improve this part.
>>
>> I'd suggest to leave out the ringbuffer/new bgwriter parts.
>
>Ok, so the patch would only onclude the checkpointer stuff.
>
>I'll look at this part in detail.

Yes, that's the more pressing part. I've seen pretty good results with the new bgwriter, but it's not really worthwhile
untilsorting and flushing is in...
 

Andres 

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> Fabien asked me to post a new version of the checkpoint flushing patch
> series. While this isn't entirely ready for commit, I think we're
> getting closer.
>
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

The first two commits of the series are pretty close to being ready. I'd
welcome review of those, and I plan to commit them independently of the
rest as they're beneficial independently.  The most important bits are
the comments and docs of 0002 - they weren't particularly good
beforehand, so I had to rewrite a fair bit.

0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
      potential regressions of 0002
0002: Fix the overaggressive flushing by the wal writer, by only
      flushing every wal_writer_delay ms or wal_writer_flush_after
      bytes.

Greetings,

Andres Freund

Attachment

Re: checkpointer continuous flushing - V16

From
Robert Haas
Date:
On Thu, Feb 11, 2016 at 1:44 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
>> Fabien asked me to post a new version of the checkpoint flushing patch
>> series. While this isn't entirely ready for commit, I think we're
>> getting closer.
>>
>> I don't want to post a full series right now, but my working state is
>> available on
>> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
>> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush
>
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently.  The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
>
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>       potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
>       flushing every wal_writer_delay ms or wal_writer_flush_after
>       bytes.

I previously reviewed 0001 and I think it's fine.  I haven't reviewed
0002 in detail, but I like the concept.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: checkpointer continuous flushing - V16

From
Fabien COELHO
Date:
Hello Andres,

> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>      potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
>      flushing every wal_writer_delay ms or wal_writer_flush_after
>      bytes.

I've looked at these patches, especially the whole bench of explanations 
and comments which is a good source for understanding what is going on in 
the WAL writer, a part of pg I'm not familiar with.

When reading the patch 0002 explanations, I had the following comments:

AFAICS, there are several levels of actions when writing things in pg:
 0: the thing is written in some internal buffer
 1: the buffer is advised to be passed to the OS (hint bits?)
 2: the buffer is actually passed to the OS (write, flush)
 3: the OS is advised to send the written data to the io subsystem    (sync_file_range with SYNC_FILE_RANGE_WRITE)
 4: the OS is required to send the written data to the disk    (fsync, sync_file_range with
SYNC_FILE_RANGE_WAIT_AFTER)

It is not clear when reading the text which level is discussed. In 
particular, I'm not sure that "flush" refers to level 2, which is 
misleading. When reading the description, I'm rather under the impression 
that it is about level 4, but then if actual fsync are performed every 200 
ms then the tps would be very low...

After more considerations, my final understanding is that this behavior 
only occurs with "asynchronous commit", aka a situation when COMMIT does 
not wait for data to be really fsynced, but the fsync is to occur within 
some delay so it will not be too far away, some kind of compromise for 
performance where commits can be lost.

Now all this is somehow alien to me because the whole point of committing 
is having the data to disk, and I would not consider a database to be safe 
if commit does not imply fsync, but I understand that people may have to 
compromise for performance.

Is my understanding right?

-- 
Fabien.



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
On 2016-02-11 19:44:25 +0100, Andres Freund wrote:
> The first two commits of the series are pretty close to being ready. I'd
> welcome review of those, and I plan to commit them independently of the
> rest as they're beneficial independently.  The most important bits are
> the comments and docs of 0002 - they weren't particularly good
> beforehand, so I had to rewrite a fair bit.
> 
> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>       potential regressions of 0002
> 0002: Fix the overaggressive flushing by the wal writer, by only
>       flushing every wal_writer_delay ms or wal_writer_flush_after
>       bytes.

I've pushed these after some more polishing, now working on the next
two.

Greetings,

Andres Freund



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
On 2016-02-18 09:51:20 +0100, Fabien COELHO wrote:
> I've looked at these patches, especially the whole bench of explanations and
> comments which is a good source for understanding what is going on in the
> WAL writer, a part of pg I'm not familiar with.
> 
> When reading the patch 0002 explanations, I had the following comments:
> 
> AFAICS, there are several levels of actions when writing things in pg:
> 
>  0: the thing is written in some internal buffer
> 
>  1: the buffer is advised to be passed to the OS (hint bits?)

Hint bits aren't related to OS writes. They're about information like
'this transaction committed' or 'all tuples on this page are visible'.


>  2: the buffer is actually passed to the OS (write, flush)
> 
>  3: the OS is advised to send the written data to the io subsystem
>     (sync_file_range with SYNC_FILE_RANGE_WRITE)
> 
>  4: the OS is required to send the written data to the disk
>     (fsync, sync_file_range with SYNC_FILE_RANGE_WAIT_AFTER)

We can't easily rely on sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER) -
the guarantees it gives aren't well defined, and actually changed across
releases.


0002 is about something different, it's about the WAL writer. Which
writes WAL to disk, so individual backends don't have to. It does so in
the background every wal_writer_delay or whenever a tranasaction
asynchronously commits.  The reason this interacts with checkpoint
flushing is that, when we flush writes on a regular pace, the writes by
the checkpointer happen inbetween the very frequent writes/fdatasync()
by the WAL writer. That means the disk's caches are flushed every
fdatasync() - which causes considerable slowdowns.  On a decent SSD the
WAL writer, before this patch, often did 500-1000 fdatasync()s a second;
the regular sync_file_range calls slowed down things too much.

That's what caused the large regression when using checkpoint
sorting/flushing with synchronous_commit=off. With that fixed - often a
performance improvement on its own - I don't see that regression anymore.


> After more considerations, my final understanding is that this behavior only
> occurs with "asynchronous commit", aka a situation when COMMIT does not wait
> for data to be really fsynced, but the fsync is to occur within some delay
> so it will not be too far away, some kind of compromise for performance
> where commits can be lost.

Right.


> Now all this is somehow alien to me because the whole point of committing is
> having the data to disk, and I would not consider a database to be safe if
> commit does not imply fsync, but I understand that people may have to
> compromise for performance.

It's obviously not applicable for every scenario, but in a *lot* of
real-world scenario a sub-second loss window doesn't have any actual
negative implications.


Andres



Re: checkpointer continuous flushing - V16

From
Fabien COELHO
Date:
Hello Andres,

> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

Below the results of a lot of tests with pgbench to exercise checkpoints 
on the above version when fetched.

Overall comments: - sorting & flushing is basically always a winner - benchmarking with short runs on large databases
isa bad idea   the results are very different if a longer run is used   (see andres00b vs andres00c)
 

# HOST/SOFT
 16 GB 2 cpu 8 cores 200 GB RAID1 HDD, ext4 FS Ubuntu 12.04 LTS (precise)

# ABOUT THE REPORTED STATISTICS
 tps: is the "excluding connection" time tps, the higher the better 1-sec tps: average of measured per-second tps
note- it should be the same as the previous one, but due to various          hazards in the trace, especially when
thingsgo badly and pg get          stuck, it may be different. Such hazard also explain why there          may be some
non-integertps reported for some seconds. stddev: standard deviation, the lower the better the five figures in bracket
givea feel of the distribution: - min: minimal per-second tps seen in the trace - q1: first quarter per-second tps seen
inthe trace - med: median per-second tps seen in the trace - q3: third quarter per-second tps seen in the trace - max:
maximalper-second tps seen in the trace the last percentage dubbed "<=10.0" is percent of seconds where performance
isbelow 10 tps: this measures of how unresponsive pg was during the run
 

###### TINY2
 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4   with scale = 10 (~ 200 MB)
 postgresql.conf:   shared_buffers = 1GB   max_wal_size = 1GB   checkpoint_timeout = 300s
checkpoint_completion_target= 0.8   checkpoint_flush_after = { none, 0, 32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 2574.1 / 2574.3 ± 367.4 [229.0, 2570.1, 2721.9, 2746.1, 2857.2] 0.0%      1 | 2575.0 / 2575.1 ± 359.3 [  1.0,
2595.9,2712.0, 2732.0, 2847.0] 0.1%      2 | 2602.6 / 2602.7 ± 359.5 [ 54.0, 2607.1, 2735.1, 2768.1, 2908.0] 0.0%
 
    0 0 | 2583.2 / 2583.7 ± 296.4 [164.0, 2580.0, 2690.0, 2717.1, 2833.8] 0.0%      1 | 2596.6 / 2596.9 ± 307.4 [296.0,
2590.5,2707.9, 2738.0, 2847.8] 0.0%      2 | 2604.8 / 2605.0 ± 300.5 [110.9, 2619.1, 2712.4, 2738.1, 2849.1] 0.0%
 
   32 0 | 2625.5 / 2625.5 ± 250.5 [  1.0, 2645.9, 2692.0, 2719.9, 2839.0] 0.1%      1 | 2630.2 / 2630.2 ± 243.1 [301.8,
2654.9,2697.2, 2726.0, 2837.4] 0.0%      2 | 2648.3 / 2648.4 ± 236.7 [570.1, 2664.4, 2708.9, 2739.0, 2844.9] 0.0%
 
   64 0 | 2587.8 / 2587.9 ± 306.1 [ 83.0, 2610.1, 2680.0, 2731.0, 2857.1] 0.0%      1 | 2591.1 / 2591.1 ± 305.2 [455.9,
2608.9,2680.2, 2734.1, 2859.0] 0.0%      2 | 2047.8 / 2046.4 ± 925.8 [  0.0, 1486.2, 2592.6, 2691.1, 3001.0] 0.2% ?
 

Pretty small setup, all data fit in buffers. Good tps performance all around
(best for 32 flushes), and flushing shows a noticable (360 -> 240) reduction
in tps stddev.

###### SMALL
 pgbench -M prepared -N -P 1 -T 4000 -j 2 -c 4   with scale = 120 (~ 2 GB)
 postgresql.conf:   shared_buffers = 2GB   checkpoint_timeout = 300s   checkpoint_completion_target = 0.8
checkpoint_flush_after= { none, 0, 32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 209.2 / 204.2 ± 516.5 [0.0,   0.0,   4.0,    5.0, 2251.0] 82.3%      1 | 207.4 / 204.2 ± 518.7 [0.0,   0.0,
4.0,   5.0, 2245.1] 82.3%      2 | 217.5 / 211.0 ± 530.3 [0.0,   0.0,   3.0,    5.0, 2255.0] 82.0%      3 | 217.8 /
213.2± 531.7 [0.0,   0.0,   4.0,    6.0, 2261.9] 81.7%      4 | 230.7 / 223.9 ± 542.7 [0.0,   0.0,   4.0,    7.0,
2282.0]80.7%
 
    0 0 | 734.8 / 735.5 ± 879.9 [0.0,   1.0,  16.5, 1748.3, 2281.1] 47.0%      1 | 694.9 / 693.0 ± 849.0 [0.0,   1.0,
29.5,1545.7, 2428.0] 46.4%      2 | 735.3 / 735.5 ± 888.4 [0.0,   0.0,  12.0, 1781.2, 2312.1] 47.9%      3 | 736.0 /
737.5± 887.1 [0.0,   1.0,  16.0, 1794.3, 2317.0] 47.5%      4 | 734.9 / 735.1 ± 885.1 [0.0,   1.0,  15.5, 1781.0,
2297.1]47.2%
 
   32 0 | 738.1 / 737.9 ± 415.8 [0.0, 553.0, 679.0,  753.0, 2312.1]  0.2%      1 | 730.5 / 730.7 ± 413.2 [0.0, 546.5,
671.0, 744.0, 2319.0]  0.1%      2 | 741.9 / 741.9 ± 416.5 [0.0, 556.0, 682.0,  756.0, 2331.0]  0.2%      3 | 744.1 /
744.1± 414.4 [0.0, 555.5, 685.2,  758.0, 2285.1]  0.1%      4 | 746.9 / 746.9 ± 416.6 [0.0, 566.6, 685.0,  759.0,
2308.1] 0.1%
 
   64 0 | 743.0 / 743.1 ± 416.5 [1.0, 555.0, 683.0,  759.0, 2353.0]  0.1%      1 | 742.5 / 742.5 ± 415.6 [0.0, 558.2,
680.0, 758.2, 2296.0]  0.1%      2 | 742.5 / 742.5 ± 415.9 [0.0, 559.0, 681.1,  757.0, 2310.0]  0.1%      3 | 529.0 /
526.6± 410.9 [0.0, 245.0, 444.0,  701.0, 2380.9]  1.5% ??      4 | 734.8 / 735.0 ± 414.1 [0.0, 550.0, 673.0,  754.0,
2298.0] 0.1%
 

Sorting brings * 3.3 tps, flushing significantly reduces tps stddev.
Pg comes from 80% unresponsive to nearly always responsive.

###### MEDIUM
 pgbench:  -M prepared -N -P 1 -T 4000 -j 2 -c 4   with scale = 250 (~ 3.8 GB)
 postgresql.conf:   shared_buffers = 4GB   max_wal_size = 4GB   checkpoint_timeout = 15min
checkpoint_completion_target= 0.8   checkpoint_flush_after = { none, 0, 32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 |  214.8 /  211.8 ± 513.7 [0.0,   1.0,    4.0,    5.0, 2344.0] 82.4%      1 |  219.2 /  215.0 ± 524.1 [0.0,
0.0,   4.0,    5.0, 2316.0] 82.2%      2 |  240.9 /  234.6 ± 550.8 [0.0,   0.0,    4.0,    6.0, 2320.2] 81.0%
 
    0 0 | 1064.7 / 1065.3 ± 888.2 [0.0,  11.0, 1089.0, 2017.7, 2461.9] 24.7%      1 | 1060.2 / 1061.2 ± 889.9 [0.0,
10.0,1056.7, 2022.0, 2444.9] 25.1%      2 | 1060.2 / 1061.4 ± 889.1 [0.0,   9.0, 1085.8, 2002.8, 2473.0] 25.6%
 
   32 0 | 1059.4 / 1059.4 ± 476.3 [3.0, 804.9,  980.0, 1123.0, 2448.1]  0.1%      1 | 1062.5 / 1062.6 ± 475.6 [0.0,
807.0, 988.0, 1132.0, 2441.0]  0.1%      2 | 1063.7 / 1063.7 ± 475.4 [0.0, 814.0,  987.0, 1131.2, 2432.1]  0.1%
 
   64 0 | 1052.6 / 1052.6 ± 475.3 [0.0, 793.0,  974.0, 1118.1, 2445.1]  0.1%      1 | 1059.8 / 1059.8 ± 475.1 [0.0,
799.0, 987.5, 1131.0, 2457.1]  0.1%      2 | 1058.5 / 1058.5 ± 472.8 [0.0, 807.0,  985.0, 1127.7, 2442.0]  0.1%
 

Sorting brings * 4.8 tps, flushing significantly reduces tps stddev.
Pg comes from +80% unresponsive to nearly always responsive.

Performance is significantly better than "small" above, probably thanks to
the longer checkpoint timeout.


###### LARGE
 pgbench -M prepared -N -P 1 -T 7500 -j 2 -c 4   with scale = 1000 (~ 15 GB)
 postgresql.conf:   shared_buffers = 4GB   max_wal_size = 2GB   checkpoint_timeout = 40min
checkpoint_completion_target= 0.8   checkpoint_flush_after = { none, 0, 32, 64}
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 |  68.7 /  65.3 ± 78.6 [0.0,  3.0,   6.0, 136.0, 291.0] 53.1%      1 |  70.6 /  70.3 ± 80.1 [0.0,  4.0,  10.0,
151.0,282.0] 50.1%      2 |  74.3 /  75.8 ± 84.9 [0.0,  4.0,   9.0, 162.0, 311.2] 50.3%
 
    0 0 | 117.2 / 116.9 ± 83.8 [0.0, 14.0, 139.0, 193.0, 372.4] 24.0%      1 | 117.3 / 117.8 ± 83.8 [0.0, 16.0, 140.0,
193.0,279.0] 23.9%      2 | 117.6 / 118.2 ± 84.1 [0.0, 16.0, 141.0, 194.0, 297.8] 23.7%
 
   32 0 | 114.2 / 114.2 ± 45.7 [0.0, 84.0, 100.0, 131.0, 613.6]  0.4%      1 | 112.5 / 112.6 ± 44.0 [0.0, 83.0,  98.0,
130.0,293.0]  0.2%      2 | 108.0 / 108.0 ± 44.7 [0.0, 79.0,  94.0, 124.0, 303.6]  0.3%
 
   64 0 | 113.0 / 113.0 ± 45.5 [0.0, 83.0,  99.0, 131.0, 289.0]  0.4%      1 |  80.0 /  80.3 ± 39.1 [0.0, 56.0,  72.0,
95.0,281.0]  0.8% ??      2 | 112.2 / 112.3 ± 44.5 [0.0, 82.0,  99.0, 129.0, 282.0]  0.3%
 

Data do not fit in the available memory, so plenty of read accesses.
Sorting still has some impact on tps performance (* 1.6), flushing
greatly improves responsiveness.


###### ANDRES00
 pgbench -M prepared -N -P 1 -T 300 -c 16 -j 16   with scale = 800 (~ 13 GB)
 postgresql.conf:   shared_buffers = 2GB   max_wal_size = 100GB   wal_level = hot_standby   maintenance_work_mem = 2GB
checkpoint_timeout = 30s   checkpoint_completion_target = 0.8   synchronous_commit = off   checkpoint_flush_after = {
none,0, 32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 328.7 / 329.9 ± 716.9 [0.0, 0.0,   0.0,    0.0, 3221.2] 77.7%      1 | 338.2 / 338.7 ± 728.6 [0.0, 0.0,
0.0,  17.0, 3296.3] 75.0%      2 | 304.5 / 304.3 ± 705.5 [0.0, 0.0,   0.0,    0.0, 3463.4] 79.3%
 
    0 0 | 425.6 / 464.0 ± 724.0 [0.0, 0.0,   0.0, 1000.6, 3363.7] 61.0%      1 | 461.5 / 463.1 ± 735.8 [0.0, 0.0,
0.0,1011.2, 3490.9] 58.7%      2 | 452.4 / 452.6 ± 744.3 [0.0, 0.0,   0.0, 1078.9, 3631.9] 63.3%
 
   32 0 | 514.4 / 515.8 ± 651.8 [0.0, 0.0, 337.4,  808.3, 2876.0] 40.7%      1 | 512.0 / 514.6 ± 661.6 [0.0, 0.0,
317.6, 690.8, 3315.8] 35.0%      2 | 529.5 / 530.3 ± 673.0 [0.0, 0.0, 321.1,  906.4, 3360.8] 40.3%
 
   64 0 | 529.6 / 530.9 ± 668.2 [0.0, 0.0, 322.1,  786.1, 3538.0] 33.3%      1 | 496.4 / 498.0 ± 606.6 [0.0, 0.0,
321.4, 746.0, 2629.6] 36.3%      2 | 521.0 / 521.7 ± 657.0 [0.0, 0.0, 328.4,  737.9, 3262.9] 34.3%
 

Data just hold in memory, maybe. Run is very short, settings are low, this
is not representative of an sane installation, this is for testing a lot of
checkpoints in a difficult situation. Sorting and flushing do bring
significant benefits.


###### ANDRES00b (same as ANDRES00 but scale 800->1000)
 pgbench -M prepared -N -P 1 -T 300 -c 16 -j 16   with scale = 1000 (~ 15 GB)
 postgresql.conf:  shared_buffers = 2GB  max_wal_size = 100GB  wal_level = hot_standby  maintenance_work_mem = 2GB
checkpoint_timeout= 30s  checkpoint_completion_target = 0.8  synchronous_commit = off  checkpoint_flush_after = { none,
0,32, 64 }
 
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 150.2 / 150.3 ± 401.6 [0.0,   0.0,   0.0,   0.0, 2199.4] 75.1%      1 | 139.2 / 139.2 ± 372.2 [0.0,   0.0,
0.0,  0.0, 2111.4] 78.3% ***      2 | 127.3 / 127.1 ± 341.2 [0.0,   0.0,   0.0,  53.0, 2144.3] 74.7% ***
 
    0 0 | 199.0 / 209.2 ± 400.4 [0.0,   0.0,   0.0, 243.6, 1846.0] 65.7%      1 | 220.4 / 226.7 ± 423.2 [0.0,   0.0,
0.0,264.0, 1777.0] 63.5% *      2 | 195.5 / 205.3 ± 337.9 [0.0,   0.0, 123.0, 212.0, 1721.9] 43.2%
 
   32 0 | 362.3 / 359.0 ± 308.4 [0.0, 200.0, 265.0, 416.4, 1816.6]  5.0%      1 | 323.6 / 321.2 ± 327.1 [0.0, 142.9,
210.0,353.4, 1907.0]  4.0%      2 | 309.0 / 310.7 ± 381.3 [0.0, 122.0, 175.5, 298.0, 2090.4]  5.0%
 
   64 0 | 342.7 / 343.6 ± 331.1 [0.0, 143.0, 239.5, 409.9, 1623.6]  5.3%      1 | 333.8 / 328.2 ± 356.3 [0.0, 132.9,
211.5,358.1, 1629.1] 10.7% ??      2 | 352.0 / 352.0 ± 332.3 [0.0, 163.5, 239.9, 400.1, 1643.4]  5.3%
 

A little bit larger than previous so that it does not really fit in memory.
The performance inpact is significant compared to previous. Sorting and
flushing brings * 2 tps, unresponsiveness comes from 75% to reach a better 5%.

###### ANDRES00c (same as ANDRES00b but time 300 -> 4000)
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 115.2 / 114.3 ± 256.4 [0.0,   0.0,  75.0, 131.1, 3389.0] 46.5%      1 | 118.4 / 117.9 ± 248.3 [0.0,   0.0,
87.0,151.0, 3603.6] 46.7%      2 | 120.1 / 119.2 ± 254.4 [0.0,   0.0,  91.0, 143.0, 3307.8] 43.8%
 
    0 0 | 217.4 / 211.0 ± 237.1 [0.0, 139.0, 193.0, 239.0, 3115.4] 16.8%      1 | 216.2 / 209.6 ± 244.9 [0.0, 138.9,
188.0,231.0, 3331.3] 16.3%      2 | 218.6 / 213.8 ± 246.7 [0.0, 137.0, 187.0, 232.0, 3229.6] 16.2%
 
   32 0 | 146.6 / 142.5 ± 234.5 [0.0,  59.0,  93.0, 151.1, 3294.7] 17.5%      1 | 148.0 / 142.6 ± 239.2 [0.0,  64.0,
95.9,144.0, 3361.8] 16.0%      2 | 147.6 / 140.4 ± 233.2 [0.0,  59.4,  94.0, 148.0, 3108.4] 18.0%
 
   64 0 | 145.3 / 140.5 ± 233.6 [0.0,  61.0,  93.0, 147.7, 3212.6] 16.5%      1 | 145.6 / 140.3 ± 233.3 [0.0,  58.0,
93.0,146.0, 3351.8] 17.3%      2 | 147.7 / 142.2 ± 233.2 [0.0,  61.0,  97.0, 148.4, 3616.3] 17.0%
 

The only difference between ANDRES00B and ANDRES00C is the duration, from
5 minutes to 66 minutes. This show that short runs can be widelely misleading:
In particular the longer runs shows less that half tps for some settings, and
the relative comparison of head vs sort vs sort+flush is different.

###### ANDRES00d (same as ANDRES00b but wal_level hot_standby->minimal)
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 191.6 / 195.1 ± 439.3 [0.0,   0.0,   0.0,   0.0, 2540.2] 76.3%      1 | 211.3 / 213.6 ± 461.9 [0.0,   0.0,
0.0, 13.0, 3203.7] 75.0%      2 | 152.4 / 154.9 ± 217.6 [0.0,   0.0,  58.0, 235.6,  995.9] 39.3% ???
 
    0 0 | 247.2 / 251.7 ± 454.0 [0.0,   0.0,   0.0, 375.3, 2592.4] 67.7%      1 | 215.4 / 232.7 ± 446.5 [0.0,   0.0,
0.0,103.0, 3046.7] 72.3%      2 | 160.6 / 160.8 ± 222.1 [0.0,   0.0,  80.0, 209.6,  885.3] 42.0% ???
 
   32 0 | 399.9 / 397.0 ± 356.6 [0.0,  67.0, 348.0, 572.8, 2604.2] 21.0%      1 | 391.8 / 392.5 ± 371.7 [0.0,  85.5,
314.4,549.3, 2590.3] 20.7%      2 | 406.1 / 404.8 ± 380.6 [0.0,  95.0, 348.5, 569.0, 3383.7] 21.3%
 
   64 0 | 395.9 / 396.1 ± 352.4 [0.0,  89.5, 342.5, 556.0, 2366.9] 17.7%      1 | 355.1 / 351.9 ± 296.7 [0.0, 172.5,
306.1,468.1, 1663.5] 16.0%      2 | 403.6 / 401.8 ± 390.5 [0.0,   0.0, 337.0, 636.1, 2591.3] 26.7% ???
 

###### ANDRES00e (same as ANDRES00b but maintenance_work_mem=2GB->64MB)
 opts # |   tps / 1-sec tps ± stddev [ min q1 med q2 max ] <=10.0
 head 0 | 153.5 / 161.3 ± 401.3 [0.0,   0.0,   0.0,   0.0, 2546.0] 82.0%      1 | 170.7 / 175.9 ± 399.9 [0.0,   0.0,
0.0, 14.0, 2537.4] 74.7%      2 | 184.7 / 190.4 ± 389.2 [0.0,   0.0,   0.0, 158.5, 2544.6] 69.3%
 
    0 0 | 211.2 / 227.8 ± 418.8 [0.0,   0.0,   0.0, 334.6, 2589.3] 65.7%      1 | 221.7 / 226.0 ± 415.7 [0.0,   0.0,
0.0,276.8, 2588.2] 68.4%      2 | 232.5 / 233.2 ± 403.5 [0.0,   0.0,   0.0, 377.0, 2260.2] 62.0%
 
   32 0 | 373.2 / 374.4 ± 309.2 [0.0, 180.6, 321.8, 475.2, 2596.5] 11.3%      1 | 348.7 / 348.1 ± 328.4 [0.0, 127.0,
284.1,451.9, 2595.1] 17.3%      2 | 376.3 / 375.3 ± 315.5 [0.0, 186.5, 329.6, 487.1, 2365.4] 15.3%
 
   64 0 | 388.9 / 387.8 ± 348.7 [0.0, 164.0, 305.9, 546.5, 2587.2] 15.0%      1 | 380.3 / 378.7 ± 338.8 [0.0, 171.1,
317.4,524.8, 2592.4] 16.7%      2 | 369.8 / 367.4 ± 340.5 [0.0,  77.4, 320.6, 525.5, 2484.7] 20.7%
 

Hmmm, interesting: maintenance_work_mem seems to have some influence on
performance, although it is not too consistent between settings, probably
because as the memory is used to its limit the performance is quite
sensitive to the available memory.

-- 
Fabien.

Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
Hi,

On 2016-02-19 10:16:41 +0100, Fabien COELHO wrote:
> Below the results of a lot of tests with pgbench to exercise checkpoints on
> the above version when fetched.

Wow, that's a great test series.


> Overall comments:
>  - sorting & flushing is basically always a winner
>  - benchmarking with short runs on large databases is a bad idea
>    the results are very different if a longer run is used
>    (see andres00b vs andres00c)

Based on these results I think 32 will be a good default for
checkpoint_flush_after? There's a few cases where 64 showed to be
beneficial, and some where 32 is better. I've seen 64 perform a bit
better in some cases here, but the differences were not too big.

I gather that you didn't play with
backend_flush_after/bgwriter_flush_after, i.e. you left them at their
default values? Especially backend_flush_after can have a significant
positive and negative performance impact.


>  16 GB 2 cpu 8 cores
>  200 GB RAID1 HDD, ext4 FS
>  Ubuntu 12.04 LTS (precise)

That's with 12.04's standard kernel?



>  postgresql.conf:
>    shared_buffers = 1GB
>    max_wal_size = 1GB
>    checkpoint_timeout = 300s
>    checkpoint_completion_target = 0.8
>    checkpoint_flush_after = { none, 0, 32, 64 }

Did you re-initdb between the runs?


I've seen massively varying performance differences due to autovacuum
triggered analyzes. It's not completely deterministic when those run,
and on bigger scale clusters analyze can take ages, while holding a
snapshot.


> Hmmm, interesting: maintenance_work_mem seems to have some influence on
> performance, although it is not too consistent between settings, probably
> because as the memory is used to its limit the performance is quite
> sensitive to the available memory.

That's probably because of differing behaviour of autovacuum/vacuum,
which sometime will have to do several scans of the tables if there are
too many dead tuples.


Regards,

Andres



Re: checkpointer continuous flushing - V16

From
Fabien COELHO
Date:
Hello.

> Based on these results I think 32 will be a good default for
> checkpoint_flush_after? There's a few cases where 64 showed to be
> beneficial, and some where 32 is better. I've seen 64 perform a bit
> better in some cases here, but the differences were not too big.

Yes, these many runs show that 32 is basically as good or better than 64.

I'll do some runs with 16/48 to have some more data.

> I gather that you didn't play with 
> backend_flush_after/bgwriter_flush_after, i.e. you left them at their 
> default values? Especially backend_flush_after can have a significant 
> positive and negative performance impact.

Indeed, non reported configuration options have their default values. 
There were also minor changes in the default options for logging (prefix, 
checkpoint, ...), but nothing significant, and always the same for all 
runs.

>>  [...] Ubuntu 12.04 LTS (precise)
>
> That's with 12.04's standard kernel?

Yes.

>>    checkpoint_flush_after = { none, 0, 32, 64 }
>
> Did you re-initdb between the runs?

Yes, all runs are from scratch (initdb, pgbench -i, some warmup...).

> I've seen massively varying performance differences due to autovacuum
> triggered analyzes. It's not completely deterministic when those run,
> and on bigger scale clusters analyze can take ages, while holding a
> snapshot.

Yes, I agree that probably the performance changes on long vs short runs 
(andres00c vs andres00b) is due to autovacuum.

-- 
Fabien.



Re: checkpointer continuous flushing - V16

From
Patric Bechtel
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Fabien,

Fabien COELHO schrieb am 19.02.2016 um 16:04:
> 
>>> [...] Ubuntu 12.04 LTS (precise)
>> 
>> That's with 12.04's standard kernel?
> 
> Yes.

Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO somehow. The difference
to 3.13 (the latest LTS kernel for 12.04) is huge.


https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu

You might consider upgrading your kernel to 3.13 LTS. It's quite easy normally:

https://wiki.ubuntu.com/Kernel/LTSEnablementStack

/Patric
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: GnuPT 2.5.2

iEYEARECAAYFAlbHW4AACgkQfGgGu8y7ypC1EACgy8mW6AoaWjKycbuAnCZ3CEPW
Al8AmwfF0smqmDvNsaPkq0dAtop7jP5M
=TxT+
-----END PGP SIGNATURE-----



Re: checkpointer continuous flushing - V16

From
Fabien COELHO
Date:
Hallo Patric,

> Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify 
> IO somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is 
> huge.
>
>
https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu

Interesting! To summarize it, 25% performance degradation from best kernel 
(2.6.32) to worst (3.2.0), that is indeed significant.

> You might consider upgrading your kernel to 3.13 LTS. It's quite easy 
> [...]

There are other stuff running on the hardware that I do not wish to touch, 
so upgrading the particular host is currently not an option, otherwise I 
would have switched to trusty.

Thanks for the pointer.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-02-04 16:54:58 +0100, Andres Freund wrote:
> Hi,
>
> Fabien asked me to post a new version of the checkpoint flushing patch
> series. While this isn't entirely ready for commit, I think we're
> getting closer.
>
> I don't want to post a full series right now, but my working state is
> available on
> http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/checkpoint-flush
> git://git.postgresql.org/git/users/andresfreund/postgres.git checkpoint-flush

I've updated the git tree.

Here's the next two (the most important) patches of the series:
0001: Allow to trigger kernel writeback after a configurable number of writes.
0002: Checkpoint sorting and balancing.

For 0001 I've recently changed:
* Don't schedule writeback after smgrextend() - that defeats linux
  delayed allocation mechanism, increasing fragmentation noticeably.
* Add docs for the new GUC variables
* comment polishing
* BackendWritebackContext now isn't dynamically allocated anymore


I think this patch primarily needs:
* review of the docs, not sure if they're easy enough to
  understand. Some language polishing might also be needed.
* review of the writeback API, combined with the smgr/md.c changes.
* Currently *_flush_after can be set to a nonzero value, even if there's
  no support for flushing on that platform. Imo that's ok, but perhaps
  other people's opinion differ.


For 0002 I've recently changed:
* Removed the sort timing information, we've proven sufficiently that
  it doesn't take a lot of time.
* Minor comment polishing.

I think this patch primarily needs:
* Benchmarking on FreeBSD/OSX to see whether we should enable the
  mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
  inclined to leave it off till then.


Regards,

Andres

Attachment

Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hello Andres,

> Here's the next two (the most important) patches of the series:
> 0001: Allow to trigger kernel writeback after a configurable number of writes.
> 0002: Checkpoint sorting and balancing.

I will look into these two in depth.

Note that I would have ordered them in reverse because sorting is nearly 
always very beneficial, and "writeback" (formely called flushing) is then 
nearly always very beneficial on sorted buffers.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-02-19 22:46:44 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >Here's the next two (the most important) patches of the series:
> >0001: Allow to trigger kernel writeback after a configurable number of writes.
> >0002: Checkpoint sorting and balancing.
> 
> I will look into these two in depth.
> 
> Note that I would have ordered them in reverse because sorting is nearly
> always very beneficial, and "writeback" (formely called flushing) is then
> nearly always very beneficial on sorted buffers.

I had it that way earlier. I actually saw pretty large regressions from
sorting alone in some cases as well, apparently because the kernel
submits much larger IOs to disk; although that probably only shows on
SSDs.  This way the modifications imo look a trifle better ;). I'm
intending to commit both at the same time, keep them separate only
because they're easier to ynderstand separately.

Andres



Re: checkpointer continuous flushing - V16

From
Michael Paquier
Date:
On Sat, Feb 20, 2016 at 5:08 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>> Kernel 3.2 is extremely bad for Postgresql, as the vm seems to amplify IO
>> somehow. The difference to 3.13 (the latest LTS kernel for 12.04) is huge.
>>
>>
>>
https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4#.6dx44vipu
>
>
> Interesting! To summarize it, 25% performance degradation from best kernel
> (2.6.32) to worst (3.2.0), that is indeed significant.

As far as I recall, the OS cache eviction is very aggressive in 3.2,
so it would be possible that data from the FS cache that was just read
could be evicted even if it was not used yet. Thie represents a large
difference when the database does not fit in RAM.
-- 
Michael



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hello Andres,

> For 0001 I've recently changed:
> * Don't schedule writeback after smgrextend() - that defeats linux
>  delayed allocation mechanism, increasing fragmentation noticeably.
> * Add docs for the new GUC variables
> * comment polishing
> * BackendWritebackContext now isn't dynamically allocated anymore
>
>
> I think this patch primarily needs:
> * review of the docs, not sure if they're easy enough to
>  understand. Some language polishing might also be needed.

Yep, see below.

> * review of the writeback API, combined with the smgr/md.c changes.

See various comments below.

> * Currently *_flush_after can be set to a nonzero value, even if there's
>  no support for flushing on that platform. Imo that's ok, but perhaps
>  other people's opinion differ.

In some previous version I think a warning was shown of the feature was
requested but not available.


Here are some quick comments on the patch:

Patch applies cleanly on head. Compiled and checked on Linux. Compilation 
issues on other systems, see below.

When pages are written by a process (checkpointer, bgwriter, backend worker),
the list of recently written pages is kept and every so often an advisory
fsync (sync_file_range, other options for other systems) is issued so that
the data is sent to the io system without relying on more or less
(un)controllable os policy.

The documentation seems to use "flush" but the code talks about "writeback"
or "flush", depending. I think one vocabulary, whichever it is, should be
chosen and everything should stick to it, otherwise everything look kind of
fuzzy and raises doubt for the reader (is it the same thing? is it something
else?). I initially used "flush", but it seems a bad idea because it has
nothing to do with the flush function, so I'm fine with writeback or anything
else, I just think that *one* word should be chosen and used everywhere.

The sgml documentation about "*_flush_after" configuration parameter talks
about bytes, but the actual unit should be buffers. I think that keeping
a number of buffers should be fine, because that is what the internal stuff
will manage, not bytes. Also, the maximum value (128 ?) should appear in
the text. In the discussion in the wal section, I'm not sure about the effect
of setting writebacks on SSD, but I think that you have made some tests so
maybe you have an answer and the corresponding section could be written with
some more definitive text than "probably brings no benefit on SSD".

A good point of the whole approach is that it is available to all kind
of pg processes. However it does not address the point that bgwriter and
backends basically issue random writes, so I would not expect much positive
effect before these writes are somehow sorted, which means doing some
compromise in the LRU/LFU logic... well, all this is best kept for later,
and I'm fine to have the logic flushing logic there. I'm wondering why you
choose 16 & 64 as default for backends & bgwriter, though.

IssuePendingWritebacks: you merge only strictly neightboring writes.
Maybe the merging strategy could be more aggressive than just strict
neighbors?

mdwriteback: all variables could be declared within the while, I do not
understand why some are in and some are out. ISTM that putting writeback
management at the relation level does not help a lot, because you have to
translate again from relation to files. The good news is that it should work
as well, and that it does avoid the issue that the file may have been closed
in between, so why not.

The PendingWriteback struct looks useless. I think it should be removed,
and maybe put back if one day if it is needed, which I rather doubt it.

struct WritebackContext: keeping a pointer to guc variables is a kind of
trick, I think it deserves a comment.

ScheduleBufferTagForWriteback: the "pending" variable is not very useful.
Maybe consider shortening the "pending_writebacks" field name to "writebacks"?

IssuePendingWritebacks: I understand that qsort is needed "again"
because when balancing writes over tablespaces they may be intermixed.
AFAICR I used a "flush context" for each table space in some version
I submitted, because I do think that this whole writeback logic really
does make sense *per table space*, which suggest that there should be as
many write backs contexts as table spaces, otherwise the positive effect
may going to be totally lost of tables spaces are used. Any thoughts?

Assert(*context->max_pending <= WRITEBACK_MAX_PENDING_FLUSHES); is always
true, I think, it is already checked in the initialization and when setting
gucs.

SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock.
I guess it is okay because it is still pinned.

pg_flush_data: in the first #elif, "context" is undeclared line 446.
Label "out" is not defined line 455. In the second #elif, "context" is
undeclared line 490 and label "out" line 500 is not defined either.

For the checkpointer, a key aspect is that the scheduling process goes
to sleep from time to time, and this sleep time looked like a great
opportunity to do this kind of flushing. You choose not to take advantage
of the behavior, why?

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
Hi,

On 2016-02-20 20:56:31 +0100, Fabien COELHO wrote:
> >* Currently *_flush_after can be set to a nonzero value, even if there's
> > no support for flushing on that platform. Imo that's ok, but perhaps
> > other people's opinion differ.
> 
> In some previous version I think a warning was shown of the feature was
> requested but not available.

I think we should either silently ignore it, or error out. Warnings
somewhere in the background aren't particularly meaningful.

> Here are some quick comments on the patch:
> 
> Patch applies cleanly on head. Compiled and checked on Linux. Compilation
> issues on other systems, see below.

For those I've already pushed a small fixup commit to git... Stupid
mistake.


> The documentation seems to use "flush" but the code talks about "writeback"
> or "flush", depending. I think one vocabulary, whichever it is, should be
> chosen and everything should stick to it, otherwise everything look kind of
> fuzzy and raises doubt for the reader (is it the same thing? is it something
> else?). I initially used "flush", but it seems a bad idea because it has
> nothing to do with the flush function, so I'm fine with writeback or anything
> else, I just think that *one* word should be chosen and used everywhere.

Hm.


> The sgml documentation about "*_flush_after" configuration parameter talks
> about bytes, but the actual unit should be buffers.

The unit actually is buffers, but you can configure it using
bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
...). Refering to bytes is easier because you don't have to explain that
it depends on compilation settings how many data it actually is and
such.

> Also, the maximum value (128 ?) should appear in the text. \

Right.


> In the discussion in the wal section, I'm not sure about the effect of
> setting writebacks on SSD, but I think that you have made some tests
> so maybe you have an answer and the corresponding section could be
> written with some more definitive text than "probably brings no
> benefit on SSD".

Yea, that paragraph needs some editing. I think we should basically
remove that last sentence.


> A good point of the whole approach is that it is available to all kind
> of pg processes.

Exactly.


> However it does not address the point that bgwriter and
> backends basically issue random writes, so I would not expect much positive
> effect before these writes are somehow sorted, which means doing some
> compromise in the LRU/LFU logic...

The benefit is primarily that you don't collect large amounts of dirty
buffers in the kernel page cache. In most cases the kernel will not be
able to coalesce these writes either...  I've measured *massive*
performance latency differences for workloads that are bigger than
shared buffers - because suddenly bgwriter / backends do the majority of
the writes. Flushing in the checkpoint quite possibly makes nearly no
difference in such cases.


> well, all this is best kept for later, and I'm fine to have the logic
> flushing logic there. I'm wondering why you choose 16 & 64 as default
> for backends & bgwriter, though.

I chose a small value for backends because there often are a large
number of backends, and thus the amount of dirty data of each adds up. I
used a larger value for bgwriter because I saw that ending up using
bigger IOs.


> IssuePendingWritebacks: you merge only strictly neightboring writes.
> Maybe the merging strategy could be more aggressive than just strict
> neighbors?

I don't think so. If you flush more than neighbouring writes you'll
often end up flushing buffers dirtied by another backend, causing
additional stalls. And if the writes aren't actually neighbouring
there's not much gained from issuing them in one sync_file_range call.


> mdwriteback: all variables could be declared within the while, I do not
> understand why some are in and some are out.

Right.


> ISTM that putting writeback management at the relation level does not
> help a lot, because you have to translate again from relation to
> files.

Sure, but what's the problem with that? That's how normal read/write IO
works as well?


> struct WritebackContext: keeping a pointer to guc variables is a kind of
> trick, I think it deserves a comment.

It has, it's just in WritebackContextInit(). Can duplicateit.


> ScheduleBufferTagForWriteback: the "pending" variable is not very
> useful.

Shortens line length a good bit, at no cost.



> IssuePendingWritebacks: I understand that qsort is needed "again"
> because when balancing writes over tablespaces they may be intermixed.

Also because the infrastructure is used for more than checkpoint
writes. There's absolutely no ordering guarantees there.


> AFAICR I used a "flush context" for each table space in some version
> I submitted, because I do think that this whole writeback logic really
> does make sense *per table space*, which suggest that there should be as
> many write backs contexts as table spaces, otherwise the positive effect
> may going to be totally lost of tables spaces are used. Any thoughts?

Leads to less regular IO, because if your tablespaces are evenly sized
(somewhat common) you'll sometimes end up issuing sync_file_range's
shortly after each other.  For latency outside checkpoints it's
important to control the total amount of dirty buffers, and that's
obviously independent of tablespaces.


> SyncOneBuffer: I'm wonder why you copy the tag after releasing the lock.
> I guess it is okay because it is still pinned.

Don't do things while holding a lock that don't require said lock. A pin
prevents a buffer changing its identity.


> For the checkpointer, a key aspect is that the scheduling process goes
> to sleep from time to time, and this sleep time looked like a great
> opportunity to do this kind of flushing. You choose not to take advantage
> of the behavior, why?

Several reasons: Most importantly there's absolutely no guarantee that
you'll ever end up sleeping, it's quite common to happen only
seldomly. If you're bottlenecked on IO, you can end up being behind all
the time. But even then you don't want to cause massive latency spikes
due to gigabytes of dirty data - a slower checkpoint is a much better
choice.  It'd make the writeback infrastructure less generic.  I also
don't really believe it helps that much, although that's a complex
argument to make.


Thanks for the review!

Andres



Re: checkpointer continuous flushing - V18

From
Robert Haas
Date:
On Sun, Feb 21, 2016 at 3:37 AM, Andres Freund <andres@anarazel.de> wrote:
>> The documentation seems to use "flush" but the code talks about "writeback"
>> or "flush", depending. I think one vocabulary, whichever it is, should be
>> chosen and everything should stick to it, otherwise everything look kind of
>> fuzzy and raises doubt for the reader (is it the same thing? is it something
>> else?). I initially used "flush", but it seems a bad idea because it has
>> nothing to do with the flush function, so I'm fine with writeback or anything
>> else, I just think that *one* word should be chosen and used everywhere.
>
> Hm.

I think there might be a semantic distinction between these two terms.
Doesn't writeback mean writing pages to disk, and flushing mean making
sure that they are durably on disk?  So for example when the Linux
kernel thinks there is too much dirty data, it initiates writeback,
not a flush; on the other hand, at transaction commit, we initiate a
flush, not writeback.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hallo Andres,

>> In some previous version I think a warning was shown if the feature was
>> requested but not available.
>
> I think we should either silently ignore it, or error out. Warnings
> somewhere in the background aren't particularly meaningful.

I like "ignoring with a warning" in the log file, because when things do 
not behave as expected that is where I'll be looking. I do not think that 
it should error out.

>> The sgml documentation about "*_flush_after" configuration parameter 
>> talks about bytes, but the actual unit should be buffers.
>
> The unit actually is buffers, but you can configure it using
> bytes. We've done the same for other GUCs (shared_buffers, wal_buffers,
> ...). Refering to bytes is easier because you don't have to explain that
> it depends on compilation settings how many data it actually is and
> such.

So I understand that it works with kb as well. Now I do not think that it 
would need a lot if explanations if you say that it is a number of pages, 
and I think that a number of pages is significant because it is a number 
of IO requests to be coalesced, eventually.

>> In the discussion in the wal section, I'm not sure about the effect of
>> setting writebacks on SSD, [...]
>
> Yea, that paragraph needs some editing. I think we should basically
> remove that last sentence.

Ok, fine with me. Does that mean that flushing as a significant positive 
impact on SSD in your tests?

>> However it does not address the point that bgwriter and backends 
>> basically issue random writes, [...]
>
> The benefit is primarily that you don't collect large amounts of dirty
> buffers in the kernel page cache. In most cases the kernel will not be
> able to coalesce these writes either...  I've measured *massive*
> performance latency differences for workloads that are bigger than
> shared buffers - because suddenly bgwriter / backends do the majority of
> the writes. Flushing in the checkpoint quite possibly makes nearly no
> difference in such cases.

So I understand that there is a positive impact under some load. Good!

>> Maybe the merging strategy could be more aggressive than just strict
>> neighbors?
>
> I don't think so. If you flush more than neighbouring writes you'll
> often end up flushing buffers dirtied by another backend, causing
> additional stalls.

Ok. Maybe the neightbor definition could be relaxed just a little bit so 
that small holes are overtake, but not large holes? If there is only a few 
pages in between, even if written by another process, then writing them 
together should be better? Well, this can wait for a clear case, because 
hopefully the OS will recoalesce them behind anyway.

>> struct WritebackContext: keeping a pointer to guc variables is a kind of
>> trick, I think it deserves a comment.
>
> It has, it's just in WritebackContextInit(). Can duplicateit.

I missed it, I expected something in the struct definition. Do not 
duplicate, but cross reference it?

>> IssuePendingWritebacks: I understand that qsort is needed "again"
>> because when balancing writes over tablespaces they may be intermixed.
>
> Also because the infrastructure is used for more than checkpoint
> writes. There's absolutely no ordering guarantees there.

Yep, but not much benefit to expect from a few dozens random pages either.

>> [...] I do think that this whole writeback logic really does make sense 
>> *per table space*,
>
> Leads to less regular IO, because if your tablespaces are evenly sized
> (somewhat common) you'll sometimes end up issuing sync_file_range's
> shortly after each other.  For latency outside checkpoints it's
> important to control the total amount of dirty buffers, and that's
> obviously independent of tablespaces.

I do not understand/buy this argument.

The underlying IO queue is per device, and table spaces should be per 
device as well (otherwise what the point?), so you should want to coalesce 
and "writeback" pages per device as wel. Calling sync_file_range on 
distinct devices should probably be issued more or less randomly, and 
should not interfere one with the other.

If you use just one context, the more table spaces the less performance 
gains, because there is less and less aggregation thus sequential writes 
per device.

So for me there should really be one context per tablespace. That would 
suggest a hashtable or some other structure to keep and retrieve them, 
which would not be that bad, and I think that it is what is needed.

>> For the checkpointer, a key aspect is that the scheduling process goes
>> to sleep from time to time, and this sleep time looked like a great
>> opportunity to do this kind of flushing. You choose not to take advantage
>> of the behavior, why?
>
> Several reasons: Most importantly there's absolutely no guarantee that 
> you'll ever end up sleeping, it's quite common to happen only seldomly.

Well, that would be under a situation when pg is completely unresponsive. 
More so, this behavior *makes* pg unresponsive.

> If you're bottlenecked on IO, you can end up being behind all the time.

Hopefully sorting & flushing should improve this situation a lot.

> But even then you don't want to cause massive latency spikes
> due to gigabytes of dirty data - a slower checkpoint is a much better
> choice.  It'd make the writeback infrastructure less generic.

Sure. It would be sufficient to have a call to ask for writebacks 
independently of the number of writebacks accumulated in the queue, it 
does not need to change the infrastructure.

Also, I think that such a call would make sense at the end of the 
checkpoint.

> I also don't really believe it helps that much, although that's a 
> complex argument to make.

Yep. My thinking is that doing things in the sleeping interval does not 
interfere with the checkpointer scheduling, so it is less likely to go 
wrong and falling behind.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hallo Andres,

>>> [...] I do think that this whole writeback logic really does make sense 
>>> *per table space*,
>> 
>> Leads to less regular IO, because if your tablespaces are evenly sized
>> (somewhat common) you'll sometimes end up issuing sync_file_range's
>> shortly after each other.  For latency outside checkpoints it's
>> important to control the total amount of dirty buffers, and that's
>> obviously independent of tablespaces.
>
> I do not understand/buy this argument.
>
> The underlying IO queue is per device, and table spaces should be per device 
> as well (otherwise what the point?), so you should want to coalesce and 
> "writeback" pages per device as wel. Calling sync_file_range on distinct 
> devices should probably be issued more or less randomly, and should not 
> interfere one with the other.
>
> If you use just one context, the more table spaces the less performance 
> gains, because there is less and less aggregation thus sequential writes per 
> device.
>
> So for me there should really be one context per tablespace. That would 
> suggest a hashtable or some other structure to keep and retrieve them, which 
> would not be that bad, and I think that it is what is needed.

Note: I think that an easy way to do that in the "checkpoint sort" patch 
is simply to keep a WritebackContext in CkptTsStatus structure which is 
per table space in the checkpointer.

For bgwriter & backends it can wait, there is few "writeback" coalescing 
because IO should be pretty random, so it does not matter much.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hallo Andres,

Here is a review for the second patch.

> For 0002 I've recently changed:
> * Removed the sort timing information, we've proven sufficiently that
>  it doesn't take a lot of time.

I put it there initialy to demonstrate that there was no cache performance 
issue when sorting on just buffer indexes. As it is always small, I agree 
that it is not needed. Well, it could be still be in seconds on a very 
large shared buffers setting with a very large checkpoint, but then the 
checkpoint would be tremendously huge...

> * Minor comment polishing.

Patch applies and checks on Linux.

* CpktSortItem:

I think that allocating 20 bytes per buffer in shared memory is a little 
on the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 
bytes to hold 4 values, could be one byte or even 2 bits somewhere. Also, 
there are very few tablespaces, they could be given a small number and 
this number could be used instead of the Oid, so the space requirement 
could be reduced to say 16 bytes per buffer by combining space & fork in 2 
shorts and keeping 4 bytes alignement and also getting 8 byte 
alignement... If this is too much, I have shown that it can work with only 
4 bytes per buffer, as the sorting is really just a performance 
optimisation and is not broken if some stuff changes between sorting & 
writeback, but you did not like the idea. If the amount of shared memory 
required is a significant concern, it could be resurrected, though.

* CkptTsStatus:

As I suggested in the other mail, I think that this structure should also keep
a per tablespace WritebackContext so that coalescing is done per tablespace.

ISTM that "progress" and "progress_slice" only depend on num_scanned and
per-tablespace num_to_scan and total num_to_scan, so they are somehow
redundant and the progress could be recomputed from the initial figures
when needed.

If these fields are kept, I think that a comment should justify why float8 
precision is okay for the purpose. I think it is quite certainly fine in 
the worst case with 32 bits buffer_ids, but it would not be if this size 
is changed someday.

* BufferSync

After a first sweep to collect buffers to write, they are sorted, and then 
there those buffers are swept again to compute some per tablespace data 
and organise a heap.

ISTM that nearly all of the collected data on the second sweep could be 
collected on the first sweep, so that this second sweep could be avoided 
altogether. The only missing data is the index of the first buffer in the 
array, which can be computed by considering tablespaces only, sweeping 
over buffers is not needed. That would suggest creating the heap or using 
a hash in the initial buffer sweep to keep this information. This would 
also provide a point where to number tablespaces for compressing the 
CkptSortItem struct.

I'm wondering about calling CheckpointWriteDelay on each round, maybe
a minimum amount of write would make sense. This remark is independent of 
this patch. Probably it works fine because after a sleep the checkpointer 
is behind enough so that it will write a bunch of buffers before sleeping
again.

I see a binary_heap_allocate but no corresponding deallocation, this
looks like a memory leak... or is there some magic involved?

There are some debug stuff to remove in #ifdefs.

I think that the buffer/README should be updated with explanations about
sorting in the checkpointer.

> I think this patch primarily needs:
> * Benchmarking on FreeBSD/OSX to see whether we should enable the
>  mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
>  inclined to leave it off till then.

I do not have that. As "msync" seems available on Linux, it is possible to 
force using it with a "ifdef 0" to skip sync_file_range and check whether 
it does some good there. Idem for the "posix_fadvise" stuff. I can try to 
do that, but it takes time to do so, if someone can test on other OS it 
would be much better. I think that if it works it should be kept in, so it 
is just a matter of testing it.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-02-21 08:26:28 +0100, Fabien COELHO wrote:
> >>In the discussion in the wal section, I'm not sure about the effect of
> >>setting writebacks on SSD, [...]
> >
> >Yea, that paragraph needs some editing. I think we should basically
> >remove that last sentence.
> 
> Ok, fine with me. Does that mean that flushing as a significant positive
> impact on SSD in your tests?

Yes. The reason we need flushing is that the kernel amasses dirty pages,
and then flushes them at once. That hurts for both SSDs and rotational
media. Sorting is the the bigger question, but I've seen it have clearly
beneficial performance impacts. I guess if you look at devices with a
internal block size bigger than 8k, you'd even see larger differences.

> >>Maybe the merging strategy could be more aggressive than just strict
> >>neighbors?
> >
> >I don't think so. If you flush more than neighbouring writes you'll
> >often end up flushing buffers dirtied by another backend, causing
> >additional stalls.
> 
> Ok. Maybe the neightbor definition could be relaxed just a little bit so
> that small holes are overtake, but not large holes? If there is only a few
> pages in between, even if written by another process, then writing them
> together should be better? Well, this can wait for a clear case, because
> hopefully the OS will recoalesce them behind anyway.

I'm against doing so without clear measurements of a benefit.

> >Also because the infrastructure is used for more than checkpoint
> >writes. There's absolutely no ordering guarantees there.
> 
> Yep, but not much benefit to expect from a few dozens random pages either.

Actually, there's kinda frequently a benefit observable. Even if few
requests can be merged, doing IO requests in an order more likely doable
within a few rotations is beneficial. Also, the cost is marginal, so why
worry?

> >>[...] I do think that this whole writeback logic really does make
> >>sense *per table space*,
> >
> >Leads to less regular IO, because if your tablespaces are evenly sized
> >(somewhat common) you'll sometimes end up issuing sync_file_range's
> >shortly after each other.  For latency outside checkpoints it's
> >important to control the total amount of dirty buffers, and that's
> >obviously independent of tablespaces.
> 
> I do not understand/buy this argument.
> 
> The underlying IO queue is per device, and table spaces should be per device
> as well (otherwise what the point?), so you should want to coalesce and
> "writeback" pages per device as wel. Calling sync_file_range on distinct
> devices should probably be issued more or less randomly, and should not
> interfere one with the other.

The kernel's dirty buffer accounting is global, not per block device.
It's also actually rather common to have multiple tablespaces on a
single block device. Especially if SANs and such are involved; where you
don't even know which partitions are on which disks.


> If you use just one context, the more table spaces the less performance
> gains, because there is less and less aggregation thus sequential writes per
> device.
> 
> So for me there should really be one context per tablespace. That would
> suggest a hashtable or some other structure to keep and retrieve them, which
> would not be that bad, and I think that it is what is needed.

That'd be much easier to do by just keeping the context in the
per-tablespace struct. But anyway, I'm really doubtful about going for
that; I had it that way earlier, and observing IO showed it not being
beneficial.


> >>For the checkpointer, a key aspect is that the scheduling process goes
> >>to sleep from time to time, and this sleep time looked like a great
> >>opportunity to do this kind of flushing. You choose not to take advantage
> >>of the behavior, why?
> >
> >Several reasons: Most importantly there's absolutely no guarantee that
> >you'll ever end up sleeping, it's quite common to happen only seldomly.
> 
> Well, that would be under a situation when pg is completely unresponsive.
> More so, this behavior *makes* pg unresponsive.

No. The checkpointer being bottlenecked on actual IO performance doesn't
impact production that badly. It'll just sometimes block in
sync_file_range(), but the IO queues will have enough space to
frequently give way to other backends, particularly to synchronous reads
(most pg reads) and synchronous writes (fdatasync()).  So a single
checkpoint will take a bit longer, but otherwise the system will mostly
keep up the work in a regular manner.  Without the sync_file_range()
calls the kernel will amass dirty buffers until global dirty limits are
reached, which then will bring the whole system to a standstill.

It's pretty common that checkpoint_timeout is too short to be able to
write all shared_buffers out, in that case it's much better to slow down
the whole checkpoint, instead of being incredibly slow at the end.

> >I also don't really believe it helps that much, although that's a complex
> >argument to make.
> 
> Yep. My thinking is that doing things in the sleeping interval does not
> interfere with the checkpointer scheduling, so it is less likely to go wrong
> and falling behind.

I don't really see why that's the case. Triggering writeback every N
writes doesn't really influence the scheduling in a bad way - the
flushing is done *before* computing the sleep time. Triggering the
writeback *after* computing the sleep time, and then sleep for that
long, in addition of the time for sync_file_range, skews things more.

Greetings,

Andres Freund



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
Hi,

On 2016-02-21 10:52:45 +0100, Fabien COELHO wrote:
> * CpktSortItem:
> 
> I think that allocating 20 bytes per buffer in shared memory is a little on
> the heavy side. Some compression can be achieved: sizeof(ForlNum) is 4 bytes
> to hold 4 values, could be one byte or even 2 bits somewhere. Also, there
> are very few tablespaces, they could be given a small number and this number
> could be used instead of the Oid, so the space requirement could be reduced
> to say 16 bytes per buffer by combining space & fork in 2 shorts and keeping
> 4 bytes alignement and also getting 8 byte alignement... If this is too
> much, I have shown that it can work with only 4 bytes per buffer, as the
> sorting is really just a performance optimisation and is not broken if some
> stuff changes between sorting & writeback, but you did not like the idea. If
> the amount of shared memory required is a significant concern, it could be
> resurrected, though.

This is less than 0.2 % of memory related to shared buffers. We have the
same amount of memory allocated in CheckpointerShmemSize(), and nobody
has complained so far.  And sorry, going back to the previous approach
isn't going to fly, and I've no desire to discuss that *again*.


> ISTM that "progress" and "progress_slice" only depend on num_scanned and
> per-tablespace num_to_scan and total num_to_scan, so they are somehow
> redundant and the progress could be recomputed from the initial figures
> when needed.

They don't cause much space usage, and we access the values
frequently. So why not store them?


> If these fields are kept, I think that a comment should justify why float8
> precision is okay for the purpose. I think it is quite certainly fine in the
> worst case with 32 bits buffer_ids, but it would not be if this size is
> changed someday.

That seems pretty much unrelated to having the fields - the question of
accuracy plays a role regardless, no? Given realistic amounts of memory
the max potential "skew" seems fairly small with float8. If we ever
flush one buffer "too much" for a tablespace it's pretty much harmless.

> ISTM that nearly all of the collected data on the second sweep could be
> collected on the first sweep, so that this second sweep could be avoided
> altogether. The only missing data is the index of the first buffer in the
> array, which can be computed by considering tablespaces only, sweeping over
> buffers is not needed. That would suggest creating the heap or using a hash
> in the initial buffer sweep to keep this information. This would also
> provide a point where to number tablespaces for compressing the CkptSortItem
> struct.

Doesn't seem worth the complexity to me.


> I'm wondering about calling CheckpointWriteDelay on each round, maybe
> a minimum amount of write would make sense.

Why? There's not really much benefit of doing more work than needed. I
think we should sleep far shorter in many cases, but that's indeed a
separate issue.

> I see a binary_heap_allocate but no corresponding deallocation, this
> looks like a memory leak... or is there some magic involved?

Hm. I think we really should use a memory context for all of this - we
could after all error out somewhere in the middle...


> >I think this patch primarily needs:
> >* Benchmarking on FreeBSD/OSX to see whether we should enable the
> > mmap()/msync(MS_ASYNC) method by default. Unless somebody does so, I'm
> > inclined to leave it off till then.
> 
> I do not have that. As "msync" seems available on Linux, it is possible to
> force using it with a "ifdef 0" to skip sync_file_range and check whether it
> does some good there.

Unfortunately it doesn't work well on linux:     * On many OSs msync() on a mmap'ed file triggers writeback. On linux
 * it only does so when MS_SYNC is specified, but then it does the     * writeback synchronously. Luckily all common
linuxsystems have     * sync_file_range().  This is preferrable over FADV_DONTNEED because     * it doesn't flush out
cleandata.
 

I've verified beforehand, with a simple demo program, that
msync(MS_ASYNC) does something reasonable of freebsd...


> Idem for the "posix_fadvise" stuff. I can try to do
> that, but it takes time to do so, if someone can test on other OS it would
> be much better. I think that if it works it should be kept in, so it is just
> a matter of testing it.

I'm not arguing for ripping it out, what I mean is that we don't set a
nondefault value for the GUCs on platforms with just posix_fadivise
available...

Greetings,

Andres Freund



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
>>>> [...] I do think that this whole writeback logic really does make
>>>> sense *per table space*,
>>>
>>> Leads to less regular IO, because if your tablespaces are evenly sized
>>> (somewhat common) you'll sometimes end up issuing sync_file_range's
>>> shortly after each other.  For latency outside checkpoints it's
>>> important to control the total amount of dirty buffers, and that's
>>> obviously independent of tablespaces.
>>
>> I do not understand/buy this argument.
>>
>> The underlying IO queue is per device, and table spaces should be per device
>> as well (otherwise what the point?), so you should want to coalesce and
>> "writeback" pages per device as wel. Calling sync_file_range on distinct
>> devices should probably be issued more or less randomly, and should not
>> interfere one with the other.
>
> The kernel's dirty buffer accounting is global, not per block device.

Sure, but this is not my point. My point is that "sync_file_range" moves 
buffers to the device io queues, which are per device. If there is one 
queue in pg and many queues on many devices, the whole point of coalescing 
to get sequential writes is somehow lost.

> It's also actually rather common to have multiple tablespaces on a
> single block device. Especially if SANs and such are involved; where you
> don't even know which partitions are on which disks.

Ok, some people would not benefit if the use many tablespaces on one 
device, too bad but that does not look like a useful very setting anyway, 
and I do not think it would harm much in this case.

>> If you use just one context, the more table spaces the less performance
>> gains, because there is less and less aggregation thus sequential writes per
>> device.
>>
>> So for me there should really be one context per tablespace. That would
>> suggest a hashtable or some other structure to keep and retrieve them, which
>> would not be that bad, and I think that it is what is needed.
>
> That'd be much easier to do by just keeping the context in the
> per-tablespace struct. But anyway, I'm really doubtful about going for
> that; I had it that way earlier, and observing IO showed it not being
> beneficial.

ISTM that you would need a significant number of tablespaces to see the 
benefit. If you do not do that, the more table spaces the more random the 
IOs, which is disappointing. Also, "the cost is marginal", so I do not see 
any good argument not to do it.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
>> ISTM that "progress" and "progress_slice" only depend on num_scanned and
>> per-tablespace num_to_scan and total num_to_scan, so they are somehow
>> redundant and the progress could be recomputed from the initial figures
>> when needed.
>
> They don't cause much space usage, and we access the values frequently. 
> So why not store them?

The same question would work the other way around: these values are one 
division away, why not compute them when needed? No big deal.

> [...] Given realistic amounts of memory the max potential "skew" seems 
> fairly small with float8. If we ever flush one buffer "too much" for a 
> tablespace it's pretty much harmless.

I do agree. I'm suggesting that a comment should be added to justify why 
float8 accuracy is okay.

>> I see a binary_heap_allocate but no corresponding deallocation, this
>> looks like a memory leak... or is there some magic involved?
>
> Hm. I think we really should use a memory context for all of this - we
> could after all error out somewhere in the middle...

I'm not sure that a memory context is justified here, there are only two 
mallocs and the checkpointer works for very long times. I think that it is 
simpler to just get the malloc/free right.

> [...] I'm not arguing for ripping it out, what I mean is that we don't 
> set a nondefault value for the GUCs on platforms with just 
> posix_fadivise available...

Ok with that.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hallo Andres,

>> AFAICR I used a "flush context" for each table space in some version
>> I submitted, because I do think that this whole writeback logic really
>> does make sense *per table space*, which suggest that there should be as
>> many write backs contexts as table spaces, otherwise the positive effect
>> may going to be totally lost of tables spaces are used. Any thoughts?
>
> Leads to less regular IO, because if your tablespaces are evenly sized
> (somewhat common) you'll sometimes end up issuing sync_file_range's
> shortly after each other.  For latency outside checkpoints it's
> important to control the total amount of dirty buffers, and that's
> obviously independent of tablespaces.

I did a quick & small test with random updates on 16 tables with 
checkpoint_flush_after=16 checkpoint_timeout=30

(1) with 16 tablespaces (1 per table, but same disk) :    tps = 1100, 27% time under 100 tps

(2) with 1 tablespace :    tps = 1200,  3% time under 100 tps

This result is logical: with one writeback context shared between 
tablespaces the sync_file_range is issued on a few buffers per file at a 
time on the 16 files, no coalescing occurs there, so this result in random 
IOs, while with one table space all writes are aggregated per file.

ISTM that this quick test shows that a writeback context are relevant per 
tablespace, as I expected.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
> I did a quick & small test with random updates on 16 tables with 
> checkpoint_flush_after=16 checkpoint_timeout=30

Another run with more "normal" settings and over 1000 seconds, so less 
"quick & small" that the previous one.
 checkpoint_flush_after = 16 checkpoint_timeout = 5min # default shared_buffers = 2GB # 1/8 of available memory

Random updates on 16 tables which total to 1.1GB of data, so this is in 
buffer, no significant "read" traffic.

(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps    per second avg, stddev [ min q1 median d3 max ] <=300tps
  679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
 

(2) with 1 tablespace on 1 disk : 956.0 tps    per second avg, stddev [ min q1 median d3 max ] <=300tps    956.2 ±
796.5[3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%
 

-- 
Fabien.

Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-02-22 14:11:05 +0100, Fabien COELHO wrote:
> 
> >I did a quick & small test with random updates on 16 tables with
> >checkpoint_flush_after=16 checkpoint_timeout=30
> 
> Another run with more "normal" settings and over 1000 seconds, so less
> "quick & small" that the previous one.
> 
>  checkpoint_flush_after = 16
>  checkpoint_timeout = 5min # default
>  shared_buffers = 2GB # 1/8 of available memory
> 
> Random updates on 16 tables which total to 1.1GB of data, so this is in
> buffer, no significant "read" traffic.
> 
> (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
>     per second avg, stddev [ min q1 median d3 max ] <=300tps
>     679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
> 
> (2) with 1 tablespace on 1 disk : 956.0 tps
>     per second avg, stddev [ min q1 median d3 max ] <=300tps
>     956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%

Interesting. That doesn't reflect my own tests, even on rotating media,
at all. I wonder if it's related to:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

If you use your 12.04 kernel, that'd not be fixed. Which might be a
reason to do it as you suggest.

Could you share the exact details of that workload?

Greetings,

Andres Freund



Re: checkpointer continuous flushing - V18

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Interesting. That doesn't reflect my own tests, even on rotating media,
> at all. I wonder if it's related to:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5

> If you use your 12.04 kernel, that'd not be fixed. Which might be a
> reason to do it as you suggest.

Hmm ... that kernel commit is less than 4 months old.  Would it be
reflected in *any* production kernels yet?
        regards, tom lane



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-02-22 11:05:20 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Interesting. That doesn't reflect my own tests, even on rotating media,
> > at all. I wonder if it's related to:
> >
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5
> 
> > If you use your 12.04 kernel, that'd not be fixed. Which might be a
> > reason to do it as you suggest.
> 
> Hmm ... that kernel commit is less than 4 months old.  Would it be
> reflected in *any* production kernels yet?

Probably not - so far I though it mainly has some performance benefits
on relatively extreme workloads; where without the patch, flushing still
is better performancewise than not flushing. But in the scenario Fabien
has brought up it seems quite possible that sync_file_range emitting
"storage cache flush" instructions, could explain the rather large
performance difference between his and my experiments.

Regards,

Andres



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
>> Random updates on 16 tables which total to 1.1GB of data, so this is in
>> buffer, no significant "read" traffic.
>>
>> (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
>>     per second avg, stddev [ min q1 median d3 max ] <=300tps
>>     679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
>>
>> (2) with 1 tablespace on 1 disk : 956.0 tps
>>     per second avg, stddev [ min q1 median d3 max ] <=300tps
>>     956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%
>
> Interesting. That doesn't reflect my own tests, even on rotating media,
> at all. I wonder if it's related to:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5
>
> If you use your 12.04 kernel, that'd not be fixed. Which might be a
> reason to do it as you suggest.
>
> Could you share the exact details of that workload?

See attached scripts (sh to create the 16 tables in the default or 16 
table spaces, small sql bench script, stat computation script).

The per-second stats were computed with:
  grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 --limit=300

Host is 8 cpu 16 GB, 2 HDD in RAID 1.

-- 
Fabien.

Re: checkpointer continuous flushing - V16

From
Tomas Vondra
Date:
Hi,

On 02/18/2016 11:31 AM, Andres Freund wrote:
> On 2016-02-11 19:44:25 +0100, Andres Freund wrote:
>> The first two commits of the series are pretty close to being ready. I'd
>> welcome review of those, and I plan to commit them independently of the
>> rest as they're beneficial independently.  The most important bits are
>> the comments and docs of 0002 - they weren't particularly good
>> beforehand, so I had to rewrite a fair bit.
>>
>> 0001: Make SetHintBit() a bit more aggressive, afaics that fixes all the
>>        potential regressions of 0002
>> 0002: Fix the overaggressive flushing by the wal writer, by only
>>        flushing every wal_writer_delay ms or wal_writer_flush_after
>>        bytes.
>
> I've pushed these after some more polishing, now working on the next
> two.

I've finally had time to do some benchmarks on those two (already
committed) pieces. I've promised to do more testing while discussing the
patches with Andres some time ago, so here we go.

I do have two machines I use for this kind of benchmarks

1) HP DL380 G5 (old rack server)
- 2x Xeon E5450, 16GB RAM (8 cores)
- 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC)
- RedHat 6
- shared_buffers = 4GB
- min_wal_size = 2GB
- max_wal_size = 6GB

2) workstation with i5 CPU
- 1x i5-2500k, 8GB RAM
- 6x Intel S3700 100GB (in RAID0 for this benchmark)
- Gentoo
- shared_buffers = 2GB
- min_wal_size = 1GB
- max_wal_size = 8GB

Both machines were using the same kernel version 4.4.2 and default io
scheduler (cfq). The

The test procedure was quite simple - pgbench with three different
scales, for each scale three runs, 1h per run (and 30 minutes of warmup
before each run).

Due to the difference in amount of RAM, each machine used different
scales - the goal is to have small, ~50% RAM, >200% RAM sizes:

1) Xeon: 100, 400, 6000
2) i5: 50, 200, 3000

The commits actually tested are

    cfafd8be  (right before the first patch)
    7975c5e0  Allow the WAL writer to flush WAL at a reduced rate.
    db76b1ef  Allow SetHintBits() to succeed if the buffer's LSN ...

For the Xeon, the total tps for each run looks like this:

     scale     commit          1        2         3
----------------------------------------------------
     100     cfafd8be       5136     5132      5144
             7975c5e0       5172     5148      5164
             db76b1ef       5131     5139      5131

     400     cfafd8be       3049     3042      2880
             7975c5e0       3038     3026      3027
             db76b1ef       2946     2940      2933

     6000    cfafd8be        394      389       391
             7975c5e0        391      479       467
             db76b1ef        443      416       481

So I'd say not much difference, except for the largest data set where
the improvement is visible (although it's a bit too noisy and additional
runs would be useful).

On the i5 workstation with SSDs, the results look like this:

    scale    commit          1         2         3
   ------------------------------------------------
     50      cfafd8be     5478      5486      5485
             7975c5e0     5473      5468      5436
             db76b1ef     5484      5453      5452

     200     cfafd8be     5169      5176      5167
             7975c5e0     5144      5151      5148
             db76b1ef     5162      5131      5131

     3000    cfafd8be     2392      2367      2359
             7975c5e0     2301      2340      2347
             db76b1ef     2277      2348      2342

So pretty much no difference, or perhaps maybe a slight slowdown.

One of the goals of this thread (as I understand it) was to make the
overall behavior smoother - eliminate sudden drops in transaction rate
due to bursts of random I/O etc.

One way to look at this is in terms of how much the tps fluctuates, so
let's see some charts. I've collected per-second tps measurements (using
the aggregation built into pgbench) but looking at that directly is
pretty pointless because it's very difficult to compare two noisy lines
jumping up and down.

So instead let's see CDF of the per-second tps measurements. I.e. we
have 3600 tps measurements, and given a tps value the question is what
percentage of the measurements is below this value.

     y = Probability(tps <= x)

We prefer higher values, and the ideal behavior would be that we get
exactly the same tps every second. Thus an ideal CDF line would be a
step line. Of course, that's rarely the case in practice. But comparing
two CDF curves is easy - the line more to the right is better, at least
for tps measurements, where we prefer higher values.


1) tps-xeon.png

The original behavior (red lines) is quite consistent. The two patches
generally seem to improve the performance, although sadly it seems that
the variability of the performance actually increased quite a bit, as
the CDFs are much wider (but generally to the right of the old ones).
I'm not sure what exactly causes the volatility.


2) maxlat-xeon.png

Another view at the per-second data, this time using "max latency" from
the pgbench aggregated log. Of course, this time "lower is better" so
we'd like to move the CDF to the left (to get lower max latencies).
Sadly, it changes is mostly the other direction, i.e. the max latency
slightly increases (but the differences are not as significant as for
the tps rate, discussed in the previous paragraph). But apparently the
average latency actually improves (which gives us better tps).

Note: In this chart, x-axis is logarithmic.


3) tps-i5.png

Same chart with CDF of tps, but for the i5 workstation. This actually
shows the consistent slowdown due to the two patches, the tps
consistently shifts to the lower end (~2000tps).


I do have some more data, but those are the most interesting charts. The
rest usually shows about the same thing (or nothing).

Overall, I'm not quite sure the patches actually achieve the intended
goals. On the 10k SAS drives I got better performance, but apparently
much more variable behavior. On SSDs, I get a bit worse results.

Also, I really wonder what will happen with non-default io schedulers. I
believe all the testing so far was done with cfq, so what happens on
machines that use e.g. "deadline" (as many DB machines actually do)?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: checkpointer continuous flushing - V16

From
Fabien COELHO
Date:
Hello Tomas,

> One of the goals of this thread (as I understand it) was to make the overall 
> behavior smoother - eliminate sudden drops in transaction rate due to bursts 
> of random I/O etc.
>
> One way to look at this is in terms of how much the tps fluctuates, so let's 
> see some charts. I've collected per-second tps measurements (using the 
> aggregation built into pgbench) but looking at that directly is pretty 
> pointless because it's very difficult to compare two noisy lines jumping up 
> and down.
>
> So instead let's see CDF of the per-second tps measurements. I.e. we have 
> 3600 tps measurements, and given a tps value the question is what percentage 
> of the measurements is below this value.
>
>    y = Probability(tps <= x)
>
> We prefer higher values, and the ideal behavior would be that we get exactly 
> the same tps every second. Thus an ideal CDF line would be a step line. Of 
> course, that's rarely the case in practice. But comparing two CDF curves is 
> easy - the line more to the right is better, at least for tps measurements, 
> where we prefer higher values.

Very nice and interesting graphs!

Alas not easy to interpret for the HDD, as there are better/worse 
variation all along the distribution, the lines cross one another, so how 
it fares overall is unclear.

Maybe a simple indication would be to compute the standard deviation on 
the per second tps? The median maybe interesting as well.

> I do have some more data, but those are the most interesting charts. The rest 
> usually shows about the same thing (or nothing).
>
> Overall, I'm not quite sure the patches actually achieve the intended goals. 
> On the 10k SAS drives I got better performance, but apparently much more 
> variable behavior. On SSDs, I get a bit worse results.

Indeed.

-- 
Fabien.



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
On 2016-03-01 16:06:47 +0100, Tomas Vondra wrote:
> 1) HP DL380 G5 (old rack server)
> - 2x Xeon E5450, 16GB RAM (8 cores)
> - 4x 10k SAS drives in RAID-10 on H400 controller (with BBWC)
> - RedHat 6
> - shared_buffers = 4GB
> - min_wal_size = 2GB
> - max_wal_size = 6GB
> 
> 2) workstation with i5 CPU
> - 1x i5-2500k, 8GB RAM
> - 6x Intel S3700 100GB (in RAID0 for this benchmark)
> - Gentoo
> - shared_buffers = 2GB
> - min_wal_size = 1GB
> - max_wal_size = 8GB


Thinking about with that hardware I'm not suprised if you're only seing
small benefits. The amount of ram limits the amount of dirty data; and
you have plenty have on-storage buffering in comparison to that.


> Both machines were using the same kernel version 4.4.2 and default io
> scheduler (cfq). The
> 
> The test procedure was quite simple - pgbench with three different scales,
> for each scale three runs, 1h per run (and 30 minutes of warmup before each
> run).
> 
> Due to the difference in amount of RAM, each machine used different scales -
> the goal is to have small, ~50% RAM, >200% RAM sizes:
> 
> 1) Xeon: 100, 400, 6000
> 2) i5: 50, 200, 3000
> 
> The commits actually tested are
> 
>    cfafd8be  (right before the first patch)
>    7975c5e0  Allow the WAL writer to flush WAL at a reduced rate.
>    db76b1ef  Allow SetHintBits() to succeed if the buffer's LSN ...

Huh, now I'm a bit confused. These are the commits you tested? Those
aren't the ones doing sorting and flushing?


> Also, I really wonder what will happen with non-default io schedulers. I
> believe all the testing so far was done with cfq, so what happens on
> machines that use e.g. "deadline" (as many DB machines actually do)?

deadline and noop showed slightly bigger benefits in my testing.


Greetings,

Andres Freund



Re: checkpointer continuous flushing - V16

From
Andres Freund
Date:
On 2016-03-07 09:41:51 -0800, Andres Freund wrote:
> > Due to the difference in amount of RAM, each machine used different scales -
> > the goal is to have small, ~50% RAM, >200% RAM sizes:
> > 
> > 1) Xeon: 100, 400, 6000
> > 2) i5: 50, 200, 3000
> > 
> > The commits actually tested are
> > 
> >    cfafd8be  (right before the first patch)
> >    7975c5e0  Allow the WAL writer to flush WAL at a reduced rate.
> >    db76b1ef  Allow SetHintBits() to succeed if the buffer's LSN ...
> 
> Huh, now I'm a bit confused. These are the commits you tested? Those
> aren't the ones doing sorting and flushing?

To clarify: The reason we'd not expect to see much difference here is
that the above commits really only have any affect above noise if you
use synchronous_commit=off. Without async commit it's just one
additional gettimeofday() call and a few additional branches in the wal
writer every wal_writer_delay.

Andres



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-02-22 20:44:35 +0100, Fabien COELHO wrote:
> 
> >>Random updates on 16 tables which total to 1.1GB of data, so this is in
> >>buffer, no significant "read" traffic.
> >>
> >>(1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
> >>    per second avg, stddev [ min q1 median d3 max ] <=300tps
> >>    679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
> >>
> >>(2) with 1 tablespace on 1 disk : 956.0 tps
> >>    per second avg, stddev [ min q1 median d3 max ] <=300tps
> >>    956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%
> >
> >Interesting. That doesn't reflect my own tests, even on rotating media,
> >at all. I wonder if it's related to:
> >https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=23d0127096cb91cb6d354bdc71bd88a7bae3a1d5
> >
> >If you use your 12.04 kernel, that'd not be fixed. Which might be a
> >reason to do it as you suggest.
> >
> >Could you share the exact details of that workload?
> 
> See attached scripts (sh to create the 16 tables in the default or 16 table
> spaces, small sql bench script, stat computation script).
> 
> The per-second stats were computed with:
> 
>   grep progress: pgbench.out | cut -d' ' -f4 | avg.py --length=1000 --limit=300
> 
> Host is 8 cpu 16 GB, 2 HDD in RAID 1.

Well, that's not a particularly meaningful workload. You increased the
number of flushed to the same number of disks considerably. For a
meaningful comparison you'd have to compare using one writeback context
for N tablespaces on N separate disks/raids, and using N writeback
contexts for the same.

Andres



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hello Andres,

>>>> (1) with 16 tablespaces (1 per table) on 1 disk : 680.0 tps
>>>>    per second avg, stddev [ min q1 median d3 max ] <=300tps
>>>>    679.6 ± 750.4 [0.0, 317.0, 371.0, 438.5, 2724.0] 19.5%
>>>>
>>>> (2) with 1 tablespace on 1 disk : 956.0 tps
>>>>    per second avg, stddev [ min q1 median d3 max ] <=300tps
>>>>    956.2 ± 796.5 [3.0, 488.0, 583.0, 742.0, 2774.0] 2.1%
>
> Well, that's not a particularly meaningful workload. You increased the 
> number of flushed to the same number of disks considerably.

It is just a simple workload designed to emphasize the effect of having 
one context shared for all table space instead of on per tablespace, 
without rewriting the patch and without a large host with multiple disks.

> For a meaningful comparison you'd have to compare using one writeback 
> context for N tablespaces on N separate disks/raids, and using N 
> writeback contexts for the same.

Sure, it would be better to do that, but that would require (1) rewriting 
the patch, which is a small work, and also (2) having access to a machine 
with a number of disks/raids, that I do NOT have available.


What happens in the 16 tb workload is that much smaller flushes are 
performed on the 16 files writen in parallel, so the tps performance is 
significantly degraded, despite the writes being sorted in each file. On 
one tb, all buffers flushed are in the same file, so flushes are much more 
effective.

When the context is shared and checkpointer buffer writes are balanced 
against table spaces, then when the limit is reached the flushing gets few 
buffers per tablespace, so this limits sequential writes to few buffers, 
hence the performance degradation.

So I can explain the performance degradation *because* the flush context 
is shared between the table spaces, which is a logical argument backed 
with experimental data, so it is better than handwaving. Given the 
available hardware, this is the best proof I can have that context should 
be per table space.

Now I cannot see how having one context per table space would have a 
significant negative performance impact.

So the logical conclusion for me is that without further experimental data 
it is better to have one context per table space.

If you have a hardware with plenty disks available for testing, that would 
provide better data, obviously.

-- 
Fabien.

Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-03-07 21:10:19 +0100, Fabien COELHO wrote:
> Now I cannot see how having one context per table space would have a
> significant negative performance impact.

The 'dirty data' etc. limits are global, not per block device. By having
several contexts with unflushed dirty data the total amount of dirty
data in the kernel increases. Thus you're more likely to see stalls by
the kernel moving pages into writeback.

Andres



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hello Andres,

>> Now I cannot see how having one context per table space would have a
>> significant negative performance impact.
>
> The 'dirty data' etc. limits are global, not per block device. By having
> several contexts with unflushed dirty data the total amount of dirty
> data in the kernel increases.

Possibly, but how much?  Do you have experimental data to back up that 
this is really an issue?

We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB 
of dirty buffers to manage for 16 table spaces, I do not see that as a 
major issue for the kernel.

> Thus you're more likely to see stalls by the kernel moving pages into 
> writeback.

I do not see the above data having a 30% negative impact on tps, given the 
quite small amount of data under discussion, and switching to random IOs 
cost so much that it must really be avoided.

Without further experimental data, I still think that the one context per 
table space is the reasonnable choice.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
>>> Now I cannot see how having one context per table space would have a
>>> significant negative performance impact.
>> 
>> The 'dirty data' etc. limits are global, not per block device. By having
>> several contexts with unflushed dirty data the total amount of dirty
>> data in the kernel increases.
>
> Possibly, but how much?  Do you have experimental data to back up that this 
> is really an issue?
>
> We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB of 
> dirty buffers to manage for 16 table spaces, I do not see that as a major 
> issue for the kernel.

More thoughts about your theoretical argument:

To complete the argument, the 4MB is just a worst case scenario, in 
reality flushing the different context would be randomized over time, so 
the frequency of flushing a context would be exactly the same in both 
cases (shared or per table space context) if the checkpoints are the same 
size, just that with shared table space each flushing potentially targets 
all tablespace with a few pages, while with the other version each 
flushing targets one table space only.

So my handwaving analysis is that the flow of dirty buffers is the same 
with both approaches, but for the shared version buffers are more equaly 
distributed on table spaces, hence reducing sequential write 
effectiveness, and for the other the dirty buffers are grouped more 
clearly per table space, so it should get better sequential write 
performance.


-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-03-08 09:28:15 +0100, Fabien COELHO wrote:
> 
> >>>Now I cannot see how having one context per table space would have a
> >>>significant negative performance impact.
> >>
> >>The 'dirty data' etc. limits are global, not per block device. By having
> >>several contexts with unflushed dirty data the total amount of dirty
> >>data in the kernel increases.
> >
> >Possibly, but how much?  Do you have experimental data to back up that
> >this is really an issue?
> >
> >We are talking about 32 (context size) * #table spaces * 8KB buffers = 4MB
> >of dirty buffers to manage for 16 table spaces, I do not see that as a
> >major issue for the kernel.

We flush in those increments, that doesn't mean there's only that much
dirty data. I regularly see one order of magnitude more being dirty.


I had originally kept it with one context per tablespace after
refactoring this, but found that it gave worse results in rate limited
loads even over only two tablespaces. That's on SSDs though.


> To complete the argument, the 4MB is just a worst case scenario, in reality
> flushing the different context would be randomized over time, so the
> frequency of flushing a context would be exactly the same in both cases
> (shared or per table space context) if the checkpoints are the same size,
> just that with shared table space each flushing potentially targets all
> tablespace with a few pages, while with the other version each flushing
> targets one table space only.

The number of pages still in writeback (i.e. for which sync_file_range
has been issued, but which haven't finished running yet) at the end of
the checkpoint matters for the latency hit incurred by the fsync()s from
smgrsync(); at least by my measurement.


My current plan is to commit this with the current behaviour (as in this
week[end]), and then do some actual benchmarking on this specific
part. It's imo a relatively minor detail.

Greetings,

Andres Freund



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
> I think there might be a semantic distinction between these two terms.
> Doesn't writeback mean writing pages to disk, and flushing mean making
> sure that they are durably on disk?  So for example when the Linux
> kernel thinks there is too much dirty data, it initiates writeback,
> not a flush; on the other hand, at transaction commit, we initiate a
> flush, not writeback.

I don't think terminology is sufficiently clear to make such a
distinction. Take e.g. our FlushBuffer()...



Re: checkpointer continuous flushing - V18

From
Robert Haas
Date:
On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
>> I think there might be a semantic distinction between these two terms.
>> Doesn't writeback mean writing pages to disk, and flushing mean making
>> sure that they are durably on disk?  So for example when the Linux
>> kernel thinks there is too much dirty data, it initiates writeback,
>> not a flush; on the other hand, at transaction commit, we initiate a
>> flush, not writeback.
>
> I don't think terminology is sufficiently clear to make such a
> distinction. Take e.g. our FlushBuffer()...

Well then we should clarify it!

:-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-03-10 17:33:33 -0500, Robert Haas wrote:
> On Thu, Mar 10, 2016 at 5:24 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-02-21 09:49:53 +0530, Robert Haas wrote:
> >> I think there might be a semantic distinction between these two terms.
> >> Doesn't writeback mean writing pages to disk, and flushing mean making
> >> sure that they are durably on disk?  So for example when the Linux
> >> kernel thinks there is too much dirty data, it initiates writeback,
> >> not a flush; on the other hand, at transaction commit, we initiate a
> >> flush, not writeback.
> >
> > I don't think terminology is sufficiently clear to make such a
> > distinction. Take e.g. our FlushBuffer()...
> 
> Well then we should clarify it!

Trying that as we speak, err, write. How about:       <para>        Whenever more than
<varname>bgwriter_flush_after</varname>bytes have        been written by the bgwriter, attempt to force the OS to issue
these       writes to the underlying storage.  Doing so will limit the amount of        dirty data in the kernel's page
cache,reducing the likelihood of        stalls when an fsync is issued at the end of a checkpoint, or when        the
OSwrites data back  in larger batches in the background.  Often        that will result in greatly reduced transaction
latency,but there        also are some cases, especially with workloads that are bigger than        <xref
linkend="guc-shared-buffers">,but smaller than the OS's page        cache, where performance might degrade.  This
settingmay have no        effect on some platforms.  <literal>0</literal> disables controlled        writeback. The
defaultis <literal>256Kb</> on Linux, <literal>0</>        otherwise. This parameter can only be set in the
<filename>postgresql.conf</>file or on the server command line.       </para>
 

(plus adjustments for the other gucs)



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
[...]

> I had originally kept it with one context per tablespace after 
> refactoring this, but found that it gave worse results in rate limited 
> loads even over only two tablespaces. That's on SSDs though.

Might just mean that a smaller context size is better on SSD, and it could 
still be better per table space.

> The number of pages still in writeback (i.e. for which sync_file_range
> has been issued, but which haven't finished running yet) at the end of
> the checkpoint matters for the latency hit incurred by the fsync()s from
> smgrsync(); at least by my measurement.

I'm not sure I've seen these performance... If you have hard evidence, 
please feel free to share it.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-03-10 23:38:38 +0100, Fabien COELHO wrote:
> I'm not sure I've seen these performance... If you have hard evidence,
> please feel free to share it.

Man, are you intentionally trying to be hard to work with?  To quote the
email you responded to:

> My current plan is to commit this with the current behaviour (as in this
> week[end]), and then do some actual benchmarking on this specific
> part. It's imo a relatively minor detail.



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
>        <para>
>         Whenever more than <varname>bgwriter_flush_after</varname> bytes have
>         been written by the bgwriter, attempt to force the OS to issue these
>         writes to the underlying storage.  Doing so will limit the amount of
>         dirty data in the kernel's page cache, reducing the likelihood of
>         stalls when an fsync is issued at the end of a checkpoint, or when
>         the OS writes data back  in larger batches in the background.  Often
>         that will result in greatly reduced transaction latency, but there
>         also are some cases, especially with workloads that are bigger than
>         <xref linkend="guc-shared-buffers">, but smaller than the OS's page
>         cache, where performance might degrade.  This setting may have no
>         effect on some platforms.  <literal>0</literal> disables controlled
>         writeback. The default is <literal>256Kb</> on Linux, <literal>0</>
>         otherwise. This parameter can only be set in the
>         <filename>postgresql.conf</> file or on the server command line.
>        </para>
>
> (plus adjustments for the other gucs)

Some suggestions:

What about the maximum value?

If the default is in pages, maybe you could state it and afterwards 
translate it in size.

"The default is 64 pages on Linux (usually 256Kb)..."

The text could say something about sequential writes performance because 
pages are sorted.., but that it is lost for large bases and/or short 
checkpoints ?

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-03-10 23:43:46 +0100, Fabien COELHO wrote:
> 
> >       <para>
> >        Whenever more than <varname>bgwriter_flush_after</varname> bytes have
> >        been written by the bgwriter, attempt to force the OS to issue these
> >        writes to the underlying storage.  Doing so will limit the amount of
> >        dirty data in the kernel's page cache, reducing the likelihood of
> >        stalls when an fsync is issued at the end of a checkpoint, or when
> >        the OS writes data back  in larger batches in the background.  Often
> >        that will result in greatly reduced transaction latency, but there
> >        also are some cases, especially with workloads that are bigger than
> >        <xref linkend="guc-shared-buffers">, but smaller than the OS's page
> >        cache, where performance might degrade.  This setting may have no
> >        effect on some platforms.  <literal>0</literal> disables controlled
> >        writeback. The default is <literal>256Kb</> on Linux, <literal>0</>
> >        otherwise. This parameter can only be set in the
> >        <filename>postgresql.conf</> file or on the server command line.
> >       </para>
> >
> >(plus adjustments for the other gucs)

> What about the maximum value?

Added.
     <varlistentry id="guc-bgwriter-flush-after" xreflabel="bgwriter_flush_after">
<term><varname>bgwriter_flush_after</varname>(<type>int</type>)      <indexterm>
<primary><varname>bgwriter_flush_after</>configuration parameter</primary>      </indexterm>      </term>
<listitem>      <para>        Whenever more than <varname>bgwriter_flush_after</varname> bytes have        been written
bythe bgwriter, attempt to force the OS to issue these        writes to the underlying storage.  Doing so will limit
theamount of        dirty data in the kernel's page cache, reducing the likelihood of        stalls when an fsync is
issuedat the end of a checkpoint, or when        the OS writes data back in larger batches in the background.  Often
   that will result in greatly reduced transaction latency, but there        also are some cases, especially with
workloadsthat are bigger than        <xref linkend="guc-shared-buffers">, but smaller than the OS's page        cache,
whereperformance might degrade.  This setting may have no        effect on some platforms.  The valid range is between
     <literal>0</literal>, which disables controlled writeback, and        <literal>2MB</literal>.  The default is
<literal>256Kb</>on Linux,        <literal>0</> elsewhere.  (Non-default values of        <symbol>BLCKSZ</symbol>
changethe default and maximum.)        This parameter can only be set in the <filename>postgresql.conf</>        file
oron the server command line.       </para>      </listitem>     </varlistentry>    </variablelist>
 


> If the default is in pages, maybe you could state it and afterwards
> translate it in size.

Hm, I think that's more complicated for users than it's worth.


> The text could say something about sequential writes performance because
> pages are sorted.., but that it is lost for large bases and/or short
> checkpoints ?

I think that's an implementation detail.


- Andres



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
Hello Andres,

>> I'm not sure I've seen these performance... If you have hard evidence,
>> please feel free to share it.
>
> Man, are you intentionally trying to be hard to work with?

Sorry, I do not understand this remark.

You were refering to some latency measures in your answer, and I was just 
stating that I was interested in seeing these figures which were used to 
justify your choice to keep a shared writeback context.

I did not intend this wish to be an issue, I was expressing an interest.

> To quote the email you responded to:
>
>> My current plan is to commit this with the current behaviour (as in 
>> this week[end]), and then do some actual benchmarking on this specific 
>> part. It's imo a relatively minor detail.

Good.

From the evidence in the thread, I would have given the per tablespace 
context the preference, but this is just a personal opinion and I agree 
that it can work the other way around.

I look forward to see these benchmarks later on, when you have them.

So all is well, and hopefully will be even better later on.

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
[...]

>> If the default is in pages, maybe you could state it and afterwards
>> translate it in size.
>
> Hm, I think that's more complicated for users than it's worth.

As you wish. I liked the number of pages you used initially because it 
really gives a hint of how much random IOs are avoided when they are 
contiguous, and I do not have the same just intuition with sizes. Also it 
is related to the io queue length manage by the OS.

>> The text could say something about sequential writes performance because
>> pages are sorted.., but that it is lost for large bases and/or short
>> checkpoints ?
>
> I think that's an implementation detail.

As you wish. I thought that understanding the underlying performance model 
with sequential writes written in chunks is important for the admin, and 
as this guc would have an impact on performance it should be hinted about, 
including the limits of its effect where large bases will converge to 
random io performance. But maybe that is not the right place.

-- 
Fabien



Re: checkpointer continuous flushing - V18

From
Andres Freund
Date:
On 2016-03-11 00:23:56 +0100, Fabien COELHO wrote:
> As you wish. I thought that understanding the underlying performance model
> with sequential writes written in chunks is important for the admin, and as
> this guc would have an impact on performance it should be hinted about,
> including the limits of its effect where large bases will converge to random
> io performance. But maybe that is not the right place.

I do agree that that's something interesting to document somewhere. But
I don't think any of the current places in the documentation are a good
fit, and it's a topic much more general than the feature we're debating
here.  I'm not volunteering, but a good discussion of storage and the
interactions with postgres surely would be a significant improvement to
the postgres docs.


- Andres



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

I just pushed the two major remaining patches in this thread. Let's see
what the buildfarm has to say; I'd not be surprised if there's some
lingering portability problem in the flushing code.

There's one remaining issue we definitely want to resolve before the
next release:  Right now we always use one writeback context across all
tablespaces in a checkpoint, but Fabien's testing shows that that's
likely to hurt in a number of cases. I've some data suggesting the
contrary in others.

Things that'd be good:
* Some benchmarking. Right now controlled flushing is enabled by default on linux, but disabled by default on other
operatingsystems. Somebody running benchmarks on e.g. freebsd or OSX might be good.
 
* If somebody has the energy to provide a windows implemenation for flush control, that might be worthwhile. There's
severalplaces that could benefit from that.
 
* The default values are basically based on benchmarking by me and Fabien.

Regards,

Andres



Re: checkpointer continuous flushing - V18

From
Fabien COELHO
Date:
>> As you wish. I thought that understanding the underlying performance model
>> with sequential writes written in chunks is important for the admin, and as
>> this guc would have an impact on performance it should be hinted about,
>> including the limits of its effect where large bases will converge to random
>> io performance. But maybe that is not the right place.
>
> I do agree that that's something interesting to document somewhere. But
> I don't think any of the current places in the documentation are a good
> fit, and it's a topic much more general than the feature we're debating
> here.  I'm not volunteering, but a good discussion of storage and the
> interactions with postgres surely would be a significant improvement to
> the postgres docs.

I can only concur!

The "Performance Tips" chapter (II.14) is more user/query oriented. The 
"Server Administration" bool (III) does not discuss this much.

There is a wiki about performance tuning, but it is not integrated into 
the documentation. It could be a first documentation source.

Also the README in some development directories are very interesting, 
although they contains too much details about the implementation.

There has been a lot of presentations over the years, and blog posts.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> I just pushed the two major remaining patches in this thread.

Hurray! Nine months the this baby out:-)

-- 
Fabien.



Re: checkpointer continuous flushing - V18

From
Peter Geoghegan
Date:
On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> I can only concur!
>
> The "Performance Tips" chapter (II.14) is more user/query oriented. The
> "Server Administration" bool (III) does not discuss this much.

That's definitely one area in which the docs are lacking -- I've heard
several complaints about this myself. I think we've been hesitant to
do more in part because the docs must always be categorically correct,
and must not use weasel words. I think it's hard to talk about
performance while maintaining the general tone of the documentation. I
don't know what can be done about that.

-- 
Peter Geoghegan



Re: checkpointer continuous flushing - V18

From
Jeff Janes
Date:
On Thu, Mar 10, 2016 at 11:25 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Mar 10, 2016 at 11:18 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>> I can only concur!
>>
>> The "Performance Tips" chapter (II.14) is more user/query oriented. The
>> "Server Administration" bool (III) does not discuss this much.
>
> That's definitely one area in which the docs are lacking -- I've heard
> several complaints about this myself. I think we've been hesitant to
> do more in part because the docs must always be categorically correct,
> and must not use weasel words. I think it's hard to talk about
> performance while maintaining the general tone of the documentation. I
> don't know what can be done about that.

Would the wiki be a good place for such tips?  Not as formal as the
documentation, and more centralized (and editable) than a collection
of blog posts.

Cheers,

Jeff



Re: checkpointer continuous flushing - V18

From
Peter Geoghegan
Date:
On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Would the wiki be a good place for such tips?  Not as formal as the
> documentation, and more centralized (and editable) than a collection
> of blog posts.

That general direction makes sense, but I'm not sure if the Wiki is
something that this will work for. I fear that it could become
something like the TODO list page: a page that contains theoretically
accurate information, but isn't very helpful. The TODO list needs to
be heavily pruned, but that seems like something that will never
happen.

A centralized location for performance tips will probably only work
well if there are still high standards that are actively enforced.
There still needs to be tight editorial control.

-- 
Peter Geoghegan



Re: checkpointer continuous flushing - V18

From
Jim Nasby
Date:
On 3/13/16 6:30 PM, Peter Geoghegan wrote:
> On Sat, Mar 12, 2016 at 5:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Would the wiki be a good place for such tips?  Not as formal as the
>> documentation, and more centralized (and editable) than a collection
>> of blog posts.
>
> That general direction makes sense, but I'm not sure if the Wiki is
> something that this will work for. I fear that it could become
> something like the TODO list page: a page that contains theoretically
> accurate information, but isn't very helpful. The TODO list needs to
> be heavily pruned, but that seems like something that will never
> happen.
>
> A centralized location for performance tips will probably only work
> well if there are still high standards that are actively enforced.
> There still needs to be tight editorial control.

I think there's ways to significantly restrict who can edit a page, so 
this could probably still be done via the wiki. IMO we should also be 
encouraging users to test various tips and provide feedback, so maybe a 
wiki page with a big fat request at the top asking users to submit any 
feedback about the page to -performance.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: checkpointer continuous flushing

From
Tomas Vondra
Date:
Hi,

On 03/11/2016 02:34 AM, Andres Freund wrote:
> Hi,
>
> I just pushed the two major remaining patches in this thread. Let's see
> what the buildfarm has to say; I'd not be surprised if there's some
> lingering portability problem in the flushing code.
>
> There's one remaining issue we definitely want to resolve before the
> next release:  Right now we always use one writeback context across all
> tablespaces in a checkpoint, but Fabien's testing shows that that's
> likely to hurt in a number of cases. I've some data suggesting the
> contrary in others.
>
> Things that'd be good:
> * Some benchmarking. Right now controlled flushing is enabled by default
>   on linux, but disabled by default on other operating systems. Somebody
>   running benchmarks on e.g. freebsd or OSX might be good.

So I've done some benchmarks of this, and I think the results are very
good. I've compared a298a1e06 and 23a27b039d (so the two patches
mentioned here are in-between those two), and I've done a few long
pgbench runs - 24h each:

1) master (a298a1e06), regular pgbench
2) master (a298a1e06), throttled to 5000 tps
3) patched (23a27b039), regular pgbench
3) patched (23a27b039), throttled to 5000 tps

All of this was done on a quite large machine:

* 4 x CPU E5-4620 (2.2GHz)
* 256GB of RAM
* 24x SSD on LSI 2208 controller (with 1GB BBWC)

The page cache was using the default config, although in production
setups we'd probably lower the limits (particularly the background
threshold):

* vm.dirty_background_ratio = 10
* vm.dirty_ratio = 20

The main PostgreSQL configuration changes are these:

* shared_buffers=64GB
* bgwriter_delay = 10ms
* bgwriter_lru_maxpages = 1000
* checkpoint_timeout = 30min
* max_wal_size = 64GB
* min_wal_size = 32GB

I haven't touched the flush_after values, so those are at default. Full
config in the github repo, along with all the results and scripts used
to generate the charts etc:

     https://github.com/tvondra/flushing-benchmark

I'd like to see some benchmarks on machines with regular rotational
storage, but I don't have a suitable system at hand.

The pgbench was scale 60000, so ~750GB of data on disk, and was executed
either like this (the "default"):

pgbench -c 32 -j 8 -T 86400 -l --aggregate-interval=1 pgbench

or like this ("throttled"):

pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench

The reason for the throttling is that people generally don't run
production databases 100% saturated, so it'd be sad to improve the 100%
saturated case and hurt the common case by increasing latency. The
machine does ~8000 tps, so 5000 tps is ~60% of that.

It's difficult to judge based on a single run (although a long one), but
it seems the throughput increased a tiny bit from 7725 to 8000. That's
~4% difference, but I guess more runs would be needed to see if this is
noise or actual improvement.

Now, let's see at the per-second results, i.e. how much the performance
fluctuates over time (due to checkpoints etc.). That's where the
aggregated log (per-second) gets useful, as it's used for generating the
various charts for tps, max latency, stddev of latency etc.

All those charts are CDF, i.e. cumulative distribution function, i.e.
they plot a metric on x-axis, and probability P(X <= x) on y-axis.

In general the steeper the curve the better (more consistent behavior
over time). It also allows comparing two curves - e.g. for tps metric
the "lower" curve is better, as it means higher values are more likely.

default (non-throttled) pgbench runs
------------------------------------

Let's see the regular (non-throttled) pgbench runs first:

* regular-tps.png (per-second TPS)

Clearly, the patched version is much more consistent - firstly it's much
less "wobbly" and it's considerably steeper, which means the per-second
throughput fluctuates much less. That's good.

We already know the total throughput is almost exactly the same (just 4%
difference), this also shows that the medians are almost exactly the
same (the curves intersect at pretty much exactly 50%).

* regular-max-lat.png (per-second maximum latency)
* regular-stddev-lat.png (per-second latency stddev)

Apparently the additional processing slightly increases both the maximum
latency and standard deviation, as the green line (patched) is
consistently below the pink one (unpatched).

Notice however that x-axis is using log scale, so the differences are
actually very small, and we also know that the total throughput slightly
increased. So while those two metrics slightly increased, the overall
impact on latency has to be positive.

throttled pgbench runs
----------------------

* throttled-tps.png (per-second TPS)

OK, this is great - the chart shows that the performance is way more
consistent. Originally there was ~10% of samples with ~2000 tps, but
with the flushing you'd have to go to ~4600 tps. It's actually pretty
difficult to determine this from the chart, because the curve got so
steep and I had to check the data used to generate the charts.

Similarly for the upper end, but I assume that's a consequence of the
throttling not having to compensate for the "slow" seconds anymore.

* throttled-max-lat.png (per-second maximum latency)
* throttled-stddev-lat.png (per-second latency stddev)

This time the stddev/max latency charts are actually in favor of the
patched code. It's actually a bit worse for the low latencies (the green
line is below the pink one, so there are fewer low values), but then it
starts winning for higher values. And that's what counts when it comes
to consistency.

Again, notice that the x-axis is log scale, so the differences for large
values are actually way more significant than it might look.


So, good work I guess!

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Tomas,

Thanks for these great measures.

> * 4 x CPU E5-4620 (2.2GHz)

4*8 = 32 cores / 64 threads.

> * 256GB of RAM

Wow!

> * 24x SSD on LSI 2208 controller (with 1GB BBWC)

Wow! RAID configuration ? The patch is designed to fix very big issues on 
HDD, but it is good to see that the impact is good on SSD as well.

Is it possible to run tests with distinct table spaces on those many 
disks?

> * shared_buffers=64GB

1/4 of the available memory.

> The pgbench was scale 60000, so ~750GB of data on disk,

*3 available memory, mostly on disk.

> or like this ("throttled"):
>
> pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench
>
> The reason for the throttling is that people generally don't run production 
> databases 100% saturated, so it'd be sad to improve the 100% saturated case 
> and hurt the common case by increasing latency.

Sure.

> The machine does ~8000 tps, so 5000 tps is ~60% of that.

Ok.

I would have suggested using the --latency-limit option to filter out very 
slow queries, otherwise if the system is stuck it may catch up later, but 
then this is not representative of "sustainable" performance.

When pgbench is running under a target rate, in both runs the transaction 
distribution is expected to be the same, around 5000 tps, and the green 
run looks pretty ok with respect to that. The magenta one shows that about 
25% of the time, things are not good at all, and the higher figures just 
show the catching up, which is not really interesting if you asked for a 
web page and it is finally delivered 1 minutes later.

> * regular-tps.png (per-second TPS) [...]

Great curves!

> consistent. Originally there was ~10% of samples with ~2000 tps, but with the 
> flushing you'd have to go to ~4600 tps. It's actually pretty difficult to 
> determine this from the chart, because the curve got so steep and I had to 
> check the data used to generate the charts.
>
> Similarly for the upper end, but I assume that's a consequence of the 
> throttling not having to compensate for the "slow" seconds anymore.

Yep, but they should be filtered out, "sorry, too late", so that would 
count as unresponsisveness, at least for a large class of applications.

Thanks a lot for there interesting tests!

-- 
Fabien.



Re: checkpointer continuous flushing

From
Tomas Vondra
Date:
Hi,

On 03/17/2016 06:36 PM, Fabien COELHO wrote:
>
> Hello Tomas,
>
> Thanks for these great measures.
>
>> * 4 x CPU E5-4620 (2.2GHz)
>
> 4*8 = 32 cores / 64 threads.

Yep. I only used 32 clients though, to keep some of the CPU available 
for the rest of the system (also, HT does not really double the number 
of cores).

>
>> * 256GB of RAM
>
> Wow!
>
>> * 24x SSD on LSI 2208 controller (with 1GB BBWC)
>
> Wow! RAID configuration ? The patch is designed to fix very big issues
> on HDD, but it is good to see that the impact is good on SSD as well.

Yep, RAID-10. I agree that doing the test on a HDD-based system would be 
useful, however (a) I don't have a comparable system at hand at the 
moment, and (b) I was a bit worried that it'll hurt performance on SSDs, 
but thankfully that's not the case.

I will do the test on a much smaller system with HDDs in a few days.

>
> Is it possible to run tests with distinct table spaces on those many disks?

Nope, that'd require reconfiguring the system (and then back), and I 
don't have access to that system (just SSH). Also, I don't quite see 
what would that tell us?

>> * shared_buffers=64GB
>
> 1/4 of the available memory.
>
>> The pgbench was scale 60000, so ~750GB of data on disk,
>
> *3 available memory, mostly on disk.
>
>> or like this ("throttled"):
>>
>> pgbench -c 32 -j 8 -T 86400 -R 5000 -l --aggregate-interval=1 pgbench
>>
>> The reason for the throttling is that people generally don't run
>> production databases 100% saturated, so it'd be sad to improve the
>> 100% saturated case and hurt the common case by increasing latency.
>
> Sure.
>
>> The machine does ~8000 tps, so 5000 tps is ~60% of that.
>
> Ok.
>
> I would have suggested using the --latency-limit option to filter out
> very slow queries, otherwise if the system is stuck it may catch up
> later, but then this is not representative of "sustainable" performance.
>
> When pgbench is running under a target rate, in both runs the
> transaction distribution is expected to be the same, around 5000 tps,
> and the green run looks pretty ok with respect to that. The magenta one
> shows that about 25% of the time, things are not good at all, and the
> higher figures just show the catching up, which is not really
> interesting if you asked for a web page and it is finally delivered 1
> minutes later.

Maybe. But that'd only increase the stress on the system, possibly 
causing more issues, no? And the magenta line is the old code, thus it 
would only increase the improvement of the new code.

Notice the max latency is in microseconds (as logged by pgbench), so 
according to the "max latency" charts the latencies are below 10 seconds 
(old) and 1 second (new) about 99% of the time. So I don't think this 
would make any measurable difference in practice.


regards


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> Is it possible to run tests with distinct table spaces on those many disks?
>
> Nope, that'd require reconfiguring the system (and then back), and I don't 
> have access to that system (just SSH).

Ok.

> Also, I don't quite see what would that tell us?

Currently the flushing context is shared between table space, but I think 
that it should be per table space. My tests did not manage to convince 
Andres, so getting some more figures would be great. That will be another 
time!

>> I would have suggested using the --latency-limit option to filter out
>> very slow queries, otherwise if the system is stuck it may catch up
>> later, but then this is not representative of "sustainable" performance.
>> 
>> When pgbench is running under a target rate, in both runs the
>> transaction distribution is expected to be the same, around 5000 tps,
>> and the green run looks pretty ok with respect to that. The magenta one
>> shows that about 25% of the time, things are not good at all, and the
>> higher figures just show the catching up, which is not really
>> interesting if you asked for a web page and it is finally delivered 1
>> minutes later.
>
> Maybe. But that'd only increase the stress on the system, possibly causing 
> more issues, no? And the magenta line is the old code, thus it would only 
> increase the improvement of the new code.

Yes and no. I agree that it stresses the system a little more, but the 
fact that you have 5000 tps in the end does not show that you can really 
sustain 5000 tps with reasonnable latency. I find this later information 
more interesting than knowing that you can get 5000 tps on average, 
thanks to some catching up. Moreover the non throttled runs already shown 
that the system could do 8000 tps, so the bandwidth is already there.

> Notice the max latency is in microseconds (as logged by pgbench), so 
> according to the "max latency" charts the latencies are below 10 seconds 
> (old) and 1 second (new) about 99% of the time.

AFAICS, the max latency is aggregated by second, but then it does not say 
much about the distribution of individuals latencies in the interval, that 
is whether they were all close to the max or not, Having the same chart 
with median or average might help. Also, with the stddev chart, the 
percent do not correspond with the latency one, so it may be that the 
latency is high but the stddev is low, i.e. all transactions are equally 
bad on the interval, or not.

So I must admit that I'm not clear at all how to interpret the max latency 
& stddev charts you provided.

> So I don't think this would make any measurable difference in practice.

I think that it may show that 25% of the time the system could not match 
the target tps, even if it can handle much more on average, so the tps 
achieved when discarding late transactions would be under 4000 tps.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Tomas Vondra
Date:
Hi,

On 03/17/2016 10:14 PM, Fabien COELHO wrote:
>
...
>>> I would have suggested using the --latency-limit option to filter out
>>> very slow queries, otherwise if the system is stuck it may catch up
>>> later, but then this is not representative of "sustainable" performance.
>>>
>>> When pgbench is running under a target rate, in both runs the
>>> transaction distribution is expected to be the same, around 5000 tps,
>>> and the green run looks pretty ok with respect to that. The magenta one
>>> shows that about 25% of the time, things are not good at all, and the
>>> higher figures just show the catching up, which is not really
>>> interesting if you asked for a web page and it is finally delivered 1
>>> minutes later.
>>
>> Maybe. But that'd only increase the stress on the system, possibly
>> causing more issues, no? And the magenta line is the old code, thus it
>> would only increase the improvement of the new code.
>
> Yes and no. I agree that it stresses the system a little more, but
> the fact that you have 5000 tps in the end does not show that you can
> really sustain 5000 tps with reasonnable latency. I find this later
> information more interesting than knowing that you can get 5000 tps
> on average, thanks to some catching up. Moreover the non throttled
> runs already shown that the system could do 8000 tps, so the
> bandwidth is already  there.

Sure, but thanks to the tps charts we *do know* that for vast majority 
of the intervals (each second) the number of completed transactions is 
very close to 5000. And that wouldn't be possible if large part of the 
latencies were close to the maximums.

With 5000 tps and 32 clients, that means the average latency should be 
less than 6ms, otherwise the clients couldn't make ~160 tps each. But we 
do see that the maximum latency for most intervals is way higher. Only 
~10% of the intervals have max latency below 10ms, for example.

>
>> Notice the max latency is in microseconds (as logged by pgbench),
>> so according to the "max latency" charts the latencies are below
>> 10 seconds (old) and 1 second (new) about 99% of the time.
>
> AFAICS, the max latency is aggregated by second, but then it does
> not say much about the distribution of individuals latencies in the
> interval, that is whether they were all close to the max or not,
> Having the same chart with median or average might help. Also, with
> the stddev chart, the percent do not correspond with the latency one,
> so it may be that the latency is high but the stddev is low, i.e. all
> transactions are equally bad on the interval, or not.>
> So I must admit that I'm not clear at all how to interpret the max
> latency & stddev charts you provided.

You're right those charts are not describing distributions of the 
latencies but those aggregated metrics. And it's not particularly simple 
to deduce information about the source statistics, for example because 
all the intervals have the same "weight" although the number of 
transactions that completed in each interval may be different.

But I do think it's a very useful tool when it comes to measuring the 
consistency of behavior over time, assuming you're asking questions 
about the intervals and not the original transactions.

For example, had there been intervals with vastly different transaction 
rates, we'd see that on the tps charts (i.e. the chart would be much 
more gradual or wobbly, just like the "unpatched" one). Or if there were 
intervals with much higher variance of latencies, we'd see that on the 
STDDEV chart.

I'll consider repeating the benchmark and logging some reasonable sample 
of transactions - for the 24h run the unthrottled benchmark did ~670M 
transactions. Assuming ~30B per line, that's ~20GB, so 5% sample should 
be ~1GB of data, which I think is enough.

But of course, that's useful for answering questions about distribution 
of the individual latencies in global, not about consistency over time.

>
>> So I don't think this would make any measurable difference in practice.
>
> I think that it may show that 25% of the time the system could not
> match the target tps, even if it can handle much more on average, so
> the tps achieved when discarding late transactions would be under
> 4000 tps.

You mean the 'throttled-tps' chart? Yes, that one shows that without the 
patches, there's a lot of intervals where the tps was much lower - 
presumably due to a lot of slow transactions.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Tomas,

> But I do think it's a very useful tool when it comes to measuring the 
> consistency of behavior over time, assuming you're asking questions 
> about the intervals and not the original transactions.

For a throttled run, I think it is better to check whether or not the 
system could handle the load "as expected", i.e. with reasonnable latency, 
so somehow I'm interested in the "original transactions" as scheduled by 
the client, and whether they were processed efficiently, but then it must 
be aggregated by interval to get some statistics.

> For example, had there been intervals with vastly different transaction 
> rates, we'd see that on the tps charts (i.e. the chart would be much more 
> gradual or wobbly, just like the "unpatched" one). Or if there were intervals 
> with much higher variance of latencies, we'd see that on the STDDEV chart.

On HDDs what happens is that transactions are "blocked/freezed", the tps 
is very low, the latency very high, but then with few tx (even 1 or 0 at 
time) and all latencies very bad but nevertheless close one to the other, 
in a bad way, the resulting stddev may be quite small anyway.

> I'll consider repeating the benchmark and logging some reasonable sample of 
> transactions

Beware that this measure is skewed, because on HDDs when the system is 
stuck, it is stuck on very few transactions which are waiting, but they
would seldom show on statistics are there are very few of them. That is 
why I'm interested in those that could not make it, hence my interest in 
--latency-limit option which just say that.

>>> So I don't think this would make any measurable difference in practice.
>> 
>> I think that it may show that 25% of the time the system could not
>> match the target tps, even if it can handle much more on average, so
>> the tps achieved when discarding late transactions would be under
>> 4000 tps.
>
> You mean the 'throttled-tps' chart?

Yes.

> Yes, that one shows that without the patches, there's a lot of intervals 
> where the tps was much lower - presumably due to a lot of slow 
> transactions.

Yep. That is what is measured with the latency limit option, by counting 
the dropped transactions that where not processed in a timely maner.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Tomas Vondra
Date:
Hi,

I've repeated the tests, but this time logged details for 5% of the
transaction (instead of aggregating the data for each second). I've also
made the tests shorter - just 12 hours instead of 24, to reduce the time
needed to complete the benchmark.

Overall, this means ~300M transactions in total for the un-throttled
case, so sample with ~15M transactions available when computing the
following charts.

I've used the same commits as during the previous testing, i.e. a298a1e0
(before patches) and 23a27b03 (with patches).

One interesting difference is that while the "patched" version resulted
in slightly better performance (8122 vs. 8000 tps), the "unpatched"
version got considerably slower (6790 vs. 7725 tps) - that's ~13%
difference, so not negligible. Not sure what's the cause - the
configuration was exactly the same, there's nothing in the log and the
machine was dedicated to the testing. The only explanation I have is
that the unpatched code is a bit more unstable when it comes to this
type of stress testing.

There results (including scripts for generating the charts) are here:

     https://github.com/tvondra/flushing-benchmark-2

Attached are three charts - again, those are using CDF to illustrate the
distributions and compare them easily:

1) regular-latency.png

The two curves intersect at ~4ms, where both CDF reach ~85%. For the
shorter transactions, the old code is slightly faster (i.e. apparently
there's some per-transaction overhead). For higher latencies though, the
patched code is clearly winning - there are far fewer transactions over
6ms, which makes a huge difference. (Notice the x-axis is actually
log-scale, so the tail on the old code is actually much longer than it
might appear.)

2) throttled-latency.png

In the throttled case (i.e. when the system is not 100% utilized, so
it's more representative of actual production use), the difference is
quite clearly in favor of the new code.

3) throttled-schedule-lag.png

Mostly just an alternative view on the previous chart, showing how much
later the transactions were scheduled. Again, the new code is winning.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
Hello Tomas,

Thanks again for these interesting benches.

> Overall, this means ~300M transactions in total for the un-throttled case, so 
> sample with ~15M transactions available when computing the following charts.

Still a very sizable run!

> There results (including scripts for generating the charts) are here:
>
>    https://github.com/tvondra/flushing-benchmark-2

This repository seems empty.

> 1) regular-latency.png

I'm wondering whether it would be clearer if the percentiles where 
relative to the largest sample, not to itself, so that the figures from 
the largest one would still be between 0 and 1, but the other (unpatched) 
one would go between 0 and 0.85, that is would be cut short proportionnaly 
to the actual performance.

> The two curves intersect at ~4ms, where both CDF reach ~85%. For the 
> shorter transactions, the old code is slightly faster (i.e. apparently 
> there's some per-transaction overhead).

I'm not sure how meaningfull is the crossing, because both curves do not 
reflect the same performance. I think that they may not cross at all if 
the normalization is with the same reference, i.e. the better run.

> 2) throttled-latency.png
>
> In the throttled case (i.e. when the system is not 100% utilized, so it's 
> more representative of actual production use), the difference is quite 
> clearly in favor of the new code.

Indeed, it is a no brainer.

> 3) throttled-schedule-lag.png
>
> Mostly just an alternative view on the previous chart, showing how much later 
> the transactions were scheduled. Again, the new code is winning.

No brainer again. I infer from this figure that with the initial version 
60% of transactions have trouble being processed on time, while this is 
maybe about 35% with the new version.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Tomas Vondra
Date:
Hi,

On 03/22/2016 07:35 AM, Fabien COELHO wrote:
>
> Hello Tomas,
>
> Thanks again for these interesting benches.
>
>> Overall, this means ~300M transactions in total for the un-throttled
>> case, so sample with ~15M transactions available when computing the
>> following charts.
>
> Still a very sizable run!
>
>> There results (including scripts for generating the charts) are here:
>>
>>    https://github.com/tvondra/flushing-benchmark-2
>
> This repository seems empty.

Strange. Apparently I forgot to push, or maybe it did not complete 
before I closed the terminal. Anyway, pushing now (it'll take a bit more 
time to complete).

>
>> 1) regular-latency.png
>
> I'm wondering whether it would be clearer if the percentiles where
> relative to the largest sample, not to itself, so that the figures
> from the largest one would still be between 0 and 1, but the other
> (unpatched) one would go between 0 and 0.85, that is would be cut
> short proportionnaly to the actual performance.
>

I'm not sure what you mean by 'relative to largest sample'?

>> The two curves intersect at ~4ms, where both CDF reach ~85%. For
>> the shorter transactions, the old code is slightly faster (i.e.
>> apparently there's some per-transaction overhead).
>
> I'm not sure how meaningfull is the crossing, because both curves do
> not reflect the same performance. I think that they may not cross at
> all if the normalization is with the same reference, i.e. the better
> run.

Well, I think the curves illustrate exactly the performance difference, 
because with the old code the percentiles after p=0.85 get much higher. 
Which is the point of the crossing, although I agree the exact point 
does not have a particular meaning.

>> 2) throttled-latency.png
>>
>> In the throttled case (i.e. when the system is not 100% utilized,
>> so it's more representative of actual production use), the
>> difference is quite clearly in favor of the new code.
>
> Indeed, it is a no brainer.

Yep.

>
>> 3) throttled-schedule-lag.png
>>
>> Mostly just an alternative view on the previous chart, showing how
>> much later the transactions were scheduled. Again, the new code is
>> winning.
>
> No brainer again. I infer from this figure that with the initial
> version 60% of transactions have trouble being processed on time,
> while this is maybe about 35% with the new version.

Yep.

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: checkpointer continuous flushing

From
Andres Freund
Date:
Hi,

On 2016-03-21 18:46:58 +0100, Tomas Vondra wrote:
> I've repeated the tests, but this time logged details for 5% of the
> transaction (instead of aggregating the data for each second). I've also
> made the tests shorter - just 12 hours instead of 24, to reduce the time
> needed to complete the benchmark.
> 
> Overall, this means ~300M transactions in total for the un-throttled case,
> so sample with ~15M transactions available when computing the following
> charts.
> 
> I've used the same commits as during the previous testing, i.e. a298a1e0
> (before patches) and 23a27b03 (with patches).
> 
> One interesting difference is that while the "patched" version resulted in
> slightly better performance (8122 vs. 8000 tps), the "unpatched" version got
> considerably slower (6790 vs. 7725 tps) - that's ~13% difference, so not
> negligible. Not sure what's the cause - the configuration was exactly the
> same, there's nothing in the log and the machine was dedicated to the
> testing. The only explanation I have is that the unpatched code is a bit
> more unstable when it comes to this type of stress testing.
> 
> There results (including scripts for generating the charts) are here:
> 
>     https://github.com/tvondra/flushing-benchmark-2
> 
> Attached are three charts - again, those are using CDF to illustrate the
> distributions and compare them easily:
> 
> 1) regular-latency.png
> 
> The two curves intersect at ~4ms, where both CDF reach ~85%. For the shorter
> transactions, the old code is slightly faster (i.e. apparently there's some
> per-transaction overhead). For higher latencies though, the patched code is
> clearly winning - there are far fewer transactions over 6ms, which makes a
> huge difference. (Notice the x-axis is actually log-scale, so the tail on
> the old code is actually much longer than it might appear.)
> 
> 2) throttled-latency.png
> 
> In the throttled case (i.e. when the system is not 100% utilized, so it's
> more representative of actual production use), the difference is quite
> clearly in favor of the new code.
> 
> 3) throttled-schedule-lag.png
> 
> Mostly just an alternative view on the previous chart, showing how much
> later the transactions were scheduled. Again, the new code is winning.

Thanks for running these tests!

I think this shows that we're in a good shape, and that the commits
succeeded in what they were attempting. Very glad to hear that.


WRT tablespaces: What I'm planning to do, unless somebody has a better
proposal, is to basically rent two big amazon instances, and run pgbench
in parallel over N tablespaces. Once with local SSD and once with local
HDD storage.

Greetings,

Andres Freund



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:

>>> 1) regular-latency.png
>> 
>> I'm wondering whether it would be clearer if the percentiles where
>> relative to the largest sample, not to itself, so that the figures
>> from the largest one would still be between 0 and 1, but the other
>> (unpatched) one would go between 0 and 0.85, that is would be cut
>> short proportionnaly to the actual performance.
>
> I'm not sure what you mean by 'relative to largest sample'?

You took 5% of the tx on two 12 hours runs, totaling say 85M tx on one 
and 100M tx on the other, so you get 4.25M tx from the first and 5M from 
the second.

I'm saying that the percentile should be computed on the largest one (5M), 
so that you get a curve like the following, with both curve having the 
same transaction density on the y axis, so the second one does not go up 
to the top, reflecting that in this case less transactions where 
processed.
  A  +    ____----- # up to 100%  |   /  ___---- # cut short  |   | /  |   | |  | _/ /  |/__/  +------------->

-- 
Fabien.



Re: checkpointer continuous flushing

From
Tomas Vondra
Date:
Hi,

On 03/22/2016 10:44 AM, Fabien COELHO wrote:
>
>
>>>> 1) regular-latency.png
>>>
>>> I'm wondering whether it would be clearer if the percentiles
>>> where relative to the largest sample, not to itself, so that the
>>> figures from the largest one would still be between 0 and 1, but
>>> the other (unpatched) one would go between 0 and 0.85, that is
>>> would be cut short proportionnaly to the actual performance.
>>
>> I'm not sure what you mean by 'relative to largest sample'?
>
> You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
> one and 100M tx on the other, so you get 4.25M tx from the first and
> 5M from the second.

OK

> I'm saying that the percentile should be computed on the largest one
> (5M), so that you get a curve like the following, with both curve
> having the same transaction density on the y axis, so the second one
> does not go up to the top, reflecting that in this case less
> transactions where processed.

Huh, that seems weird. That's not how percentiles or CDFs work, and I 
don't quite understand what would that tell us.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> WRT tablespaces: What I'm planning to do, unless somebody has a better
> proposal, is to basically rent two big amazon instances, and run pgbench
> in parallel over N tablespaces. Once with local SSD and once with local
> HDD storage.

Ok.

Not sure how to control that table spaces are actually on distinct 
dedicated disks with VMs, but this is the idea.

To emphasize potential bad effects without having to build too large a 
host and involve too many table spaces, I would suggest to reduce 
significantly the "checkpoint_flush_after" setting while running these 
tests.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-03-22 10:48:20 +0100, Tomas Vondra wrote:
> Hi,
> 
> On 03/22/2016 10:44 AM, Fabien COELHO wrote:
> >
> >
> >>>>1) regular-latency.png
> >>>
> >>>I'm wondering whether it would be clearer if the percentiles
> >>>where relative to the largest sample, not to itself, so that the
> >>>figures from the largest one would still be between 0 and 1, but
> >>>the other (unpatched) one would go between 0 and 0.85, that is
> >>>would be cut short proportionnaly to the actual performance.
> >>
> >>I'm not sure what you mean by 'relative to largest sample'?
> >
> >You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
> >one and 100M tx on the other, so you get 4.25M tx from the first and
> >5M from the second.
> 
> OK
> 
> >I'm saying that the percentile should be computed on the largest one
> >(5M), so that you get a curve like the following, with both curve
> >having the same transaction density on the y axis, so the second one
> >does not go up to the top, reflecting that in this case less
> >transactions where processed.
> 
> Huh, that seems weird. That's not how percentiles or CDFs work, and I don't
> quite understand what would that tell us.

My impression is that we actually know what we need to know anyway?



Re: checkpointer continuous flushing

From
Andres Freund
Date:
On 2016-03-22 10:52:55 +0100, Fabien COELHO wrote:
> To emphasize potential bad effects without having to build too large a host
> and involve too many table spaces, I would suggest to reduce significantly
> the "checkpoint_flush_after" setting while running these tests.

Meh, that completely distorts the test.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> You took 5% of the tx on two 12 hours runs, totaling say 85M tx on
>> one and 100M tx on the other, so you get 4.25M tx from the first and
>> 5M from the second.
>
> OK
>
>> I'm saying that the percentile should be computed on the largest one
>> (5M), so that you get a curve like the following, with both curve
>> having the same transaction density on the y axis, so the second one
>> does not go up to the top, reflecting that in this case less
>> transactions where processed.
>
> Huh, that seems weird. That's not how percentiles or CDFs work, and I don't 
> quite understand what would that tell us.

It would tell us that for a given transaction number (in the 
latency-ordered list) whether its latency is above or below the other run.

I think it would probably show that the latency is always better for the 
patched version by getting rid of the crossing which has no meaning and 
seems to suggest, wrongly, that in some case the other is better than the 
first, but as the y axis of both curves are not in the same unit (not same 
transaction density) this is just an illusion implied by a misplaced 
normalization.

So I'm basically saying that the y axis should be just the transaction 
number, not a percent.

Anyway, these are just details, your figures show that the patch is a very 
significant win on SSDs, all is well!

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
> My impression is that we actually know what we need to know anyway?

Sure, the overall summary is "it is much better with the patch" on this 
large SSD test, which is good news because the patch was really designed 
to help with HDDs.

-- 
Fabien.



Re: checkpointer continuous flushing

From
Fabien COELHO
Date:
>> To emphasize potential bad effects without having to build too large a host
>> and involve too many table spaces, I would suggest to reduce significantly
>> the "checkpoint_flush_after" setting while running these tests.
>
> Meh, that completely distorts the test.

Yep, I agree.

The point would be to show whether there is a significant impact, or not, 
with less hardware & cost involved in the test.

Now if you can put 16 disks with 16 table spaces with 16 buffers per 
bucket, that is good, fine with me! I'm just trying to point out that you 
could probably get comparable relative results with 4 disks, 4 tables 
spaces and 4 buffers per bucket, so it is an alternative and less 
expensive testing strategy.

This just shows that I usually work on a tight (negligeable?) budget:-)

-- 
Fabien.