Thread: double writes using "double-write buffer" approach [WIP]

double writes using "double-write buffer" approach [WIP]

From

Dan Scales

Date:

27 January 2012, 18:32:25

I've been prototyping the double-write buffer idea that Heikki and Simon
had proposed (as an alternative to a previous patch that only batched up
writes by the checkpointer).  I think it is a good idea, and can help
double-writes perform better in the case of lots of backend evictions.
It also centralizes most of the code change in smgr.c.  However, it is
trickier to reason about.

The idea is that all page writes generally are copied to a double-write
buffer, rather than being immediately written.  Note that a full copy of
the page is required, but can folded in with a checksum calculation.
Periodically (e.g. every time a certain-size batch of writes have been
added), some writes are pushed out using double writes -- the pages are
first written and fsynced to a double-write file, then written to the
data files, which are then fsynced.  Then double writes allow for fixing
torn pages, so full_page_writes can be turned off (thus greatly reducing
the size of the WAL log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big
    enough batch has accumulated, then flush the batch using double
    writes.  [I don't think I need to intercept calls to smgrextend(),
    but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a
    particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the
    current contents of the double-write buffer are flushed.

4.  Pass flags around in some cases to indicate whether a page buffer
    needs a double write or not.  (I think eventually this would be an
    attribute of the buffer, set when the page is WAL-logged, rather than
    a flag passed around.)

5.  Deal with duplicates in the double-write buffer appropriately (very
    rarely happens).

To get good performance, I needed to have two double-write buffers, one
for the checkpointer and one for all other processes.  The double-write
buffers are circular buffers.  The checkpointer double-write buffer is
just a single batch of 64 pages; the non-checkpointer double-write buffer
is 128 pages, 2 batches of 64 pages each.  Each batch goes to a different
double-write file, so that they can be issued independently as soon as
each batch is completed.  Also, I need to sort the buffers being
checkpointed by file/offset (see ioseq.c), so that the checkpointer
batches will most likely only have to write and fsync one data file.

Interestingly, I find that the plot of tpm for DBT2 is much smoother
(though still has wiggles) with double writes enabled, since there are no
unpredictable long fsyncs at the end (or during) a checkpoint.

Here are performance numbers for double-write buffer (same configs as
previous numbers), for 2-processor, 60-minute 50-warehouse DBT2.  One the
right shows the size of the shared_buffers, and the size of the RAM in
the virtual machine.  FPW stands for full_page_writes, DW for
double_writes.  'two disk' means the WAL log is on a separate ext3
filesystem from the data files.

           FPW off FPW on  DW on, FPW off
one disk:  15488   13146   11713                    [5G buffers, 8G VM]
two disk:  18833   16703   18013

one disk:  12908   11159    9758                    [3G buffers, 6G VM]
two disk:  14258   12694   11229

one disk   10829    9865    5806                    [1G buffers, 8G VM]
two disk   13605   12694    5682

one disk:   6752    6129    4878
two disk:   7253    6677    5239                    [1G buffers, 2G VM]


The performance of DW on the small cache cases (1G shared_buffers) is now
much better, though still not as good as FPW on.  In the medium cache
case (3G buffers), where there are significant backend dirty evictions,
the performance of DW is close to that of FPW on.  In the large cache (5G
buffers), where the checkpointer can do all the work and there are
minimal dirty evictions, DW is much better than FPW in the two disk case.
In the one disk case, it is somewhat worse than FPW.  However,
interestingly, if you just move the double-write files to a separate ext3
filesystem on the same disk as the data files, the performance goes to
13107 -- now on par with FPW on.  We are obviously getting hit by the
ext3 fsync slowness issues.  (I believe that an fsync on a filesystem can
stall on other unrelated writes to the same filesystem.)

Let me know if you have any thoughts/comments, etc.  The patch is
enclosed, and the README.doublewrites is updated a fair bit.

Thanks,

Dan

Attachment

dwbuf2.patch

Re: double writes using "double-write buffer" approach [WIP]

From

Robert Haas

Date:

02 February 2012, 11:20:06

On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales <scales@vmware.com> wrote:
> I've been prototyping the double-write buffer idea that Heikki and Simon
> had proposed (as an alternative to a previous patch that only batched up
> writes by the checkpointer).  I think it is a good idea, and can help
> double-writes perform better in the case of lots of backend evictions.
> It also centralizes most of the code change in smgr.c.  However, it is
> trickier to reason about.

This doesn't compile on MacOS X, because there's no writev().

I don't understand how you can possibly get away with such small
buffers.  AIUI, you must retained every page in the double-write
buffer until it's been written and fsync'd to disk.  That means the
most dirty data you'll ever be able to have in the operating system
cache with this implementation is (128 + 64) * 8kB = 1.5MB.  Granted,
we currently have occasional problems with the OS caching too *much*
dirty data, but that seems like it's going way, way too far in the
opposite direction.  That's barely enough for the system to do any
write reordering at all.

I am particularly worried about what happens when a ring buffer is in
use.  I tried running "pgbench -i -s 10" with this patch applied,
full_page_writes=off, double_writes=on.  It took 41.2 seconds to
complete.  The same test with the stock code takes 14.3 seconds; and
the actual situation is worse for double-writes than those numbers
might imply, because the index build time doesn't seem to be much
affected, while the COPY takes a small eternity with the patch
compared to the usual way of doing things.  I think the slowdown on
COPY once the double-write buffer fills is on the order of 10x.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: double writes using "double-write buffer" approach [WIP]

From

Dan Scales

Date:

03 February 2012, 16:14:48

Hi Robert,

Thanks for the feedback!  I think you make a good point about the small size of dirty data in the OS cache.  I think
whatyou can say about this double-write patch is that it will work not work well for configurations that have a small
Postgrescache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync.
However,it should work much better for configurations with a much large Postgres cache and relatively smaller OS cache
(includingthe configurations that I've given performance results for).  In that case, there is a lot more capacity for
dirtypages in the Postgres cache, and you won't have nearly as many dirty evictions.  The checkpointer is doing a good
numberof the writes, and this patch sorts the checkpointer's buffers so its IO is efficient. 

Of course, I can also increase the size of the non-checkpointer ring buffer to be much larger, though I wouldn't want
tomake it too large, since it is consuming memory.  If I increase the size of the ring buffers significantly, I will
probablyneed to add some data structures so that the ring buffer lookups in smgrread() and smgrwrite() are more
efficient.

Can you let me know what the shared_buffers and RAM sizes were for your pgbench run?  I can try running the same
workload. If the size of shared_buffers is especially small compared to RAM, then we should increase the size of
shared_bufferswhen using double_writes. 

Thanks,

Dan

----- Original Message -----
From: "Robert Haas" <robertmhaas@gmail.com>
To: "Dan Scales" <scales@vmware.com>
Cc: "PG Hackers" <pgsql-hackers@postgresql.org>
Sent: Thursday, February 2, 2012 7:19:47 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales <scales@vmware.com> wrote:
> I've been prototyping the double-write buffer idea that Heikki and Simon
> had proposed (as an alternative to a previous patch that only batched up
> writes by the checkpointer).  I think it is a good idea, and can help
> double-writes perform better in the case of lots of backend evictions.
> It also centralizes most of the code change in smgr.c.  However, it is
> trickier to reason about.

This doesn't compile on MacOS X, because there's no writev().

I don't understand how you can possibly get away with such small
buffers.  AIUI, you must retained every page in the double-write
buffer until it's been written and fsync'd to disk.  That means the
most dirty data you'll ever be able to have in the operating system
cache with this implementation is (128 + 64) * 8kB = 1.5MB.  Granted,
we currently have occasional problems with the OS caching too *much*
dirty data, but that seems like it's going way, way too far in the
opposite direction.  That's barely enough for the system to do any
write reordering at all.

I am particularly worried about what happens when a ring buffer is in
use.  I tried running "pgbench -i -s 10" with this patch applied,
full_page_writes=off, double_writes=on.  It took 41.2 seconds to
complete.  The same test with the stock code takes 14.3 seconds; and
the actual situation is worse for double-writes than those numbers
might imply, because the index build time doesn't seem to be much
affected, while the COPY takes a small eternity with the patch
compared to the usual way of doing things.  I think the slowdown on
COPY once the double-write buffer fills is on the order of 10x.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: double writes using "double-write buffer" approach [WIP]

From

Robert Haas

Date:

03 February 2012, 17:49:14

On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales <scales@vmware.com> wrote:
> Thanks for the feedback! I think you make a good point about the small size of dirty data in the OS cache. I think
whatyou can say about this double-write patch is that it will work not work well for configurations that have a small
Postgrescache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync.

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL. I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers. But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway. I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.The entire database fits in shared_buffers, with
mostof it left
over. However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly. Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects. It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection. And even if we did, that would just
delay the day of reckoning. You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory. Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS. If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that. Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment. And here's yet another one: we
assume that it is better to recycle old WAL files and overwrite the
contents rather than create new, empty ones, because we assume that
the pages from the old files may still be present in the OS cache. We
also rely on the fact that an evicted CLOG page can be pulled back in
quickly without (in most cases) a disk access. We also rely on
shared_buffers not being too large to avoid walloping the I/O
controller too hard at checkpoint time - which is forcing some people
to set shared_buffers much smaller than would otherwise be ideal. In
other words, even if setting shared_buffers to most of the available
system memory would fix the problem I mentioned, it would create a
whole bunch of new ones, many of them non-trivial. It may be a good
idea to think about what we'd need to do to work efficiently in that
sort of configuration, but there is going to be a very large amount of
thinking, testing, and engineering that has to be done to make it a
reality.

There's another issue here, too. The idea that we're going to write
data to the double-write buffer only when we decide to evict the pages
strikes me as a bad one. We ought to proactively start dumping pages
to the double-write area as soon as they're dirtied, and fsync them
after every N pages, so that by the time we need to evict some page
that requires a double-write, it's already durably on disk in the
double-write buffer, and we can do the real write without having to
wait. It's likely that, to make this perform acceptably for bulk
loads, you'll need the writes to the double-write buffer and the
fsyncs of that buffer to be done by separate processes, so that one
backend (the background writer, perhaps) can continue spooling
additional pages to the double-write files while some other process (a
new auxiliary process?) fsyncs the ones that are already full. Along
with that, the page replacement algorithm probably needs to be
adjusted to avoid evicting pages that need an as-yet-unfinished
double-write like the plague, even to the extent of allowing the
BufferAccessStrategy rings to grow if the double-writes can't be
finished before the ring wraps around.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: double writes using "double-write buffer" approach [WIP]

From

Dan Scales

Date:

05 February 2012, 17:17:33

Thanks for the detailed followup.  I do see how Postgres is tuned for
having a bunch of memory available that is not in shared_buffers, both
for the OS buffer cache and other memory allocations.  However, Postgres
seems to run fine in many "large shared_memory" configurations that I
gave performance numbers for, including 5G shared_buffers for an 8G
machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
sufficient extra memory beyond the shared_buffers cache.

I think the pgbench run is pointing out a problem that this double_writes
implementation has with BULK_WRITEs.  As you point out, the
BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.
I'm not sure if there is a great solution that always works for that
issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
archiving/replication is happening.  As I understand it, if the
BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
double-written either.  The BULK_WRITE data is not officially synced and
committed until it is all written, so there doesn't have to be any
torn-page protection for that data, which is why the WAL logging can be
omitted.  The double-write implementation can be improved by marking each
buffer if it doesn't need torn-page protection.  These buffers would be
those new pages that are explicitly not WAL-logged, even when
full_page_writes is enabled.  When such a buffer is eventually synced
(perhaps because of an eviction), it would not be double-written.  This
would often avoid double-writes for BULK_WRITE, etc., especially since
the administrator is often not archiving or doing replication when doing
bulk loads.

However, overall, I think the idea is that double writes are an optional
optimization.  The user would only turn it on in existing configurations
where it helps or only slightly hurts performance, and where greatly
reducing the size of the WAL logs is beneficial.  It might also be
especially beneficial when there is a small amount of FLASH or other
kind of fast storage that the double-write files can be stored on.

Thanks,

Dan


----- Original Message -----
From: "Robert Haas" <robertmhaas@gmail.com>
To: "Dan Scales" <scales@vmware.com>
Cc: "PG Hackers" <pgsql-hackers@postgresql.org>
Sent: Friday, February 3, 2012 1:48:54 PM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales <scales@vmware.com> wrote:
> Thanks for the feedback!  I think you make a good point about the small size of dirty data in the OS cache.  I think
whatyou can say about this double-write patch is that it will work not work well for configurations that have a small
Postgrescache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync. 

The general guidance for setting shared_buffers these days is 25% of
RAM up to a maximum of 8GB, so the configuration that you're
describing as not optimal for this patch is the one normally used when
running PostgreSQL.  I've run across several cases where larger values
of shared_buffers are a huge win, because the entire working set can
then be accommodated in shared_buffers.  But it's certainly not the
case that all working sets fit.

And in this case, I think that's beside the point anyway.  I had
shared_buffers set to 8GB on a machine with much more memory than
that, but the database created by pgbench -i -s 10 is about 156 MB, so
the problem isn't that there is too little PostgreSQL cache available.The entire database fits in shared_buffers, with
mostof it left 
over.  However, because of the BufferAccessStrategy stuff, pages start
to get forced out to the OS pretty quickly.  Of course, we could
disable the BufferAccessStrategy stuff when double_writes is in use,
but bear in mind that the reason we have it in the first place is to
prevent cache trashing effects.  It would be imprudent of us to throw
that out the window without replacing it with something else that
would provide similar protection.  And even if we did, that would just
delay the day of reckoning.  You'd be able to blast through and dirty
the entirety of shared_buffers at top speed, but then as soon as you
started replacing pages performance would slow to an utter crawl, just
as it did here, only you'd need a bigger scale factor to trigger the
problem.

The more general point here is that there are MANY aspects of
PostgreSQL's design that assume that shared_buffers accounts for a
relatively small percentage of system memory.  Here's another one: we
assume that backends that need temporary memory for sorts and hashes
(i.e. work_mem) can just allocate it from the OS.  If we were to start
recommending setting shared_buffers to large percentages of the
available memory, we'd probably have to rethink that.  Most likely,
we'd need some kind of in-core mechanism for allocating temporary
memory from the shared memory segment.  And here's yet another one: we
assume that it is better to recycle old WAL files and overwrite the
contents rather than create new, empty ones, because we assume that
the pages from the old files may still be present in the OS cache.  We
also rely on the fact that an evicted CLOG page can be pulled back in
quickly without (in most cases) a disk access.  We also rely on
shared_buffers not being too large to avoid walloping the I/O
controller too hard at checkpoint time - which is forcing some people
to set shared_buffers much smaller than would otherwise be ideal.  In
other words, even if setting shared_buffers to most of the available
system memory would fix the problem I mentioned, it would create a
whole bunch of new ones, many of them non-trivial.  It may be a good
idea to think about what we'd need to do to work efficiently in that
sort of configuration, but there is going to be a very large amount of
thinking, testing, and engineering that has to be done to make it a
reality.

There's another issue here, too.  The idea that we're going to write
data to the double-write buffer only when we decide to evict the pages
strikes me as a bad one.  We ought to proactively start dumping pages
to the double-write area as soon as they're dirtied, and fsync them
after every N pages, so that by the time we need to evict some page
that requires a double-write, it's already durably on disk in the
double-write buffer, and we can do the real write without having to
wait.  It's likely that, to make this perform acceptably for bulk
loads, you'll need the writes to the double-write buffer and the
fsyncs of that buffer to be done by separate processes, so that one
backend (the background writer, perhaps) can continue spooling
additional pages to the double-write files while some other process (a
new auxiliary process?) fsyncs the ones that are already full.  Along
with that, the page replacement algorithm probably needs to be
adjusted to avoid evicting pages that need an as-yet-unfinished
double-write like the plague, even to the extent of allowing the
BufferAccessStrategy rings to grow if the double-writes can't be
finished before the ring wraps around.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: double writes using "double-write buffer" approach [WIP]

From

Robert Haas

Date:

06 February 2012, 01:15:28

On Sun, Feb 5, 2012 at 4:17 PM, Dan Scales <scales@vmware.com> wrote:
> Thanks for the detailed followup.  I do see how Postgres is tuned for
> having a bunch of memory available that is not in shared_buffers, both
> for the OS buffer cache and other memory allocations.  However, Postgres
> seems to run fine in many "large shared_memory" configurations that I
> gave performance numbers for, including 5G shared_buffers for an 8G
> machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
> sufficient extra memory beyond the shared_buffers cache.

I agree that you could probably set shared_buffers to 3GB on a 6GB
machine and get decent performance - but would it be the optimal
performance, and for what workload?  To really figure out whether this
patch is a win, you need to get the system optimally tuned for the
unpatched sources (which we can't tell whether you've done, since you
haven't posted the configuration settings or any comparative figures
for different settings, or any details on which commit you tested
against) and then get the system optimally tuned for the patched
sources with double_writes=on, and then see whether there's a gain.

> I think the pgbench run is pointing out a problem that this double_writes
> implementation has with BULK_WRITEs.  As you point out, the
> BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.

Bulk reads will have the same problem.  Consider loading a bunch of
data into a new data with COPY, and then scanning the table.  The
table scan will be a "bulk read" and every page will be dirtied
setting hint bits.  Another thing to worry about is vacuum, which also
uses a BufferAccessStrategy.  Greg Smith has done some previous
benchmarking showing that when the kernel is too aggressive about
flushing dirty data to disk, vacuum becomes painfully slow.  I suspect
this patch is going to have that problem in spades (but it would be
good to test that).  Checkpoints might be a problem, too, since they
flush a lot of dirty data, and that's going to require a lot of extra
fsyncing with this implementation.  It certainly seems that unless you
have a pg_xlog and the data separated and a battery-backed write cache
for each, checkpoints might be really slow.  I'm not entirely
convinced they'll be fast even if you have all that (but it would be
good to test that, too).

> I'm not sure if there is a great solution that always works for that
> issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
> archiving/replication is happening.  As I understand it, if the
> BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
> double-written either.  The BULK_WRITE data is not officially synced and
> committed until it is all written, so there doesn't have to be any
> torn-page protection for that data, which is why the WAL logging can be
> omitted.  The double-write implementation can be improved by marking each
> buffer if it doesn't need torn-page protection.  These buffers would be
> those new pages that are explicitly not WAL-logged, even when
> full_page_writes is enabled.  When such a buffer is eventually synced
> (perhaps because of an eviction), it would not be double-written.  This
> would often avoid double-writes for BULK_WRITE, etc., especially since
> the administrator is often not archiving or doing replication when doing
> bulk loads.

I agree - this optimization seems like a must.  I'm not sure that it's
sufficient, but it certainly seems necessary.  It's not going to help
with VACUUM, though, so I think that case needs some careful looking
at to determine how bad the regression is and what can be done to
mitigate it.  In particular, I note that I suggested an idea that
might help in the final paragraph of my last email.

My general feeling about this patch is that it needs a lot more work
before we should consider committing it.  Your tests so far overlook
quite a few important problem cases (bulk loads, SELECT on large
unhinted tables, vacuum speed, checkpoint duration, and others) and
still mostly show it losing to full_page_writes, sometimes by large
margins.  Even in the one case where you got an 8% speedup, it's not
really clear that the same speedup (or an even bigger one) couldn't
have been gotten by some other kind of tuning.  I think you really
need to spend some more time thinking about how to blunt the negative
impact on the cases where it hurts, and increase the benefit in the
cases where it helps.  The approach seems to have potential, but it
seems way to immature to think about shipping it at this point.  (You
may have been thinking along similar lines since I note that the patch
is marked "WIP".)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: double writes using "double-write buffer" approach [WIP]

From

Fujii Masao

Date:

06 February 2012, 07:08:30

On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales <scales@vmware.com> wrote:
> Let me know if you have any thoughts/comments, etc.  The patch is
> enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: double writes using "double-write buffer" approach [WIP]

From

Dan Scales

Date:

07 February 2012, 01:10:32

I don't know a lot about base backup, but it sounds like full_page_writes must be turned on for base backup, in order
todeal with the inconsistent reads of pages (which you might call torn pages) that can happen when you backup the data
fileswhile the database is running.  The relevant parts of the WAL log are then copied separately (and consistently)
oncethe backup of the data files is done, and used to "recover" the database into a consistent state later. 

So, yes, good point -- double writes cannot replace the functionality of full_page_writes for base backup.  If double
writeswere in use, they might be automatically switched over to full page writes for the duration of the base backup.
Andthe double write file should not be part of the base backup. 

Dan

----- Original Message -----
From: "Fujii Masao" <masao.fujii@gmail.com>
To: "Dan Scales" <scales@vmware.com>
Cc: "PG Hackers" <pgsql-hackers@postgresql.org>
Sent: Monday, February 6, 2012 3:08:15 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales <scales@vmware.com> wrote:
> Let me know if you have any thoughts/comments, etc.  The patch is
> enclosed, and the README.doublewrites is updated a fair bit.

ISTM that the double-write can prevent torn-pages in neither double-write file
nor data file in *base backup*. Because both double-write file and data file can
be backed up while being written. Is this right? To avoid the torn-page problem,
we should write FPI to WAL during online backup even if the double-write has
been committed?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: double writes using "double-write buffer" approach [WIP]

From

Amit Kapila

Date:

07 February 2012, 11:53:33

>> I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions.  I
don'tunderstand this point, because from the data in your mail, it appears that when shared buffers are less means when
moreevictions can happen, the performance is less.
 

ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or checkpoint. 
Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a
double-writebuffer, but bgwriter  will check this double-buffer and flush from it.
 
Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell
bgwriterto flush from double-write buffer.
 
This can ensure very less I/O by any backend.


-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using "double-write buffer" approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous
patchthat only batched up writes by the checkpointer).  I think it is a good idea, and can help double-writes perform
betterin the case of lots of backend evictions.
 
It also centralizes most of the code change in smgr.c.  However, it is trickier to reason about.

The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written.
Notethat a full copy of the page is required, but can folded in with a checksum calculation.
 
Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double
writes-- the pages are first written and fsynced to a double-write file, then written to the data files, which are then
fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing
thesize of the WAL log).
 

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big   enough batch has accumulated, then flush the
batchusing double   writes.  [I don't think I need to intercept calls to smgrextend(),   but I am not totally sure.]
 

2.  In smgrread(), always look first in the double-write buffer for a   particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the   current contents of the double-write buffer
areflushed.
 

4.  Pass flags around in some cases to indicate whether a page buffer   needs a double write or not.  (I think
eventuallythis would be an   attribute of the buffer, set when the page is WAL-logged, rather than   a flag passed
around.)

5.  Deal with duplicates in the double-write buffer appropriately (very   rarely happens).

To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other
processes. The double-write buffers are circular buffers.  The checkpointer double-write buffer is just a single batch
of64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each.  Each batch goes to a
differentdouble-write file, so that they can be issued independently as soon as each batch is completed.  Also, I need
tosort the buffers being checkpointed by file/offset (see ioseq.c), so that the checkpointer batches will most likely
onlyhave to write and fsync one data file.
 

Interestingly, I find that the plot of tpm for DBT2 is much smoother (though still has wiggles) with double writes
enabled,since there are no unpredictable long fsyncs at the end (or during) a checkpoint.
 

Here are performance numbers for double-write buffer (same configs as previous numbers), for 2-processor, 60-minute
50-warehouseDBT2.  One the right shows the size of the shared_buffers, and the size of the RAM in the virtual machine.
FPWstands for full_page_writes, DW for double_writes.  'two disk' means the WAL log is on a separate ext3 filesystem
fromthe data files.
 
          FPW off FPW on  DW on, FPW off
one disk:  15488   13146   11713                    [5G buffers, 8G VM]
two disk:  18833   16703   18013

one disk:  12908   11159    9758                    [3G buffers, 6G VM]
two disk:  14258   12694   11229

one disk   10829    9865    5806                    [1G buffers, 8G VM]
two disk   13605   12694    5682

one disk:   6752    6129    4878
two disk:   7253    6677    5239                    [1G buffers, 2G VM]


The performance of DW on the small cache cases (1G shared_buffers) is now much better, though still not as good as FPW
on. In the medium cache case (3G buffers), where there are significant backend dirty evictions, the performance of DW
isclose to that of FPW on.  In the large cache (5G buffers), where the checkpointer can do all the work and there are
minimaldirty evictions, DW is much better than FPW in the two disk case.
 
In the one disk case, it is somewhat worse than FPW.  However, interestingly, if you just move the double-write files
toa separate ext3 filesystem on the same disk as the data files, the performance goes to
 
13107 -- now on par with FPW on.  We are obviously getting hit by the
ext3 fsync slowness issues.  (I believe that an fsync on a filesystem can stall on other unrelated writes to the same
filesystem.)

Let me know if you have any thoughts/comments, etc.  The patch is enclosed, and the README.doublewrites is updated a
fairbit.
 

Thanks,

Dan

Re: double writes using "double-write buffer" approach [WIP]

From

Greg Smith

Date:

07 February 2012, 13:05:15

On 02/07/2012 12:09 AM, Dan Scales wrote:
> So, yes, good point -- double writes cannot replace the functionality of full_page_writes for base backup.  If double
writeswere in use, they might be automatically switched over to full page writes for the duration of the base backup.
Andthe double write file should not be part of the base backup.
 

There is already a check for this sort of problem during the base 
backup.  It forces full_pages_writes on for the backup, even if the 
running configuration has it off.  So long as double writes can be 
smoothly turned off and back on again, that same section of code can 
easily be made to handle that, too.

As far as not making the double write file part of the base backup, I 
was assuming that would go into a subdirectory under pg_xlog by 
default.  I would think that people who relocate pg_xlog using one of 
the methods for doing that would want the double write buffer to move as 
well.  And if it's inside pg_xlog, existing base backup scripts won't 
need to be changed--the correct ones already exclude pg_xlog files.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Re: double writes using "double-write buffer" approach [WIP]

From

Dan Scales

Date:

08 February 2012, 20:01:50

> Is there any problem if the double-write happens only by bgwriter or checkpoint. 
> Something like whenever backend process has to evict the buffer, it will do same as you have described that write in
adouble-write buffer, but > bgwriter  will check this double-buffer and flush from it.
 
> Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will
tellbgwriter to flush from > double-write buffer.
 
> This can ensure very less I/O by any backend.

Yes, I think this is a good idea.  I could make changes so that the backends hand off the responsibility to flush
batchesof the double-write buffer to the bgwriter whenever possible.  This would avoid some long IO waits in the
backends,though the backends may of course eventually wait anyways for the bgwriter if IO is not fast enough.  I did
writethe code so that any process can write a completed batch if the batch is not currently being flushed (so as to
dealwith crashes by backends).  Having the backends flush the batches as they fill them up was just simpler for a first
prototype.

Dan

----- Original Message -----
From: "Amit Kapila" <amit.kapila@huawei.com>
To: "Dan Scales" <scales@vmware.com>, "PG Hackers" <pgsql-hackers@postgresql.org>
Sent: Tuesday, February 7, 2012 1:08:49 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

>> I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions.  I
don'tunderstand this point, because from the data in your mail, it appears that when shared buffers are less means when
moreevictions can happen, the performance is less.
 

ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or checkpoint. 
Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a
double-writebuffer, but bgwriter  will check this double-buffer and flush from it.
 
Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell
bgwriterto flush from double-write buffer.
 
This can ensure very less I/O by any backend.


-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using "double-write buffer" approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous
patchthat only batched up writes by the checkpointer).  I think it is a good idea, and can help double-writes perform
betterin the case of lots of backend evictions.
 
It also centralizes most of the code change in smgr.c.  However, it is trickier to reason about.

The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written.
Notethat a full copy of the page is required, but can folded in with a checksum calculation.
 
Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double
writes-- the pages are first written and fsynced to a double-write file, then written to the data files, which are then
fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing
thesize of the WAL log).
 

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big   enough batch has accumulated, then flush the
batchusing double   writes.  [I don't think I need to intercept calls to smgrextend(),   but I am not totally sure.]
 

2.  In smgrread(), always look first in the double-write buffer for a   particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the   current contents of the double-write buffer
areflushed.
 

4.  Pass flags around in some cases to indicate whether a page buffer   needs a double write or not.  (I think
eventuallythis would be an   attribute of the buffer, set when the page is WAL-logged, rather than   a flag passed
around.)

5.  Deal with duplicates in the double-write buffer appropriately (very   rarely happens).

To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other
processes. The double-write buffers are circular buffers.  The checkpointer double-write buffer is just a single batch
of64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each.  Each batch goes to a
differentdouble-write file, so that they can be issued independently as soon as each batch is completed.  Also, I need
tosort the buffers being checkpointed by file/offset (see ioseq.c), so that the checkpointer batches will most likely
onlyhave to write and fsync one data file.
 

Interestingly, I find that the plot of tpm for DBT2 is much smoother (though still has wiggles) with double writes
enabled,since there are no unpredictable long fsyncs at the end (or during) a checkpoint.
 

Here are performance numbers for double-write buffer (same configs as previous numbers), for 2-processor, 60-minute
50-warehouseDBT2.  One the right shows the size of the shared_buffers, and the size of the RAM in the virtual machine.
FPWstands for full_page_writes, DW for double_writes.  'two disk' means the WAL log is on a separate ext3 filesystem
fromthe data files.
 
          FPW off FPW on  DW on, FPW off
one disk:  15488   13146   11713                    [5G buffers, 8G VM]
two disk:  18833   16703   18013

one disk:  12908   11159    9758                    [3G buffers, 6G VM]
two disk:  14258   12694   11229

one disk   10829    9865    5806                    [1G buffers, 8G VM]
two disk   13605   12694    5682

one disk:   6752    6129    4878
two disk:   7253    6677    5239                    [1G buffers, 2G VM]


The performance of DW on the small cache cases (1G shared_buffers) is now much better, though still not as good as FPW
on. In the medium cache case (3G buffers), where there are significant backend dirty evictions, the performance of DW
isclose to that of FPW on.  In the large cache (5G buffers), where the checkpointer can do all the work and there are
minimaldirty evictions, DW is much better than FPW in the two disk case.
 
In the one disk case, it is somewhat worse than FPW.  However, interestingly, if you just move the double-write files
toa separate ext3 filesystem on the same disk as the data files, the performance goes to
 
13107 -- now on par with FPW on.  We are obviously getting hit by the
ext3 fsync slowness issues.  (I believe that an fsync on a filesystem can stall on other unrelated writes to the same
filesystem.)

Let me know if you have any thoughts/comments, etc.  The patch is enclosed, and the README.doublewrites is updated a
fairbit.
 

Thanks,

Dan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: double writes using "double-write buffer" approach [WIP]

From

Amit Kapila

Date:

13 February 2012, 04:39:54

Dan, I believe your approach of double buffer write is right as it has potential that it can avoid the latency backends
incurduring full page writes after checkpoint. Although there are chances that overall I/O will be more in this case
butif we can make sure that in most scenarios backend has to never do I/O it can show performance improvement as well
ascompare to full page writes.

-----Original Message-----
From: Dan Scales [mailto:scales@vmware.com] 
Sent: Thursday, February 09, 2012 5:30 AM
To: Amit Kapila
Cc: PG Hackers
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

> Is there any problem if the double-write happens only by bgwriter or checkpoint. 
> Something like whenever backend process has to evict the buffer, it will do same as you have described that write in
adouble-write buffer, but > bgwriter  will check this double-buffer and flush from it.

> Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will
tellbgwriter to flush from > double-write buffer.

> This can ensure very less I/O by any backend.

Yes, I think this is a good idea.  I could make changes so that the backends hand off the responsibility to flush
batchesof the double-write buffer to the bgwriter whenever possible.  This would avoid some long IO waits in the
backends,though the backends may of course eventually wait anyways for the bgwriter if IO is not fast enough.  I did
writethe code so that any process can write a completed batch if the batch is not currently being flushed (so as to
dealwith crashes by backends).  Having the backends flush the batches as they fill them up was just simpler for a first
prototype.

Dan

----- Original Message -----
From: "Amit Kapila" <amit.kapila@huawei.com>
To: "Dan Scales" <scales@vmware.com>, "PG Hackers" <pgsql-hackers@postgresql.org>
Sent: Tuesday, February 7, 2012 1:08:49 AM
Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP]

>> I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions.  I
don'tunderstand this point, because from the data in your mail, it appears that when shared buffers are less means when
moreevictions can happen, the performance is less.

ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process
which can degrade performance. 
Is there any problem if the double-write happens only by bgwriter or checkpoint. 
Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a
double-writebuffer, but bgwriter  will check this double-buffer and flush from it.

Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell
bgwriterto flush from double-write buffer.

This can ensure very less I/O by any backend.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dan Scales
Sent: Saturday, January 28, 2012 4:02 AM
To: PG Hackers
Subject: [HACKERS] double writes using "double-write buffer" approach [WIP]

I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous
patchthat only batched up writes by the checkpointer).  I think it is a good idea, and can help double-writes perform
betterin the case of lots of backend evictions.

It also centralizes most of the code change in smgr.c.  However, it is trickier to reason about.

The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written.
Notethat a full copy of the page is required, but can folded in with a checksum calculation.

Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double
writes-- the pages are first written and fsynced to a double-write file, then written to the data files, which are then
fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing
thesize of the WAL log).

The key changes are conceptually simple:

1.  In smgrwrite(), copy the page to the double-write buffer.  If a big   enough batch has accumulated, then flush the
batchusing double   writes.  [I don't think I need to intercept calls to smgrextend(),   but I am not totally sure.]

2.  In smgrread(), always look first in the double-write buffer for a   particular page, before going to disk.

3.  At the end of a checkpoint and on shutdown, always make sure that the   current contents of the double-write buffer
areflushed.

4.  Pass flags around in some cases to indicate whether a page buffer   needs a double write or not.  (I think
eventuallythis would be an   attribute of the buffer, set when the page is WAL-logged, rather than   a flag passed
around.)

5.  Deal with duplicates in the double-write buffer appropriately (very   rarely happens).

To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other
processes. The double-write buffers are circular buffers.  The checkpointer double-write buffer is just a single batch
of64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each.  Each batch goes to a
differentdouble-write file, so that they can be issued independently as soon as each batch is completed.  Also, I need
tosort the buffers being checkpointed by file/offset (see ioseq.c), so that the checkpointer batches will most likely
onlyhave to write and fsync one data file.

Interestingly, I find that the plot of tpm for DBT2 is much smoother (though still has wiggles) with double writes
enabled,since there are no unpredictable long fsyncs at the end (or during) a checkpoint.

Here are performance numbers for double-write buffer (same configs as previous numbers), for 2-processor, 60-minute
50-warehouseDBT2.  One the right shows the size of the shared_buffers, and the size of the RAM in the virtual machine.
FPWstands for full_page_writes, DW for double_writes.  'two disk' means the WAL log is on a separate ext3 filesystem
fromthe data files.

          FPW off FPW on  DW on, FPW off
one disk:  15488   13146   11713                    [5G buffers, 8G VM]
two disk:  18833   16703   18013

one disk:  12908   11159    9758                    [3G buffers, 6G VM]
two disk:  14258   12694   11229

one disk   10829    9865    5806                    [1G buffers, 8G VM]
two disk   13605   12694    5682

one disk:   6752    6129    4878
two disk:   7253    6677    5239                    [1G buffers, 2G VM]

The performance of DW on the small cache cases (1G shared_buffers) is now much better, though still not as good as FPW
on. In the medium cache case (3G buffers), where there are significant backend dirty evictions, the performance of DW
isclose to that of FPW on.  In the large cache (5G buffers), where the checkpointer can do all the work and there are
minimaldirty evictions, DW is much better than FPW in the two disk case.

In the one disk case, it is somewhat worse than FPW.  However, interestingly, if you just move the double-write files
toa separate ext3 filesystem on the same disk as the data files, the performance goes to

13107 -- now on par with FPW on.  We are obviously getting hit by the
ext3 fsync slowness issues.  (I believe that an fsync on a filesystem can stall on other unrelated writes to the same
filesystem.)

Let me know if you have any thoughts/comments, etc.  The patch is enclosed, and the README.doublewrites is updated a
fairbit.

Thanks,

Dan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers