Re: double writes using "double-write buffer" approach [WIP] - Mailing list pgsql-hackers

From Robert Haas
Subject Re: double writes using "double-write buffer" approach [WIP]
Date
Msg-id CA+TgmoaMZTOveaZ+3DraYNcBzsWHdEkhsnbLRmfXAyriWyBqsQ@mail.gmail.com
Whole thread Raw
In response to Re: double writes using "double-write buffer" approach [WIP]  (Dan Scales <scales@vmware.com>)
List pgsql-hackers
On Sun, Feb 5, 2012 at 4:17 PM, Dan Scales <scales@vmware.com> wrote:
> Thanks for the detailed followup.  I do see how Postgres is tuned for
> having a bunch of memory available that is not in shared_buffers, both
> for the OS buffer cache and other memory allocations.  However, Postgres
> seems to run fine in many "large shared_memory" configurations that I
> gave performance numbers for, including 5G shared_buffers for an 8G
> machine, 3G shared_buffers for a 6G machine, etc.  There just has to be
> sufficient extra memory beyond the shared_buffers cache.

I agree that you could probably set shared_buffers to 3GB on a 6GB
machine and get decent performance - but would it be the optimal
performance, and for what workload?  To really figure out whether this
patch is a win, you need to get the system optimally tuned for the
unpatched sources (which we can't tell whether you've done, since you
haven't posted the configuration settings or any comparative figures
for different settings, or any details on which commit you tested
against) and then get the system optimally tuned for the patched
sources with double_writes=on, and then see whether there's a gain.

> I think the pgbench run is pointing out a problem that this double_writes
> implementation has with BULK_WRITEs.  As you point out, the
> BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions.

Bulk reads will have the same problem.  Consider loading a bunch of
data into a new data with COPY, and then scanning the table.  The
table scan will be a "bulk read" and every page will be dirtied
setting hint bits.  Another thing to worry about is vacuum, which also
uses a BufferAccessStrategy.  Greg Smith has done some previous
benchmarking showing that when the kernel is too aggressive about
flushing dirty data to disk, vacuum becomes painfully slow.  I suspect
this patch is going to have that problem in spades (but it would be
good to test that).  Checkpoints might be a problem, too, since they
flush a lot of dirty data, and that's going to require a lot of extra
fsyncing with this implementation.  It certainly seems that unless you
have a pg_xlog and the data separated and a battery-backed write cache
for each, checkpoints might be really slow.  I'm not entirely
convinced they'll be fast even if you have all that (but it would be
good to test that, too).

> I'm not sure if there is a great solution that always works for that
> issue.  However, I do notice that BULK_WRITE data isn't WAL-logged unless
> archiving/replication is happening.  As I understand it, if the
> BULK_WRITE data isn't being WAL-logged, then it doesn't have to be
> double-written either.  The BULK_WRITE data is not officially synced and
> committed until it is all written, so there doesn't have to be any
> torn-page protection for that data, which is why the WAL logging can be
> omitted.  The double-write implementation can be improved by marking each
> buffer if it doesn't need torn-page protection.  These buffers would be
> those new pages that are explicitly not WAL-logged, even when
> full_page_writes is enabled.  When such a buffer is eventually synced
> (perhaps because of an eviction), it would not be double-written.  This
> would often avoid double-writes for BULK_WRITE, etc., especially since
> the administrator is often not archiving or doing replication when doing
> bulk loads.

I agree - this optimization seems like a must.  I'm not sure that it's
sufficient, but it certainly seems necessary.  It's not going to help
with VACUUM, though, so I think that case needs some careful looking
at to determine how bad the regression is and what can be done to
mitigate it.  In particular, I note that I suggested an idea that
might help in the final paragraph of my last email.

My general feeling about this patch is that it needs a lot more work
before we should consider committing it.  Your tests so far overlook
quite a few important problem cases (bulk loads, SELECT on large
unhinted tables, vacuum speed, checkpoint duration, and others) and
still mostly show it losing to full_page_writes, sometimes by large
margins.  Even in the one case where you got an 8% speedup, it's not
really clear that the same speedup (or an even bigger one) couldn't
have been gotten by some other kind of tuning.  I think you really
need to spend some more time thinking about how to blunt the negative
impact on the cases where it hurts, and increase the benefit in the
cases where it helps.  The approach seems to have potential, but it
seems way to immature to think about shipping it at this point.  (You
may have been thinking along similar lines since I note that the patch
is marked "WIP".)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: 16-bit page checksums for 9.2
Next
From: Heikki Linnakangas
Date:
Subject: Re: 16-bit page checksums for 9.2