Re: Page Checksums + Double Writes - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Page Checksums + Double Writes
Date
Msg-id CA+Tgmobo8o-_r1Vdc6kWxRSPWCwpjbquB2ww4epi0dNzQPTFwQ@mail.gmail.com
Whole thread Raw
In response to Re: Page Checksums + Double Writes  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: Page Checksums + Double Writes  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Page Checksums + Double Writes  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Thoughts?

Those are good thoughts.

Here's another random idea, which might be completely nuts.  Maybe we
could consider some kind of summarization of CLOG data, based on the
idea that most transactions commit.  We introduce the idea of a CLOG
rollup page.  On a CLOG rollup page, each bit represents the status of
N consecutive XIDs.  If the bit is set, that means all XIDs in that
group are known to have committed.  If it's clear, then we don't know,
and must fall through to a regular CLOG lookup.

If you let N = 1024, then 8K of CLOG rollup data is enough to
represent the status of 64 million transactions, which means that just
a couple of pages could cover as much of the XID space as you probably
need to care about.  Also, you would need to replace CLOG summary
pages in memory only very infrequently.  Backends could test the bit
without any lock.  If it's set, they do pg_read_barrier(), and then
check the buffer label to make sure it's still the summary page they
were expecting.  If so, no CLOG lookup is needed.  If the page has
changed under us or the bit is clear, then we fall through to a
regular CLOG lookup.

An obvious problem is that, if the abort rate is significantly
different from zero, and especially if the aborts are randomly mixed
in with commits rather than clustered together in small portions of
the XID space, the CLOG rollup data would become useless.  On the
other hand, if you're doing 10k tps, you only need to have a window of
a tenth of a second or so where everything commits in order to start
getting some benefit, which doesn't seem like a stretch.

Perhaps the CLOG rollup data wouldn't even need to be kept on disk.
We could simply have bgwriter (or bghinter) set the rollup bits in
shared memory for new transactions, as it becomes possible to do so,
and let lookups for XIDs prior to the last shutdown fall through to
CLOG.  Or, if that's not appealing, we could reconstruct the data in
memory by groveling through the CLOG pages - or maybe just set summary
bits only for CLOG pages that actually get faulted in.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: WIP: explain analyze with 'rows' but not timing
Next
From: Tom Lane
Date:
Subject: Re: Page Checksums + Double Writes