Thread: Page Checksums + Double Writes

Page Checksums + Double Writes

From
David Fetter
Date:
Folks,

One of the things VMware is working on is double writes, per previous
discussions of how, for example, InnoDB does things.   I'd initially
thought that introducing just one of the features in $Subject at a
time would help, but I'm starting to see a mutual dependency.

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

If submitting these things together seems like a better idea than
having them arrive separately, I'll work with my team here to make
that happen soonest.

There's a separate issue we'd like to get clear on, which is whether
it would be OK to make a new PG_PAGE_LAYOUT_VERSION.

If so, there's less to do, but pg_upgrade as it currently stands is
broken.

If not, we'll have to do some extra work on the patch as described
below.  Thanks to Kevin Grittner for coming up with this :)

- Use a header bit to say whether we've got a checksum on the page. We're using 3/16 of the available bits as described
insrc/include/storage/bufpage.h.
 

- When that bit is set, place the checksum somewhere convenient on the page.  One way to do this would be to have an
optionalfield at the end of the special space based on the new bit.  Rows from pg_upgrade would have the bit clear, and
wouldhave the shorter special structure without the checksum.
 

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Page Checksums + Double Writes

From
Alvaro Herrera
Date:
Excerpts from David Fetter's message of mié dic 21 18:59:13 -0300 2011:

> If not, we'll have to do some extra work on the patch as described
> below.  Thanks to Kevin Grittner for coming up with this :)
>
> - Use a header bit to say whether we've got a checksum on the page.
>   We're using 3/16 of the available bits as described in
>   src/include/storage/bufpage.h.
>
> - When that bit is set, place the checksum somewhere convenient on the
>   page.  One way to do this would be to have an optional field at the
>   end of the special space based on the new bit.  Rows from pg_upgrade
>   would have the bit clear, and would have the shorter special
>   structure without the checksum.

If you get away with a new page format, let's make sure and coordinate
so that we can add more info into the header.  One thing I wanted was to
have an ID struct on each file, so that you know what
DB/relation/segment the file corresponds to.  So the first page's
special space would be a bit larger than the others.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Alvaro Herrera <alvherre@commandprompt.com> wrote:
> If you get away with a new page format, let's make sure and
> coordinate so that we can add more info into the header.  One
> thing I wanted was to have an ID struct on each file, so that you
> know what DB/relation/segment the file corresponds to.  So the
> first page's special space would be a bit larger than the others.
Couldn't that also be done by burning a bit in the page header
flags, without a page layout version bump?  If that were done, you
wouldn't have the additional information on tables converted by
pg_upgrade, but you would get them on new tables, including those
created by pg_dump/psql conversions.  Adding them could even be made
conditional, although I don't know whether that's a good idea....
-Kevin


Re: Page Checksums + Double Writes

From
Simon Riggs
Date:
On Wed, Dec 21, 2011 at 10:19 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> wrote:
>
>> If you get away with a new page format, let's make sure and
>> coordinate so that we can add more info into the header.  One
>> thing I wanted was to have an ID struct on each file, so that you
>> know what DB/relation/segment the file corresponds to.  So the
>> first page's special space would be a bit larger than the others.
>
> Couldn't that also be done by burning a bit in the page header
> flags, without a page layout version bump?  If that were done, you
> wouldn't have the additional information on tables converted by
> pg_upgrade, but you would get them on new tables, including those
> created by pg_dump/psql conversions.  Adding them could even be made
> conditional, although I don't know whether that's a good idea....

These are good thoughts because they overcome the major objection to
doing *anything* here for 9.2.

We don't need to use any flag bits at all. We add
PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking
becomes an initdb option. All new pages can be created with
PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must
be either the layout version from this release (4) or the next version
(5). Page validity then becomes version dependent.

pg_upgrade still works.

Layout 5 is where we add CRCs, so its basically optional.

We can also have a utility that allows you to bump the page version
for all new pages, even after you've upgraded, so we may end with a
mix of page layout versions in the same relation. That's more
questionable but I see no problem with it.

Do we need CRCs as a table level option? I hope not. That complicates
many things.

All of this allows us to have another more efficient page version (6)
in future without problems, so its good infrastructure.

I'm now personally game on to make something work here for 9.2.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Page Checksums + Double Writes

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> There's a separate issue we'd like to get clear on, which is whether
> it would be OK to make a new PG_PAGE_LAYOUT_VERSION.

If you're not going to provide pg_upgrade support, I think there is no
chance of getting a new page layout accepted.  The people who might want
CRC support are pretty much exactly the same people who would find lack
of pg_upgrade a showstopper.

Now, given the hint bit issues, I rather doubt that you can make this
work without a page format change anyway.  So maybe you ought to just
bite the bullet and start working on the pg_upgrade problem, rather than
imagining you will find an end-run around it.

> The issue is that double writes needs a checksum to work by itself,
> and page checksums more broadly work better when there are double
> writes, obviating the need to have full_page_writes on.

Um.  So how is that going to work if checksums are optional?
        regards, tom lane


Re: Page Checksums + Double Writes

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> We don't need to use any flag bits at all. We add
> PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking
> becomes an initdb option. All new pages can be created with
> PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must
> be either the layout version from this release (4) or the next version
> (5). Page validity then becomes version dependent.

> We can also have a utility that allows you to bump the page version
> for all new pages, even after you've upgraded, so we may end with a
> mix of page layout versions in the same relation. That's more
> questionable but I see no problem with it.

It seems like you've forgotten all of the previous discussion of how
we'd manage a page format version change.

Having two different page formats running around in the system at the
same time is far from free; in the worst case it means that every single
piece of code that touches pages has to know about and be prepared to
cope with both versions.  That's a rather daunting prospect, from a
coding perspective and even more from a testing perspective.  Maybe
the issues can be kept localized, but I've seen no analysis done of
what the impact would be or how we could minimize it.  I do know that
we considered the idea and mostly rejected it a year or two back.

A "utility to bump the page version" is equally a whole lot easier said
than done, given that the new version has more overhead space and thus
less payload space than the old.  What does it do when the old page is
too full to be converted?  "Move some data somewhere else" might be
workable for heap pages, but I'm less sanguine about rearranging indexes
like that.  At the very least it would imply that the utility has full
knowledge about every index type in the system.

> I'm now personally game on to make something work here for 9.2.

If we're going to freeze 9.2 in the spring, I think it's a bit late
for this sort of work to be just starting.  What you've just described
sounds to me like possibly a year's worth of work.
        regards, tom lane


Re: Page Checksums + Double Writes

From
Simon Riggs
Date:
On Wed, Dec 21, 2011 at 11:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> It seems like you've forgotten all of the previous discussion of how
> we'd manage a page format version change.

Maybe I've had too much caffeine. It's certainly late here.

> Having two different page formats running around in the system at the
> same time is far from free; in the worst case it means that every single
> piece of code that touches pages has to know about and be prepared to
> cope with both versions.  That's a rather daunting prospect, from a
> coding perspective and even more from a testing perspective.  Maybe
> the issues can be kept localized, but I've seen no analysis done of
> what the impact would be or how we could minimize it.  I do know that
> we considered the idea and mostly rejected it a year or two back.

I'm looking at that now.

My feeling is it probably depends upon how different the formats are,
so given we are discussing a 4 byte addition to the header, it might
be doable.

I'm investing some time on the required analysis.

> A "utility to bump the page version" is equally a whole lot easier said
> than done, given that the new version has more overhead space and thus
> less payload space than the old.  What does it do when the old page is
> too full to be converted?  "Move some data somewhere else" might be
> workable for heap pages, but I'm less sanguine about rearranging indexes
> like that.  At the very least it would imply that the utility has full
> knowledge about every index type in the system.

I agree, rewriting every page is completely out and I never even considered it.

>> I'm now personally game on to make something work here for 9.2.
>
> If we're going to freeze 9.2 in the spring, I think it's a bit late
> for this sort of work to be just starting.

I agree with that. If this goes adrift it will have to be killed for 9.2.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Page Checksums + Double Writes

From
Rob Wultsch
Date:
On Wed, Dec 21, 2011 at 1:59 PM, David Fetter <david@fetter.org> wrote:
> One of the things VMware is working on is double writes, per previous
> discussions of how, for example, InnoDB does things.

The world is moving to flash, and the lifetime of flash is measured
writes. Potentially doubling the number of writes is potentially
halving the life of the flash.

Something to think about...

-- 
Rob Wultsch
wultsch@gmail.com


Re: Page Checksums + Double Writes

From
Robert Haas
Date:
On Wed, Dec 21, 2011 at 7:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> My feeling is it probably depends upon how different the formats are,
> so given we are discussing a 4 byte addition to the header, it might
> be doable.

I agree.  When thinking back on Zoltan's patches, it's worth
remembering that he had a number of pretty bad ideas mixed in with the
good stuff - such as taking a bunch of things that are written as
macros for speed, and converting them to function calls.  Also, he
didn't make any attempt to isolate the places that needed to know
about both page versions; everybody knew about everything, everywhere,
and so everything needed to branch in places where it had not needed
to do so before.  I don't think we should infer from the failure of
those patches that no one can do any better.

On the other hand, I also agree with Tom that the chances of getting
this done in time for 9.2 are virtually zero, assuming that (1) we
wish to ship 9.2 in 2012 and (2) we don't wish to be making
destabilizing changes beyond the end of the last CommitFest.  There is
a lot of work here, and I would be astonished if we could wrap it all
up in the next month.  Or even the next four months.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Page Checksums + Double Writes

From
David Fetter
Date:
On Wed, Dec 21, 2011 at 04:18:33PM -0800, Rob Wultsch wrote:
> On Wed, Dec 21, 2011 at 1:59 PM, David Fetter <david@fetter.org> wrote:
> > One of the things VMware is working on is double writes, per
> > previous discussions of how, for example, InnoDB does things.
> 
> The world is moving to flash, and the lifetime of flash is measured
> writes.  Potentially doubling the number of writes is potentially
> halving the life of the flash.
> 
> Something to think about...

Modern flash drives let you have more write cycles than modern
spinning rust, so while yes, there is something happening, it's also
happening to spinning rust, too.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Page Checksums + Double Writes

From
Simon Riggs
Date:
On Thu, Dec 22, 2011 at 12:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

>> Having two different page formats running around in the system at the
>> same time is far from free; in the worst case it means that every single
>> piece of code that touches pages has to know about and be prepared to
>> cope with both versions.  That's a rather daunting prospect, from a
>> coding perspective and even more from a testing perspective.  Maybe
>> the issues can be kept localized, but I've seen no analysis done of
>> what the impact would be or how we could minimize it.  I do know that
>> we considered the idea and mostly rejected it a year or two back.
>
> I'm looking at that now.
>
> My feeling is it probably depends upon how different the formats are,
> so given we are discussing a 4 byte addition to the header, it might
> be doable.
>
> I'm investing some time on the required analysis.

We've assumed to now that adding a CRC to the Page Header would add 4
bytes, meaning that we are assuming we are taking a CRC-32 check
field. This will change the size of the header and thus break
pg_upgrade in a straightforward implementation. Breaking pg_upgrade is
not acceptable. We can get around this by making code dependent upon
page version, allowing mixed page versions in one executable. That
causes the PageGetItemId() macro to be page version dependent. After
review, altering the speed of PageGetItemId() is not acceptable either
(show me microbenchmarks if you doubt that). In a large minority of
cases the line pointer and the page header will be in separate cache
lines.

As Kevin points out, we have 13 bits spare on the pd_flags of
PageHeader, so we have a little wiggle room there. In addition to that
I notice that pd_pagesize_version itself is 8 bits (page size is other
8 bits packed together), yet we currently use just one bit of that,
since version is 4. Version 3 was last seen in Postgres 8.2, now
de-supported.

Since we don't care too much about backwards compatibility with data
in Postgres 8.2 and below, we can just assume that all pages are
version 4, unless marked otherwise with additional flags. We then use
two separate bits to pd_flags to show PD_HAS_CRC (0x0008 and 0x8000).
We then completely replace the 16 bit version field with a 16-bit CRC
value, rather than a 32-bit value. Why two flag bits? If either CRC
bit is set we assume the page's CRC is supposed to be valid. This
ensures that a single bit error doesn't switch off CRC checking when
it was supposed to be active. I suggest we remove the page size data
completely; if we need to keep that we should mark 8192 bytes as the
default and set bits for 16kB and 32 kB respectively.

With those changes, we are able to re-organise the page header so that
we can add a 16 bit checksum (CRC), yet retain the same size of
header. Thus, we don't need to change PageGetItemId(). We would
require changes to PageHeaderIsValid() and PageInit() only. Making
these changes means we are reducing the number of bits used to
validate the page header, though we are providing a much better way of
detecting page validity, so the change is of positive benefit.

Adding a CRC was a performance concern because of the hint bit
problem, so making the value 16 bits long gives performance where it
is needed. Note that we do now have a separation of bgwriter and
checkpointer, so we have more CPU bandwidth to address the problem.
Adding multiple bgwriters is also possible.

Notably, this proposal makes CRC checking optional, so if performance
is a concern it can be disabled completely.

Which CRC algorithm to choose?
"A study of error detection capabilities for random independent bit
errors and burst errors reveals that XOR, two's complement addition,
and Adler checksums are suboptimal for typical network use. Instead,
one's complement addition should be used for networks willing to
sacrifice error detection effectiveness to reduce compute cost,
Fletcher checksum for networks looking for a balance of error
detection and compute cost, and CRCs for networks willing to pay a
higher compute cost for significantly improved error detection."
The Effectiveness of Checksums for Embedded Control Networks,
Maxino, T.C.  Koopman, P.J.,
Dependable and Secure Computing, IEEE Transactions on
Issue Date: Jan.-March 2009
Available here - http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf

Based upon that paper, I suggest we use Fletcher-16. The overall
concept is not sensitive to the choice of checksum algorithm however
and the algorithm itself could be another option. F16 or CRC. My poor
understanding of the difference is that F16 is about 20 times cheaper
to calculate, at the expense of about 1000 times worse error detection
(but still pretty good).

16 bit CRCs are not the strongest available, but still support
excellent error detection rates - better than 1 failure in a million,
possibly much better depending on which algorithm and block size.
That's good easily enough to detect our kind of errors.

This idea doesn't rule out the possibility of a 4 byte CRC-32 added in
the future, since we still have 11 bits spare for use as future page
version indicators. (If we did that, it is clear that we should add
the checksum as a *trailer* not as part of the header.)

So overall, I do now think its still possible to add an optional
checksum in the 9.2 release and am willing to pursue it unless there
are technical objections.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Page Checksums + Double Writes

From
Heikki Linnakangas
Date:
On 22.12.2011 01:43, Tom Lane wrote:
> A "utility to bump the page version" is equally a whole lot easier said
> than done, given that the new version has more overhead space and thus
> less payload space than the old.  What does it do when the old page is
> too full to be converted?  "Move some data somewhere else" might be
> workable for heap pages, but I'm less sanguine about rearranging indexes
> like that.  At the very least it would imply that the utility has full
> knowledge about every index type in the system.

Remembering back the old discussions, my favorite scheme was to have an 
online pre-upgrade utility that runs on the old cluster, moving things 
around so that there is enough spare room on every page. It would do 
normal heap updates to make room on heap pages (possibly causing 
transient serialization failures, like all updates do), and split index 
pages to make room on them. Yes, it would need to know about all index 
types. And it would set a global variable to indicate that X bytes must 
be kept free on all future updates, too.

Once the pre-upgrade utility has scanned through the whole cluster, you 
can run pg_upgrade. After the upgrade, old page versions are converted 
to new format as pages are read in. The conversion is staightforward, as 
there the pre-upgrade utility ensured that there is enough spare room on 
every page.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Page Checksums + Double Writes

From
Florian Weimer
Date:
* David Fetter:

> The issue is that double writes needs a checksum to work by itself,
> and page checksums more broadly work better when there are double
> writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes?  Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Page Checksums + Double Writes

From
Simon Riggs
Date:
On Thu, Dec 22, 2011 at 7:44 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 22.12.2011 01:43, Tom Lane wrote:
>>
>> A "utility to bump the page version" is equally a whole lot easier said
>> than done, given that the new version has more overhead space and thus
>> less payload space than the old.  What does it do when the old page is
>> too full to be converted?  "Move some data somewhere else" might be
>> workable for heap pages, but I'm less sanguine about rearranging indexes
>> like that.  At the very least it would imply that the utility has full
>> knowledge about every index type in the system.
>
>
> Remembering back the old discussions, my favorite scheme was to have an
> online pre-upgrade utility that runs on the old cluster, moving things
> around so that there is enough spare room on every page. It would do normal
> heap updates to make room on heap pages (possibly causing transient
> serialization failures, like all updates do), and split index pages to make
> room on them. Yes, it would need to know about all index types. And it would
> set a global variable to indicate that X bytes must be kept free on all
> future updates, too.
>
> Once the pre-upgrade utility has scanned through the whole cluster, you can
> run pg_upgrade. After the upgrade, old page versions are converted to new
> format as pages are read in. The conversion is staightforward, as there the
> pre-upgrade utility ensured that there is enough spare room on every page.

That certainly works, but we're still faced with pg_upgrade rewriting
every page, which will take a significant amount of time and with no
backout plan or rollback facility. I don't like that at all, hence why
I think we need an online upgrade facility if we do have to alter page
headers.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Page Checksums + Double Writes

From
Simon Riggs
Date:
On Thu, Dec 22, 2011 at 8:42 AM, Florian Weimer <fweimer@bfk.de> wrote:
> * David Fetter:
>
>> The issue is that double writes needs a checksum to work by itself,
>> and page checksums more broadly work better when there are double
>> writes, obviating the need to have full_page_writes on.
>
> How desirable is it to disable full_page_writes?  Doesn't it cut down
> recovery time significantly because it avoids read-modify-write cycles
> with a cold cache?

It's way too late in the cycle to suggest removing full page writes or
code them. We're looking to add protection, not swap out existing
ones.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Page Checksums + Double Writes

From
Jesper Krogh
Date:
On 2011-12-22 09:42, Florian Weimer wrote:
> * David Fetter:
>
>> The issue is that double writes needs a checksum to work by itself,
>> and page checksums more broadly work better when there are double
>> writes, obviating the need to have full_page_writes on.
> How desirable is it to disable full_page_writes?  Doesn't it cut down
> recovery time significantly because it avoids read-modify-write cycles
> with a cold cache
What is the downsides of having full_page_writes enabled .. except from
log-volume? The manual mentions something about speed, but it is
a bit unclear where that would come from, since the full pages must
be somewhere in memory when being worked on anyway,.

Anyway, I have an archive_command that looks like:
archive_command = 'test ! -f /data/wal/%f.gz && gzip --fast < %p > 
/data/wal/%f.gz'

It brings on along somewhere between 50 and 75% reduction in log-volume
with "no cost" on the production system (since gzip just occupices one 
of the
many cores on the system) and can easily keep up even during
quite heavy writes.

Recovery is a bit more tricky, because hooking gunzip into the command 
there
will cause the system to replay log, gunzip, read data, replay log cycle 
where the gunzip
easily could be done on the other logfiles while replay are being done 
on one.

So a "straightforward" recovery will cost in recovery time, but that can 
be dealt with.

Jesper
-- 
Jesper


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Simon Riggs  wrote:
> So overall, I do now think its still possible to add an optional
> checksum in the 9.2 release and am willing to pursue it unless
> there are technical objections.
Just to restate Simon's proposal, to make sure I'm understanding it,
we would support a new page header format number and the old one in
9.2, both to be the same size and carefully engineered to minimize
what code would need to be aware of the version.  PageHeaderIsValid()
and PageInit() certainly would, and we would need some way to set,
clear (maybe), and validate a CRC.  We would need a GUC to indicate
whether to write the CRC, and if present we would always test it on
read and treat it as a damaged page if it didn't match.  (Perhaps
other options could be added later, to support recovery attempts, but
let's not complicate a first cut.)  This whole idea would depend on
either (1) trusting your storage system never to tear a page on write
or (2) getting the double-write feature added, too.
I see some big advantages to this over what I suggested to David. 
For starters, using a flag bit and putting the CRC somewhere other
than the page header would require that each AM deal with the CRC,
exposing some function(s) for that.  Simon's idea doesn't require
that.  I was also a bit concerned about shifting tuple images to
convert non-protected pages to protected pages.  No need to do that,
either.  With the bit flags, I think there might be some cases where
we would be unable to add a CRC to a converted page because space was
too tight; that's not an issue with Simon's proposal.
Heikki was talking about a pre-convert tool.  Neither approach really
needs that, although with Simon's approach it would be possible to
have a background *post*-conversion tool to add CRCs, if desired. 
Things would continue to function if it wasn't run; you just wouldn't
have CRC protection on pages not updated since pg_upgrade was run.
Simon, does it sound like I understand your proposal?
Now, on to the separate-but-related topic of double-write.  That
absolutely requires some form of checksum or CRC to detect torn
pages, in order for the technique to work at all.  Adding a CRC
without double-write would work fine if you have a storage stack
which prevents torn pages in the file system or hardware driver.  If
you don't have that, it could create a damaged page indication after
a hardware or OS crash, although I suspect that would be the
exception, not the typical case.  Given all that, and the fact that
it would be cleaner to deal with these as two separate patches, it
seems the CRC patch should go in first.  (And, if this is headed for
9.2, *very soon*, so there is time for the double-write patch to
follow.)
It seems to me that the full_page_writes GUC could become an
enumeration, with "off" having the current meaning, "wal" meaning
what "on" now does, and "double" meaning that the new double-write
technique would be used.  (It doesn't seem to make any sense to do
both at the same time.)  I don't think we need a separate GUC to tell
us *what* to protect against torn pages -- if not "off" we should
always protect the first write of a page after checkpoint, and if
"double" and write_page_crc (or whatever we call it) is "on", then we
protect hint-bit-only writes.  I think.  I can see room to argue that
with CRCs on we should do a full-page write to the WAL for a
hint-bit-only change, or that we should add another GUC to control
when we do this.
I'm going to take a shot at writing a patch for background hinting
over the holidays, which I think has benefit alone but also boosts
the value of these patches, since it would reduce double-write
activity otherwise needed to prevent spurious error when using CRCs.
This whole area has some overlap with spreading writes, I think.  The
double-write approach seems to count on writing a bunch of pages
(potentially from different disk files) sequentially to the
double-write buffer, fsyncing that, and then writing the actual pages
-- which must be fsynced before the related portion of the
double-write buffer can be reused.  The simple implementation would
be to simply fsync the files just written to if they required a prior
write to the double-write buffer, although fancier techniques could
be used to try to optimize that.  Again, setting hint bits set before
the write when possible would help reduce the impact of that.
-Kevin


Re: Page Checksums + Double Writes

From
Jignesh Shah
Date:
On Thu, Dec 22, 2011 at 4:00 AM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 2011-12-22 09:42, Florian Weimer wrote:
>>
>> * David Fetter:
>>
>>> The issue is that double writes needs a checksum to work by itself,
>>> and page checksums more broadly work better when there are double
>>> writes, obviating the need to have full_page_writes on.
>>
>> How desirable is it to disable full_page_writes?  Doesn't it cut down
>> recovery time significantly because it avoids read-modify-write cycles
>> with a cold cache
>
> What is the downsides of having full_page_writes enabled .. except from
> log-volume? The manual mentions something about speed, but it is
> a bit unclear where that would come from, since the full pages must
> be somewhere in memory when being worked on anyway,.
>


I thought I will share some of my perspective on this checksum +
doublewrite from a performance point of view.

Currently from what I see in our tests based on dbt2, DVDStore, etc
is that checksum does not impact scalability or total throughput
measured. It does increase CPU cycles depending on the algorithm used
by not really anything that causes problems. The Doublewrite change
will be the big win to performance compared to full_page_write.  For
example compared to other databases our WAL traffic is one of the
highest. Most of it is attributed to full_page_write. The reason
full_page_write is necessary in production (atleast without worrying
about replication impact) is that if a write fails, we can recover
that whole page from WAL Logs as it is and just put it back out there.
(In fact I believe thats the recovery does.) However the net impact is
during high OLTP the runtime impact on WAL is high due to the high
traffic and compared to other databases due to the higher traffic, the
utilization is high. Also this has a huge impact on transaction
response time the first time a page is changed which in all OLTP
environments it is huge because by nature the transactions are all on
random pages.

When we use Doublewrite with checksums, we can safely disable
full_page_write causing a HUGE reduction to the WAL traffic without
loss of reliatbility due to a write fault since there are two writes
always. (Implementation detail discussable). Since the double writes
itself are sequential bundling multiple such writes further reduces
the write time. The biggest improvement is that now these writes are
not done during TRANSACTION COMMIT but during CHECKPOINT WRITES which
improves performance drastically for OLTP application's transaction
performance  and you still get the reliability that is needed.

Typically  Performance in terms of throughput tps system is like
tps(Full_page Write) << tps (no full page write)
With the double write and CRC we see   tps (Full_page_write) << tps (Doublewrite) < tps(no full page
write)
Which is a big win for production systems to get the reliability of
full_page write.

Also the side effect for response times is that they are more leveled
unlike full page write where the response time varies like  0.5ms to
5ms depending on whether the same transaction needs to write a full
page onto WAL or not.  With doublewrite it can always be around 0.5ms
rather than have a huge deviation on transaction performance. With
this folks measuring the 90 %ile  response time will see a huge relief
on trying to meet their SLAs.

Also from WAL perspective, I like to put the WAL on its own
LUN/spindle/VMDK etc .. The net result that I have with the reduced
WAL traffic, my utilization drops which means the same hardware can
now handle higher WAL traffic in terms of IOPS resulting that WAL
itself becomes lesser of a bottleneck. Typically this is observed by
the reduction in response times of the transactions and increase in
tps till some other bottleneck becomes the gating factor.

So overall this is a big win.

Regards,
Jignesh


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Jignesh Shah <jkshah@gmail.com> wrote:
> When we use Doublewrite with checksums, we can safely disable
> full_page_write causing a HUGE reduction to the WAL traffic
> without loss of reliatbility due to a write fault since there are
> two writes always. (Implementation detail discussable).
The "always" there surprised me.  It seemed to me that we only need
to do the double-write where we currently do full page writes or
unlogged writes.  In thinking about your message, it finally struck
me that this might require a WAL record to be written with the
checksum (or CRC; whatever we use).  Still, writing a WAL record
with a CRC prior to the page write would be less data than the full
page.  Doing double-writes instead for situations without the torn
page risk seems likely to be a net performance loss, although I have
no benchmarks to back that up (not having a double-write
implementation to test).  And if we can get correct behavior without
doing either (the checksum WAL record or the double-write), that's
got to be a clear win.
-Kevin


Re: Page Checksums + Double Writes

From
Jignesh Shah
Date:
On Thu, Dec 22, 2011 at 11:16 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Jignesh Shah <jkshah@gmail.com> wrote:
>
>> When we use Doublewrite with checksums, we can safely disable
>> full_page_write causing a HUGE reduction to the WAL traffic
>> without loss of reliatbility due to a write fault since there are
>> two writes always. (Implementation detail discussable).
>
> The "always" there surprised me.  It seemed to me that we only need
> to do the double-write where we currently do full page writes or
> unlogged writes.  In thinking about your message, it finally struck

Currently PG only does full page write for the first change that makes
the dirty after a checkpoint. This scheme works when all changes are
relative to that first page so when checkpoint write fails then it can
recreate the page by using the full page write + all the delta changes
from WAL.

In the double write implementation, every checkpoint write is double
writed, so if the first doublewrite page write fails then then
original page is not corrupted and if the second write to the actual
datapage fails, then one can recover it from the earlier write. Now
while it seems that there are 2X double writes during checkpoint is
true. I can argue that there are the same 2 X writes right now except
1X of the write goes to WAL DURING TRANSACTION COMMIT.  Also since
doublewrite is generally written in its own file it is essentially
sequential so it doesnt have the same write latencies as the actual
checkpoint write. So if you look at the net amount of the writes it is
the same. For unlogged tables even if you do doublewrite it is not
much of a penalty while that may not be logging before in the WAL.  By
doing the double write for it, it is still safe and gives resilience
for those tables to it eventhough it is not required. The net result
is that the underlying page is never "irrecoverable" due to failed
writes.


> me that this might require a WAL record to be written with the
> checksum (or CRC; whatever we use).  Still, writing a WAL record
> with a CRC prior to the page write would be less data than the full
> page.  Doing double-writes instead for situations without the torn
> page risk seems likely to be a net performance loss, although I have
> no benchmarks to back that up (not having a double-write
> implementation to test).  And if we can get correct behavior without
> doing either (the checksum WAL record or the double-write), that's
> got to be a clear win.

I am not sure why would one want to write the checksum to WAL.
As for the double writes, infact there is not a net loss because
(a) the writes to the doublewrite area is sequential the writes calls
are relatively very fast and infact does not cause any latency
increase to any transactions unlike full_page_write.
(b) It can be moved to a different location to have no stress on the
default tablespace if you are worried about that spindle handling 2X
writes which is mitigated in full_page_writes if you move pg_xlogs to
different spindle

and my own tests supports that the net result is almost as fast as
full_page_write=off  but not the same due to the extra write  (which
gives you the desired reliability) but way better than
full_page_write=on.


Regards,
Jignesh






> -Kevin


Re: Page Checksums + Double Writes

From
Robert Haas
Date:
On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah <jkshah@gmail.com> wrote:
> In the double write implementation, every checkpoint write is double
> writed,

Unless I'm quite thoroughly confused, which is possible, the double
write will need to happen the first time a buffer is written following
each checkpoint.  Which might mean the next checkpoint, but it could
also be sooner if the background writer kicks in, or in the worst case
a buffer has to do its own write.

Furthermore, we can't *actually* write any pages until they are
written *and fsync'd* to the double-write buffer.  So the penalty for
the background writer failing to do the right thing is going to go up
enormously.  Think about VACUUM or COPY IN, using a ring buffer and
kicking out its own pages.  Every time it evicts a page, it is going
to have to doublewrite the buffer, fsync it, and then write it for
real.  That is going to make PostgreSQL 6.5 look like a speed demon.
The background writer or checkpointer can conceivably dump a bunch of
pages into the doublewrite area and then fsync the whole thing in
bulk, but a backend that needs to evict a page only wants one page, so
it's pretty much screwed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Page Checksums + Double Writes

From
Jignesh Shah
Date:
On Thu, Dec 22, 2011 at 3:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah <jkshah@gmail.com> wrote:
>> In the double write implementation, every checkpoint write is double
>> writed,
>
> Unless I'm quite thoroughly confused, which is possible, the double
> write will need to happen the first time a buffer is written following
> each checkpoint.  Which might mean the next checkpoint, but it could
> also be sooner if the background writer kicks in, or in the worst case
> a buffer has to do its own write.
>


Logically the double write happens for every checkpoint write and it
gets fsynced.. Implementation wise you can do a chunk of those pages
like we do in sets of pages and sync them once and yes it still
performs better than full_page_write. As long as you compare with
full_page_write=on, the scheme is always much better. If you compare
it with performance of full_page_write=off it is slightly less but
then you lose the the reliability. So for performance testers like me
who always turn off  full_page_write anyway during my benchmark run
will not see any impact. However for folks in production who are
rightly scared to turn off full_page_write will have an ability to
increase performance without being scared on failed writes.

> Furthermore, we can't *actually* write any pages until they are
> written *and fsync'd* to the double-write buffer.  So the penalty for
> the background writer failing to do the right thing is going to go up
> enormously.  Think about VACUUM or COPY IN, using a ring buffer and
> kicking out its own pages.  Every time it evicts a page, it is going
> to have to doublewrite the buffer, fsync it, and then write it for
> real.  That is going to make PostgreSQL 6.5 look like a speed demon.

Like I said implementation detail wise it depends on how many such
pages do you sync simultaneously and the real tests prove that it is
actually much faster than one expects.

> The background writer or checkpointer can conceivably dump a bunch of
> pages into the doublewrite area and then fsync the whole thing in
> bulk, but a backend that needs to evict a page only wants one page, so
> it's pretty much screwed.
>

Generally what point you pay the penalty is a trade off.. I would
argue that you are making me pay for the full page write for my first
transaction commit  that changes the page which I can never avoid and
the result is I get a transaction response time that is unacceptable
since the deviation of a similar transaction which modifies the page
already made dirty is lot less. However I can avoid page evictions if
I select a bigger bufferpool (not necessarily that I want to do that
but I have a choice without losing reliability).

Regards,
Jignesh



> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


Re: Page Checksums + Double Writes

From
Simon Riggs
Date:
On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

> Simon, does it sound like I understand your proposal?

Yes, thanks for restating.

> Now, on to the separate-but-related topic of double-write.  That
> absolutely requires some form of checksum or CRC to detect torn
> pages, in order for the technique to work at all.  Adding a CRC
> without double-write would work fine if you have a storage stack
> which prevents torn pages in the file system or hardware driver.  If
> you don't have that, it could create a damaged page indication after
> a hardware or OS crash, although I suspect that would be the
> exception, not the typical case.  Given all that, and the fact that
> it would be cleaner to deal with these as two separate patches, it
> seems the CRC patch should go in first.  (And, if this is headed for
> 9.2, *very soon*, so there is time for the double-write patch to
> follow.)

It could work that way, but I seriously doubt that a technique only
mentioned in dispatches one month before the last CF is likely to
become trustable code within one month. We've been discussing CRCs for
years, so assembling the puzzle seems much easier, when all the parts
are available.

> It seems to me that the full_page_writes GUC could become an
> enumeration, with "off" having the current meaning, "wal" meaning
> what "on" now does, and "double" meaning that the new double-write
> technique would be used.  (It doesn't seem to make any sense to do
> both at the same time.)  I don't think we need a separate GUC to tell
> us *what* to protect against torn pages -- if not "off" we should
> always protect the first write of a page after checkpoint, and if
> "double" and write_page_crc (or whatever we call it) is "on", then we
> protect hint-bit-only writes.  I think.  I can see room to argue that
> with CRCs on we should do a full-page write to the WAL for a
> hint-bit-only change, or that we should add another GUC to control
> when we do this.
>
> I'm going to take a shot at writing a patch for background hinting
> over the holidays, which I think has benefit alone but also boosts
> the value of these patches, since it would reduce double-write
> activity otherwise needed to prevent spurious error when using CRCs.

I would suggest you examine how to have an array of N bgwriters, then
just slot the code for hinting into the bgwriter. That way a bgwriter
can set hints, calc CRC and write pages in sequence on a particular
block. The hinting needs to be synchronised with the writing to give
good benefit.

If we want page checksums in 9.2, I'll need your help, so the hinting
may be a sidetrack.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:
> It could work that way, but I seriously doubt that a technique
> only mentioned in dispatches one month before the last CF is
> likely to become trustable code within one month. We've been
> discussing CRCs for years, so assembling the puzzle seems much
> easier, when all the parts are available.
Well, double-write has been mentioned on the lists for years,
sometimes in conjunction with CRCs, and I get the impression this is
one of those things which has been worked on out of the community's
view for a while and is just being posted now.  That's often not
viewed as the ideal way for development to proceed from a community
standpoint, but it's been done before with some degree of success --
particularly when a feature has been bikeshedded to a standstill. 
;-)
> I would suggest you examine how to have an array of N bgwriters,
> then just slot the code for hinting into the bgwriter. That way a
> bgwriter can set hints, calc CRC and write pages in sequence on a
> particular block. The hinting needs to be synchronised with the
> writing to give good benefit.
I'll think about that.  I see pros and cons, and I'll have to see
how those balance out after I mull them over.
> If we want page checksums in 9.2, I'll need your help, so the
> hinting may be a sidetrack.
Well, VMware posted the initial patch, and that was the first I
heard of it.  I just had some off-line discussions with them after
they posted it.  Perhaps the engineers who wrote it should take your
comments as a review an post a modified patch?  It didn't seem like
that pot of broth needed any more cooks, so I was going to go work
on a nice dessert; but I agree that any way I can help along the
either of the $Subject patches should take priority.
-Kevin


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
>> I would suggest you examine how to have an array of N bgwriters,
>> then just slot the code for hinting into the bgwriter. That way a
>> bgwriter can set hints, calc CRC and write pages in sequence on a
>> particular block. The hinting needs to be synchronised with the
>> writing to give good benefit.
> 
> I'll think about that.  I see pros and cons, and I'll have to see
> how those balance out after I mull them over.
I think maybe the best solution is to create some common code to use
from both.  The problem with *just* doing it in bgwriter is that it
would not help much with workloads like Robert has been using for
most of his performance testing -- a database which fits entirely in
shared buffers and starts thrashing on CLOG.  For a background
hinter process my goal would be to deal with xids as they are passed
by the global xmin value, so that you have a cheap way to know that
they are ripe for hinting, and you can frequently hint a bunch of
transactions that are all in the same CLOG page which is recent
enough to likely be already loaded.
Now, a background hinter isn't going to be a net win if it has to
grovel through every tuple on every dirty page every time it sweeps
through the buffers, so the idea depends on having a sufficiently
efficient was to identify interesting buffers.  I'm hoping to
improve on this, but my best idea so far is to add a field to the
buffer header for "earliest unhinted xid" for the page.  Whenever
this background process wakes up and is scanning through the buffers
(probably just in buffer number order), it does a quick check,
without any pin or lock, to see if the buffer is dirty and the
earliest unhinted xid is below the global xmin.  If it passes both
of those tests, there is definitely useful work which can be done if
the page doesn't get evicted before we can do it.  We pin the page,
recheck those conditions, and then we look at each tuple and hint
where possible.  As we go, we remember the earliest xid that we see
which is *not* being hinted, to store back into the buffer header
when we're done.  Of course, we would also update the buffer header
for new tuples or when an xmax is set if the xid involved precedes
what we have in the buffer header.
This would not only help avoid multiple page writes as unhinted
tuples on the page are read, it would minimize thrashing on CLOG and
move some of the hinting work from the critical path of reading a
tuple into a background process.
Thoughts?
-Kevin


Re: Page Checksums + Double Writes

From
Robert Haas
Date:
On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Thoughts?

Those are good thoughts.

Here's another random idea, which might be completely nuts.  Maybe we
could consider some kind of summarization of CLOG data, based on the
idea that most transactions commit.  We introduce the idea of a CLOG
rollup page.  On a CLOG rollup page, each bit represents the status of
N consecutive XIDs.  If the bit is set, that means all XIDs in that
group are known to have committed.  If it's clear, then we don't know,
and must fall through to a regular CLOG lookup.

If you let N = 1024, then 8K of CLOG rollup data is enough to
represent the status of 64 million transactions, which means that just
a couple of pages could cover as much of the XID space as you probably
need to care about.  Also, you would need to replace CLOG summary
pages in memory only very infrequently.  Backends could test the bit
without any lock.  If it's set, they do pg_read_barrier(), and then
check the buffer label to make sure it's still the summary page they
were expecting.  If so, no CLOG lookup is needed.  If the page has
changed under us or the bit is clear, then we fall through to a
regular CLOG lookup.

An obvious problem is that, if the abort rate is significantly
different from zero, and especially if the aborts are randomly mixed
in with commits rather than clustered together in small portions of
the XID space, the CLOG rollup data would become useless.  On the
other hand, if you're doing 10k tps, you only need to have a window of
a tenth of a second or so where everything commits in order to start
getting some benefit, which doesn't seem like a stretch.

Perhaps the CLOG rollup data wouldn't even need to be kept on disk.
We could simply have bgwriter (or bghinter) set the rollup bits in
shared memory for new transactions, as it becomes possible to do so,
and let lookups for XIDs prior to the last shutdown fall through to
CLOG.  Or, if that's not appealing, we could reconstruct the data in
memory by groveling through the CLOG pages - or maybe just set summary
bits only for CLOG pages that actually get faulted in.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Page Checksums + Double Writes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> An obvious problem is that, if the abort rate is significantly
> different from zero, and especially if the aborts are randomly mixed
> in with commits rather than clustered together in small portions of
> the XID space, the CLOG rollup data would become useless.

Yeah, I'm afraid that with N large enough to provide useful
acceleration, the cases where you'd actually get a win would be too thin
on the ground to make it worth the trouble.
        regards, tom lane


Re: Page Checksums + Double Writes

From
Robert Haas
Date:
On Fri, Dec 23, 2011 at 12:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> An obvious problem is that, if the abort rate is significantly
>> different from zero, and especially if the aborts are randomly mixed
>> in with commits rather than clustered together in small portions of
>> the XID space, the CLOG rollup data would become useless.
>
> Yeah, I'm afraid that with N large enough to provide useful
> acceleration, the cases where you'd actually get a win would be too thin
> on the ground to make it worth the trouble.

Well, I don't know: something like pgbench is certainly going to
benefit, because all the transactions commit.  I suspect that's true
for many benchmarks.  Whether it's true of real-life workloads is more
arguable, of course, but if the benchmarks aren't measuring things
that people really do with the database, then why are they designed
the way they are?

I've certainly written applications that relied on the database for
integrity checking, so rollbacks were an expected occurrence, but then
again those were very low-velocity systems where there wasn't going to
be enough CLOG contention to matter anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Page Checksums + Double Writes

From
Jeff Janes
Date:
On 12/23/11, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>> Thoughts?
>
> Those are good thoughts.
>
> Here's another random idea, which might be completely nuts.  Maybe we
> could consider some kind of summarization of CLOG data, based on the
> idea that most transactions commit.

I had a perhaps crazier idea. Aren't CLOG pages older than global xmin
effectively read only?  Could backends that need these bypass locking
and shared memory altogether?

> An obvious problem is that, if the abort rate is significantly
> different from zero, and especially if the aborts are randomly mixed
> in with commits rather than clustered together in small portions of
> the XID space, the CLOG rollup data would become useless.  On the
> other hand, if you're doing 10k tps, you only need to have a window of
> a tenth of a second or so where everything commits in order to start
> getting some benefit, which doesn't seem like a stretch.

Could we get some major OLTP users to post their CLOG for analysis?  I
wouldn't think there would be much security/propietary issues with
CLOG data.

Cheers,

Jeff


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> An obvious problem is that, if the abort rate is significantly
>> different from zero, and especially if the aborts are randomly
>> mixed in with commits rather than clustered together in small
>> portions of the XID space, the CLOG rollup data would become
>> useless.
> 
> Yeah, I'm afraid that with N large enough to provide useful
> acceleration, the cases where you'd actually get a win would be
> too thin on the ground to make it worth the trouble.
Just to get a real-life data point, I check the pg_clog directory
for Milwaukee County Circuit Courts.  They have about 300 OLTP
users, plus replication feeds to the central servers.  Looking at
the now-present files, there are 19,104 blocks of 256 bytes (which
should support N of 1024, per Robert's example).  Of those, 12,644
(just over 66%) contain 256 bytes of hex 55.
"Last modified" dates on the files go back to the 4th of October, so
this represents roughly three months worth of real-life
transactions.
-Kevin


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Jeff Janes <jeff.janes@gmail.com> wrote:
> Could we get some major OLTP users to post their CLOG for
> analysis?  I wouldn't think there would be much
> security/propietary issues with CLOG data.
FWIW, I got the raw numbers to do my quick check using this Ruby
script (put together for me by Peter Brant).  If it is of any use to
anyone else, feel free to use it and/or post any enhanced versions
of it.
#!/usr/bin/env ruby

Dir.glob("*") do |file_name| contents = File.read(file_name) total =
contents.enum_for(:each_byte).enum_for(:each_slice,
256).inject(0) do |count, chunk|     if chunk.all? { |b| b == 0x55 }       count + 1     else       count     end   end
printf"%s %d\n", file_name, total
 
end
-Kevin


Re: Page Checksums + Double Writes

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> I had a perhaps crazier idea. Aren't CLOG pages older than global xmin
> effectively read only?  Could backends that need these bypass locking
> and shared memory altogether?

Hmm ... once they've been written out from the SLRU arena, yes.  In fact
you don't need to go back as far as global xmin --- *any* valid xmin is
a sufficient boundary point.  The only real problem is to know whether
the data's been written out from the shared area yet.

This idea has potential.  I like it better than Robert's, mainly because
I do not want to see us put something in place that would lead people to
try to avoid rollbacks.
        regards, tom lane


Re: Page Checksums + Double Writes

From
Simon Riggs
Date:
On Thu, Dec 22, 2011 at 9:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>
>> Simon, does it sound like I understand your proposal?
>
> Yes, thanks for restating.

I've implemented that proposal, posting patch on a separate thread.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Page Checksums + Double Writes

From
Jeff Davis
Date:
On Thu, 2011-12-22 at 03:50 -0600, Kevin Grittner wrote:
> Now, on to the separate-but-related topic of double-write.  That
> absolutely requires some form of checksum or CRC to detect torn
> pages, in order for the technique to work at all.  Adding a CRC
> without double-write would work fine if you have a storage stack
> which prevents torn pages in the file system or hardware driver.  If
> you don't have that, it could create a damaged page indication after
> a hardware or OS crash, although I suspect that would be the
> exception, not the typical case.  Given all that, and the fact that
> it would be cleaner to deal with these as two separate patches, it
> seems the CRC patch should go in first.

I think it could be broken down further.

Taking a step back, there are several types of HW-induced corruption,
and checksums only catch some of them. For instance, the disk losing
data completely and just returning zeros won't be caught, because we
assume that a zero page is just fine.

From a development standpoint, I think a better approach would be:

1. Investigate if there are reasonable ways to ensure that (outside of
recovery) pages are always initialized; and therefore zero pages can be
treated as corruption.

2. Make some room in the page header for checksums and maybe some other
simple sanity information (like file and page number). It will be a big
project to sort out the pg_upgrade issues (as Tom and others have
pointed out).

3. Attack hint bits problem.

If (1) and (2) were complete, we would catch many common types of
corruption, and we'd be in a much better position to think clearly about
hint bits, double writes, etc.

Regards,Jeff Davis




Re: Page Checksums + Double Writes

From
Merlin Moncure
Date:
On Tue, Dec 27, 2011 at 1:24 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> 3. Attack hint bits problem.

A large number of problems would go away if the current hint bit
system could be replaced with something that did not require writing
to the tuple itself.  FWIW, moving the bits around seems like a
non-starter -- you're trading a problem with a much bigger problem
(locking, wal logging, etc).  But perhaps a clog caching strategy
would be a win.  You get a full nibble back in the tuple header,
significant i/o reduction for some workloads, crc becomes relatively
trivial, etc etc.

My first attempt at a process local cache for hint bits wasn't perfect
but proved (at least to me) that you can sneak a tight cache in there
without significantly impacting the general case.  Maybe the angle of
attack was wrong anyways -- I bet if you kept a judicious number of
clog pages in each local process with some smart invalidation you
could cover enough cases that scribbling the bits down would become
unnecessary.  Proving that is a tall order of course, but IMO merits
another attempt.

merlin


Re: Page Checksums + Double Writes

From
Jeff Davis
Date:
On Tue, 2011-12-27 at 16:43 -0600, Merlin Moncure wrote:
> On Tue, Dec 27, 2011 at 1:24 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> > 3. Attack hint bits problem.
> 
> A large number of problems would go away if the current hint bit
> system could be replaced with something that did not require writing
> to the tuple itself.

My point was that neither the zero page problem nor the upgrade problem
are solved by addressing the hint bits problem. They can be solved
independently, and in my opinion, it seems to make sense to solve those
problems before the hint bits problem (in the context of detecting
hardware corruption).

Of course, don't let that stop you from trying to get rid of hint bits,
that has numerous potential benefits.

Regards,Jeff Davis




Re: Page Checksums + Double Writes

From
Greg Stark
Date:
On Tue, Dec 27, 2011 at 10:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>  I bet if you kept a judicious number of
> clog pages in each local process with some smart invalidation you
> could cover enough cases that scribbling the bits down would become
> unnecessary.

I don't understand how any cache can completely remove the need for
hint bits. Without hint bits the xids in the tuples will be "in-doubt"
forever. No matter how large your cache you'll always come across
tuples that are arbitrarily old and are from an unbounded size set of
xids.

We could replace the xids with a frozen xid sooner but that just
amounts to nearly the same thing as the hint bits only with page
locking and wal records.


-- 
greg


Re: Page Checksums + Double Writes

From
Merlin Moncure
Date:
On Wed, Dec 28, 2011 at 8:45 AM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Dec 27, 2011 at 10:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>  I bet if you kept a judicious number of
>> clog pages in each local process with some smart invalidation you
>> could cover enough cases that scribbling the bits down would become
>> unnecessary.
>
> I don't understand how any cache can completely remove the need for
> hint bits. Without hint bits the xids in the tuples will be "in-doubt"
> forever. No matter how large your cache you'll always come across
> tuples that are arbitrarily old and are from an unbounded size set of
> xids.

well, hint bits aren't needed strictly speaking, they are an
optimization to guard against clog lookups.   but is marking bits on
the tuple the only way to get that effect?

I'm conjecturing that some process local memory could be laid on top
of the clog slru that would be fast enough such that it could take the
place of the tuple bits in the visibility check.  Maybe this could
reduce clog contention as well -- or maybe the idea is unworkable.
That said, it shouldn't be that much work to make a proof of concept
to test the idea.

> We could replace the xids with a frozen xid sooner but that just
> amounts to nearly the same thing as the hint bits only with page
> locking and wal records.

right -- I don't think that helps.

merlin


Re: Page Checksums + Double Writes

From
Jim Nasby
Date:
On Dec 23, 2011, at 2:23 PM, Kevin Grittner wrote:

> Jeff Janes <jeff.janes@gmail.com> wrote:
>
>> Could we get some major OLTP users to post their CLOG for
>> analysis?  I wouldn't think there would be much
>> security/propietary issues with CLOG data.
>
> FWIW, I got the raw numbers to do my quick check using this Ruby
> script (put together for me by Peter Brant).  If it is of any use to
> anyone else, feel free to use it and/or post any enhanced versions
> of it.

Here's output from our largest OLTP system... not sure exactly how to interpret it, so I'm just providing the raw data.
Thisspans almost exactly 1 month. 

I have a number of other systems I can profile if anyone's interested.

063A 379
063B 143
063C 94
063D 94
063E 326
063F 113
0640 122
0641 270
0642 81
0643 390
0644 183
0645 76
0646 61
0647 50
0648 275
0649 288
064A 126
064B 53
064C 59
064D 125
064E 357
064F 92
0650 54
0651 83
0652 267
0653 328
0654 118
0655 75
0656 104
0657 280
0658 414
0659 105
065A 74
065B 153
065C 303
065D 63
065E 216
065F 169
0660 113
0661 405
0662 85
0663 52
0664 44
0665 78
0666 412
0667 116
0668 48
0669 61
066A 66
066B 364
066C 104
066D 48
066E 68
066F 104
0670 465
0671 158
0672 64
0673 62
0674 115
0675 452
0676 296
0677 65
0678 80
0679 177
067A 316
067B 86
067C 87
067D 270
067E 84
067F 295
0680 299
0681 88
0682 35
0683 67
0684 66
0685 456
0686 146
0687 52
0688 33
0689 73
068A 147
068B 345
068C 107
068D 67
068E 50
068F 97
0690 473
0691 156
0692 47
0693 57
0694 97
0695 550
0696 224
0697 51
0698 80
0699 280
069A 115
069B 426
069C 241
069D 395
069E 98
069F 130
06A0 523
06A1 296
06A2 92
06A3 97
06A4 122
06A5 524
06A6 256
06A7 118
06A8 111
06A9 157
06AA 553
06AB 166
06AC 106
06AD 103
06AE 200
06AF 621
06B0 288
06B1 95
06B2 107
06B3 227
06B4 92
06B5 447
06B6 210
06B7 364
06B8 119
06B9 113
06BA 384
06BB 319
06BC 45
06BD 68
06BE 2
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Jim Nasby <jim@nasby.net> wrote:
> Here's output from our largest OLTP system... not sure exactly how
> to interpret it, so I'm just providing the raw data. This spans
> almost exactly 1 month.
Those number wind up meaning that 18% of the 256-byte blocks (1024
transactions each) were all commits.  Yikes.  That pretty much
shoots down Robert's idea of summarized CLOG data, I think.
-Kevin


Re: Page Checksums + Double Writes

From
Robert Haas
Date:
On Wed, Jan 4, 2012 at 3:02 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Jim Nasby <jim@nasby.net> wrote:
>> Here's output from our largest OLTP system... not sure exactly how
>> to interpret it, so I'm just providing the raw data. This spans
>> almost exactly 1 month.
>
> Those number wind up meaning that 18% of the 256-byte blocks (1024
> transactions each) were all commits.  Yikes.  That pretty much
> shoots down Robert's idea of summarized CLOG data, I think.

I'm not *totally* certain of that... another way to look at it is that
I have to be able to show a win even if only 18% of the probes into
the summarized data are successful, which doesn't seem totally out of
the question given how cheap I think lookups could be.  But I'll admit
it's not real encouraging.

I think the first thing we need to look at is increasing the number of
CLOG buffers.  Even if hypothetical summarized CLOG data had a 60% hit
rate rather than 18%, 8 CLOG buffers is probably still not going to be
enough for a 32-core system, let alone anything larger.  I am aware of
two concerns here:

1. Unconditionally adding more CLOG buffers will increase PostgreSQL's
minimum memory footprint, which is bad for people suffering under
default shared memory limits or running a database on a device with
less memory than a low-end cell phone.

2. The CLOG code isn't designed to manage a large number of buffers,
so adding more might cause a performance regression on small systems.

On Nate Boley's 32-core system, running pgbench at scale factor 100,
the optimal number of buffers seems to be around 32.  I'd like to get
some test results from smaller systems - any chance you (or anyone)
have, say, an 8-core box you could test on?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> 2. The CLOG code isn't designed to manage a large number of
> buffers, so adding more might cause a performance regression on
> small systems.
> 
> On Nate Boley's 32-core system, running pgbench at scale factor
> 100, the optimal number of buffers seems to be around 32.  I'd
> like to get some test results from smaller systems - any chance
> you (or anyone) have, say, an 8-core box you could test on?
Hmm.  I can think of a lot of 4-core servers I could test on.  (We
have a few poised to go into production where it would be relatively
easy to do benchmarking without distorting factors right now.) 
After that we jump to 16 cores, unless I'm forgetting something. 
These are currently all in production, but some of them are
redundant machines which could be pulled for a few hours here and
there for benchmarks.  If either of those seem worthwhile, please
spec the useful tests so I can capture the right information.
-Kevin


Re: Page Checksums + Double Writes

From
Jim Nasby
Date:
On Jan 4, 2012, at 2:02 PM, Kevin Grittner wrote:
> Jim Nasby <jim@nasby.net> wrote:
>> Here's output from our largest OLTP system... not sure exactly how
>> to interpret it, so I'm just providing the raw data. This spans
>> almost exactly 1 month.
>
> Those number wind up meaning that 18% of the 256-byte blocks (1024
> transactions each) were all commits.  Yikes.  That pretty much
> shoots down Robert's idea of summarized CLOG data, I think.

Here's another data point. This is for a londiste slave of what I posted earlier. Note that this slave has no users on
it.
054A 654
054B 835
054C 973
054D 1020
054E 1012
054F 1022
0550 284


And these clog files are from Sep 15-30... I believe that's the period when we were building this slave, but I'm not
100%certain. 

04F0 194
04F1 253
04F2 585
04F3 243
04F4 176
04F5 164
04F6 358
04F7 505
04F8 168
04F9 180
04FA 369
04FB 318
04FC 236
04FD 437
04FE 242
04FF 625
0500 222
0501 139
0502 174
0503 91
0504 546
0505 220
0506 187
0507 151
0508 199
0509 491
050A 232
050B 170
050C 191
050D 414
050E 557
050F 231
0510 173
0511 159
0512 436
0513 789
0514 354
0515 157
0516 187
0517 333
0518 599
0519 483
051A 300
051B 512
051C 713
051D 422
051E 291
051F 596
0520 785
0521 825
0522 484
0523 238
0524 151
0525 190
0526 256
0527 403
0528 551
0529 757
052A 837
052B 418
052C 256
052D 161
052E 254
052F 423
0530 469
0531 757
0532 627
0533 325
0534 224
0535 295
0536 290
0537 352
0538 561
0539 565
053A 833
053B 756
053C 485
053D 276
053E 241
053F 270
0540 334
0541 306
0542 700
0543 821
0544 402
0545 199
0546 226
0547 250
0548 354
0549 587


This is for a slave of that database that does have user activity:

054A 654
054B 835
054C 420
054D 432
054E 852
054F 666
0550 302
0551 243
0552 600
0553 295
0554 617
0555 504
0556 232
0557 304
0558 580
0559 156

--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Page Checksums + Double Writes

From
Robert Haas
Date:
On Wed, Jan 4, 2012 at 4:02 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>
>> 2. The CLOG code isn't designed to manage a large number of
>> buffers, so adding more might cause a performance regression on
>> small systems.
>>
>> On Nate Boley's 32-core system, running pgbench at scale factor
>> 100, the optimal number of buffers seems to be around 32.  I'd
>> like to get some test results from smaller systems - any chance
>> you (or anyone) have, say, an 8-core box you could test on?
>
> Hmm.  I can think of a lot of 4-core servers I could test on.  (We
> have a few poised to go into production where it would be relatively
> easy to do benchmarking without distorting factors right now.)
> After that we jump to 16 cores, unless I'm forgetting something.
> These are currently all in production, but some of them are
> redundant machines which could be pulled for a few hours here and
> there for benchmarks.  If either of those seem worthwhile, please
> spec the useful tests so I can capture the right information.

Yes, both of those seem useful.  To compile, I do this:

./configure --prefix=$HOME/install/$BRANCHNAME --enable-depend
--enable-debug ${EXTRA_OPTIONS}
make
make -C contrib/pgbench
make check
make install
make -C contrib/pgbench install

In this case, the relevant builds would probably be (1) master, (2)
master with NUM_CLOG_BUFFERS = 16, (3) master with NUM_CLOG_BUFFERS =
32, and (4) master with NUM_CLOG_BUFFERS = 48.  (You could also try
intermediate numbers if it seems warranted.)

Basic test setup:

rm -rf $PGDATA
~/install/master/bin/initdb
cat >> $PGDATA/postgresql.conf <<EOM;
shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_writer_delay = 20ms
EOM

I'm attaching a driver script you can modify to taste.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: Page Checksums + Double Writes

From
Florian Pflug
Date:
On Jan4, 2012, at 21:27 , Robert Haas wrote:
> I think the first thing we need to look at is increasing the number of
> CLOG buffers.

What became of the idea to treat the stable (i.e. earlier than the oldest
active xid) and the unstable (i.e. the rest) parts of the CLOG differently.

On 64-bit machines at least, we could simply mmap() the stable parts of the
CLOG into the backend address space, and access it without any locking at all.

I believe that we could also compress the stable part by 50% if we use one
instead of two bits per txid. AFAIK, we need two bits because we
 a) Distinguish between transaction where were ABORTED and those which never    completed (due to i.e. a backend crash)
and
 b) Mark transaction as SUBCOMMITTED to achieve atomic commits.

Which both are strictly necessary for the stable parts of the clog. Note that
we could still keep the uncompressed CLOG around for debugging purposes - the
additional compressed version would require only 2^32/8 bytes = 512 MB in the
worst case, which people who're serious about performance can very probably
spare.

The fly in the ointment are 32-bit machines, of course - but then, those could
still fall back to the current way of doing things.

best regards,
Florian Pflug



Re: Page Checksums + Double Writes

From
Merlin Moncure
Date:
On Thu, Jan 5, 2012 at 5:15 AM, Florian Pflug <fgp@phlo.org> wrote:
> On Jan4, 2012, at 21:27 , Robert Haas wrote:
>> I think the first thing we need to look at is increasing the number of
>> CLOG buffers.
>
> What became of the idea to treat the stable (i.e. earlier than the oldest
> active xid) and the unstable (i.e. the rest) parts of the CLOG differently.


I'm curious -- anyone happen to have an idea how big the unstable CLOG
xid space is in the "typical" case?  What's would be the main driver
of making it bigger?  What are the main tradeoffs in terms of trying
to keep the unstable area compact?

merlin


Re: Page Checksums + Double Writes

From
Robert Haas
Date:
On Thu, Jan 5, 2012 at 6:15 AM, Florian Pflug <fgp@phlo.org> wrote:
> On 64-bit machines at least, we could simply mmap() the stable parts of the
> CLOG into the backend address space, and access it without any locking at all.

True.  I think this could be done, but it would take some fairly
careful thought and testing because (1) we don't currently use mmap()
anywhere else in the backend AFAIK, so we might run into portability
issues (think: Windows) and perhaps unexpected failure modes (e.g.
mmap() fails because there are too many mappings already).  Also, it's
not completely guaranteed to be a win.  Sure, you save on locking, but
now you are doing an mmap() call in every backend instead of just one
read() into shared memory.  If concurrency isn't a problem that might
be more expensive on net.  Or maybe no, but I'm kind of inclined to
steer clear of this whole area at least for 9.2.  So far, the only
test result I have only supports the notion that we run into trouble
when NUM_CPUS > NUM_CLOG_BUFFERS, and people have to before they can
even start their I/Os.  That can be fixed with a pretty modest
reengineering.  I'm sure there is a second-order effect from the cost
of repeated I/Os per se, which a backend-private cache of one form or
another might well help with, but it may not be very big.  Test
results are welcome, of course.

> I believe that we could also compress the stable part by 50% if we use one
> instead of two bits per txid. AFAIK, we need two bits because we
>
>  a) Distinguish between transaction where were ABORTED and those which never
>     completed (due to i.e. a backend crash) and
>
>  b) Mark transaction as SUBCOMMITTED to achieve atomic commits.
>
> Which both are strictly necessary for the stable parts of the clog.

Well, if we're going to do compression at all, I'm inclined to think
that we should compress by more than a factor of two.  Jim Nasby's
numbers (the worst we've seen so far) show that 18% of 1k blocks of
XIDs were all commits.  Presumably if we reduced the chunk size to,
say, 8 transactions, that percentage would go up, and even that would
be enough to get 16x compression rather than 2x.  Of course, then
keeping the uncompressed CLOG files becomes required rather than
optional, but that's OK.  What bothers me about compressing by only 2x
is that the act of compressing is not free.  You have to read all the
chunks and then write out new chunks, and those chunks then compete
for each other in cache.  Who is to say that we're not better off just
reading the uncompressed data at that point?  At least then we have
only one copy of it.

> Note that
> we could still keep the uncompressed CLOG around for debugging purposes - the
> additional compressed version would require only 2^32/8 bytes = 512 MB in the
> worst case, which people who're serious about performance can very probably
> spare.

I don't think it'd be even that much, because we only ever use half
the XID space at a time, and often probably much less: the default
value of vacuum_freeze_table_age is only 150 million transactions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Page Checksums + Double Writes

From
Benedikt Grundmann
Date:
For what's worth here are the numbers on one of our biggest databases
(same system as I posted about separately wrt seq_scan_cost vs
random_page_cost).


0053 1001
00BA 1009
0055 1001
00B9 1020
0054 983
00BB 1010
0056 1001
00BC 1019
0069 0
00BD 1009
006A 224
00BE 1018
006B 1009
00BF 1008
006C 1008
00C0 1006
006D 1004
00C1 1014
006E 1016
00C2 1023
006F 1003
00C3 1012
0070 1011
00C4 1000
0071 1011
00C5 1002
0072 1005
00C6 982
0073 1009
00C7 996
0074 1013
00C8 973
0075 1002
00D1 987
0076 997
00D2 968
0077 1007
00D3 974
0078 1012
00D4 964
0079 994
00D5 981
007A 1013
00D6 964
007B 999
00D7 966
007C 1000
00D8 971
007D 1000
00D9 956
007E 1008
00DA 976
007F 1010
00DB 950
0080 1001
00DC 967
0081 1009
00DD 983
0082 1008
00DE 970
0083 988
00DF 965
0084 1007
00E0 984
0085 1012
00E1 1004
0086 1004
00E2 976
0087 996
00E3 941
0088 1008
00E4 960
0089 1003
00E5 948
008A 995
00E6 851
008B 1001
00E7 971
008C 1003
00E8 954
008D 982
00E9 938
008E 1000
00EA 931
008F 1008
00EB 956
0090 1009
00EC 960
0091 1013
00ED 962
0092 1006
00EE 933
0093 1012
00EF 956
0094 994
00F0 978
0095 1017
00F1 292
0096 1004
0097 1005
0098 1014
0099 1012
009A 994
0035 1003
009B 1007
0036 1004
009C 1010
0037 981
009D 1024
0038 1002
009E 1009
0039 998
009F 1011
003A 995
00A0 1015
003B 996
00A1 1018
003C 1013
00A5 1007
003D 1008
00A3 1016
003E 1007
00A4 1020
003F 989
00A7 375
0040 989
00A6 1010
0041 975
00A9 3
0042 994
00A8 0
0043 1010
00AA 1
0044 1007
00AB 1
0045 1008
00AC 0
0046 991
00AF 4
0047 1010
00AD 0
0048 997
00AE 0
0049 1002
00B0 5
004A 1004
00B1 0
004B 1012
00B2 0
004C 999
00B3 0
004D 1008
00B4 0
004E 1007
00B5 807
004F 1010
00B6 1007
0050 1004
00B7 1007
0051 1009
00B8 1006
0052 1005
0057 1008
00C9 994
0058 991
00CA 977
0059 1000
00CB 978
005A 998
00CD 944
005B 971
00CC 972
005C 1005
00CF 969
005D 1010
00CE 988
005E 1006
00D0 975
005F 1015
0060 989
0061 998
0062 1014
0063 1000
0064 991
0065 990
0066 1000
0067 947
0068 377
00A2 1011


On 23/12/11 14:23, Kevin Grittner wrote:
> Jeff Janes <jeff.janes@gmail.com> wrote:
>  
> > Could we get some major OLTP users to post their CLOG for
> > analysis?  I wouldn't think there would be much
> > security/propietary issues with CLOG data.
>  
> FWIW, I got the raw numbers to do my quick check using this Ruby
> script (put together for me by Peter Brant).  If it is of any use to
> anyone else, feel free to use it and/or post any enhanced versions
> of it.
>  
> #!/usr/bin/env ruby
> 
> Dir.glob("*") do |file_name|
>   contents = File.read(file_name)
>   total = 
>     contents.enum_for(:each_byte).enum_for(:each_slice,
> 256).inject(0) do |count, chunk|
>       if chunk.all? { |b| b == 0x55 }
>         count + 1
>       else
>         count
>       end
>     end
>   printf "%s %d\n", file_name, total
> end
>  
> -Kevin
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Page Checksums + Double Writes

From
"Kevin Grittner"
Date:
Benedikt Grundmann <bgrundmann@janestreet.com> wrote:
> For what's worth here are the numbers on one of our biggest
> databases (same system as I posted about separately wrt
> seq_scan_cost vs random_page_cost).
That's would be a 88.4% hit rate on the summarized data.
-Kevin