Thread: 16-bit page checksums for 9.2

16-bit page checksums for 9.2

From
Simon Riggs
Date:
After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.

page_checksums = on | off (default)

There are no required block changes; checksums are optional and some
blocks may have a checksum, others not. This means that the patch will
allow pg_upgrade.

That capability also limits us to 16-bit checksums. Fletcher's 16 is
used in this patch and seems rather quick, though that is easily
replaceable/tuneable if desired, perhaps even as a parameter enum.
This patch is a step on the way to 32-bit checksums in a future
redesign of the page layout, though that is not a required future
change, nor does this prevent that.

Checksum is set whenever the buffer is flushed to disk, and checked
when the page is read in from disk. It is not set at other times, and
for much of the time may not be accurate. This follows earlier
discussions from 2010-12-22, and is discussed in detail in patch
comments.

Note it works with buffer manager pages, which includes shared and
local data buffers, but not SLRU pages (yet? an easy addition but
needs other discussion around contention).

Note that all this does is detect bit errors on the page, it doesn't
identify where the error is, how bad and definitely not what caused it
or when it happened.

The main body of the patch involves changes to bufpage.c/.h so this
differs completely from the VMware patch, for technical reasons. Also
included are facilities to LockBufferForHints() with usage in various
AMs, to avoid the case where hints are set during calculation of the
checksum.

In my view this is a fully working, committable patch but I'm not in a
hurry to do so given the holiday season.

Hopefully its a gift not a turkey, and therefore a challenge for some
to prove that wrong. Enjoy either way,

Merry Christmas,

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> After the various recent discussions on list, I present what I believe
> to be a working patch implementing 16-but checksums on all buffer
> pages.

I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint.  I also wonder whether it might not result in
deadlocks.

Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted.  Do you really think this is
the last page layout change we'll ever make?
        regards, tom lane


Re: 16-bit page checksums for 9.2

From
Andres Freund
Date:
On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > After the various recent discussions on list, I present what I believe
> > to be a working patch implementing 16-but checksums on all buffer
> > pages.
> 
> I think locking around hint-bit-setting is likely to be unworkable from
> a performance standpoint.  I also wonder whether it might not result in
> deadlocks.
Why don't you use the same tricks as the former patch and copy the buffer, 
compute the checksum on that, and then write out that copy (you can even do 
both at the same time). I have a hard time believing that the additional copy 
is more expensive than the locking.


Andres


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sat, Dec 24, 2011 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> After the various recent discussions on list, I present what I believe
>> to be a working patch implementing 16-but checksums on all buffer
>> pages.
>
> I think locking around hint-bit-setting is likely to be unworkable from
> a performance standpoint.

Anyone choosing page_checksums = on has already made a performance
reducing decision in favour of reliability. So they understand and
accept the impact. There is no locking when the parameter is off.

A safe alternative is to use LockBuffer, which has a much greater
performance impact.

I did think about optimistically checking after the write, but if we
crash at that point we will then see a block that has an invalid
checksum. It's faster but you may get a checksum failure if you crash
- but then one important aspect of this is to spot problems in case of
a crash, so that seems unacceptable.

> I also wonder whether it might not result in
> deadlocks.

If you can see how, please say. I can't see any ways for that myself.

> Also, as far as I can see this patch usurps the page version field,
> which I find unacceptably short-sighted.  Do you really think this is
> the last page layout change we'll ever make?

No, I don't. I hope and expect the next page layout change to
reintroduce such a field.

But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.

pg_filedump will need to be changed more than normal, but the version
isn't used anywhere else in the server code.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
> On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>> > After the various recent discussions on list, I present what I believe
>> > to be a working patch implementing 16-but checksums on all buffer
>> > pages.
>>
>> I think locking around hint-bit-setting is likely to be unworkable from
>> a performance standpoint.  I also wonder whether it might not result in
>> deadlocks.

> Why don't you use the same tricks as the former patch and copy the buffer,
> compute the checksum on that, and then write out that copy (you can even do
> both at the same time). I have a hard time believing that the additional copy
> is more expensive than the locking.

We would copy every time we write, yet lock only every time we set hint bits.

If that option is favoured, I'll write another version after Christmas.

ISTM we can't write and copy at the same time because the cheksum is
not a trailer field.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sat, Dec 24, 2011 at 3:51 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:

> Not an expert here, but after reading through the patch quickly, I
> don't see anything that changes the torn-page problem though, right?
>
> Hint bits aren't wal-logged, and FPW isn't forced on the hint-bit-only
> dirty, right?

Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.

So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.

Checksums work even when there is no crash, so if your disk goes bad
and corrupts data then you'll know about it as soon as it happens.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Andres Freund
Date:
On Saturday, December 24, 2011 05:01:02 PM Simon Riggs wrote:
> On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
> > On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
> >> Simon Riggs <simon@2ndQuadrant.com> writes:
> >> > After the various recent discussions on list, I present what I believe
> >> > to be a working patch implementing 16-but checksums on all buffer
> >> > pages.
> >> 
> >> I think locking around hint-bit-setting is likely to be unworkable from
> >> a performance standpoint.  I also wonder whether it might not result in
> >> deadlocks.
> > 
> > Why don't you use the same tricks as the former patch and copy the
> > buffer, compute the checksum on that, and then write out that copy (you
> > can even do both at the same time). I have a hard time believing that
> > the additional copy is more expensive than the locking.
> 
> We would copy every time we write, yet lock only every time we set hint
> bits.
Isn't setting hint bits also a rather frequent operation? At least in a well-
cached workload where most writeout happens due to checkpoints.

> If that option is favoured, I'll write another version after Christmas.
Seems less complicated (wrt deadlocking et al) to me. But I havent read your 
patch, so I will shut up now ;)

Andres


Re: 16-bit page checksums for 9.2

From
Greg Stark
Date:
On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Checksums merely detect a problem, whereas FPWs correct a problem if
> it happens, but only in crash situations.
>
> So this does nothing to remove the need for FPWs, though checksum
> detection could be used for double write buffers also.

This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will "detect" a problem that isn't there.

The problem is that there is no WAL indicating the hint bit change.
And if the torn page includes the new checksum but not the new hint
bit or vice versa it will be a checksum mismatch.

The strategy discussed in the past was moving all the hint bits to a
common area and skipping them in the checksum. No amount of double
writing or buffering or locking will avoid this problem.

-- 
greg


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark@mit.edu> wrote:
> On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Checksums merely detect a problem, whereas FPWs correct a problem if
>> it happens, but only in crash situations.
>>
>> So this does nothing to remove the need for FPWs, though checksum
>> detection could be used for double write buffers also.
>
> This is missing the point. If you have a torn page on a page that is
> only dirty due to hint bits then the checksum will show a spurious
> checksum failure. It will "detect" a problem that isn't there.

It will detect a problem that *is* there, but one you are classifying
it as a non-problem because it is a correctable or acceptable bit
error. Given that acceptable bit errors on hints cover no more than 1%
of a block, the great likelihood is that the bit error is unacceptable
in any case, so false positives page errors are in fact very rare.

Any bit error is an indicator of problems on the external device, so
many would regard any bit error as unacceptable.

> The problem is that there is no WAL indicating the hint bit change.
> And if the torn page includes the new checksum but not the new hint
> bit or vice versa it will be a checksum mismatch.
>
> The strategy discussed in the past was moving all the hint bits to a
> common area and skipping them in the checksum. No amount of double
> writing or buffering or locking will avoid this problem.

I completely agree we should do this, but we are unable to do it now,
so this patch is a stop-gap and provides a much requested feature
*now*.

In the future, we will be able to tell the difference between an
acceptable and an unacceptable bit error. Right now, all we have is
the ability to detect a bit error and as I point out above that is 99%
of the problem solves, at least.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
> Simon Riggs  wrote:
> On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark  wrote:
>> The problem is that there is no WAL indicating the hint bit
>> change. And if the torn page includes the new checksum but not the
>> new hint bit or vice versa it will be a checksum mismatch.
With *just* this patch, true.  An OS crash or hardware failure could
sometimes create an invalid page.
>> The strategy discussed in the past was moving all the hint bits to
>> a common area and skipping them in the checksum. No amount of
>> double writing or buffering or locking will avoid this problem.
I don't believe that.  Double-writing is a technique to avoid torn
pages, but it requires a checksum to work.  This chicken-and-egg
problem requires the checksum to be implemented first.
> I completely agree we should do this, but we are unable to do it
> now, so this patch is a stop-gap and provides a much requested
> feature *now*.
Yes, for people who trust their environment to prevent torn pages, or
who are willing to tolerate one bad page per OS crash in return for
quick reporting of data corruption from unreliable file systems, this
is a good feature even without double-writes.
> In the future, we will be able to tell the difference between an
> acceptable and an unacceptable bit error.
A double-write patch would provide that, and it sounds like VMware
has a working patch for that which is being polished for submission. 
It would need to wait until we have some consensus on the checksum
patch before it can be finalized.  I'll try to review the patch from
this thread today, to do what I can to move that along.
-Kevin


Re: 16-bit page checksums for 9.2

From
Martijn van Oosterhout
Date:
On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote:
> On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
> > Why don't you use the same tricks as the former patch and copy the buffer,
> > compute the checksum on that, and then write out that copy (you can even do
> > both at the same time). I have a hard time believing that the additional copy
> > is more expensive than the locking.
>
> ISTM we can't write and copy at the same time because the cheksum is
> not a trailer field.

Ofcourse you can. If the checksum is in the trailer field you get the
nice property that the whole block has a constant checksum. However, if
you store the checksum elsewhere you just need to change the checking
algorithm to copy the checksum out, zero those bytes and run the
checksum and compare with the extracted checksum.

Not pretty, but I don't think it makes a difference in performence.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Sun, Dec 25, 2011 at 5:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark@mit.edu> wrote:
>> On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> Checksums merely detect a problem, whereas FPWs correct a problem if
>>> it happens, but only in crash situations.
>>>
>>> So this does nothing to remove the need for FPWs, though checksum
>>> detection could be used for double write buffers also.
>>
>> This is missing the point. If you have a torn page on a page that is
>> only dirty due to hint bits then the checksum will show a spurious
>> checksum failure. It will "detect" a problem that isn't there.
>
> It will detect a problem that *is* there, but one you are classifying
> it as a non-problem because it is a correctable or acceptable bit
> error.

I don't agree with this.  We don't WAL-log hint bit changes precisely
because it's OK if they make it to disk and it's OK if they don't.
Given that, I don't see how we can say that writing out only half of a
page that has had hint bit changes is a problem.  It's not.

(And if it is, then we ought to WAL-log all such changes regardless of
whether CRCs are in use.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 25.12.2011 15:01, Kevin Grittner wrote:
> I don't believe that.  Double-writing is a technique to avoid torn
> pages, but it requires a checksum to work.  This chicken-and-egg
> problem requires the checksum to be implemented first.

I don't think double-writes require checksums on the data pages 
themselves, just on the copies in the double-write buffers. In the 
double-write buffer, you'll need some extra information per-page anyway, 
like a relfilenode and block number that indicates which page it is in 
the buffer.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 25.12.2011 15:01, Kevin Grittner wrote:
>>
>> I don't believe that.  Double-writing is a technique to avoid torn
>> pages, but it requires a checksum to work.  This chicken-and-egg
>> problem requires the checksum to be implemented first.
>
>
> I don't think double-writes require checksums on the data pages themselves,
> just on the copies in the double-write buffers. In the double-write buffer,
> you'll need some extra information per-page anyway, like a relfilenode and
> block number that indicates which page it is in the buffer.

How would you know when to look in the double write buffer?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sun, Dec 25, 2011 at 1:01 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

> This chicken-and-egg
> problem requires the checksum to be implemented first.

v2 of checksum patch, using a conditional copy if checksumming is
enabled, so locking is removed.

Thanks to Andres for thwacking me with the cluestick, though I have
used a simple copy rather than a copy & calc.

Tested using make installcheck with parameter on/off, then restart and
vacuumdb to validate all pages.

Reviews, objections, user interface tweaks all welcome.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 28.12.2011 01:39, Simon Riggs wrote:
> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 25.12.2011 15:01, Kevin Grittner wrote:
>>>
>>> I don't believe that.  Double-writing is a technique to avoid torn
>>> pages, but it requires a checksum to work.  This chicken-and-egg
>>> problem requires the checksum to be implemented first.
>>
>>
>> I don't think double-writes require checksums on the data pages themselves,
>> just on the copies in the double-write buffers. In the double-write buffer,
>> you'll need some extra information per-page anyway, like a relfilenode and
>> block number that indicates which page it is in the buffer.
>
> How would you know when to look in the double write buffer?

You scan the double-write buffer, and every page in the double write 
buffer that has a valid checksum, you copy to the main storage. There's 
no need to check validity of pages in the main storage.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

>> How would you know when to look in the double write buffer?
>
>
> You scan the double-write buffer, and every page in the double write buffer
> that has a valid checksum, you copy to the main storage. There's no need to
> check validity of pages in the main storage.

OK, then we are talking at cross purposes. Double write buffers, in
the way you explain them allow us to remove full page writes. They
clearly don't do anything to check page validity on read. Torn pages
are not the only fault we wish to correct against... and the double
writes idea is orthogonal to the idea of checksums.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 28.12.2011 11:22, Simon Riggs wrote:
> On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>
>>> How would you know when to look in the double write buffer?
>>
>>
>> You scan the double-write buffer, and every page in the double write buffer
>> that has a valid checksum, you copy to the main storage. There's no need to
>> check validity of pages in the main storage.
>
> OK, then we are talking at cross purposes. Double write buffers, in
> the way you explain them allow us to remove full page writes. They
> clearly don't do anything to check page validity on read. Torn pages
> are not the only fault we wish to correct against... and the double
> writes idea is orthogonal to the idea of checksums.

The reason we're talking about double write buffers in this thread is 
that double write buffers can be used to solve the problem with hint 
bits and checksums.

You're right, though, that it's academical whether double write buffers 
can be used without checksums on data pages, if the whole point of the 
exercise is to make it possible to have checksums on data pages..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Dec 28, 2011 at 5:45 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 28.12.2011 11:22, Simon Riggs wrote:
>>
>> On Wed, Dec 28, 2011 at 7:42 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>
>>>> How would you know when to look in the double write buffer?
>>>
>>>
>>>
>>> You scan the double-write buffer, and every page in the double write
>>> buffer
>>> that has a valid checksum, you copy to the main storage. There's no need
>>> to
>>> check validity of pages in the main storage.
>>
>>
>> OK, then we are talking at cross purposes. Double write buffers, in
>> the way you explain them allow us to remove full page writes. They
>> clearly don't do anything to check page validity on read. Torn pages
>> are not the only fault we wish to correct against... and the double
>> writes idea is orthogonal to the idea of checksums.
>
>
> The reason we're talking about double write buffers in this thread is that
> double write buffers can be used to solve the problem with hint bits and
> checksums.

Torn pages are not the only problem we need to detect.

You said "You scan the double write buffer...". When exactly would you do that?

Please explain how a double write buffer detects problems that do not
occur as the result of a crash.

We don't have much time, so please be clear and lucid.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
> Heikki Linnakangas  wrote:
> On 28.12.2011 01:39, Simon Riggs wrote:
>> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
>>  wrote:
>>> On 25.12.2011 15:01, Kevin Grittner wrote:
>>>>
>>>> I don't believe that. Double-writing is a technique to avoid
>>>> torn pages, but it requires a checksum to work. This chicken-
>>>> and-egg problem requires the checksum to be implemented first.
>>>
>>> I don't think double-writes require checksums on the data pages
>>> themselves, just on the copies in the double-write buffers. In
>>> the double-write buffer, you'll need some extra information per-
>>> page anyway, like a relfilenode and block number that indicates
>>> which page it is in the buffer.
You are clearly right -- if there is no checksum in the page itself,
you can put one in the double-write metadata.  I've never seen that
discussed before, but I'm embarrassed that it never occurred to me.
>> How would you know when to look in the double write buffer?
>
> You scan the double-write buffer, and every page in the double
> write buffer that has a valid checksum, you copy to the main
> storage. There's no need to check validity of pages in the main
> storage.
Right.  I'll recap my understanding of double-write (from memory --
if there's a material error or omission, I hope someone will correct
me).
The write-ups I've seen on double-write techniques have all the
writes to the double-write buffer (a single, sequential file that
stays around).  This is done as sequential writing to a file which is
overwritten pretty frequently, making the writes to a controller very
fast, and a BBU write-back cache unlikely to actually write to disk
very often.  On good server-quality hardware, it should be blasting
RAM-to_RAM very efficiently.  The file is fsync'd (like I said,
hopefully to BBU cache), then each page in the double-write buffer is
written to the normal page location, and that is fsync'd.  Once that
is done, the database writes have no risk of being torn, and the
double-write buffer is marked as empty.  This all happens at the
point when you would be writing the page to the database, after the
WAL-logging.

On crash recovery you read through the double-write buffer from the
start and write the pages which look good (including a good checksum)
to the database before replaying WAL.  If you find a checksum error
in processing the double-write buffer, you assume that you never got
as far as the fsync of the double-write buffer, which means you never
started writing the buffer contents to the database, which means
there can't be any torn pages there.  If you get to the end and
fsync, you can be sure any torn pages from a previous attempt to
write to the database itself have been overwritten with the good copy
in the double-write buffer.  Either way, you move on to WAL
processing.
You wind up with a database free of torn pages before you apply WAL.
full_page_writes to the WAL are not needed as long as double-write is
used for any pages which would have been written to the WAL.  If
checksums were written to the double-buffer metadata instead of
adding them to the page itself, this could be implemented alone.  It
would probably allow a modest speed improvement over using
full_page_writes and would eliminate those full-page images from the
WAL files, making them smaller.
If we do add a checksum to the page header, that could be used for
testing for torn pages in the double-write buffer without needing a
redundant calculation for double-write.  With no torn pages in the
actual database, checksum failures there would never be false
positives.  To get this right for a checksum in the page header,
double-write would need to be used for all cases where
full_page_writes now are used (i.e., the first write of a page after
a checkpoint), and for all unlogged writes (e.g., hint-bit-only
writes).  There would be no correctness problem for always using
double-write, but it would be unnecessary overhead for other page
writes, which I think we can avoid.
-Kevin


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
> Heikki Linnakangas  wrote:
> Simon Riggs wrote:
>> OK, then we are talking at cross purposes. Double write buffers,
>> in the way you explain them allow us to remove full page writes.
>> They clearly don't do anything to check page validity on read.
>> Torn pages are not the only fault we wish to correct against...
>> and the double writes idea is orthogonal to the idea of checksums.
> 
> The reason we're talking about double write buffers in this thread
> is that double write buffers can be used to solve the problem with
> hint bits and checksums.
Exactly.  Every time the issue of page checksums is raised, there are
objections because OS or hardware crashes could cause torn pages for
hint-bit-only writes which would be treated as serious errors
(potentially indicating hardware failure) when they are in fact
expected and benign.  Some time before the thread dies, someone
generally points out that double-write technology would be a graceful
way to handle that, with the side benefit of smaller WAL files.  All
available evidence suggests it would also allow a small performance
improvement, although I hesitate to emphasize that aspect of it; the
other benefits fully justify the effort without that.
I do feel there is value in a page checksum patch even without torn
page protection.  The discussion on the list has convinced me that a
failed checksum should be treated as seriously as other page format
errors, rather than as a warning, even though (in the absence of torn
page protection) torn hint-bit-only page writes would be benign.

As an example of how this might be useful, consider our central
databases which contain all the detail replicated from the circuit
court databases in all the counties.  These are mission-critical, so
we have redundant servers in separate buildings.  At one point, one
of them experienced hardware problems and we started seeing invalid
pages.  Since we can shift the load between these servers without
down time, we moved all applications to other servers, and
investigated.  Now, it's possible that for some time before we got
errors on the bad pages, there could have been subtle corruption
which didn't generate errors but presented bad data on our web site.
A page checksum would help prevent that sort of problem, and a
post-crash false positive might waste a little time in investigation,
but that cost would be far outweighed by the benefit of better
accuracy guarantees.

Of course, it will be a big plus if we can roll this out in 9.2 in
conjunction with a double-write feature.  Not only will double-write
probably be a bit faster than full_page_writes in the WAL log, but it
will allow protection against torn pages on hint-bit-only writes
without adding those writes to the WAL or doing any major
rearrangement of where they sit that would break pg_upgrade.  It
would be nice not to have to put all sorts of caveats and
explanations into the docs about how a checksum error might be benign
due to hint bit writes.
-Kevin


Re: 16-bit page checksums for 9.2

From
Noah Misch
Date:
On Thu, Dec 29, 2011 at 11:08:43AM -0600, Kevin Grittner wrote:
> > Heikki Linnakangas  wrote:
> > Simon Riggs wrote:
>  
> >> OK, then we are talking at cross purposes. Double write buffers,
> >> in the way you explain them allow us to remove full page writes.
> >> They clearly don't do anything to check page validity on read.
> >> Torn pages are not the only fault we wish to correct against...
> >> and the double writes idea is orthogonal to the idea of checksums.
> > 
> > The reason we're talking about double write buffers in this thread
> > is that double write buffers can be used to solve the problem with
> > hint bits and checksums.

> Of course, it will be a big plus if we can roll this out in 9.2 in
> conjunction with a double-write feature.  Not only will double-write
> probably be a bit faster than full_page_writes in the WAL log, but it
> will allow protection against torn pages on hint-bit-only writes
> without adding those writes to the WAL or doing any major
> rearrangement of where they sit that would break pg_upgrade.

[Thanks for your recent thread summaries.]

A double-write buffer, like a WAL-logged full-page image, is a technique for
performing atomic writes wider than those automatically provided by components
further down the storage stack.  The two strategies have different performance
characteristics, and we're told that a double-write buffer would better serve us
overall.  However, its benefits would not be *greater* for hint-only writes than
for any other write.  For that reason, I think we should consider these changes
independently.  With page checksums enabled, remove the hazard of torn hint-only
writes by ensuring that a WAL FPI has flushed since the last checkpoint.  When
necessary, emit an FPI-only record.  Separately, optimize first-since-checkpoint
writes by replacing FPIs with double-write buffers.  The double-write patch will
reduce the added WAL of the checksum/safe-hint-updates patch to zero.  If the
double-writes patch founders, we'll just have more-costly, yet equally reliable,
page checksums.

nm


Re: 16-bit page checksums for 9.2

From
Ants Aasma
Date:
On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> positives.  To get this right for a checksum in the page header,
> double-write would need to be used for all cases where
> full_page_writes now are used (i.e., the first write of a page after
> a checkpoint), and for all unlogged writes (e.g., hint-bit-only
> writes).  There would be no correctness problem for always using
> double-write, but it would be unnecessary overhead for other page
> writes, which I think we can avoid.

Unless I'm missing something, double-writes are needed for all writes,
not only the first page after a checkpoint. Consider this sequence of
events:

1. Checkpoint
2. Double-write of page A (DW buffer write, sync, heap write)
3. Sync of heap, releasing DW buffer for new writes.... some time goes by
4. Regular write of page A
5. OS writes one part of page A
6. Crash!

Now recovery comes along, page A is broken in the heap with no
double-write buffer backup nor anything to recover it by in the WAL.

--
Ants Aasma


Re: 16-bit page checksums for 9.2

From
Nicolas Barbier
Date:
2011/12/30 Ants Aasma <ants.aasma@eesti.ee>:

> On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>
>> positives.  To get this right for a checksum in the page header,
>> double-write would need to be used for all cases where
>> full_page_writes now are used (i.e., the first write of a page after
>> a checkpoint), and for all unlogged writes (e.g., hint-bit-only
>> writes).  There would be no correctness problem for always using
>> double-write, but it would be unnecessary overhead for other page
>> writes, which I think we can avoid.
>
> Unless I'm missing something, double-writes are needed for all writes,
> not only the first page after a checkpoint. Consider this sequence of
> events:
>
> 1. Checkpoint
> 2. Double-write of page A (DW buffer write, sync, heap write)
> 3. Sync of heap, releasing DW buffer for new writes.
>  ... some time goes by
> 4. Regular write of page A
> 5. OS writes one part of page A
> 6. Crash!
>
> Now recovery comes along, page A is broken in the heap with no
> double-write buffer backup nor anything to recover it by in the WAL.

I guess the assumption is that the write in (4) is either backed by
the WAL, or made safe by double writing. ISTM that such reasoning is
only correct if the change that is expressed by the WAL record can be
applied in the context of inconsistent (i.e., partially written)
pages, which I assume is not the case (excuse my ignorance regarding
such basic facts).

So I think you are right.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Thu, Dec 29, 2011 at 4:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>> Heikki Linnakangas  wrote:
>> On 28.12.2011 01:39, Simon Riggs wrote:
>>> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
>>>  wrote:
>>>> On 25.12.2011 15:01, Kevin Grittner wrote:
>>>>>
>>>>> I don't believe that. Double-writing is a technique to avoid
>>>>> torn pages, but it requires a checksum to work. This chicken-
>>>>> and-egg problem requires the checksum to be implemented first.
>>>>
>>>> I don't think double-writes require checksums on the data pages
>>>> themselves, just on the copies in the double-write buffers. In
>>>> the double-write buffer, you'll need some extra information per-
>>>> page anyway, like a relfilenode and block number that indicates
>>>> which page it is in the buffer.
>
> You are clearly right -- if there is no checksum in the page itself,
> you can put one in the double-write metadata.  I've never seen that
> discussed before, but I'm embarrassed that it never occurred to me.

Heikki's idea for double writes works well. It solves the problems of
torn pages in a way that would make FPW redundant.

However, I don't see that it provides protection across non-crash
write problems. We know we have these since many systems have run
without a crash for years and yet still experience corrupt data.

Double writes do not require page checksums but neither do they
replace page checksums.

So I think we need page checksums plus either FPWs or double writes.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
> Nicolas Barbier  wrote:
> 2011/12/30 Ants Aasma :
>> Kevin Grittner  wrote:
>>
>>> positives. To get this right for a checksum in the page header,
>>> double-write would need to be used for all cases where
>>> full_page_writes now are used (i.e., the first write of a page
>>> after a checkpoint), and for all unlogged writes (e.g.,
>>> hint-bit-only writes). There would be no correctness problem for
>>> always using double-write, but it would be unnecessary overhead
>>> for other page writes, which I think we can avoid.
>>
>> Unless I'm missing something, double-writes are needed for all
>> writes, not only the first page after a checkpoint. Consider this
>> sequence of events:
>>
>> 1. Checkpoint
>> 2. Double-write of page A (DW buffer write, sync, heap write)
>> 3. Sync of heap, releasing DW buffer for new writes.
>> ... some time goes by
>> 4. Regular write of page A
>> 5. OS writes one part of page A
>> 6. Crash!
>>
>> Now recovery comes along, page A is broken in the heap with no
>> double-write buffer backup nor anything to recover it by in the
>> WAL.
> 
> I guess the assumption is that the write in (4) is either backed by
> the WAL, or made safe by double writing. ISTM that such reasoning
> is only correct if the change that is expressed by the WAL record
> can be applied in the context of inconsistent (i.e., partially
> written) pages, which I assume is not the case (excuse my ignorance
> regarding such basic facts).
> 
> So I think you are right.
Hmm.  It appears that I didn't think that through all the way.  I see
two alternatives.
(1)  We don't eliminate full_page_writes and we only need to use
double-writes for unlogged writes.
(2)  We double-write all writes and on recovery we only apply WAL to
a page from pd_lsn onward.  We would start from the same point and
follow the same rules except that when we read a page and find a
pd_lsn past the location of the record we are applying, we do nothing
because we are 100% sure everything to that point is safely written
and not torn.  full_page_writes to WAL would not be needed.
-Kevin


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
> Simon Riggs  wrote:
> Kevin Grittner  wrote:
>> if there is no checksum in the page itself, you can put one in the
>> double-write metadata.
> However, I don't see that it provides protection across non-crash
> write problems. We know we have these since many systems have run
> without a crash for years and yet still experience corrupt data.
Agreed.  I don't think anyone has tried to assert it solves the same
problems that checksums solve -- it is a high-performance way to
solve some of the problems that an in-page checksum *creates* without
breaking pg_upgrade.
> Double writes do not require page checksums but neither do they
> replace page checksums.
To nit-pick: double writes require a page checksum, but (as Heikki
pointed out) they don't require it to be stored in the page.  If
there *is* one stored in the page, it probably makes sense to use it.
> So I think we need page checksums plus either FPWs or double
> writes.
Adding checksums by themselves creates a risk of false positive
corrupted page indications following an OS or hardware crash.
Additional FPWs or a new double-write mechanism are two of miriad
possible solutions to that.  If it is going to be addressed for 9.2,
I believe they're the two most reasonable, especially from the POV of
pg_upgrade.
So, while they should be separate patches, the complement each other;
each makes the other perform better, and they should share some code.
-Kevin


Re: 16-bit page checksums for 9.2

From
Aidan Van Dyk
Date:
On Thu, Dec 29, 2011 at 11:44 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

> You wind up with a database free of torn pages before you apply WAL.
> full_page_writes to the WAL are not needed as long as double-write is
> used for any pages which would have been written to the WAL.  If
> checksums were written to the double-buffer metadata instead of
> adding them to the page itself, this could be implemented alone.  It
> would probably allow a modest speed improvement over using
> full_page_writes and would eliminate those full-page images from the
> WAL files, making them smaller.

Correct.  So now lots of people seem to be jumping on the double-write
bandwagon and looking at some the things it promise: All writes are
durable

This solves 2 big issues:  - Remove torn-page problem  - Remove FPW from WAL

That up front looks pretty attractive.  But we need to look at the
tradeoffs, and then decide (benchmark anyone).

Remember, postgresql is a double-write system right now.  The 1st,
checkumed write is the FPW in WAL.  It's fsynced.  And the 2nd synced
write is when the file is synced during checkpoint.

So, postgresql currently has an optimization now that not every write
has *requirements* for atomic, instant durability.  And so postgresql
get's to do lots of writes to the OS cache and *not* request them to
be instantly synced.  And then at some point, when it's reay to clear
the 1st checksumed write, make sure everywrite is synced.  And lots of
work went into PG recently to get even better at the collection of
writes/syncs that happen at checkpoint time to take even biger
advantage of the fact that its' better to write everything in a fil
efirst, then call a single sync.

So moving to this new double-write-area bandwagon, we move from a "WAL
FPW synced at the commit, collect as many other writes, then final
sync" type system to a system where *EVERY* write requires syncs of 2
separate 8K writes at buffer write-out time.  So we avoid the FPW at
commit (yes, that's nice for latency), and we guarentee every buffer
written is consistent (that fixes our hit-bit-only dirty writes from
being torn).  And we do that at a cost of every buffer write requiring
2 fsyncs, in a serial fashion.  Come checkpoint, I'm wondering....

Again, all that to avoid a single "optimization" that postgresql currently has:
1) writes for hint-bit only buffers don't need to be durable

And the problem that optimization introduces:
1) Since they aren't guarenteed durable, we can't believe a checksum



--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: 16-bit page checksums for 9.2

From
Jeff Janes
Date:
On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote:

> Unless I'm missing something, double-writes are needed for all writes,
> not only the first page after a checkpoint. Consider this sequence of
> events:
>
> 1. Checkpoint
> 2. Double-write of page A (DW buffer write, sync, heap write)
> 3. Sync of heap, releasing DW buffer for new writes.
>  ... some time goes by
> 4. Regular write of page A
> 5. OS writes one part of page A
> 6. Crash!
>
> Now recovery comes along, page A is broken in the heap with no
> double-write buffer backup nor anything to recover it by in the WAL.

Isn't 3 the very definition of a checkpoint, meaning that 4 is not
really a regular write as it is the first one after a checkpoint?

But it doesn't seem safe to me replace a page from the DW buffer and
then apply WAL to that replaced page which preceded the age of the
page in the buffer.

Cheers,

Jeff


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
Simon Riggs <simon@2ndQuadrant.com> wrote:
> v2 of checksum patch, using a conditional copy if checksumming is
> enabled, so locking is removed.
> 
> Thanks to Andres for thwacking me with the cluestick, though I
> have used a simple copy rather than a copy & calc.
> 
> Tested using make installcheck with parameter on/off, then restart
> and vacuumdb to validate all pages.
> 
> Reviews, objections, user interface tweaks all welcome.
I'm happy with how this looks, except (as noted in a code comment)
that there seems to be room for optimization of the calculation
itself.  Details below:
(1)  I like the choice of Fletcher-16.  It should be very good at
detecting problems while being a lot less expensive that an official
CRC calculation.  The fact that it was developed at Lawrence
Livermore Labs and has been subjected to peer review in the IEEE
Transactions on Communications generates confidence in the
technique.  According to what I've read, though, the technique is
conceptually based around the modulus of the sums.  Doing the
modulus calculation for each addition is often done to prevent
overflow, but if we define sum1 and sum2 as uint I don't see how we
can get an overflow with 8k byes, so I suggest we declare both of
these local variables as uint and leave the modulus to the end.  It
can't hurt to leave off 16000 division operations per checksum
generation.
(2)  I'm not sure about doing this in three parts, to skip the
checksum itself and the hole in the middle of the page.  Is this
because the hole might not have predictable data?  Why would that
matter, as long as it is read back the same?  Is it expected that
this will reduce the cost by checking fewer bytes?  That we could
tolerate the read of an actual corrupted disk sector in the middle
of a page if it didn't contain any data?  If we can set the checksum
to zero before starting, might it be faster to load four bytes at a
time, and iterate fewer times?  (Like I said, I'm not sure about any
of this, but it seemed worth asking the questions.)
(3)  Rather than having PageSetVerificationInfo() use memcpy,
followed by pass through the copied data to calculate the checksum,
might it be better to have a "copy and calculate" version of the
function (as in VMware's original patch) to save an extra pass over
the page image?
Other than these performance tweaks around the calculation phase, I
didn't spot any problems.  I beat up on it a bit on a couple
machines without hitting any bugs or seeing any regression test
failures.
-Kevin


Re: 16-bit page checksums for 9.2

From
Jim Nasby
Date:
On Jan 3, 2012, at 4:21 PM, Kevin Grittner wrote:
> (2)  I'm not sure about doing this in three parts, to skip the
> checksum itself and the hole in the middle of the page.  Is this
> because the hole might not have predictable data?  Why would that
> matter, as long as it is read back the same?

IMO not checksumming the free space would be a really bad idea. It's entirely possible to have your hardware crapping
onyour free space, and I'd still want to know that that was happening. Now, it would be interesting if the free space
couldbe checksummed separately, since there's no reason to refuse to read the page if only the free space is screwed
up...But given the choice, I'd rather get an error when the free space is "corrupted" and be forced to look into things
ratherthan blissfully ignore corrupted free space only to be hit later with real data loss. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote:
>> Unless I'm missing something, double-writes are needed for all writes,
>> not only the first page after a checkpoint. Consider this sequence of
>> events:
>>
>> 1. Checkpoint
>> 2. Double-write of page A (DW buffer write, sync, heap write)
>> 3. Sync of heap, releasing DW buffer for new writes.
>>  ... some time goes by
>> 4. Regular write of page A
>> 5. OS writes one part of page A
>> 6. Crash!
>>
>> Now recovery comes along, page A is broken in the heap with no
>> double-write buffer backup nor anything to recover it by in the WAL.
>
> Isn't 3 the very definition of a checkpoint, meaning that 4 is not
> really a regular write as it is the first one after a checkpoint?

I think you nailed it.

> But it doesn't seem safe to me replace a page from the DW buffer and
> then apply WAL to that replaced page which preceded the age of the
> page in the buffer.

That's what LSNs are for.

If we write the page to the checkpoint buffer just once per
checkpoint, recovery can restore the double-written versions of the
pages and then begin WAL replay, which will restore all the subsequent
changes made to the page.  Recovery may also need to do additional
double-writes if it encounters pages that for which we wrote WAL but
never flushed the buffer, because a crash during recovery can also
create torn pages.  When we reach a restartpoint, we fsync everything
down to disk and then nuke the double-write buffer.  Similarly, in
normal running, we can nuke the double-write buffer at checkpoint
time, once the fsyncs are complete.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
"Kevin Grittner"  wrote:
> if we define sum1 and sum2 as uint I don't see how we can get an
> overflow with 8k byes
I feel the need to amend that opinion.
While sum1 only needs to hold a maximum of (BLCKSZ * 255), which
would be adequate for a BLCKSZ up to 16 MB, sum2 needs to hold up to
a maximum of ((BLCKSZ * (BLCKSZ + 1) / 2) * 255) so a 32-bit unsigned
isn't even good for an 8 KB block size.  A uint64 for sum2 can handle
BLCKSZ up to 256 MB.  So we could define sum1 as uint (which is
presumably either uint32 or uint64) and sum2 as uint64 and for a
BLCKSZ up toe 16 MB we still don't need to find the modulo 255
numbers until we're done adding things up.
Does anyone think we need to support anything bigger than that? 
Making them both uint64 would support a BLCKSZ up to 256 MB without
needing to do the division inside the loop. but it might be slightly
slower on 32-bit builds.
-Kevin


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Tue, Jan 3, 2012 at 11:00 PM, Jim Nasby <jim@nasby.net> wrote:
> On Jan 3, 2012, at 4:21 PM, Kevin Grittner wrote:
>> (2)  I'm not sure about doing this in three parts, to skip the
>> checksum itself and the hole in the middle of the page.  Is this
>> because the hole might not have predictable data?  Why would that
>> matter, as long as it is read back the same?
>
> IMO not checksumming the free space would be a really bad idea. It's entirely possible to have your hardware crapping
onyour free space, and I'd still want to know that that was happening. Now, it would be interesting if the free space
couldbe checksummed separately, since there's no reason to refuse to read the page if only the free space is screwed
up...But given the choice, I'd rather get an error when the free space is "corrupted" and be forced to look into things
ratherthan blissfully ignore corrupted free space only to be hit later with real data loss. 

I see that argument. We don't have space for 2 checksums.

We can either

(1) report all errors on a page, including errors that don't change
PostgreSQL data. This involves checksumming long strings of zeroes,
which the checksum algorithm can't tell apart from long strings of
ones.

(2) report only errors that changed PostgreSQL data.

We already do (1) for WAL CRCs so doing the same thing for page
checksums makes sense and is much faster.

If enough people think we should do (2) then its a simple change to the patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Tue, Jan 3, 2012 at 10:21 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

> I'm happy with how this looks, except (as noted in a code comment)
> that there seems to be room for optimization of the calculation
> itself.  Details below:

...

> (3)  Rather than having PageSetVerificationInfo() use memcpy,
> followed by pass through the copied data to calculate the checksum,
> might it be better to have a "copy and calculate" version of the
> function (as in VMware's original patch) to save an extra pass over
> the page image?

> Other than these performance tweaks around the calculation phase, I
> didn't spot any problems.  I beat up on it a bit on a couple
> machines without hitting any bugs or seeing any regression test
> failures.

My focus was on getting something working first, then tuning. If we're
agreed that we have everything apart from the tuning then we can
proceed with tests to see which works better.

The copy and calculate approach might get in the way of hardware
prefetch since in my understanding the memory fetch time exceeds the
calculation time. As discussed elsewhere using that code or not would
not stop that work being credited.

David, please can you rework the VMware calc patch to produce an
additional 16-bit checksum mechanism in a way compatible with the
16bit patch, so we can test the two versions of the calculation? We
can make the GUC an enum so that the page checksum is selectable (for
testing).

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Andres Freund
Date:
On Tuesday, January 03, 2012 11:21:42 PM Kevin Grittner wrote:
> (1)  I like the choice of Fletcher-16.  It should be very good at
> detecting problems while being a lot less expensive that an official
> CRC calculation. 
I wonder if CRC32c wouldn't be a good alternative given more and more cpus 
(its in SSE 4.2) support calculating it in silicon.

Andres


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Jan 4, 2012 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote:
> On Tuesday, January 03, 2012 11:21:42 PM Kevin Grittner wrote:
>> (1)  I like the choice of Fletcher-16.  It should be very good at
>> detecting problems while being a lot less expensive that an official
>> CRC calculation.
> I wonder if CRC32c wouldn't be a good alternative given more and more cpus
> (its in SSE 4.2) support calculating it in silicon.

We're trying to get something that fits in 16bits for this release.
I'm guessing CRC32c doesn't?


--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Ants Aasma
Date:
On Wed, Jan 4, 2012 at 3:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On 12/29/11, Ants Aasma <ants.aasma@eesti.ee> wrote:
>>> Unless I'm missing something, double-writes are needed for all writes,
>>> not only the first page after a checkpoint. Consider this sequence of
>>> events:
>>>
>>> 1. Checkpoint
>>> 2. Double-write of page A (DW buffer write, sync, heap write)
>>> 3. Sync of heap, releasing DW buffer for new writes.
>>>  ... some time goes by
>>> 4. Regular write of page A
>>> 5. OS writes one part of page A
>>> 6. Crash!
>>>
>>> Now recovery comes along, page A is broken in the heap with no
>>> double-write buffer backup nor anything to recover it by in the WAL.
>>
>> Isn't 3 the very definition of a checkpoint, meaning that 4 is not
>> really a regular write as it is the first one after a checkpoint?
>
> I think you nailed it.

No, I should have explicitly stated that no checkpoint happens in
between. I think the confusion here is because I assumed Kevin
described a fixed size d-w buffer in this message:

On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> ...  The file is fsync'd (like I said,
> hopefully to BBU cache), then each page in the double-write buffer is
> written to the normal page location, and that is fsync'd.  Once that
> is done, the database writes have no risk of being torn, and the
> double-write buffer is marked as empty.  ...

If the double-write buffer survives until the next checkpoint,
double-writing only the first write should work just fine. The
advantage over current full-page writes is that the write is not into
the WAL stream and is done (hopefully) by the bgwriter/checkpointer in
the background.

--
Ants Aasma


Re: 16-bit page checksums for 9.2

From
Nicolas Barbier
Date:
2012/1/4 Simon Riggs <simon@2ndquadrant.com>:

> On Wed, Jan 4, 2012 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote:
>
>> I wonder if CRC32c wouldn't be a good alternative given more and more cpus
>> (its in SSE 4.2) support calculating it in silicon.
>
> We're trying to get something that fits in 16bits for this release.
> I'm guessing CRC32c doesn't?

What happens to the problem-detecting performance of a 16 bit part of
a CRC32c vs. a real 16 bit checksum? If it is still as good, it might
make sense to use the former, assuming that there is a way to easily
trigger the silicon support and enough CPUs support it.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
Simon Riggs  wrote:
> We can either
> 
> (1) report all errors on a page, including errors that don't change
> PostgreSQL data. This involves checksumming long strings of zeroes,
> which the checksum algorithm can't tell apart from long strings of
> ones.
What do you mean?  Each byte which goes into the checksum, and the
position of that byte influences the outcome once you've got a
non-zero value in sum1.  The number of leading NIL bytes would not
affect the outcome unless you seed the calculation with something
non-zero, but including the page header in the calculation seems to
cover that OK.
> (2) report only errors that changed PostgreSQL data.
> 
> We already do (1) for WAL CRCs so doing the same thing for page
> checksums makes sense and is much faster.
> 
> If enough people think we should do (2) then its a simple change to
> the patch.
To me, (1) makes more sense, but it seems to me you're currently
doing (2) because you check in three parts, skipping the free space
in the middle of the page.
-Kevin


Re: 16-bit page checksums for 9.2

From
Alvaro Herrera
Date:
Excerpts from Kevin Grittner's message of mié ene 04 04:12:43 -0300 2012:
> "Kevin Grittner"  wrote:
>
> > if we define sum1 and sum2 as uint I don't see how we can get an
> > overflow with 8k byes
>
> I feel the need to amend that opinion.
>
> While sum1 only needs to hold a maximum of (BLCKSZ * 255), which
> would be adequate for a BLCKSZ up to 16 MB, sum2 needs to hold up to
> a maximum of ((BLCKSZ * (BLCKSZ + 1) / 2) * 255) so a 32-bit unsigned
> isn't even good for an 8 KB block size.  A uint64 for sum2 can handle
> BLCKSZ up to 256 MB.  So we could define sum1 as uint (which is
> presumably either uint32 or uint64) and sum2 as uint64 and for a
> BLCKSZ up toe 16 MB we still don't need to find the modulo 255
> numbers until we're done adding things up.
>
> Does anyone think we need to support anything bigger than that?
> Making them both uint64 would support a BLCKSZ up to 256 MB without
> needing to do the division inside the loop. but it might be slightly
> slower on 32-bit builds.

We don't support BLCKSZ higher than 32k anyway ... saith pg_config.h:

/* Size of a disk block --- this also limits the size of a tuple. You can set  it bigger if you need bigger tuples
(althoughTOAST should reduce the need  to have large tuples, since fields can be spread across multiple tuples).
BLCKSZmust be a power of 2. The maximum possible value of BLCKSZ is  currently 2^15 (32768). This is determined by the
15-bitwidths of the  lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h).  Changing BLCKSZ requires
aninitdb. */ 
#define BLCKSZ 8192

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
Robert Haas  wrote:
> Jeff Janes  wrote:
>> But it doesn't seem safe to me replace a page from the DW buffer
>> and then apply WAL to that replaced page which preceded the age of
>> the page in the buffer.
>
> That's what LSNs are for.
Agreed.
> If we write the page to the checkpoint buffer just once per
> checkpoint, recovery can restore the double-written versions of the
> pages and then begin WAL replay, which will restore all the
> subsequent changes made to the page. Recovery may also need to do
> additional double-writes if it encounters pages that for which we
> wrote WAL but never flushed the buffer, because a crash during
> recovery can also create torn pages.
That's a good point.  I think WAL application does need to use
double-write.  As usual, it doesn't affect *when* a page must be
written, but *how*.
> When we reach a restartpoint, we fsync everything down to disk and
> then nuke the double-write buffer.
I think we add to the double-write buffer as we write pages from the
buffer to disk.  I don't think it makes sense to do potentially
repeated writes of the same page with different contents to the
double-write buffer as we go; nor is it a good idea to leave the page
unsynced and let the double-write buffer grow for a long time.
> Similarly, in normal running, we can nuke the double-write buffer
> at checkpoint time, once the fsyncs are complete.
Well, we should nuke it for re-use as soon as all pages in the buffer
are written and fsynced.  I'm not at all sure that the best
performance is hit by waiting for checkpoint for that versus doing it
at page eviction time.
The whole reason that double-write techniques don't double the write
time is that it is relatively small and the multiple writes to the
same disk sectors get absorbed by the BBU write-back without actually
hitting the disk all the time.  Letting the double-write buffer grow
to a large size seems likely to me to be a performance killer.  The
whole double-write, including fsyncs to buffer and the actual page
location should just be considered part of the page write process, I
think.
-Kevin


Re: 16-bit page checksums for 9.2

From
Stephen Frost
Date:
Simon, all,

* Simon Riggs (simon@2ndQuadrant.com) wrote:
> (1) report all errors on a page, including errors that don't change
> PostgreSQL data. This involves checksumming long strings of zeroes,
> which the checksum algorithm can't tell apart from long strings of
> ones.

Do we actually know when/where it's supposed to be all zeros, and hence
could we check for that explicitly?  If we know what it's supposed to
be, in order to be consistent with other information, I could certainly
see value in actually checking that.

I don't think that's valuable enough to go breaking abstraction layers
or bending over backwards to do it though.  If we don't have the
knowledge, at the right level, that the data should all be zeros then
including those pieces in the CRC certainly makes sense to me.

Just my 2c.
Thanks,
    Stephen

Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
Simon Riggs  wrote:
> My focus was on getting something working first, then tuning. If
> we're agreed that we have everything apart from the tuning then we
> can proceed with tests to see which works better.
Sure.  I just think you are there already except for what I got into.
FWIW, moving the modulus application out of the loop is a very
trivial change and has no affect on the results; it's strictly a
performance issue.
-Kevin


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
Alvaro Herrera  wrote:
> We don't support BLCKSZ higher than 32k anyway
Thanks for pointing that out.  Then I think we should declare sum1 to
be uint and sum2 to be uint64.  We can take out the "% 255" out from
where it sits in the v2 patch, and just add something like this after
the sums are generated:   sum1 %= 255;   sum2 %= 255;
Or, of course, we can just do it on the line where we combine the two
sums for the final result, if that's not too hard to read.
-Kevin


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Jan 4, 2012 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Simon, all,
>
> * Simon Riggs (simon@2ndQuadrant.com) wrote:
>> (1) report all errors on a page, including errors that don't change
>> PostgreSQL data. This involves checksumming long strings of zeroes,
>> which the checksum algorithm can't tell apart from long strings of
>> ones.
>
> Do we actually know when/where it's supposed to be all zeros, and hence
> could we check for that explicitly?  If we know what it's supposed to
> be, in order to be consistent with other information, I could certainly
> see value in actually checking that.

Yes, we can. Excellent suggestion, will implement.

That means we can keep the CRC calc fast as well as check the whole of
the page inbound.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Jan 4, 2012 at 8:31 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>> When we reach a restartpoint, we fsync everything down to disk and
>> then nuke the double-write buffer.
>
> I think we add to the double-write buffer as we write pages from the
> buffer to disk.  I don't think it makes sense to do potentially
> repeated writes of the same page with different contents to the
> double-write buffer as we go; nor is it a good idea to leave the page
> unsynced and let the double-write buffer grow for a long time.

You may be right.  Currently, though, we only fsync() at
end-of-checkpoint.  So we'd have to think about what to fsync, and how
often, to keep the double-write buffer to a manageable size.  I can't
help thinking that any extra fsyncs are pretty expensive, though,
especially if you have to fsync() every file that's been
double-written before clearing the buffer.  Possibly we could have 2^N
separate buffers based on an N-bit hash of the relfilenode and segment
number, so that we could just fsync 1/(2^N)-th of the open files at a
time.  But even that sounds expensive: writing back lots of dirty data
isn't cheap.  One of the systems I've been doing performance testing
on can sometimes take >15 seconds to write a shutdown checkpoint, and
I'm sure that other people have similar (and worse) problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> we only fsync() at end-of-checkpoint.  So we'd have to think about
> what to fsync, and how often, to keep the double-write buffer to a
> manageable size.
I think this is the big tuning challenge with this technology.
> I can't help thinking that any extra fsyncs are pretty expensive,
> though, especially if you have to fsync() every file that's been
> double-written before clearing the buffer. Possibly we could have
> 2^N separate buffers based on an N-bit hash of the relfilenode and
> segment number, so that we could just fsync 1/(2^N)-th of the open
> files at a time.
I'm not sure I'm following -- we would just be fsyncing those files
we actually wrote pages into, right?  Not all segments for the table
involved?
> But even that sounds expensive: writing back lots of dirty data
> isn't cheap.  One of the systems I've been doing performance
> testing on can sometimes take >15 seconds to write a shutdown
> checkpoint,
Consider the relation-file fsyncs for double-write as a form of
checkpoint spreading, and maybe it won't seem so bad.  It should
make that shutdown checkpoint less painful.  Now, I have been
thinking that on a write-heavy system you had better have a BBU
write-back cache, but that's my recommendation, anyway.
> and I'm sure that other people have similar (and worse) problems.
Well, I have no doubt that this feature should be optional.  Those
who prefer can continue to do full-page writes to the WAL, instead. 
Or take the "running with scissors" approach.
-Kevin


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Jan 4, 2012 at 1:32 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> we only fsync() at end-of-checkpoint.  So we'd have to think about
>> what to fsync, and how often, to keep the double-write buffer to a
>> manageable size.
>
> I think this is the big tuning challenge with this technology.

One of them, anyway.  I think it may also be tricky to make sure that
a backend that needs to write a dirty buffer doesn't end up having to
wait for a double-write to be fsync'd.

>> I can't help thinking that any extra fsyncs are pretty expensive,
>> though, especially if you have to fsync() every file that's been
>> double-written before clearing the buffer. Possibly we could have
>> 2^N separate buffers based on an N-bit hash of the relfilenode and
>> segment number, so that we could just fsync 1/(2^N)-th of the open
>> files at a time.
>
> I'm not sure I'm following -- we would just be fsyncing those files
> we actually wrote pages into, right?  Not all segments for the table
> involved?

Yes.

>> But even that sounds expensive: writing back lots of dirty data
>> isn't cheap.  One of the systems I've been doing performance
>> testing on can sometimes take >15 seconds to write a shutdown
>> checkpoint,
>
> Consider the relation-file fsyncs for double-write as a form of
> checkpoint spreading, and maybe it won't seem so bad.  It should
> make that shutdown checkpoint less painful.  Now, I have been
> thinking that on a write-heavy system you had better have a BBU
> write-back cache, but that's my recommendation, anyway.

I think this point has possibly been beaten to death, but at the risk
of belaboring the point I'll bring it up again: the frequency with
which we fsync() is basically a trade-off between latency and
throughput.  If you fsync a lot, then each one will be small, so you
shouldn't experience much latency, but throughput might suck.  If you
don't fsync very much, then you maximize the chances for
write-combining (because inserting an fsync between two writes to the
same block forces that block to be physically written twice rather
than just once) thus improving throughput, but when you do get around
to calling fsync() there may be a lot of data to write all at once,
and you may get a gigantic latency spike.

As far as I can tell, one fsync per checkpoint is the theoretical
minimum, and that's what we do now.  So our current system is
optimized for throughput.  The decision to put full-page images into
WAL rather than a separate buffer is essentially turning the dial in
the same direction, because, in effect, the double-write fsync
piggybacks on the WAL fsync which we must do anyway.  So both the
decision to use a double-write buffer AT ALL and the decision to fsync
more frequently to keep that buffer to a manageable size are going to
result in turning that dial in the opposite direction.  It seems to me
inevitable that, even with the best possible implementation,
throughput will get worse.  With a good implementation but not a bad
one, latency should improve.

Now, this is not necessarily a reason to reject the idea.  I believe
that several people have proposed that our current implementation is
*overly* optimized for throughput *at the expense of* latency, and
that we might want to provide some options that, in one way or
another, fsync more frequently, so that checkpoint spikes aren't as
bad.  But when it comes time to benchmark, we might need to think
somewhat carefully about what we're testing...

Another thought here is that double-writes may not be the best
solution, and are almost certainly not the easiest-to-implement
solution.  We could instead do something like this: when an unlogged
change is made to a buffer (e.g. a hint bit is set), we set a flag on
the buffer header.  When we evict such a buffer, we emit a WAL record
that just overwrites the whole buffer with a new FPI.  There are some
pretty obvious usage patterns where this is likely to be painful (e.g.
load a big table without setting hint bits, and then seq-scan it).
But there are also many use cases where the working set fits inside
shared buffers and data pages don't get written very often, apart from
checkpoint time, and those cases might work just fine.  Also, the
cases that are problems for this implementation are likely to also be
problems for a double-write based implementation, for exactly the same
reasons: if you discover at buffer eviction time that you need to
fsync something (whether it's WAL or DW), it's going to hurt.
Checksums aren't free even when using double-writes: if you don't have
checksums, pages that have only hint bit-changes don't need to be
double-written.  If double writes aren't going to give us anything
"for free", maybe that's not the right place to be focusing our
efforts...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
"Kevin Grittner"
Date:
Robert Haas <robertmhaas@gmail.com> wrote:
> I think it may also be tricky to make sure that a backend that
> needs to write a dirty buffer doesn't end up having to wait for a
> double-write to be fsync'd.
This and other parts of your post seem to ignore the BBU write-back
cache.  Multiple fsyncs of a single page can still be collapsed at
that level to a single actual disk write.  In fact, I rather doubt
this technology will look very good on machines without write-back
caching.  I'm not as sure as you are that this is a net loss in
throughput, either.  When the fsync storm clogs the RAID controller,
even reads stall, so something which more evenly pushes writes to
disk might avoid these non-productive pauses.  I think that could
improve throughput enough to balance or exceed the other effects. 
Maybe.  I agree we need to be careful to craft a good set of
benchmarks here.
> Checksums aren't free even when using double-writes: if you don't
> have checksums, pages that have only hint bit-changes don't need
> to be double-written.
Agreed.  Checksums aren't expected to be free under any
circumstances.  I'm expecting DW to be slightly faster than FPW in
general, with or without in-page checksums.
> If double writes aren't going to give us anything "for free",
> maybe that's not the right place to be focusing our
> efforts...
I'm not sure why it's not enough that they improve performance over
the alternative.  Making some other feature with obvious overhead
"free" seems an odd requirement to hang on this.  (Maybe I'm
misunderstanding you on that point?)
-Kevin


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Jan 4, 2012 at 3:51 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>> If double writes aren't going to give us anything "for free",
>> maybe that's not the right place to be focusing our
>> efforts...
>
> I'm not sure why it's not enough that they improve performance over
> the alternative.  Making some other feature with obvious overhead
> "free" seems an odd requirement to hang on this.  (Maybe I'm
> misunderstanding you on that point?)

Well, this thread is nominally about checksums, but here we are
talking about double writes, so I thought we were connecting those
features in some way?

Certainly, life is easier if we can develop them completely separately
- but checksums really ought to come with some sort of solution to the
problem of a torn-page with hint bit changes, IMO, and I thought
that's why were thinking so hard about DW just now.

Maybe I'm confused.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Jan 4, 2012 at 3:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, Jan 4, 2012 at 1:31 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> Simon, all,
>>
>> * Simon Riggs (simon@2ndQuadrant.com) wrote:
>>> (1) report all errors on a page, including errors that don't change
>>> PostgreSQL data. This involves checksumming long strings of zeroes,
>>> which the checksum algorithm can't tell apart from long strings of
>>> ones.
>>
>> Do we actually know when/where it's supposed to be all zeros, and hence
>> could we check for that explicitly?  If we know what it's supposed to
>> be, in order to be consistent with other information, I could certainly
>> see value in actually checking that.
>
> Yes, we can. Excellent suggestion, will implement.

No, we can't.

I discover that non-all-zeroes holes are fairly common, just not very frequent.

That may or may not be a problem, but not something to be dealt with
here and now.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Simon Riggs  wrote:
>
>> My focus was on getting something working first, then tuning. If
>> we're agreed that we have everything apart from the tuning then we
>> can proceed with tests to see which works better.
>
> Sure.  I just think you are there already except for what I got into.
>
> FWIW, moving the modulus application out of the loop is a very
> trivial change and has no affect on the results; it's strictly a
> performance issue.

New version attached, with your suggested changes included. Hole check
code is there as well, but ifdef'd out since it isn't a valid check in
all cases.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Stephen Frost
Date:
* Simon Riggs (simon@2ndQuadrant.com) wrote:
> I discover that non-all-zeroes holes are fairly common, just not very frequent.

Curious, might be interesting to find out why.

> That may or may not be a problem, but not something to be dealt with
> here and now.

But I agree that it's not the job of this patch/effort.  It sounds like
we have clear indication, however, that those areas, as they are not
necessairly all zeros, should be included in the checksum.
Thanks,
    Stephen

Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Simon Riggs (simon@2ndQuadrant.com) wrote:
>> I discover that non-all-zeroes holes are fairly common, just not very frequent.
>
> Curious, might be interesting to find out why.
>
>> That may or may not be a problem, but not something to be dealt with
>> here and now.
>
> But I agree that it's not the job of this patch/effort.  It sounds like
> we have clear indication, however, that those areas, as they are not
> necessairly all zeros, should be included in the checksum.

Disagree. Full page writes ignore the hole, so its appropriate to do
so here also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Andres Freund
Date:
On Friday, January 06, 2012 11:30:53 AM Simon Riggs wrote:
> On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Simon Riggs (simon@2ndQuadrant.com) wrote:
> >> I discover that non-all-zeroes holes are fairly common, just not very
> >> frequent.
> > 
> > Curious, might be interesting to find out why.
> > 
> >> That may or may not be a problem, but not something to be dealt with
> >> here and now.
> > 
> > But I agree that it's not the job of this patch/effort.  It sounds like
> > we have clear indication, however, that those areas, as they are not
> > necessairly all zeros, should be included in the checksum.
> 
> Disagree. Full page writes ignore the hole, so its appropriate to do
> so here also.
Well, ignoriging them in fpw has clear space benefits. Ignoring them while 
checksumming doesn't have that much of a benefit.

Andres


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Thu, Jan 5, 2012 at 3:46 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:

>> Sure.  I just think you are there already except for what I got into.

> New version attached, with your suggested changes included. Hole check
> code is there as well, but ifdef'd out since it isn't a valid check in
> all cases.

Following private discussions, Kevin showed me the patch at v3 was
inadequate because it didn't cater for torn pages as a result of hint
bit writes.

The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).

Dan Scales also observed some cases where private writes aren't
checksummed, e.g. index build. Those suggestions are still outstanding
from me, but the v4 is worthy of more serious review.

I'll now turn my attention more fully onto the DW patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Andres Freund
Date:
On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote:
> The following patch (v4) introduces a new WAL record type that writes
> backup blocks for the first hint on a block in any checkpoint that has
> not previously been changed. IMHO this fixes the torn page problem
> correctly, though at some additional loss of performance but not the
> total catastrophe some people had imagined. Specifically we don't need
> to log anywhere near 100% of hint bit settings, much more like 20-30%
> (estimated not measured).
Well, but it will hurt in those cases where hint bits already hurt hugely. 
Which is for example when accessing huge amounts of data the first time after 
loading it. Which is not exactly uncommon...

Andres


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 06.01.2012 20:26, Simon Riggs wrote:
> The following patch (v4) introduces a new WAL record type that writes
> backup blocks for the first hint on a block in any checkpoint that has
> not previously been changed. IMHO this fixes the torn page problem
> correctly, though at some additional loss of performance but not the
> total catastrophe some people had imagined. Specifically we don't need
> to log anywhere near 100% of hint bit settings, much more like 20-30%
> (estimated not measured).

How's that going to work during recovery? Like in hot standby.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Andres Freund
Date:
On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
> On 06.01.2012 20:26, Simon Riggs wrote:
> > The following patch (v4) introduces a new WAL record type that writes
> > backup blocks for the first hint on a block in any checkpoint that has
> > not previously been changed. IMHO this fixes the torn page problem
> > correctly, though at some additional loss of performance but not the
> > total catastrophe some people had imagined. Specifically we don't need
> > to log anywhere near 100% of hint bit settings, much more like 20-30%
> > (estimated not measured).
> 
> How's that going to work during recovery? Like in hot standby.
How's recovery a problem? Unless I miss something that doesn't actually 
introduce a new possibility to transport hint bits to the standby (think 
fpw's). A new transport will obviously increase traffic but ...

Andres


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Fri, Jan 6, 2012 at 2:35 PM, Andres Freund <andres@anarazel.de> wrote:
> On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote:
>> The following patch (v4) introduces a new WAL record type that writes
>> backup blocks for the first hint on a block in any checkpoint that has
>> not previously been changed. IMHO this fixes the torn page problem
>> correctly, though at some additional loss of performance but not the
>> total catastrophe some people had imagined. Specifically we don't need
>> to log anywhere near 100% of hint bit settings, much more like 20-30%
>> (estimated not measured).
> Well, but it will hurt in those cases where hint bits already hurt hugely.
> Which is for example when accessing huge amounts of data the first time after
> loading it. Which is not exactly uncommon...

No, but I'd rather see us commit a checksum patch that sometimes
carries a major performance penalty and then work to reduce that
penalty later than never commit anything at all.  I think it's
unrealistic to assume we're going to get this perfectly right in one
try.  I am not sure whether this particular patch is close enough for
government work or still needs more hacking, and it may well be the
latter, but there is no use holding our breath and waiting for
absolute perfection.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote:
> On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
>> On 06.01.2012 20:26, Simon Riggs wrote:
>> > The following patch (v4) introduces a new WAL record type that writes
>> > backup blocks for the first hint on a block in any checkpoint that has
>> > not previously been changed. IMHO this fixes the torn page problem
>> > correctly, though at some additional loss of performance but not the
>> > total catastrophe some people had imagined. Specifically we don't need
>> > to log anywhere near 100% of hint bit settings, much more like 20-30%
>> > (estimated not measured).
>>
>> How's that going to work during recovery? Like in hot standby.
> How's recovery a problem? Unless I miss something that doesn't actually
> introduce a new possibility to transport hint bits to the standby (think
> fpw's). A new transport will obviously increase traffic but ...

The standby can set hint bits locally that weren't set on the data it
received from the master.  This will require rechecksumming and
rewriting the page, but obviously we can't write the WAL records
needed to protect those writes during recovery.  So a crash could
create a torn page, invalidating the checksum.

Ignoring checksum errors during Hot Standby operation doesn't fix it,
either, because eventually you might want to promote the standby, and
the checksum will still be invalid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Andres Freund
Date:
On Friday, January 06, 2012 08:53:38 PM Robert Haas wrote:
> On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote:
> > On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
> >> On 06.01.2012 20:26, Simon Riggs wrote:
> >> > The following patch (v4) introduces a new WAL record type that writes
> >> > backup blocks for the first hint on a block in any checkpoint that has
> >> > not previously been changed. IMHO this fixes the torn page problem
> >> > correctly, though at some additional loss of performance but not the
> >> > total catastrophe some people had imagined. Specifically we don't need
> >> > to log anywhere near 100% of hint bit settings, much more like 20-30%
> >> > (estimated not measured).
> >> 
> >> How's that going to work during recovery? Like in hot standby.
> > 
> > How's recovery a problem? Unless I miss something that doesn't actually
> > introduce a new possibility to transport hint bits to the standby (think
> > fpw's). A new transport will obviously increase traffic but ...
> 
> The standby can set hint bits locally that weren't set on the data it
> received from the master.  This will require rechecksumming and
> rewriting the page, but obviously we can't write the WAL records
> needed to protect those writes during recovery.  So a crash could
> create a torn page, invalidating the checksum.
Err. Stupid me, thanks.

> Ignoring checksum errors during Hot Standby operation doesn't fix it,
> either, because eventually you might want to promote the standby, and
> the checksum will still be invalid.
Its funny. I have the feeling we all are missing a very obvious brilliant 
solution to this...

Andres


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Fri, Jan 6, 2012 at 7:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> No, but I'd rather see us commit a checksum patch that sometimes
> carries a major performance penalty and then work to reduce that
> penalty later than never commit anything at all.  I think it's
> unrealistic to assume we're going to get this perfectly right in one
> try.  I am not sure whether this particular patch is close enough for
> government work or still needs more hacking, and it may well be the
> latter, but there is no use holding our breath and waiting for
> absolute perfection.

Well said. My view completely.

I do want jam tomorrow, I just want bread and butter today as well.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Fri, Jan 6, 2012 at 3:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Since we ignore hints during HS anyway,

No, we don't.  We need to ignore visibility map bits, but we need not
and do not ignore HEAP_XMIN_COMMITTED, HEAP_XMAX_COMMITTED, etc.

> not setting them seems OK if
> checksums defined. Or we can recommend that you don't use checksums on
> a standby. Whichever fits.
>
> Writing pages during recovery doesn't need WAL. If we crash, we replay
> using the already generated WAL.

Which is all fine, except when you start making changes that are not
WAL-logged.  Then, you have the same torn page problem that exists
when you it in normal running.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Merlin Moncure
Date:
On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:
>> The standby can set hint bits locally that weren't set on the data it
>> received from the master.  This will require rechecksumming and
>> rewriting the page, but obviously we can't write the WAL records
>> needed to protect those writes during recovery.  So a crash could
>> create a torn page, invalidating the checksum.
> Err. Stupid me, thanks.
>
>> Ignoring checksum errors during Hot Standby operation doesn't fix it,
>> either, because eventually you might want to promote the standby, and
>> the checksum will still be invalid.
> Its funny. I have the feeling we all are missing a very obvious brilliant
> solution to this...

Like getting rid of hint bits?

merlin


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Fri, Jan 6, 2012 at 7:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres@anarazel.de> wrote:
>> On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
>>> On 06.01.2012 20:26, Simon Riggs wrote:
>>> > The following patch (v4) introduces a new WAL record type that writes
>>> > backup blocks for the first hint on a block in any checkpoint that has
>>> > not previously been changed. IMHO this fixes the torn page problem
>>> > correctly, though at some additional loss of performance but not the
>>> > total catastrophe some people had imagined. Specifically we don't need
>>> > to log anywhere near 100% of hint bit settings, much more like 20-30%
>>> > (estimated not measured).
>>>
>>> How's that going to work during recovery? Like in hot standby.
>> How's recovery a problem? Unless I miss something that doesn't actually
>> introduce a new possibility to transport hint bits to the standby (think
>> fpw's). A new transport will obviously increase traffic but ...
>
> The standby can set hint bits locally that weren't set on the data it
> received from the master.  This will require rechecksumming and
> rewriting the page, but obviously we can't write the WAL records
> needed to protect those writes during recovery.  So a crash could
> create a torn page, invalidating the checksum.
>
> Ignoring checksum errors during Hot Standby operation doesn't fix it,
> either, because eventually you might want to promote the standby, and
> the checksum will still be invalid.

Since we ignore hints during HS anyway, not setting them seems OK if
checksums defined. Or we can recommend that you don't use checksums on
a standby. Whichever fits.

Writing pages during recovery doesn't need WAL. If we crash, we replay
using the already generated WAL.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Aidan Van Dyk
Date:
On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund <andres@anarazel.de> wrote:
>>> The standby can set hint bits locally that weren't set on the data it
>>> received from the master.  This will require rechecksumming and
>>> rewriting the page, but obviously we can't write the WAL records
>>> needed to protect those writes during recovery.  So a crash could
>>> create a torn page, invalidating the checksum.
>> Err. Stupid me, thanks.
>>
>>> Ignoring checksum errors during Hot Standby operation doesn't fix it,
>>> either, because eventually you might want to promote the standby, and
>>> the checksum will still be invalid.
>> Its funny. I have the feeling we all are missing a very obvious brilliant
>> solution to this...
>
> Like getting rid of hint bits?

Or even just not bothering to consider them as making buffers dirty,
so the only writes are already protected by the double-write (WAL, or
if they get some DW outside of WAL).

I think I've said it before, but I'm guessing OLTP style database
rarely have pages written that are dirty that aren't covered by real
changes (so have the FPW anyways) and OLAP type generally freeze after
loads to avoid the hint-bit-write penalty too...

a.


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Its funny. I have the feeling we all are missing a very obvious brilliant
>> solution to this...
>
> Like getting rid of hint bits?

No.

First, that solution has been proposed before, and it crashed and
burned before.  You may as well propose getting rid of VACUUM.  In
each case, the supposed problem is in fact a cure for a much more
severe problem that would quickly make you long for the days when you
had the first one.  I think you (or someone else?) had a fairly
well-developed patch for blunting the impact of hint bit setting, and
that we might be able to do if you (or someone) wants to finish it.
But getting rid of them altogether isn't likely, not because
PostgreSQL developers are an intractable herd of mule-headed idiots
(although I have been known to be all of those things on occasion) but
because the performance characteristics demonstrably suck, as mostly
recently demonstrated by commit
4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b, which significantly improved
performance with synchronous_commit=off by shaving about an average of
one-tenth of one second off the time before hint bits could be set on
any given tuple.

Second, even if you did it, it wouldn't fix this problem, because hint
bits aren't the only changes we make without WAL logging.  In
particular, when an index scan runs across a tuple that turns out to
be dead, we kill the index tuple so that future scans don't need to
revisit it.  I haven't actually done any testing to measure how
significant this is, but I wouldn't bet on it not mattering.

I think it would be wonderful to come up with a design that either
blunted the effect of hint bits or eliminated them altogether, but the
idea that there's an obvious way forward there that we've simply
refused to pursue is, AFAICT, flat wrong.  Let's not get sidetracked
into talking about things that aren't going to change in time for 9.2
(and probably not 9.3, either, given that no one seems to be putting
any time into it) and aren't even feasible solutions to the problem at
hand (checksums) anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Aidan Van Dyk
Date:
On Fri, Jan 6, 2012 at 6:48 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:

> I think I've said it before, but I'm guessing OLTP style database
> rarely have pages written that are dirty that aren't covered by real
> changes (so have the FPW anyways) and OLAP type generally freeze after
> loads to avoid the hint-bit-write penalty too...

But ya, again, I've never measured ;-)


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Fri, Jan 6, 2012 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

>> Writing pages during recovery doesn't need WAL. If we crash, we replay
>> using the already generated WAL.
>
> Which is all fine, except when you start making changes that are not
> WAL-logged.  Then, you have the same torn page problem that exists
> when you it in normal running.

Yes, of course. My point is that this is not a blocker to using
checksums, only that some actions cannot occur on the standby but the
replay of changes is not in danger.

page_checksums is an optional parameter, so you can turn it on or off
on the standby as you wish. People frequently have a standby dedicated
to HA and other standbys for queries. So this is all normal and
natural.

page_checksums will default to 'off' in the final patch anyway, in my
understanding.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 07.01.2012 12:14, Simon Riggs wrote:
> page_checksums is an optional parameter, so you can turn it on or off
> on the standby as you wish. People frequently have a standby dedicated
> to HA and other standbys for queries. So this is all normal and
> natural.

There needs to be a well-documented way of turning it on/off. In 
particular, from off to on. If you ever flip it off even for a minute, 
pages with invalid checksums start to appear in the data directory, and 
if you then turn it on again, you start getting errors. Perhaps there 
needs to be a third setting, calculate-but-dont-verify, where CRCs are 
updated but existing CRCs are not expected to be correct. And a utility 
to scan through your database and fix any incorrect CRCs, so that after 
that's run in calculate-but-dont-verify mode, you can safely turn 
checksums on.

Even better would be a way to make that process robust so that you can't 
do a pilot error and turn page_checksums 'on' when it's not safe to do 
so. Otherwise when you get a checksum error, your first thought is going 
to be "hmm, I wonder if anyone ever turned page_checksums 'off' by 
accident anytime in the past, or if this is a genuine error".

> page_checksums will default to 'off' in the final patch anyway, in my
> understanding.

That makes the need for such a facility even more important. Otherwise 
there's no safe way to ever switch it from off to on.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sat, Jan 7, 2012 at 10:26 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

> There needs to be a well-documented way of turning it on/off. In particular,
> from off to on.

There is.... in the patch.

The checksum field is optional, as is the parameter.

If page_checksums is on, we write a checksum and it is correct. We
also validate any checksums we see.

If page_checksums is off we clear the checksum on write, so an
incorrect checksum is never written.

So there isn't any problem with there being incorrect checksums on
blocks and you can turn the parameter on and off as often and as
easily as you want. I think it can be USERSET but I wouldn't want to
encourage users to see turning it off as a performance tuning feature.
If the admin turns it on for the server, its on, so its SIGHUP.

Any holes in that I haven't noticed?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

> So there isn't any problem with there being incorrect checksums on
> blocks and you can turn the parameter on and off as often and as
> easily as you want. I think it can be USERSET but I wouldn't want to
> encourage users to see turning it off as a performance tuning feature.
> If the admin turns it on for the server, its on, so its SIGHUP.
>
> Any holes in that I haven't noticed?

And of course, as soon as I wrote that I thought of the problem. We
mustn't make a write that hasn't been covered by a FPW, so we must
know ahead of time whether to WAL log hints or not. We can't simply
turn it on/off any longer, now that we have to WAL log hint bits also.
So thanks for making me think of that.

We *could* make it turn on/off at each checkpoint, but its easier just
to say that it can be turned on/off at server start.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>> So there isn't any problem with there being incorrect checksums on
>> blocks and you can turn the parameter on and off as often and as
>> easily as you want. I think it can be USERSET but I wouldn't want to
>> encourage users to see turning it off as a performance tuning feature.
>> If the admin turns it on for the server, its on, so its SIGHUP.
>>
>> Any holes in that I haven't noticed?
>
> And of course, as soon as I wrote that I thought of the problem. We
> mustn't make a write that hasn't been covered by a FPW, so we must
> know ahead of time whether to WAL log hints or not. We can't simply
> turn it on/off any longer, now that we have to WAL log hint bits also.
> So thanks for making me think of that.
>
> We *could* make it turn on/off at each checkpoint, but its easier just
> to say that it can be turned on/off at server start.

Attached patch v6 now handles hint bits and checksums correctly,
following Heikki's comments.

In recovery, setting a hint doesn't dirty a block if it wasn't already
dirty. So we can write some hints, and we can set others but not write
them.

Lots of comments in the code.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Jim Nasby
Date:
On Jan 6, 2012, at 4:36 AM, Andres Freund wrote:
> On Friday, January 06, 2012 11:30:53 AM Simon Riggs wrote:
>> On Fri, Jan 6, 2012 at 1:10 AM, Stephen Frost <sfrost@snowman.net> wrote:
>>> * Simon Riggs (simon@2ndQuadrant.com) wrote:
>>>> I discover that non-all-zeroes holes are fairly common, just not very
>>>> frequent.
>>>
>>> Curious, might be interesting to find out why.
>>>
>>>> That may or may not be a problem, but not something to be dealt with
>>>> here and now.
>>>
>>> But I agree that it's not the job of this patch/effort.  It sounds like
>>> we have clear indication, however, that those areas, as they are not
>>> necessairly all zeros, should be included in the checksum.
>>
>> Disagree. Full page writes ignore the hole, so its appropriate to do
>> so here also.
> Well, ignoriging them in fpw has clear space benefits. Ignoring them while
> checksumming doesn't have that much of a benefit.

I agree with Andres... we should checksum zero bytes, because if they're screwed up then something is wrong with your
system,even if you got lucky with what data got trashed. 

As I mentioned before, 2 separate checksums would be nice, but if we can't have that I think we need to fail on any
checksumerror. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: 16-bit page checksums for 9.2

From
Greg Smith
Date:
On 1/7/12 5:26 AM, Heikki Linnakangas wrote:
> Perhaps there
> needs to be a third setting, calculate-but-dont-verify, where CRCs are
> updated but existing CRCs are not expected to be correct. And a utility
> to scan through your database and fix any incorrect CRCs, so that after
> that's run in calculate-but-dont-verify mode, you can safely turn
> checksums on.

The users of any feature like this are eventually going to demand a 
"scrubbing" utility that validates the checksums in a background task. 
Doing something like pg_dump just to validate your copy of the data is 
still clean is not a great approach.  If you accept that sort of utility 
is inevitable, you might as well presume it could handle this sort of 
task eventually too.


-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: 16-bit page checksums for 9.2

From
Greg Smith
Date:
On 12/30/11 9:44 AM, Aidan Van Dyk wrote:

> So moving to this new double-write-area bandwagon, we move from a "WAL
> FPW synced at the commit, collect as many other writes, then final
> sync" type system to a system where *EVERY* write requires syncs of 2
> separate 8K writes at buffer write-out time.

It's not quite that bad.  The double-write area is going to be a small 
chunk of re-used sequential I/O, like the current WAL.  And if this 
approach shifts some of the full-page writes out of the WAL and toward 
the new area instead, that's not a real doubling either.  Could probably 
put both on the same disk, and in situations where you don't have a 
battery-backed write cache it's possible to get a write to both per 
rotation.

This idea has been tested pretty extensively as part of MySQL's Innodb 
engine.  Results there suggest the overhead is in the 5% to 30% range; 
some examples mentioning both extremes of that:

http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
http://www.bigdbahead.com/?p=392

Makes me wish I knew off the top of my head how expensive WAL logging 
hint bits would be, for comparison sake.


-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sun, Jan 8, 2012 at 2:03 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>>> So there isn't any problem with there being incorrect checksums on
>>> blocks and you can turn the parameter on and off as often and as
>>> easily as you want. I think it can be USERSET but I wouldn't want to
>>> encourage users to see turning it off as a performance tuning feature.
>>> If the admin turns it on for the server, its on, so its SIGHUP.
>>>
>>> Any holes in that I haven't noticed?
>>
>> And of course, as soon as I wrote that I thought of the problem. We
>> mustn't make a write that hasn't been covered by a FPW, so we must
>> know ahead of time whether to WAL log hints or not. We can't simply
>> turn it on/off any longer, now that we have to WAL log hint bits also.
>> So thanks for making me think of that.
>>
>> We *could* make it turn on/off at each checkpoint, but its easier just
>> to say that it can be turned on/off at server start.
>
> Attached patch v6 now handles hint bits and checksums correctly,
> following Heikki's comments.
>
> In recovery, setting a hint doesn't dirty a block if it wasn't already
> dirty. So we can write some hints, and we can set others but not write
> them.
>
> Lots of comments in the code.


v7

* Fixes merge conflict
* Minor patch cleanups
* Adds checksum of complete page including hole
* Calcs checksum in mdwrite() so we pickup all non-shared buffer writes also

Robert mentioned to me there were outstanding concerns on this patch.
I know of none, and have double checked the thread to confirm all
concerns are fully addressed. Adding to CF.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Noah Misch
Date:
On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
> v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled.  One could not simply condition them on the
page_checksums setting, though.  Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on.  If that write tears,
leaving the checksum intact, that block will now fail verification.  A couple
of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
bumping the page
LSN:heap_xlog_visible()visibilitymap_clear()visibilitymap_truncate()lazy_scan_heap()XLogRecordPageWithFreeSpace()FreeSpaceMapTruncateRel()fsm_set_and_search()fsm_vacuum_page()fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards.  Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too.  The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum.  If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data.  They would need a dump/reload
to get the performance of a never-checksummed database.  Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml

> +       <para>
> +        Turning this parameter off speeds normal operation, but
> +        might allow data corruption to go unnoticed. The checksum uses
> +        16-bit checksums, using the fast Fletcher 16 algorithm. With this
> +        parameter enabled there is still a non-zero probability that an error
> +        could go undetected, as well as a non-zero probability of false
> +        positives.
> +       </para>

What sources of false positives do you intend to retain?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -20,6 +20,7 @@
>  #include "postgres.h"
>  
>  #include "access/visibilitymap.h"
> +#include "access/transam.h"
>  #include "access/xact.h"
>  #include "access/xlogutils.h"
>  #include "catalog/catalog.h"
> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
>  /* XLOG gives us high 4 bits */
>  #define XLOG_SMGR_CREATE    0x10
>  #define XLOG_SMGR_TRUNCATE    0x20
> +#define XLOG_SMGR_HINT        0x40
>  
>  typedef struct xl_smgr_create
>  {
> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
>      smgrDoPendingDeletes(false);
>  }
>  
> +/*
> + * Write a backup block if needed when we are setting a hint.
> + *
> + * Deciding the "if needed" bit is delicate and requires us to either
> + * grab WALInsertLock or check the info_lck spinlock. If we check the
> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
> + * so the design choice here is to just go straight for the WALInsertLock
> + * and trust that calls to this function are minimised elsewhere.
> + *
> + * Callable while holding share lock on the buffer content.
> + *
> + * Possible that multiple concurrent backends could attempt to write
> + * WAL records. In that case, more than one backup block may be recorded
> + * though that isn't important to the outcome and the backup blocks are
> + * likely to be identical anyway.
> + */
> +#define    SMGR_HINT_WATERMARK        13579
> +void
> +smgr_buffer_hint(Buffer buffer)
> +{
> +    /*
> +     * Make an XLOG entry reporting the hint
> +     */
> +    XLogRecPtr    lsn;
> +    XLogRecData rdata[2];
> +    int            watermark = SMGR_HINT_WATERMARK;
> +
> +    /*
> +     * Not allowed to have zero-length records, so use a small watermark
> +     */
> +    rdata[0].data = (char *) (&watermark);
> +    rdata[0].len = sizeof(int);
> +    rdata[0].buffer = InvalidBuffer;
> +    rdata[0].buffer_std = false;
> +    rdata[0].next = &(rdata[1]);
> +
> +    rdata[1].data = NULL;
> +    rdata[1].len = 0;
> +    rdata[1].buffer = buffer;
> +    rdata[1].buffer_std = true;
> +    rdata[1].next = NULL;
> +
> +    lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
> +
> +    /*
> +     * Set the page LSN if we wrote a backup block.
> +     */
> +    if (!XLByteEQ(InvalidXLogRecPtr, lsn))
> +    {
> +        Page     page = BufferGetPage(buffer);
> +        PageSetLSN(page, lsn);

PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.

Need a PageSetTLI() here, too.

> +        elog(LOG, "inserted backup block for hint bit");
> +    }
> +}

> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>      if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
>          (BM_DIRTY | BM_JUST_DIRTIED))
>      {
> +        /*
> +         * If we're writing checksums and we care about torn pages then we
> +         * cannot dirty a page during recovery as a result of a hint.
> +         * We can set the hint, just not dirty the page as a result.
> +         *
> +         * See long discussion in bufpage.c
> +         */
> +        if (HintsMustNotDirtyPage())
> +            return;

This expands toif (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.

> +
> +        /*
> +         * Write a full page into WAL iff this is the first change on the
> +         * block since the last checkpoint. That will never be the case
> +         * if the block is already dirty because we either made a change
> +         * or set a hint already. Note that aggressive cleaning of blocks
> +         * dirtied by hint bit setting would increase the call rate.
> +         * Bulk setting of hint bits would reduce the call rate...
> +         *
> +         * We must issue the WAL record before we mark the buffer dirty.
> +         * Otherwise we might write the page before we write the WAL.
> +         * That causes a race condition, since a checkpoint might
> +         * occur between writing the WAL record and marking the buffer dirty.
> +         * We solve that with a kluge, but one that is already in use
> +         * during transaction commit to prevent race conditions.
> +         * Basically, we simply prevent the checkpoint WAL record from
> +         * being written until we have marked the buffer dirty. We don't
> +         * start the checkpoint flush until we have marked dirty, so our
> +         * checkpoint must flush the change to disk successfully or the
> +         * checkpoint never gets written, so crash recovery will set us right.
> +         *
> +         * XXX rename PGPROC variable later; keep it same now for clarity
> +         */
> +        MyPgXact->inCommit = true;
> +        smgr_buffer_hint(buffer);
> +
>          LockBufHdr(bufHdr);
>          Assert(bufHdr->refcount > 0);
>          if (!(bufHdr->flags & BM_DIRTY))
> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>          }
>          bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
>          UnlockBufHdr(bufHdr);
> +        MyPgXact->inCommit = false;
>      }
>  }
>  
> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 096d36a..a220310 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>          /* Find smgr relation for buffer */
>          oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>  
> +        /* XXX do we want to write checksums for local buffers? An option? */
> +

Don't we have that, now that it happens in mdwrite()?

I can see value in an option to exclude local buffers, since corruption there
may be less exciting.  It doesn't seem important for an initial patch, though.

> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c

> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.

I get a 404 for that link.

> --- a/src/include/storage/bufpage.h
> +++ b/src/include/storage/bufpage.h

> -#define PD_VALID_FLAG_BITS    0x0007        /* OR of all valid pd_flags bits */
> +#define PD_VALID_FLAG_BITS    0x800F        /* OR of all non-checksum pd_flags bits */

The comment is not consistent with the character of the value.

Thanks,
nm


Re: 16-bit page checksums for 9.2

From
Dan Scales
Date:
Some other comments on the checksum patch:

I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite().  If
therewere every another storage backend, it would have to duplicate the checksum check, right?  Is there a disadvantage
toputting it in smgrwrite()?
 

You may have already noticed this, but a bunch of the comments are incorrect, now that you have moved the checksum
calculationto mdwrite().
 
 config.sgml - says writes via temp_buffers (which I think means local buffers) are not checksummed -- that's no longer
true,right? bufmgr.c, line 1914 - bufToWrite no longer exists.  You could revert changes from 1912-1920 localbuf.c,
line203 - as mentioned below, this comment is no longer relevant, you are checksumming local buffers now
 


Dan

----- Original Message -----
From: "Noah Misch" <noah@leadboat.com>
To: "Simon Riggs" <simon@2ndQuadrant.com>
Cc: "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Robert Haas" <robertmhaas@gmail.com>, "Andres Freund"
<andres@anarazel.de>,"Kevin Grittner" <Kevin.Grittner@wicourts.gov>, david@fetter.org, aidan@highrise.ca,
stark@mit.edu,pgsql-hackers@postgresql.org
 
Sent: Thursday, January 26, 2012 12:20:39 PM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
> v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled.  One could not simply condition them on the
page_checksums setting, though.  Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on.  If that write tears,
leaving the checksum intact, that block will now fail verification.  A couple
of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
bumping the page
LSN:heap_xlog_visible()visibilitymap_clear()visibilitymap_truncate()lazy_scan_heap()XLogRecordPageWithFreeSpace()FreeSpaceMapTruncateRel()fsm_set_and_search()fsm_vacuum_page()fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards.  Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too.  The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum.  If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data.  They would need a dump/reload
to get the performance of a never-checksummed database.  Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml

> +       <para>
> +        Turning this parameter off speeds normal operation, but
> +        might allow data corruption to go unnoticed. The checksum uses
> +        16-bit checksums, using the fast Fletcher 16 algorithm. With this
> +        parameter enabled there is still a non-zero probability that an error
> +        could go undetected, as well as a non-zero probability of false
> +        positives.
> +       </para>

What sources of false positives do you intend to retain?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -20,6 +20,7 @@
>  #include "postgres.h"
>  
>  #include "access/visibilitymap.h"
> +#include "access/transam.h"
>  #include "access/xact.h"
>  #include "access/xlogutils.h"
>  #include "catalog/catalog.h"
> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
>  /* XLOG gives us high 4 bits */
>  #define XLOG_SMGR_CREATE    0x10
>  #define XLOG_SMGR_TRUNCATE    0x20
> +#define XLOG_SMGR_HINT        0x40
>  
>  typedef struct xl_smgr_create
>  {
> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
>      smgrDoPendingDeletes(false);
>  }
>  
> +/*
> + * Write a backup block if needed when we are setting a hint.
> + *
> + * Deciding the "if needed" bit is delicate and requires us to either
> + * grab WALInsertLock or check the info_lck spinlock. If we check the
> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
> + * so the design choice here is to just go straight for the WALInsertLock
> + * and trust that calls to this function are minimised elsewhere.
> + *
> + * Callable while holding share lock on the buffer content.
> + *
> + * Possible that multiple concurrent backends could attempt to write
> + * WAL records. In that case, more than one backup block may be recorded
> + * though that isn't important to the outcome and the backup blocks are
> + * likely to be identical anyway.
> + */
> +#define    SMGR_HINT_WATERMARK        13579
> +void
> +smgr_buffer_hint(Buffer buffer)
> +{
> +    /*
> +     * Make an XLOG entry reporting the hint
> +     */
> +    XLogRecPtr    lsn;
> +    XLogRecData rdata[2];
> +    int            watermark = SMGR_HINT_WATERMARK;
> +
> +    /*
> +     * Not allowed to have zero-length records, so use a small watermark
> +     */
> +    rdata[0].data = (char *) (&watermark);
> +    rdata[0].len = sizeof(int);
> +    rdata[0].buffer = InvalidBuffer;
> +    rdata[0].buffer_std = false;
> +    rdata[0].next = &(rdata[1]);
> +
> +    rdata[1].data = NULL;
> +    rdata[1].len = 0;
> +    rdata[1].buffer = buffer;
> +    rdata[1].buffer_std = true;
> +    rdata[1].next = NULL;
> +
> +    lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
> +
> +    /*
> +     * Set the page LSN if we wrote a backup block.
> +     */
> +    if (!XLByteEQ(InvalidXLogRecPtr, lsn))
> +    {
> +        Page     page = BufferGetPage(buffer);
> +        PageSetLSN(page, lsn);

PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.

Need a PageSetTLI() here, too.

> +        elog(LOG, "inserted backup block for hint bit");
> +    }
> +}

> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>      if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
>          (BM_DIRTY | BM_JUST_DIRTIED))
>      {
> +        /*
> +         * If we're writing checksums and we care about torn pages then we
> +         * cannot dirty a page during recovery as a result of a hint.
> +         * We can set the hint, just not dirty the page as a result.
> +         *
> +         * See long discussion in bufpage.c
> +         */
> +        if (HintsMustNotDirtyPage())
> +            return;

This expands toif (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.

> +
> +        /*
> +         * Write a full page into WAL iff this is the first change on the
> +         * block since the last checkpoint. That will never be the case
> +         * if the block is already dirty because we either made a change
> +         * or set a hint already. Note that aggressive cleaning of blocks
> +         * dirtied by hint bit setting would increase the call rate.
> +         * Bulk setting of hint bits would reduce the call rate...
> +         *
> +         * We must issue the WAL record before we mark the buffer dirty.
> +         * Otherwise we might write the page before we write the WAL.
> +         * That causes a race condition, since a checkpoint might
> +         * occur between writing the WAL record and marking the buffer dirty.
> +         * We solve that with a kluge, but one that is already in use
> +         * during transaction commit to prevent race conditions.
> +         * Basically, we simply prevent the checkpoint WAL record from
> +         * being written until we have marked the buffer dirty. We don't
> +         * start the checkpoint flush until we have marked dirty, so our
> +         * checkpoint must flush the change to disk successfully or the
> +         * checkpoint never gets written, so crash recovery will set us right.
> +         *
> +         * XXX rename PGPROC variable later; keep it same now for clarity
> +         */
> +        MyPgXact->inCommit = true;
> +        smgr_buffer_hint(buffer);
> +
>          LockBufHdr(bufHdr);
>          Assert(bufHdr->refcount > 0);
>          if (!(bufHdr->flags & BM_DIRTY))
> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>          }
>          bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
>          UnlockBufHdr(bufHdr);
> +        MyPgXact->inCommit = false;
>      }
>  }
>  
> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 096d36a..a220310 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>          /* Find smgr relation for buffer */
>          oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>  
> +        /* XXX do we want to write checksums for local buffers? An option? */
> +

Don't we have that, now that it happens in mdwrite()?

I can see value in an option to exclude local buffers, since corruption there
may be less exciting.  It doesn't seem important for an initial patch, though.

> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c

> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.

I get a 404 for that link.

> --- a/src/include/storage/bufpage.h
> +++ b/src/include/storage/bufpage.h

> -#define PD_VALID_FLAG_BITS    0x0007        /* OR of all valid pd_flags bits */
> +#define PD_VALID_FLAG_BITS    0x800F        /* OR of all non-checksum pd_flags bits */

The comment is not consistent with the character of the value.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales <scales@vmware.com> wrote:
> I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite().
 Ifthere were every another storage backend, it would have to duplicate the checksum check, right?  Is there a
disadvantageto putting it in smgrwrite()? 

The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way.  It seems like the
right place for this is in some higher layer, like bufmgr.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Dan Scales
Date:
The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page
writesthat don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c) 

Also, I missed this before:  don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or
preferablysmgrextend()) as well?  Otherwise, you won't be checksumming a bunch of the new pages. 

Dan

----- Original Message -----
From: "Robert Haas" <robertmhaas@gmail.com>
To: "Dan Scales" <scales@vmware.com>
Cc: "Noah Misch" <noah@leadboat.com>, "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Andres Freund"
<andres@anarazel.de>,"Kevin Grittner" <Kevin.Grittner@wicourts.gov>, david@fetter.org, aidan@highrise.ca,
stark@mit.edu,pgsql-hackers@postgresql.org, "Simon Riggs" <simon@2ndquadrant.com> 
Sent: Friday, January 27, 2012 5:19:32 AM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales <scales@vmware.com> wrote:
> I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite().
 Ifthere were every another storage backend, it would have to duplicate the checksum check, right?  Is there a
disadvantageto putting it in smgrwrite()? 

The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way.  It seems like the
right place for this is in some higher layer, like bufmgr.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales <scales@vmware.com> wrote:

> The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page
writesthat don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c) 

I'll have another look at that. Seems like we can make it various
ways, we just need to decide the code placement.

> Also, I missed this before:  don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend()
(orpreferably smgrextend()) as well?  Otherwise, you won't be checksumming a bunch of the new pages. 

You don't need to checksum the extend because no data is written at
that point. It create a new block that will become dirty at some point
and then be written out, which will trigger the checksum.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 28.01.2012 15:49, Simon Riggs wrote:
> On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales<scales@vmware.com>  wrote:
>
>> Also, I missed this before:  don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend()
(orpreferably smgrextend()) as well?  Otherwise, you won't be checksumming a bunch of the new pages.
 
>
> You don't need to checksum the extend because no data is written at
> that point.

That's not correct. smgrextend writes a block of data just like 
smgrwrite does. When a relation is extended by the buffer manager, it 
calls smgrextend with an all-zeros block, but not all callers do that. 
See rewriteheap.c and nbtsort.c for counter-examples.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Fri, Jan 27, 2012 at 4:07 PM, Dan Scales <scales@vmware.com> wrote:
> The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page
writesthat don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)
 

Maybe we should have some sort of wrapper function, then.  I really
dislike the idea that the smgr layer knows anything about the page
format, and if md has to know that's even worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Bruce Momjian
Date:
On Sun, Dec 25, 2011 at 04:25:19PM +0100, Martijn van Oosterhout wrote:
> On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote:
> > On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres@anarazel.de> wrote:
> > > Why don't you use the same tricks as the former patch and copy the buffer,
> > > compute the checksum on that, and then write out that copy (you can even do
> > > both at the same time). I have a hard time believing that the additional copy
> > > is more expensive than the locking.
> > 
> > ISTM we can't write and copy at the same time because the cheksum is
> > not a trailer field.
> 
> Ofcourse you can. If the checksum is in the trailer field you get the
> nice property that the whole block has a constant checksum. However, if
> you store the checksum elsewhere you just need to change the checking
> algorithm to copy the checksum out, zero those bytes and run the
> checksum and compare with the extracted checksum.
> 
> Not pretty, but I don't think it makes a difference in performence.

Sorry to be late replying to this, but an interesting idea would be to
zero the _hint_ bits before doing the CRC checksum.  That would avoid
the problem of WAL-logging the hint bits.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 16-bit page checksums for 9.2

From
Bruce Momjian
Date:
On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
> > Also, as far as I can see this patch usurps the page version field,
> > which I find unacceptably short-sighted.  Do you really think this is
> > the last page layout change we'll ever make?
> 
> No, I don't. I hope and expect the next page layout change to
> reintroduce such a field.
> 
> But since we're agreed now that upgrading is important, changing page
> format isn't likely to be happening until we get an online upgrade
> process. So future changes are much less likely. If they do happen, we
> have some flag bits spare that can be used to indicate later versions.
> It's not the prettiest thing in the world, but it's a small ugliness
> in return for an important feature. If there was a way without that, I
> would have chosen it.

Have you considered the CRC might match a valuid page version number? 
Is that safe?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
>> > Also, as far as I can see this patch usurps the page version field,
>> > which I find unacceptably short-sighted.  Do you really think this is
>> > the last page layout change we'll ever make?
>>
>> No, I don't. I hope and expect the next page layout change to
>> reintroduce such a field.
>>
>> But since we're agreed now that upgrading is important, changing page
>> format isn't likely to be happening until we get an online upgrade
>> process. So future changes are much less likely. If they do happen, we
>> have some flag bits spare that can be used to indicate later versions.
>> It's not the prettiest thing in the world, but it's a small ugliness
>> in return for an important feature. If there was a way without that, I
>> would have chosen it.
>
> Have you considered the CRC might match a valuid page version number?
> Is that safe?

In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Bruce Momjian
Date:
On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
> On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
> >> > Also, as far as I can see this patch usurps the page version field,
> >> > which I find unacceptably short-sighted.  Do you really think this is
> >> > the last page layout change we'll ever make?
> >>
> >> No, I don't. I hope and expect the next page layout change to
> >> reintroduce such a field.
> >>
> >> But since we're agreed now that upgrading is important, changing page
> >> format isn't likely to be happening until we get an online upgrade
> >> process. So future changes are much less likely. If they do happen, we
> >> have some flag bits spare that can be used to indicate later versions.
> >> It's not the prettiest thing in the world, but it's a small ugliness
> >> in return for an important feature. If there was a way without that, I
> >> would have chosen it.
> >
> > Have you considered the CRC might match a valuid page version number?
> > Is that safe?
> 
> In the proposed scheme there are two flag bits set on the page to
> indicate whether the field is used as a checksum rather than a version
> number. So its possible the checksum could look like a valid page
> version number, but we'd still be able to tell the difference.

Thanks.  Clearly we don't need 16 bits to represent our page version
number because we rarely change it.  The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.  

I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero.  I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.

I don't want to get to a point where we need a bit per page number
version.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 06.02.2012 05:48, Bruce Momjian wrote:
> On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
>> In the proposed scheme there are two flag bits set on the page to
>> indicate whether the field is used as a checksum rather than a version
>> number. So its possible the checksum could look like a valid page
>> version number, but we'd still be able to tell the difference.
>
> Thanks.  Clearly we don't need 16 bits to represent our page version
> number because we rarely change it. The other good thing is I don't
> remember anyone wanting additional per-page storage in the past few
> years except for a checksum.

There's this idea that I'd like to see implemented: 
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php

In any case, I think it's a very bad idea to remove the page version 
field. How are we supposed to ever be able to change the page format 
again if we don't even have a version number on the page? I strongly 
oppose removing it.

I'm also not very comfortable with the idea of having flags on the page 
indicating whether it has a checksum or not. It's not hard to imagine a 
software of firmware bug or hardware failure that would cause pd_flags 
field to be zeroed out altogether. It would be more robust if the 
expected bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to 
deal with that at upgrade somehow. And it still feels a bit whacky anyway.

> I wonder if we should just dedicate 3 page header bits, call that the
> page version number, and set this new version number to 1, and assume
> all previous versions were zero, and have them look in the old page
> version location if the new version number is zero.  I am basically
> thinking of how we can plan ahead to move the version number to a new
> location and have a defined way of finding the page version number using
> old and new schemes.

Three bits seems short-sighted, but yeah, something like 6-8 bits should 
be enough. On the whole, though. I think we should bite the bullet and 
invent a way to extend the page header at upgrade.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Mon, Feb 6, 2012 at 4:48 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
>> On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> > On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
>> >> > Also, as far as I can see this patch usurps the page version field,
>> >> > which I find unacceptably short-sighted.  Do you really think this is
>> >> > the last page layout change we'll ever make?
>> >>
>> >> No, I don't. I hope and expect the next page layout change to
>> >> reintroduce such a field.
>> >>
>> >> But since we're agreed now that upgrading is important, changing page
>> >> format isn't likely to be happening until we get an online upgrade
>> >> process. So future changes are much less likely. If they do happen, we
>> >> have some flag bits spare that can be used to indicate later versions.
>> >> It's not the prettiest thing in the world, but it's a small ugliness
>> >> in return for an important feature. If there was a way without that, I
>> >> would have chosen it.
>> >
>> > Have you considered the CRC might match a valuid page version number?
>> > Is that safe?
>>
>> In the proposed scheme there are two flag bits set on the page to
>> indicate whether the field is used as a checksum rather than a version
>> number. So its possible the checksum could look like a valid page
>> version number, but we'd still be able to tell the difference.
>
> Thanks.  Clearly we don't need 16 bits to represent our page version
> number because we rarely change it.  The other good thing is I don't
> remember anyone wanting additional per-page storage in the past few
> years except for a checksum.
>
> I wonder if we should just dedicate 3 page header bits, call that the
> page version number, and set this new version number to 1, and assume
> all previous versions were zero, and have them look in the old page
> version location if the new version number is zero.  I am basically
> thinking of how we can plan ahead to move the version number to a new
> location and have a defined way of finding the page version number using
> old and new schemes.
>
> I don't want to get to a point where we need a bit per page number
> version.

Agreed, though I think we need at least 2 bits set if we are using the
checksum, so we should start at version 2 to ensure that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 06.02.2012 05:48, Bruce Momjian wrote:
>>
>> On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
>>>
>>> In the proposed scheme there are two flag bits set on the page to
>>> indicate whether the field is used as a checksum rather than a version
>>> number. So its possible the checksum could look like a valid page
>>> version number, but we'd still be able to tell the difference.
>>
>>
>> Thanks.  Clearly we don't need 16 bits to represent our page version
>> number because we rarely change it. The other good thing is I don't
>> remember anyone wanting additional per-page storage in the past few
>> years except for a checksum.
>
>
> There's this idea that I'd like to see implemented:
> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php

As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.

ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.

That leaves the same problem that if someone writes to the page you
need to freeze the rows first.

> In any case, I think it's a very bad idea to remove the page version field.
> How are we supposed to ever be able to change the page format again if we
> don't even have a version number on the page? I strongly oppose removing it.

Nobody is removing the version field, nor is anybody suggesting not
being able to tell which page version we are looking at.

> I'm also not very comfortable with the idea of having flags on the page
> indicating whether it has a checksum or not. It's not hard to imagine a
> software of firmware bug or hardware failure that would cause pd_flags field
> to be zeroed out altogether. It would be more robust if the expected
> bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
> at upgrade somehow. And it still feels a bit whacky anyway.

Good idea. Lets use

0-0-0 to represent upgraded from previous version, needs a bit set
0-0-1 to represent new version number of page, no checksum
1-1-1 to represent new version number of page, with checksum

So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator

>> I wonder if we should just dedicate 3 page header bits, call that the
>> page version number, and set this new version number to 1, and assume
>> all previous versions were zero, and have them look in the old page
>> version location if the new version number is zero.  I am basically
>> thinking of how we can plan ahead to move the version number to a new
>> location and have a defined way of finding the page version number using
>> old and new schemes.
>
>
> Three bits seems short-sighted, but yeah, something like 6-8 bits should be
> enough. On the whole, though. I think we should bite the bullet and invent a
> way to extend the page header at upgrade.

There are currently many spare bits. I don't see any need to allocate
them to this specific use ahead of time - especially since that is the
exact decision we took last time when we reserved 16 bits for the
version.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 06.02.2012 10:05, Simon Riggs wrote:
> On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 06.02.2012 05:48, Bruce Momjian wrote:
>>>
>>> Thanks.  Clearly we don't need 16 bits to represent our page version
>>> number because we rarely change it. The other good thing is I don't
>>> remember anyone wanting additional per-page storage in the past few
>>> years except for a checksum.
>>
>> There's this idea that I'd like to see implemented:
>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php
>
> As you'll note, adding that field will change the page format and is
> therefore ruled out for 9.2.
>
> ISTM there is a different way to handle that anyway. All we need to do
> is to record the LSN of the last wraparound in shared memory/control
> file. Any block with page LSN older than that has all-frozen rows.
> That doesn't use any space nor does it require another field to be
> set.

Good idea. However, the followup idea to that discussion was to not only 
avoid the I/O needed to mark tuples as frozen, but to avoid xid 
wraparound altogether, by allowing clog to grow indefinitely. You do 
want to freeze at some point of course, to truncate the clog, but it 
would be nice to not have a hard limit. The way to do that is to store 
an xid "epoch" in the page header, so that Xids are effectively 64-bits 
wide, even though the xid fields on the tuple header are only 32-bits 
wide. That does require a new field in the page header.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 06.02.2012 10:05, Simon Riggs wrote:
>>
>> On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>
>>> On 06.02.2012 05:48, Bruce Momjian wrote:
>>>>
>>>>
>>>> Thanks.  Clearly we don't need 16 bits to represent our page version
>>>> number because we rarely change it. The other good thing is I don't
>>>> remember anyone wanting additional per-page storage in the past few
>>>> years except for a checksum.
>>>
>>>
>>> There's this idea that I'd like to see implemented:
>>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php
>>
>>
>> As you'll note, adding that field will change the page format and is
>> therefore ruled out for 9.2.
>>
>> ISTM there is a different way to handle that anyway. All we need to do
>> is to record the LSN of the last wraparound in shared memory/control
>> file. Any block with page LSN older than that has all-frozen rows.
>> That doesn't use any space nor does it require another field to be
>> set.
>
>
> Good idea. However, the followup idea to that discussion was to not only
> avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
> altogether, by allowing clog to grow indefinitely. You do want to freeze at
> some point of course, to truncate the clog, but it would be nice to not have
> a hard limit. The way to do that is to store an xid "epoch" in the page
> header, so that Xids are effectively 64-bits wide, even though the xid
> fields on the tuple header are only 32-bits wide. That does require a new
> field in the page header.

We wouldn't need to do that would we?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 06.02.2012 11:25, Simon Riggs wrote:
> On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> Good idea. However, the followup idea to that discussion was to not only
>> avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
>> altogether, by allowing clog to grow indefinitely. You do want to freeze at
>> some point of course, to truncate the clog, but it would be nice to not have
>> a hard limit. The way to do that is to store an xid "epoch" in the page
>> header, so that Xids are effectively 64-bits wide, even though the xid
>> fields on the tuple header are only 32-bits wide. That does require a new
>> field in the page header.
>
> We wouldn't need to do that would we?

Huh? Do you mean that we wouldn't need to implement that feature?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Bruce Momjian
Date:
On Mon, Feb 06, 2012 at 09:05:19AM +0000, Simon Riggs wrote:
> > In any case, I think it's a very bad idea to remove the page version field.
> > How are we supposed to ever be able to change the page format again if we
> > don't even have a version number on the page? I strongly oppose removing it.
> 
> Nobody is removing the version field, nor is anybody suggesting not
> being able to tell which page version we are looking at.

Agreed.  I thought the idea was that we have a 16-bit page _version_
number and 8+ free page _flag_ bits, which are all currently zero.  The
idea was to move the version number from 16-bit field into the unused
flag bits, and use the 16-bit field for the checksum.  I would like to
have some logic that would allow tools inspecting the page to tell if
they should look for the version number in the bits at the beginning of
the page or at the end.

Specifically, this becomes the checksum:
uint16          pd_pagesize_version;

and this holds the page version, if we have updated the page to the new
format:
       uint16          pd_flags;               /* flag bits, see below */

Of the 16 bits of pd_flags, these are the only ones used:

#define PD_HAS_FREE_LINES       0x0001          /* are there any unused line pointers? */
#define PD_PAGE_FULL            0x0002          /* not enough free space for new
                                       * tuple? */
 
#define PD_ALL_VISIBLE          0x0004          /* all tuples on page are visible to
                                           * everyone */
 

#define PD_VALID_FLAG_BITS      0x0007          /* OR of all valid pd_flags bits */


> > I'm also not very comfortable with the idea of having flags on the page
> > indicating whether it has a checksum or not. It's not hard to imagine a
> > software of firmware bug or hardware failure that would cause pd_flags field
> > to be zeroed out altogether. It would be more robust if the expected
> > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
> > at upgrade somehow. And it still feels a bit whacky anyway.

> Good idea. Lets use
> 
> 0-0-0 to represent upgraded from previous version, needs a bit set
> 0-0-1 to represent new version number of page, no checksum
> 1-1-1 to represent new version number of page, with checksum
> 
> So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator

Interesting point that we would not be guarding against a bit flip from
1 to 0 for the checksum bit; I agree using two bits is the way to go.  I
don't see how upgrade figures into this.

However, I am unclear how Simon's idea above actually works.  We need
two bits for redundancy, both 1, to mark a page as having a checksum.  I
don't think mixing the idea of a new page version and checksum enabled
really makes sense, especially since we have to plan for future page
version changes.

I think we dedicate 2 bits to say we have computed a checksum, and 3
bits to mark up to 8 possible page versions, so the logic is, in
pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
on the page, and we use 0x32 and later for the page version number.  We
can assume all the remaining bits are for the page version number until
we need to define new bits, and we can start storing them at the end
first, and work forward.  If all the version bits are zero, it means the
page version number is still stored in pd_pagesize_version.

> >> I wonder if we should just dedicate 3 page header bits, call that the
> >> page version number, and set this new version number to 1, and assume
> >> all previous versions were zero, and have them look in the old page
> >> version location if the new version number is zero.  I am basically
> >> thinking of how we can plan ahead to move the version number to a new
> >> location and have a defined way of finding the page version number using
> >> old and new schemes.
> >
> >
> > Three bits seems short-sighted, but yeah, something like 6-8 bits should be
> > enough. On the whole, though. I think we should bite the bullet and invent a
> > way to extend the page header at upgrade.
> 
> There are currently many spare bits. I don't see any need to allocate
> them to this specific use ahead of time - especially since that is the
> exact decision we took last time when we reserved 16 bits for the
> version.

Right, but I am thinking we should set things up so we can grow the page
version number into the unused bit, rather than box it between bits we
are already using.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 16-bit page checksums for 9.2

From
Bruce Momjian
Date:
On Mon, Feb 06, 2012 at 08:51:34AM +0100, Heikki Linnakangas wrote:
> >I wonder if we should just dedicate 3 page header bits, call that the
> >page version number, and set this new version number to 1, and assume
> >all previous versions were zero, and have them look in the old page
> >version location if the new version number is zero.  I am basically
> >thinking of how we can plan ahead to move the version number to a new
> >location and have a defined way of finding the page version number using
> >old and new schemes.
> 
> Three bits seems short-sighted, but yeah, something like 6-8 bits
> should be enough. On the whole, though. I think we should bite the
> bullet and invent a way to extend the page header at upgrade.

I just emailed a possible layout that allows for future page version
number expansion.  

I don't think there is any magic bullet that will allow for page header
extension by pg_upgrade.  If it is going to be done, it would have to
happen in the backend while the system is running.  Anything that
requires pg_upgrade to do lots of reads or writes basically makes
pg_upgrade useless.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 16-bit page checksums for 9.2

From
Bruce Momjian
Date:
On Mon, Feb 06, 2012 at 12:59:59PM -0500, Bruce Momjian wrote:
> > > I'm also not very comfortable with the idea of having flags on the page
> > > indicating whether it has a checksum or not. It's not hard to imagine a
> > > software of firmware bug or hardware failure that would cause pd_flags field
> > > to be zeroed out altogether. It would be more robust if the expected
> > > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
> > > at upgrade somehow. And it still feels a bit whacky anyway.
> 
> > Good idea. Lets use
> > 
> > 0-0-0 to represent upgraded from previous version, needs a bit set
> > 0-0-1 to represent new version number of page, no checksum
> > 1-1-1 to represent new version number of page, with checksum
> > 
> > So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator
> 
> Interesting point that we would not be guarding against a bit flip from
> 1 to 0 for the checksum bit; I agree using two bits is the way to go.  I
> don't see how upgrade figures into this.
> 
> However, I am unclear how Simon's idea above actually works.  We need
> two bits for redundancy, both 1, to mark a page as having a checksum.  I
> don't think mixing the idea of a new page version and checksum enabled
> really makes sense, especially since we have to plan for future page
> version changes.
> 
> I think we dedicate 2 bits to say we have computed a checksum, and 3
> bits to mark up to 8 possible page versions, so the logic is, in
> pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
> on the page, and we use 0x32 and later for the page version number.  We
> can assume all the remaining bits are for the page version number until
> we need to define new bits, and we can start storing them at the end
> first, and work forward.  If all the version bits are zero, it means the
> page version number is still stored in pd_pagesize_version.

A simpler solution would be to place two bits for checksum after the
existing page bit usage, and place the page version on the last four
bits of the 16-bit field --- that still leaves us with 7 unused bits.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
> On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
>> v7

Thanks very much for the review. Just realised I hadn't actually replied...

> This patch uses FPIs to guard against torn hint writes, even when the
> checksums are disabled.  One could not simply condition them on the
> page_checksums setting, though.  Suppose page_checksums=off and we're hinting
> a page previously written with page_checksums=on.  If that write tears,
> leaving the checksum intact, that block will now fail verification.  A couple
> of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
> old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
> Whenever the cluster starts with checksums disabled, set the field to
> InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
> field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
> failure occurs in a page with LSN older than the stored one, emit either a
> softer warning or no message at all.

We can only change page_checksums at restart (now) so the above seems moot.

If we crash with FPWs enabled we repair any torn pages.

> Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
> the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
> bumping the page LSN:
>        heap_xlog_visible()
>        visibilitymap_clear()
>        visibilitymap_truncate()
>        lazy_scan_heap()
>        XLogRecordPageWithFreeSpace()
>        FreeSpaceMapTruncateRel()
>        fsm_set_and_search()
>        fsm_vacuum_page()
>        fsm_search_avail()
> Though I have not investigated each of these in detail, I suspect most or all
> represent continued torn-page hazards.  Since the FSM is just a hint, we do
> not WAL-log it at all; it would be nicest to always skip checksums for it,
> too.  The visibility map shortcuts are more nuanced.

Still checking all, but not as bad as the list looks.

I'm removing chksum code from smgr and putting back into bufmgr and
other locations.

Patch footprint now looks like this.
doc/src/sgml/config.sgml                      |   40 +++src/backend/access/hash/hashpage.c            |
1src/backend/access/heap/rewriteheap.c        |    4src/backend/access/heap/visibilitymap.c       |
1src/backend/access/nbtree/nbtree.c           |    1src/backend/access/nbtree/nbtsort.c           |
3src/backend/access/spgist/spginsert.c        |    2src/backend/access/transam/twophase.c         |   16
-src/backend/access/transam/xact.c            |    8src/backend/access/transam/xlog.c             |  114
++++++++src/backend/commands/tablecmds.c             |    2src/backend/storage/buffer/bufmgr.c           |   75
+++++src/backend/storage/buffer/localbuf.c        |    2src/backend/storage/ipc/procarray.c           |   63
+++-src/backend/storage/lmgr/proc.c              |    4src/backend/storage/page/bufpage.c            |  331
+++++++++++++++++++++++++-src/backend/utils/misc/guc.c                 |   14
+src/backend/utils/misc/postgresql.conf.sample|   15 -src/include/access/xlog.h                     |
1src/include/catalog/pg_control.h             |    3src/include/catalog/storage.h                 |
1src/include/storage/bufpage.h                |  107 ++++++--src/include/storage/proc.h                    |
3src/include/storage/procarray.h              |    424 files changed, 743 insertions(+), 72 deletions(-) 

Patch is *not* attached here, yet. Still working on it.

> When page_checksums=off and we read a page last written by page_checksums=on,
> we still verify its checksum.  If a mostly-insert/read site uses checksums for
> some time and eventually wishes to shed the overhead, disabling the feature
> will not stop the costs for reads of old data.  They would need a dump/reload
> to get the performance of a never-checksummed database.  Let's either
> unconditionally skip checksum validation under page_checksums=off or add a new
> state like page_checksums=ignore for that even-weaker condition.

Agreed, changed.

>> --- a/doc/src/sgml/config.sgml
>> +++ b/doc/src/sgml/config.sgml
>
>> +       <para>
>> +        Turning this parameter off speeds normal operation, but
>> +        might allow data corruption to go unnoticed. The checksum uses
>> +        16-bit checksums, using the fast Fletcher 16 algorithm. With this
>> +        parameter enabled there is still a non-zero probability that an error
>> +        could go undetected, as well as a non-zero probability of false
>> +        positives.
>> +       </para>
>
> What sources of false positives do you intend to retain?

I see none. Will reword.

>> --- a/src/backend/catalog/storage.c
>> +++ b/src/backend/catalog/storage.c
>> @@ -20,6 +20,7 @@
>>  #include "postgres.h"
>>
>>  #include "access/visibilitymap.h"
>> +#include "access/transam.h"
>>  #include "access/xact.h"
>>  #include "access/xlogutils.h"
>>  #include "catalog/catalog.h"
>> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
>>  /* XLOG gives us high 4 bits */
>>  #define XLOG_SMGR_CREATE     0x10
>>  #define XLOG_SMGR_TRUNCATE   0x20
>> +#define XLOG_SMGR_HINT               0x40
>>
>>  typedef struct xl_smgr_create
>>  {
>> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
>>       smgrDoPendingDeletes(false);
>>  }
>>
>> +/*
>> + * Write a backup block if needed when we are setting a hint.
>> + *
>> + * Deciding the "if needed" bit is delicate and requires us to either
>> + * grab WALInsertLock or check the info_lck spinlock. If we check the
>> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
>> + * so the design choice here is to just go straight for the WALInsertLock
>> + * and trust that calls to this function are minimised elsewhere.
>> + *
>> + * Callable while holding share lock on the buffer content.
>> + *
>> + * Possible that multiple concurrent backends could attempt to write
>> + * WAL records. In that case, more than one backup block may be recorded
>> + * though that isn't important to the outcome and the backup blocks are
>> + * likely to be identical anyway.
>> + */
>> +#define      SMGR_HINT_WATERMARK             13579
>> +void
>> +smgr_buffer_hint(Buffer buffer)
>> +{
>> +     /*
>> +      * Make an XLOG entry reporting the hint
>> +      */
>> +     XLogRecPtr      lsn;
>> +     XLogRecData rdata[2];
>> +     int                     watermark = SMGR_HINT_WATERMARK;
>> +
>> +     /*
>> +      * Not allowed to have zero-length records, so use a small watermark
>> +      */
>> +     rdata[0].data = (char *) (&watermark);
>> +     rdata[0].len = sizeof(int);
>> +     rdata[0].buffer = InvalidBuffer;
>> +     rdata[0].buffer_std = false;
>> +     rdata[0].next = &(rdata[1]);
>> +
>> +     rdata[1].data = NULL;
>> +     rdata[1].len = 0;
>> +     rdata[1].buffer = buffer;
>> +     rdata[1].buffer_std = true;
>> +     rdata[1].next = NULL;
>> +
>> +     lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
>> +
>> +     /*
>> +      * Set the page LSN if we wrote a backup block.
>> +      */
>> +     if (!XLByteEQ(InvalidXLogRecPtr, lsn))
>> +     {
>> +             Page    page = BufferGetPage(buffer);
>> +             PageSetLSN(page, lsn);
>
> PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
> is insufficient.

Am serialising this by only writing PageLSN while holding buf hdr lock.

> Need a PageSetTLI() here, too.

Agreed

>> +             elog(LOG, "inserted backup block for hint bit");
>> +     }
>> +}
>
>> --- a/src/backend/storage/buffer/bufmgr.c
>> +++ b/src/backend/storage/buffer/bufmgr.c
>
>> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>>       if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
>>               (BM_DIRTY | BM_JUST_DIRTIED))
>>       {
>> +             /*
>> +              * If we're writing checksums and we care about torn pages then we
>> +              * cannot dirty a page during recovery as a result of a hint.
>> +              * We can set the hint, just not dirty the page as a result.
>> +              *
>> +              * See long discussion in bufpage.c
>> +              */
>> +             if (HintsMustNotDirtyPage())
>> +                     return;
>
> This expands to
>        if (page_checksums && fullPageWrites && RecoveryInProgress())
> If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
> WAL record during recovery.

Yes, logic corrected.

>> +
>> +             /*
>> +              * Write a full page into WAL iff this is the first change on the
>> +              * block since the last checkpoint. That will never be the case
>> +              * if the block is already dirty because we either made a change
>> +              * or set a hint already. Note that aggressive cleaning of blocks
>> +              * dirtied by hint bit setting would increase the call rate.
>> +              * Bulk setting of hint bits would reduce the call rate...
>> +              *
>> +              * We must issue the WAL record before we mark the buffer dirty.
>> +              * Otherwise we might write the page before we write the WAL.
>> +              * That causes a race condition, since a checkpoint might
>> +              * occur between writing the WAL record and marking the buffer dirty.
>> +              * We solve that with a kluge, but one that is already in use
>> +              * during transaction commit to prevent race conditions.
>> +              * Basically, we simply prevent the checkpoint WAL record from
>> +              * being written until we have marked the buffer dirty. We don't
>> +              * start the checkpoint flush until we have marked dirty, so our
>> +              * checkpoint must flush the change to disk successfully or the
>> +              * checkpoint never gets written, so crash recovery will set us right.
>> +              *
>> +              * XXX rename PGPROC variable later; keep it same now for clarity
>> +              */
>> +             MyPgXact->inCommit = true;
>> +             smgr_buffer_hint(buffer);
>> +
>>               LockBufHdr(bufHdr);
>>               Assert(bufHdr->refcount > 0);
>>               if (!(bufHdr->flags & BM_DIRTY))
>> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>>               }
>>               bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
>>               UnlockBufHdr(bufHdr);
>> +             MyPgXact->inCommit = false;
>>       }
>>  }


Found another bug in that section also, relating to inCommit handling.

>> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
>> index 096d36a..a220310 100644
>> --- a/src/backend/storage/buffer/localbuf.c
>> +++ b/src/backend/storage/buffer/localbuf.c
>> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>>               /* Find smgr relation for buffer */
>>               oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>>
>> +             /* XXX do we want to write checksums for local buffers? An option? */
>> +
>
> Don't we have that, now that it happens in mdwrite()?
>
> I can see value in an option to exclude local buffers, since corruption there
> may be less exciting.  It doesn't seem important for an initial patch, though.

I'm continuing to exclude local buffers. Let me know if that should change.

>> --- a/src/backend/storage/page/bufpage.c
>> +++ b/src/backend/storage/page/bufpage.c
>
>> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
>
> I get a 404 for that link.

Changed

>> --- a/src/include/storage/bufpage.h
>> +++ b/src/include/storage/bufpage.h
>
>> -#define PD_VALID_FLAG_BITS   0x0007          /* OR of all valid pd_flags bits */
>> +#define PD_VALID_FLAG_BITS   0x800F          /* OR of all non-checksum pd_flags bits */
>
> The comment is not consistent with the character of the value.

OK


--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Noah Misch
Date:
On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
> > On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:

> > This patch uses FPIs to guard against torn hint writes, even when the
> > checksums are disabled. ?One could not simply condition them on the
> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
> > a page previously written with page_checksums=on. ?If that write tears,
> > leaving the checksum intact, that block will now fail verification. ?A couple
> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
> > Whenever the cluster starts with checksums disabled, set the field to
> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
> > failure occurs in a page with LSN older than the stored one, emit either a
> > softer warning or no message at all.
> 
> We can only change page_checksums at restart (now) so the above seems moot.
> 
> If we crash with FPWs enabled we repair any torn pages.

There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.

> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
> > is insufficient.
> 
> Am serialising this by only writing PageLSN while holding buf hdr lock.

That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock.  Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?

> > I can see value in an option to exclude local buffers, since corruption there
> > may be less exciting. ?It doesn't seem important for an initial patch, though.
> 
> I'm continuing to exclude local buffers. Let me know if that should change.

Seems reasonable.

Thanks,
nm


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
>> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
>> > On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
>
>> > This patch uses FPIs to guard against torn hint writes, even when the
>> > checksums are disabled. ?One could not simply condition them on the
>> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
>> > a page previously written with page_checksums=on. ?If that write tears,
>> > leaving the checksum intact, that block will now fail verification. ?A couple
>> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
>> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
>> > Whenever the cluster starts with checksums disabled, set the field to
>> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
>> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
>> > failure occurs in a page with LSN older than the stored one, emit either a
>> > softer warning or no message at all.
>>
>> We can only change page_checksums at restart (now) so the above seems moot.
>>
>> If we crash with FPWs enabled we repair any torn pages.
>
> There's no live bug, but that comes at a high cost: the patch has us emit
> full-page images for hint bit writes regardless of the page_checksums setting.

Sorry, I don't understand what you mean. I don't see any failure cases
that require that.

page_checksums can only change at a shutdown checkpoint,

The statement "If that write tears,
>> > leaving the checksum intact, that block will now fail verification."
cannot happen, ISTM.

If we write out a block we update the checksum if page_checksums is
set, or we clear it. If we experience a torn page at crash, the FPI
corrects that, so the checksum never does fail verification. We only
need to write a FPI when we write with checksums.

If that's wrong, please explain a failure case in detail.


>> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
>> > is insufficient.
>>
>> Am serialising this by only writing PageLSN while holding buf hdr lock.
>
> That means also taking the buffer header spinlock in every PageGetLSN() caller
> holding only a shared buffer content lock.  Do you think that will pay off,
> versus the settled pattern of trading here your shared buffer content lock for
> an exclusive one?

Yes, I think it will pay off. This is the only code location where we
do that, and we are already taking the buffer header lock, so there is
effectively no overhead.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Noah Misch
Date:
On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote:
> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
> >> > This patch uses FPIs to guard against torn hint writes, even when the
> >> > checksums are disabled. ?One could not simply condition them on the
> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
> >> > a page previously written with page_checksums=on. ?If that write tears,
> >> > leaving the checksum intact, that block will now fail verification. ?A couple
> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
> >> > Whenever the cluster starts with checksums disabled, set the field to
> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
> >> > failure occurs in a page with LSN older than the stored one, emit either a
> >> > softer warning or no message at all.
> >>
> >> We can only change page_checksums at restart (now) so the above seems moot.
> >>
> >> If we crash with FPWs enabled we repair any torn pages.
> >
> > There's no live bug, but that comes at a high cost: the patch has us emit
> > full-page images for hint bit writes regardless of the page_checksums setting.
> 
> Sorry, I don't understand what you mean. I don't see any failure cases
> that require that.
> 
> page_checksums can only change at a shutdown checkpoint,
> 
> The statement "If that write tears,
> >> > leaving the checksum intact, that block will now fail verification."
> cannot happen, ISTM.
> 
> If we write out a block we update the checksum if page_checksums is
> set, or we clear it. If we experience a torn page at crash, the FPI
> corrects that, so the checksum never does fail verification. We only
> need to write a FPI when we write with checksums.
> 
> If that's wrong, please explain a failure case in detail.

In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both.  (This is
just the normal torn-page scenario.)  Given that, consider the following
sequence of events:

1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown

5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
7. begin write of P
8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT

9. startup with page_checksums = off
10. crash recovery.  does not touch P, because step 6 generated no WAL
11. clean shutdown

12. startup with page_checksums = on
13. read P.  checksum failure

Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL.  That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.

> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
> >> > is insufficient.
> >>
> >> Am serialising this by only writing PageLSN while holding buf hdr lock.
> >
> > That means also taking the buffer header spinlock in every PageGetLSN() caller
> > holding only a shared buffer content lock. ?Do you think that will pay off,
> > versus the settled pattern of trading here your shared buffer content lock for
> > an exclusive one?
> 
> Yes, I think it will pay off. This is the only code location where we
> do that, and we are already taking the buffer header lock, so there is
> effectively no overhead.

The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
spinlock at either of those locations.  If they remain safe under the new
rules, we'll need comments explaining why.  I think they may indeed be safe,
but it's far from obvious.

Thanks,
nm


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah@leadboat.com> wrote:
> On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
>> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote:
>> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
>> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
>> >> > This patch uses FPIs to guard against torn hint writes, even when the
>> >> > checksums are disabled. ?One could not simply condition them on the
>> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
>> >> > a page previously written with page_checksums=on. ?If that write tears,
>> >> > leaving the checksum intact, that block will now fail verification. ?A couple
>> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
>> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
>> >> > Whenever the cluster starts with checksums disabled, set the field to
>> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
>> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
>> >> > failure occurs in a page with LSN older than the stored one, emit either a
>> >> > softer warning or no message at all.
>> >>
>> >> We can only change page_checksums at restart (now) so the above seems moot.
>> >>
>> >> If we crash with FPWs enabled we repair any torn pages.
>> >
>> > There's no live bug, but that comes at a high cost: the patch has us emit
>> > full-page images for hint bit writes regardless of the page_checksums setting.
>>
>> Sorry, I don't understand what you mean. I don't see any failure cases
>> that require that.
>>
>> page_checksums can only change at a shutdown checkpoint,
>>
>> The statement "If that write tears,
>> >> > leaving the checksum intact, that block will now fail verification."
>> cannot happen, ISTM.
>>
>> If we write out a block we update the checksum if page_checksums is
>> set, or we clear it. If we experience a torn page at crash, the FPI
>> corrects that, so the checksum never does fail verification. We only
>> need to write a FPI when we write with checksums.
>>
>> If that's wrong, please explain a failure case in detail.
>
> In the following description, I will treat a heap page as having two facts.
> The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
> A torn page write lacking the protection of a full-page image can update one
> or both facts despite the transaction having logically updated both.  (This is
> just the normal torn-page scenario.)  Given that, consider the following
> sequence of events:
>
> 1. startup with page_checksums = on
> 2. update page P with facts CHKSUM, !HINT
> 3. clean write of P
> 4. clean shutdown
>
> 5. startup with page_checksums = off
> 6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
> 7. begin write of P
> 8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT
>
> 9. startup with page_checksums = off
> 10. crash recovery.  does not touch P, because step 6 generated no WAL
> 11. clean shutdown
>
> 12. startup with page_checksums = on
> 13. read P.  checksum failure
>
> Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
> because step 6 _does_ write WAL.  That ought to change for the sake of
> page_checksums=off performance, at which point we must protect the above
> scenario by other means.

Thanks for explaining.

Logic fixed.

>> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
>> >> > is insufficient.
>> >>
>> >> Am serialising this by only writing PageLSN while holding buf hdr lock.
>> >
>> > That means also taking the buffer header spinlock in every PageGetLSN() caller
>> > holding only a shared buffer content lock. ?Do you think that will pay off,
>> > versus the settled pattern of trading here your shared buffer content lock for
>> > an exclusive one?
>>
>> Yes, I think it will pay off. This is the only code location where we
>> do that, and we are already taking the buffer header lock, so there is
>> effectively no overhead.
>
> The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
> (via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
> spinlock at either of those locations.  If they remain safe under the new
> rules, we'll need comments explaining why.  I think they may indeed be safe,
> but it's far from obvious.

OK, some slight rework required to do that, no problems.

I checked all other call sites and have updated README to explain.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Thu, Feb 9, 2012 at 3:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah@leadboat.com> wrote:
>> On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
>>> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah@leadboat.com> wrote:
>>> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
>>> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah@leadboat.com> wrote:
>>> >> > This patch uses FPIs to guard against torn hint writes, even when the
>>> >> > checksums are disabled. ?One could not simply condition them on the
>>> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
>>> >> > a page previously written with page_checksums=on. ?If that write tears,
>>> >> > leaving the checksum intact, that block will now fail verification. ?A couple
>>> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
>>> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
>>> >> > Whenever the cluster starts with checksums disabled, set the field to
>>> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
>>> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
>>> >> > failure occurs in a page with LSN older than the stored one, emit either a
>>> >> > softer warning or no message at all.
>>> >>
>>> >> We can only change page_checksums at restart (now) so the above seems moot.
>>> >>
>>> >> If we crash with FPWs enabled we repair any torn pages.
>>> >
>>> > There's no live bug, but that comes at a high cost: the patch has us emit
>>> > full-page images for hint bit writes regardless of the page_checksums setting.
>>>
>>> Sorry, I don't understand what you mean. I don't see any failure cases
>>> that require that.
>>>
>>> page_checksums can only change at a shutdown checkpoint,
>>>
>>> The statement "If that write tears,
>>> >> > leaving the checksum intact, that block will now fail verification."
>>> cannot happen, ISTM.
>>>
>>> If we write out a block we update the checksum if page_checksums is
>>> set, or we clear it. If we experience a torn page at crash, the FPI
>>> corrects that, so the checksum never does fail verification. We only
>>> need to write a FPI when we write with checksums.
>>>
>>> If that's wrong, please explain a failure case in detail.
>>
>> In the following description, I will treat a heap page as having two facts.
>> The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
>> A torn page write lacking the protection of a full-page image can update one
>> or both facts despite the transaction having logically updated both.  (This is
>> just the normal torn-page scenario.)  Given that, consider the following
>> sequence of events:
>>
>> 1. startup with page_checksums = on
>> 2. update page P with facts CHKSUM, !HINT
>> 3. clean write of P
>> 4. clean shutdown
>>
>> 5. startup with page_checksums = off
>> 6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
>> 7. begin write of P
>> 8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT
>>
>> 9. startup with page_checksums = off
>> 10. crash recovery.  does not touch P, because step 6 generated no WAL
>> 11. clean shutdown
>>
>> 12. startup with page_checksums = on
>> 13. read P.  checksum failure
>>
>> Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
>> because step 6 _does_ write WAL.  That ought to change for the sake of
>> page_checksums=off performance, at which point we must protect the above
>> scenario by other means.
>
> Thanks for explaining.
>
> Logic fixed.
>
>>> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
>>> >> > is insufficient.
>>> >>
>>> >> Am serialising this by only writing PageLSN while holding buf hdr lock.
>>> >
>>> > That means also taking the buffer header spinlock in every PageGetLSN() caller
>>> > holding only a shared buffer content lock. ?Do you think that will pay off,
>>> > versus the settled pattern of trading here your shared buffer content lock for
>>> > an exclusive one?
>>>
>>> Yes, I think it will pay off. This is the only code location where we
>>> do that, and we are already taking the buffer header lock, so there is
>>> effectively no overhead.
>>
>> The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
>> (via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
>> spinlock at either of those locations.  If they remain safe under the new
>> rules, we'll need comments explaining why.  I think they may indeed be safe,
>> but it's far from obvious.
>
> OK, some slight rework required to do that, no problems.
>
> I checked all other call sites and have updated README to explain.

v8 attached

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Albert Cervera i Areny
Date:
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;
-qt-user-state:0;">ADijous, 16 de febrer de 2012 12:16:31, Simon Riggs va escriure:<p style=" margin-top:0px;
margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">> v8
attached<pstyle="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Maybe the docs should include
whatwill happen if the checksum is not correct?<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;"><br />-- <p style=" margin-top:0px;
margin-bottom:0px;margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Albert
Cerverai Areny<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0;
text-indent:0px;-qt-user-state:0;"><a href="http://www.NaN-tic.com"><span style=" text-decoration: underline;
color:#0057ae;">http://www.NaN-tic.com</span></a><pstyle=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Tel: +34 93 553 18 03<p
style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0;"><br /><p style=" margin-top:0px; margin-bottom:0px;
margin-left:0px;margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;"><a
href="http://twitter.com/albertnan"><spanstyle=" text-decoration: underline;
color:#0057ae;">http://twitter.com/albertnan</span></a><pstyle=" margin-top:0px; margin-bottom:0px; margin-left:0px;
margin-right:0px;-qt-block-indent:0; text-indent:0px; -qt-user-state:0;"><a href="http://www.nan-tic.com/blog"><span
style="text-decoration: underline; color:#0057ae;">http://www.nan-tic.com/blog</span></a><p
style="-qt-paragraph-type:empty;margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px;
-qt-block-indent:0;text-indent:0px; -qt-user-state:0; text-decoration: underline; color:#0057ae;"> 

Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> v8 attached

It's hard to believe that this version has been tested terribly
thoroughly, because it doesn't compile.

+     LockBufHdr(buf);
+
+     /*
+      * Run PageGetLSN while holding header lock.
+      */
+     recptr = BufferGetLSN(buf);
+
+     /* To check if block content changes while flushing. - vadim 01/17/97 */
+     buf->flags &= ~BM_JUST_DIRTIED;
+     UnlockBufHdr(buf);
+

This doesn't seem very helpful.  It's obvious even without the comment
that we're running PageGetLSN while holding the header lock.  What's
not obvious, and what the comment should be explaining, is why we're
doing that.

+             /*
+              * If we're in recovery we cannot dirty a page because of a hint.
+              * We can set the hint, just not dirty the page as a result so
+              * the hint is lost when we evict the page or shutdown.
+              *
+              * See long discussion in bufpage.c
+              */
+             if (RecoveryInProgress())
+                 return;

Doesn't this seem awfully bad for performance on Hot Standby servers?
I agree that it fixes the problem with un-WAL-logged pages there, but
I seem to recall some recent complaining about performance features
that work on the master but not the standby.  Durable hint bits are
one such feature.

+              * Basically, we simply prevent the checkpoint WAL record from
+              * being written until we have marked the buffer dirty. We don't
+              * start the checkpoint flush until we have marked dirty, so our
+              * checkpoint must flush the change to disk successfully or the
+              * checkpoint never gets written, so crash recovery will fix.
+              *
+              * It's possible we may enter here without an xid, so it is
+              * essential that CreateCheckpoint waits for virtual transactions
+              * rather than full transactionids.
+              */
+             MyPgXact->delayChkpt = delayChkpt = true;

I am slightly worried that this expansion in the use of this mechanism
(formerly called inCommit, for those following along at home) could
lead to checkpoint starvation.  Suppose we've got one or two large
table scans wandering along, setting hint bits, and now suddenly it's
time to checkpoint.  How long will it take the checkpoint process to
find a time when nobody's got delayChkpt set?

+ #define PageSetChecksum(page) \
+ do \
+ { \
+     PageHeader    p = (PageHeader) page; \
+     p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
+     p->pd_flags |= PD_CHECKSUM1; \
+     p->pd_flags &= ~PD_CHECKSUM2; \
+     p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
+ } while (0);
+
+ /* ensure any older checksum info is overwritten with watermark */
+ #define PageResetVersion(page) \
+ do \
+ { \
+     PageHeader    p = (PageHeader) page; \
+     if (!PageHasNoChecksum(p)) \
+     { \
+         p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
+         p->pd_flags &= ~PD_CHECKSUM1; \
+         p->pd_flags &= ~PD_CHECKSUM2; \
+         PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
+     } \
+ } while (0);

So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
do?
  * PageGetPageSize  *        Returns the page size of a page.  *
!  * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
!  * This can be called on any page, initialised or not, in or out of buffers.
!  * You might think this can vary at runtime but you'd be wrong, since pages
!  * frequently need to occupy buffers and pages are copied from one to another
!  * so there are many hidden assumptions that this simple definition is true.  */
! #define PageGetPageSize(page) (BLCKSZ)

I think I agree that the current definition of PageGetPageSize seems
unlikely to come anywhere close to allowing us to cope with multiple
page sizes, but I think this method of disabling it is a hack.  The
callers that want to know how big the page really is should just use
BLCKSZ instead of this macro, and those that want to know how big the
page THINKS it is (notably contrib/pageinspect) need a way to get that
information.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> v8 attached
>
> It's hard to believe that this version has been tested terribly
> thoroughly, because it doesn't compile.

I'm just back home from a few days off grid.

It's possible it doesn't compile against current HEAD, though it
certainly does compile and work against my last git pull.

I will look into your comments in detail tomorrow morning. Thank you
for looking at the patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Fri, Feb 17, 2012 at 10:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> v8 attached
>>
>> It's hard to believe that this version has been tested terribly
>> thoroughly, because it doesn't compile.
>
> I'm just back home from a few days off grid.
>
> It's possible it doesn't compile against current HEAD, though it
> certainly does compile and work against my last git pull.
>
> I will look into your comments in detail tomorrow morning. Thank you
> for looking at the patch.

It does appear there is a chunk of code introduced somewhere between
my testing and sending the patch, so you're right. Looking at the
chunk below it looks like something hit some keys, adding a "]" and
CR, which are adjacent on my keyboard. Trying to develop with 3 kids
and a dog around isn't easy, I guess.

With this chunk removed we're back to the version I've tested - I
think - will carry on checking.

@@ -465,7 +466,9 @@ ReadBuffer_common(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,
        {
                /* Only need to adjust flags */
                bufHdr->flags |= BM_VALID;
-       }
+       [
+
+}
        else
        {
                /* Set BM_VALID, terminate IO, and wake up any waiters */

For testing, I've been running server with shared_buffers = 16, to
force blocks in and out of cache frequently to stress test the
checksumming.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> +                       /*
> +                        * If we're in recovery we cannot dirty a page because of a hint.
> +                        * We can set the hint, just not dirty the page as a result so
> +                        * the hint is lost when we evict the page or shutdown.
> +                        *
> +                        * See long discussion in bufpage.c
> +                        */
> +                       if (RecoveryInProgress())
> +                               return;
>
> Doesn't this seem awfully bad for performance on Hot Standby servers?
> I agree that it fixes the problem with un-WAL-logged pages there, but
> I seem to recall some recent complaining about performance features
> that work on the master but not the standby.  Durable hint bits are
> one such feature.

It's impossible for it to work, in this case, since we cannot write
new WAL to prevent torn pages.

Note that hint bit setting on a dirty block is allowed, so many hints
will still be set in Hot Standby.

> +                        * Basically, we simply prevent the checkpoint WAL record from
> +                        * being written until we have marked the buffer dirty. We don't
> +                        * start the checkpoint flush until we have marked dirty, so our
> +                        * checkpoint must flush the change to disk successfully or the
> +                        * checkpoint never gets written, so crash recovery will fix.
> +                        *
> +                        * It's possible we may enter here without an xid, so it is
> +                        * essential that CreateCheckpoint waits for virtual transactions
> +                        * rather than full transactionids.
> +                        */
> +                       MyPgXact->delayChkpt = delayChkpt = true;
>
> I am slightly worried that this expansion in the use of this mechanism
> (formerly called inCommit, for those following along at home) could
> lead to checkpoint starvation.  Suppose we've got one or two large
> table scans wandering along, setting hint bits, and now suddenly it's
> time to checkpoint.  How long will it take the checkpoint process to
> find a time when nobody's got delayChkpt set?

We don't need to wait until nobody has it set, we just need to wait
for the people that had it set when we first checked to be out of that
state momentarily.

> + #define PageSetChecksum(page) \
> + do \
> + { \
> +       PageHeader      p = (PageHeader) page; \
> +       p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
> +       p->pd_flags |= PD_CHECKSUM1; \
> +       p->pd_flags &= ~PD_CHECKSUM2; \
> +       p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
> + } while (0);
> +
> + /* ensure any older checksum info is overwritten with watermark */
> + #define PageResetVersion(page) \
> + do \
> + { \
> +       PageHeader      p = (PageHeader) page; \
> +       if (!PageHasNoChecksum(p)) \
> +       { \
> +               p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
> +               p->pd_flags &= ~PD_CHECKSUM1; \
> +               p->pd_flags &= ~PD_CHECKSUM2; \
> +               PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
> +       } \
> + } while (0);
>
> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
> doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
> do?

As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.

>   * PageGetPageSize
>   *            Returns the page size of a page.
>   *
> !  * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
> !  * This can be called on any page, initialised or not, in or out of buffers.
> !  * You might think this can vary at runtime but you'd be wrong, since pages
> !  * frequently need to occupy buffers and pages are copied from one to another
> !  * so there are many hidden assumptions that this simple definition is true.
>   */
> ! #define PageGetPageSize(page) (BLCKSZ)
>
> I think I agree that the current definition of PageGetPageSize seems
> unlikely to come anywhere close to allowing us to cope with multiple
> page sizes, but I think this method of disabling it is a hack.  The
> callers that want to know how big the page really is should just use
> BLCKSZ instead of this macro, and those that want to know how big the
> page THINKS it is (notably contrib/pageinspect) need a way to get that
> information.

Fair comment.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sun, Feb 19, 2012 at 4:35 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

> We don't need to wait until nobody has it set, we just need to wait
> for the people that had it set when we first checked to be out of that
> state momentarily.

I've just finished doing some performance analysis on various aspects
of hint bit setting for this patch.

I've seen as high as 14% of transactions writing full pages during a
pgbench run. That sounds quite bad, but on pgbench at least all of
those are associated with UPDATEs, which would dirty the page anyway,
so there aren't any more full page writes overall.

Checkpoints would be delayed only until a virtual transaction ends or
a virtual transaction comes out of DelayCkpt state. If a virtual
transaction was long running it wouldn't spend much time in the
delaying state, especially if we take into account I/O requirements.

So although I'm not exactly happy with the overheads, they don't seem
to be as big a problem as they sound. Plus this is all optional and
avoidable.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

> v8 attached

v10 attached.

This patch covers all the valid concerns discussed and has been
extensively tested.

I expect to commit this in about 24 hours, so last call for comments
please or say if you need more time to review.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Doesn't this seem awfully bad for performance on Hot Standby servers?
>> I agree that it fixes the problem with un-WAL-logged pages there, but
>> I seem to recall some recent complaining about performance features
>> that work on the master but not the standby.  Durable hint bits are
>> one such feature.
>
> It's impossible for it to work, in this case, since we cannot write
> new WAL to prevent torn pages.
>
> Note that hint bit setting on a dirty block is allowed, so many hints
> will still be set in Hot Standby.

To me, it seems that you are applying a double standard.  You have
twice attempted to insist that I do extra work to make major features
that I worked on - unlogged tables and index-only scans - work in Hot
Standby mode, despite the existence of significant technological
obstacles.  But when it comes to your own feature, you simply state
that it cannot be done, and therefore we need not do it.   Of course,
this feature, like those, CAN be made to work.  It just involves
solving difficult problems that have little to do with the primary
purpose of the patch.  To be honest, I don't use Hot Standby enough to
care very much about this particular issue, and I'm not saying we
should reject it on these grounds.  But I do think it's a mistake to
dismiss it entirely, since every limitation of Hot Standby narrows the
set of cases in which it can be deployed.  And at any rate, I want the
same standard applied to my work as to yours.

>> I am slightly worried that this expansion in the use of this mechanism
>> (formerly called inCommit, for those following along at home) could
>> lead to checkpoint starvation.  Suppose we've got one or two large
>> table scans wandering along, setting hint bits, and now suddenly it's
>> time to checkpoint.  How long will it take the checkpoint process to
>> find a time when nobody's got delayChkpt set?
>
> We don't need to wait until nobody has it set, we just need to wait
> for the people that had it set when we first checked to be out of that
> state momentarily.

Ah... good point.

>> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
>> doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
>> do?
>
> As explained in detailed comments, the purpose of this is to implement
> Heikki's suggestion that we have a bit set to zero so we can detect
> failures that cause a run of 1s.

I think it's nonsensical to pretend that there's anything special
about that particular bit.  If we want to validate the page header
before trusting the lack of a checksum bit, we can do that far more
thoroughly than just checking that one bit.  There are a whole bunch
of bits that ought to always be zero, and there are other things we
can validate as well (e.g. LSN not in future).  If we're concerned
about the checksum-enabled bit getting flipped (and I agree that we
should be), we can check some combination of that stuff in the hope of
catching it, and that'll be a far better guard than just checking one
arbitrarily selected bit.

That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum.  There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place.  Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests.  Things never
fail the way you want them to.

Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on.  But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed.  If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.

I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table.  That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are.  This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums.  It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity.  I'm not sure whether that's
practical.

I also think that the question of where exactly the checksum ought to
be put might bear some more thought.  Tom expressed some concern about
stealing the page version field, and it seems to me we could work
around that by instead stealing something less valuable.  The top
eight bits of the pd_pagesize_version field probably meet that
criteria, since in practice basically everybody uses 8K blocks, and
the number of errors that will be caught by checksums is probably much
larger than the number of errors that will be caught by the page-size
cross-check.  But maybe the other 8 bits should come from somewhere
else, maybe pd_tli or pd_flags.  For almost all practical purposes,
storing only the low-order 8 bits of the TLI ought to provide just as
much of a safety check as storing the low-order 16 bits, so I think
that might be the way to go.  We could set it up so that we always
check only the low 8 bits against the TLI, regardless of whether
checksums are enabled; then we basically free up that bit space at no
cost in code complexity.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> To me, it seems that you are applying a double standard.  You have
> twice attempted to insist that I do extra work to make major features
> that I worked on - unlogged tables and index-only scans - work in Hot
> Standby mode, despite the existence of significant technological
> obstacles.  But when it comes to your own feature, you simply state
> that it cannot be done, and therefore we need not do it.   Of course,
> this feature, like those, CAN be made to work.

Vitriol aside, If you would be so kind as to explain how it is
possible, as you claim, I'll look into making it work.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

>> As explained in detailed comments, the purpose of this is to implement
>> Heikki's suggestion that we have a bit set to zero so we can detect
>> failures that cause a run of 1s.
>
> I think it's nonsensical to pretend that there's anything special
> about that particular bit.  If we want to validate the page header
> before trusting the lack of a checksum bit, we can do that far more
> thoroughly than just checking that one bit.  There are a whole bunch
> of bits that ought to always be zero, and there are other things we
> can validate as well (e.g. LSN not in future).  If we're concerned
> about the checksum-enabled bit getting flipped (and I agree that we
> should be), we can check some combination of that stuff in the hope of
> catching it, and that'll be a far better guard than just checking one
> arbitrarily selected bit.

I thought it was a reasonable and practical idea from Heikki. The bit
is not selected arbitrarily, it is by design adjacent to one of the
other bits. So overall, 3 bits need to be set to a precise value and a
run of 1s or 0s will throw and error.

> That having been said, I don't feel very good about the idea of
> relying on the contents of the page to tell us whether or not the page
> has a checksum.  There's no guarantee that an error that flips the
> has-checksum bit will flip any other bit on the page, or that it won't
> flip everything else we're relying on as a backstop in exactly the
> manner that foils whatever algorithm we put in place.  Random
> corruption is, perhaps, unlikely to do that, but somehow I feel like
> it'll happen more often than random chance suggests.  Things never
> fail the way you want them to.

You're right. This patch is not the best possible world, given a clean
slate. But we don't have a clean slate.

The fact is this patch will detect corruptions pretty well and that's
what Postgres users want.

While developing this, many obstacles could have been blockers. I
think we're fairly lucky that I managed to find a way through the
minefield of obstacles.

> Another disadvantage of the current scheme is that there's no
> particularly easy way to know that your whole cluster has checksums.
> No matter how we implement checksums, you'll have to rewrite every
> table in the cluster in order to get them fully turned on.  But with
> the current design, there's no easy way to know how much of the
> cluster is actually checksummed.

You can read every block and check.

> If you shut checksums off, they'll
> linger until those pages are rewritten, and there's no easy way to
> find the relations from which they need to be removed, either.

We can't have it both ways. Either we have an easy upgrade, or we
don't. I'm told that an easy upgrade is essential.

> I'm tempted to suggest a relation-level switch: when you want
> checksums, you use ALTER TABLE to turn them on, and when you don't
> want them any more you use ALTER TABLE to shut them off again, in each
> case rewriting the table.  That way, there's never any ambiguity about
> what's in the data pages in a given relation: either they're either
> all checksummed, or none of them are.  This moves the decision about
> whether checksums are enabled or disabled quite a but further away
> from the data itself, and also allows you to determine (by catalog
> inspection) which parts of the cluster do in fact have checksums.  It
> might be kind of a pain to implement, though: you'd have to pass the
> information about how any given relation was configured down to the
> place where we validate page sanity.  I'm not sure whether that's
> practical.

It's not practical as the only mechanism, given the requirement for upgrade.

As I mention in the docs, if you want that, use VACUUM FULL. So there
is a mechanism if you want it.

> I also think that the question of where exactly the checksum ought to
> be put might bear some more thought.  Tom expressed some concern about
> stealing the page version field, and it seems to me we could work
> around that by instead stealing something less valuable.  The top
> eight bits of the pd_pagesize_version field probably meet that
> criteria, since in practice basically everybody uses 8K blocks, and
> the number of errors that will be caught by checksums is probably much
> larger than the number of errors that will be caught by the page-size
> cross-check.  But maybe the other 8 bits should come from somewhere
> else, maybe pd_tli or pd_flags.  For almost all practical purposes,
> storing only the low-order 8 bits of the TLI ought to provide just as
> much of a safety check as storing the low-order 16 bits, so I think
> that might be the way to go.  We could set it up so that we always
> check only the low 8 bits against the TLI, regardless of whether
> checksums are enabled; then we basically free up that bit space at no
> cost in code complexity.

The version stuff is catered for by the current patch.

TLI isn't something I want to mess with. It does actually mean
something, unlike the page version field which carries no appreciable
information.

People want checksums. This is the best way I've come up with of
giving it to them, given all of the constraints and requests imposed
upon me. I have no doubt that a later release will redesign the page
format and do a better job, while at the same time allowing an online
upgrade mechanism to allow block changes. But that is at least 1 year
away and experience shows that is often much longer than that. I'll
buy you a $DRINK if that happens in the next release.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Sun, Feb 19, 2012 at 6:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> To me, it seems that you are applying a double standard.  You have
>> twice attempted to insist that I do extra work to make major features
>> that I worked on - unlogged tables and index-only scans - work in Hot
>> Standby mode, despite the existence of significant technological
>> obstacles.  But when it comes to your own feature, you simply state
>> that it cannot be done, and therefore we need not do it.   Of course,
>> this feature, like those, CAN be made to work.
>
> Vitriol aside, If you would be so kind as to explain how it is
> possible, as you claim, I'll look into making it work.

It would require a double-write buffer of some kind.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Sun, Feb 19, 2012 at 6:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I thought it was a reasonable and practical idea from Heikki. The bit
> is not selected arbitrarily, it is by design adjacent to one of the
> other bits. So overall, 3 bits need to be set to a precise value and a
> run of 1s or 0s will throw and error.

Sure, but who is to say that a typical error will look anything like
that?  Anyway, you could check even more bits just as easily; we know
exactly which ones have a plausible reason for being non-zero.

>> That having been said, I don't feel very good about the idea of
>> relying on the contents of the page to tell us whether or not the page
>> has a checksum.  There's no guarantee that an error that flips the
>> has-checksum bit will flip any other bit on the page, or that it won't
>> flip everything else we're relying on as a backstop in exactly the
>> manner that foils whatever algorithm we put in place.  Random
>> corruption is, perhaps, unlikely to do that, but somehow I feel like
>> it'll happen more often than random chance suggests.  Things never
>> fail the way you want them to.
>
> You're right. This patch is not the best possible world, given a clean
> slate. But we don't have a clean slate.
>
> The fact is this patch will detect corruptions pretty well and that's
> what Postgres users want.
>
> While developing this, many obstacles could have been blockers. I
> think we're fairly lucky that I managed to find a way through the
> minefield of obstacles.

I think we could do worse than this patch, but I don't really believe
it's ready for commit.  We don't have a single performance number
showing how much of a performance regression this causes, either on
the master or on the standby, on any workload, much less those where a
problem is reasonably forseeable from the design you've chosen.  Many
controversial aspects of the patch, such as the way you're using the
buffer header spinlocks as a surrogate for x-locking the buffer, or
the right way of obtaining the bit-space the patch requires, haven't
really been discussed, and to the extent that they have been
discussed, they have not been agreed.

On the former point, you haven't updated
src/backend/storage/buffer/README at all; but updating is not by
itself sufficient.  Before we change the buffer-locking rules, we
ought to have some discussion of whether it's OK to do that, and why
it's necessary.  "Why it's necessary" would presumably include
demonstrating that the performance of the straightforward
implementation stinks, and that with this change is better.  You
haven't made any effort to do that whatsoever, or if you have, you
haven't posted the results here.  I'm pretty un-excited by that
change, first because I think it's a modularity violation and possibly
unsafe, and second because I believe we've already got a problem with
buffer header spinlock contention which this might exacerbate.

>> Another disadvantage of the current scheme is that there's no
>> particularly easy way to know that your whole cluster has checksums.
>> No matter how we implement checksums, you'll have to rewrite every
>> table in the cluster in order to get them fully turned on.  But with
>> the current design, there's no easy way to know how much of the
>> cluster is actually checksummed.
>
> You can read every block and check.

Using what tool?

>> If you shut checksums off, they'll
>> linger until those pages are rewritten, and there's no easy way to
>> find the relations from which they need to be removed, either.
>
> We can't have it both ways. Either we have an easy upgrade, or we
> don't. I'm told that an easy upgrade is essential.

Easy upgrade and removal of checksums are unrelated issues AFAICS.

>> I'm tempted to suggest a relation-level switch: when you want
>> checksums, you use ALTER TABLE to turn them on, and when you don't
>> want them any more you use ALTER TABLE to shut them off again, in each
>> case rewriting the table.  That way, there's never any ambiguity about
>> what's in the data pages in a given relation: either they're either
>> all checksummed, or none of them are.  This moves the decision about
>> whether checksums are enabled or disabled quite a but further away
>> from the data itself, and also allows you to determine (by catalog
>> inspection) which parts of the cluster do in fact have checksums.  It
>> might be kind of a pain to implement, though: you'd have to pass the
>> information about how any given relation was configured down to the
>> place where we validate page sanity.  I'm not sure whether that's
>> practical.
>
> It's not practical as the only mechanism, given the requirement for upgrade.

Why not?

> The version stuff is catered for by the current patch.

Your patch reuses the version number field for an unrelated purpose
and includes some vague hints of how we might do versioning using some
of the page-level flag bits.  Since bit-space is fungible, I think it
makes more sense to keep the version number where it is and carve the
checksum out of whatever bit space you would have moved the version
number into.  Whether that's pd_tli or pd_flags is somewhat secondary;
the point is that you can certainly arrange to leave the 8 bits that
currently contain the version number as they are.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Mon, Feb 20, 2012 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> I think we could do worse than this patch, but I don't really believe
> it's ready for commit.  We don't have a single performance number
> showing how much of a performance regression this causes, either on
> the master or on the standby, on any workload, much less those where a
> problem is reasonably forseeable from the design you've chosen.  Many
> controversial aspects of the patch, such as the way you're using the
> buffer header spinlocks as a surrogate for x-locking the buffer, or
> the right way of obtaining the bit-space the patch requires, haven't
> really been discussed, and to the extent that they have been
> discussed, they have not been agreed.

These points are being discussed here. If you have rational
objections, say what they are.

> On the former point, you haven't updated
> src/backend/storage/buffer/README at all;

I've updated the appropriate README file which relates to page LSNs to explain.

> but updating is not by
> itself sufficient.  Before we change the buffer-locking rules, we
> ought to have some discussion of whether it's OK to do that, and why
> it's necessary.  "Why it's necessary" would presumably include
> demonstrating that the performance of the straightforward
> implementation stinks, and that with this change is better.

What straightforward implementation is that?? This *is* the straightforward one.

God knows what else we'd break if we drop the lock, reacquire as an
exclusive, then drop it again and reacquire in shared mode. Code tends
to assume that when you take a lock you hold it until you release;
doing otherwise would allow all manner of race conditions to emerge.

And do you really think that would be faster?

> You
> haven't made any effort to do that whatsoever, or if you have, you
> haven't posted the results here.  I'm pretty un-excited by that
> change, first because I think it's a modularity violation and possibly
> unsafe, and second because I believe we've already got a problem with
> buffer header spinlock contention which this might exacerbate.
>
>>> Another disadvantage of the current scheme is that there's no
>>> particularly easy way to know that your whole cluster has checksums.
>>> No matter how we implement checksums, you'll have to rewrite every
>>> table in the cluster in order to get them fully turned on.  But with
>>> the current design, there's no easy way to know how much of the
>>> cluster is actually checksummed.
>>
>> You can read every block and check.
>
> Using what tool?

Pageinspect?

>>> If you shut checksums off, they'll
>>> linger until those pages are rewritten, and there's no easy way to
>>> find the relations from which they need to be removed, either.
>>
>> We can't have it both ways. Either we have an easy upgrade, or we
>> don't. I'm told that an easy upgrade is essential.
>
> Easy upgrade and removal of checksums are unrelated issues AFAICS.
>
>>> I'm tempted to suggest a relation-level switch: when you want
>>> checksums, you use ALTER TABLE to turn them on, and when you don't
>>> want them any more you use ALTER TABLE to shut them off again, in each
>>> case rewriting the table.  That way, there's never any ambiguity about
>>> what's in the data pages in a given relation: either they're either
>>> all checksummed, or none of them are.  This moves the decision about
>>> whether checksums are enabled or disabled quite a but further away
>>> from the data itself, and also allows you to determine (by catalog
>>> inspection) which parts of the cluster do in fact have checksums.  It
>>> might be kind of a pain to implement, though: you'd have to pass the
>>> information about how any given relation was configured down to the
>>> place where we validate page sanity.  I'm not sure whether that's
>>> practical.
>>
>> It's not practical as the only mechanism, given the requirement for upgrade.
>
> Why not?
>
>> The version stuff is catered for by the current patch.
>
> Your patch reuses the version number field for an unrelated purpose
> and includes some vague hints of how we might do versioning using some
> of the page-level flag bits.  Since bit-space is fungible, I think it
> makes more sense to keep the version number where it is and carve the
> checksum out of whatever bit space you would have moved the version
> number into.  Whether that's pd_tli or pd_flags is somewhat secondary;
> the point is that you can certainly arrange to leave the 8 bits that
> currently contain the version number as they are.

If you've got some rational objections, please raise them.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> What straightforward implementation is that?? This *is* the straightforward one.
>
> God knows what else we'd break if we drop the lock, reacquire as an
> exclusive, then drop it again and reacquire in shared mode. Code tends
> to assume that when you take a lock you hold it until you release;
> doing otherwise would allow all manner of race conditions to emerge.
>
> And do you really think that would be faster?

I don't know, but neither do you, because you apparently haven't tried
it.   Games where we drop the shared lock and get an exclusive lock
are used in numerous places in the source code; see, for example,
heap_update(), so I don't believe that there is any reason to reject
that a priori.  Indeed, I can think of at least one pretty good reason
to do it exactly that way: it's the way we've handled all previous
problems of this type, and in general it's better to make new code
look like existing code than to invent something new, especially when
you haven't made any effort to quantify the problem or the extent to
which the proposed solution mitigates it.

> If you've got some rational objections, please raise them.

Look, I think that the objections I have been raising are rational.
YMMV, and obviously does.  The bottom line here is that I can't stop
you from committing this patch, and we both know that.  And, really,
at the end of the day, I have no desire to keep you from committing
this patch, even if I had the power to do so, because I agree that the
feature has merit.  But I can object to it and, as the patch stands
now, I do object to it, for the following specific reasons:

1. You haven't posted any meaningful performance test results, of
either normal cases or worst cases.
2. You're fiddling with the buffer locking in a way that I'm very
doubtful about without having tried any of the alternatives.
3. You're rearranging the page header in a way that I find ugly and baroque.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Josh Berkus
Date:
On 2/20/12 5:57 AM, Robert Haas wrote:
> 3. You're rearranging the page header in a way that I find ugly and baroque.

Guys, are we talking about an on-disk format change?  If so, this may
need to be kicked out of 9.2 ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 2/20/12 5:57 AM, Robert Haas wrote:
>> 3. You're rearranging the page header in a way that I find ugly and baroque.
>
> Guys, are we talking about an on-disk format change?  If so, this may
> need to be kicked out of 9.2 ...

Yes, we are.  Simon's gone to some pains to make it backward
compatible, but IMHO it's a lot messier and less future-proof than it
could be with some more work, and if we commit this patch than we WILL
be stuck with this for a very long time.

The fact is that, thus far, so real effort has been made by anyone to
evict anything from the current CommitFest.  I did that for the last
two cycles, but as the minutes for last year's development meeting
succinctly observed "Haas ... doesn't like being the schedule jerk".
My resolve to be less of a schedule jerk is being sorely tested,
though: aside from this patch, which has its share of issues, there's
also Alvaro's foreign key stuff, which probably needs a lot more
review than it has so far gotten, and several large patches from
Dimitri, which do cool stuff but need a lot more work, and Peter
Geoghegan's pg_stat_statements patch, which is awesome functionality
but sounds like it's still a little green, and parallel pg_dump, which
is waiting on a rebase after some refactoring I did to simplify
things, and pgsql_fdw, and Heikki's xlog insert scaling patch, which
fortunately seems to be in the capable hands of Fujii Masao and Jeff
Janes, but certainly isn't ready to go today.  Any of these
individually could probably eat up a solid week of my time, and then
there are the other 40 as-yet-undealt-with patches.

I said five weeks ago that it probably wouldn't hurt anything to let
the CommitFest run six weeks, and Tom opined that it would probably
require two months.  But the sad reality is that, after five weeks,
we're not even half done with this CommitFest.  That's partly because
most people did not review as much code as they submitted, partly
because a bunch of people played fast and loose with the submission
deadline, and partly because we didn't return any patches that still
needed major rehashing after the first round of review, leading to a
situation where supposedly code-complete patches are showing up for
the first time four or five weeks after the submission deadline.  Now
none of that is the fault of this patch specifically: honestly, if I
had to pick just two more patches to get committed before beta, this
would probably be one of them.  But that doesn't make me happy with
where it's at today, not to mention the fact that there are people who
submitted their code a lot earlier who still haven't gotten the
attention this patch has, which is still less than the attention a
patch of this magnitude probably needs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Josh Berkus
Date:
>> Guys, are we talking about an on-disk format change?  If so, this may
>> need to be kicked out of 9.2 ...
> 
> Yes, we are.  Simon's gone to some pains to make it backward
> compatible, but IMHO it's a lot messier and less future-proof than it
> could be with some more work, and if we commit this patch than we WILL
> be stuck with this for a very long time.

Yeah.  I'd personally prefer to boot this to 9.3, then.  It's not like
there's not enough features for 9.2, and I really don't want this
feature to cause 5 others to be delayed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: 16-bit page checksums for 9.2

From
Bruce Momjian
Date:
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
> Another disadvantage of the current scheme is that there's no
> particularly easy way to know that your whole cluster has checksums.
> No matter how we implement checksums, you'll have to rewrite every
> table in the cluster in order to get them fully turned on.  But with
> the current design, there's no easy way to know how much of the
> cluster is actually checksummed.  If you shut checksums off, they'll
> linger until those pages are rewritten, and there's no easy way to
> find the relations from which they need to be removed, either.

Yes, pg_upgrade makes this hard.  If you are using pg_dump to restore,
and set the checksum GUC before you do the restore, and never turn it
off, then you will have a fully checksum'ed database.

If you use pg_upgrade, and enable the checksum GUC, your database will
become progressively checksummed, and as Simon pointed out, the only
clean way is VACUUM FULL.  It is quite hard to estimate the checksum
coverage of a database with mixed checksumming --- one cool idea would
be for ANALYZE to report how many of the pages it saw were checksummed. 
Yeah, crazy, but it might be enough.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Mon, Feb 20, 2012 at 6:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 2/20/12 5:57 AM, Robert Haas wrote:
>>> 3. You're rearranging the page header in a way that I find ugly and baroque.
>>
>> Guys, are we talking about an on-disk format change?  If so, this may
>> need to be kicked out of 9.2 ...
>
> Yes, we are.  Simon's gone to some pains to make it backward
> compatible, but IMHO it's a lot messier and less future-proof than it
> could be with some more work, and if we commit this patch than we WILL
> be stuck with this for a very long time.

No, we aren't. The format is not changed, though there are 3 new bits added.

The format is designed to be extensible and we have room for 10 more
bits, which allows 1024 new page formats, which at current usage rate
should last us for approximately 5000 years of further development.
Previously, we were limited to 251 more page formats, so this new
mechanism is more future proof than the last. Alternatively you might
say that we have dropped from 1271 page formats to only 1024, which is
hardly limiting.

This has already been discussed and includes points made by Bruce and
Heikki in the final design. The "more work" required is not described
clearly, nor has anyone but RH requested it. In terms of messier, RH's
only alternate suggestion is to split the checksum into multiple
pieces, which I don't think yer average reader would call less ugly,
less messy or more future proof.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Mon, Feb 20, 2012 at 11:09 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>> Another disadvantage of the current scheme is that there's no
>> particularly easy way to know that your whole cluster has checksums.
>> No matter how we implement checksums, you'll have to rewrite every
>> table in the cluster in order to get them fully turned on.  But with
>> the current design, there's no easy way to know how much of the
>> cluster is actually checksummed.  If you shut checksums off, they'll
>> linger until those pages are rewritten, and there's no easy way to
>> find the relations from which they need to be removed, either.
>
> Yes, pg_upgrade makes this hard.  If you are using pg_dump to restore,
> and set the checksum GUC before you do the restore, and never turn it
> off, then you will have a fully checksum'ed database.
>
> If you use pg_upgrade, and enable the checksum GUC, your database will
> become progressively checksummed, and as Simon pointed out, the only
> clean way is VACUUM FULL.  It is quite hard to estimate the checksum
> coverage of a database with mixed checksumming --- one cool idea would
> be for ANALYZE to report how many of the pages it saw were checksummed.
> Yeah, crazy, but it might be enough.

Well, I didn't say VACUUM FULL was the only clean way of knowing
whether every block is checksummed, its a very intrusive way.

If you want a fast upgrade with pg_upgrade, rewriting every block is
not really a grand plan, but if you want it...

If we did that, I think I would prefer to do it with these commands
 VACUUM ENABLE CHECKSUM;   //whole database only VACUUM DISABLE CHECKSUM;

rather than use a GUC. We can then add an option to pg_upgrade.

That way, we scan whole database, adding checksums and then record it
in pg_database

When we remove it, we do same thing in reverse then record it.

So there's no worries about turning on/off GUCs and we know for
certain where our towel is.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Bruce Momjian
Date:
On Mon, Feb 20, 2012 at 11:23:42PM +0000, Simon Riggs wrote:
> > If you use pg_upgrade, and enable the checksum GUC, your database will
> > become progressively checksummed, and as Simon pointed out, the only
> > clean way is VACUUM FULL.  It is quite hard to estimate the checksum
> > coverage of a database with mixed checksumming --- one cool idea would
> > be for ANALYZE to report how many of the pages it saw were checksummed.
> > Yeah, crazy, but it might be enough.
> 
> Well, I didn't say VACUUM FULL was the only clean way of knowing
> whether every block is checksummed, its a very intrusive way.
> 
> If you want a fast upgrade with pg_upgrade, rewriting every block is
> not really a grand plan, but if you want it...
> 
> If we did that, I think I would prefer to do it with these commands
> 
>   VACUUM ENABLE CHECKSUM;   //whole database only
>   VACUUM DISABLE CHECKSUM;
> 
> rather than use a GUC. We can then add an option to pg_upgrade.
> 
> That way, we scan whole database, adding checksums and then record it
> in pg_database
> 
> When we remove it, we do same thing in reverse then record it.
> 
> So there's no worries about turning on/off GUCs and we know for
> certain where our towel is.

Yes, that would work, but I suspect most users are going to want to
enable checksums when they are seeing some odd behavior and want to test
the hardware.  Requiring them to rewrite the database to do that is
making the feature less useful, and once they have the problem fixed,
they might want to turn it off again.

Now, one argument is that they could enable checksums, see no checksum
errors, but still be getting checksum failures from pages that were
written before checksum was enabled.  I think we just need to document
that checksum only works for writes performed _after_ the feature is
enabled.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: 16-bit page checksums for 9.2

From
Noah Misch
Date:
On Mon, Feb 20, 2012 at 08:57:08AM -0500, Robert Haas wrote:
> On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> > What straightforward implementation is that?? This *is* the straightforward one.
> >
> > God knows what else we'd break if we drop the lock, reacquire as an
> > exclusive, then drop it again and reacquire in shared mode. Code tends
> > to assume that when you take a lock you hold it until you release;
> > doing otherwise would allow all manner of race conditions to emerge.
> >
> > And do you really think that would be faster?
> 
> I don't know, but neither do you, because you apparently haven't tried
> it.   Games where we drop the shared lock and get an exclusive lock
> are used in numerous places in the source code; see, for example,
> heap_update(), so I don't believe that there is any reason to reject
> that a priori.  Indeed, I can think of at least one pretty good reason
> to do it exactly that way: it's the way we've handled all previous
> problems of this type, and in general it's better to make new code
> look like existing code than to invent something new, especially when
> you haven't made any effort to quantify the problem or the extent to
> which the proposed solution mitigates it.

We do, in numerous places, drop a shared buffer content lock and reacquire it
in exclusive mode.  However, the existing users of that pattern tend to trade
the lock, complete subsequent work, and unlock the buffer all within the same
function.  So they must, because several of them recheck facts that can change
during the period of holding no lock.  SetBufferCommitInfoNeedsSave() callers
do not adhere to that pattern; they can be quite distant from the original
lock acquisition and eventual lock release.  Take the prototypical case of
SetHintBits().  Our shared lock originated in a place like heapgettup(), which
expects to hold it continually until finished.  We would need to somehow push
up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
lock, then have heapgettup() reorient itself as needed.  Every caller of code
that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
Perhaps there's a different way to apply the lock-trade technique that avoids
this mess, but I'm not seeing it.  Consequently, I find the idea of requiring
a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
to be a sensible one.  Within that umbrella, some details need attention:

- Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note gistScanPage() and XLogCheckBuffer()[1].
(Perhapswe'll only require the spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a public API,
perhapsLockBufferForLSN().
 

- The use of some spinlock need not imply using the buffer header spinlock. We could add a dedicated pd_lsn_tli_lock to
BufferDesc. That has the usual trade-off of splitting a lock: less contention at the cost of more acquisitions.  I have
nointuition on which approach would perform better.
 

I agree that src/backend/storage/buffer/README is the place to document the
new locking rules.

I do share your general unease about adding new twists to the buffer access
rules.  Some of our hairiest code is also the code manipulating buffer locks
most extensively, and I would not wish to see that code get even more
difficult to understand.  However, I'm not seeing a credible alternative that
retains the same high-level behaviors.

A possible compromise is to leave the page clean after setting a hint bit,
much like the patch already has us do under hot standby.  Then there's no new
WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
benchmarked when he was looking at hint bits?  Does anyone recall the result?

Thanks,
nm

[1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise.  The
comment is true today, but the same patch makes it false by having
XLogSaveBufferForHint() call XLogInsert() under a share lock.


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Tue, Feb 21, 2012 at 10:07 AM, Noah Misch <noah@leadboat.com> wrote:

> Consequently, I find the idea of requiring
> a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
> to be a sensible one.

OK

> Within that umbrella, some details need attention:
>
> - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
>  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
>  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
>  public API, perhaps LockBufferForLSN().

Yep, I checked all the call points previously. gistScanPage() and also
gistbulkdelete() would be need changing, but since GIST doesn't use
hints, there is no need for locking. I'll document that better.

> - The use of some spinlock need not imply using the buffer header spinlock.
>  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
>  trade-off of splitting a lock: less contention at the cost of more
>  acquisitions.  I have no intuition on which approach would perform better.

Will think about that and try a few ideas.

> I agree that src/backend/storage/buffer/README is the place to document the
> new locking rules.

OK, I'll move the comments.

> I do share your general unease about adding new twists to the buffer access
> rules.  Some of our hairiest code is also the code manipulating buffer locks
> most extensively, and I would not wish to see that code get even more
> difficult to understand.  However, I'm not seeing a credible alternative that
> retains the same high-level behaviors.

Yes, those changes not made lightly and with much checking.

> A possible compromise is to leave the page clean after setting a hint bit,
> much like the patch already has us do under hot standby.  Then there's no new
> WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
> benchmarked when he was looking at hint bits?  Does anyone recall the result?

I was thinking exactly that myself. I'll add a GUC to test.

> [1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise.  The
> comment is true today, but the same patch makes it false by having
> XLogSaveBufferForHint() call XLogInsert() under a share lock.

OK, thats an issue, well spotted. Will fix.

Thanks for thorough review.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Noah Misch
Date:
On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
> On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
> >> doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
> >> do?
> >
> > As explained in detailed comments, the purpose of this is to implement
> > Heikki's suggestion that we have a bit set to zero so we can detect
> > failures that cause a run of 1s.
> 
> I think it's nonsensical to pretend that there's anything special
> about that particular bit.  If we want to validate the page header
> before trusting the lack of a checksum bit, we can do that far more
> thoroughly than just checking that one bit.  There are a whole bunch
> of bits that ought to always be zero, and there are other things we
> can validate as well (e.g. LSN not in future).  If we're concerned
> about the checksum-enabled bit getting flipped (and I agree that we
> should be), we can check some combination of that stuff in the hope of
> catching it, and that'll be a far better guard than just checking one
> arbitrarily selected bit.

PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
"(page->pd_flags & ~PD_VALID_FLAG_BITS) == 0".  Explicitly naming another bit
and keeping it unset is redundant with that existing check.  It would cease to
be redundant if we ever allocate all the flag bits, but then we also wouldn't
have a bit to spare as PD_CHECKSUM2.  I agree with you on this point.

> That having been said, I don't feel very good about the idea of
> relying on the contents of the page to tell us whether or not the page
> has a checksum.  There's no guarantee that an error that flips the
> has-checksum bit will flip any other bit on the page, or that it won't
> flip everything else we're relying on as a backstop in exactly the
> manner that foils whatever algorithm we put in place.  Random
> corruption is, perhaps, unlikely to do that, but somehow I feel like
> it'll happen more often than random chance suggests.  Things never
> fail the way you want them to.
> 
> Another disadvantage of the current scheme is that there's no
> particularly easy way to know that your whole cluster has checksums.
> No matter how we implement checksums, you'll have to rewrite every
> table in the cluster in order to get them fully turned on.  But with
> the current design, there's no easy way to know how much of the
> cluster is actually checksummed.  If you shut checksums off, they'll
> linger until those pages are rewritten, and there's no easy way to
> find the relations from which they need to be removed, either.

I'm not seeing value in rewriting pages to remove checksums, as opposed to
just ignoring those checksums going forward.  Did you have a particular
scenario in mind?

> I'm tempted to suggest a relation-level switch: when you want
> checksums, you use ALTER TABLE to turn them on, and when you don't
> want them any more you use ALTER TABLE to shut them off again, in each
> case rewriting the table.  That way, there's never any ambiguity about
> what's in the data pages in a given relation: either they're either
> all checksummed, or none of them are.  This moves the decision about
> whether checksums are enabled or disabled quite a but further away
> from the data itself, and also allows you to determine (by catalog
> inspection) which parts of the cluster do in fact have checksums.  It
> might be kind of a pain to implement, though: you'd have to pass the
> information about how any given relation was configured down to the
> place where we validate page sanity.  I'm not sure whether that's
> practical.

This patch implies future opportunities to flesh out its UI, and I don't see
it locking us out of implementing the above.  We'll want a weak-lock command
that adds checksums to pages lacking them.  We'll want to expose whether a
given relation has full checksum coverage.  With that, we could produce an
error when an apparently-non-checksummed page appears in a relation previously
known to have full checksum coverage.

Even supposing an "ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS" as the only
tool for enabling or disabling them, it's helpful to have each page declare
whether it has a checksum.  That way, the rewrite can take as little as an
AccessShareLock, and any failure need not lose work already done.  If you rely
on anticipating the presence of a checksum based on a pg_class column, that
ALTER needs to perform an atomic rewrite.


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch <noah@leadboat.com> wrote:
> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>> On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
>> >> doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
>> >> do?
>> >
>> > As explained in detailed comments, the purpose of this is to implement
>> > Heikki's suggestion that we have a bit set to zero so we can detect
>> > failures that cause a run of 1s.
>>
>> I think it's nonsensical to pretend that there's anything special
>> about that particular bit.  If we want to validate the page header
>> before trusting the lack of a checksum bit, we can do that far more
>> thoroughly than just checking that one bit.  There are a whole bunch
>> of bits that ought to always be zero, and there are other things we
>> can validate as well (e.g. LSN not in future).  If we're concerned
>> about the checksum-enabled bit getting flipped (and I agree that we
>> should be), we can check some combination of that stuff in the hope of
>> catching it, and that'll be a far better guard than just checking one
>> arbitrarily selected bit.
>
> PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
> "(page->pd_flags & ~PD_VALID_FLAG_BITS) == 0".  Explicitly naming another bit
> and keeping it unset is redundant with that existing check.  It would cease to
> be redundant if we ever allocate all the flag bits, but then we also wouldn't
> have a bit to spare as PD_CHECKSUM2.  I agree with you on this point.

No problem to remove that. It was there at Heikki's request.

>> That having been said, I don't feel very good about the idea of
>> relying on the contents of the page to tell us whether or not the page
>> has a checksum.  There's no guarantee that an error that flips the
>> has-checksum bit will flip any other bit on the page, or that it won't
>> flip everything else we're relying on as a backstop in exactly the
>> manner that foils whatever algorithm we put in place.  Random
>> corruption is, perhaps, unlikely to do that, but somehow I feel like
>> it'll happen more often than random chance suggests.  Things never
>> fail the way you want them to.
>>
>> Another disadvantage of the current scheme is that there's no
>> particularly easy way to know that your whole cluster has checksums.
>> No matter how we implement checksums, you'll have to rewrite every
>> table in the cluster in order to get them fully turned on.  But with
>> the current design, there's no easy way to know how much of the
>> cluster is actually checksummed.  If you shut checksums off, they'll
>> linger until those pages are rewritten, and there's no easy way to
>> find the relations from which they need to be removed, either.
>
> I'm not seeing value in rewriting pages to remove checksums, as opposed to
> just ignoring those checksums going forward.  Did you have a particular
> scenario in mind?

Agreed. No reason to change a checksum unless we rewrite the block, no
matter whether page_checksums is on or off.

>> I'm tempted to suggest a relation-level switch: when you want
>> checksums, you use ALTER TABLE to turn them on, and when you don't
>> want them any more you use ALTER TABLE to shut them off again, in each
>> case rewriting the table.  That way, there's never any ambiguity about
>> what's in the data pages in a given relation: either they're either
>> all checksummed, or none of them are.  This moves the decision about
>> whether checksums are enabled or disabled quite a but further away
>> from the data itself, and also allows you to determine (by catalog
>> inspection) which parts of the cluster do in fact have checksums.  It
>> might be kind of a pain to implement, though: you'd have to pass the
>> information about how any given relation was configured down to the
>> place where we validate page sanity.  I'm not sure whether that's
>> practical.
>
> This patch implies future opportunities to flesh out its UI, and I don't see
> it locking us out of implementing the above.

Agreed

> We'll want a weak-lock command
> that adds checksums to pages lacking them.

VACUUM FREEZE will do this.

> We'll want to expose whether a
> given relation has full checksum coverage.

Not sure I understand why we'd want that. If you really want that, its
a simple function to do it.

> With that, we could produce an
> error when an apparently-non-checksummed page appears in a relation previously
> known to have full checksum coverage.

Additional checking at relation level is just going to slow things
down. Don't see the value of having relation level checksum option.

It would be an easy thing to skip checksums on UNLOGGED relations, and
possibly desirable, but I don't see the utility of a per relation
checksum setting in other cases. There's just no need for fine tuning
like that.

> Even supposing an "ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS" as the only
> tool for enabling or disabling them, it's helpful to have each page declare
> whether it has a checksum.  That way, the rewrite can take as little as an
> AccessShareLock, and any failure need not lose work already done.  If you rely
> on anticipating the presence of a checksum based on a pg_class column, that
> ALTER needs to perform an atomic rewrite.

That will pretty much kill the usability of this feature. We must be
able to add checksums without taking a full lock on the table and/or
database. Making it an ALTER TABLE would require huge maintenance
effort to add checksums to a database.

-1 to the idea of having ALTER TABLE variant to enable/disable checksums

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Tue, Feb 21, 2012 at 5:07 AM, Noah Misch <noah@leadboat.com> wrote:
> We do, in numerous places, drop a shared buffer content lock and reacquire it
> in exclusive mode.  However, the existing users of that pattern tend to trade
> the lock, complete subsequent work, and unlock the buffer all within the same
> function.  So they must, because several of them recheck facts that can change
> during the period of holding no lock.  SetBufferCommitInfoNeedsSave() callers
> do not adhere to that pattern; they can be quite distant from the original
> lock acquisition and eventual lock release.  Take the prototypical case of
> SetHintBits().  Our shared lock originated in a place like heapgettup(), which
> expects to hold it continually until finished.  We would need to somehow push
> up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
> lock, then have heapgettup() reorient itself as needed.  Every caller of code
> that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
> Perhaps there's a different way to apply the lock-trade technique that avoids
> this mess, but I'm not seeing it.  Consequently, I find the idea of requiring
> a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
> to be a sensible one.  Within that umbrella, some details need attention:

Fair analysis.

> - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
>  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
>  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
>  public API, perhaps LockBufferForLSN().

That seems like a good idea.  I am a little worried about Simon's plan
to do the additional locking only for AMs that have no unlogged-type
updates.  It's a reasonable performance optimization, and skipping the
locking when checksums are shut off also seems reasonable, but it
seems a bit fragile: suppose that, in the future, someone fixes GiST
to do something intelligent with kill_prior_tuple.  Now suddenly it
needs LSN locking that it didn't need before, but this fact might not
be very obvious.  It might be a good idea to design LockBufferForLSN
to take an AM OID as an argument, and use the AM OID to decide whether
the locking is required.  That way, we can call it from everywhere
that reads an LSN, and it can simply skip the locking in places where
it isn't presently needed.

Or maybe there's a better design, but I agree with you that some kind
of public API is essential.

> - The use of some spinlock need not imply using the buffer header spinlock.
>  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
>  trade-off of splitting a lock: less contention at the cost of more
>  acquisitions.  I have no intuition on which approach would perform better.

I think I like the idea of a separate lock, but I agree it could stand
to be tested both ways.  Another thought is that we might add
SpinLockConditionalAcquire(), that just tries to TAS the lock and
returns false if the lock is already held.  Then, we could grab the
spinlock while writing the page, in lieu of copying it, and anyone who
wants to set a hint bit can conditionally acquire the spinlock long
enough to set the bits.  If the spinlock isn't immediately free then
instead of spinning they just don't set the hint bits after all.  That
way you can be sure that no hint bits are set during the write, but
any attempt to set a hint bit only costs you a single TAS - no loop,
as with a regular spinlock acquisition - and of course the hint
itself.

> A possible compromise is to leave the page clean after setting a hint bit,
> much like the patch already has us do under hot standby.  Then there's no new
> WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
> benchmarked when he was looking at hint bits?  Does anyone recall the result?

It slows things down substantially in the case of repeated scans of
the same unchaning table. In fact, I tried a much more nuanced
approach: set BM_UNTIDY on the page when
SetBufferCommitInfoNeedsSave() is called.  BM_UNTIDY pages are written
by the background writer and during checkpoints, and 5% of the time by
backends.  All of that has the salutary effect of making the first
sequential scan less ungodly slow, but it also has the less-desirable
effect of making the next 19 of them still kinda-sorta slow.  It was
unclear to me (and perhaps to others) whether it really worked out to
a win.

However, we might need/want to revisit some of that logic in
connection with this patch, because it seems to me that a large
sequential scan of an unhinted table could be ferociously slow, with
the backend writing out its own pages and WAL-logging them as it goes.It would be good to test that to figure out
whethersome mitigation 
measures are needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Peter Geoghegan
Date:
I decided that it would be worth benchmarking this patch.
Specifically, I tested:

master, as a basis of comparison

checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'

checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'

This test was performed using pgbench-tools. At different client
counts and scaling factors "1,10,100", performance of an update.sql
workload was tested.

This benchmark used Greg Smith's "toy" server. As he put it recently:

"The main change to the 8 hyperthreaded core test server (Intel
i7-870) for this year is bumping it from 8GB to 16GB of RAM, which
effectively doubles the scale I can reach before things slow
dramatically." However, while Greg used scientific Linux for his
recent batch of performance numbers, the box was booted into Debian
for this, which used Kernel 2.6.32, FWIW. Didn't bother with a
separate disk for WAL.

I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the
additional precaution of initdb'ing for each set, lest there be some
kind of contamination between sets, which necessitated doing some
additional work since I couldn't very well expect the "results"
database to persist. Different sets of figures from different runs
where dumped and restored, before finally being combined so that
pgbench-tools could produce it's regular report.

I have attached a config for pgbench-tools, so that others may
recreate my work here. I also attach the most relevant image,
comparing each test set across scaling levels. I'll make a pdf of the
full report available if that would be useful.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment

Re: 16-bit page checksums for 9.2

From
David Fetter
Date:
On Wed, Feb 22, 2012 at 11:17:53PM +0000, Peter Geoghegan wrote:
> I decided that it would be worth benchmarking this patch.
> Specifically, I tested:
> 
> master, as a basis of comparison
> 
> checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'
> 
> checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'
> 
> This test was performed using pgbench-tools. At different client
> counts and scaling factors "1,10,100", performance of an update.sql
> workload was tested.

Looks interesting.  Could you get some error bars around the numbers
plotted, and possibly some scaling factors between 10 and 100?

For the former, I'm looking for whether those changes are within
ordinary variation, and in the latter, some better idea of what the
curve looks like.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Feb 22, 2012 at 6:17 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> I decided that it would be worth benchmarking this patch.
> Specifically, I tested:
>
> master, as a basis of comparison
>
> checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'
>
> checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'
>
> This test was performed using pgbench-tools. At different client
> counts and scaling factors "1,10,100", performance of an update.sql
> workload was tested.
>
> This benchmark used Greg Smith's "toy" server. As he put it recently:
>
> "The main change to the 8 hyperthreaded core test server (Intel
> i7-870) for this year is bumping it from 8GB to 16GB of RAM, which
> effectively doubles the scale I can reach before things slow
> dramatically." However, while Greg used scientific Linux for his
> recent batch of performance numbers, the box was booted into Debian
> for this, which used Kernel 2.6.32, FWIW. Didn't bother with a
> separate disk for WAL.
>
> I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the
> additional precaution of initdb'ing for each set, lest there be some
> kind of contamination between sets, which necessitated doing some
> additional work since I couldn't very well expect the "results"
> database to persist. Different sets of figures from different runs
> where dumped and restored, before finally being combined so that
> pgbench-tools could produce it's regular report.
>
> I have attached a config for pgbench-tools, so that others may
> recreate my work here. I also attach the most relevant image,
> comparing each test set across scaling levels. I'll make a pdf of the
> full report available if that would be useful.

Thanks for testing this.  The graph obscures a bit how much percentage
change we're talking about here - could you post the raw tps numbers?

I think we also need to test the case of seq-scanning a large, unhinted table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 22.02.2012 14:30, Simon Riggs wrote:
> On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch<noah@leadboat.com>  wrote:
>> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>>> Another disadvantage of the current scheme is that there's no
>>> particularly easy way to know that your whole cluster has checksums.
>>> No matter how we implement checksums, you'll have to rewrite every
>>> table in the cluster in order to get them fully turned on.  But with
>>> the current design, there's no easy way to know how much of the
>>> cluster is actually checksummed.  If you shut checksums off, they'll
>>> linger until those pages are rewritten, and there's no easy way to
>>> find the relations from which they need to be removed, either.
>>
>> I'm not seeing value in rewriting pages to remove checksums, as opposed to
>> just ignoring those checksums going forward.  Did you have a particular
>> scenario in mind?
>
> Agreed. No reason to change a checksum unless we rewrite the block, no
> matter whether page_checksums is on or off.

This can happen:

1. checksums are initially enabled. A page is written, with a correct 
checksum.
2. checksums turned off.
3. A hint bit is set on the page.
4. While the page is being written out, someone pulls the power cord, 
and you get a torn write. The hint bit change made it to disk, but the 
clearing of the checksum in the page header did not.
5. Sometime after restart, checksums are turned back on.

The page now has an incorrect checksum on it. The next time it's read, 
you get a checksum error.

I'm pretty uncomfortable with this idea of having a flag on the page 
itself to indicate whether it has a checksum or not. No matter how many 
bits we use for that flag. You can never be quite sure that all your 
data is covered by the checksum, and there's a lot of room for subtle 
bugs like the above, where a page is reported as corrupt when it isn't, 
or vice versa.

This thing needs to be reliable and robust. The purpose of a checksum is 
to have an extra sanity check, to detect faulty hardware. If it's 
complicated, whenever you get a checksum mismatch, you'll be wondering 
if you have broken hardware or if you just bumped on a PostgreSQL bug. I 
think you need a flag in pg_control or somewhere to indicate whether 
checksums are currently enabled or disabled, and a mechanism to scan and 
rewrite all the pages with checksums, before they are verified.

I've said this before, but I still don't like the hacks with the version 
number in the page header. Even if it works, I would much prefer the 
straightforward option of extending the page header for the new field. 
Yes, it means you have to deal with pg_upgrade, but it's a hurdle we'll 
have to jump at some point anyway.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 22.02.2012 14:30, Simon Riggs wrote:
>>
>> On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch<noah@leadboat.com>  wrote:
>>>
>>> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>>>>
>>>> Another disadvantage of the current scheme is that there's no
>>>> particularly easy way to know that your whole cluster has checksums.
>>>> No matter how we implement checksums, you'll have to rewrite every
>>>> table in the cluster in order to get them fully turned on.  But with
>>>> the current design, there's no easy way to know how much of the
>>>> cluster is actually checksummed.  If you shut checksums off, they'll
>>>> linger until those pages are rewritten, and there's no easy way to
>>>> find the relations from which they need to be removed, either.
>>>
>>>
>>> I'm not seeing value in rewriting pages to remove checksums, as opposed
>>> to
>>> just ignoring those checksums going forward.  Did you have a particular
>>> scenario in mind?
>>
>>
>> Agreed. No reason to change a checksum unless we rewrite the block, no
>> matter whether page_checksums is on or off.
>
>
> This can happen:
>
> 1. checksums are initially enabled. A page is written, with a correct
> checksum.
> 2. checksums turned off.
> 3. A hint bit is set on the page.
> 4. While the page is being written out, someone pulls the power cord, and
> you get a torn write. The hint bit change made it to disk, but the clearing
> of the checksum in the page header did not.
> 5. Sometime after restart, checksums are turned back on.
>
> The page now has an incorrect checksum on it. The next time it's read, you
> get a checksum error.

Yes, you will. And you'll get a checksum error because the block no
longer passes. So an error should be reported.

We can and should document that turning this on/off/on can cause
problems. Hopefully crashing isn't that common a situation.

The production default would be "off". The default in the patch is
"on" only for testing.

> I'm pretty uncomfortable with this idea of having a flag on the page itself
> to indicate whether it has a checksum or not. No matter how many bits we use
> for that flag. You can never be quite sure that all your data is covered by
> the checksum, and there's a lot of room for subtle bugs like the above,
> where a page is reported as corrupt when it isn't, or vice versa.

That is necessary to allow upgrade. It's not their for any other reason.

> This thing needs to be reliable and robust. The purpose of a checksum is to
> have an extra sanity check, to detect faulty hardware. If it's complicated,
> whenever you get a checksum mismatch, you'll be wondering if you have broken
> hardware or if you just bumped on a PostgreSQL bug. I think you need a flag
> in pg_control or somewhere to indicate whether checksums are currently
> enabled or disabled, and a mechanism to scan and rewrite all the pages with
> checksums, before they are verified.

That would require massive downtime, so again, it has been ruled out
for practicality.

> I've said this before, but I still don't like the hacks with the version
> number in the page header. Even if it works, I would much prefer the
> straightforward option of extending the page header for the new field. Yes,
> it means you have to deal with pg_upgrade, but it's a hurdle we'll have to
> jump at some point anyway.

What you suggest might happen in the next release, or maybe longer.
There may be things that block it completely, so it might never
happen. My personal opinion is that it is not possible to make further
block format changes until we have a fully online upgrade process,
otherwise we block people from upgrading - not everybody can take
their site down to run pg_upgrade. I plan to work on that, but it may
not happen for 9.3; perhaps you will object to that also when it
comes.

So we simply cannot rely on this "jam tomorrow" vision.

This patch is very specifically something that makes the best of the
situation, now, for those that want and need it. If you don't want it,
you don't have to use it. But that shouldn't stop us giving it to the
people that do want it.

I'm hearing general interest and support for this feature from people
that run their business on PostgreSQL.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 29.02.2012 17:01, Simon Riggs wrote:
> On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 22.02.2012 14:30, Simon Riggs wrote:
>>>
>>> Agreed. No reason to change a checksum unless we rewrite the block, no
>>> matter whether page_checksums is on or off.
>>
>> This can happen:
>>
>> 1. checksums are initially enabled. A page is written, with a correct
>> checksum.
>> 2. checksums turned off.
>> 3. A hint bit is set on the page.
>> 4. While the page is being written out, someone pulls the power cord, and
>> you get a torn write. The hint bit change made it to disk, but the clearing
>> of the checksum in the page header did not.
>> 5. Sometime after restart, checksums are turned back on.
>>
>> The page now has an incorrect checksum on it. The next time it's read, you
>> get a checksum error.
>
> Yes, you will. And you'll get a checksum error because the block no
> longer passes. So an error should be reported.
>
> We can and should document that turning this on/off/on can cause
> problems. Hopefully crashing isn't that common a situation.

Hopefully not, but then again, you get interested in fiddling with this 
setting, when you do experience crashes. This feature needs to be 
trustworthy in the face of crashes.

>> I'm pretty uncomfortable with this idea of having a flag on the page itself
>> to indicate whether it has a checksum or not. No matter how many bits we use
>> for that flag. You can never be quite sure that all your data is covered by
>> the checksum, and there's a lot of room for subtle bugs like the above,
>> where a page is reported as corrupt when it isn't, or vice versa.
>
> That is necessary to allow upgrade. It's not their for any other reason.

Understood. I'm uncomfortable with it regardless of why it's there.

>> This thing needs to be reliable and robust. The purpose of a checksum is to
>> have an extra sanity check, to detect faulty hardware. If it's complicated,
>> whenever you get a checksum mismatch, you'll be wondering if you have broken
>> hardware or if you just bumped on a PostgreSQL bug. I think you need a flag
>> in pg_control or somewhere to indicate whether checksums are currently
>> enabled or disabled, and a mechanism to scan and rewrite all the pages with
>> checksums, before they are verified.
>
> That would require massive downtime, so again, it has been ruled out
> for practicality.

Surely it can be done online. You'll just need a third state between off 
and on, where checksums are written but not verified, while the cluster 
is scanned.

>> I've said this before, but I still don't like the hacks with the version
>> number in the page header. Even if it works, I would much prefer the
>> straightforward option of extending the page header for the new field. Yes,
>> it means you have to deal with pg_upgrade, but it's a hurdle we'll have to
>> jump at some point anyway.
>
> What you suggest might happen in the next release, or maybe longer.
> There may be things that block it completely, so it might never
> happen. My personal opinion is that it is not possible to make further
> block format changes until we have a fully online upgrade process,
> otherwise we block people from upgrading - not everybody can take
> their site down to run pg_upgrade. I plan to work on that, but it may
> not happen for 9.3;

Yep, we should bite the bullet and work on that.

> perhaps you will object to that also when it comes.

Heh, quite possible :-). But only if there's a reason. I do want to see 
a solution to this, I have a few page-format changing ideas myself that 
I'd like to implement at some point.

> This patch is very specifically something that makes the best of the
> situation, now, for those that want and need it. If you don't want it,
> you don't have to use it.

Whether I use it or not, I'll have to live with it in the source tree.

> But that shouldn't stop us giving it to the people that do want it.
>
> I'm hearing general interest and support for this feature from people
> that run their business on PostgreSQL.

If you ask someone "would you like to have checksums in PostgreSQL?", 
he'll say "sure". If you ask him "would you like to keep the PostgreSQL 
source tree as simple as possible, to make it easy for new developers to 
join the effort?", he'll say "yes". It's all about how you frame the 
question. Even if you want to have checksums on data pages, it doesn't 
necessarily mean you want them so badly you can't wait another release 
or two for a cleaner solution, or that you'd be satisfied with the 
implementation proposed here.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Feb 29, 2012 at 3:30 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

> Surely it can be done online. You'll just need a third state between off and
> on, where checksums are written but not verified, while the cluster is
> scanned.

Are you saying you would accept the patch if we had this?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 29.02.2012 17:42, Simon Riggs wrote:
> On Wed, Feb 29, 2012 at 3:30 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>
>> Surely it can be done online. You'll just need a third state between off and
>> on, where checksums are written but not verified, while the cluster is
>> scanned.
>
> Are you saying you would accept the patch if we had this?

I think I would still be uncomfortable with the hacks in the page 
header. Less so than in the current form - you wouldn't need a flag to 
indicate whether the page has a valid checksum or not, which would clean 
it up quite a bit - but still.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

>> Are you saying you would accept the patch if we had this?

> I think I would still be uncomfortable with the hacks in the page header.

There are no "hacks". There are some carefully designed changes with
input from multiple people, including yourself, and it copes as
gracefully as it can with backwards compatibility requirements.

> Less so than in the current form - you wouldn't need a flag to indicate
> whether the page has a valid checksum or not, which would clean it up quite
> a bit - but still.

I agree this would remove the reliance on a bit in the page header and
so make it even more robust.

I'll add the 2 phase enabling feature, making it happen at database level.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>
>>> Are you saying you would accept the patch if we had this?
>
>> I think I would still be uncomfortable with the hacks in the page header.
>
> There are no "hacks". There are some carefully designed changes with
> input from multiple people, including yourself, and it copes as
> gracefully as it can with backwards compatibility requirements.

You have comments from three different people, all experienced
hackers, disagreeing with this position; Heikki and I have both
proposed alternate approaches.  I'm not sure that we're at a point
where we can say that we know what the best solution is, but I think
it is clear that there's enough concern about this that you ought not
to be denying that there is a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 12:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, Feb 29, 2012 at 4:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
>>> <heikki.linnakangas@enterprisedb.com> wrote:
>>>
>>>>> Are you saying you would accept the patch if we had this?
>>>
>>>> I think I would still be uncomfortable with the hacks in the page header.
>>>
>>> There are no "hacks". There are some carefully designed changes with
>>> input from multiple people, including yourself, and it copes as
>>> gracefully as it can with backwards compatibility requirements.
>>
>> You have comments from three different people, all experienced
>> hackers, disagreeing with this position;
>
> Who is the third person you speak of? Perhaps they will speak again if
> they wish to be heard.

Tom Lane.  It was the very first email posted in response to the very
first version of this patch you ever posted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Feb 29, 2012 at 5:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

>>> You have comments from three different people, all experienced
>>> hackers, disagreeing with this position;
>>
>> Who is the third person you speak of? Perhaps they will speak again if
>> they wish to be heard.
>
> Tom Lane.  It was the very first email posted in response to the very
> first version of this patch you ever posted.

Tom objected to not being able to tell which version a data block was.
At length, we have discussed this on list and there is no issue. It is
clear what page format a block has.

Please ping him if you believe he has other rational objections to
committing something and he isn't listening.

I'm beginning to lose faith that objections are being raised at a
rational level. It's not a panel game with points for clever answers,
its an engineering debate about how to add features real users want.
And they do want, so let me solve the problems by agreeing something
early enough to allow it to be implemented, rather than just
discussing it until we run out of time.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Wed, Feb 29, 2012 at 4:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>
>>>> Are you saying you would accept the patch if we had this?
>>
>>> I think I would still be uncomfortable with the hacks in the page header.
>>
>> There are no "hacks". There are some carefully designed changes with
>> input from multiple people, including yourself, and it copes as
>> gracefully as it can with backwards compatibility requirements.
>
> You have comments from three different people, all experienced
> hackers, disagreeing with this position;

Who is the third person you speak of? Perhaps they will speak again if
they wish to be heard.

> Heikki and I have both
> proposed alternate approaches.

No, you've proposed rejecting the patch in favour of some future
dream. We all agree on the future dream but it clearly happens in the
future and that could easily be more than 1 release ahead.

If you have comments that would allow a patch in this release, I am
happy to hear them. I hear only two people seeking to reject a patch
that adds value for users. Quite frankly, the comments about flag bits
are ridiculous and bogus.


> I'm not sure that we're at a point
> where we can say that we know what the best solution is, but I think
> it is clear that there's enough concern about this that you ought not
> to be denying that there is a problem.

There are some weird cases that can cause problems. My wish to is to
resolve those so everyone is happy. If those weird cases are simply
used as an excuse to reject, then I don't accept that and nor will our
users. Of course, if there was a significant issue, it would be
immediate rejection but there isn't.

I'm trying to commit a useful patch in this release. If you'd like to
work with me to find a solution acceptable to all, I'd be happy to
include suggestions, just as I have already included comments from
Heikki, Bruce, Noah and others. I do accept that Heikki now says that
the code I added at his request is just a hack.

Assuming I'm going to commit something in this release, what should it
look like?

At Heikki's request/idea I will be working on a 2-phase database level
parameter that will give us checksumming on the whole database after a
scan to enable it. That sounds like it will resolve the corner cases
relatively cleanly.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 29.02.2012 19:54, Simon Riggs wrote:
> I'm beginning to lose faith that objections are being raised at a
> rational level. It's not a panel game with points for clever answers,
> its an engineering debate about how to add features real users want.
> And they do want, so let me solve the problems by agreeing something
> early enough to allow it to be implemented, rather than just
> discussing it until we run out of time.

I thought my view on how this should be done was already clear, but just 
in case it isn't, let me restate: Enlarge the page header to make room 
for the checksum. To handle upgrades, put code in the backend to change 
the page format from old version to new one on-the-fly, as pages are 
read in. Because we're making the header larger, we need to ensure that 
there's room on every page. To do that, write a utility that you run on 
the cluster before running pg_upgrade, which moves tuples to ensure 
that. To ensure that the space doesn't get used again before upgrading, 
change the old version so that it reserves those N bytes in all new 
insertions and updates (I believe that approach has been discussed 
before and everyone is comfortable with backpatching such a change). All 
of this in 9.3.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Alvaro Herrera
Date:
Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012:
> On 29.02.2012 19:54, Simon Riggs wrote:
> > I'm beginning to lose faith that objections are being raised at a
> > rational level. It's not a panel game with points for clever answers,
> > its an engineering debate about how to add features real users want.
> > And they do want, so let me solve the problems by agreeing something
> > early enough to allow it to be implemented, rather than just
> > discussing it until we run out of time.
>
> I thought my view on how this should be done was already clear, but just
> in case it isn't, let me restate: Enlarge the page header to make room
> for the checksum. To handle upgrades, put code in the backend to change
> the page format from old version to new one on-the-fly, as pages are
> read in. Because we're making the header larger, we need to ensure that
> there's room on every page. To do that, write a utility that you run on
> the cluster before running pg_upgrade, which moves tuples to ensure
> that. To ensure that the space doesn't get used again before upgrading,
> change the old version so that it reserves those N bytes in all new
> insertions and updates (I believe that approach has been discussed
> before and everyone is comfortable with backpatching such a change). All
> of this in 9.3.

Note that if we want such an utility to walk and transform pages, we
probably need a marker in the catalogs somewhere so that pg_upgrade can
make sure that it was done in all candidate tables -- which is something
that we should get in 9.2 so that it can be used in 9.3.  Such a marker
would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 2:09 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 29.02.2012 19:54, Simon Riggs wrote:
>> I'm beginning to lose faith that objections are being raised at a
>> rational level. It's not a panel game with points for clever answers,
>> its an engineering debate about how to add features real users want.
>> And they do want, so let me solve the problems by agreeing something
>> early enough to allow it to be implemented, rather than just
>> discussing it until we run out of time.
>
> I thought my view on how this should be done was already clear, but just in
> case it isn't, let me restate: Enlarge the page header to make room for the
> checksum. To handle upgrades, put code in the backend to change the page
> format from old version to new one on-the-fly, as pages are read in. Because
> we're making the header larger, we need to ensure that there's room on every
> page. To do that, write a utility that you run on the cluster before running
> pg_upgrade, which moves tuples to ensure that. To ensure that the space
> doesn't get used again before upgrading, change the old version so that it
> reserves those N bytes in all new insertions and updates (I believe that
> approach has been discussed before and everyone is comfortable with
> backpatching such a change). All of this in 9.3.

This could all be done on-line if we had a per-table flag indicating
which page version is in use.  You can use some variant of CLUSTER or
VACUUM or ALTER TABLE to rewrite the table using the page version of
your choice.  This is also more granular, in that it allows checksums
(or other features) to be turned on and off on a per-table basis.
When you read the table, the same flag tells you which page version to
expect, and you can error out if you haven't got that (or if the CRCs
don't match).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 29.02.2012 21:18, Alvaro Herrera wrote:
> Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012:
>> I thought my view on how this should be done was already clear, but just
>> in case it isn't, let me restate: Enlarge the page header to make room
>> for the checksum. To handle upgrades, put code in the backend to change
>> the page format from old version to new one on-the-fly, as pages are
>> read in. Because we're making the header larger, we need to ensure that
>> there's room on every page. To do that, write a utility that you run on
>> the cluster before running pg_upgrade, which moves tuples to ensure
>> that. To ensure that the space doesn't get used again before upgrading,
>> change the old version so that it reserves those N bytes in all new
>> insertions and updates (I believe that approach has been discussed
>> before and everyone is comfortable with backpatching such a change). All
>> of this in 9.3.
>
> Note that if we want such an utility to walk and transform pages, we
> probably need a marker in the catalogs somewhere so that pg_upgrade can
> make sure that it was done in all candidate tables -- which is something
> that we should get in 9.2 so that it can be used in 9.3.

In the simplest form, the utility could just create a magic file in the 
data directory to indicate that it has run. All we need is a boolean 
flag, unless you want to be fancy and make the utility restartable. 
Implemented that way, there's no need to have anything in the catalogs.

> Such a marker would also allow us get rid of HEAP_MOVED_IN and
> HEAP_MOVED_OUT.

True.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Note that if we want such an utility to walk and transform pages, we
> probably need a marker in the catalogs somewhere so that pg_upgrade can
> make sure that it was done in all candidate tables -- which is something
> that we should get in 9.2 so that it can be used in 9.3.  Such a marker
> would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.

Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
but I don't see why we need to squeeze anything into 9.2 for any of
this.  pg_upgrade can certainly handle the addition of a new pg_class
column, and can arrange for in-place upgrades from pre-9.3 versions to
9.3 to set the flag to the appropriate value.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Heikki Linnakangas
Date:
On 29.02.2012 21:30, Robert Haas wrote:
> On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
> <alvherre@commandprompt.com>  wrote:
>> Note that if we want such an utility to walk and transform pages, we
>> probably need a marker in the catalogs somewhere so that pg_upgrade can
>> make sure that it was done in all candidate tables -- which is something
>> that we should get in 9.2 so that it can be used in 9.3.  Such a marker
>> would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.
>
> Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
> but I don't see why we need to squeeze anything into 9.2 for any of
> this.  pg_upgrade can certainly handle the addition of a new pg_class
> column, and can arrange for in-place upgrades from pre-9.3 versions to
> 9.3 to set the flag to the appropriate value.

The utility would run in the old cluster before upgrading, so the the 
flag would have to be present in the old version. pg_upgrade would check 
that the flag is set, refusing to upgrade if it isn't, with an error 
like "please run pre-upgrade utility first".

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: 16-bit page checksums for 9.2

From
Alvaro Herrera
Date:
Excerpts from Heikki Linnakangas's message of mié feb 29 16:29:26 -0300 2012:
> On 29.02.2012 21:18, Alvaro Herrera wrote:
> > Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012:
> >> I thought my view on how this should be done was already clear, but just
> >> in case it isn't, let me restate: Enlarge the page header to make room
> >> for the checksum. To handle upgrades, put code in the backend to change
> >> the page format from old version to new one on-the-fly, as pages are
> >> read in. Because we're making the header larger, we need to ensure that
> >> there's room on every page. To do that, write a utility that you run on
> >> the cluster before running pg_upgrade, which moves tuples to ensure
> >> that. To ensure that the space doesn't get used again before upgrading,
> >> change the old version so that it reserves those N bytes in all new
> >> insertions and updates (I believe that approach has been discussed
> >> before and everyone is comfortable with backpatching such a change). All
> >> of this in 9.3.
> >
> > Note that if we want such an utility to walk and transform pages, we
> > probably need a marker in the catalogs somewhere so that pg_upgrade can
> > make sure that it was done in all candidate tables -- which is something
> > that we should get in 9.2 so that it can be used in 9.3.
>
> In the simplest form, the utility could just create a magic file in the
> data directory to indicate that it has run. All we need is a boolean
> flag, unless you want to be fancy and make the utility restartable.
> Implemented that way, there's no need to have anything in the catalogs.

Well, I find it likely that people with huge and/or high velocity
databases would want to get fancy about it, so that they can schedule it
as they see fit.

What I wouldn't like is an utility that is optional, so that we end up
with situations after upgrade that do have the new page format, others
that don't.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 29.02.2012 21:30, Robert Haas wrote:
>>
>> On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
>> <alvherre@commandprompt.com>  wrote:
>>>
>>> Note that if we want such an utility to walk and transform pages, we
>>> probably need a marker in the catalogs somewhere so that pg_upgrade can
>>> make sure that it was done in all candidate tables -- which is something
>>> that we should get in 9.2 so that it can be used in 9.3.  Such a marker
>>> would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.
>>
>>
>> Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
>> but I don't see why we need to squeeze anything into 9.2 for any of
>> this.  pg_upgrade can certainly handle the addition of a new pg_class
>> column, and can arrange for in-place upgrades from pre-9.3 versions to
>> 9.3 to set the flag to the appropriate value.
>
> The utility would run in the old cluster before upgrading, so the the flag
> would have to be present in the old version. pg_upgrade would check that the
> flag is set, refusing to upgrade if it isn't, with an error like "please run
> pre-upgrade utility first".

I find that a pretty unappealing design; it seems to me it'd be much
easier to make the new cluster cope with everything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> The utility would run in the old cluster before upgrading, so the the flag
>> would have to be present in the old version. pg_upgrade would check that the
>> flag is set, refusing to upgrade if it isn't, with an error like "please run
>> pre-upgrade utility first".

> I find that a pretty unappealing design; it seems to me it'd be much
> easier to make the new cluster cope with everything.

Easier for who?  I don't care for the idea of code that has to cope with
two page formats, or before long N page formats, because if we don't
have some mechanism like this then we will never be able to decide that
an old data format is safely dead.
        regards, tom lane


Re: 16-bit page checksums for 9.2

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of mié feb 29 18:34:27 -0300 2012:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
> > <heikki.linnakangas@enterprisedb.com> wrote:
> >> The utility would run in the old cluster before upgrading, so the the flag
> >> would have to be present in the old version. pg_upgrade would check that the
> >> flag is set, refusing to upgrade if it isn't, with an error like "please run
> >> pre-upgrade utility first".
>
> > I find that a pretty unappealing design; it seems to me it'd be much
> > easier to make the new cluster cope with everything.
>
> Easier for who?  I don't care for the idea of code that has to cope with
> two page formats, or before long N page formats, because if we don't
> have some mechanism like this then we will never be able to decide that
> an old data format is safely dead.

.. in fact this is precisely what killed Zdenek Kotala's idea of
upgrading.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> The utility would run in the old cluster before upgrading, so the the flag
>>> would have to be present in the old version. pg_upgrade would check that the
>>> flag is set, refusing to upgrade if it isn't, with an error like "please run
>>> pre-upgrade utility first".
>
>> I find that a pretty unappealing design; it seems to me it'd be much
>> easier to make the new cluster cope with everything.
>
> Easier for who?  I don't care for the idea of code that has to cope with
> two page formats, or before long N page formats, because if we don't
> have some mechanism like this then we will never be able to decide that
> an old data format is safely dead.

Huh?  You can drop support for a new page format any time you like.
You just decree that version X+1 no longer supports it, and you can't
pg_upgrade to it until all traces of the old page format are gone.

If you're going to require an offline rewrite of the whole old cluster
before doing the upgrade, how much better is it than just breaking the
page format and telling pg_upgrade users they're out of luck?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Easier for who? �I don't care for the idea of code that has to cope with
>> two page formats, or before long N page formats, because if we don't
>> have some mechanism like this then we will never be able to decide that
>> an old data format is safely dead.

> Huh?  You can drop support for a new page format any time you like.
> You just decree that version X+1 no longer supports it, and you can't
> pg_upgrade to it until all traces of the old page format are gone.

And how would a DBA know that?

> If you're going to require an offline rewrite of the whole old cluster
> before doing the upgrade, how much better is it than just breaking the
> page format and telling pg_upgrade users they're out of luck?

The difference is that people aren't playing russian roulette with their
data when they upgrade.  The point of the mechanisms being discussed
here is to know, for certain, that a large database no longer contains
pages of the old format.
        regards, tom lane


Re: 16-bit page checksums for 9.2

From
Jim Nasby
Date:
On 2/29/12 3:53 PM, Alvaro Herrera wrote:
>
> Excerpts from Tom Lane's message of mié feb 29 18:34:27 -0300 2012:
>>
>> Robert Haas<robertmhaas@gmail.com>  writes:
>>> On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
>>> <heikki.linnakangas@enterprisedb.com>  wrote:
>>>> The utility would run in the old cluster before upgrading, so the the flag
>>>> would have to be present in the old version. pg_upgrade would check that the
>>>> flag is set, refusing to upgrade if it isn't, with an error like "please run
>>>> pre-upgrade utility first".
>>
>>> I find that a pretty unappealing design; it seems to me it'd be much
>>> easier to make the new cluster cope with everything.
>>
>> Easier for who?  I don't care for the idea of code that has to cope with
>> two page formats, or before long N page formats, because if we don't
>> have some mechanism like this then we will never be able to decide that
>> an old data format is safely dead.
>
> .. in fact this is precisely what killed Zdenek Kotala's idea of
> upgrading.

This is also why Simon has avoided the whole upgrade thing with his 16 bit checksum idea (otherwise presumably we'd be
lookingat bigger checksums anyway). 

I get that fussing around with the version field is ugly. If there was another way to do this without breaking
pg_upgradethen it would be silly to mess with the version field. Unfortunately, there is no other way. 

Page checksuming is something a lot of people (myself included) want to see. Being able to get it in 9.2 would be a big
winover crossing our fingers that at some point in the future (who knows when) we'll maybe figure out the page format
upgradeissue. While we should definitely be looking into that issue it's definitely not going to get fixed in 9.2. ISTM
thatchecksums are actually ready to go if people can just swallow the bitter pill of a screwed-up page version field
untilwe can actually get an upgrade utility in place (and until we get that utility in place I don't see that the
versionfield would do us any good anyway...) 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net


Re: 16-bit page checksums for 9.2

From
Tom Lane
Date:
Jim Nasby <jim@nasby.net> writes:
> On 2/29/12 3:53 PM, Alvaro Herrera wrote:
>> .. in fact this is precisely what killed Zdenek Kotala's idea of
>> upgrading.

> This is also why Simon has avoided the whole upgrade thing with his 16 bit checksum idea (otherwise presumably we'd
belooking at bigger checksums anyway).
 

> I get that fussing around with the version field is ugly. If there was another way to do this without breaking
pg_upgradethen it would be silly to mess with the version field. Unfortunately, there is no other way.
 

Fundamentally, what is going on here is that several of us think that we
need a page format upgrade infrastructure first, and then we can think
about adding checksums.  Simon would like to cram checksums in without
building such infrastructure, regardless of the consequences in code
ugliness or future maintainability.  Personally, I think that is a bad
tradeoff.  Eventually we are going to have to build that infrastructure,
and the more we've kluged the page header format in the meanwhile, the
more unpleasant it's going to be.
        regards, tom lane


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Wed, Feb 29, 2012 at 5:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Easier for who?  I don't care for the idea of code that has to cope with
>>> two page formats, or before long N page formats, because if we don't
>>> have some mechanism like this then we will never be able to decide that
>>> an old data format is safely dead.
>
>> Huh?  You can drop support for a new page format any time you like.
>> You just decree that version X+1 no longer supports it, and you can't
>> pg_upgrade to it until all traces of the old page format are gone.
>
> And how would a DBA know that?

We'd add a column to pg_class that tracks which page version is in use
for each relation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Josh Berkus
Date:
>> And how would a DBA know that?
> 
> We'd add a column to pg_class that tracks which page version is in use
> for each relation.

So a relation can't have some pages in Version 9.2, and other pages in
version 9.3?  How will this work for 2TB tables?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Thu, Mar 1, 2012 at 12:42 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> And how would a DBA know that?
>>
>> We'd add a column to pg_class that tracks which page version is in use
>> for each relation.
>
> So a relation can't have some pages in Version 9.2, and other pages in
> version 9.3?  How will this work for 2TB tables?

Not very well, but better than Tom's proposal to require upgrading the
entire cluster in a single off-line operation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Josh Berkus
Date:
>> So a relation can't have some pages in Version 9.2, and other pages in
>> version 9.3?  How will this work for 2TB tables?
> 
> Not very well, but better than Tom's proposal to require upgrading the
> entire cluster in a single off-line operation.

Yes, but the result will be that anyone with a 2TB table will *never*
convert it to the new format.  Which means we can never deprecate that
format, because lots of people will still be using it.

I continue to assert that all of this sounds like 9.3 work to me.  I'm
really not keen on pushing through a hack which will make pushing in a
long-term solution harder.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: 16-bit page checksums for 9.2

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
>> So a relation can't have some pages in Version 9.2, and other pages in
>> version 9.3? �How will this work for 2TB tables?

> Not very well, but better than Tom's proposal to require upgrading the
> entire cluster in a single off-line operation.

WTF?  That was most certainly not what *I* was proposing; it's obviously
unworkable.  We need a process that can incrementally up-version a live
database and keep track of the minimum version present, at some
granularity smaller than "whole database".

All of this was discussed and hashed out about two years ago, IIRC.
We just haven't made any progress towards actually implementing those
concepts.
        regards, tom lane


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Thu, Mar 1, 2012 at 4:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>>> So a relation can't have some pages in Version 9.2, and other pages in
>>> version 9.3?  How will this work for 2TB tables?
>
>> Not very well, but better than Tom's proposal to require upgrading the
>> entire cluster in a single off-line operation.
>
> WTF?  That was most certainly not what *I* was proposing; it's obviously
> unworkable.  We need a process that can incrementally up-version a live
> database and keep track of the minimum version present, at some
> granularity smaller than "whole database".
>
> All of this was discussed and hashed out about two years ago, IIRC.
> We just haven't made any progress towards actually implementing those
> concepts.

I am now officially confused.  I thought you and Heikki were arguing
about 24 hours ago that we needed to shut down the old database, run a
pre-upgrade utility to convert all the pages, run pg_upgrade, and then
bring the new database on-line; and that, further, you were arguing
that we should not support multiple page versions.  Now you seem to be
arguing the exact opposite - namely, that we should support multiple
page versions, and that the conversion should be done on-line.

So, to reiterate, I'm lost.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
> On Thu, Mar 1, 2012 at 4:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >>> So a relation can't have some pages in Version 9.2, and other pages in
> >>> version 9.3?  How will this work for 2TB tables?
> >
> >> Not very well, but better than Tom's proposal to require upgrading the
> >> entire cluster in a single off-line operation.
> >
> > WTF?  That was most certainly not what *I* was proposing; it's obviously
> > unworkable.  We need a process that can incrementally up-version a live
> > database and keep track of the minimum version present, at some
> > granularity smaller than "whole database".
> >
> > All of this was discussed and hashed out about two years ago, IIRC.
> > We just haven't made any progress towards actually implementing those
> > concepts.
>
> I am now officially confused.  I thought you and Heikki were arguing
> about 24 hours ago that we needed to shut down the old database, run a
> pre-upgrade utility to convert all the pages, run pg_upgrade, and then
> bring the new database on-line;

Actually, nobody mentioned shutting the database down.  This utility
does not necessarily have to be something that runs independently from a
running postmaster; in fact what I envisioned was that we would have it
be run in a backend -- I've always imagined it's just a function to
which you give a table OID, and it converts pages of that table into the
new format.  Maybe a user-friendly utility would connect to each
database and call this function for each table.

> and that, further, you were arguing that we should not support
> multiple page versions.

I don't think we need to support multiple page versions, if multiple
means > 2.  Obviously we need the server to support reading *two* page
versions; though we can probably get away with having support for
*writing* only the newest page version.

(The pg_class column would mean "the oldest page version to be found in
this table", so the table can actually contain a mixture of old pages
and new pages.)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: 16-bit page checksums for 9.2

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
>> and that, further, you were arguing that we should not support
>> multiple page versions.

> I don't think we need to support multiple page versions, if multiple
> means > 2.

That's exactly the point here.  We clearly cannot support on-line
upgrade unless, somewhere along the line, we are willing to cope with
two page formats more or less concurrently.  What I don't want is for
that requirement to balloon to supporting N formats forever.  If we
do not have a mechanism that allows certifying that you have no
remaining pages of format N-1 before you upgrade to a server that
supports (only) versions N and N+1, then we're going to be in the
business of indefinite backwards compatibility instead.

I'm not entirely sure, but I think we may all be in violent agreement
about where this needs to end up.
        regards, tom lane


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Thu, Mar 1, 2012 at 8:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
>>> and that, further, you were arguing that we should not support
>>> multiple page versions.
>
>> I don't think we need to support multiple page versions, if multiple
>> means > 2.
>
> That's exactly the point here.  We clearly cannot support on-line
> upgrade unless, somewhere along the line, we are willing to cope with
> two page formats more or less concurrently.  What I don't want is for
> that requirement to balloon to supporting N formats forever.  If we
> do not have a mechanism that allows certifying that you have no
> remaining pages of format N-1 before you upgrade to a server that
> supports (only) versions N and N+1, then we're going to be in the
> business of indefinite backwards compatibility instead.
>
> I'm not entirely sure, but I think we may all be in violent agreement
> about where this needs to end up.

I was using multiple to mean N>1, so yeah, it's sounding like we're
more or less on the same page here.

One thing I'm not too sure about is how to extend the page format to
handle optional features.  For example, suppose we want to add 2 bytes
to the page header for a checksum (or 4 bytes, or any other number).
Ideally, I'd like to not use up those 2 bytes on pages that aren't
checksummed.  What do we do about that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> One thing I'm not too sure about is how to extend the page format to
> handle optional features.  For example, suppose we want to add 2 bytes
> to the page header for a checksum (or 4 bytes, or any other number).
> Ideally, I'd like to not use up those 2 bytes on pages that aren't
> checksummed.  What do we do about that?

I'm having a hard time getting excited about adding that requirement on
top of all the others that we have for this feature.  In particular, if
we insist on that, there is exactly zero chance of turning checksums on
on-the-fly, because you won't be able to do it if the page is full.

The scheme that seems to me to make the most sense for checksums,
if we grant Simon's postulate that a 2-byte checksum is enough, is
(1) repurpose the top N bits of pd_flags as a page version number,   for some N;
(2) repurpose pd_pagesize_version as the checksum, with the   understanding that you can't have a checksum in version-0
pages.

(Simon's current patch seems to be an overengineered version of that.
Possibly we should also ask ourselves how much we really need pd_tli
and whether that couldn't be commandeered as the checksum field.)

I see the page versioning stuff as mainly being of interest for changes
that are more invasive than this, for instance tuple-header or data
changes.
        regards, tom lane


Re: 16-bit page checksums for 9.2

From
Robert Haas
Date:
On Thu, Mar 1, 2012 at 9:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> One thing I'm not too sure about is how to extend the page format to
>> handle optional features.  For example, suppose we want to add 2 bytes
>> to the page header for a checksum (or 4 bytes, or any other number).
>> Ideally, I'd like to not use up those 2 bytes on pages that aren't
>> checksummed.  What do we do about that?
>
> I'm having a hard time getting excited about adding that requirement on
> top of all the others that we have for this feature.  In particular, if
> we insist on that, there is exactly zero chance of turning checksums on
> on-the-fly, because you won't be able to do it if the page is full.
>
> The scheme that seems to me to make the most sense for checksums,
> if we grant Simon's postulate that a 2-byte checksum is enough, is
> (1) repurpose the top N bits of pd_flags as a page version number,
>    for some N;
> (2) repurpose pd_pagesize_version as the checksum, with the
>    understanding that you can't have a checksum in version-0 pages.
>
> (Simon's current patch seems to be an overengineered version of that.
> Possibly we should also ask ourselves how much we really need pd_tli
> and whether that couldn't be commandeered as the checksum field.)
>
> I see the page versioning stuff as mainly being of interest for changes
> that are more invasive than this, for instance tuple-header or data
> changes.

Well, OK, so...  it wouldn't bother me a bit to steal pd_tli for this,
although Simon previously objected to steeling even half of it, when I
suggested that upthread.

But I don't see the point of your first proposal: keeping the page
version right where it is is a good idea, and we should do it.  We
could steel some *high* order bits from that field without breaking
anything, but moving it around seems like it will complicate life to
no good purpose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 16-bit page checksums for 9.2

From
Peter Geoghegan
Date:
On 23 February 2012 01:39, Robert Haas <robertmhaas@gmail.com> wrote:
> Thanks for testing this.  The graph obscures a bit how much percentage
> change we're talking about here - could you post the raw tps numbers?

Sorry for not following up on this until now. The report is available from:

http://pagechecksum.staticloud.com/

Note that detailed latency figures for each test are not available in
this report, so only the main page is available, but that gets you the
raw tps numbers.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


Re: 16-bit page checksums for 9.2

From
Simon Riggs
Date:
On Fri, Mar 2, 2012 at 2:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> One thing I'm not too sure about is how to extend the page format to
>> handle optional features.  For example, suppose we want to add 2 bytes
>> to the page header for a checksum (or 4 bytes, or any other number).
>> Ideally, I'd like to not use up those 2 bytes on pages that aren't
>> checksummed.  What do we do about that?
>
> I'm having a hard time getting excited about adding that requirement on
> top of all the others that we have for this feature.  In particular, if
> we insist on that, there is exactly zero chance of turning checksums on
> on-the-fly, because you won't be able to do it if the page is full.
>
> The scheme that seems to me to make the most sense for checksums,
> if we grant Simon's postulate that a 2-byte checksum is enough, is
> (1) repurpose the top N bits of pd_flags as a page version number,
>    for some N;
> (2) repurpose pd_pagesize_version as the checksum, with the
>    understanding that you can't have a checksum in version-0 pages.

I'd say N = 1, for now. N > 1 in future, as needed. We can document the
intention to reserve high end bits for that purpose. We can defer the
decision about what bits are used for until they are needed.


> (Simon's current patch seems to be an overengineered version of that.

Happy to make changes. Having 3 bits instead of 1 is just for robustness,
but would accept the opinion that this is not worth having.

One suggestion from Heikki is that we don't change the bits at all;
we just have a database level setting that says whether checksums are
enabled. We use VACUUM FREEZE to enable checksumming, rewrite the blocks
and then enable at db level. Once fully enabled we check the checksums
on read. If we do that, we don't need to record that the page format
has changed and we could dispense with page-level flags entirely, but
of course that then means you can't inspect a block and know whether its
actually got a checksum on it or maybe its just a version field.

If we want to know the difference between page formats we need to use
flag bits to indicate that. Which in some ways could be called fragile.

Personally, don't mind what we do there.


> Possibly we should also ask ourselves how much we really need pd_tli
> and whether that couldn't be commandeered as the checksum field.)

pd_tli has slightly more utility than pd_pagesize_version, but we could
easily live without either.

The question is whether a 32-bit value, split into 2 pieces, is a
faster and better way than the current proposal. Splitting the value in two
doesn't sound like it would lead to good performance or sensible code,
but someone may have a reason why 32-bit checksums are important.

IMHO we have a viable mechanism for checksums, its all down to what user
interface we would prefer and related details.

I'm happy to implement whatever we decide.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: 16-bit page checksums for 9.2

From
Jeff Janes
Date:
On Sun, Feb 19, 2012 at 1:49 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>> v8 attached
>
> v10 attached.
>
> This patch covers all the valid concerns discussed and has been
> extensively tested.

If I turn wal_level to anything other than minimal, this fails to pass
make installcheck with multiple instances of

ERROR:  invalid page header in block 0 of relation....

I've looked around in the patch docs and this thread, and this doesn't
seem to be either expected or already known.

This occurs even when page_checksums is off.

Cheers,

Jeff


Re: 16-bit page checksums for 9.2

From
Jeff Davis
Date:
On Sun, 2012-02-19 at 21:49 +0000, Simon Riggs wrote:
> On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> > v8 attached
>
> v10 attached.
>
> This patch covers all the valid concerns discussed and has been
> extensively tested.
>
> I expect to commit this in about 24 hours, so last call for comments
> please or say if you need more time to review.

I have made an attempt to merge this with master. This patch is not
meant to be the next version, and I have not made a real effort to be
sure this was a safe merge. I am just hoping this saves Simon some of
the tedious portions of the merge, and offers a reference point to see
where his merge differs from mine.

In the meantime, I'm looking at the original v10 patch against master as
it existed when Simon posted the patch.

I'd like to see checksums take part in the June commitfest.

Regards,
    Jeff Davis

Attachment

Re: 16-bit page checksums for 9.2

From
Jeff Davis
Date:
On Sun, 2012-02-19 at 21:49 +0000, Simon Riggs wrote:
> On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> 
> > v8 attached
> 
> v10 attached.
> 
> This patch covers all the valid concerns discussed and has been
> extensively tested.

Is there something I can do to help get this ready for the next
commitfest? I am willing to rebase it, but I was worried that might
cause confusion. I am also willing to review it after the rebase, of
course.

There are still a couple open issues, including:

* Store the checksum in the page version field or the TLI field?

* What mechanism to guarantee to the user that all pages are properly
protected by checksums (rather than just new pages)? In other words,
there are more than the two states represented by the GUC.

* What specific data is included in the checksum? I suggested that we
include the block number, and maybe the OID.

Regards,Jeff Davis