Thread: Block-level CRC checks

Block-level CRC checks

From
Alvaro Herrera
Date:
A customer of ours has been having trouble with corrupted data for some
time.  Of course, we've almost always blamed hardware (and we've seen
RAID controllers have their firmware upgraded, among other actions), but
the useful thing to know is when corruption has happened, and where.

So we've been tasked with adding CRCs to data files.

The idea is that these CRCs are going to be checked just after reading
files from disk, and calculated just before writing it.  They are
just a protection against the storage layer going mad; they are not
intended to protect against faulty RAM, CPU or kernel.

This code would be run-time or compile-time configurable.  I'm not
absolutely sure which yet; the problem with run-time is what to do if
the user restarts the server with the setting flipped.  It would have
almost no impact on users who don't enable it.

The implementation I'm envisioning requires the use of a new relation
fork to store the per-block CRCs.  Initially I'm aiming at a CRC32 sum
for each block.  FlushBuffer would calculate the checksum and store it
in the CRC fork; ReadBuffer_common would read the page, calculate the
checksum, and compare it to the one stored in the CRC fork.

A buffer's io_in_progress lock protects the buffer's CRC.  We read and
pin the CRC page before acquiring the lock, to avoid having two buffer
IO operations in flight.

I'd like to submit this for 8.4, but I want to ensure that -hackers at
large approve of this feature before starting serious coding.

Opinions?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Tue, Sep 30, 2008 at 2:02 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> A customer of ours has been having trouble with corrupted data for some
> time.  Of course, we've almost always blamed hardware (and we've seen
> RAID controllers have their firmware upgraded, among other actions), but
> the useful thing to know is when corruption has happened, and where.

Agreed.

> So we've been tasked with adding CRCs to data files.

Awesome.

> The idea is that these CRCs are going to be checked just after reading
> files from disk, and calculated just before writing it.  They are
> just a protection against the storage layer going mad; they are not
> intended to protect against faulty RAM, CPU or kernel.

This is the common case.

> This code would be run-time or compile-time configurable.  I'm not
> absolutely sure which yet; the problem with run-time is what to do if
> the user restarts the server with the setting flipped.  It would have
> almost no impact on users who don't enable it.

I've supported this forever!

> The implementation I'm envisioning requires the use of a new relation
> fork to store the per-block CRCs.  Initially I'm aiming at a CRC32 sum
> for each block.  FlushBuffer would calculate the checksum and store it
> in the CRC fork; ReadBuffer_common would read the page, calculate the
> checksum, and compare it to the one stored in the CRC fork.
>
> A buffer's io_in_progress lock protects the buffer's CRC.  We read and
> pin the CRC page before acquiring the lock, to avoid having two buffer
> IO operations in flight.

If the CRC gets written before the block, how is recovery going to
handle it?  I'm not too familiar with the new forks stuff, but
recovery will pull the old block, compare it against the checksum, and
consider the block invalid, correct?

> I'd like to submit this for 8.4, but I want to ensure that -hackers at
> large approve of this feature before starting serious coding.

IMHO, this is a functionality that should be enabled by default (as it
is on most other RDBMS).  It would've prevented severe corruption in
the 20 or so databases I've had to fix, and other than making it
optional, I don't see the reasoning for a separate relation fork
rather than storing it directly on the block (as everyone else does).
Similarly, I think Greg Stark was playing with a patch for it
(http://archives.postgresql.org/pgsql-hackers/2007-02/msg01850.php).

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> The implementation I'm envisioning requires the use of a new relation
> fork to store the per-block CRCs.

That seems bizarre, and expensive, and if you lose one block of the CRC
fork you lose confidence in a LOT of data.  Why not keep the CRCs in the
page headers?

> A buffer's io_in_progress lock protects the buffer's CRC.

Unfortunately, it doesn't.  See hint bits.
        regards, tom lane


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Alvaro Herrera wrote:
> A customer of ours has been having trouble with corrupted data for some
> time.  Of course, we've almost always blamed hardware (and we've seen
> RAID controllers have their firmware upgraded, among other actions), but
> the useful thing to know is when corruption has happened, and where.
> 
> So we've been tasked with adding CRCs to data files.
> 
> The idea is that these CRCs are going to be checked just after reading
> files from disk, and calculated just before writing it.  They are
> just a protection against the storage layer going mad; they are not
> intended to protect against faulty RAM, CPU or kernel.

This has been suggested before, and the usual objection is precisely 
that it only protects from errors in the storage layer, giving a false 
sense of security.

Doesn't some filesystems include a per-block CRC, which would achieve 
the same thing? ZFS?

> This code would be run-time or compile-time configurable.  I'm not
> absolutely sure which yet; the problem with run-time is what to do if
> the user restarts the server with the setting flipped.  It would have
> almost no impact on users who don't enable it.

Yeah, seems like it would need to be compile-time or initdb-time 
configurable.

> The implementation I'm envisioning requires the use of a new relation
> fork to store the per-block CRCs.  Initially I'm aiming at a CRC32 sum
> for each block.  FlushBuffer would calculate the checksum and store it
> in the CRC fork; ReadBuffer_common would read the page, calculate the
> checksum, and compare it to the one stored in the CRC fork.

Surely it would be much simpler to just add a field to the page header.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Joshua Drake
Date:
On Tue, 30 Sep 2008 14:33:04 -0400
"Jonah H. Harris" <jonah.harris@gmail.com> wrote:

> > I'd like to submit this for 8.4, but I want to ensure that -hackers
> > at large approve of this feature before starting serious coding.
> 
> IMHO, this is a functionality that should be enabled by default (as it
> is on most other RDBMS).  It would've prevented severe corruption in

What other RDMS have it enabled by default?

Sincerely,

Joshua D. Drake



-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/




Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Tue, Sep 30, 2008 at 2:49 PM, Joshua Drake <jd@commandprompt.com> wrote:
> On Tue, 30 Sep 2008 14:33:04 -0400
> "Jonah H. Harris" <jonah.harris@gmail.com> wrote:
>
>> > I'd like to submit this for 8.4, but I want to ensure that -hackers
>> > at large approve of this feature before starting serious coding.
>>
>> IMHO, this is a functionality that should be enabled by default (as it
>> is on most other RDBMS).  It would've prevented severe corruption in
>
> What other RDMS have it enabled by default?

Oracle and (I belive) SQL Server >= 2005

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Markus Wanner
Date:
Hello Alvaro,

some random thoughts while reading your proposal follow...

Alvaro Herrera wrote:
> So we've been tasked with adding CRCs to data files.

Disks get larger and relative reliability shrinks, it seems. So I agree
that this is a worthwhile thing to have. But shouldn't that be the job
of the filesystem? Think of ZFS or the upcoming BTRFS.

> The idea is that these CRCs are going to be checked just after reading
> files from disk, and calculated just before writing it.  They are
> just a protection against the storage layer going mad; they are not
> intended to protect against faulty RAM, CPU or kernel.

That sounds reasonable if we do it from Postgres.

> This code would be run-time or compile-time configurable.  I'm not
> absolutely sure which yet; the problem with run-time is what to do if
> the user restarts the server with the setting flipped.  It would have
> almost no impact on users who don't enable it.

I'd say calculating a CRC is close enough to be considered "no impact".
A single core of a modern CPU easily reaches way above 200 MiB/s
throughput for CRC32 today. See [1].

Maybe consider Adler-32 which is 3-4x faster [2], also part of zlib and
AFAIK about equally safe for 8k blocks and above.

> The implementation I'm envisioning requires the use of a new relation
> fork to store the per-block CRCs.  Initially I'm aiming at a CRC32 sum
> for each block.  FlushBuffer would calculate the checksum and store it
> in the CRC fork; ReadBuffer_common would read the page, calculate the
> checksum, and compare it to the one stored in the CRC fork.

Huh? Aren't CRCs normally stored as part of the block they are supposed
to protect? Or how do you expect to ensure the data from the CRC
relation fork is correct? How about crash safety (a data block written,
but not its CRC block or vice versa)?

Wouldn't that double the amount of seeking required for writes?

> I'd like to submit this for 8.4, but I want to ensure that -hackers at
> large approve of this feature before starting serious coding.

Very cool!

Regards

Markus Wanner

[1]: Crypto++ benchmarks:
http://www.cryptopp.com/benchmarks.html

[2]: Wikipedia about hash functions:
http://en.wikipedia.org/wiki/List_of_hash_functions#Computational_costs_of_CRCs_vs_Hashes


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Alvaro Herrera wrote:
> Initially I'm aiming at a CRC32 sum
> for each block.  FlushBuffer would calculate the checksum and store it
> in the CRC fork; ReadBuffer_common would read the page, calculate the
> checksum, and compare it to the one stored in the CRC fork.

There's one fundamental problem with that, related to the way our hint 
bits are written.

Currently, hint bit updates are not WAL-logged, and thus no full page 
write is done when only hint bits are changed. Imagine what happens if 
hint bits are updated on a page, but there's no other changes, and we 
crash so that only one half of the new page version makes it to disk (= 
torn page). The CRC would not match, even though the page is actually valid.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Greg Smith
Date:
On Tue, 30 Sep 2008, Heikki Linnakangas wrote:

> Doesn't some filesystems include a per-block CRC, which would achieve the 
> same thing? ZFS?

Yes, there is a popular advoacy piece for ZFS with a high-level view of 
why and how they implement that at 
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data The guarantees are 
stronger than what you can get if you just put a CRC in the block itself. 
I'd never really thought too hard about putting this in the database 
knowing that ZFS is available for environments where this is a concern, 
but it certainly would be a nice addition.

The best analysis I've ever seen that makes a case for OS or higher level 
disk checksums of some sort, by looking at the myriad ways that disks and 
disk arrays fail in the real world, is in 
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf 
(there is a shorter version that hits the high points of that at 
http://www.usenix.org/publications/login/2008-06/openpdfs/bairavasundaram.pdf 
)

One really interesting bit in there I'd never seen before is that they 
find real data that supports the stand that enterprise drives are 
significantly more reliable than consumer ones.  While general failure 
rates aren't that different, "SATA disks have an order of magnitude higher 
probability of developing checksum mismatches than Fibre Channel disks. We 
find that 0.66% of SATA disks develop at least one mismatch during the 
first 17 months in the field, whereas only 0.06% of Fibre Channel disks 
develop a mismatch during that time."

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Block-level CRC checks

From
pgsql@mohawksoft.com
Date:
> A customer of ours has been having trouble with corrupted data for some
> time.  Of course, we've almost always blamed hardware (and we've seen
> RAID controllers have their firmware upgraded, among other actions), but
> the useful thing to know is when corruption has happened, and where.

That is an important statement, to know when it happens not necessarily to
be able to recover the block or where in the block it is corrupt. Is that
correct?

>
> So we've been tasked with adding CRCs to data files.

CRC or checksum? If the objective is merely general "detection" there
should be some latitude in choosing the methodology for performance.

>
> The idea is that these CRCs are going to be checked just after reading
> files from disk, and calculated just before writing it.  They are
> just a protection against the storage layer going mad; they are not
> intended to protect against faulty RAM, CPU or kernel.

It will actually find faults in all if it. If the CPU can't add and/or a
RAM location lost a bit, this will blow up just as easily as a bad block.
It may cause "false identification" of an error, but it will keep a bad
system from hiding.

>
> This code would be run-time or compile-time configurable.  I'm not
> absolutely sure which yet; the problem with run-time is what to do if
> the user restarts the server with the setting flipped.  It would have
> almost no impact on users who don't enable it.

CPU capacity on modern hardware within a small area of RAM is practically
infinite when compared to any sort of I/O.
>
> The implementation I'm envisioning requires the use of a new relation
> fork to store the per-block CRCs.  Initially I'm aiming at a CRC32 sum
> for each block.  FlushBuffer would calculate the checksum and store it
> in the CRC fork; ReadBuffer_common would read the page, calculate the
> checksum, and compare it to the one stored in the CRC fork.

Hell, all that is needed is a long or a short checksum value in the block.
I mean, if you just want a sanity test, it doesn't take much. Using a
second relation creates confusion. If there is a CRC discrepancy between
two different blocks, who's wrong? You need a third "control" to know. If
the block knows its CRC or checksum and that is in error, the block is
bad.

>
> A buffer's io_in_progress lock protects the buffer's CRC.  We read and
> pin the CRC page before acquiring the lock, to avoid having two buffer
> IO operations in flight.
>
> I'd like to submit this for 8.4, but I want to ensure that -hackers at
> large approve of this feature before starting serious coding.
>
> Opinions?

If its fast enough, its a good idea. It could be very helpful in
protecting users data.

>
> --
> Alvaro Herrera
> http://www.CommandPrompt.com/
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



Re: Block-level CRC checks

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> A customer of ours has been having trouble with corrupted data for some
> time.  Of course, we've almost always blamed hardware (and we've seen
> RAID controllers have their firmware upgraded, among other actions), but
> the useful thing to know is when corruption has happened, and where.
> 
> So we've been tasked with adding CRCs to data files.

Maybe a stupid question, but what I/O subsystems corrupt data and fail
to report it?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
"Jeffrey Baker"
Date:
On Tue, Sep 30, 2008 at 1:41 PM, Bruce Momjian <bruce@momjian.us> wrote:
Alvaro Herrera wrote:
> A customer of ours has been having trouble with corrupted data for some
> time.  Of course, we've almost always blamed hardware (and we've seen
> RAID controllers have their firmware upgraded, among other actions), but
> the useful thing to know is when corruption has happened, and where.
>
> So we've been tasked with adding CRCs to data files.

Maybe a stupid question, but what I/O subsystems corrupt data and fail
to report it?

Practically all of them.  Here is a good paper on various checksums, their failure rates, and practical applications.

"Parity Lost and Parity Regained"
http://www.usenix.org/event/fast08/tech/full_papers/krioukov/krioukov_html/index.html

-jwb

Re: Block-level CRC checks

From
Decibel!
Date:
On Sep 30, 2008, at 2:17 PM, pgsql@mohawksoft.com wrote:
>> A customer of ours has been having trouble with corrupted data for  
>> some
>> time.  Of course, we've almost always blamed hardware (and we've seen
>> RAID controllers have their firmware upgraded, among other  
>> actions), but
>> the useful thing to know is when corruption has happened, and where.
>
> That is an important statement, to know when it happens not  
> necessarily to
> be able to recover the block or where in the block it is corrupt.  
> Is that
> correct?

Oh, correcting the corruption would be AWESOME beyond belief! But at  
this point I'd settle for just knowing it had happened.

>> So we've been tasked with adding CRCs to data files.
>
> CRC or checksum? If the objective is merely general "detection" there
> should be some latitude in choosing the methodology for performance.

See above. Perhaps the best win would be a case where you could  
choose which method you wanted. We generally have extra CPU on the  
servers, so we could afford to burn some cycles with more complex  
algorithms.

>> The idea is that these CRCs are going to be checked just after  
>> reading
>> files from disk, and calculated just before writing it.  They are
>> just a protection against the storage layer going mad; they are not
>> intended to protect against faulty RAM, CPU or kernel.
>
> It will actually find faults in all if it. If the CPU can't add and/ 
> or a
> RAM location lost a bit, this will blow up just as easily as a bad  
> block.
> It may cause "false identification" of an error, but it will keep a  
> bad
> system from hiding.

Well, very likely not, since the intention is to only compute the CRC  
when we write the block out, at least for now. In the future I would  
like to be able to detect when a CPU or memory goes bonkers and poops  
on something, because that's actually happened to us as well.

>> The implementation I'm envisioning requires the use of a new relation
>> fork to store the per-block CRCs.  Initially I'm aiming at a CRC32  
>> sum
>> for each block.  FlushBuffer would calculate the checksum and  
>> store it
>> in the CRC fork; ReadBuffer_common would read the page, calculate the
>> checksum, and compare it to the one stored in the CRC fork.
>
> Hell, all that is needed is a long or a short checksum value in the  
> block.
> I mean, if you just want a sanity test, it doesn't take much. Using a
> second relation creates confusion. If there is a CRC discrepancy  
> between
> two different blocks, who's wrong? You need a third "control" to  
> know. If
> the block knows its CRC or checksum and that is in error, the block is
> bad.

I believe the idea was to make this as non-invasive as possible. And  
it would be really nice if this could be enabled without a dump/ 
reload (maybe the upgrade stuff would make this possible?)
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: Block-level CRC checks

From
Joshua Drake
Date:
On Tue, 30 Sep 2008 13:48:52 -0700
"Jeffrey Baker" <jwbaker@gmail.com> wrote:


> 
> Practically all of them.  Here is a good paper on various checksums,
> their failure rates, and practical applications.
> 
> "Parity Lost and Parity Regained"
> http://www.usenix.org/event/fast08/tech/full_papers/krioukov/krioukov_html/index.html
> 

In a related article published in Login called "Data Corruption in the
storage stack: a closer look" they say:

During a 41-month period we observed more than 400,000 instances of
checksum mistmatches, 8% of which were discovered during RAID
reconstruction, creating the possibility of real data loss.

They also have a wonderful term they mention, "Silent Data corruptions".

Joshua D. Drake

[1] Login June 2008

> -jwb


-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/




Re: Block-level CRC checks

From
Decibel!
Date:
On Sep 30, 2008, at 1:48 PM, Heikki Linnakangas wrote:
> This has been suggested before, and the usual objection is  
> precisely that it only protects from errors in the storage layer,  
> giving a false sense of security.

If you can come up with a mechanism for detecting non-storage errors  
as well, I'm all ears. :)

In the meantime, you're way, way more likely to experience corruption  
at the storage layer than anywhere else. We've had several corruption  
events, only one of which was memory related... and we *know* it was  
memory related because we actually got logs saying so. But with a SAN  
environment there's a lot of moving parts, all waiting to screw up  
your data:

filesystem
SAN device driver
SAN network
SAN BIOS
drive BIOS
drive

That's above things that could hose your data outside of storage:
kernel
CPU
memory
motherboard

> Doesn't some filesystems include a per-block CRC, which would  
> achieve the same thing? ZFS?


Sure, some do. We're on linux and can't run ZFS. And I'll argue that  
no linux FS is anywhere near as tested as ext3 is, which means that  
going to some other FS that offers you CRC means you're now exposing  
yourself to the possibility of issues with the FS itself. Not to  
mention that changing filesystems on a large production system is  
very painful.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: Block-level CRC checks

From
pgsql@mohawksoft.com
Date:
>
> I believe the idea was to make this as non-invasive as possible. And
> it would be really nice if this could be enabled without a dump/
> reload (maybe the upgrade stuff would make this possible?)
> --

It's all about the probability of a duplicate check being generated. If
you use a 32 bit checksum, then you have a theoretical probability of 1 in
4 billion that a corrupt block will be missed (probably much lower
depending on your algorithm). If you use a short, then a 1 in 65 thousand
probability. If you use an 8 bit number, then 1 in 256.

Why am I going on? Well, if there are any spare bits in a block header,
they could be used for the check value.


Re: Block-level CRC checks

From
Greg Stark
Date:

On 30 Sep 2008, at 10:17 PM, Decibel! <decibel@decibel.org> wrote:

> On Sep 30, 2008, at 1:48 PM, Heikki Linnakangas wrote:
>> This has been suggested before, and the usual objection is  
>> precisely that it only protects from errors in the storage layer,  
>> giving a false sense of security.
>
> If you can come up with a mechanism for detecting non-storage errors  
> as well, I'm all ears. :)
>
> In the meantime, you're way, way more likely to experience  
> corruption at the storage layer than anywhere else.

Fwiw this hasn't been my experience. Bad memory is extremely common  
and even the storage failures I've seen (excluding the drive crashes)  
turned out to actually be caused by bad memory.

That said I've always been interested in doing this. The main use case  
in my mind has actually been for data that's been restored from old  
backups which have been lying round and floating between machines for  
a while with many opportunities for bit errors to show up.


The main stumbling block I ran into was how to deal with turning the  
option off and on. I wanted it to be possible to turn off the option  
to have the database ignore any errors and to avoid the overhead.

But that means including an escape hatch value which is always  
considered to be correct. But that dramatically reduces the  
effectiveness of the scheme.

Another issue is it will make space available on each page smaller  
making it harder to do in place upgrades.


If you can deal with those issues and carefully deal with the  
contingencies so it's clear to people what to do when errra occur or  
they want to turn the feature on or off then I'm all for it. That  
despite my experience of memory errors being a lot more common than  
undetected storage errors. 


Re: Block-level CRC checks

From
Andrew Chernow
Date:
Joshua Drake wrote:
> During a 41-month period we observed more than 400,000 instances of
> checksum mistmatches, 8% of which were discovered during RAID
> reconstruction, creating the possibility of real data loss.
> 
> They also have a wonderful term they mention, "Silent Data corruptions".
> 
> 
> 

Exactely!
From my experience, the only assumption to be made about storage is that it can 
and will fail ... frequently!  It is unreliable (not to mention slooow) and 
should not be trusted; regardless of the price tag or brand.

This could help detect:

- fs corruption
- vfs bug
- raid controller firmware bug
- bad disk sector
- power crash
- weird martian-like raid rebuilds

Although, this idea won't prevent anything.  Everything would still sinisterly 
fail on you.  The difference is, no more silence.

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Block-level CRC checks

From
Paul Schlie
Date:
If you are concerned with data integrity (not caused by bugs in the code
itself), you may be interested in utilizing ZFS; however, be aware that I
found and reported a bug in their implementation of the Fletcher checksum
algorithm they use by default to attempt to verify the integrity of the data
stored in their file system, and further aware that checksums/CRCs do not
enable the correction of errors in general, therefore be prepared to make
the decision of "what should be done in the event of a failure"; ZFS
effectively locks up in certain circumstances rather risk silently using
suspect data with some form of persistent indication that the result may be
corrupted. (strong CRC's and FEC's are relatively inexpensive to compute).

So in summary, my two cents: a properly implemented 32/64 bit Fletcher
checksum is likely adequate to detect most errors (and correct them if
presumed to be a result of a single flipped bit within 128KB or so, as such
a Fletcher checksum has a hamming distance of 3 within blocks of this size,
albeit fairly expensive to do so by trial and error; further presuming that
this can not be relied upon, a strategy potentially utilizing the suspect
data as if it were good likely needs to be adopted, accompanied somehow with
a persistent indication that the query results (or specific sub-results) are
themselves suspect, as it may often be a lesser evil than the alternative
(but not always). Or use a file system like ZFS, and let it do its thing,
and hope for the best.





Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
Paul Schlie wrote:
> this can not be relied upon, a strategy potentially utilizing the suspect
> data as if it were good likely needs to be adopted, accompanied somehow with
> a persistent indication that the query results (or specific sub-results) are
> themselves suspect, as it may often be a lesser evil than the alternative
> (but not always). Or use a file system like ZFS, and let it do its thing,
> and hope for the best.

Problem is, most people are running PostgreSQL on one of two operating 
systems. Linux or Win32. (/me has been amazed at how many people are 
running win32)


ZFS is not an option; generally speaking.

Joshua D. Drake




Re: Block-level CRC checks

From
Paul Schlie
Date:
> Joshua D. Drake wrote:
> ...
> ZFS is not an option; generally speaking.

Then in general, if the corruption occurred within the:

- read path, try again and hope it takes care of itself.

- write path, the best that can be hoped for is a single bit error within the data itself which can be both detected
andcorrected with a sufficiently strong check sum; or worst case if address or control information was corrupted, god
knowswhat happed to the data, and what other data may have been destroyed by having the data written to the wrong
blocksand typically unrecoverable.
 

- drive itself, this is most typically very unlikely, as strong FEC codes typically prevent the misidentification of
unrecoverabledata as being otherwise.
 

The simplest thing to do would seem to be to upon reading blocks
check the check sum, if bad, try read again; if that doesn't fix
the problem, assume a single bit error, and iteratively flip
single bits until the check sum matches (hopefully not making the
problem worse as may be the case if many bits were actually already
in error) and write the data back, and proceed as normal, possibly
logging the action; otherwise presume the data is unrecoverable and
in error, somehow mark it as being so such that subsequent queries
which may utilize any portion of it knows it may be corrupt (which
I suspect may be best done not on file-system blocks, but actually
on a logical rows or even individual entries if very large, as my
best initial guess, and likely to measurably affect performance
when enabled, and haven't a clue how resulting query should/could
be identified as being potentially corrupt without confusing the
client which requested it).




Re: Block-level CRC checks

From
"Albe Laurenz"
Date:
Jonah H. Harris wrote:
>>>> I'd like to submit this for 8.4, but I want to ensure that -hackers
>>>> at large approve of this feature before starting serious coding.
>>>
>>> IMHO, this is a functionality that should be enabled by default (as it
>>> is on most other RDBMS).  It would've prevented severe corruption in
>>
>> What other RDMS have it enabled by default?
>
> Oracle and (I belive) SQL Server >= 2005

References:
http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14237/initparams040.htm#CHDDCEIC
http://download.oracle.com/docs/cd/B28359_01/server.111/b28320/initparams046.htm#sthref130

Oracle claims that it introduces only 1% to 2% overhead.

Actually I suggested the same feature for 8.3:
http://archives.postgresql.org/pgsql-hackers/2007-08/msg01003.php

It was eventually rejected because the majority felt that it
would be a small benefit (only detects disk corruption and not
software bugs) that would not justify the overhead and the
additional code.

Incidently, Oracle also has a parameter DB_BLOCK_CHECKING
that checks blocks for logical consistency. This is OFF by default,
but Oracle recommends that you activate it if you can live with
the performance impact.

Yours,
Laurenz Albe


Re: Block-level CRC checks

From
Zdenek Kotala
Date:
Alvaro Herrera napsal(a):

> This code would be run-time or compile-time configurable.  I'm not
> absolutely sure which yet; the problem with run-time is what to do if
> the user restarts the server with the setting flipped.  It would have
> almost no impact on users who don't enable it.

I prefer runtime configuration. Solaris has two filesystems UFS, ZFS. 
ZFS offers strong error detection and a another CRC is overhead. You 
need mechanism how to enable this feature on UFS and disable on ZFS. I 
suggest to have it as tablespace property and for default tablespace you 
can setup it in initdb phase.

    Zdenek


Re: Block-level CRC checks

From
"Harald Armin Massa"
Date:
CRC-checks will help to detect corrupt data.

my question:

WHAT should happen when corrupted data is detected?

a) PostgreSQL can end with some paniccode
b) a log can be written, with some rather high level

a) has the benefit that it surely will be noticed. Which is  a real
benefet, as I suppose that many users of PostgreSQL on the low end do
not have dedicated DBA-Admins who read the *.log every day

b) has the benefit that work goes on


BUT:
What can somebody do AFTER PostgreSQL has detected "data is corrupted,
CRC error in block xxxx" ?

My first thought of action would be: get new drive, pg_dump_all to
save place, install new drive, pg_restore
BUT: how will pg_dump_all be enabled? As PostgreSQL must be forced to
accept the corrupted data to do a pg_dump...

Next step: for huge databases it is tempting to not pg_dump, but
filecopy; with shut down database or pg_backup() etc. What way can
that be supported while data is corrupted?

Please do not misunderstand my question  ... I really look forward to
get this kind of information; especially since corrupted data on hard
drives is an early sign of BIG trouble to come. I am just asking to
start thinking about "what do after corruption has been detected"


Harald

--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
EuroPython 2009 will take place in Birmingham - Stay tuned!


Re: Block-level CRC checks

From
Hannu Krosing
Date:
On Tue, 2008-09-30 at 17:13 -0400, pgsql@mohawksoft.com wrote:
> >
> > I believe the idea was to make this as non-invasive as possible. And
> > it would be really nice if this could be enabled without a dump/
> > reload (maybe the upgrade stuff would make this possible?)
> > --
> 
> It's all about the probability of a duplicate check being generated. If
> you use a 32 bit checksum, then you have a theoretical probability of 1 in
> 4 billion that a corrupt block will be missed (probably much lower
> depending on your algorithm). If you use a short, then a 1 in 65 thousand
> probability. If you use an 8 bit number, then 1 in 256.
> 
> Why am I going on? Well, if there are any spare bits in a block header,
> they could be used for the check value.

Even and 64-bit integer is just 0.1% of 8k page size, and it is even
less than 0.1% likely that page will be 100% full and thus that 64bits
wastes any real space at all.

So I don't think that this is a space issue.

---------------
Hannu




Re: Block-level CRC checks

From
Tom Lane
Date:
Hannu Krosing <hannu@2ndQuadrant.com> writes:
> So I don't think that this is a space issue.

No, it's all about time penalties and loss of concurrency.
        regards, tom lane


Re: Block-level CRC checks

From
Tom Lane
Date:
"Harald Armin Massa" <haraldarminmassa@gmail.com> writes:
> WHAT should happen when corrupted data is detected?

Same thing that happens now, ie, query fails with an error.  This would
just be an extension of the existing validity checks done at page read
time.
        regards, tom lane


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Wed, Oct 1, 2008 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Harald Armin Massa" <haraldarminmassa@gmail.com> writes:
>> WHAT should happen when corrupted data is detected?
>
> Same thing that happens now, ie, query fails with an error.  This would
> just be an extension of the existing validity checks done at page read
> time.

Agreed.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
pgsql@mohawksoft.com
Date:
> On Tue, 2008-09-30 at 17:13 -0400, pgsql@mohawksoft.com wrote:
>> >
>> > I believe the idea was to make this as non-invasive as possible. And
>> > it would be really nice if this could be enabled without a dump/
>> > reload (maybe the upgrade stuff would make this possible?)
>> > --
>>
>> It's all about the probability of a duplicate check being generated. If
>> you use a 32 bit checksum, then you have a theoretical probability of 1
>> in
>> 4 billion that a corrupt block will be missed (probably much lower
>> depending on your algorithm). If you use a short, then a 1 in 65
>> thousand
>> probability. If you use an 8 bit number, then 1 in 256.
>>
>> Why am I going on? Well, if there are any spare bits in a block header,
>> they could be used for the check value.
>
> Even and 64-bit integer is just 0.1% of 8k page size, and it is even
> less than 0.1% likely that page will be 100% full and thus that 64bits
> wastes any real space at all.
>
> So I don't think that this is a space issue.
>

Oh, I don't think it is a space issue either, the question was could there
be a way to do it without a dump and restore. The only way that occurs to
me is if there are some unused bits in the block header. I haven't looked
at that code in years. The numerics of it are just a description of the
probability of a duplicate sum or crc, meaning a false OK.

Also, regardless of whether or not the block is full, the block is read
and written as a block and that the underlying data unimportant.


Re: Block-level CRC checks

From
Brian Hurt
Date:
Paul Schlie wrote:
>
> ... if that doesn't fix
> the problem, assume a single bit error, and iteratively flip
> single bits until the check sum matches ...
This can actually be done much faster, if you're doing a CRC checksum 
(aka modulo over GF(2^n)). Basically, an error flipping bit n will 
always create the same xor between the computed CRC and the stored CRC. 
So you can just store a table- for all n, an error in bit n will create 
an xor of this value, sort the table in order of xor values, and then 
you can do a binary search on the table, and get exactly what bit was wrong.

This is actually probably fairly safe- for an 8K page, there are only 
65536 possible bit positions. Assuming a 32-bit CRC, that means that 
larger corrupts are much more likely to hit one of the other 
4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact.

Brian



> (hopefully not making the
> problem worse as may be the case if many bits were actually already
> in error) and write the data back, and proceed as normal, possibly
> logging the action; otherwise presume the data is unrecoverable and
> in error, somehow mark it as being so such that subsequent queries
> which may utilize any portion of it knows it may be corrupt (which
> I suspect may be best done not on file-system blocks, but actually
> on a logical rows or even individual entries if very large, as my
> best initial guess, and likely to measurably affect performance
> when enabled, and haven't a clue how resulting query should/could
> be identified as being potentially corrupt without confusing the
> client which requested it).
>
>
>
>   



Re: Block-level CRC checks

From
pgsql@mohawksoft.com
Date:
> Hannu Krosing <hannu@2ndQuadrant.com> writes:
>> So I don't think that this is a space issue.
>
> No, it's all about time penalties and loss of concurrency.

I don't think that the amount of time it would take to calculate and test
the sum is even important. It may be in older CPUs, but these days CPUs
are so fast in RAM and a block is very small. On x86 systems, depending on
page alignment, we are talking about two or three pages that will be "in
memory" (They were used to read the block from disk or previously
accessed).




Re: Block-level CRC checks

From
Brian Hurt
Date:
Brian Hurt wrote:
> Paul Schlie wrote:
>>
>> ... if that doesn't fix
>> the problem, assume a single bit error, and iteratively flip
>> single bits until the check sum matches ...
> This can actually be done much faster, if you're doing a CRC checksum 
> (aka modulo over GF(2^n)). Basically, an error flipping bit n will 
> always create the same xor between the computed CRC and the stored 
> CRC. So you can just store a table- for all n, an error in bit n will 
> create an xor of this value, sort the table in order of xor values, 
> and then you can do a binary search on the table, and get exactly what 
> bit was wrong.
>
> This is actually probably fairly safe- for an 8K page, there are only 
> 65536 possible bit positions. Assuming a 32-bit CRC, that means that 
> larger corrupts are much more likely to hit one of the other 
> 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact.
>

Actually, I think I'm going to take this back. Thinking about it, the 
table is going to be large-ish (~512K) and it assumes a fixed 8K page 
size. I think a better solution would be a tight loop, something like:
r = 1u;
for (i = 0; i < max_bits_per_page; ++i) {
if (r == xor_difference) {
return i;
} else if ((r & 1u) == 1u) {
r = (r >> 1) ^ CRC_POLY;
} else {
r >>= 1;
}
}

Brian



Re: Block-level CRC checks

From
Tom Lane
Date:
pgsql@mohawksoft.com writes:
>> No, it's all about time penalties and loss of concurrency.

> I don't think that the amount of time it would take to calculate and test
> the sum is even important. It may be in older CPUs, but these days CPUs
> are so fast in RAM and a block is very small. On x86 systems, depending on
> page alignment, we are talking about two or three pages that will be "in
> memory" (They were used to read the block from disk or previously
> accessed).

Your optimism is showing ;-).  XLogInsert routinely shows up as a major
CPU hog in any update-intensive test, and AFAICT that's mostly from the
CRC calculation for WAL records.

We could possibly use something cheaper than a real CRC, though.  A
word-wide XOR (ie, effectively a parity calculation) would be sufficient
to detect most problems.
        regards, tom lane


Re: Block-level CRC checks

From
Paul Schlie
Date:
Brian Hurt wrote:
> Paul Schlie wrote:
>> 
>> ... if that doesn't fix
>> the problem, assume a single bit error, and iteratively flip
>> single bits until the check sum matches ...
> This can actually be done much faster, if you're doing a CRC checksum
> (aka modulo over GF(2^n)). Basically, an error flipping bit n will
> always create the same xor between the computed CRC and the stored CRC.
> So you can just store a table- for all n, an error in bit n will create
> an xor of this value, sort the table in order of xor values, and then
> you can do a binary search on the table, and get exactly what bit was
> wrong.
> 
> This is actually probably fairly safe- for an 8K page, there are only
> 65536 possible bit positions. Assuming a 32-bit CRC, that means that
> larger corrupts are much more likely to hit one of the other
> 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact.

- yes, if you're willing to compute true CRC's as opposed to simpler
checksums, which may be worth the price if in fact many/most data
check failures are truly caused by single bit errors somewhere in the
chain, which I'd guess may typically be the case in absents of an
error manifesting itself in the data's storage control structure which
would likely result in the catastrophic loss of the data itself along with
potential corruption of other previously stored data if occurring in
the write chain if not otherwise detected. (and all presuming the hopeful
presence of reasonably healthy ECC main memory and processor, as otherwise
data corruption may easily go unnoticed between it's check/use/storage
obviously).

- however if such a storage block integrity check/correction mechanism is
to be developed, I can't help but wonder if such a facility is truly best
if architected as a shell around the file system calls in general (thereby
useable by arbitrary programs, as opposed to being specific to any single
one, thereby arguably predominantly outside of the scope of the db itself)?

>> (hopefully not making the
>> problem worse as may be the case if many bits were actually already
>> in error) and write the data back, and proceed as normal, possibly
>> logging the action; otherwise presume the data is unrecoverable and
>> in error, somehow mark it as being so such that subsequent queries
>> which may utilize any portion of it knows it may be corrupt (which
>> I suspect may be best done not on file-system blocks, but actually
>> on a logical rows or even individual entries if very large, as my
>> best initial guess, and likely to measurably affect performance
>> when enabled, and haven't a clue how resulting query should/could
>> be identified as being potentially corrupt without confusing the
>> client which requested it).
>> 




Re: Block-level CRC checks

From
Paul Schlie
Date:
Brian Hurt wrote:
> Brian Hurt wrote:
>> Paul Schlie wrote:
>>> 
>>> ... if that doesn't fix
>>> the problem, assume a single bit error, and iteratively flip
>>> single bits until the check sum matches ...
>> This can actually be done much faster, if you're doing a CRC checksum
>> (aka modulo over GF(2^n)). Basically, an error flipping bit n will
>> always create the same xor between the computed CRC and the stored
>> CRC. So you can just store a table- for all n, an error in bit n will
>> create an xor of this value, sort the table in order of xor values,
>> and then you can do a binary search on the table, and get exactly what
>> bit was wrong.
>> 
>> This is actually probably fairly safe- for an 8K page, there are only
>> 65536 possible bit positions. Assuming a 32-bit CRC, that means that
>> larger corrupts are much more likely to hit one of the other
>> 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact.
>> 
> 
> Actually, I think I'm going to take this back. Thinking about it, the
> table is going to be large-ish (~512K) and it assumes a fixed 8K page
> size. I think a better solution would be a tight loop, something like:
> r = 1u;
> for (i = 0; i < max_bits_per_page; ++i) {
> if (r == xor_difference) {
> return i;
> } else if ((r & 1u) == 1u) {
> r = (r >> 1) ^ CRC_POLY;
> } else {
> r >>= 1;
> }
> }

- or used a hash table indexed by the xor diff between the computed and
stored CRC (stored separately and CRC'd itself and possibly stored
redundantly) which I thought was what you meant; either way correction
performance isn't likely that important as its use should hopefully be rare.




Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Wed, Oct 1, 2008 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't think that the amount of time it would take to calculate and test
>> the sum is even important. It may be in older CPUs, but these days CPUs
>> are so fast in RAM and a block is very small. On x86 systems, depending on
>> page alignment, we are talking about two or three pages that will be "in
>> memory" (They were used to read the block from disk or previously
>> accessed).
>
> Your optimism is showing ;-).  XLogInsert routinely shows up as a major
> CPU hog in any update-intensive test, and AFAICT that's mostly from the
> CRC calculation for WAL records.

I probably wouldn't compare checksumming *every* WAL record to a
single block-level checksum.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:

> > A buffer's io_in_progress lock protects the buffer's CRC.
> 
> Unfortunately, it doesn't.  See hint bits.

Hmm, so it seems we need to keep held of the bufferhead's spinlock while
calculating the checksum, just after resetting BM_JUST_DIRTIED.  Yuck.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> Unfortunately, it doesn't.  See hint bits.

> Hmm, so it seems we need to keep held of the bufferhead's spinlock while
> calculating the checksum, just after resetting BM_JUST_DIRTIED.  Yuck.

No, holding a spinlock that long is entirely unacceptable, and it's the
wrong thing anyway, because we don't hold the header lock while
manipulating hint bits.

What this would *actually* mean is that we'd need to hold exclusive not
shared buffer lock on a buffer we are about to write, and that would
have to be maintained while computing the checksum and until the write
is completed.  The JUST_DIRTIED business could go away, in fact.

(Thinks for a bit...)  I wonder if that could induce any deadlock
problems?  The concurrency hit might be the least of our worries.
        regards, tom lane


Re: Block-level CRC checks

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> On Wed, Oct 1, 2008 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Your optimism is showing ;-).  XLogInsert routinely shows up as a major
>> CPU hog in any update-intensive test, and AFAICT that's mostly from the
>> CRC calculation for WAL records.

> I probably wouldn't compare checksumming *every* WAL record to a
> single block-level checksum.

No, not at all.  Block-level checksums would be an order of magnitude
more expensive: they're on bigger chunks of data and they'd be done more
often.
        regards, tom lane


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Wed, Oct 1, 2008 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I probably wouldn't compare checksumming *every* WAL record to a
>> single block-level checksum.
>
> No, not at all.  Block-level checksums would be an order of magnitude
> more expensive: they're on bigger chunks of data and they'd be done more
> often.

That's debatable and would be dependent on cache and the workload.

In our case however, because shared buffers doesn't scale, we would
end up doing a lot more block-level checksums than the other vendors
just pushing the block to/from the OS cache.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Tom Lane
Date:
Paul Schlie <schlie@comcast.net> writes:
> - yes, if you're willing to compute true CRC's as opposed to simpler
> checksums, which may be worth the price if in fact many/most data
> check failures are truly caused by single bit errors somewhere in the
> chain,

FWIW, not one of the corrupted-data problems I've investigated has ever
looked like a single-bit error.  So the theoretical basis for using a
CRC here seems pretty weak.  I doubt we'd even consider automatic repair
attempts anyway.
        regards, tom lane


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Tom Lane escribió:
> "Jonah H. Harris" <jonah.harris@gmail.com> writes:

> > I probably wouldn't compare checksumming *every* WAL record to a
> > single block-level checksum.
> 
> No, not at all.  Block-level checksums would be an order of magnitude
> more expensive: they're on bigger chunks of data and they'd be done more
> often.

More often?  My intention is that they are checked when the buffer is
read in, and calculated/stored when the buffer is written out.
In-memory changers of the block do not check nor recalculate the sum.

Is this not OK?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Jonah H. Harris" <jonah.harris@gmail.com> writes:
>> On Wed, Oct 1, 2008 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Your optimism is showing ;-).  XLogInsert routinely shows up as a major
>>> CPU hog in any update-intensive test, and AFAICT that's mostly from the
>>> CRC calculation for WAL records.
>
>> I probably wouldn't compare checksumming *every* WAL record to a
>> single block-level checksum.
>
> No, not at all.  Block-level checksums would be an order of magnitude
> more expensive: they're on bigger chunks of data and they'd be done more
> often.

Yeah, it's not a single block, it's the total amount that matters and that's
going to amount to the entire i/o bandwidth of the database. That said I think
the reason WAL checksums are so expensive is the startup and finishing cost.

I wonder if we could do something clever here though. Only one process is busy
calculating the checksum -- it just has to know if anyone fiddles the hint
bits while it's busy.

If setting a hint bit cleared a flag on the buffer header then the
checksumming process could set that flag, begin checksumming, and check that
the flag is still set when he's finished.

Actually I suppose that wouldn't actually be good enough. He would have to do
the i/o and check that the checksum was still valid after the i/o. If not then
he would have to recalculate the checksum and repeat the i/o. That might make
the idea a loser since I think the only way it wins is if you rarely actually
get someone setting the hint bits during i/o anyways.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Wed, Oct 1, 2008 at 11:57 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Tom Lane escribió:
>> No, not at all.  Block-level checksums would be an order of magnitude
>> more expensive: they're on bigger chunks of data and they'd be done more
>> often.
>
> More often?  My intention is that they are checked when the buffer is
> read in, and calculated/stored when the buffer is written out.
> In-memory changers of the block do not check nor recalculate the sum.
>
> Is this not OK?

That is the way it should work, only on read-in/write-out.


--
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Paul Schlie
Date:
> Jonah H. Harris wrote:
>> Tom Lane wrote:
>> "Harald Armin Massa" writes:
>>> WHAT should happen when corrupted data is detected?
>>
>> Same thing that happens now, ie, query fails with an error.  This would
>> just be an extension of the existing validity checks done at page read
>> time.
>
> Agreed.

- however it must be understood that this may result in large portions of
the database (not otherwise corrupted) from being legitimately queried,
which may in practice simply not be acceptable; unless I misunderstand it's
significance. (so wonder if it may be necessary to identify only the logical
data within the corrupted block as being in potential error, and attempt to
limit its effect as much as possible?)

- if ZFS's effective behavior is any indication, although folks strongly
favor strong data integrity, many aren't ready to accept the consequences of
only being able to access known good data, especially if it inhibits access
to remaining most likely good data (and lacking any reasonable mechanism to
otherwise recover access to it). Sometimes ignorance is bliss.




Re: Block-level CRC checks

From
pgsql@mohawksoft.com
Date:
> pgsql@mohawksoft.com writes:
>>> No, it's all about time penalties and loss of concurrency.
>
>> I don't think that the amount of time it would take to calculate and
>> test
>> the sum is even important. It may be in older CPUs, but these days CPUs
>> are so fast in RAM and a block is very small. On x86 systems, depending
>> on
>> page alignment, we are talking about two or three pages that will be "in
>> memory" (They were used to read the block from disk or previously
>> accessed).
>
> Your optimism is showing ;-).  XLogInsert routinely shows up as a major
> CPU hog in any update-intensive test, and AFAICT that's mostly from the
> CRC calculation for WAL records.
>
> We could possibly use something cheaper than a real CRC, though.  A
> word-wide XOR (ie, effectively a parity calculation) would be sufficient
> to detect most problems.

That was something that I mentioned in my first response. if the *only*
purpose of the check is to generate a "pass" or "fail" status, and not
something to be used to find where in the block it is corrupted or attempt
to regenerate the data, then we could certainly optimize the check
algorithm. A simple checksum may be good enough.


Re: Block-level CRC checks

From
Florian Weimer
Date:
* Heikki Linnakangas:

> Currently, hint bit updates are not WAL-logged, and thus no full page
> write is done when only hint bits are changed. Imagine what happens if
> hint bits are updated on a page, but there's no other changes, and we
> crash so that only one half of the new page version makes it to disk
> (= torn page). The CRC would not match, even though the page is
> actually valid.

The non-logged hint bit writes are somewhat dangerous anyway.  Maybe
it's time to get rid of this peculiarity, despite the performance
impact?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Block-level CRC checks

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane escribi�:
>> No, not at all.  Block-level checksums would be an order of magnitude
>> more expensive: they're on bigger chunks of data and they'd be done more
>> often.

> More often?  My intention is that they are checked when the buffer is
> read in, and calculated/stored when the buffer is written out.

Right.  My point is that the volume of data involved is more than the
WAL traffic.  Example: you update one tuple on a page, your WAL record
is that tuple, but you had to checksum the whole page when you read it
in and you'll have to do it again when you write it out.  Not to mention
that in a read-mostly query mix there's not much WAL traffic at all,
but plenty of page reading (and maybe some writing too, if hint bit
updates happen).

"Order of magnitude" might be an overstatement, but I don't believe
for a moment that the cost will be negligible.  That's why I'm thinking
about something cheaper than a full-blown CRC calculation.
        regards, tom lane


Re: Block-level CRC checks

From
Csaba Nagy
Date:
On Wed, 2008-10-01 at 16:57 +0100, Gregory Stark wrote:
> I wonder if we could do something clever here though. Only one process
> is busy
> calculating the checksum -- it just has to know if anyone fiddles the hint
> bits while it's busy.

What if the hint bits are added at the very end to the checksum, with an
exclusive lock to them ? Then the exclusive lock should be short
enough... only it might be deadlock-prone as any lock upgrade...

Cheers,
Csaba.




Re: Block-level CRC checks

From
Florian Weimer
Date:
* Tom Lane:

> No, not at all.  Block-level checksums would be an order of magnitude
> more expensive: they're on bigger chunks of data and they'd be done more
> often.

For larger blocks, checksumming can be parallelized at the instruction
level, especially if the block size is statically known.  And for
large blocks, Adler32 isn't that bad compared to CRC32 from a error
detection POV, so maybe you could use that.

I've seen faults which were uncovered by page-level checksumming, so
I'd be willing to pay the performance cost. 8-/

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Gregory Stark <stark@enterprisedb.com> [081001 11:59]:
> If setting a hint bit cleared a flag on the buffer header then the
> checksumming process could set that flag, begin checksumming, and check that
> the flag is still set when he's finished.
> 
> Actually I suppose that wouldn't actually be good enough. He would have to do
> the i/o and check that the checksum was still valid after the i/o. If not then
> he would have to recalculate the checksum and repeat the i/o. That might make
> the idea a loser since I think the only way it wins is if you rarely actually
> get someone setting the hint bits during i/o anyways.

A doubled-write is essentially "free" with PostgreSQL because it's not
doing direct IO, rather relying on the OS page cache to be efficient.
So if you write block A and then call write on block A immediately (or,
realistically, after a redo of the checksum), the first write is almost
*never* going to take IO bandwidth to your spindles...

But the problem is if something crashes (or interrupts PG) between those
two writes, you've got a block of data into the pagecache (and possibly
to the disks) that PG will no longer read in, because the CRC/checksum
fails despite the actual content being valid...

So if you're going to be makeing PG refuse to read-in blocks with bad
CRC/csum, you need to guarnetee that nothing fiddles with the block
between the start of the CRC and the completion of the write().

One possibility would be to "double-buffer" the write... i.e. as you
calculate your CRC, you're doing it on a local copy of the block, which
you hand to the OS to write...  If you're touching the whole block of
memory to CRC it, it isn't *ridiculously* more expensive to copy the
memory somewhere else as you do it...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Paul Schlie
Date:
Tom Lane wrote:
> Paul Schlie writes:
>> - yes, if you're willing to compute true CRC's as opposed to simpler
>> checksums, which may be worth the price if in fact many/most data
>> check failures are truly caused by single bit errors somewhere in the
>> chain,
> 
> FWIW, not one of the corrupted-data problems I've investigated has ever
> looked like a single-bit error.  So the theoretical basis for using a
> CRC here seems pretty weak.  I doubt we'd even consider automatic repair
> attempts anyway.

- although I accept that you may be correct in your assessment that most
errors are in fact multi-bit; I've never seen any hard data to coberate
either this or my suspicion that most errors are in fact single bit in
nature (if occurring within the read/processing/write paths from storage),
but agree that if occurring within an otherwise ECC'd memory subsystem,
would have to be multi-bit in nature; however in systems which record very
low single bit corrected errors, and little if any uncorrectable double bit
errors, it seems unlikely that multi-bit errors resulting from memory
failure can account for the number of integrity check failures for data
stored in file systems; so strongly suspect that of the failures you've
had occasion to investigate, they were predominantly so catastrophic
they were sufficiently obvious to catch your attention, with most having
more subtle integrity errors simply sneaking below the radar. (As it
seems clear that statistically hardware failure will most likely result
in single bit errors being injected into data with greater frequency than
multi-bit ones, and will not be detected unless otherwise provisioned to
be minimally detected, if not corrected at each communication boundary the
data traverses).




Re: Block-level CRC checks

From
Mark Mielke
Date:
Aidan Van Dyk wrote:
> One possibility would be to "double-buffer" the write... i.e. as you
> calculate your CRC, you're doing it on a local copy of the block, which
> you hand to the OS to write...  If you're touching the whole block of
> memory to CRC it, it isn't *ridiculously* more expensive to copy the
> memory somewhere else as you do it...
>   

Coming in to this late - so apologies if this makes no sense - but 
doesn't writev() provide the required capability?

Also, what is the difference between the OS not writing the block at 
all, and writing the block but missing the checksum? This window seems 
to be small enough (most of the time being faster than the system can 
schedule the buffer to be dumped?) that the "additional risk" seems 
theoretical rather than real. Unless there is evidence that writev() 
performs poorly, I'd suggest that avoiding double-buffering by using 
writev() would be preferred.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>



Re: Block-level CRC checks

From
Mark Mielke
Date:
Tom Lane wrote: <blockquote cite="mid:23549.1222876590@sss.pgh.pa.us" type="cite"><pre wrap="">Paul Schlie <a
class="moz-txt-link-rfc2396E"href="mailto:schlie@comcast.net"><schlie@comcast.net></a> writes: </pre><blockquote
type="cite"><prewrap="">- yes, if you're willing to compute true CRC's as opposed to simpler
 
checksums, which may be worth the price if in fact many/most data
check failures are truly caused by single bit errors somewhere in the
chain,   </pre></blockquote><pre wrap="">
FWIW, not one of the corrupted-data problems I've investigated has ever
looked like a single-bit error.  So the theoretical basis for using a
CRC here seems pretty weak.  I doubt we'd even consider automatic repair
attempts anyway. </pre></blockquote><br /> Single bit failures are probably the most common, but they are probably
alreadyhandled by the hardware. I don't think I've ever seen a modern hard drive return a wrong bit - I get short reads
first.By the time somebody notices a problem, it's probably more than a few bits that have accumulated. For example, if
memoryhas a faulty cell in it, it will create a fault a percentage of every time it is accessed. One bit error easily
turnsinto two, three, ... Then there is the fact that no hardware is perfect, and every single component in the
computerhas a chance, however small, of introducing bit errors... :-(<br /><br /> Cheers,<br /> mark<br /><br /><pre
class="moz-signature"cols="72">-- 
 
Mark Mielke <a class="moz-txt-link-rfc2396E" href="mailto:mark@mielke.cc"><mark@mielke.cc></a>
</pre>

Re: Block-level CRC checks

From
Gregory Stark
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:

> * Gregory Stark <stark@enterprisedb.com> [081001 11:59]:
>  
>> If setting a hint bit cleared a flag on the buffer header then the
>> checksumming process could set that flag, begin checksumming, and check that
>> the flag is still set when he's finished.
>> 
>> Actually I suppose that wouldn't actually be good enough. He would have to do
>> the i/o and check that the checksum was still valid after the i/o. If not then
>> he would have to recalculate the checksum and repeat the i/o. That might make
>> the idea a loser since I think the only way it wins is if you rarely actually
>> get someone setting the hint bits during i/o anyways.
>
> A doubled-write is essentially "free" with PostgreSQL because it's not
> doing direct IO, rather relying on the OS page cache to be efficient.

All things are relative. What we're talking about here is all cpu and
memory-bandwidth costs anyways so, yes, it'll be cheap compared to the disk
i/o but it'll still represent doubling the memory bandwidth and cpu cost of
these routines.

That said you would only have to do it in cases where the hint bits actually
get twiddled. That might not actually happen often.

> But the problem is if something crashes (or interrupts PG) between those
> two writes, you've got a block of data into the pagecache (and possibly
> to the disks) that PG will no longer read in, because the CRC/checksum
> fails despite the actual content being valid...

I don't think this is a problem because we're still doing WAL logging. The i/o
isn't allowed to happen until the page has been WAL logged and fsynced
anyways.


Incidentally I think the JUST_DIRTIED bit might actually be sufficient here.
Hint bits already cause the buffer to be marked dirty. So the only case I see
a real problem for is when we're writing a block as part of a checkpoint and
find it's JUST_DIRTIED after writing it. In that case we would have to start
over and write it again rather than leave it marked dirty.

If we're writing the block as part of normal i/o then we could just decide to
leave the possibly-bogus checksum in the table since it'll be overwritten by a
full page write anyways. It'll be overwritten in normal use when the newly
dirty buffer is eventually written out again.


If you're not doing full page writes then you would have to restore from
backup in cases where previously the page might actually have been valid
though. That's kind of unfortunate. In theory it hasn't actually changed
anything the risks of running without full page writes but it has certainly
increased the likelihood of actually having to deal with "corruption" in the
form of a gratuitously invalid checksum. (Of course without checksums you
don't ever actually know if you have corruption -- and real corruption).

> One possibility would be to "double-buffer" the write... i.e. as you
> calculate your CRC, you're doing it on a local copy of the block, which
> you hand to the OS to write...  If you're touching the whole block of
> memory to CRC it, it isn't *ridiculously* more expensive to copy the
> memory somewhere else as you do it...

Hm. Well that might actually work. You can do the CRC at the same time as
copying to the buffer, effectively doing it for the same cost as the CRC
alone.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Block-level CRC checks

From
Gregory Stark
Date:
Paul Schlie <schlie@comcast.net> writes:

> Tom Lane wrote:
>> Paul Schlie writes:
>>> - yes, if you're willing to compute true CRC's as opposed to simpler
>>> checksums, which may be worth the price if in fact many/most data
>>> check failures are truly caused by single bit errors somewhere in the
>>> chain,
>> 
>> FWIW, not one of the corrupted-data problems I've investigated has ever
>> looked like a single-bit error.  So the theoretical basis for using a
>> CRC here seems pretty weak.  I doubt we'd even consider automatic repair
>> attempts anyway.
>
> - although I accept that you may be correct in your assessment that most
> errors are in fact multi-bit; 

I've seen bad memory in a SCSI controller cause single-bit errors in storage.
It was quite confusing since the symptom was syntax errors in the C code we
were compiling on the server. The sysadmin actually caught it reliably
corrupting a block of source text written out and read back.

I've also seen single-bit errors caused by bad memory in a network interface.
*Twice*. Particularly nasty since the CRC on TCP/IP packets is only 16-bit so
a large enough ftp transfer would eventually finish despite the packet loss
but with the occasional bits flipped. In these days of SAN/NAS and SCSI over
IP that's pretty scary...

Several cases on list have come down to "filesystem secretly replaces entire
block of data with Folger's Crystals(tm) -- let's see if the database
notices". Any checksum would help in that case but I wouldn't discount single
bit errors either.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Block-level CRC checks

From
Sam Mason
Date:
On Wed, Oct 01, 2008 at 11:57:31AM -0400, Alvaro Herrera wrote:
> Tom Lane escribió:
> > "Jonah H. Harris" <jonah.harris@gmail.com> writes:
> 
> > > I probably wouldn't compare checksumming *every* WAL record to a
> > > single block-level checksum.
> > 
> > No, not at all.  Block-level checksums would be an order of magnitude
> > more expensive: they're on bigger chunks of data and they'd be done more
> > often.
> 
> More often?  My intention is that they are checked when the buffer is
> read in, and calculated/stored when the buffer is written out.
> In-memory changers of the block do not check nor recalculate the sum.

I know you said detecting memory errors wasn't being attempted, but
bad memory accounts for a reasonable number of reports of database
corruption on -general so I was wondering if moving the checks around
could catch some of these.

How about updating the block's checksum immediately whenever a tuple is
modified within it?  Checksums would be checked whenever data is read in
and just before it's written out.  Checksum failure on write would cause
PG to abort noisily with complaints about bad memory?

If some simple checksum could be found (probably a parity check) that
would allow partial updates PG wouldn't need to scan too much data when
regenerating it.  It would also cause the checksum to stay bad if data
goes bad in memory between reading from disk and its eventual write out,
a block remaining in cache for a large amount of time makes this case
appear possible.

 Sam


Re: Block-level CRC checks

From
"Kevin Grittner"
Date:
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> Paul Schlie <schlie@comcast.net> writes:
>> - yes, if you're willing to compute true CRC's as opposed to
simpler
>> checksums, which may be worth the price if in fact many/most data
>> check failures are truly caused by single bit errors somewhere in
the
>> chain,
> 
> FWIW, not one of the corrupted-data problems I've investigated has
ever
> looked like a single-bit error.  So the theoretical basis for using
a
> CRC here seems pretty weak.  I doubt we'd even consider automatic
repair
> attempts anyway.
+1
The only single-bit errors I've seen have been the result of a buggy
driver for a particular model of network card.  The problem went away
with the next update of the driver.  I've never encountered a
single-bit error in a disk sector.
-Kevin


Re: Block-level CRC checks

From
Paul Schlie
Date:
Kevin Grittner wrote:
>>>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Paul Schlie <schlie@comcast.net> writes:
>>> - yes, if you're willing to compute true CRC's as opposed to
>>> simpler checksums, which may be worth the price if in fact many/most
>>> data check failures are truly caused by single bit errors somewhere
>>> in the chain,
>> 
>> FWIW, not one of the corrupted-data problems I've investigated has
>> ever looked like a single-bit error.  So the theoretical basis for
>> using a CRC here seems pretty weak.  I doubt we'd even consider
>> automatic repair attempts anyway.
>  
> +1
>  
> The only single-bit errors I've seen have been the result of a buggy
> driver for a particular model of network card.  The problem went away
> with the next update of the driver.  I've never encountered a
> single-bit error in a disk sector.

- although I personally don't see how a buggy driver could ever likely
generate single bit errors within the data stored/retrieved, as most
typically have no business mucking with data beyond breaking-it-up or
collating it into larger chunks typically on octet boundaries, unless
implementing a soft usart or something like that for some odd reason.

- however regardless, if some form of error detection ends up being
implemented, it might be nice to actually log corrupted blocks of data
along with their previously computed checksums for subsequent analysis
in an effort to ascertain if there's an opportunity to improve its
implementation based on this more concrete real-world information.




Re: Block-level CRC checks

From
Tom Lane
Date:
Paul Schlie <schlie@comcast.net> writes:
> - however regardless, if some form of error detection ends up being
> implemented, it might be nice to actually log corrupted blocks of data
> along with their previously computed checksums for subsequent analysis
> in an effort to ascertain if there's an opportunity to improve its
> implementation based on this more concrete real-world information.

This feature is getting overdesigned, I think.  It's already the case
that we log an error complaining that thus-and-such a page is corrupt.
Once PG has decided that it won't have anything to do with the page at
all --- it can't load it into shared buffers, so it won't write it
either.  So the user can go inspect the page at leisure with whatever
tools seem handy.  I don't see a need for more verbose logging.
        regards, tom lane


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Wed, Oct 1, 2008 at 4:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Paul Schlie <schlie@comcast.net> writes:
>> - however regardless, if some form of error detection ends up being
>> implemented, it might be nice to actually log corrupted blocks of data
>> along with their previously computed checksums for subsequent analysis
>> in an effort to ascertain if there's an opportunity to improve its
>> implementation based on this more concrete real-world information.
>
> This feature is getting overdesigned, I think.  It's already the case
> that we log an error complaining that thus-and-such a page is corrupt.
> Once PG has decided that it won't have anything to do with the page at
> all --- it can't load it into shared buffers, so it won't write it
> either.  So the user can go inspect the page at leisure with whatever
> tools seem handy.  I don't see a need for more verbose logging.

Agreed!

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Tom Lane
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:
> One possibility would be to "double-buffer" the write... i.e. as you
> calculate your CRC, you're doing it on a local copy of the block, which
> you hand to the OS to write...  If you're touching the whole block of
> memory to CRC it, it isn't *ridiculously* more expensive to copy the
> memory somewhere else as you do it...

That actually seems like a really good idea.  We don't have to increase
the buffer locking requirements, or make much of any change at all in
the existing logic.  +1, especially if this is intended to be an
optional feature (which I agree with).
        regards, tom lane


Re: Block-level CRC checks

From
pgsql@mohawksoft.com
Date:
> Aidan Van Dyk <aidan@highrise.ca> writes:
>> One possibility would be to "double-buffer" the write... i.e. as you
>> calculate your CRC, you're doing it on a local copy of the block, which
>> you hand to the OS to write...  If you're touching the whole block of
>> memory to CRC it, it isn't *ridiculously* more expensive to copy the
>> memory somewhere else as you do it...
>
> That actually seems like a really good idea.  We don't have to increase
> the buffer locking requirements, or make much of any change at all in
> the existing logic.  +1, especially if this is intended to be an
> optional feature (which I agree with).
>
I don't think it make sense at all!!!

If you are going to double buffer, one presumes that for some non-zero
period of time, the block must be locked during which it is copied. You
wouldn't want it changing "mid-copy" would you? How is this any less of a
hit than just calculating the checksum?


Re: Block-level CRC checks

From
Tom Lane
Date:
pgsql@mohawksoft.com writes:
>> That actually seems like a really good idea.

> I don't think it make sense at all!!!

> If you are going to double buffer, one presumes that for some non-zero
> period of time, the block must be locked during which it is copied. You
> wouldn't want it changing "mid-copy" would you? How is this any less of a
> hit than just calculating the checksum?

It only has to be share-locked.  That locks out every change *but* hint
bit updates, and we don't really care whether we catch a hint bit update
or not, so long as it doesn't break our CRC.  This is really exactly the
same way it works now: there is no way to know whether the page image
sent to the kernel includes hint bit updates made after the write() call
starts.  But we don't care.  (The JUST_DIRTIED business ensures that
we'll catch any such hint bit update next time.)

Thought experiment: suppose that write() had some magic option that made
it calculate a CRC on the data after it was pulled from userspace, while
it's sitting in a kernel buffer.  Then we'd have exactly the same
guarantees as we do now about the consistency of data written to disk,
except that the CRC magically got in there too.  The double-buffer idea
implements exactly that behavior, without any magic write option.
        regards, tom lane


Re: Block-level CRC checks

From
Gregory Stark
Date:
pgsql@mohawksoft.com writes:

> If you are going to double buffer, one presumes that for some non-zero
> period of time, the block must be locked during which it is copied. You
> wouldn't want it changing "mid-copy" would you? How is this any less of a
> hit than just calculating the checksum?

a) You wouldn't have to keep the lock while doing the I/O. Once the CRC+copy
is done you can release the lock secure in the knowledge that nobody is going
to modify your buffered copy before the kernel can grab its copy.

b) You don't have to worry about hint bits being modified underneath you. As
long as the CRC+copy is carefully written to copy whole atomic-sized
words/bytes and only read the original once then it won't matter if it catches
the hint bit before or after it's set. The CRC will reflect the value buffered
and eventually written.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Block-level CRC checks

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> a) You wouldn't have to keep the lock while doing the I/O.

Hoo, yeah, so the period of holding the share-lock could well be
*shorter* than it is now.  Most especially so if the write() blocks
instead of just transferring the data to kernel space and returning.

I wonder whether that could mean that it's a win to double-buffer
even if we aren't computing a checksum?  Nah, probably not.
        regards, tom lane


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [081001 19:42]:
> Gregory Stark <stark@enterprisedb.com> writes:
> > a) You wouldn't have to keep the lock while doing the I/O.
> 
> Hoo, yeah, so the period of holding the share-lock could well be
> *shorter* than it is now.  Most especially so if the write() blocks
> instead of just transferring the data to kernel space and returning.
> 
> I wonder whether that could mean that it's a win to double-buffer
> even if we aren't computing a checksum?  Nah, probably not.

That all depends on what you think is longer: copy 8K & release, or
syscall(which obviously does a copy)+return & release... And whether you
want to shorten the lock hold time (but it's only shared), or the time
until write is done (isn't the goal to have writes being done in the
background during checkpoint so write latency isn't a problem)...  This
might be an interesting experiment for someone to do with a very high
concurrency, high-write load...

a.


-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
Rather than potentially letting this slide past 8.4, I threw together
an extremely quick-hack patch at the smgr-layer for block-level
checksums.

There are some nasties in that the CRC is the first member of
PageHeaderData (in order to guarantee inclusion of the LSN), and that
it bumps the size of the page header from 24-32 bytes on MAXALIGN=8.
I really think most people should use checksums and so I didn't make
it optional storage within the page (which would of course increase
the complexity).

Second, I didn't bump page version numbers.

Third, rather than a zero-filled page (which has been commonplace for
as long as I can remember), I used a fixed magic number (0xdeadbeef)
for pd_checksum as the default CRC; that way, if someone
enables/disables it at runtime, they won't get invalid checksums for
blocks which hadn't been checksummed previously.  This may as well be
zero (which means PageInit/PageHeaderIsValid wouldn't have to be
touched), but I used it as a test.

I ran the regressions and several concurrent benchmark tests which
passed successfully, but I'm sure I'm missing quite a bit due to the
the fact that it's late, it's just a quick hack, and I haven't gone
through the buffer manager locking code in awhile.

I'll be happy to work on this or let Alvaro take it; just as long as
it gets done for 8.4.

--
Jonah H. Harris, Senior DBA
myYearbook.com

Attachment

Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 1:29 AM, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> I ran the regressions and several concurrent benchmark tests which
> passed successfully, but I'm sure I'm missing quite a bit due to the
> the fact that it's late, it's just a quick hack, and I haven't gone
> through the buffer manager locking code in awhile.

Don't know how I missed this obvious one... should not be coding this
late @ night :(

Patch updated.

--
Jonah H. Harris, Senior DBA
myYearbook.com

Attachment

Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Jonah H. Harris wrote:
> Rather than potentially letting this slide past 8.4, I threw together
> an extremely quick-hack patch at the smgr-layer for block-level
> checksums.

One hard problem is how to deal with torn pages with non-WAL-logged 
changes. Like heap hint bit updates, killing tuples in index pages 
during a scan, and the new FSM pages.

Currently, a torn page when writing a hint-bit-updated page doesn't 
matter, but it would break the checksum.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Florian Weimer
Date:
* Gregory Stark:

> I've also seen single-bit errors caused by bad memory in a network interface.
> *Twice*. Particularly nasty since the CRC on TCP/IP packets is only 16-bit so
> a large enough ftp transfer would eventually finish despite the packet loss
> but with the occasional bits flipped. In these days of SAN/NAS and SCSI over
> IP that's pretty scary...

I've seen double-bit errors in Internet routing which canceled each
other out (the Internet checksum is just a sum, not a CRC, so this
happens with some probability once you've got bit errors with a
multiple-of-16 periodicity).

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: [SPAM?]: Re: Block-level CRC checks

From
Hannu Krosing
Date:
On Thu, 2008-10-02 at 09:35 +0300, Heikki Linnakangas wrote:
> Jonah H. Harris wrote:
> > Rather than potentially letting this slide past 8.4, I threw together
> > an extremely quick-hack patch at the smgr-layer for block-level
> > checksums.
> 
> One hard problem is how to deal with torn pages with non-WAL-logged 
> changes. Like heap hint bit updates, killing tuples in index pages 
> during a scan, and the new FSM pages.

Hit bit updates and killing tuples in index pages during a scan can
probably be brute-forced quite cheaply after we find a CRC mismatch.
Not sure about new FSM pages.

> Currently, a torn page when writing a hint-bit-updated page doesn't 
> matter, but it would break the checksum.

Another idea is to just ignore non-WAL-logged bits when calculating
CRC-s, by masking them out before adding corresponding bytes to CRC.

This requires page-type aware CRC functions and is more expensive to
calculate. How much more expensive is something that only testing can
tell. Probably not very much, as everything needed should be in L1
caches already.

---------------
Hannu




Re: Block-level CRC checks

From
Brian Hurt
Date:
I have a stupid question wrt hint bits and CRC checksums- it seems to me 
that it should be possible, if you change the hint bits, to be able to 
very easily calculate what the change in the CRC checksum should be.

The basic idea of the CRC checksum is that, given a message x, the 
checksum is x mod p where p is some constant polynomial (all operations 
are in GF(2^n)).  Now, the interesting thing about modulo is that it's 
distributable- that is to say, (x ^ y) mod p = (x mod p) ^ (y mod p), 
and that
(x * y) mod p = ((x mod p) * (y mod p)) mod p (I'm using ^ instead of 
the more traditional + here to emphasize that it's xor, not addition, 
I'm doing).  So let's assume we're updating a word a known n bytes from 
the end of the message- we calculate y = old_value ^ new_value, so our 
change is the equivalent of changing the original block m to (m ^ (y * 
x^{8n})).  The new checksum is then (m ^ (y * x^{8n})) mod p =
(m mod p) ^ (((y mod p) * (x^{8n} mod p)) mod p).  Now, m mod p is the 
original checksum, and (x^{8n} mod p) is a constant for a given n, and 
the multiplication modulo p can be implemented as a set of table 
lookups, one per byte.

The take away here is that, if we know ahead of time where the 
modifications are going to be, we can make updating the CRC checksum 
(relatively) cheap.  So, for example, a change of the hint bits would 
only need 4 tables lookups and a couple of xors to update the block's 
CRC checksum.  We could extended this idea- break the 8K page up into, 
say, 32 256-byte "subblocks".  Changing any given subblock would require 
only re-checksumming that subblock and then updating the CRC checksum.  
The reason for the subblocks would be to limit the number of tables 
necessary- each subblock requires it's own set of 4 256-word tables, so 
having 32 subblocks means that the tables involved would be 32*4*256*4 = 
128K in size.  Going to, say, 64 byte subblocks means needing 128 tables 
or 512K of tables.

If people are interested, I could bang out the code tonight, and post it 
to the list tomorrow.

Brian



Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <bhurt@janestcapital.com> wrote:
> I have a stupid question wrt hint bits and CRC checksums- it seems to me
> that it should be possible, if you change the hint bits, to be able to very
> easily calculate what the change in the CRC checksum should be.

Doesn't the problem still remain?  The problem being that the buffer
can be changed as it's written, yes?

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Brian Hurt
Date:
Jonah H. Harris wrote:
> On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <bhurt@janestcapital.com> wrote:
>   
>> I have a stupid question wrt hint bits and CRC checksums- it seems to me
>> that it should be possible, if you change the hint bits, to be able to very
>> easily calculate what the change in the CRC checksum should be.
>>     
>
> Doesn't the problem still remain?  The problem being that the buffer
> can be changed as it's written, yes?
>
>   
Another possibility is to just not checksum the hint bits...

Brian




Re: Block-level CRC checks

From
Gregory Stark
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:

> On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <bhurt@janestcapital.com> wrote:
>> I have a stupid question wrt hint bits and CRC checksums- it seems to me
>> that it should be possible, if you change the hint bits, to be able to very
>> easily calculate what the change in the CRC checksum should be.
>
> Doesn't the problem still remain?  The problem being that the buffer
> can be changed as it's written, yes?

It's even worse than that. Two processes can both be fiddling hint bits on
different tuples (or even the same tuple) at the same time.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 9:36 AM, Brian Hurt <bhurt@janestcapital.com> wrote:
> Another possibility is to just not checksum the hint bits...

Seems like that would just complicate matters and prevent a viable checksum.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 9:42 AM, Gregory Stark <stark@enterprisedb.com> wrote:
> It's even worse than that. Two processes can both be fiddling hint bits on
> different tuples (or even the same tuple) at the same time.

Agreed.  Back to the double-buffer idea, we could have a temporary
BLCKSZ buffer we could use immediately before write() which we could
copy the block to, perform the checksum on, and write out... is that
what you were thinking Tom?

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Brian Hurt wrote:
> Jonah H. Harris wrote:
>> On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <bhurt@janestcapital.com> 
>> wrote:
>>> I have a stupid question wrt hint bits and CRC checksums- it seems to me
>>> that it should be possible, if you change the hint bits, to be able 
>>> to very
>>> easily calculate what the change in the CRC checksum should be.
>>
>> Doesn't the problem still remain?  The problem being that the buffer
>> can be changed as it's written, yes?
>>
> Another possibility is to just not checksum the hint bits...

That would work. But I'm afraid it'd make the implementation a lot more 
invasive, and also slower. The buffer manager would have to know what 
kind of a page it's dealing with, heap or index or FSM or what, to know 
where the hint bits are. Then it would have to follow the line pointers 
to locate the hint bits, and mask them out for the CRC calculation.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Brian Hurt wrote:
>> Another possibility is to just not checksum the hint bits...

> That would work. But I'm afraid it'd make the implementation a lot more 
> invasive, and also slower. The buffer manager would have to know what 
> kind of a page it's dealing with, heap or index or FSM or what, to know 
> where the hint bits are. Then it would have to follow the line pointers 
> to locate the hint bits, and mask them out for the CRC calculation.

Right.  The odds are that this'd actually be slower than the
double-buffer method, because of all the added complexity.  And it would
really suck from a modularity standpoint to have bufmgr know about all
that.

The problem we still have to solve is torn pages when writing back a
hint-bit update ...
        regards, tom lane


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Brian Hurt wrote:
>>> Another possibility is to just not checksum the hint bits...
> 
>> That would work. But I'm afraid it'd make the implementation a lot more 
>> invasive, and also slower. The buffer manager would have to know what 
>> kind of a page it's dealing with, heap or index or FSM or what, to know 
>> where the hint bits are. Then it would have to follow the line pointers 
>> to locate the hint bits, and mask them out for the CRC calculation.
> 
> Right.  The odds are that this'd actually be slower than the
> double-buffer method, because of all the added complexity.

I was thinking that masking out the hint bits would be implemented by 
copying the page to the temporary buffer, ANDing out the hint bits 
there, and then calculating the CRC and issuing the write. So we'd still 
need to double-buffer.

> The problem we still have to solve is torn pages when writing back a
> hint-bit update ...

Not checksumming the hint bits *is* a solution to the torn page problem.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Andrew Chernow
Date:
Jonah H. Harris wrote:
> On Thu, Oct 2, 2008 at 1:29 AM, Jonah H. Harris <jonah.harris@gmail.com> wrote:
>> I ran the regressions and several concurrent benchmark tests which
>> passed successfully, but I'm sure I'm missing quite a bit due to the
>> the fact that it's late, it's just a quick hack, and I haven't gone
>> through the buffer manager locking code in awhile.
> 
> Don't know how I missed this obvious one... should not be coding this
> late @ night :(
> 
> Patch updated.
> 

I read through this patch and am curious why 0xdeadbeef was used as an 
uninitialized value for the page crc.  Is this value somehow less likely 
to have collisons than zero (or any other arbitrary value)?

Would it not be better to add a boolean bit or byte to inidcate the crc 
state?

-- 
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 10:09 AM, Andrew Chernow <ac@esilo.com> wrote:
> I read through this patch and am curious why 0xdeadbeef was used as an
> uninitialized value for the page crc.  Is this value somehow less likely to
> have collisons than zero (or any other arbitrary value)?

It was just an arbitrary value I chose to identify non-checksummed
pages; I believe would have the same collision rate as anything else.

> Would it not be better to add a boolean bit or byte to inidcate the crc
> state?

Ideally, though we don't have any spare bits to play with in MAXALIGN=4.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Jonah H. Harris wrote:
> On Thu, Oct 2, 2008 at 10:09 AM, Andrew Chernow <ac@esilo.com> wrote:
>> Would it not be better to add a boolean bit or byte to inidcate the crc
>> state?
> 
> Ideally, though we don't have any spare bits to play with in MAXALIGN=4.

In the page header? There's plenty of free bits in pd_flags.

But isn't it a bit dangerous to have a single flag on the page 
indicating whether the CRC is valid or not? Any corruption that flips 
that bit would make the CRC check to be skipped.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 10:27 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> Ideally, though we don't have any spare bits to play with in MAXALIGN=4.
>
> In the page header? There's plenty of free bits in pd_flags.

Ahh, didn't see that.  Good catch!

> But isn't it a bit dangerous to have a single flag on the page indicating
> whether the CRC is valid or not? Any corruption that flips that bit would
> make the CRC check to be skipped.

Agreed.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> The problem we still have to solve is torn pages when writing back a
>> hint-bit update ...

> Not checksumming the hint bits *is* a solution to the torn page problem.

Yeah, but it has enough drawbacks that I'd like to keep looking for
alternatives.

One argument that I've not seen raised is that not checksumming the hint
bits leaves you open to a single-bit error that incorrectly sets a hint
bit.
        regards, tom lane


Re: Block-level CRC checks

From
Tom Lane
Date:
Andrew Chernow <ac@esilo.com> writes:
> I read through this patch and am curious why 0xdeadbeef was used as an 
> uninitialized value for the page crc.  Is this value somehow less likely 
> to have collisons than zero (or any other arbitrary value)?

Actually, because that's a favorite bit pattern for programs to fill
unused memory with, I'd venture that it has measurably HIGHER odds
of being bogus than any other bit pattern.  Consider the possibility
that a database page got overwritten with someone's core dump.

> Would it not be better to add a boolean bit or byte to inidcate the crc 
> state?

No, as noted that would give you a one-in-two chance of incorrectly
skipping the CRC check, not one-in-2^32 or so.  If we're going to allow
a silent skip of the CRC check then a special value of CRC is a good way
to do it ... just not this particular one.
        regards, tom lane


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 10:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Not checksumming the hint bits *is* a solution to the torn page problem.
>
> Yeah, but it has enough drawbacks that I'd like to keep looking for
> alternatives.

Agreed.

> One argument that I've not seen raised is that not checksumming the hint
> bits leaves you open to a single-bit error that incorrectly sets a hint
> bit.

Agreed.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
So, it comes down to two possible designs, each with its own set of challenges.

Just to see where to go from here... I want to make sure the options
I've seen in this thread are laid out clearly:

1. Hold an exclusive lock on the buffer during the call to smgrwrite
OR
2. Doublebuffer the write
OR
3. Do some crufty magic to ignore hint-bit updates

Because option 3 not only complicates the entire thing, but also makes
corruption more difficult to detect, I don't consider it viable.  Can
anyone provide a reason that makes this option viable?

Option 1 will prevent hint-bit updates during write, which means we
can checksum the buffer and not worry about it.  Also, is only the
buffer content lock required?  This could potentially slow down
concurrent transactions reading the block and/or writing hint bits.

Option #2 consists of copying the block to a temporary buffer,
checksumming it, and pushing the checksummed block down to write() (at
smgr/md/fd depending on where we want to perform the checksum).

From my perspective, I prefer #2 and performing it at the sgmr layer,
but I am open to suggestions.  Tom, what are your thoughts?  #1 isn't
very difficult, but I can see it potentially causing a number of
side-problems and it would require a fair amount of testing.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> Just to see where to go from here... I want to make sure the options
> I've seen in this thread are laid out clearly:

> 1. Hold an exclusive lock on the buffer during the call to smgrwrite
> OR
> 2. Doublebuffer the write
> OR
> 3. Do some crufty magic to ignore hint-bit updates

Right, I think everyone agrees now that #2 seems like the most
preferable option for writing the checksum.  However, that still
leaves us lacking a solution for torn pages during a write that
follows a hint bit update.  We may end up with some "crufty
magic" anyway for dealing with that.
        regards, tom lane


Re: Block-level CRC checks

From
Robert Treat
Date:
On Tuesday 30 September 2008 17:17:10 Decibel! wrote:
> On Sep 30, 2008, at 1:48 PM, Heikki Linnakangas wrote:
> > Doesn't some filesystems include a per-block CRC, which would
> > achieve the same thing? ZFS?
>
> Sure, some do. We're on linux and can't run ZFS. And I'll argue that
> no linux FS is anywhere near as tested as ext3 is, which means that
> going to some other FS that offers you CRC means you're now exposing
> yourself to the possibility of issues with the FS itself. Not to
> mention that changing filesystems on a large production system is
> very painful.

Actually we had someone on irc yesterday explaining how they were able to run 
zfs on debian linux, so that option might be closer than you think. 

On a side note, I believe there are a couple of companies that do postgresql 
consulting that have pretty good experience running it atop solaris... just 
in case you guys ever do want to do a migration or something ;-)

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [081002 11:40]:
> "Jonah H. Harris" <jonah.harris@gmail.com> writes:
> > Just to see where to go from here... I want to make sure the options
> > I've seen in this thread are laid out clearly:
> 
> > 1. Hold an exclusive lock on the buffer during the call to smgrwrite
> > OR
> > 2. Doublebuffer the write
> > OR
> > 3. Do some crufty magic to ignore hint-bit updates
> 
> Right, I think everyone agrees now that #2 seems like the most
> preferable option for writing the checksum.  However, that still
> leaves us lacking a solution for torn pages during a write that
> follows a hint bit update.  We may end up with some "crufty
> magic" anyway for dealing with that.

How does your current "write" strategy handle this situation.  I mean,
how do you currently guarnetee that between when you call write() and
the kernel copies the buffer internally, no hint-bit are updated?
#define write(fd, buf, count) buffer_crc_write(fd, buf, count)

whatever protection you have on the regular write is sufficient.  The
time of the protection will need to start before the "buffer" period
instead of just the write, (and maybe not the write syscall anymore) but
with CPU caches and speed, the buffer period should be <= the time of
the write() syscall...  Your fsync is your "on disk guarentee", not the
write, and that won't change.

But I thought you didn't really care about hint-bit updates, even in the
current strategy... but I'm fully ignorant about the code, sorry...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Joshua Drake
Date:
On Thu, 02 Oct 2008 11:57:30 -0400
Robert Treat <xzilla@users.sourceforge.net> wrote:

> Actually we had someone on irc yesterday explaining how they were
> able to run zfs on debian linux, so that option might be closer than
> you think. 

Its user mode. Not sure I would suggest that from a production server
perspective.

> 
> On a side note, I believe there are a couple of companies that do
> postgresql consulting that have pretty good experience running it
> atop solaris... just in case you guys ever do want to do a migration
> or something ;-)

:P

Joshua D. Drake 


-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/




Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 12:05 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> How does your current "write" strategy handle this situation.  I mean,
> how do you currently guarnetee that between when you call write() and
> the kernel copies the buffer internally, no hint-bit are updated?

Working on the exact double-buffering technique now.

> #define write(fd, buf, count) buffer_crc_write(fd, buf, count)

I certainly wouldn't interpose the write() call itself; that's just
asking for trouble.

> whatever protection you have on the regular write is sufficient.  The
> time of the protection will need to start before the "buffer" period
> instead of just the write, (and maybe not the write syscall anymore) but
> with CPU caches and speed, the buffer period should be <= the time of
> the write() syscall...  Your fsync is your "on disk guarentee", not the
> write, and that won't change.

Agreed.

> But I thought you didn't really care about hint-bit updates, even in the
> current strategy... but I'm fully ignorant about the code, sorry...

The current implementation does not take it into account.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Jonah H. Harris <jonah.harris@gmail.com> [081002 12:43]:
> > #define write(fd, buf, count) buffer_crc_write(fd, buf, count)
>
> I certainly wouldn't interpose the write() call itself; that's just
> asking for trouble.

Of course not, that was only to show that whatever you currenlty pritect
"write()" with, is valid for protecting the buffer+write.

> > But I thought you didn't really care about hint-bit updates, even in the
> > current strategy... but I'm fully ignorant about the code, sorry...
>
> The current implementation does not take it into account.

So if PG currently doesn't care about the hit-bits being updated, during
the write, then why should introducing a double-buffer introduce the a
torn-page problem Tom mentions?  I admit, I'm fishing for information
from those in the know, because I haven't been looking at the code long
enough (or all of it enough) to to know all the ins-and-outs...

a.
--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 12:51 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
>> > But I thought you didn't really care about hint-bit updates, even in the
>> > current strategy... but I'm fully ignorant about the code, sorry...
>>
>> The current implementation does not take it into account.
>
> So if PG currently doesn't care about the hit-bits being updated, during
> the write, then why should introducing a double-buffer introduce the a
> torn-page problem Tom mentions?  I admit, I'm fishing for information
> from those in the know, because I haven't been looking at the code long
> enough (or all of it enough) to to know all the ins-and-outs...

PG doesn't care because during hint-bits aren't logged and during
normal WAL replay, the old page will be pulled from the WAL.  I
believe what Tom is referring to is that the buffer PG sends to
write() can still be modified by way of SetHintBits between the time
smgrwrite is called and the time the actual write takes place, which
is why we can't rely on a checksum of the buffer pointer passed to
smgrwrite and friends.

If we're double-buffering the write, I don't see where we could be
introducing a torn-page, as we'd actually be writing a copied version
of the buffer.  Will look into this.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Jonah H. Harris wrote:
> PG doesn't care because during hint-bits aren't logged and during
> normal WAL replay, the old page will be pulled from the WAL.  I
> believe what Tom is referring to is that the buffer PG sends to
> write() can still be modified by way of SetHintBits between the time
> smgrwrite is called and the time the actual write takes place, which
> is why we can't rely on a checksum of the buffer pointer passed to
> smgrwrite and friends.
> 
> If we're double-buffering the write, I don't see where we could be
> introducing a torn-page, as we'd actually be writing a copied version
> of the buffer.  Will look into this.

The torn page is during kernel write to disk, I assume, so it is still
possible.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Bruce Momjian <bruce@momjian.us> [081002 13:07]:
> Jonah H. Harris wrote:
> > PG doesn't care because during hint-bits aren't logged and during
> > normal WAL replay, the old page will be pulled from the WAL.  I
> > believe what Tom is referring to is that the buffer PG sends to
> > write() can still be modified by way of SetHintBits between the time
> > smgrwrite is called and the time the actual write takes place, which
> > is why we can't rely on a checksum of the buffer pointer passed to
> > smgrwrite and friends.
> > 
> > If we're double-buffering the write, I don't see where we could be
> > introducing a torn-page, as we'd actually be writing a copied version
> > of the buffer.  Will look into this.
> 
> The torn page is during kernel write to disk, I assume, so it is still
> possible.

Ah, I see...  So your full-page-write in the WAL, protecting the torn
page has to be aware of the need for a valid CRC32 as well...


-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 1:07 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> If we're double-buffering the write, I don't see where we could be
>> introducing a torn-page, as we'd actually be writing a copied version
>> of the buffer.  Will look into this.
>
> The torn page is during kernel write to disk, I assume, so it is still
> possible.

Well, we can't really control too much of that.  The most common
solution to that I've seen is to double-write the page (which some
OSes already do regardless).  Or, are you meaning something else?

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Greg Stark
Date:

On 2 Oct 2008, at 05:51 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:

> So if PG currently doesn't care about the hit-bits being updated,  
> during
> the write, then why should introducing a double-buffer introduce the a
> torn-page problem Tom mentions?  I admit, I'm fishing for information
> from those in the know, because I haven't been looking at the code  
> long
> enough (or all of it enough) to to know all the ins-and-outs...

It's not the buffeting it's the checksum. The problem arises if a page  
is read in but no wal logged modifications are done against it. If a  
hint bit is modified it won't be wal logged but the page is marked  
dirty.

When we write the page there's a chance only part of the page actually  
makes it to disk if the system crashes before the whole page is flushed.

Wal logged changes are safe because of full_page_writes. Hint bits are  
safe because either the old or the new value will be on disk and we  
don't care which. It doesn't matter if some hint bits are set and some  
aren't.

However the checksum won't match because the checksum will have been  
calculated on the whole block and part of it was never written.


Writing this explanation did bring to mind one solution which we had  
already discussed for other reasons: not marking blocks dirty after  
hint bit setting.

Alternatively if we detect a block is dirty but the lsn is older than  
the last checkpoint is that the only time we need to worry? Then we  
could either discard the writes or generate a noop wal log record just  
for the full page write in that case.   


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Jonah H. Harris wrote:
> On Thu, Oct 2, 2008 at 1:07 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >> If we're double-buffering the write, I don't see where we could be
> >> introducing a torn-page, as we'd actually be writing a copied version
> >> of the buffer.  Will look into this.
> >
> > The torn page is during kernel write to disk, I assume, so it is still
> > possible.
> 
> Well, we can't really control too much of that.  The most common
> solution to that I've seen is to double-write the page (which some
> OSes already do regardless).  Or, are you meaning something else?

I just don't see how writing a copy of the page (rather than the
original) to the kernel affects issues about torn pages.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
> It's not the buffeting it's the checksum. The problem arises if a page is
> read in but no wal logged modifications are done against it. If a hint bit
> is modified it won't be wal logged but the page is marked dirty.

Ahhhhh.  Thanks Greg.  Let me look into this a bit before I respond :)

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Greg Stark escribió:

> Writing this explanation did bring to mind one solution which we had  
> already discussed for other reasons: not marking blocks dirty after hint 
> bit setting.

How about when a hint bit is set and the page is not already dirty, set
the checksum to the "always valid" value?  The problem I have with this
idea is that there would be lots of pages excluded from the CRC checks,
a non-trivial percentage of the time.

Maybe we could mix this with Simon's approach to counting hint bit
setting, and calculate a valid CRC on the page every n-th non-logged
change.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Greg Stark <greg.stark@enterprisedb.com> [081002 13:37]:
> It's not the buffeting it's the checksum. The problem arises if a page  
> is read in but no wal logged modifications are done against it. If a  
> hint bit is modified it won't be wal logged but the page is marked  
> dirty.
> 
> When we write the page there's a chance only part of the page actually  
> makes it to disk if the system crashes before the whole page is flushed.

Yup, Brucess message pointed me in the right direction..

> Wal logged changes are safe because of full_page_writes. Hint bits are  
> safe because either the old or the new value will be on disk and we  
> don't care which. It doesn't matter if some hint bits are set and some  
> aren't.
> 
> However the checksum won't match because the checksum will have been  
> calculated on the whole block and part of it was never written.

Correct.  But now doesn't full-page-writes give us the same protection
here against a half-write as it did for the previous case?

On recovery after a torn-page write, won't the recovery of the
full_page_write WAL + WAL changes get us back to the page as it was
before the buffer+checksum+write?  The checksum on that block *now in
memory* is irrelevant, because it wasn't "read from disk", it was
completely constructed from WAL records, which are protected by
checksums themselves individually..., and is now ready to be "written to
disk" which will force a valid checksum to be on it.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Gregory Stark
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:

>> Wal logged changes are safe because of full_page_writes. Hint bits are  
>> safe because either the old or the new value will be on disk and we  
>> don't care which. It doesn't matter if some hint bits are set and some  
>> aren't.
>> 
>> However the checksum won't match because the checksum will have been  
>> calculated on the whole block and part of it was never written.
>
> Correct.  But now doesn't full-page-writes give us the same protection
> here against a half-write as it did for the previous case?
>
> On recovery after a torn-page write, won't the recovery of the
> full_page_write WAL + WAL changes get us back to the page as it was
> before the buffer+checksum+write?  

Hint bit setting doesn't trigger a WAL record.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 1:58 PM, Gregory Stark <stark@enterprisedb.com> wrote:
>> On recovery after a torn-page write, won't the recovery of the
>> full_page_write WAL + WAL changes get us back to the page as it was
>> before the buffer+checksum+write?
>
> Hint bit setting doesn't trigger a WAL record.

Hence, no page image is written to WAL for later use in recovery.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 2, 2008 at 1:44 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> How about when a hint bit is set and the page is not already dirty, set
> the checksum to the "always valid" value?  The problem I have with this
> idea is that there would be lots of pages excluded from the CRC checks,
> a non-trivial percentage of the time.

I don't like that because it trades-off corruption detection (the
whole point of this feature) for a slight performance improvement.

> Maybe we could mix this with Simon's approach to counting hint bit
> setting, and calculate a valid CRC on the page every n-th non-logged
> change.

I still think we should only calculate checksums on the actual write.
And, this still seems to have an issue with WAL, unless Simon's
original idea somehow included recording hint bit settings/dirtying
the page in WAL.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Jonah H. Harris <jonah.harris@gmail.com> [081002 14:01]:
> On Thu, Oct 2, 2008 at 1:58 PM, Gregory Stark <stark@enterprisedb.com> wrote:
> >> On recovery after a torn-page write, won't the recovery of the
> >> full_page_write WAL + WAL changes get us back to the page as it was
> >> before the buffer+checksum+write?
> >
> > Hint bit setting doesn't trigger a WAL record.
> 
> Hence, no page image is written to WAL for later use in recovery.

OK.  Got it...  The block is dirty (only because of hint bits).  write
starts, crash, torn page, recovery doesn't "fix" the torn page...
because it's never been changed (according WAL), so on next read...

Without the CRC it doesn't matter, because the only change was
hint-bits, so the page is half-old+half-new, but new == old+only
hint-bits...

Because ther'es no WAP. the torn page will be read next time that buffer
is needed...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Robert Treat
Date:
On Wednesday 01 October 2008 10:27:52 Tom Lane wrote:
> pgsql@mohawksoft.com writes:
> >> No, it's all about time penalties and loss of concurrency.
> >
> > I don't think that the amount of time it would take to calculate and test
> > the sum is even important. It may be in older CPUs, but these days CPUs
> > are so fast in RAM and a block is very small. On x86 systems, depending
> > on page alignment, we are talking about two or three pages that will be
> > "in memory" (They were used to read the block from disk or previously
> > accessed).
>
> Your optimism is showing ;-).  XLogInsert routinely shows up as a major
> CPU hog in any update-intensive test, and AFAICT that's mostly from the
> CRC calculation for WAL records.
>

Yeah... for those who run on filesystems that do checksumming for you, I'd bet 
they'd much rather see time spent in turning that off rather than 
checksumming everything else.  (just guessing) 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Robert Treat wrote:
> On Wednesday 01 October 2008 10:27:52 Tom Lane wrote:

> > Your optimism is showing ;-).  XLogInsert routinely shows up as a major
> > CPU hog in any update-intensive test, and AFAICT that's mostly from the
> > CRC calculation for WAL records.
> 
> Yeah... for those who run on filesystems that do checksumming for you, I'd bet 
> they'd much rather see time spent in turning that off rather than 
> checksumming everything else.  (just guessing) 

I don't think it can be turned off, because ISTR a failed checksum is
used to detect end of the WAL stream to be recovered.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Jonah H. Harris escribió:
> On Thu, Oct 2, 2008 at 1:44 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
> > How about when a hint bit is set and the page is not already dirty, set
> > the checksum to the "always valid" value?  The problem I have with this
> > idea is that there would be lots of pages excluded from the CRC checks,
> > a non-trivial percentage of the time.
> 
> I don't like that because it trades-off corruption detection (the
> whole point of this feature) for a slight performance improvement.

I agree that giving up corruption detection is not such a hot idea, but
what I'm intending to get back is not performance but correctness (in
this case protection from the torn page problem)

> > Maybe we could mix this with Simon's approach to counting hint bit
> > setting, and calculate a valid CRC on the page every n-th non-logged
> > change.
> 
> I still think we should only calculate checksums on the actual write.

Well, if we could trade off a bit of performance for correctness, I
would give up on that :-)  However, you're right that this tradeoff is
not what we're having here.

> And, this still seems to have an issue with WAL, unless Simon's
> original idea somehow included recording hint bit settings/dirtying
> the page in WAL.

I have to admit I don't remember exactly how it worked :-)  I think the
idea was avoiding setting the page dirty until a certain number of hint
bit setting operations had been done (which I think means it's not
useful for the present purpose).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Alvaro Herrera <alvherre@commandprompt.com> [081002 16:18]:
> > And, this still seems to have an issue with WAL, unless Simon's
> > original idea somehow included recording hint bit settings/dirtying
> > the page in WAL.
> 
> I have to admit I don't remember exactly how it worked :-)  I think the
> idea was avoiding setting the page dirty until a certain number of hint
> bit setting operations had been done (which I think means it's not
> useful for the present purpose).

How crazy would I be to wonder about the performance impact of doing the
full_page_write xlog block backups at buffer dirtying time
(MarkBufferDirty and SetBufferCommitInfoNeedsSave), instead of at
XLogInsert?

A few thoughts, quite possible not true:

* The xlog backup block records don't need to be synced to disk at the time of the dirty but they can be synced along
withstuff behind it, although it *needs* to be synced by the time the buffer write() comes long, otherwise we haven't
fixedour torn page probem, so practically, we may need to sync it for guarentees
 
* OLAP workloads that handle bulk insert/update/delete are probably running with full_page_writes off, so don't pay the
penaltyof extra xlog writing on all the hint-bits being set
 
* OLTP workloads with full_page_writes on would have some extra full_page_writes, but I have a feeling that most
dirtiedbuffers in OLTP systems are going to get changed by more than just hint-bits soon eonugh anyways, so it's not a
hugenet increase
 
* Slow systems that aren't high-velocity can probably spare a bit more Xlog bandwith anyways...


But my experience is limitted to my small-scale databases...


-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Simon Riggs
Date:
On Thu, 2008-10-02 at 16:18 -0400, Alvaro Herrera wrote:

> > > Maybe we could mix this with Simon's approach to counting hint bit
> > > setting, and calculate a valid CRC on the page every n-th non-logged
> > > change.
> > 
> > I still think we should only calculate checksums on the actual write.
> 
> Well, if we could trade off a bit of performance for correctness, I
> would give up on that :-)  However, you're right that this tradeoff is
> not what we're having here.
> 
> > And, this still seems to have an issue with WAL, unless Simon's
> > original idea somehow included recording hint bit settings/dirtying
> > the page in WAL.
> 
> I have to admit I don't remember exactly how it worked :-)  I think the
> idea was avoiding setting the page dirty until a certain number of hint
> bit setting operations had been done (which I think means it's not
> useful for the present purpose).

Having read everything so far, the only way I can see to solve the
problem does seem to be to make hint bit setting write WAL. When, is the
question. Every time is definitely the wrong answer.

Hint bit setting approach so far is in two parts: we add code to
separate the concept of "dirty" from "has hint bits set". We already
have some logic that says when to write dirty pages. So we just add some
slightly different logic that says when to write hinted pages.

The main correctness of the idea has been worked out. The difficult part
is the "when to write hinted pages" because its just a heuristic,
subject to argument and lots of performance testing.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block-level CRC checks

From
"Dawid Kuroczko"
Date:
On Thu, Oct 2, 2008 at 7:42 PM, Jonah H. Harris <jonah.harris@gmail.com> wrote:
>> It's not the buffeting it's the checksum. The problem arises if a page is
>> read in but no wal logged modifications are done against it. If a hint bit
>> is modified it won't be wal logged but the page is marked dirty.
>
> Ahhhhh.  Thanks Greg.  Let me look into this a bit before I respond :)

Hmm, how about, when reading a page:

read the page
if checksum mismatch {   flip the hint bits [1]   if checksum mismatch {         ERROR   } else {         emit a
warning,'found a torn page'   }
 
}

...that is assuming we know which bit to flip
and that we accept the check will be a bit
weaker. :)  OTOH this shouldn't happen too
often, so performance should matter much.

My 0.02
  Best regards,     Dawid Kuroczko

[1]: Of course it would be more efficient to flip
the checksum, but it would be tricky. :)
--  ..................        ``The essence of real creativity is a certain: *Dawid Kuroczko* :         playfulness, a
flittingfrom idea to idea: qnex42@gmail.com :     without getting bogged down by fixated demands.''`..................'
Sherkaner Underhill, A Deepness in the Sky, V. Vinge
 


Re: Block-level CRC checks

From
Decibel!
Date:
On Oct 1, 2008, at 2:03 PM, Sam Mason wrote:
> I know you said detecting memory errors wasn't being attempted, but
> bad memory accounts for a reasonable number of reports of database
> corruption on -general so I was wondering if moving the checks around
> could catch some of these.

Down the road, I would also like to have a sanity check for data  
modification that occur while the data is in a buffer, to guard  
against memory or CPU errors. But the huge issue there is how to do  
it without killing performance. Because there's no obvious solution  
to that, I don't want to try and get it in for 8.4.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: Block-level CRC checks

From
Gregory Stark
Date:
"Dawid Kuroczko" <qnex42@gmail.com> writes:

> On Thu, Oct 2, 2008 at 7:42 PM, Jonah H. Harris <jonah.harris@gmail.com> wrote:
>
> if checksum mismatch {
>     flip the hint bits [1]

I did try to make something like that work. But I didn't get anywhere. There
could easily be dozens of bits to flip. The MaxHeapTuplesPerPage is over 200
and you could easily have half of them that don't match the checksum if the
writes happen in 4k chunks. If they happen in 512b chunks then you could have
a lot more. And yes they could easily have all been set at around the same
time because that's often just what a sequential scan does.

And you can't even just set the bits to their "correct" values either before
the checksum or before checking the checksum since the "correct" value changes
over time. By the time you compare the checksum more bits will be settable
than when the page was stored.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Block-level CRC checks

From
Decibel!
Date:
On Oct 2, 2008, at 3:18 PM, Alvaro Herrera wrote:
> I have to admit I don't remember exactly how it worked :-)  I think  
> the
> idea was avoiding setting the page dirty until a certain number of  
> hint
> bit setting operations had been done (which I think means it's not
> useful for the present purpose).


Well, it would be useful if whenever we magically decided it was time  
to write out a page that had only hint-bit updates we generated WAL,  
right? Even if it was just a no-op WAL record to ensure we had the  
page image in the WAL.

BTW, speaking of torn pages... I've heard that there's some serious  
gains to be had by turning full_page_writes to off, but I've never  
even dreamed of doing that because I've never seen any real sure-fire  
way to check that your hardware can't write torn pages. But if we  
have checksums enabled and checked the checksums on a block the first  
time we touched it during recovery, we'd be able to detect torn  
pages, yet still recover. That would help show that torn pages aren't  
possible in a particular environment (though unfortunately I don't  
think there's any way to actually prove that they're not).
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: Block-level CRC checks

From
Brian Hurt
Date:
OK, I have a stupid question- torn pages are a problem, but only during 
recovery.  Recovery is (I assume) a fairly rare condition- if data 
corruption is going to happen, it's most likely to happen during normal 
operation.  So why not just turn off CRC checksumming during recovery, 
or at least treat it as a much less critical error?  During recovery, if 
the CRC checksum matches, we can assume the page is good- not only not 
corrupt, but not torn either.  If the CRC checksum doesn't match, we 
don't panic, but maybe we do more careful analysis of the page to make 
sure that only the hint bits are wrong.  Or maybe not.  It's only during 
normal operation that a CRC checksum failure would be considered critical.


Feel free to explain to me why I'm an idiot.

Brian



Re: Block-level CRC checks

From
"Dawid Kuroczko"
Date:
On Fri, Oct 3, 2008 at 3:36 PM, Brian Hurt <bhurt@janestcapital.com> wrote:
> OK, I have a stupid question- torn pages are a problem, but only during
> recovery.  Recovery is (I assume) a fairly rare condition- if data
> corruption is going to happen, it's most likely to happen during normal
> operation.  So why not just turn off CRC checksumming during recovery, or at
> least treat it as a much less critical error?  During recovery, if the CRC
> checksum matches, we can assume the page is good- not only not corrupt, but
> not torn either.  If the CRC checksum doesn't match, we don't panic, but
> maybe we do more careful analysis of the page to make sure that only the
> hint bits are wrong.  Or maybe not.  It's only during normal operation that
> a CRC checksum failure would be considered critical.

Well:
1. database half-writes the page X to disk, and there is power outage.
2. we regain the power
2. during recovery database replay all WAL-logged pages.  The X page
was not WAL-logged, thus it is not replayed.
3. when replaying is finished, everything looks OK at this point
4. user runs a SELECT which hits page X.  Oops, we have a checksum
mismatch.
 Best regards,    Dawid Kuroczko
--  ..................        ``The essence of real creativity is a certain: *Dawid Kuroczko* :         playfulness, a
flittingfrom idea to idea: qnex42@gmail.com :     without getting bogged down by fixated demands.''`..................'
Sherkaner Underhill, A Deepness in the Sky, V. Vinge
 


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Brian Hurt wrote:
> OK, I have a stupid question- torn pages are a problem, but only during 
> recovery.  Recovery is (I assume) a fairly rare condition- if data 
> corruption is going to happen, it's most likely to happen during normal 
> operation.  So why not just turn off CRC checksumming during recovery, 
> or at least treat it as a much less critical error?  During recovery, if 
> the CRC checksum matches, we can assume the page is good- not only not 
> corrupt, but not torn either.  If the CRC checksum doesn't match, we 
> don't panic, but maybe we do more careful analysis of the page to make 
> sure that only the hint bits are wrong.  Or maybe not.  It's only during 
> normal operation that a CRC checksum failure would be considered critical.

Interesting question.  The problem is that we don't read all pages in
during recovery.  One idea would be to WAL log the page numbers that
might be torn and recompute the checksums on those pages during
recovery.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Decibel! <decibel@decibel.org> [081002 19:18]:
> Well, it would be useful if whenever we magically decided it was time  
> to write out a page that had only hint-bit updates we generated WAL,  
> right? Even if it was just a no-op WAL record to ensure we had the  
> page image in the WAL.

Well, I'm by no means an expert in the code, but from my looking around
bufmgr and transam yesterady, it really looks like it would be a
modularity nightmare...

But I think that would have the same "total IO" affect as nop WAL record
being generated at the the page being dirtied, which would seem to fit
the code a bit better...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Alvaro Herrera
Date:
So this discussion died with no solution arising to the
hint-bit-setting-invalidates-the-CRC problem.

Apparently the only solution in sight is to WAL-log hint bits.  Simon
opines it would be horrible from a performance standpoint to WAL-log
every hint bit set, and I think we all agree with that.  So we need to
find an alternative mechanism to WAL log hint bits.

I thought about causing a process that's about to write a page check a
flag that says "this page has been dirtied by someone who didn't bother
to generate WAL".  If the flag is set, then the writer process is forced
to write a WAL record containing all hint bits in the page, and only
then it is allowed to write the page (and thus calculate the new CRC).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Fri, Oct 17, 2008 at 11:26 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> So this discussion died with no solution arising to the
> hint-bit-setting-invalidates-the-CRC problem.

I've been busy.

> Apparently the only solution in sight is to WAL-log hint bits.  Simon
> opines it would be horrible from a performance standpoint to WAL-log
> every hint bit set, and I think we all agree with that.  So we need to
> find an alternative mechanism to WAL log hint bits.

Agreed.

> I thought about causing a process that's about to write a page check a
> flag that says "this page has been dirtied by someone who didn't bother
> to generate WAL".  If the flag is set, then the writer process is forced
> to write a WAL record containing all hint bits in the page, and only
> then it is allowed to write the page (and thus calculate the new CRC).

Interesting idea... let me ponder it for a bit.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Greg Stark
Date:
I'm far from convinced wal logging hint bits is a non starter. In fact  
I doubt the wal log itself I a problem. Having to take the buffer lock  
does suck though.

Heikki had a clever idea earlier which was to have two crc checks- one  
which skips the hint bits and one dedicated to hint bits. If the  
second doesn't match we clear all the hint bits.

The problem with that is that skipping the hint bits for the main crc  
would slow it down severely. It would make a lot of sense if the hint  
bits were all in a contiguous block of memory but I can't see how to  
make that add up.

greg

On 17 Oct 2008, at 05:42 PM, "Jonah H. Harris"  
<jonah.harris@gmail.com> wrote:

> On Fri, Oct 17, 2008 at 11:26 AM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> So this discussion died with no solution arising to the
>> hint-bit-setting-invalidates-the-CRC problem.
>
> I've been busy.
>
>> Apparently the only solution in sight is to WAL-log hint bits.  Simon
>> opines it would be horrible from a performance standpoint to WAL-log
>> every hint bit set, and I think we all agree with that.  So we need  
>> to
>> find an alternative mechanism to WAL log hint bits.
>
> Agreed.
>
>> I thought about causing a process that's about to write a page  
>> check a
>> flag that says "this page has been dirtied by someone who didn't  
>> bother
>> to generate WAL".  If the flag is set, then the writer process is  
>> forced
>> to write a WAL record containing all hint bits in the page, and only
>> then it is allowed to write the page (and thus calculate the new  
>> CRC).
>
> Interesting idea... let me ponder it for a bit.
>
> -- 
> Jonah H. Harris, Senior DBA
> myYearbook.com
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Fri, Oct 17, 2008 at 12:05 PM, Greg Stark
<greg.stark@enterprisedb.com> wrote:
> Heikki had a clever idea earlier which was to have two crc checks- one which
> skips the hint bits and one dedicated to hint bits. If the second doesn't
> match we clear all the hint bits.

Sounds overcomplicated to me.

> The problem with that is that skipping the hint bits for the main crc would
> slow it down severely. It would make a lot of sense if the hint bits were
> all in a contiguous block of memory but I can't see how to make that add up.

Agreed.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Greg Stark <greg.stark@enterprisedb.com> [081017 12:05]:
> I'm far from convinced wal logging hint bits is a non starter. In fact  
> I doubt the wal log itself I a problem. Having to take the buffer lock  
> does suck though.

And remember, you don't even need to WAL all the hint-bit setts:... You
only *need* to get a WAL backup of the block at some *any* point before the
block's written to save from the torn-page on recovery leading to an
inconsistent CRC.

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:

> Apparently the only solution in sight is to WAL-log hint bits.  Simon
> opines it would be horrible from a performance standpoint to WAL-log
> every hint bit set, and I think we all agree with that.  So we need to
> find an alternative mechanism to WAL log hint bits.

Yes, it's clearly not acceptable bit by bit.

But perhaps writing a single WAL record if you scan whole page and set
all bits at once. Then it makes sense in some cases.

It might be possible to have a partial solution where some blocks have
CRC checks, some not. Most databases have static portions. Any block not
touched for X amount of time (~= to a distance between current LSN and
LSN on block) could have CRC checks added.

Or maybe just make it a table-level option and let users choose if they
want the hit or not.

Or maybe have a new command that you can run whenever you want to set
CRC checks. That way you get to choose. CHECK TABLE?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Simon Riggs wrote:
> 
> On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:
> 
> > Apparently the only solution in sight is to WAL-log hint bits.  Simon
> > opines it would be horrible from a performance standpoint to WAL-log
> > every hint bit set, and I think we all agree with that.  So we need to
> > find an alternative mechanism to WAL log hint bits.
> 
> Yes, it's clearly not acceptable bit by bit.
> 
> But perhaps writing a single WAL record if you scan whole page and set
> all bits at once. Then it makes sense in some cases.

Yeah, I thought about that too -- and perhaps give the scan some slop,
so that it will also updates some more hint bits that would be updated
in the next, say, 100 transactions.  However this seems more messy than
the other idea.

> It might be possible to have a partial solution where some blocks have
> CRC checks, some not.

That's another idea but it reduces the effectiveness of the check.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2008-10-17 at 13:59 -0300, Alvaro Herrera wrote:

> > It might be possible to have a partial solution where some blocks have
> > CRC checks, some not.
> 
> That's another idea but it reduces the effectiveness of the check.

If you put in a GUC to control the check, block by block. 0 = check
every time, with full impact. Other values delay the use of CRC checks.
Kind of like freezing parameters. Let people choose.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Block-level CRC checks

From
Greg Stark
Date:
I don't think that works anyways. No matter how thoroughly you update  
all the hint bits there's still a chance someone else comes along and  
sets one you missed or is setting hint bits on the same tuple at the  
same time and your update gets lost.

greg

On 17 Oct 2008, at 06:59 PM, Alvaro Herrera  
<alvherre@commandprompt.com> wrote:

> Simon Riggs wrote:
>>
>> On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:
>>
>>> Apparently the only solution in sight is to WAL-log hint bits.   
>>> Simon
>>> opines it would be horrible from a performance standpoint to WAL-log
>>> every hint bit set, and I think we all agree with that.  So we  
>>> need to
>>> find an alternative mechanism to WAL log hint bits.
>>
>> Yes, it's clearly not acceptable bit by bit.
>>
>> But perhaps writing a single WAL record if you scan whole page and  
>> set
>> all bits at once. Then it makes sense in some cases.
>
> Yeah, I thought about that too -- and perhaps give the scan some slop,
> so that it will also updates some more hint bits that would be updated
> in the next, say, 100 transactions.  However this seems more messy  
> than
> the other idea.
>
>> It might be possible to have a partial solution where some blocks  
>> have
>> CRC checks, some not.
>
> That's another idea but it reduces the effectiveness of the check.
>
> -- 
> Alvaro Herrera                                http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Block-level CRC checks

From
Markus Wanner
Date:
Hi,

Alvaro Herrera wrote:
> So this discussion died with no solution arising to the
> hint-bit-setting-invalidates-the-CRC problem.

Isn't double-buffering solving this issue? Has somebody checked if it
even helps performance due to being able to release the lock on the
buffer *before* the syscall?

Regards

Markus Wanner



Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Markus Wanner wrote:
> Alvaro Herrera wrote:
>> So this discussion died with no solution arising to the
>> hint-bit-setting-invalidates-the-CRC problem.
> 
> Isn't double-buffering solving this issue? Has somebody checked if it
> even helps performance due to being able to release the lock on the
> buffer *before* the syscall?

Double-buffering helps with the hint bit issues within shared buffers, 
but doesn't help with the torn page and hint bits problem. The torn page 
problem seems to be the show-stopper.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Paul Schlie
Date:
Alvaro Herrera wrote:
> So this discussion died with no solution arising to the
> hint-bit-setting-invalidates-the-CRC problem.

Is there no point at which a page is logically committed to
storage, past which no mutating access may be performed?




Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Simon Riggs wrote:

> But perhaps writing a single WAL record if you scan whole page and set
> all bits at once. Then it makes sense in some cases.

So this is what I ended up doing; attached.

There are some gotchas in this patch:

1. it does not consider hint bits other than the ones defined in htup.h.
Some index AMs use hint bits to "kill" tuples (LP_DEAD mostly, I think).
This means that CRCs will be broken for such pages when pages are torn.

2. some parts of the code could be considered modularity violations.
For example, tqual.c is setting a bit in a Page structure; bufmgr.c is
later checking that bit to determine when to log.

3. the bgwriter is seen writing WAL entries at checkpoint.  At shutdown,
this might cause an error to be reported on how there was not supposed
to be activity on the log.  I didn't save the exact error report and I
can't find it in the source :-(


So it "mostly works" at this time.  I very much welcome opinions to
improve the weak points.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Attachment

Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> So this is what I ended up doing; attached.

Oh, another thing.  The contents for the WAL log message here is very
simplistic; just store all the t_infomask and t_infomask2 relevant bits,
for all the tuples in the table.  A possible optimization to reduce the
WAL traffic is to add another infomask bit which indicates whether a
hint bit has been set since the last time we visited the page.  I'm
unsure if this is worth the pain.  (Another possibility, even more
painful, is to choose at runtime between the two formats, depending on
the number of tuples that need hint bits logged.)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> There are some gotchas in this patch:
> 
> 1. it does not consider hint bits other than the ones defined in htup.h.
> Some index AMs use hint bits to "kill" tuples (LP_DEAD mostly, I think).
> This means that CRCs will be broken for such pages when pages are torn.

The "other hint bits" are:

- LP_DEAD as used by the various callers of ItemIdMarkDead.
- PD_PAGE_FULL
- BTPageOpaque->btpo_flags and btpo_cycleid

All of them are changed with only SetBufferCommitInfoNeedsSave being
called afterwards.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> 3. the bgwriter is seen writing WAL entries at checkpoint.  At shutdown,
> this might cause an error to be reported on how there was not supposed
> to be activity on the log.  I didn't save the exact error report and I
> can't find it in the source :-(

LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating connection due to administrator command
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  INSERT @ 0/67F05F0: prev 0/67F05C0; xid 0: Heap2 - hintbits: rel 1663/16384/1259; blk 4
CONTEXT:  writing block 4 of relation 1663/16384/1259
LOG:  xlog flush request 0/67F06C0; write 0/0; flush 0/0
CONTEXT:  writing block 4 of relation 1663/16384/1259
LOG:  INSERT @ 0/67F06C0: prev 0/67F05F0; xid 0: Heap2 - hintbits: rel 1663/16384/2608; blk 40
CONTEXT:  writing block 40 of relation 1663/16384/2608
LOG:  xlog flush request 0/67F0708; write 0/67F06C0; flush 0/67F06C0
CONTEXT:  writing block 40 of relation 1663/16384/2608
LOG:  INSERT @ 0/67F0708: prev 0/67F06C0; xid 0: Heap2 - hintbits: rel 1663/16384/1249; blk 29
CONTEXT:  writing block 29 of relation 1663/16384/1249
LOG:  xlog flush request 0/67F0808; write 0/67F0708; flush 0/67F0708
CONTEXT:  writing block 29 of relation 1663/16384/1249
LOG:  INSERT @ 0/67F0808: prev 0/67F0708; xid 0: XLOG - checkpoint: redo 0/67F05F0; tli 1; xid 0/9093; oid 90132; multi
1;offset 0; shutdown
 
LOG:  xlog flush request 0/67F0850; write 0/67F0808; flush 0/67F0808
PANIC:  concurrent transaction log activity while database system is shutting down
LOG:  background writer process (PID 17411) was terminated by signal 6: Aborted

I am completely at a loss what to do here.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> The "other hint bits" are:
> 
> - LP_DEAD as used by the various callers of ItemIdMarkDead.
> - PD_PAGE_FULL
> - BTPageOpaque->btpo_flags and btpo_cycleid
> 
> All of them are changed with only SetBufferCommitInfoNeedsSave being
> called afterwards.

I think we could get away with WAL-logging LP_DEAD via ItemIdMarkDead
similar to what is done to SetHintBits in the posted patch, and cope
with the rest by marking the page with the invalid checksum; they are
not so frequent anyway so the reliability loss is low.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Gregory Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> Alvaro Herrera wrote:
>
>> The "other hint bits" are:
>> 
>> - LP_DEAD as used by the various callers of ItemIdMarkDead.
>> - PD_PAGE_FULL
>> - BTPageOpaque->btpo_flags and btpo_cycleid
>> 
>> All of them are changed with only SetBufferCommitInfoNeedsSave being
>> called afterwards.
>
> I think we could get away with WAL-logging LP_DEAD via ItemIdMarkDead
> similar to what is done to SetHintBits in the posted patch, and cope
> with the rest by marking the page with the invalid checksum; they are
> not so frequent anyway so the reliability loss is low.

If PD_PAGE_FULL is set and that causes the crc to be set to the invalid sum
will we ever get another chance to set it?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Block-level CRC checks

From
Zdenek Kotala
Date:
Alvaro Herrera napsal(a):
> Simon Riggs wrote:
> 
>> But perhaps writing a single WAL record if you scan whole page and set
>> all bits at once. Then it makes sense in some cases.
> 
> So this is what I ended up doing; attached.
> 
> There are some gotchas in this patch:
> 

Please, DO NOT MOVE position of page version in PageHeader structure! And 
PG_PAGE_LAYOUT_VERSION should be bump to 5.
    Thanks Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
> Please, DO NOT MOVE position of page version in PageHeader structure! And
> PG_PAGE_LAYOUT_VERSION should be bump to 5.

Umm, any in-place upgrade should be capable of handling changes to the
page header.  Of, did I miss something significant in the in-place
upgrade design?

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
Jonah H. Harris wrote:
> On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>> Please, DO NOT MOVE position of page version in PageHeader structure! And
>> PG_PAGE_LAYOUT_VERSION should be bump to 5.
> 
> Umm, any in-place upgrade should be capable of handling changes to the
> page header.  Of, did I miss something significant in the in-place

I thought that was kind of the point of in place upgrade.

Joshua D. Drake


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Zdenek Kotala wrote:
> Alvaro Herrera napsal(a):
>> Simon Riggs wrote:
>>
>>> But perhaps writing a single WAL record if you scan whole page and set
>>> all bits at once. Then it makes sense in some cases.
>>
>> So this is what I ended up doing; attached.
>
> Please, DO NOT MOVE position of page version in PageHeader structure!

Hmm.  The only way I see we could do that is to modify the checksum
struct member to a predefined value before calculating the page's
checksum.

Ah, actually there's another alternative -- leave the checksum on its
current position (start of struct) and move other members below
pg_pagesize_version (leaning towards pd_tli and pd_flags).  That'd leave
the page version in the same position.

(Hmm, maybe it's better to move pd_lower and pd_upper?)

> And PG_PAGE_LAYOUT_VERSION should be bump to 5.

Easily done; thanks for the reminder.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Attachment

Re: Block-level CRC checks

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> Jonah H. Harris wrote:
>> On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>>> Please, DO NOT MOVE position of page version in PageHeader structure! And
>>> PG_PAGE_LAYOUT_VERSION should be bump to 5.
>>
>> Umm, any in-place upgrade should be capable of handling changes to the
>> page header.  Of, did I miss something significant in the in-place
>
> I thought that was kind of the point of in place upgrade.

Sure, but he has to have a reliable way to tell what version of the page
header he's looking at...

What I'm wondering though -- are we going to make CRCs mandatory? Or set aside
the 4 bytes even if you're not using them? Because if the size of the page
header varies depending on whether you're using CRCs that sounds like it would
be quite a pain.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Block-level CRC checks

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>> Please, DO NOT MOVE position of page version in PageHeader structure! And
>> PG_PAGE_LAYOUT_VERSION should be bump to 5.

> Umm, any in-place upgrade should be capable of handling changes to the
> page header.

Well, yeah, but it has to be able to tell which version it's dealing
with.  I quite agree with Zdenek that keeping the version indicator
in a fixed location is appropriate.
        regards, tom lane


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 30, 2008 at 11:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Jonah H. Harris" <jonah.harris@gmail.com> writes:
>> On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>>> Please, DO NOT MOVE position of page version in PageHeader structure! And
>>> PG_PAGE_LAYOUT_VERSION should be bump to 5.
>
>> Umm, any in-place upgrade should be capable of handling changes to the
>> page header.
>
> Well, yeah, but it has to be able to tell which version it's dealing
> with.  I quite agree with Zdenek that keeping the version indicator
> in a fixed location is appropriate.

Most of the other databases I've worked, which don't have different
types of pages, put the page version as the first element of the page.That would let us put the crc right after it.
Thoughts?

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Gregory Stark wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> 
> > Alvaro Herrera wrote:
> >
> >> The "other hint bits" are:
> >> 
> >> - LP_DEAD as used by the various callers of ItemIdMarkDead.
> >> - PD_PAGE_FULL
> >> - BTPageOpaque->btpo_flags and btpo_cycleid
> >> 
> >> All of them are changed with only SetBufferCommitInfoNeedsSave being
> >> called afterwards.
> >
> > I think we could get away with WAL-logging LP_DEAD via ItemIdMarkDead
> > similar to what is done to SetHintBits in the posted patch, and cope
> > with the rest by marking the page with the invalid checksum; they are
> > not so frequent anyway so the reliability loss is low.
> 
> If PD_PAGE_FULL is set and that causes the crc to be set to the invalid sum
> will we ever get another chance to set it?

I should have qualified that a bit more.  It's not setting PD_FULL
that's not logged, but clearing it (heap_prune_page, line 282).  It's
set in heap_update.

Hmm, oh I see another problem here -- the bit is not restored when
replayed heap_update's WAL record.  I'm now wondering what other bits
are set without much care about correctly restoring them during replay.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> On Thu, Oct 30, 2008 at 11:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Well, yeah, but it has to be able to tell which version it's dealing
>> with.  I quite agree with Zdenek that keeping the version indicator
>> in a fixed location is appropriate.

> Most of the other databases I've worked, which don't have different
> types of pages, put the page version as the first element of the page.
>  That would let us put the crc right after it.  Thoughts?

"Fixed location" does not mean "let's move it".
        regards, tom lane


Re: Block-level CRC checks

From
Zdenek Kotala
Date:
Jonah H. Harris napsal(a):
> On Thu, Oct 30, 2008 at 10:33 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:
>> Please, DO NOT MOVE position of page version in PageHeader structure! And
>> PG_PAGE_LAYOUT_VERSION should be bump to 5.
> 
> Umm, any in-place upgrade should be capable of handling changes to the
> page header.  Of, did I miss something significant in the in-place
> upgrade design?

Not any change. If you move page header version field to another position it 
will require kind of magic to detect what version it is. Other field you can 
place everywhere :-), but do not touch page version. It will brings a lot of 
problems...
Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 30, 2008 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Jonah H. Harris" <jonah.harris@gmail.com> writes:
>> On Thu, Oct 30, 2008 at 11:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Well, yeah, but it has to be able to tell which version it's dealing
>>> with.  I quite agree with Zdenek that keeping the version indicator
>>> in a fixed location is appropriate.
>
>> Most of the other databases I've worked, which don't have different
>> types of pages, put the page version as the first element of the page.
>>  That would let us put the crc right after it.  Thoughts?
>
> "Fixed location" does not mean "let's move it".

Just trying to be helpful.  Just thought I might give some insight as
to what others, who had implemented in-place upgrade functionality
years before Postgres' existence, had done.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Gregory Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> Ah, actually there's another alternative -- leave the checksum on its
> current position (start of struct) and move other members below
> pg_pagesize_version (leaning towards pd_tli and pd_flags).  That'd leave
> the page version in the same position.

I don't understand why the position of anything matters here. Look at TCP
packets for instance, the checksum is not at the beginning or end of anything.

The CRC is chosen such that if you CRC the resulting packet including the CRC
you get a CRC of 0. That can be done for whatever offset the CRC appears at I
believe.


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Block-level CRC checks

From
Zdenek Kotala
Date:
Alvaro Herrera napsal(a):
> Zdenek Kotala wrote:
>> Alvaro Herrera napsal(a):
>>> Simon Riggs wrote:
>>>
>>>> But perhaps writing a single WAL record if you scan whole page and set
>>>> all bits at once. Then it makes sense in some cases.
>>> So this is what I ended up doing; attached.
>> Please, DO NOT MOVE position of page version in PageHeader structure!
> 
> Hmm.  The only way I see we could do that is to modify the checksum
> struct member to a predefined value before calculating the page's
> checksum.
> 
> Ah, actually there's another alternative -- leave the checksum on its
> current position (start of struct) and move other members below
> pg_pagesize_version (leaning towards pd_tli and pd_flags).  That'd leave
> the page version in the same position.
> 
> (Hmm, maybe it's better to move pd_lower and pd_upper?)

No, please, keep pd_lower and pd_upper on same position. They are accessed more 
often than pd_tli and pd_flags. It is better for optimization.

By the way, do you need CRC as a first page member? Is it for future development 
like CLOG integration into buffers? Why not put it on the end as and mark it as 
a special? It will reduce space requirement when CRC is not enabled.
    Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Block-level CRC checks

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> By the way, do you need CRC as a first page member? Is it for future development 
> like CLOG integration into buffers? Why not put it on the end as and mark it as 
> a special? It will reduce space requirement when CRC is not enabled.

... and make life tremendously more complex for indexes, plus turning
CRC checking on or off on-the-fly would be problematic.  I think Alvaro
has the right idea: just put the field there all the time.
        regards, tom lane


Re: Block-level CRC checks

From
Zdenek Kotala
Date:
Tom Lane napsal(a):
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> By the way, do you need CRC as a first page member? Is it for future development 
>> like CLOG integration into buffers? Why not put it on the end as and mark it as 
>> a special? It will reduce space requirement when CRC is not enabled.
> 
> ... and make life tremendously more complex for indexes, 

Indexes  uses PageGetSpecial macro and they live with them and PageInit could do  correct placement. Only problem are
assertmacros and extra check which 
 
verifies correct size of special.

> plus turning
> CRC checking on or off on-the-fly would be problematic.  

Yeah, it is problem.

> I think Alvaro
> has the right idea: just put the field there all the time.

Agree.
Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Gregory Stark escribió:

> What I'm wondering though -- are we going to make CRCs mandatory? Or set aside
> the 4 bytes even if you're not using them? Because if the size of the page
> header varies depending on whether you're using CRCs that sounds like it would
> be quite a pain.

Not mandatory, but the space needs to be set aside.  (Otherwise you
couldn't turn it on after running with it turned off, which would rule
out using the database after initdb).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Thu, Oct 30, 2008 at 12:14 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Gregory Stark escribió:
>
>> What I'm wondering though -- are we going to make CRCs mandatory? Or set aside
>> the 4 bytes even if you're not using them? Because if the size of the page
>> header varies depending on whether you're using CRCs that sounds like it would
>> be quite a pain.
>
> Not mandatory, but the space needs to be set aside.  (Otherwise you
> couldn't turn it on after running with it turned off, which would rule
> out using the database after initdb).

Agreed.

--
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> Hmm, oh I see another problem here -- the bit is not restored when
> replayed heap_update's WAL record.  I'm now wondering what other bits
> are set without much care about correctly restoring them during replay.

I'm now wondering whether it'd be easier to just ignore pd_flags in
calculating the checksum.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Thu, Oct 30, 2008 at 03:41:17PM +0000, Gregory Stark wrote:
> The CRC is chosen such that if you CRC the resulting packet including the CRC
> you get a CRC of 0. That can be done for whatever offset the CRC appears at I
> believe.

IIRC, you calculate the CRC-32 of the page, then XOR it over where it's
supposed to end up. No need to preseed (or more accurately, it doesn't
matter how you preseed, the result is the same).

For checking it doesn't matter either, just checksum the page and if
you get zero it's correct.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Florian Weimer
Date:
* Greg Stark:

> Wal logged changes are safe because of full_page_writes. Hint bits are
> safe because either the old or the new value will be on disk and we
> don't care which.

Is this really true with all disks?  IBM's DTLA disks didn't behave
that way (an interrupted write could zero a sector), and I think the
text book algorithms don't assume this behavior, either.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> Alvaro Herrera wrote:
>
> > Hmm, oh I see another problem here -- the bit is not restored when
> > replayed heap_update's WAL record.  I'm now wondering what other bits
> > are set without much care about correctly restoring them during replay.
>
> I'm now wondering whether it'd be easier to just ignore pd_flags in
> calculating the checksum.

Okay, so this is what I've done.  pd_flags is skipped.  Also the WAL
routine logs both HeapTupleHeader infomasks and ItemId->lp_flags.  On
the latter point I'm not 100% sure of the cases where lp_flags must be
logged; right now I'm only logging if the item is marked as "having
storage" (the logic being that if an item does not have storage, then
making it have requires a WAL entry, and vice versa).

(This version has some debugging log entries which are obviously only
WIP material.)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment

Re: Block-level CRC checks

From
Paul Schlie
Date:
Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> >Alvaro Herrera wrote:
> > 
> > Hmm, oh I see another problem here -- the bit is not restored when
> > replayed heap_update's WAL record.  I'm now wondering what other bits
> > are set without much care about correctly restoring them during replay.
> > 
> > I'm now wondering whether it'd be easier to just ignore pd_flags in
> > calculating the checksum.
>
> Okay, so this is what I've done.  pd_flags is skipped.  Also the WAL
> routine logs both HeapTupleHeader infomasks and ItemId->lp_flags.  On
> the latter point I'm not 100% sure of the cases where lp_flags must be
> logged; right now I'm only logging if the item is marked as "having
> storage" (the logic being that if an item does not have storage, then
> making it have requires a WAL entry, and vice versa).

Might it make sense to move such flags to another data structure which
may or may not need to be logged, thereby maintaining the crc integrity
of the data pages themselves?

(I pre-apologize if this is a silly, as I honestly don't understand how once
a page has been logically committed to storage, it can ever be subsequently
validly modified unless first removed as being committed to storage; as if
it's write were interrupted prior to being completed, it seems most correct
to simply consider the page as not having been stored and simply resume the
process from the beginning if a partial store is suspected; thereby implying
that any buffers storing the logical page are not released until the page as
a whole is known to have been successfully stored; thereby retaining the
entire page to either to remain committed for storage, or possibly
alternatively made re-available for mutation with it's crc marked as invalid
if ever mutated prior to being re-committed to storage, it seems.)




Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Fri, Oct 17, 2008 at 12:26:11PM -0300, Alvaro Herrera wrote:
> So this discussion died with no solution arising to the
> hint-bit-setting-invalidates-the-CRC problem.
>
> Apparently the only solution in sight is to WAL-log hint bits.  Simon
> opines it would be horrible from a performance standpoint to WAL-log
> every hint bit set, and I think we all agree with that.  So we need to
> find an alternative mechanism to WAL log hint bits.

There is another option I havn't seen mentioned anywhere yet: a single
bit change in a page has a predictable change on the CRC, dependant
only on the position of the bit. So in theory it would be possible for
the process changing the hint bit to update the CRC with a single XOR
operation. Working out what to XOR it with is the hard part.

Worst case you're talking about a BLOCK_SIZE*8*4 byte = 256K lookup
table but CRC has nice mathematical properties which could probably
get that down to a few KB.

Although, maybe locking of the hint bits would be a problem?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> There is another option I havn't seen mentioned anywhere yet: a single
> bit change in a page has a predictable change on the CRC, dependant
> only on the position of the bit. So in theory it would be possible for
> the process changing the hint bit to update the CRC with a single XOR
> operation. Working out what to XOR it with is the hard part.

> Although, maybe locking of the hint bits would be a problem?

Yes it would :-(.  Also, this scheme would point us towards maintaining
the CRCs *continually* while the page is in memory, rather than only
recalculating them upon write.  So every tuple insert/update/delete
would require a recalculation of the entire page CRC.

What happened to the plan to double-buffer the writes to avoid this
issue?
        regards, tom lane


Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Sun, Nov 09, 2008 at 11:02:32AM -0500, Tom Lane wrote:
> Yes it would :-(.  Also, this scheme would point us towards maintaining
> the CRCs *continually* while the page is in memory, rather than only
> recalculating them upon write.  So every tuple insert/update/delete
> would require a recalculation of the entire page CRC.

I wasn't thinking of that. I was thinking more of the situation where a
seq scan reads in a page, updates a few hint bits and then goes on to
the next page. For these just doing a few XORs might be cheaper.

> What happened to the plan to double-buffer the writes to avoid this
> issue?

Might be better anyway. A single copy-and-checksum would probably be
quite cheap (pulling the page into L2 cache).

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Greg Stark
Date:
I think double buffering solves the torn page problem but not the lack  
of wal logging. Alvarro solved the wal logging  by deferring the wal  
logs. But I'm not sure how confident we are that it's logging enough.

I'm beginning to think just excluding the hint bits would be simpler  
and safer. If we're double buffering then it might be possible to do  
that pretty cheaply. Copy the whole buffer with memcpy then loop  
through the line pointers unsetting the hint bits. Then do the crc.  
Though that would prevent us from doing "zero-copy" crc by doing it in  
the copy.

greg

On 9 Nov 2008, at 04:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Martijn van Oosterhout <kleptog@svana.org> writes:
>> There is another option I havn't seen mentioned anywhere yet: a  
>> single
>> bit change in a page has a predictable change on the CRC, dependant
>> only on the position of the bit. So in theory it would be possible  
>> for
>> the process changing the hint bit to update the CRC with a single XOR
>> operation. Working out what to XOR it with is the hard part.
>
>> Although, maybe locking of the hint bits would be a problem?
>
> Yes it would :-(.  Also, this scheme would point us towards  
> maintaining
> the CRCs *continually* while the page is in memory, rather than only
> recalculating them upon write.  So every tuple insert/update/delete
> would require a recalculation of the entire page CRC.
>
> What happened to the plan to double-buffer the writes to avoid this
> issue?
>
>            regards, tom lane
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Greg Stark wrote:
> I think double buffering solves the torn page problem but not the lack  
> of wal logging. Alvarro solved the wal logging  by deferring the wal  
> logs. But I'm not sure how confident we are that it's logging enough.
>

Right now, it's WAL-logging HeapTupleHeader hint bits (infomask and
infomask2), and ItemId (line pointer) flags.   Page pd_flags are skipped
in the CRC checksum -- this is easy to do because they are in a constant
offset in the page and I'm just skipping those bytes in CRC_COMP().

So what I'm missing is:
- btree hint bits
- bgwriter calls XLogInsert during shutdown, to WAL-log the hint bits
of unwritten pages.  This causes a PANIC to trigger about concurrent WAL
activity during checkpoint.  (The easy solution to this problem is just
to remove the check; another idea is to flush the buffers before
grabbing the final address to watch for at shutdown.)

> I'm beginning to think just excluding the hint bits would be simpler and 
> safer. If we're double buffering then it might be possible to do that 
> pretty cheaply. Copy the whole buffer with memcpy then loop through the 
> line pointers unsetting the hint bits. Then do the crc. Though that would 
> prevent us from doing "zero-copy" crc by doing it in the copy.

This can probably be made to work, and it solves the problem that
bgwriter calls XLogInsert during shutdown.  I would create new routines
to clear hint bits in all involved modules (heap_resethintbits, btree_%,
item_%, page_%), and call them on a copy of the page.

The downside to this idea is that we need to create a copy of the page
and call those routines when we read the page in, too.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Greg Stark wrote:
>> I'm beginning to think just excluding the hint bits would be simpler and 
>> safer. If we're double buffering then it might be possible to do that 
>> pretty cheaply. Copy the whole buffer with memcpy then loop through the 
>> line pointers unsetting the hint bits. Then do the crc. Though that would 
>> prevent us from doing "zero-copy" crc by doing it in the copy.

> The downside to this idea is that we need to create a copy of the page
> and call those routines when we read the page in, too.

Ugh.  The cost on write was bad enough, but paying it on read is a lot
worse ...
        regards, tom lane


Re: Block-level CRC checks

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Greg Stark wrote:
>>> I'm beginning to think just excluding the hint bits would be simpler and 
>>> safer. If we're double buffering then it might be possible to do that 
>>> pretty cheaply. Copy the whole buffer with memcpy then loop through the 
>>> line pointers unsetting the hint bits. Then do the crc. Though that would 
>>> prevent us from doing "zero-copy" crc by doing it in the copy.
>
>> The downside to this idea is that we need to create a copy of the page
>> and call those routines when we read the page in, too.

oh, good point.

> Ugh.  The cost on write was bad enough, but paying it on read is a lot
> worse ...

I think you could checksum the block including the hint bits then go back and
remove them from the checksum. I didn't realize you were handling more than
just the heap transaction hint bits though. It would be hard to do it in any
kind of abstract away like you were describing.

How happy are you with the wal logging entries? Have you done any tests to see
how much extra wal traffic it is? Are you sure you always generate enough logs
soon enough?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Gregory Stark wrote:

> I think you could checksum the block including the hint bits then go back and
> remove them from the checksum.

I'm not sure what you're proposing here.  It sounds to me like you are
saying that we can read the page, make it available to other users, and
then check the CRC.  I don't think this works though, because if you do
that the possibly-invalid buffer is available to the other readers.

> I didn't realize you were handling more than just the heap transaction
> hint bits though. It would be hard to do it in any kind of abstract
> away like you were describing.

Yeah, I also initially thought that there was only a single set of hint
bits, but that turned out not to be the case.  Right now the nbtree hint
bits are the ones missing :-(  It's hard to see how to handle those.

> How happy are you with the wal logging entries? Have you done any tests to see
> how much extra wal traffic it is? Are you sure you always generate enough logs
> soon enough?

I haven't measured the amount of traffic.  They are always generated
"soon enough": just before calling smgrwrite on the page on FlushBuffer,
i.e. just before the page hits disk.  I admit it feels a bit dirty to be
calling XLogInsert on such low a level.

Right now we log all bits for all tuples, even if a single bit changed.
It could be more efficient if I could only logs tuples whose hints bits
had changed since the last write.  This would require setting a bit on
every tuple "this tuple has an unlogged hint bit" (right now there's a
bit at the page level).  I haven't tried implementing that.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Gregory Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> Gregory Stark wrote:
>
>> I think you could checksum the block including the hint bits then go back and
>> remove them from the checksum.
>
> I'm not sure what you're proposing here.  It sounds to me like you are
> saying that we can read the page, make it available to other users, and
> then check the CRC.  I don't think this works though, because if you do
> that the possibly-invalid buffer is available to the other readers.

No, I just meant that you could calculate the CRC by scanning the whole buffer
efficiently using one of the good word-wise CRC algorithms, then look at the
line pointers to find the hint bits and subtract them out of the CRC. The
result should be zero after adjusting for the hint bits.

It doesn't solve much though.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Mon, Nov 10, 2008 at 11:31:33PM +0000, Gregory Stark wrote:
> No, I just meant that you could calculate the CRC by scanning the whole buffer
> efficiently using one of the good word-wise CRC algorithms, then look at the
> line pointers to find the hint bits and subtract them out of the CRC. The
> result should be zero after adjusting for the hint bits.

If you're going to look at the line pointers anyway, couldn't you just
do it in one pass, like:

n = 0
next = &tuple[n].hintbits
pos = 0
while pos < BLOCK_SIZE: if pos == next:    CRC_ADD( block[pos] & mask )   n++   next = &tuple[n].hintbits  # If n ==
numtups,next = BLOCK_SIZE else:   CRC_ADD( block[pos] pos++ 

This only handles one byte of hintbits but can easily be extended. No
need to actually *store* the hintbit free version anywhere...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Martijn van Oosterhout wrote:

> If you're going to look at the line pointers anyway, couldn't you just
> do it in one pass, like:
> 
> n = 0
> next = &tuple[n].hintbits
> pos = 0
> while pos < BLOCK_SIZE:
>   if pos == next: 
>     CRC_ADD( block[pos] & mask )
>     n++
>     next = &tuple[n].hintbits  # If n == numtups, next = BLOCK_SIZE
>   else:
>     CRC_ADD( block[pos]
>   pos++

For this to work, we would have to create two (or more) versions of the
calculate checksum macro, one for heap pages and other for other pages.
I'm not sure how bad is that.  The bit that's worse is that we'd need to
have external knowledge of what kind of page we're talking about (i.e.
FlushBuffer would need to know whether a page is heap or another kind).

However, your idea suggests something else that we could do to improve
the patch: skip the ItemId->lp_flags during the CRC calculation.  This
would mean we wouldn't need to WAL-log those.  The problem with that is
that lp_flags are only 2 bits, so we would need to iterate zeroing them
and restore them after CRC_COMP() instead of simply skipping.

The immediately useful information arising from your note is that I
noticed I'm calling a heap routine on non-heap pages, because of setting
PD_UNLOGGED_CHANGE for ItemId flags on index pages.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> However, your idea suggests something else that we could do to improve
> the patch: skip the ItemId->lp_flags during the CRC calculation.  This
> would mean we wouldn't need to WAL-log those.

What!?  In most cases those bits are critical data, not hints.
        regards, tom lane


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > However, your idea suggests something else that we could do to improve
> > the patch: skip the ItemId->lp_flags during the CRC calculation.  This
> > would mean we wouldn't need to WAL-log those.
> 
> What!?  In most cases those bits are critical data, not hints.

In most cases; but LP_DEAD is used as a hint sometimes which is causing
me some grief ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Hmm, I can get around the btree problem by not summing the "special
space".  This loses a bit of reliability because some of the most
critical bits of the page would not be protected by the CRC, but the
bulk of the data would be.  And this allows me to get away from page
type specific tricks (like btpo_cycleid which is used as a hint bit).

The reason I noticed this is that I started wondering about only summing
the part of the page that's actually used, i.e. the header, the line
pointers, and the area beyond pd_upper.  I then noticed that if I only
include the area between pd_upper and pd_special then I don't need to
care about those bits.

So far, the only other idea I've had is to keep a list of page types
(gin, gist, btree, hash, heap; am I missing something else?) and each
module would provide a routine to do the summing.  (Or perhaps better:
the routine they provide says how to sum the special area of the page.
That would allow having a single routine to check the bulk of the page,
and the type-specific routine sums the summable parts of the special
area.)

Thoughts?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Wed, Nov 12, 2008 at 11:08:13AM -0300, Alvaro Herrera wrote:
> For this to work, we would have to create two (or more) versions of the
> calculate checksum macro, one for heap pages and other for other pages.
> I'm not sure how bad is that.  The bit that's worse is that we'd need to
> have external knowledge of what kind of page we're talking about (i.e.
> FlushBuffer would need to know whether a page is heap or another kind).

I think all you need is a macro (say COMP_CRC32_ONE) that adds a single
byte to the checksum, then use COMP_CRC32 for the bulk of the work.
Yes, you'd need to distinguish different kinds of pages.

Seems to me the xlog code already has plenty of examples on how to do
acrobatics with CRCs.

> However, your idea suggests something else that we could do to improve
> the patch: skip the ItemId->lp_flags during the CRC calculation.  This
> would mean we wouldn't need to WAL-log those.  The problem with that is
> that lp_flags are only 2 bits, so we would need to iterate zeroing them
> and restore them after CRC_COMP() instead of simply skipping.

Not sure why you're so intent on actually changing memory just so you can use
COMP_CRC32, which is just a for loop around the COMP_CRC32_ONE I
mentioned. Actually changing the memory probably means locking so why
bother.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Martijn van Oosterhout wrote:
> On Wed, Nov 12, 2008 at 11:08:13AM -0300, Alvaro Herrera wrote:

> > However, your idea suggests something else that we could do to improve
> > the patch: skip the ItemId->lp_flags during the CRC calculation.  This
> > would mean we wouldn't need to WAL-log those.  The problem with that is
> > that lp_flags are only 2 bits, so we would need to iterate zeroing them
> > and restore them after CRC_COMP() instead of simply skipping.
> 
> Not sure why you're so intent on actually changing memory just so you can use
> COMP_CRC32, which is just a for loop around the COMP_CRC32_ONE I
> mentioned. Actually changing the memory probably means locking so why
> bother.

Well, that's one of the problems -- memory is being changed without
holding a lock.  The other problem is that of pages being changed, their
CRCs calculated, and then a crash occuring.  On recovery, the CRC is
restored but some of those changed bits are not.

The other thing that maybe you didn't notice is that lp_flags are 2
bits, not a full byte.  A byte-at-a-time CRC calculation is no help
there.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Gregory Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> The other thing that maybe you didn't notice is that lp_flags are 2
> bits, not a full byte.  A byte-at-a-time CRC calculation is no help
> there.

I think we're talking past each other. Martin and I are talking about doing
something like:

for (...) ... crc(word including hint bits) ...
for (each line pointer) crc-negated(word & LP_DEAD<<15)
Because CRC is a cyclic checksum it's possible to add or remove bits
incrementally. This only works if the data is already copied to someplace so
you can be sure nobody will set or clear the bits behind your back. But when
you're reading the data back in you don't have to worry about that.

I'm a bit surprised to hear our CRC implementation is a bytewise loop. I
thought it was much faster to process CRC checks word-wise.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Gregory Stark wrote:

> I think we're talking past each other. Martin and I are talking about doing
> something like:
>
> for (...)
>   ...
>   crc(word including hint bits)
>   ...
> for (each line pointer)
>   crc-negated(word & LP_DEAD<<15)
>
> Because CRC is a cyclic checksum it's possible to add or remove bits
> incrementally.

I see.

Since our CRC implementation is a simple byte loop, and since ItemIdData
fits in a uint32, the attached patch should do mostly the same by
copying the line pointer into a uint32, turning off the lp_flags, and
summing the modified copy.

This patch is also skipping pd_special and the unused area of the page.

I'm still testing this; please beware that this likely has an even
higher bug density than my regular patches (and some debugging printouts
as well).

While reading the pg_filedump code I noticed that there's a way to tell
the different index pages apart, so perhaps we can use that to be able
to checksum the special space as well.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Attachment

Re: Block-level CRC checks

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I'm still testing this; please beware that this likely has an even
> higher bug density than my regular patches (and some debugging printouts
> as well).

This seems impossibly fragile ... and the non-modular assumptions about
what is in a disk page aren't even the worst part :-(.  The worst part
is the race conditions.

In particular, the code added to FlushBuffer effectively assumes that
the PD_UNLOGGED_CHANGE bit is set sooner than the actual hint bit change
occurs.  Even if the tqual.c code did that in the right order, which it
doesn't, you can't assume that the updates will become visible to other
CPUs in the expected order.  This might be fixable with introduction of
some memory barrier operations but it's certainly broken as-is.

Also, if you do make tqual.c set the bits in that order, it's not clear
how you can ever *clear* PD_UNLOGGED_CHANGE without introducing a race
condition at that end.  (The patch actually neglects to do this anywhere,
which means that it won't be long till every page in the DB has got that
bit set all the time, which I don't think we want.)

I also don't like that you've got different CPUs trying to set or clear
the same PD_UNLOGGED_CHANGE bit with no locking.  We can tolerate that
for ordinary hint bits because it's not critical if an update gets lost.
But in this scheme PD_UNLOGGED_CHANGE is not an optional hint bit: you
*will* mess up if it fails to get set.  Even more insidiously, the
scheme will certainly fail if someone ever tries to add another
asynchronously-updated hint bit in pd_flags, since an update of one of
the bits might overwrite a concurrent update of the other.  Also, it's
not inconceivable (depending on how wide the processor/memory bus is)
that one processor updating PD_UNLOGGED_CHANGE could overwrite some
other processor's change to the nearby pd_checksum or pd_lsn or pd_tli
fields.

Basically, you can't make any critical changes to a shared buffer
if you haven't got exclusive lock on it.  But that's exactly what
this patch is assuming it can do.
        regards, tom lane


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
I think I'm missing something...

In this patch, I see you writing WAL records for hint-bits (bufmgr.c
FlushBuffer).  But doesn't XLogInsert then make a "backup block" record (unless
it's already got one since last checkpoint)?

Once there's a backup block record, the torn-page problem that causes the whole
CRCs to not validate, isn't it?  On crash/recovery, you won't read this torn
block because the WAL log will have the old backup + any possible updates to
it...

Sorry if I'm missing something very obvious...

a.

* Alvaro Herrera <alvherre@commandprompt.com> [081113 13:08]:
> I see.
> 
> Since our CRC implementation is a simple byte loop, and since ItemIdData
> fits in a uint32, the attached patch should do mostly the same by
> copying the line pointer into a uint32, turning off the lp_flags, and
> summing the modified copy.
> 
> This patch is also skipping pd_special and the unused area of the page.
> 
> I'm still testing this; please beware that this likely has an even
> higher bug density than my regular patches (and some debugging printouts
> as well).
> 
> While reading the pg_filedump code I noticed that there's a way to tell
> the different index pages apart, so perhaps we can use that to be able
> to checksum the special space as well.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Basically, you can't make any critical changes to a shared buffer
> if you haven't got exclusive lock on it.  But that's exactly what
> this patch is assuming it can do.

It seems to me that the only possible way to close this hole is to
acquire an exclusive lock before calling FlushBuffers, not shared.
This lock would be held until the flag has been examined and reset; the
actual WAL record and write would continue with a shared lock, as now.

I'm wary of this "solution" because it's likely to reduce concurrency
tremendously ... thoughts?

(The alternative seems to be to abandon this idea for hint bit logging;
we'll need something else.)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Aidan Van Dyk wrote:
> 
> I think I'm missing something...
> 
> In this patch, I see you writing WAL records for hint-bits (bufmgr.c
> FlushBuffer).  But doesn't XLogInsert then make a "backup block" record (unless
> it's already got one since last checkpoint)?

I'm not causing a backup block to be written with that WAL record.  The
rationale is that it's not needed -- if there was a critical write to
the page, then there's already a backup block.  If the only write was a
hint bit being set, then the page cannot possibly be torn.

Now that I think about this, I wonder if this can cause problems in some
filesystems.  XFS, for example, zeroes out during recovery any block
that was written to but not fsync'ed before a crash.  This means that if
we change a hint bit after a checkpoing and mark the page dirty, the
system can write the page.  Suppose we crash at this point.  On
recovery, XFS will zero out the block, but there will be nothing with
which to recovery it, because there's no backup block ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> Basically, you can't make any critical changes to a shared buffer
>> if you haven't got exclusive lock on it.  But that's exactly what
>> this patch is assuming it can do.

> It seems to me that the only possible way to close this hole is to
> acquire an exclusive lock before calling FlushBuffers, not shared.
> This lock would be held until the flag has been examined and reset; the
> actual WAL record and write would continue with a shared lock, as now.

Well, if we adopt the double buffering approach then the ex-lock would
only need to be held for long enough to copy the page contents to local
memory.  So maybe this would be acceptable.  It would certainly be a
heck of a lot simpler than any workable variant of the current patch
is likely to be; and we could simplify some existing code too (no more
need for the BM_JUST_DIRTIED flag for instance).

> (The alternative seems to be to abandon this idea for hint bit logging;
> we'll need something else.)

I'm feeling dissatisfied too --- seems like we're one idea short of a
good solution.

In the larger scheme of things, this patch shouldn't go in anyway as
long as there is some chance that we could have upgrade-in-place for
8.4 at the price of not increasing the page header size.  So I think
there's time to keep thinking about it.
        regards, tom lane


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> Tom Lane wrote:
> 
> > Basically, you can't make any critical changes to a shared buffer
> > if you haven't got exclusive lock on it.  But that's exactly what
> > this patch is assuming it can do.
> 
> It seems to me that the only possible way to close this hole is to
> acquire an exclusive lock before calling FlushBuffers, not shared.
> This lock would be held until the flag has been examined and reset; the
> actual WAL record and write would continue with a shared lock, as now.

We don't seem to have an API for reducing LWLock strength though ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Alvaro Herrera wrote:
> XFS, for example, zeroes out during recovery any block
> that was written to but not fsync'ed before a crash.  This means that if
> we change a hint bit after a checkpoing and mark the page dirty, the
> system can write the page.  Suppose we crash at this point.  On
> recovery, XFS will zero out the block, but there will be nothing with
> which to recovery it, because there's no backup block ...

Really? That would mean that you're prone to lose data if you run 
PostgreSQL on XFS, even without the CRC patch.

I doubt that's true, though. Google found this:

http://marc.info/?l=linux-xfs&m=122549156102504&w=2

See the bottom of that mail.

Although, Florian Weimer suggested earlier in this thread that IBM DTLA 
disks have exactly that problem; a sector could be zero-filled if the 
write is interrupted.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> Alvaro Herrera wrote:
>> XFS, for example, zeroes out during recovery any block
>> that was written to but not fsync'ed before a crash.  This means that if
>> we change a hint bit after a checkpoing and mark the page dirty, the
>> system can write the page.  Suppose we crash at this point.  On
>> recovery, XFS will zero out the block, but there will be nothing with
>> which to recovery it, because there's no backup block ...
>
> Really? That would mean that you're prone to lose data if you run  
> PostgreSQL on XFS, even without the CRC patch.
>
> I doubt that's true, though. Google found this:
>
> http://marc.info/?l=linux-xfs&m=122549156102504&w=2

Ah, there's no problem here then.  This email mentions another one by
"Eric" which is this one:
http://marc.info/?l=linux-xfs&m=122546510218150&w=2
It contains more information about the problem.


> Although, Florian Weimer suggested earlier in this thread that IBM DTLA  
> disks have exactly that problem; a sector could be zero-filled if the  
> write is interrupted.

Hmm.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [081113 14:43]:
> Well, if we adopt the double buffering approach then the ex-lock would
> only need to be held for long enough to copy the page contents to local
> memory.  So maybe this would be acceptable.  It would certainly be a
> heck of a lot simpler than any workable variant of the current patch
> is likely to be; and we could simplify some existing code too (no more
> need for the BM_JUST_DIRTIED flag for instance).

Well, can we get rid of the PD_UNLOGGED_CHANGE completely?

I think that if the buffer is dirty (FlushBuffer was called, and you've gotten
through the StartBufferIO and gotten the lock), you can just WAL log the hint
bits from the *local double-buffered* "page" (don't know if the current code
allows it easily)

If I understand tom's objections its that with the shared lock, other hint bits
may still change... But we don't relly care if we get all the hint bits to WAL
in our write, what we care about is that we get the hint-bits *that we
checksummed* to WAL.  You'll need throw the CRC in the WAL as well for the
really paranoid.  That way, if the write is torn, on recovery, the correct hint
bits and matching CRC will be available.

This means your chewing up more WAL.  You get the WAL record for all the hint
bits on every page write.  For that you get:
1) Simplified locking (and maybe with releasing the lock before the write  shorter lock hold-times)
2) Simplified CRC/checksum (don't have to try and skip hint-bits)
3) HINT bits WAL logged even for blocks written that aren't hint-bit only

You trade WAL and simplicity for verifiable integrety.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Thu, Nov 13, 2008 at 01:45:52PM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I'm still testing this; please beware that this likely has an even
> > higher bug density than my regular patches (and some debugging printouts
> > as well).
>
> This seems impossibly fragile ... and the non-modular assumptions about
> what is in a disk page aren't even the worst part :-(.  The worst part
> is the race conditions.

Actually, the real problem to me seems to be that to check the checksum
when you read the page in, you need to look at the contents of the page
and "assume" some of the values in there are correct, before you can
even calculate the checksum. If the page really is corrupted, chances
are the item pointers are going to be bogus, but you need to read them
to calculate the checksum...

Double-buffering allows you to simply checksum the whole page, so
creating a COMP_CRC32_WITH_COPY() macro would do it. Just allocate a
block on the stack, copy/checksum it there, do the write() syscall and
forget it.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Actually, the real problem to me seems to be that to check the checksum
> when you read the page in, you need to look at the contents of the page
> and "assume" some of the values in there are correct, before you can
> even calculate the checksum. If the page really is corrupted, chances
> are the item pointers are going to be bogus, but you need to read them
> to calculate the checksum...

Hmm.  You could verify the values closely enough to ensure you don't
crash while redoing the CRC calculation, which ought to be sufficient.
Still, I agree that the whole thing looks too Rube Goldbergian to count
as a reliability enhancer, which is what the point is after all.

> Double-buffering allows you to simply checksum the whole page, so
> creating a COMP_CRC32_WITH_COPY() macro would do it. Just allocate a
> block on the stack, copy/checksum it there, do the write() syscall and
> forget it.

I think the argument is about whether we increase our vulnerability to
torn-page problems if we just add a CRC and don't do anything else to
the overall writing process.  Right now, a partial write on a
hint-bit-only update merely results in some hint bits getting lost
(as long as you discount the scenario where the disk fails to read a
partially-written sector at all --- maybe we're fooling ourselves to
ignore that?).  With a CRC added, that suddenly becomes a corrupted-page
situation, and it's not easy to tell that no real harm was done.

Again, the real bottom line here is whether there will be a *net*
gain in reliability.  If a CRC adds too many false-positive
reports of bad data, it's not going to be a win.
        regards, tom lane


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Still, I agree that the whole thing looks too Rube Goldbergian to count
> as a reliability enhancer, which is what the point is after all.

Agreed.

> I think the argument is about whether we increase our vulnerability to
> torn-page problems if we just add a CRC and don't do anything else to
> the overall writing process.  Right now, a partial write on a
> hint-bit-only update merely results in some hint bits getting lost
> (as long as you discount the scenario where the disk fails to read a
> partially-written sector at all --- maybe we're fooling ourselves to
> ignore that?).  With a CRC added, that suddenly becomes a corrupted-page
> situation, and it's not easy to tell that no real harm was done.

The first idea that comes to mind is skipping hint bits in the CRC too.
That does away with a lot of the trouble (PD_UNLOGGED_CHANGE, the
necessity of WAL-logging hint bits, etc).  The problem, again, is that
the checksumming process becomes page type-specific; but then maybe
that's the only workable approach.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Thu, Nov 13, 2008 at 09:03:41PM -0300, Alvaro Herrera wrote:
> The first idea that comes to mind is skipping hint bits in the CRC too.
> That does away with a lot of the trouble (PD_UNLOGGED_CHANGE, the
> necessity of WAL-logging hint bits, etc).  The problem, again, is that
> the checksumming process becomes page type-specific; but then maybe
> that's the only workable approach.

Which brings back the problem of having to decode the page to checksum
it, so your checksumming code needs to have all sorts of failsafes in
it to stop it going crazy on bad data.

But I understand the problem is that you want to continue in the face
of torn pages, something which is AFAICS ambitious. At least MS-SQL
just blows up on a torn page, havn't found results for other
databases...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> But I understand the problem is that you want to continue in the face
> of torn pages, something which is AFAICS ambitious. At least MS-SQL
> just blows up on a torn page, havn't found results for other
> databases...

I don't think it's too "ambitious" to demand that this patch preserve
a behavior we have today.

In fact, if the patch were to break torn-page handling, it would be
100% likely to be a net *decrease* in system reliability.  It would add
detection of a situation that is not supposed to happen (ie, storage
system fails to return the same data it stored) at the cost of breaking
one's database when the storage system acts as it's expected and
documented to in a routine power-loss situation.

So no, I don't care that MSSQL is unable to handle this.  This patch
must, or it doesn't go in.
        regards, tom lane


Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote:
> In fact, if the patch were to break torn-page handling, it would be
> 100% likely to be a net *decrease* in system reliability.  It would add
> detection of a situation that is not supposed to happen (ie, storage
> system fails to return the same data it stored) at the cost of breaking
> one's database when the storage system acts as it's expected and
> documented to in a routine power-loss situation.

Ok, I see it's a problem because the hint changes are not WAL logged,
so torn pages are expected to work in normal operation. But simply
skipping the hint bits during checksumming is a terrible solution,
since then any errors in those bits will go undetected. To not be able
to say in the documentation that you'll detect 100% of single-bit
errors is pretty darn terrible, since that's kind of the goal of the
exercise.

Unfortunatly, there's not a lot of easy solutions here. You could do
two checksums, one with and one without hint bits. The overall checksum
tells you if there's a problem. If it doesn't match the second checksum
will tell you if it's the hint bits or not (torn page problem). If it's
the hint bits you can reset them all and continue. The checksums need
not be of equal strength.

The extreme case is an ECC where you explicitly can set it so you can
alter N bits before you need to recalculate the checksum.
Computationally though, that sucks.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Martijn van Oosterhout wrote:
> On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote:
>> In fact, if the patch were to break torn-page handling, it would be
>> 100% likely to be a net *decrease* in system reliability.  It would add
>> detection of a situation that is not supposed to happen (ie, storage
>> system fails to return the same data it stored) at the cost of breaking
>> one's database when the storage system acts as it's expected and
>> documented to in a routine power-loss situation.
> 
> Ok, I see it's a problem because the hint changes are not WAL logged,
> so torn pages are expected to work in normal operation. But simply
> skipping the hint bits during checksumming is a terrible solution,
> since then any errors in those bits will go undetected. To not be able
> to say in the documentation that you'll detect 100% of single-bit
> errors is pretty darn terrible, since that's kind of the goal of the
> exercise.

Agreed, trying to explain that in the documentation would look like 
making excuses.

The requirement that all hint bit changes are WAL-logged seems like a 
pretty big change. I don't like doing that, just for CRCing.

There has been discussion before about not writing out pages to disk 
that only have hint-bit updates on them. That means that the next time 
the page is read, the reader needs to do the clog lookups and set the 
hint bits again. It's a tradeoff, making the first SELECT after 
modifying a page cheaper, I/O-wise, at the cost of making all subsequent 
SELECTs that need to read the page from disk or kernel cache more 
expensive, CPU-wise.

I'm not sure if I like that idea or not, but it would also solve the CRC 
problem with torn pages. FWIW, it would also solve the problem suggested 
with IBM DTLA disks and others that might zero-out a sector in case of 
an interrupted write. I'm not totally convinced that's a problem, as 
there's apparently other software that make the same assumption as we 
do, and we haven't heard of any torn-page corruption in real life, but 
still.

If we made the behavior configurable, that would be pretty hard to 
explain in the docs. We'd have three options with dependencies

- CRC on/off
- write pages with only hint bit changes on/off
- full_page_writes on/off

If disable full_page_writes, you're vulnerable to torn pages. If you 
enable it, you're not. Except if you also turn CRC on. Except if you 
also turn "write pages with only hint bit changes" off.

> Unfortunatly, there's not a lot of easy solutions here. You could do
> two checksums, one with and one without hint bits. The overall checksum
> tells you if there's a problem. If it doesn't match the second checksum
> will tell you if it's the hint bits or not (torn page problem). If it's
> the hint bits you can reset them all and continue. The checksums need
> not be of equal strength.

Hmm, that would work I guess.

> The extreme case is an ECC where you explicitly can set it so you can
> alter N bits before you need to recalculate the checksum.
> Computationally though, that sucks.

Yep. Also, in case of a torn page, you're very likely going to have 
several hint bits from the old image and several from the new image. An 
error-correcting code would need to be unfeasibly long to cope with that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Greg Stark
Date:
[sorry for top-posting - damn phone]

I thought of saying that too but it doesn't really solve the problem.  
Think of what happens if someone sets a hint bit on a dirty page.

greg

On 17 Nov 2008, at 08:26 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com > wrote:

> Martijn van Oosterhout wrote:
>> On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote:
>>> In fact, if the patch were to break torn-page handling, it would be
>>> 100% likely to be a net *decrease* in system reliability.  It  
>>> would add
>>> detection of a situation that is not supposed to happen (ie, storage
>>> system fails to return the same data it stored) at the cost of  
>>> breaking
>>> one's database when the storage system acts as it's expected and
>>> documented to in a routine power-loss situation.
>> Ok, I see it's a problem because the hint changes are not WAL logged,
>> so torn pages are expected to work in normal operation. But simply
>> skipping the hint bits during checksumming is a terrible solution,
>> since then any errors in those bits will go undetected. To not be  
>> able
>> to say in the documentation that you'll detect 100% of single-bit
>> errors is pretty darn terrible, since that's kind of the goal of the
>> exercise.
>
> Agreed, trying to explain that in the documentation would look like  
> making excuses.
>
> The requirement that all hint bit changes are WAL-logged seems like  
> a pretty big change. I don't like doing that, just for CRCing.
>
> There has been discussion before about not writing out pages to disk  
> that only have hint-bit updates on them. That means that the next  
> time the page is read, the reader needs to do the clog lookups and  
> set the hint bits again. It's a tradeoff, making the first SELECT  
> after modifying a page cheaper, I/O-wise, at the cost of making all  
> subsequent SELECTs that need to read the page from disk or kernel  
> cache more expensive, CPU-wise.
>
> I'm not sure if I like that idea or not, but it would also solve the  
> CRC problem with torn pages. FWIW, it would also solve the problem  
> suggested with IBM DTLA disks and others that might zero-out a  
> sector in case of an interrupted write. I'm not totally convinced  
> that's a problem, as there's apparently other software that make the  
> same assumption as we do, and we haven't heard of any torn-page  
> corruption in real life, but still.
>
> If we made the behavior configurable, that would be pretty hard to  
> explain in the docs. We'd have three options with dependencies
>
> - CRC on/off
> - write pages with only hint bit changes on/off
> - full_page_writes on/off
>
> If disable full_page_writes, you're vulnerable to torn pages. If you  
> enable it, you're not. Except if you also turn CRC on. Except if you  
> also turn "write pages with only hint bit changes" off.
>
>> Unfortunatly, there's not a lot of easy solutions here. You could do
>> two checksums, one with and one without hint bits. The overall  
>> checksum
>> tells you if there's a problem. If it doesn't match the second  
>> checksum
>> will tell you if it's the hint bits or not (torn page problem). If  
>> it's
>> the hint bits you can reset them all and continue. The checksums need
>> not be of equal strength.
>
> Hmm, that would work I guess.
>
>> The extreme case is an ECC where you explicitly can set it so you can
>> alter N bits before you need to recalculate the checksum.
>> Computationally though, that sucks.
>
> Yep. Also, in case of a torn page, you're very likely going to have  
> several hint bits from the old image and several from the new image.  
> An error-correcting code would need to be unfeasibly long to cope  
> with that.
>
> -- 
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]:
> [sorry for top-posting - damn phone]
>
> I thought of saying that too but it doesn't really solve the problem.  
> Think of what happens if someone sets a hint bit on a dirty page.

If the page is dirty from a "real change", then it has a WAL backup block
record already, so the torn-page on disk is going to be fixed with the wal
replay ... *because* of the torn-page problem already being "solved" in PG.
You don't get the hint-bits back, but that's no different from the current
state.  But nobody's previously cared if hint-bits wern't set on WAL replay.

The tradeoff for CRC is:

1) Are hint-bits "worth saving"? (noting that with CRC the goal is the  ability of detecting blocks that aren't
*exactly*as we wrote them)
 
2) Are hint-bits "nice, but not worth IO" (noting that this case can be  mitigated to only pages which are hint-bit
*only*changed, not dirty with   already-wal-logged changes)
 

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Martijn van Oosterhout
Date:
On Mon, Nov 17, 2008 at 08:41:20AM -0500, Aidan Van Dyk wrote:
> * Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]:
> > [sorry for top-posting - damn phone]
> >
> > I thought of saying that too but it doesn't really solve the problem.
> > Think of what happens if someone sets a hint bit on a dirty page.
>
> If the page is dirty from a "real change", then it has a WAL backup block
> record already, so the torn-page on disk is going to be fixed with the wal
> replay ... *because* of the torn-page problem already being "solved" in PG.

Aah, I thought the problem was that someone updating a tuple won't
write out the whole page to WAL, only a delta. Then again, maybe I
understood it wrong.

> The tradeoff for CRC is:
>
> 1) Are hint-bits "worth saving"? (noting that with CRC the goal is the
>    ability of detecting blocks that aren't *exactly* as we wrote them)

Worth saving? No. Does it matter if they're wrong? Yes. If the
XMIN_COMMITTED bit gets set incorrectly the tuple may appear even when
it shouldn't. Put another way, accedently having a one converted to a
zero is not a problem. Having a zero become a one though, is probably
bad.

> 2) Are hint-bits "nice, but not worth IO" (noting that this case can be
>    mitigated to only pages which are hint-bit *only* changed, not dirty with
>    already-wal-logged changes)

That's a long running debate. Hint bits do save I/O, the question is
the tradeoff.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Martijn van Oosterhout <kleptog@svana.org> [081117 10:15]:
> Aah, I thought the problem was that someone updating a tuple won't
> write out the whole page to WAL, only a delta. Then again, maybe I
> understood it wrong.

I'm no expert, but from my understanding of xlog.c, the WAL record is a "delta"
type record (i.e with MVCC, a new tuple, but HOT complicated this), but the WAL
makes a "backup block record" of the block the WAL modifies hasn't been written
since the last checkpoint.  This "backup block" is what saves postgres from the
torn-page problem on crash/recovery.

On pages with "only hint-bit updates", the torn-page problem was never an issue
because either the old or new parts of the pages are both valid on recovery,
you just may need to re-calculate a hint-bit again.

> > The tradeoff for CRC is:
> > 
> > 1) Are hint-bits "worth saving"? (noting that with CRC the goal is the
> >    ability of detecting blocks that aren't *exactly* as we wrote them)
> 
> Worth saving? No. Does it matter if they're wrong? Yes. If the
> XMIN_COMMITTED bit gets set incorrectly the tuple may appear even when
> it shouldn't. Put another way, accedently having a one converted to a
> zero is not a problem. Having a zero become a one though, is probably
> bad.

Yes, and this difference is why not WAL-logging hint bits (and allowing
torn-pages to possibley appear) is been safe and never been a problem.

But if you're doing a CRC on the page, then they are suddenenly just as equally
important as a "real change"

> > 2) Are hint-bits "nice, but not worth IO" (noting that this case can be
> >    mitigated to only pages which are hint-bit *only* changed, not dirty with 
> >    already-wal-logged changes)
> 
> That's a long running debate. Hint bits do save I/O, the question is
> the tradeoff.

And I don't think anyone's going to have a good answer either way unless we get
real numbers.  But I don't know of any way to get at these numbers right now.

1) How many writes happen on buffer pages that are "hint dirty" but not "really  dirty"?

2) How much IO would writing the WAL records "hint bits" on every page write  take up?

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Gregory Stark
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:

> * Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]:
>> [sorry for top-posting - damn phone]
>>
>> I thought of saying that too but it doesn't really solve the problem.  
>> Think of what happens if someone sets a hint bit on a dirty page.
>
> If the page is dirty from a "real change", then it has a WAL backup block
> record already, so the torn-page on disk is going to be fixed with the wal
> replay ... *because* of the torn-page problem already being "solved" in PG.
> You don't get the hint-bits back, but that's no different from the current
> state.  But nobody's previously cared if hint-bits wern't set on WAL replay.

Hum. Actually I think you're right.

However you still have a problem that someone could come along and set the
hint bit between calculating the CRC and actually calling write.



--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> However you still have a problem that someone could come along and set the
> hint bit between calculating the CRC and actually calling write.

The double-buffering will solve that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
"Matthew T. O'Connor"
Date:
Aidan Van Dyk wrote:
> * Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]:
>> I thought of saying that too but it doesn't really solve the problem.  
>> Think of what happens if someone sets a hint bit on a dirty page.
> 
> If the page is dirty from a "real change", then it has a WAL backup block
> record already, so the torn-page on disk is going to be fixed with the wal
> replay ... *because* of the torn-page problem already being "solved" in PG.
> You don't get the hint-bits back, but that's no different from the current
> state.  But nobody's previously cared if hint-bits wern't set on WAL replay.


What if all changes to a page (even hit bits) are WAL logged when 
running with Block-level CRC checks enables, does that make things 
easier?  I'm sure it would result in some performance loss, but anyone 
enabling Block Level CRCs is already trading some performance for safety.

Thoughts?


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Matthew T. O'Connor <matthew@zeut.net> [081117 15:19]:
> Aidan Van Dyk wrote:
>> * Greg Stark <greg.stark@enterprisedb.com> [081117 03:54]:
>>> I thought of saying that too but it doesn't really solve the problem. 
>>>  Think of what happens if someone sets a hint bit on a dirty page.
>>
>> If the page is dirty from a "real change", then it has a WAL backup block
>> record already, so the torn-page on disk is going to be fixed with the wal
>> replay ... *because* of the torn-page problem already being "solved" in PG.
>> You don't get the hint-bits back, but that's no different from the current
>> state.  But nobody's previously cared if hint-bits wern't set on WAL replay.
>
>
> What if all changes to a page (even hit bits) are WAL logged when  
> running with Block-level CRC checks enables, does that make things  
> easier?  I'm sure it would result in some performance loss, but anyone  
> enabling Block Level CRCs is already trading some performance for safety.
>
> Thoughts?

*I'ld* be more than happy for that trade-off, because:

1) I run PostgreSQL on old, crappy hardware
2) I run small databases
3) I've never had a situation where PG was already to slow
4) I'ld like to know when I really *should* dump old hardware...

But I'm not going to loose money if my DB is down either, so I'ld hardly
consider myself one to cater to ;-)

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Paul Schlie
Date:
Heikki Linnakangas wrote:
>> Gregory Stark wrote:
>> However you still have a problem that someone could come along and set the
>> hint bit between calculating the CRC and actually calling write.
>
> The double-buffering will solve that.

Or simply require that hint bit writes acquire a write lock on the page
(which should be available if not being critically updated or flushed);
thereby to write/flush, simply do the same, calc it's crc, write/flush to
disk, then release it's lock; and which seems like the most reliable thing
to do (although no expert on pg's implementation by any means).

As I would guess although there may be occasional lock contention, its
likely minor, and greatly simplify the whole process it would seem?

(unless I misunderstand, even double buffering requires a lock, as if
multiple hint bits may be updated during the copy, the resulting copy may be
inconsistent if only partially reflecting the updates in progress)




Re: Block-level CRC checks

From
Gregory Stark
Date:
Paul Schlie <schlie@comcast.net> writes:

> Heikki Linnakangas wrote:
>>> Gregory Stark wrote:
>>> However you still have a problem that someone could come along and set the
>>> hint bit between calculating the CRC and actually calling write.
>>
>> The double-buffering will solve that.
>
> Or simply require that hint bit writes acquire a write lock on the page
> (which should be available if not being critically updated or flushed);
> thereby to write/flush, simply do the same, calc it's crc, write/flush to
> disk, then release it's lock; and which seems like the most reliable thing
> to do (although no expert on pg's implementation by any means).
>
> As I would guess although there may be occasional lock contention, its
> likely minor, and greatly simplify the whole process it would seem?

Well it would be a lot more locking than now. You're talking about locking
potentially hundreds of times per page scanned as well as locking when doing a
write which is potentially a long time since the write can block.

It would be the simplest option. Perhaps we should test whether it's actually
a problem.

> (unless I misunderstand, even double buffering requires a lock, as if
> multiple hint bits may be updated during the copy, the resulting copy may be
> inconsistent if only partially reflecting the updates in progress)

No you only need a share lock to do the copy since there's nothing wrong
"inconsistent" sets of hint bits as long as you're checksumming the same copy
you're putting on disk.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Block-level CRC checks

From
Paul Schlie
Date:
Gregory Stark wrote:
> Paul Schlie writes:
> 
>> Heikki Linnakangas wrote:
>>>> Gregory Stark wrote:
>>>> However you still have a problem that someone could come along and set the
>>>> hint bit between calculating the CRC and actually calling write.
>>> 
>>> The double-buffering will solve that.
>> 
>> Or simply require that hint bit writes acquire a write lock on the page
>> (which should be available if not being critically updated or flushed);
>> thereby to write/flush, simply do the same, calc it's crc, write/flush to
>> disk, then release it's lock; and which seems like the most reliable thing
>> to do (although no expert on pg's implementation by any means).
>> 
>> As I would guess although there may be occasional lock contention, its
>> likely minor, and greatly simplify the whole process it would seem?
> 
> Well it would be a lot more locking than now. You're talking about locking
> potentially hundreds of times per page scanned as well as locking when doing
> a write which is potentially a long time since the write can block.

- I guess one could define another lock, specifically for hint bits; thereby
a page scan would not need that lock, but hint bit updates could use it in
combination with a page flush requiring both a share lock and the hint bit
lock. (but don't know if its overall better than copying)

> It would be the simplest option. Perhaps we should test whether it's actually
> a problem.
> 
>> (unless I misunderstand, even double buffering requires a lock, as if
>> multiple hint bits may be updated during the copy, the resulting copy may be
>> inconsistent if only partially reflecting the updates in progress)
> 
> No you only need a share lock to do the copy since there's nothing wrong
> "inconsistent" sets of hint bits as long as you're checksumming the same copy
> you're putting on disk.

Understood, thanks.




Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Aidan Van Dyk wrote:

> And I don't think anyone's going to have a good answer either way unless we get
> real numbers.  But I don't know of any way to get at these numbers right now.
> 
> 1) How many writes happen on buffer pages that are "hint dirty" but not "really
>    dirty"?
> 
> 2) How much IO would writing the WAL records "hint bits" on every page write
>    take up?

I don't think it's a matter of hoy many writes or how much IO.  The
question is locks.  Right now we flip hint bits without taking any kind
of lock on the page.  If we're going to WAL-log each hint bit change,
then we will need to lock the page to update the LSN.  This will make
changing a hint bit a very expensive operation, and maybe a possible
cause for deadlocks.

What my patch did was log hint bits in bulk.  The problem of that
approach was precisely that it was not locking the logged page enough
(locking before setting the "this page needs hint bits logged" bit).  Of
course, the trivial solution is just to lock the page before flipping
hint bits, but I don't know (and I doubt) whether it would really work
at all.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Alvaro Herrera wrote:
> Right now we flip hint bits without taking any kind
> of lock on the page.

That's not quite true. You need to hold a shared lock on heap page to 
examine the visibility of a tuple, and that's when the hint bits are 
set. So we always hold at least a shared lock on the page while hint 
bits are set.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Alvaro Herrera <alvherre@commandprompt.com> [081118 12:25]:
> I don't think it's a matter of hoy many writes or how much IO.  The
> question is locks.  Right now we flip hint bits without taking any kind
> of lock on the page.  If we're going to WAL-log each hint bit change,
> then we will need to lock the page to update the LSN.  This will make
> changing a hint bit a very expensive operation, and maybe a possible
> cause for deadlocks.

Ya, that's obviously the worst option.

> What my patch did was log hint bits in bulk.  The problem of that
> approach was precisely that it was not locking the logged page enough
> (locking before setting the "this page needs hint bits logged" bit).  Of
> course, the trivial solution is just to lock the page before flipping
> hint bits, but I don't know (and I doubt) whether it would really work
> at all.

But why can't you wal-log the hint bits from the "buffered" page.  then your
consitent.  At least as consistent as the original write was.

So you're CRC ends up being:  Buffer the page  Calculate CRC on the buffered page  WAL (in bulk) the hint bits (and
maybeCRC?)  write buffered page
 

-- 
aidan van dyk                                             create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Alvaro Herrera wrote:
>> Right now we flip hint bits without taking any kind
>> of lock on the page.

> That's not quite true. You need to hold a shared lock on heap page to 
> examine the visibility of a tuple, and that's when the hint bits are 
> set. So we always hold at least a shared lock on the page while hint 
> bits are set.

Right, but we couldn't let hint-bit-setters update the page LSN with
only shared lock.  Too much chance of ending up with a scrambled LSN
value.

Could we arrange for the actual LSN-updating to be done while still
holding WALInsertLock?  Then we'd be depending on that lock, not the
page-level locks, to serialize.  It's not great to be pushing more work
inside that global lock, but it's not very much more work ...
        regards, tom lane


Re: Block-level CRC checks

From
Tom Lane
Date:
Aidan Van Dyk <aidan@highrise.ca> writes:
> But why can't you wal-log the hint bits from the "buffered" page.  then your
> consitent.  At least as consistent as the original write was.

> So you're CRC ends up being:
>    Buffer the page
>    Calculate CRC on the buffered page
>    WAL (in bulk) the hint bits (and maybe CRC?)
>    write buffered page

The trouble here is to avoid repeated WAL-logging of the same hint bits.

(Alvaro's patch tried to do that by depending on another hint bit in the
page header, but that seems unsafe if hint bit setters aren't taking
exclusive lock.)
        regards, tom lane


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [081118 12:43]:
> Aidan Van Dyk <aidan@highrise.ca> writes:
> > But why can't you wal-log the hint bits from the "buffered" page.  then your
> > consitent.  At least as consistent as the original write was.
> 
> > So you're CRC ends up being:
> >    Buffer the page
> >    Calculate CRC on the buffered page
> >    WAL (in bulk) the hint bits (and maybe CRC?)
> >    write buffered page
> 
> The trouble here is to avoid repeated WAL-logging of the same hint bits.
> 
> (Alvaro's patch tried to do that by depending on another hint bit in the
> page header, but that seems unsafe if hint bit setters aren't taking
> exclusive lock.)

And I know it's extra IO.  That's why I started the whole thing with a question
along the lines of "how much extra IO are people going to take" for the sake of
"guarenteeing" we read exactly what we wrote.

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Tue, 2008-11-18 at 12:54 -0500, Aidan Van Dyk wrote:
> * Tom Lane <tgl@sss.pgh.pa.us> [081118 12:43]:
> > Aidan Van Dyk <aidan@highrise.ca> writes:

> > The trouble here is to avoid repeated WAL-logging of the same hint bits.
> > 
> > (Alvaro's patch tried to do that by depending on another hint bit in the
> > page header, but that seems unsafe if hint bit setters aren't taking
> > exclusive lock.)
> 
> And I know it's extra IO.  That's why I started the whole thing with a question
> along the lines of "how much extra IO are people going to take" for the sake of
> "guarenteeing" we read exactly what we wrote.

Those that need it will turn it on, those that don't won't.

IO is cheap for those that are going to actually need this feature.

Joshua D. Drake

> 
> a.
> 
-- 



Re: Block-level CRC checks

From
"Jaime Casanova"
Date:
On Thu, Nov 13, 2008 at 1:00 PM, Alvaro Herrera <alvherre@commandprompt.com>
> This patch is also skipping pd_special and the unused area of the page.
>

v11 doesn't apply to cvs head anymore

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Jaime Casanova wrote:
> On Thu, Nov 13, 2008 at 1:00 PM, Alvaro Herrera <alvherre@commandprompt.com>
> > This patch is also skipping pd_special and the unused area of the page.
> 
> v11 doesn't apply to cvs head anymore

I'm not currently working on this patch, sorry.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Josh Berkus
Date:
Alvaro Herrera wrote:
> Jaime Casanova wrote:
>> On Thu, Nov 13, 2008 at 1:00 PM, Alvaro Herrera <alvherre@commandprompt.com>
>>> This patch is also skipping pd_special and the unused area of the page.
>> v11 doesn't apply to cvs head anymore
> 
> I'm not currently working on this patch, sorry.
> 

Should we pull it from 8.4, then?

--Josh


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Sun, Dec 14, 2008 at 4:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
>>> v11 doesn't apply to cvs head anymore
>>
>> I'm not currently working on this patch, sorry.
>>
>
> Should we pull it from 8.4, then?

Here's an updated patch against head.

NOTE, it appears that this (and the previous) patch PANIC with
"concurrent transaction log activity while database system is shutting
down" on shutdown if checksumming is enabled.  This appears to be due
to FlushBuffer (lines 1821-1828) during the checkpoint-at-shutdown.
Other than that, I haven't looked into what needs to be done to fix
it.

Similarly, I ran a pgbench, performed a manual checkpoint, and
corrupted the tellers table myself using hexedit but the system didn't
pick up the corruption at all :(

Alvaro, have you given up on the patch or are you just busy on
something else at the moment?

--
Jonah H. Harris, Senior DBA
myYearbook.com

Attachment

Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Jonah H. Harris escribió:
> On Sun, Dec 14, 2008 at 4:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
> >>> v11 doesn't apply to cvs head anymore
> >>
> >> I'm not currently working on this patch, sorry.
> >>
> >
> > Should we pull it from 8.4, then?
> 
> Here's an updated patch against head.

Thanks.

> NOTE, it appears that this (and the previous) patch PANIC with
> "concurrent transaction log activity while database system is shutting
> down" on shutdown if checksumming is enabled.  This appears to be due
> to FlushBuffer (lines 1821-1828) during the checkpoint-at-shutdown.

Yeah, I reported this issue several times.

> Similarly, I ran a pgbench, performed a manual checkpoint, and
> corrupted the tellers table myself using hexedit but the system didn't
> pick up the corruption at all :(

Heh :-)

> Alvaro, have you given up on the patch or are you just busy on
> something else at the moment?

I've given up until we find a good way to handle hint bits.  Various
schemes have been proposed but they all have more or less fatal flaws.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Mon, Dec 15, 2008 at 7:24 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>> Here's an updated patch against head.
>
> Thanks.

No problemo.

>> NOTE, it appears that this (and the previous) patch PANIC with
>> "concurrent transaction log activity while database system is shutting
>> down" on shutdown if checksumming is enabled.  This appears to be due
>> to FlushBuffer (lines 1821-1828) during the checkpoint-at-shutdown.
>
> Yeah, I reported this issue several times.

Hmm.  Well, the easiest thing would be to add a !shutdown check for
logging the hint bits during the shutdown checkpoint :)  Of course,
that would break the page for recovery, which was the whole point of
putting that in place.  I'd have to look at xlog and see whether that
check can be deferred or changed.  Or, did you already research this
issue?

>> Similarly, I ran a pgbench, performed a manual checkpoint, and
>> corrupted the tellers table myself using hexedit but the system didn't
>> pick up the corruption at all :(
>
> Heh :-)

:(

>> Alvaro, have you given up on the patch or are you just busy on
>> something else at the moment?
>
> I've given up until we find a good way to handle hint bits.  Various
> schemes have been proposed but they all have more or less fatal flaws.

Agreed.  Though, I don't want to see this patch get dropped from 8.4.

ALL, Alvaro has tried a couple different methods, does anyone have any
other ideas?

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Jonah H. Harris wrote:
> >> Alvaro, have you given up on the patch or are you just busy on
> >> something else at the moment?
> >
> > I've given up until we find a good way to handle hint bits.  Various
> > schemes have been proposed but they all have more or less fatal flaws.
> 
> Agreed.  Though, I don't want to see this patch get dropped from 8.4.
> 
> ALL, Alvaro has tried a couple different methods, does anyone have any
> other ideas?

Feature freeze is not the time to be looking for new ideas.  I suggest
we save this for 8.5.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Mon, Dec 15, 2008 at 10:13 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Jonah H. Harris wrote:
>> >> Alvaro, have you given up on the patch or are you just busy on
>> >> something else at the moment?
>> >
>> > I've given up until we find a good way to handle hint bits.  Various
>> > schemes have been proposed but they all have more or less fatal flaws.
>>
>> Agreed.  Though, I don't want to see this patch get dropped from 8.4.
>>
>> ALL, Alvaro has tried a couple different methods, does anyone have any
>> other ideas?
>
> Feature freeze is not the time to be looking for new ideas.  I suggest
> we save this for 8.5.

Well, we may not need a new idea.  Currently, the problem I see with
the checkpoint-at-shutdown looks like it could possibly be easily
solved.  Though, there may be other issues I'm not familiar with.  Has
anyone reviewed this yet?

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> On Mon, Dec 15, 2008 at 10:13 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> Feature freeze is not the time to be looking for new ideas.  I suggest
>> we save this for 8.5.

> Well, we may not need a new idea.

We don't really have an acceptable solution for the conflict with hint
bit behavior.  The shutdown issue is minor, agreed, but that's not the
stumbling block.
        regards, tom lane


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Jonah H. Harris escribió:

> Well, we may not need a new idea.  Currently, the problem I see with
> the checkpoint-at-shutdown looks like it could possibly be easily
> solved.  Though, there may be other issues I'm not familiar with.  Has
> anyone reviewed this yet?

I didn't investigate the shutdown checkpoint issue a lot (I was aware of
it), because the really hard problem are hint bits.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Mon, Dec 15, 2008 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Jonah H. Harris" <jonah.harris@gmail.com> writes:
>> On Mon, Dec 15, 2008 at 10:13 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>> Feature freeze is not the time to be looking for new ideas.  I suggest
>>> we save this for 8.5.
>
>> Well, we may not need a new idea.
>
> We don't really have an acceptable solution for the conflict with hint
> bit behavior.  The shutdown issue is minor, agreed, but that's not the
> stumbling block.

Agreed on the shutdown issue.  But, didn't this patch address the hint
bit setting as discussed?  After performing a cursory look at the
patch, it appears that hint-bit changes are detected and a WAL entry
is written on buffer flush if hint bits had been changed.  I don't see
anything wrong with this in theory.  Am I missing something?

Now, in the case where hint bits have been updated and a WAL record is
required because the buffer is being flushed, requiring the WAL to be
flushed up to that point may be a killer on performance.  Has anyone
tested it?

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Jonah H. Harris escribió:
> On Mon, Dec 15, 2008 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > We don't really have an acceptable solution for the conflict with hint
> > bit behavior.  The shutdown issue is minor, agreed, but that's not the
> > stumbling block.
> 
> Agreed on the shutdown issue.  But, didn't this patch address the hint
> bit setting as discussed?  After performing a cursory look at the
> patch, it appears that hint-bit changes are detected and a WAL entry
> is written on buffer flush if hint bits had been changed.  I don't see
> anything wrong with this in theory.  Am I missing something?

That only does heap hint bits, but it does nothing about pd_flags, the
btree flags (btpo_cycleid I think), and something else I don't recall at
the moment.  This was all solvable however.  The big problem with it was
that it was using a new bit in pd_flags in unsafe ways.  To make it safe
you'd have to grab a lock on the page, which is very probably problematic.

> Now, in the case where hint bits have been updated and a WAL record is
> required because the buffer is being flushed, requiring the WAL to be
> flushed up to that point may be a killer on performance.  Has anyone
> tested it?

I didn't measure it but I'm sure it'll be plenty slow.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Mon, Dec 15, 2008 at 11:50 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> That only does heap hint bits, but it does nothing about pd_flags, the
> btree flags (btpo_cycleid I think), and something else I don't recall at
> the moment.  This was all solvable however.  The big problem with it was
> that it was using a new bit in pd_flags in unsafe ways.  To make it safe
> you'd have to grab a lock on the page, which is very probably problematic.

:(

>> Now, in the case where hint bits have been updated and a WAL record is
>> required because the buffer is being flushed, requiring the WAL to be
>> flushed up to that point may be a killer on performance.  Has anyone
>> tested it?
>
> I didn't measure it but I'm sure it'll be plenty slow.

Yeah.  What really sucks is that it would be fairly unpredictable and
could easily result in unexpected production performance issues.

It is pretty late in the process to continue with this design-related
discussion, but I really wanted to see it in 8.4.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Jonah H. Harris escribió:

> It is pretty late in the process to continue with this design-related
> discussion, but I really wanted to see it in 8.4.

Well, it's hard to blame anyone but me, because I started working on
this barely two weeks before the final commitfest IIRC.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Gregory Stark
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:

> Jonah H. Harris escribió:
>> Now, in the case where hint bits have been updated and a WAL record is
>> required because the buffer is being flushed, requiring the WAL to be
>> flushed up to that point may be a killer on performance.  Has anyone
>> tested it?
>
> I didn't measure it but I'm sure it'll be plenty slow.

How hard would it be to just take an exclusive lock on the page when setting
all these hint bits? It might be a big performance hit but it would only
affect running with CRC enabled and we can document that. And it wouldn't
involve contorting the existing code much.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Mon, 2008-12-15 at 10:13 -0500, Bruce Momjian wrote:
> Jonah H. Harris wrote:
> > >> Alvaro, have you given up on the patch or are you just busy on
> > >> something else at the moment?
> > >
> > > I've given up until we find a good way to handle hint bits.  Various
> > > schemes have been proposed but they all have more or less fatal flaws.
> > 
> > Agreed.  Though, I don't want to see this patch get dropped from 8.4.
> > 
> > ALL, Alvaro has tried a couple different methods, does anyone have any
> > other ideas?
> 
> Feature freeze is not the time to be looking for new ideas.  I suggest
> we save this for 8.5.

Agreed, shall we remove the replication and se postgres patches too :P.

If we can't fix the issue, then yeah let's rip it out but as it sits we
have a hurdle that needs to be overcome not a new feature that needs to
be implemented.

Sincerely,

Joshua D. Drake

> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Joshua D. Drake escribió:

> If we can't fix the issue, then yeah let's rip it out but as it sits we
> have a hurdle that needs to be overcome not a new feature that needs to
> be implemented.

Ideas for solving the hurdle are welcome.

> Agreed, shall we remove the replication and se postgres patches too :P.

There are plenty of ideas for those patches, and lively discussion.
They look successful to me, which this patch does not.

Please do not troll.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Gregory Stark escribió:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> 
> > Jonah H. Harris escribió:
> >> Now, in the case where hint bits have been updated and a WAL record is
> >> required because the buffer is being flushed, requiring the WAL to be
> >> flushed up to that point may be a killer on performance.  Has anyone
> >> tested it?
> >
> > I didn't measure it but I'm sure it'll be plenty slow.
> 
> How hard would it be to just take an exclusive lock on the page when setting
> all these hint bits?

I guess it will be intolerably slow then.  If we were to say "we have
CRC now, but if you enable it you have 1% of the performance" we will
get laughed at.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Mon, 2008-12-15 at 14:29 -0300, Alvaro Herrera wrote:
> Joshua D. Drake escribió:
> 
> > If we can't fix the issue, then yeah let's rip it out but as it sits we
> > have a hurdle that needs to be overcome not a new feature that needs to
> > be implemented.
> 
> Ideas for solving the hurdle are welcome.
> 
> > Agreed, shall we remove the replication and se postgres patches too :P.
> 
> There are plenty of ideas for those patches, and lively discussion.
> They look successful to me, which this patch does not.
> 
> Please do not troll.

I wasn't trolling. I was making a point.

Joshua D. Drake


> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Mon, Dec 15, 2008 at 12:30 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>> How hard would it be to just take an exclusive lock on the page when setting
>> all these hint bits?
>
> I guess it will be intolerably slow then.  If we were to say "we have
> CRC now, but if you enable it you have 1% of the performance" we will
> get laughed at.

Well, Oracle does tell users that enabling full CRC checking will cost
~5% performance overhead, which is reasonable to me.  I'm not
pessimistic enough to think we'd be down to 1% the performance of a
non-CRC enabled system, but the locking overhead would probably be
fairly high.  The problem is, at this point, we don't really know what
the impact would be either way :(

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Joshua D. Drake escribió:
> On Mon, 2008-12-15 at 14:29 -0300, Alvaro Herrera wrote:
> > Joshua D. Drake escribió:
> > 
> > > If we can't fix the issue, then yeah let's rip it out but as it sits we
> > > have a hurdle that needs to be overcome not a new feature that needs to
> > > be implemented.
> > 
> > Ideas for solving the hurdle are welcome.
> > 
> > > Agreed, shall we remove the replication and se postgres patches too :P.
> > 
> > There are plenty of ideas for those patches, and lively discussion.
> > They look successful to me, which this patch does not.
> > 
> > Please do not troll.
> 
> I wasn't trolling. I was making a point.

Okay.  Sorry.  I was defeating your point.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:
> So this discussion died with no solution arising to the
> hint-bit-setting-invalidates-the-CRC problem.
> 
> Apparently the only solution in sight is to WAL-log hint bits.  Simon
> opines it would be horrible from a performance standpoint to WAL-log
> every hint bit set, and I think we all agree with that.  So we need to
> find an alternative mechanism to WAL log hint bits.

It occurred to me that maybe we don't need to WAL-log the CRC checks.

Proposal

* We reserve enough space on a disk block for a CRC check. When a dirty
block is written to disk we calculate and annotate the CRC value, though
this is *not* WAL logged.

* In normal running we re-check the CRC when we read the block back into
shared_buffers.

* In recovery we will overwrite the last image of a block from WAL, so
we ignore the block CRC check, since the WAL record was already CRC
checked. If full_page_writes = off, we ignore and zero the block's CRC
for any block touched during recovery. We do those things because the
block CRC in the WAL is likely to be different to that on disk, due to
hints.

* We also re-check the CRC on a block immediately before we dirty the
block (for any reason). This minimises the possibility of in-memory data
corruption for blocks.

So in the typical case all blocks moving from disk <-> memory and from
clean -> dirty are CRC checked. So in the case where we have
full_page_writes = on then we have a good CRC every time. In the
full_page_writes = off case we are exposed only on the blocks that
changed during last checkpoint cycle and only if we crash. That seems
good because most databases are up 99% of the time, so any corruptions
are likely to occur in normal running, not as a result of crashes.

This would be a run-time option.

Like it?

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Mon, 2009-11-30 at 13:21 +0000, Simon Riggs wrote:
> On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:
> > So this discussion died with no solution arising to the
> > hint-bit-setting-invalidates-the-CRC problem.
> >
> > Apparently the only solution in sight is to WAL-log hint bits.  Simon
> > opines it would be horrible from a performance standpoint to WAL-log
> > every hint bit set, and I think we all agree with that.  So we need to
> > find an alternative mechanism to WAL log hint bits.
>
> It occurred to me that maybe we don't need to WAL-log the CRC checks.
>
> Proposal
>
> * We reserve enough space on a disk block for a CRC check. When a dirty
> block is written to disk we calculate and annotate the CRC value, though
> this is *not* WAL logged.
>
> * In normal running we re-check the CRC when we read the block back into
> shared_buffers.
>
> * In recovery we will overwrite the last image of a block from WAL, so
> we ignore the block CRC check, since the WAL record was already CRC
> checked. If full_page_writes = off, we ignore and zero the block's CRC
> for any block touched during recovery. We do those things because the
> block CRC in the WAL is likely to be different to that on disk, due to
> hints.
>
> * We also re-check the CRC on a block immediately before we dirty the
> block (for any reason). This minimises the possibility of in-memory data
> corruption for blocks.
>
> So in the typical case all blocks moving from disk <-> memory and from
> clean -> dirty are CRC checked. So in the case where we have
> full_page_writes = on then we have a good CRC every time. In the
> full_page_writes = off case we are exposed only on the blocks that
> changed during last checkpoint cycle and only if we crash. That seems
> good because most databases are up 99% of the time, so any corruptions
> are likely to occur in normal running, not as a result of crashes.
>
> This would be a run-time option.
>
> Like it?
>

Just FYI, Alvaro is out of town and our of email access (almost
exclusively). It may take him another week or so to get back to this.

Joshua D. Drake



> --
>  Simon Riggs           www.2ndQuadrant.com
>
>


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Proposal
> 
> * We reserve enough space on a disk block for a CRC check. When a dirty
> block is written to disk we calculate and annotate the CRC value, though
> this is *not* WAL logged.

Imagine this:
1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
calculated and stored on the page.
3. Half of the page is flushed to disk (aka torn page problem). The CRC
made it to disk but the flipped hint bit didn't.

You now have a page with incorrect CRC on disk.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Mon, 2009-11-30 at 22:27 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Proposal
> > 
> > * We reserve enough space on a disk block for a CRC check. When a dirty
> > block is written to disk we calculate and annotate the CRC value, though
> > this is *not* WAL logged.
> 
> Imagine this:
> 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
> 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
> calculated and stored on the page.
> 3. Half of the page is flushed to disk (aka torn page problem). The CRC
> made it to disk but the flipped hint bit didn't.
> 
> You now have a page with incorrect CRC on disk.

You've written that as if you are spotting a problem. It sounds to me
that this is exactly the situation we would like to detect and this is a
perfect way of doing that.

What do you see is the purpose here apart from spotting corruptions?

Do we think error rates are so low we can recover the corruption by
doing something clever with the CRC? I envisage most corruptions as
being unrecoverable except from backup/WAL/replicated servers. 

It's been a long day, so perhaps I've misunderstood.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Simon Riggs <simon@2ndQuadrant.com> [091130 16:28]:
> 
> You've written that as if you are spotting a problem. It sounds to me
> that this is exactly the situation we would like to detect and this is a
> perfect way of doing that.
> 
> What do you see is the purpose here apart from spotting corruptions?
> 
> Do we think error rates are so low we can recover the corruption by
> doing something clever with the CRC? I envisage most corruptions as
> being unrecoverable except from backup/WAL/replicated servers. 
> 
> It's been a long day, so perhaps I've misunderstood.

No, I believe the torn-page problem is exactly the thing that made the
checksum talks stall out last time...  The torn page isn't currently a
problem on only-hint-bit-dirty writes, because if you get
half-old/half-new, the only changes is the hint bit - no big loss, the
data is still the same.

But, with a form of check-sums, when you read it it next time, is it
corrupt?  According to the check-sum, yes, but in reality, the *data* is
still valid, just that the check sum is/isn't correctly matching the
half-changed hint bits...

And then many not-so-really-attractive workarounds where thrown around,
with nothing nice falling into place...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Simon Riggs
Date:
On Mon, 2009-11-30 at 16:49 -0500, Aidan Van Dyk wrote:
> * Simon Riggs <simon@2ndQuadrant.com> [091130 16:28]:
> > 
> > You've written that as if you are spotting a problem. It sounds to me
> > that this is exactly the situation we would like to detect and this is a
> > perfect way of doing that.
> > 
> > What do you see is the purpose here apart from spotting corruptions?
> > 
> > Do we think error rates are so low we can recover the corruption by
> > doing something clever with the CRC? I envisage most corruptions as
> > being unrecoverable except from backup/WAL/replicated servers. 
> > 
> > It's been a long day, so perhaps I've misunderstood.
> 
> No, I believe the torn-page problem is exactly the thing that made the
> checksum talks stall out last time...  The torn page isn't currently a
> problem on only-hint-bit-dirty writes, because if you get
> half-old/half-new, the only changes is the hint bit - no big loss, the
> data is still the same.
> 
> But, with a form of check-sums, when you read it it next time, is it
> corrupt?  According to the check-sum, yes, but in reality, the *data* is
> still valid, just that the check sum is/isn't correctly matching the
> half-changed hint bits...

A good argument, but we're missing some proportion.

There are at most 240 hint bits in an 8192 byte block. So that is less
than 0.5% of the data block where a single bit error would not corrupt
data, and 0% of the data block where a 2+ bit error would not corrupt
data. Put it another way, more than 99.5% of possible errors would cause
data loss, so I would at least like the option of being told about them.

The other perspective is that these errors are unlikely to be caused by
cosmic rays and other quantum effects, they are more likely to be caused
by hardware errors. Hardware errors are frequently repeatable, so one
bank of memory or one section of DRAM is damaged and will give errors.
If we don't report an error, the next error from that piece of hardware
is almost certain to cause data loss, so even a false positive result
should be treated as a good indicator of a true positive detection
result in the future.

If protection against data loss really does need to be so invasive that
we need to WAL-log all changes, then lets make it a table-level option.
If people want to pay the price, we should at least give them the option
of doing so. We can think of ways of optimising it later. Since I was
the one who opposed this on the basis of performance, I want to rescind
that objection and say lets make it an option for those that wish to
trade performance for some visibility of possible data loss errors.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Mon, 2009-11-30 at 16:49 -0500, Aidan Van Dyk wrote:
>> No, I believe the torn-page problem is exactly the thing that made the
>> checksum talks stall out last time...  The torn page isn't currently a
>> problem on only-hint-bit-dirty writes, because if you get
>> half-old/half-new, the only changes is the hint bit - no big loss, the
>> data is still the same.

> A good argument, but we're missing some proportion.

No, I think you are.  The problem with the described behavior is exactly
that it converts a non-problem into a problem --- a big problem, in
fact: uncorrectable data loss.  Loss of hint bits is expected and
tolerated in the current system design.  But a block with bad CRC is not
going to have any automated recovery path.

So the difficulty is that in the name of improving system reliability
by detecting infrequent corruption events, we'd be decreasing system
reliability by *creating* infrequent corruption events, added onto
whatever events we were hoping to detect.  There is no strong argument
you can make that this isn't a net loss --- you'd need to pull some
error-rate numbers out of the air to even try to make the argument,
and in any case the fact remains that more data gets lost with the CRC
than without it.  The only thing the CRC is really buying is giving
the PG project a more plausible argument for blaming data loss on
somebody else; it's not helping the user whose data got lost.

It's hard to justify the amount of work and performance hit we'd take
to obtain a "feature" like that.
        regards, tom lane


Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Mon, 2009-11-30 at 13:21 +0000, Simon Riggs wrote:
> On Fri, 2008-10-17 at 12:26 -0300, Alvaro Herrera wrote:
> > So this discussion died with no solution arising to the
> > hint-bit-setting-invalidates-the-CRC problem.
> > 
> > Apparently the only solution in sight is to WAL-log hint bits.  Simon
> > opines it would be horrible from a performance standpoint to WAL-log
> > every hint bit set, and I think we all agree with that.  So we need to
> > find an alternative mechanism to WAL log hint bits.
> 
> It occurred to me that maybe we don't need to WAL-log the CRC checks.
> 
> Proposal
> 
> * We reserve enough space on a disk block for a CRC check. When a dirty
> block is written to disk we calculate and annotate the CRC value, though
> this is *not* WAL logged.
> 
> * In normal running we re-check the CRC when we read the block back into
> shared_buffers.
> 
> * In recovery we will overwrite the last image of a block from WAL, so
> we ignore the block CRC check, since the WAL record was already CRC
> checked. If full_page_writes = off, we ignore and zero the block's CRC
> for any block touched during recovery. We do those things because the
> block CRC in the WAL is likely to be different to that on disk, due to
> hints.
> 
> * We also re-check the CRC on a block immediately before we dirty the
> block (for any reason). This minimises the possibility of in-memory data
> corruption for blocks.
> 
> So in the typical case all blocks moving from disk <-> memory and from
> clean -> dirty are CRC checked. So in the case where we have
> full_page_writes = on then we have a good CRC every time. In the
> full_page_writes = off case we are exposed only on the blocks that
> changed during last checkpoint cycle and only if we crash. That seems
> good because most databases are up 99% of the time, so any corruptions
> are likely to occur in normal running, not as a result of crashes.
> 
> This would be a run-time option.
> 
> Like it?
> 

Just FYI, Alvaro is out of town and our of email access (almost
exclusively). It may take him another week or so to get back to this.

Joshua D. Drake



> -- 
>  Simon Riggs           www.2ndQuadrant.com
> 
> 


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander



Re: Block-level CRC checks

From
Simon Riggs
Date:
On Mon, 2009-11-30 at 20:02 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Mon, 2009-11-30 at 16:49 -0500, Aidan Van Dyk wrote:
> >> No, I believe the torn-page problem is exactly the thing that made the
> >> checksum talks stall out last time...  The torn page isn't currently a
> >> problem on only-hint-bit-dirty writes, because if you get
> >> half-old/half-new, the only changes is the hint bit - no big loss, the
> >> data is still the same.
> 
> > A good argument, but we're missing some proportion.
> 
> No, I think you are.  The problem with the described behavior is exactly
> that it converts a non-problem into a problem --- a big problem, in
> fact: uncorrectable data loss.  Loss of hint bits is expected and
> tolerated in the current system design.  But a block with bad CRC is not
> going to have any automated recovery path.
> 
> So the difficulty is that in the name of improving system reliability
> by detecting infrequent corruption events, we'd be decreasing system
> reliability by *creating* infrequent corruption events, added onto
> whatever events we were hoping to detect.  There is no strong argument
> you can make that this isn't a net loss --- you'd need to pull some
> error-rate numbers out of the air to even try to make the argument,
> and in any case the fact remains that more data gets lost with the CRC
> than without it.  The only thing the CRC is really buying is giving
> the PG project a more plausible argument for blaming data loss on
> somebody else; it's not helping the user whose data got lost.
> 
> It's hard to justify the amount of work and performance hit we'd take
> to obtain a "feature" like that.

I think there is a clear justification for an additional option.

There is no "creation" of corruption events. This scheme detects
corruption events that *have* occurred. Now I understand that we
previously would have recovered seamlessly from such events, but they
were corruption events nonetheless and I think they need to be reported.
(For why, see Conclusion #2, below).

The frequency of such events against other corruption events is
important here. You are right that there is effectively one new *type*
of corruption event but without error-rate numbers you can't say that
this shows substantially "more data gets lost with the CRC than without
it".

So let me say this again: the argument that inaction is a safe response
here relies upon error-rate numbers going in your favour. You don't
persuade us of one argument purely by observing that the alternate
proposition requires a certain threshold error-rate - both propositions
do. So its a straight: "what is the error-rate?" discussion and ISTM
that there is good evidence of what that is.

---

So, what is the probability of single-bit errors effecting hint bits?
The hint bits can occupy any portion of the block, so their positions
are random. They occupy less than 0.5% of the block, so they must
account for a very small proportion of hardware-induced errors.

Since most reasonable servers use Error Correcting Memory, I would
expect not to see a high level of single bit errors, even though we know
they are occurring in the underlying hardware (Conclusion #1, Schroeder
et al, 2009)

What is the chance that a correctable corruption event is in no way
linked to another non-correctable event later? We would need to argue
that corruptions are a purely stochastic process in all cases, yet
again, there is evidence of both a clear and strong linkage from
correctable to non-correctable errors.  (Conclusion #2 and Conclusion
#7, Schroeder et al, 2009).

Schroeder et al 
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
(thanks Greg!)

Based on that paper, ISTM that ignorable hint bit corruptions are likely
to account for a very small proportion of all corruptions, and of those,
"70-80%" would show up as a non-ignorable corruptions within a month
anyway. So the immediate effect on reliability is tiny, if any. The
effect on detection is huge, which eventually produces significantly
higher relability overall.

> The only thing the CRC is really buying is giving
> the PG project a more plausible argument for blaming data loss on
> somebody else; it's not helping the user whose data got lost.

This isn't about blame, its about detection. If we know something has
happened we can do something about it. Experienced people know that
hardware goes wrong, they just want to be told so they can fix it. 

I blocked development of a particular proposal earlier for performance
reasons, but did not intend to block progress completely. It seems
likely the checks will cause a performance hit. So make them an option.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> There is no "creation" of corruption events. This scheme detects
> corruption events that *have* occurred. Now I understand that we
> previously would have recovered seamlessly from such events, but they
> were corruption events nonetheless and I think they need to be reported.
> (For why, see Conclusion #2, below).

No, you're still missing the point. The point is *not* random bit errors
affecting hint bits, but the torn page problem. Today, a torn page is a
completely valid and expected behavior from the OS and storage
subsystem. We handle it with full_page_writes, and by relying on the
fact that it's OK for a hint bit set to get lost. With your scheme, a
torn page would become a corrupt page.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 10:04 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > There is no "creation" of corruption events. This scheme detects
> > corruption events that *have* occurred. Now I understand that we
> > previously would have recovered seamlessly from such events, but they
> > were corruption events nonetheless and I think they need to be reported.
> > (For why, see Conclusion #2, below).
> 
> No, you're still missing the point. The point is *not* random bit errors
> affecting hint bits, but the torn page problem. Today, a torn page is a
> completely valid and expected behavior from the OS and storage
> subsystem. We handle it with full_page_writes, and by relying on the
> fact that it's OK for a hint bit set to get lost. With your scheme, a
> torn page would become a corrupt page.

Well, its easy to keep going on about how much you think I
misunderstand. But I think that's just misdirection.

The way we handle torn page corruptions *hides* actual corruptions from
us. The frequency of true positives and false positives is important
here. If the false positive ratio is very small, then reporting them is
not a problem because of the benefit we get from having spotted the true
positives. Some convicted murderers didn't do it, but that is not an
argument for letting them all go free (without knowing the details). So
we need to know what the false positive ratio is before we evaluate the
benefit of either reporting or non-reporting possible corruption events.

When do you think torn pages happen? Only at crash, or other times also?
Do they always happen at crash? Are there ways to re-check a block that
has suffered a hint-related torn page issue? Are there ways to isolate
and minimise the reporting of false positives? Those are important
questions and this is not black and white.

If the *only* answer really is we-must-WAL-log everything, then that is
the answer, as an option. I suspect that there is a less strict
possibility, if we question our assumptions and look at the frequencies.

We know that I have no time to work on this; I am just trying to hold
open the door to a few possibilities that we have not fully considered
in a balanced way. And I myself am guilty of having slammed the door
previously. I encourage development of a way forward based upon a
balance of utility.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Bruce Momjian
Date:
Simon Riggs wrote:
> The way we handle torn page corruptions *hides* actual corruptions from
> us. The frequency of true positives and false positives is important
> here. If the false positive ratio is very small, then reporting them is
> not a problem because of the benefit we get from having spotted the true
> positives. Some convicted murderers didn't do it, but that is not an
> argument for letting them all go free (without knowing the details). So
> we need to know what the false positive ratio is before we evaluate the
> benefit of either reporting or non-reporting possible corruption events.
> 
> When do you think torn pages happen? Only at crash, or other times also?
> Do they always happen at crash? Are there ways to re-check a block that
> has suffered a hint-related torn page issue? Are there ways to isolate
> and minimise the reporting of false positives? Those are important
> questions and this is not black and white.
> 
> If the *only* answer really is we-must-WAL-log everything, then that is
> the answer, as an option. I suspect that there is a less strict
> possibility, if we question our assumptions and look at the frequencies.
> 
> We know that I have no time to work on this; I am just trying to hold
> open the door to a few possibilities that we have not fully considered
> in a balanced way. And I myself am guilty of having slammed the door
> previously. I encourage development of a way forward based upon a
> balance of utility.

I think the problem boils down to what the user response should be to a
corruption report.  If it is a torn page, it would be corrected and the
user doesn't have to do anything.  If it is something that is not
correctable, then the user has corruption and/or bad hardware. I think
the problem is that the existing proposal can't distinguish between
these two cases so the user has no idea how to respond to the report.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 06:35 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
> > The way we handle torn page corruptions *hides* actual corruptions from
> > us. The frequency of true positives and false positives is important
> > here. If the false positive ratio is very small, then reporting them is
> > not a problem because of the benefit we get from having spotted the true
> > positives. Some convicted murderers didn't do it, but that is not an
> > argument for letting them all go free (without knowing the details). So
> > we need to know what the false positive ratio is before we evaluate the
> > benefit of either reporting or non-reporting possible corruption events.
> > 
> > When do you think torn pages happen? Only at crash, or other times also?
> > Do they always happen at crash? Are there ways to re-check a block that
> > has suffered a hint-related torn page issue? Are there ways to isolate
> > and minimise the reporting of false positives? Those are important
> > questions and this is not black and white.
> > 
> > If the *only* answer really is we-must-WAL-log everything, then that is
> > the answer, as an option. I suspect that there is a less strict
> > possibility, if we question our assumptions and look at the frequencies.
> > 
> > We know that I have no time to work on this; I am just trying to hold
> > open the door to a few possibilities that we have not fully considered
> > in a balanced way. And I myself am guilty of having slammed the door
> > previously. I encourage development of a way forward based upon a
> > balance of utility.
> 
> I think the problem boils down to what the user response should be to a
> corruption report.  If it is a torn page, it would be corrected and the
> user doesn't have to do anything.  If it is something that is not
> correctable, then the user has corruption and/or bad hardware. 

> I think
> the problem is that the existing proposal can't distinguish between
> these two cases so the user has no idea how to respond to the report.

If 99.5% of cases are real corruption then there is little need to
distinguish between the cases, nor much value in doing so. The
prevalence of the different error types is critical to understanding how
to respond.

If a man pulls a gun on you, your first thought isn't "some people
remove guns from their jacket to polish them, so perhaps he intends to
polish it now" because the prevalence of shootings is high, when faced
by people with guns, and the risk of dying is also high. You make a
judgement based upon the prevalence and the risk. 

That is all I am asking for us to do here, make a balanced call. These
recent comments are a change in my own position, based upon evaluating
the prevalence and the risk. I ask others to consider the same line of
thought rather than a black/white assessment.

All useful detection mechanisms have non-zero false positives because we
would rather sometimes ring the bell for no reason than to let bad
things through silently, as we do now.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Bruce Momjian
Date:
Simon Riggs wrote:
> > I think
> > the problem is that the existing proposal can't distinguish between
> > these two cases so the user has no idea how to respond to the report.
> 
> If 99.5% of cases are real corruption then there is little need to
> distinguish between the cases, nor much value in doing so. The
> prevalence of the different error types is critical to understanding how
> to respond.
> 
> If a man pulls a gun on you, your first thought isn't "some people
> remove guns from their jacket to polish them, so perhaps he intends to
> polish it now" because the prevalence of shootings is high, when faced
> by people with guns, and the risk of dying is also high. You make a
> judgement based upon the prevalence and the risk. 
> 
> That is all I am asking for us to do here, make a balanced call. These
> recent comments are a change in my own position, based upon evaluating
> the prevalence and the risk. I ask others to consider the same line of
> thought rather than a black/white assessment.
> 
> All useful detection mechanisms have non-zero false positives because we
> would rather sometimes ring the bell for no reason than to let bad
> things through silently, as we do now.

OK, but what happens if someone gets the failure report, assumes their
hardware is faulty and replaces it, and then gets a failure report
again?  I assume torn pages are 99% of the reported problem, which are
expected and are fixed, and bad hardware 1%, quite the opposite of your
numbers above.

What might be interesting is to report CRC mismatches if the database
was shut down cleanly previously;  I think in those cases we shouldn't
have torn pages.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Robert Haas
Date:
On Mon, Nov 30, 2009 at 3:27 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Simon Riggs wrote:
>> Proposal
>>
>> * We reserve enough space on a disk block for a CRC check. When a dirty
>> block is written to disk we calculate and annotate the CRC value, though
>> this is *not* WAL logged.
>
> Imagine this:
> 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
> 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
> calculated and stored on the page.
> 3. Half of the page is flushed to disk (aka torn page problem). The CRC
> made it to disk but the flipped hint bit didn't.
>
> You now have a page with incorrect CRC on disk.

This is probably a stupid question, but why doesn't the other half of
the page make it to disk?  Somebody pulls the plug first?

...Robert


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Mon, Nov 30, 2009 at 3:27 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
> > Simon Riggs wrote:
> >> Proposal
> >>
> >> * We reserve enough space on a disk block for a CRC check. When a dirty
> >> block is written to disk we calculate and annotate the CRC value, though
> >> this is *not* WAL logged.
> >
> > Imagine this:
> > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
> > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
> > calculated and stored on the page.
> > 3. Half of the page is flushed to disk (aka torn page problem). The CRC
> > made it to disk but the flipped hint bit didn't.
> >
> > You now have a page with incorrect CRC on disk.
> 
> This is probably a stupid question, but why doesn't the other half of
> the page make it to disk?  Somebody pulls the plug first?

Yep, the pages are 512 bytes on disk, so you might get only some of the
16 512-byte blocks to disk, or the 512-byte block might be partially
written.  Full page writes fix these on recovery.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:

> I assume torn pages are 99% of the reported problem, which are
> expected and are fixed, and bad hardware 1%, quite the opposite of your
> numbers above.

On what basis do you make that assumption?

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:
> 
> > I assume torn pages are 99% of the reported problem, which are
> > expected and are fixed, and bad hardware 1%, quite the opposite of your
> > numbers above.
> 
> On what basis do you make that assumption?

Because we added full page write protection to fix the reported problem
of torn pages, which we had on occasion;  now we don't.  Bad hardware
reports are less frequent.

And we know we can reproduce torn pages by shutting of power to a server
without battery-backed cache.  We don't know how to produce I/O failures
on demand.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Bruce Momjian
Date:
bruce wrote:
> What might be interesting is to report CRC mismatches if the database
> was shut down cleanly previously;  I think in those cases we shouldn't
> have torn pages.

Sorry, stupid idea on my part.  We don't WAL log hit bit changes so
there is no guarantee the page is in WAL on recovery.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 07:58 -0500, Bruce Momjian wrote:
> bruce wrote:
> > What might be interesting is to report CRC mismatches if the database
> > was shut down cleanly previously;  I think in those cases we shouldn't
> > have torn pages.
> 
> Sorry, stupid idea on my part.  We don't WAL log hit bit changes so
> there is no guarantee the page is in WAL on recovery.

I thought it was a reasonable idea. We would need to re-check CRCs after
a crash and zero any that mismatched. Then we can start checking them
again as we run.

In any case, it seems strange to do nothing to protect the database in
normal running just because there is one type of problem that occurs
when we crash.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 07:42 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
> > On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:
> > 
> > > I assume torn pages are 99% of the reported problem, which are
> > > expected and are fixed, and bad hardware 1%, quite the opposite of your
> > > numbers above.
> > 
> > On what basis do you make that assumption?
> 
> Because we added full page write protection to fix the reported problem
> of torn pages, which we had on occasion;  now we don't.  Bad hardware
> reports are less frequent.

Bad hardware reports are infrequent because we lack a detection system
for them, which is the topic of this thread. It would be circular to
argue that as a case against.

It's also an argument that only effects crashes.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Bruce Momjian wrote:
> What might be interesting is to report CRC mismatches if the database
> was shut down cleanly previously;  I think in those cases we shouldn't
> have torn pages.

Unfortunately that's not true. You can crash, leading to a torn page,
and then start up the database and shut it down cleanly. The torn page
is still there, even though the last shutdown was a clean one.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
marcin mank
Date:
On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Simon Riggs wrote:
>> Proposal
>>
>> * We reserve enough space on a disk block for a CRC check. When a dirty
>> block is written to disk we calculate and annotate the CRC value, though
>> this is *not* WAL logged.
>
> Imagine this:
> 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
> 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
> calculated and stored on the page.
> 3. Half of the page is flushed to disk (aka torn page problem). The CRC
> made it to disk but the flipped hint bit didn't.
>
> You now have a page with incorrect CRC on disk.
>

What if we treated the hint bits as all-zeros for the purpose of CRC
calculation? This would exclude them from the checksum.


Greetings
Marcin Mańk


Re: Block-level CRC checks

From
Andres Freund
Date:
On Tuesday 01 December 2009 14:38:26 marcin mank wrote:
> On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas
> 
> <heikki.linnakangas@enterprisedb.com> wrote:
> > Simon Riggs wrote:
> >> Proposal
> >>
> >> * We reserve enough space on a disk block for a CRC check. When a dirty
> >> block is written to disk we calculate and annotate the CRC value, though
> >> this is *not* WAL logged.
> >
> > Imagine this:
> > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
> > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
> > calculated and stored on the page.
> > 3. Half of the page is flushed to disk (aka torn page problem). The CRC
> > made it to disk but the flipped hint bit didn't.
> >
> > You now have a page with incorrect CRC on disk.
> 
> What if we treated the hint bits as all-zeros for the purpose of CRC
> calculation? This would exclude them from the checksum.
That sounds like doing a complete copy of the wal page zeroing specific fields 
and then doing wal - rather expensive I would say. Both, during computing the 
checksum and checking it...

Andres


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Andres Freund <andres@anarazel.de> [091201 08:42]:
> On Tuesday 01 December 2009 14:38:26 marcin mank wrote:
> > On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas
> > 
> > <heikki.linnakangas@enterprisedb.com> wrote:
> > > Simon Riggs wrote:
> > >> Proposal
> > >>
> > >> * We reserve enough space on a disk block for a CRC check. When a dirty
> > >> block is written to disk we calculate and annotate the CRC value, though
> > >> this is *not* WAL logged.
> > >
> > > Imagine this:
> > > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
> > > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC is
> > > calculated and stored on the page.
> > > 3. Half of the page is flushed to disk (aka torn page problem). The CRC
> > > made it to disk but the flipped hint bit didn't.
> > >
> > > You now have a page with incorrect CRC on disk.
> > 
> > What if we treated the hint bits as all-zeros for the purpose of CRC
> > calculation? This would exclude them from the checksum.
> That sounds like doing a complete copy of the wal page zeroing specific fields 
> and then doing wal - rather expensive I would say. Both, during computing the 
> checksum and checking it...

No, it has nothing to do with WAL, it has to do with when writing
"pages" out... You already double-buffer them (to avoid the page
changing while you checksum it) before calling write, but the code
writing (and then reading) pages doesn't currently have to know all the
internal "stuff" needed decide what's a hint bit and what's not...

And adding that information into the buffer in/out would be a huge wart
on the modularity of the PG code...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 8:30 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Bruce Momjian wrote:
>> What might be interesting is to report CRC mismatches if the database
>> was shut down cleanly previously;  I think in those cases we shouldn't
>> have torn pages.
>
> Unfortunately that's not true. You can crash, leading to a torn page,
> and then start up the database and shut it down cleanly. The torn page
> is still there, even though the last shutdown was a clean one.

Thinking through this, as I understand it, in order to prevent this
problem, you'd need to be able to predict at recovery time which pages
might have been torn by the unclean shutdown.  In order to do that,
you'd need to know which pages were waiting to be written to disk at
the time of the shutdown.  For ordinary page modifications, that's not
a problem, because there will be WAL records for those pages that need
to be replayed, and we could recompute the CRC at the same time.  But
for hint bit changes, there's no persistent state that would tell us
which hint bits were in the midst of being flipped when the system
went down, so the only way to make sure all the CRCs are correct would
be to rescan every page in the entire cluster and recompute every CRC.

Is that right?

...Robert


Re: Block-level CRC checks

From
Andres Freund
Date:
On Tuesday 01 December 2009 15:26:21 Aidan Van Dyk wrote:
> * Andres Freund <andres@anarazel.de> [091201 08:42]:
> > On Tuesday 01 December 2009 14:38:26 marcin mank wrote:
> > > On Mon, Nov 30, 2009 at 9:27 PM, Heikki Linnakangas
> > >
> > > <heikki.linnakangas@enterprisedb.com> wrote:
> > > > Simon Riggs wrote:
> > > >> Proposal
> > > >>
> > > >> * We reserve enough space on a disk block for a CRC check. When a
> > > >> dirty block is written to disk we calculate and annotate the CRC
> > > >> value, though this is *not* WAL logged.
> > > >
> > > > Imagine this:
> > > > 1. A hint bit is set. It is not WAL-logged, but the page is dirtied.
> > > > 2. The buffer is flushed out of the buffer cache to the OS. A new CRC
> > > > is calculated and stored on the page.
> > > > 3. Half of the page is flushed to disk (aka torn page problem). The
> > > > CRC made it to disk but the flipped hint bit didn't.
> > > >
> > > > You now have a page with incorrect CRC on disk.
> > >
> > > What if we treated the hint bits as all-zeros for the purpose of CRC
> > > calculation? This would exclude them from the checksum.
> >
> > That sounds like doing a complete copy of the wal page zeroing specific
> > fields and then doing wal - rather expensive I would say. Both, during
> > computing the checksum and checking it...

> No, it has nothing to do with WAL, it has to do with when writing
> "pages" out... You already double-buffer them (to avoid the page
> changing while you checksum it) before calling write, but the code
> writing (and then reading) pages doesn't currently have to know all the
> internal "stuff" needed decide what's a hint bit and what's not...
err, yes. That "WAL" slipped in, sorry. But it would still either mean a third 
copy of the page or a rather complex jumping around on the page...

Andres


Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Robert Haas wrote:
> On Tue, Dec 1, 2009 at 8:30 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> Bruce Momjian wrote:
>>> What might be interesting is to report CRC mismatches if the database
>>> was shut down cleanly previously;  I think in those cases we shouldn't
>>> have torn pages.
>> Unfortunately that's not true. You can crash, leading to a torn page,
>> and then start up the database and shut it down cleanly. The torn page
>> is still there, even though the last shutdown was a clean one.
> 
> Thinking through this, as I understand it, in order to prevent this
> problem, you'd need to be able to predict at recovery time which pages
> might have been torn by the unclean shutdown.  In order to do that,
> you'd need to know which pages were waiting to be written to disk at
> the time of the shutdown.  For ordinary page modifications, that's not
> a problem, because there will be WAL records for those pages that need
> to be replayed, and we could recompute the CRC at the same time.  But
> for hint bit changes, there's no persistent state that would tell us
> which hint bits were in the midst of being flipped when the system
> went down, so the only way to make sure all the CRCs are correct would
> be to rescan every page in the entire cluster and recompute every CRC.
> 
> Is that right?

Yep.

Even if rescanning every page in the cluster was feasible from a
performance point-of-view, it would make the CRC checking a lot less
useful. It's not hard to imagine that when a hardware glitch happens
causing corruption, it also causes the system to crash. Recalculating
the CRCs after crash would mask the corruption.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 15:30 +0200, Heikki Linnakangas wrote:
> Bruce Momjian wrote:
> > What might be interesting is to report CRC mismatches if the database
> > was shut down cleanly previously;  I think in those cases we shouldn't
> > have torn pages.
> 
> Unfortunately that's not true. You can crash, leading to a torn page,
> and then start up the database and shut it down cleanly. The torn page
> is still there, even though the last shutdown was a clean one.

There seems to be two ways forwards: journalling or fsck.

We can either

* WAL-log all changes to a page (journalling) (8-byte overhead)

* After a crash disable CRC checks until a full database scan has either
re-checked CRC or found CRC mismatch, report it in the LOG and then
reset the CRC. (fsck) (8-byte overhead)

Both of which can be optimised in various ways.

Also, we might

* Put all hint bits in the block header to allow them to be excluded
more easily from CRC checking. If we used 3 more bits from
ItemIdData.lp_len (limiting tuple length to 4096) then we could store
some hints in the item pointer. HEAP_XMIN_INVALID can be stored as
LP_DEAD, since that will happen very quickly anyway. 

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 9:40 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Even if rescanning every page in the cluster was feasible from a
> performance point-of-view, it would make the CRC checking a lot less
> useful. It's not hard to imagine that when a hardware glitch happens
> causing corruption, it also causes the system to crash. Recalculating
> the CRCs after crash would mask the corruption.

Yeah.  Thanks for the explanation - I think I understand the problem now.

...Robert


Re: Block-level CRC checks

From
Florian Weimer
Date:
* Simon Riggs:

> * Put all hint bits in the block header to allow them to be excluded
> more easily from CRC checking. If we used 3 more bits from
> ItemIdData.lp_len (limiting tuple length to 4096) then we could store
> some hints in the item pointer. HEAP_XMIN_INVALID can be stored as
> LP_DEAD, since that will happen very quickly anyway.

What about putting the whole visibility information out-of-line, into
its own B-tree, indexed by page number?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Block-level CRC checks

From
Tom Lane
Date:
Florian Weimer <fweimer@bfk.de> writes:
> What about putting the whole visibility information out-of-line, into
> its own B-tree, indexed by page number?

Hint bits need to be *cheap* to examine.  Otherwise there's little
point in having them at all.
        regards, tom lane


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:

> It's not hard to imagine that when a hardware glitch happens
> causing corruption, it also causes the system to crash. Recalculating
> the CRCs after crash would mask the corruption.

They are already masked from us, so continuing to mask those errors
would not put us in a worse position.

If we are saying that 99% of page corruptions are caused at crash time
because of torn pages on hint bits, then only WAL logging can help us
find the 1%. I'm not convinced that is an accurate or safe assumption
and I'd at least like to see LOG entries showing what happened.

ISTM we could go for two levels of protection. CRC checks and scanner
for Level 1 protection, then full WAL logging for Level 2 protection.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
>> It's not hard to imagine that when a hardware glitch happens
>> causing corruption, it also causes the system to crash. Recalculating
>> the CRCs after crash would mask the corruption.

> They are already masked from us, so continuing to mask those errors
> would not put us in a worse position.

No, it would just destroy a large part of the argument for why this
is worth doing.  "We detect disk errors ... except for ones that happen
during a database crash."  "Say what?"

The fundamental problem with this is the same as it's been all along:
the tradeoff between implementation work expended, performance overhead
added, and net number of real problems detected (with a suitably large
demerit for actually *introducing* problems) just doesn't look
attractive.  You can make various compromises that improve one or two of
these factors at the cost of making the others worse, but at the end of
the day I've still not seen a combination that seems worth doing.
        regards, tom lane


Re: Block-level CRC checks

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 10:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
>
>> It's not hard to imagine that when a hardware glitch happens
>> causing corruption, it also causes the system to crash. Recalculating
>> the CRCs after crash would mask the corruption.
>
> They are already masked from us, so continuing to mask those errors
> would not put us in a worse position.
>
> If we are saying that 99% of page corruptions are caused at crash time
> because of torn pages on hint bits, then only WAL logging can help us
> find the 1%. I'm not convinced that is an accurate or safe assumption
> and I'd at least like to see LOG entries showing what happened.

It may or may not be true that most page corruptions happen at crash
time, but it's certainly false that they are caused at crash time
*because of torn pages on hint bits*.   If only part of a block is
written to disk and the unwritten parts contain hint-bit changes -
that's not corruption.  That's design behavior.  Any CRC system needs
to avoid complaining about errors when that happens because otherwise
people will think that their database is corrupted and their hardware
is faulty when in reality it is not.

If we could find a way to put the hint bits in the same 512-byte block
as the CRC, that might do it, but I'm not sure whether that is
possible.

Ignoring CRC errors after a crash until we've re-CRC'd the entire
database will certainly eliminate the bogus error reports, but it
seems likely to mask a large percentage of legitimate errors.  For
example, suppose that I write 1MB of data out to disk and then don't
access it for a year.   During that time the data is corrupted.  Then
the system crashes.  Upon recovery, since there's no way of knowing
whether hint bits on those pages were being updated at the time of the
crash, so the system re-CRC's the corrupted data and declares it known
good.  Six months later, I try to access the data and find out that
it's bad.  Sucks to be me.

Now consider the following alternative scenario: I write the block to
disk.  Five minutes later, without an intervening crash, I read it
back in and it's bad.  Yeah, the system detects it.

Which is more likely?  I'm not an expert on disk failure modes, but my
intuition is that the first one will happen often enough to make us
look silly.  Is it 10%?  20%?  50%?  I don't know.  But ISTM that a
CRC system that has no ability to determine whether a system is still
"ok" post-crash is not a compelling proposition, even though it might
still be able to detect some problems.

...Robert


Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:
>
> > All useful detection mechanisms have non-zero false positives because we
> > would rather sometimes ring the bell for no reason than to let bad
> > things through silently, as we do now.
>
> OK, but what happens if someone gets the failure report, assumes their
> hardware is faulty and replaces it, and then gets a failure report
> again?

They are stupid? Nobody just replaces hardware. You test it.

We can't fix stupid.

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
> >> It's not hard to imagine that when a hardware glitch happens
> >> causing corruption, it also causes the system to crash. Recalculating
> >> the CRCs after crash would mask the corruption.
> 
> > They are already masked from us, so continuing to mask those errors
> > would not put us in a worse position.
> 
> No, it would just destroy a large part of the argument for why this
> is worth doing.  "We detect disk errors ... except for ones that happen
> during a database crash."  "Say what?"

I know what I said sounds ridiculous, I'm just trying to keep my mind
open about the tradeoffs. The way to detect 100% of corruptions is to
WAL-log 100% of writes to blocks and we know that sucks performance -
twas me that said it in the original discussion. I'm trying to explore
whether we can detect <100% of other errors at some intermediate
percentage of WAL-logging. If we decide that there isn't an intermediate
position worth taking, I'm happy, as long it was a fact-based decision.

> The fundamental problem with this is the same as it's been all along:
> the tradeoff between implementation work expended, performance overhead
> added, and net number of real problems detected (with a suitably large
> demerit for actually *introducing* problems) just doesn't look
> attractive.  You can make various compromises that improve one or two of
> these factors at the cost of making the others worse, but at the end of
> the day I've still not seen a combination that seems worth doing.

I agree. But also I do believe there are people that care enough about
this to absorb a performance hit and the new features in 8.5 will bring
in a new crop of people that care about those things very much.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
> >> It's not hard to imagine that when a hardware glitch happens
> >> causing corruption, it also causes the system to crash. Recalculating
> >> the CRCs after crash would mask the corruption.
>
> > They are already masked from us, so continuing to mask those errors
> > would not put us in a worse position.
>
> No, it would just destroy a large part of the argument for why this
> is worth doing.  "We detect disk errors ... except for ones that happen
> during a database crash."  "Say what?"
>
> The fundamental problem with this is the same as it's been all along:
> the tradeoff between implementation work expended, performance overhead
> added, and net number of real problems detected (with a suitably large
> demerit for actually *introducing* problems) just doesn't look
> attractive.  You can make various compromises that improve one or two of
> these factors at the cost of making the others worse, but at the end of
> the day I've still not seen a combination that seems worth doing.

Let me try a different but similar perspective. The problem we are
trying to solve here, only matters to a very small subset of the people
actually using PostgreSQL. Specifically, a percentage that is using
PostgreSQL in a situation where they can lose many thousands of dollars
per minute or hour should an outage occur.

On the other hand it is those very people that are *paying* people to
try and implement these features. Kind of a catch-22.

The hard core reality is this. *IF* it is one of the goals of this
project to insure that the software can be safely, effectively, and
responsibly operated in a manner that is acceptable to C* level people
in a Fortune level company then we *must* solve this problem.

If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant
to fork it because it will eventually happen. Our customers are
demanding these features.

Sincerely,

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: Block-level CRC checks

From
Bruce Momjian
Date:
Simon Riggs wrote:
> Also, we might
> 
> * Put all hint bits in the block header to allow them to be excluded
> more easily from CRC checking. If we used 3 more bits from
> ItemIdData.lp_len (limiting tuple length to 4096) then we could store
> some hints in the item pointer. HEAP_XMIN_INVALID can be stored as
> LP_DEAD, since that will happen very quickly anyway. 

OK, here is another idea, maybe crazy:

When we read in a page that has an invalid CRC, we check the page to see
which hint bits are _not_ set, and we try setting them to see if can get
a matching CRC.  If there no missing hint bits and the CRC doesn't
match, we know the page is corrupt.  If two hint bits are missing, we
can try setting one and both of them and see if can get a matching CRC. 
If we can, the page is OK, if not, it is corrupt.

Now if 32 hint bits are missing, but could be based on transaction
status, then we would need 2^32 possible hint bit combinations, so we
can't do the test and we just assume the page is valid.

I have no idea what percentage of corruption this would detect, but it
might have minimal overhead because the overhead only happens when we
detect a non-matching CRC due to a crash of some sort.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> The hard core reality is this. *IF* it is one of the goals of this
> project to insure that the software can be safely, effectively, and
> responsibly operated in a manner that is acceptable to C* level people
> in a Fortune level company then we *must* solve this problem.
>
> If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant
> to fork it because it will eventually happen. Our customers are
> demanding these features.

OK, and when you fork it, how do you plan to implement it?  The
problem AFAICS is not that anyone hugely dislikes the feature; it's
that nobody is really clear on how to implement it in a way that's
actually useful.

So far the only somewhat reasonable suggestions I've seen seem to be:

1. WAL-log setting the hint bits.  If you don't like the resulting
performance, shut off the feature.

2. Rearrange the page so that all the hint bits are in the first 512
bytes along with the CRC, so that there can be no torn pages.  AFAICS,
no one has rendered judgment on whether this is a feasible solution.

Does $COMPETITOR offer this feature?

...Robert


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Tue, 2009-12-01 at 13:05 -0500, Bruce Momjian wrote:
> Simon Riggs wrote:
> > Also, we might
> > 
> > * Put all hint bits in the block header to allow them to be excluded
> > more easily from CRC checking. If we used 3 more bits from
> > ItemIdData.lp_len (limiting tuple length to 4096) then we could store
> > some hints in the item pointer. HEAP_XMIN_INVALID can be stored as
> > LP_DEAD, since that will happen very quickly anyway. 
> 
> OK, here is another idea, maybe crazy:

When there's nothing else left, crazy wins.

> When we read in a page that has an invalid CRC, we check the page to see
> which hint bits are _not_ set, and we try setting them to see if can get
> a matching CRC.  If there no missing hint bits and the CRC doesn't
> match, we know the page is corrupt.  If two hint bits are missing, we
> can try setting one and both of them and see if can get a matching CRC. 
> If we can, the page is OK, if not, it is corrupt.
> 
> Now if 32 hint bits are missing, but could be based on transaction
> status, then we would need 2^32 possible hint bit combinations, so we
> can't do the test and we just assume the page is valid.
> 
> I have no idea what percentage of corruption this would detect, but it
> might have minimal overhead because the overhead only happens when we
> detect a non-matching CRC due to a crash of some sort.

Perhaps we could store a sector-based parity bit each 512 bytes in the
block. If there are an even number of hint bits set, if odd we unset the
parity bit. So whenever we set a hint bit we flip the parity bit for
that sector. That way we could detect which sectors are potentially
missing in an effort to minimize the number of combinations we need to
test. That would require only 16 bits for an 8192 byte block; we store
it next to the CRC, so we know that was never altered separately. So
total 6 byte overhead.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
> On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > The hard core reality is this. *IF* it is one of the goals of this
> > project to insure that the software can be safely, effectively, and
> > responsibly operated in a manner that is acceptable to C* level people
> > in a Fortune level company then we *must* solve this problem.
> >
> > If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant
> > to fork it because it will eventually happen. Our customers are
> > demanding these features.
>
> OK, and when you fork it, how do you plan to implement it?

Hey man, I am not an engineer :P. You know that. I am just speaking the
pressures that some of us are having in the marketplace about these
types of features.

> red judgment on whether this is a feasible solution.
>
> Does $COMPETITOR offer this feature?
>

My understanding is that MSSQL does. I am not sure about Oracle. Those
are the only two I run into (I don't run into MySQL at all). I know
others likely compete in the DB2 space.

Sincerely,

Joshua D. Drake



--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander

Re: Block-level CRC checks

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> OK, here is another idea, maybe crazy:

> When we read in a page that has an invalid CRC, we check the page to see
> which hint bits are _not_ set, and we try setting them to see if can get
> a matching CRC.  If there no missing hint bits and the CRC doesn't
> match, we know the page is corrupt.  If two hint bits are missing, we
> can try setting one and both of them and see if can get a matching CRC. 
> If we can, the page is OK, if not, it is corrupt.

> Now if 32 hint bits are missing, but could be based on transaction
> status, then we would need 2^32 possible hint bit combinations, so we
> can't do the test and we just assume the page is valid.

A typical page is going to have something like 100 tuples, so
potentially 2^400 combinations to try.  I don't see this being
realistic from that standpoint.  What's much worse is that to even
find the potentially missing hint bits, you need to make very strong
assumptions about the validity of the rest of the page.

The suggestions that were made upthread about moving the hint bits
could resolve the second objection, but once you do that you might
as well just exclude them from the CRC and eliminate the guessing.
        regards, tom lane


Re: Block-level CRC checks

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2009-12-01 at 13:05 -0500, Bruce Momjian wrote:
>> When we read in a page that has an invalid CRC, we check the page to see
>> which hint bits are _not_ set, and we try setting them to see if can get
>> a matching CRC.

> Perhaps we could store a sector-based parity bit each 512 bytes in the
> block. If there are an even number of hint bits set, if odd we unset the
> parity bit. So whenever we set a hint bit we flip the parity bit for
> that sector. That way we could detect which sectors are potentially
> missing in an effort to minimize the number of combinations we need to
> test.

Actually, the killer problem with *any* scheme involving "guessing"
is that each bit you guess translates directly to removing one bit
of confidence from the CRC value.  If you try to guess at as many
as 32 bits, it is practically guaranteed that you will find a
combination that makes a 32-bit CRC appear to match.  Well before
that, you have degraded the reliability of the error detection to
the point that there's no point.

The bottom line here seems to be that the only practical way to do
anything like this is to move the hint bits into their own area of
the page, and then exclude them from the CRC.  Are we prepared to
once again blow off any hope of in-place update for another release
cycle?
        regards, tom lane


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [091201 13:58]:

> Actually, the killer problem with *any* scheme involving "guessing"
> is that each bit you guess translates directly to removing one bit
> of confidence from the CRC value.  If you try to guess at as many
> as 32 bits, it is practically guaranteed that you will find a
> combination that makes a 32-bit CRC appear to match.  Well before
> that, you have degraded the reliability of the error detection to
> the point that there's no point.

Exactly.

> The bottom line here seems to be that the only practical way to do
> anything like this is to move the hint bits into their own area of
> the page, and then exclude them from the CRC.  Are we prepared to
> once again blow off any hope of in-place update for another release
> cycle?

Well, *I* think if we're ever going to have really reliable "in-place
upgrades" that we can expect to function release after release, we're
going to need to be able to read in "old version" pages, and convert
them to current version pages, for some set of "old version" (I'ld be
happy with $VERSION-1)...  But I don't see that happening any time
soon...

But I'm not loading TB of data either, my largest clusters are a couple
of gigs, so I acknowledge my priorities are probably quite different
then some of the companies driving a lot of the heavy development.

a.
-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Block-level CRC checks

From
Greg Stark
Date:
On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> OK, here is another idea, maybe crazy:
>
>> When we read in a page that has an invalid CRC, we check the page to see
>> which hint bits are _not_ set, and we try setting them to see if can get
>> a matching CRC.

Unfortunately you would also have to try *unsetting* every hint bit as
well since the updated hint bits might have made it to disk but not
the CRC leaving the old CRC for the block with the unset bits.

I actually independently had the same thought today that Simon had of
moving the hint bits to the line pointer. We can obtain more free bits
in the line pointers by dividing the item offsets and sizes by
maxalign if we need it. That should give at least 4 spare bits which
is all we need for the four VALID/INVALID hint bits.

It should be relatively cheap to skip the hint bits in the line
pointers since they'll be the same bits of every 16-bit value for a
whole range. Alternatively we could just CRC the tuples and assume a
corrupted line pointer will show itself quickly. That would actually
make it faster than a straight CRC of the whole block -- making
lemonade out of lemons as it were.

There's still the all-tuples-in-page-are-visible hint bit and the hint
bits in btree pages. I'm not sure if those are easier or harder to
solve. We might be able to assume the all-visible flag will not be
torn from the crc as long as they're within the same 512 byte sector.
And iirc the btree hint bits are in the line pointers themselves as
well?

Another thought is that would could use the MSSQL-style torn page
detection of including a counter (or even a bit?) in every 512-byte
chunk which gets incremented every time the page is written. If they
don't all match when read in then the page was torn and we can't check
the CRC. That gets us the advantage that we can inform the user that a
torn page was detected so they know that they must absolutely use
full_page_writes on their system. Currently users are in the dark
whether their system is susceptible to them or not and have now idea
with what frequency. Even here there are quite divergent opinions
about their frequency and which systems are susceptible to them or
immune.

-- 
greg


Re: Block-level CRC checks

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Another thought is that would could use the MSSQL-style torn page
> detection of including a counter (or even a bit?) in every 512-byte
> chunk which gets incremented every time the page is written.

I think we can dismiss that idea, or any idea involving a per-page
status value, out of hand.  The implications for tuple layout are
just too messy.  I'm not especially thrilled with making assumptions
about the underlying device block size anyway.
        regards, tom lane


Re: Block-level CRC checks

From
Josh Berkus
Date:
All,

I feel strongly that we should be verifying pages on write, or at least
providing the option to do so, because hardware is simply not reliable.And a lot of our biggest users are having
issues;it seems pretty much
 
guarenteed that if you have more than 20 postgres servers, at least one
of them will have bad memory, bad RAID and/or a bad driver.

(and yes, InnoDB, DB2 and Oracle do provide tools to detect hardware
corruption when it happens. Oracle even provides correction tools.  We
are *way* behind them in this regard)

There are two primary conditions we are testing for:

(a) bad RAM, which happens as frequently as 8% of the time on commodity
servers, and given a sufficient amount of RAM happens 99% of the time
due to quantum effects, and
(b) bad I/O, in the form of bad drivers, bad RAID, and/or bad disks.

Our users want to potentially take two degrees of action on this:

1. detect the corruption immediately when it happens, so that they can
effectively troubleshoot the cause of the corruption, and potentially
shut down the database before further corruption occurs and while they
still have clean backups.

2. make an attempt to fix the corrupted page before/immediately after it
is written.

Further, based on talking to some of these users who are having chronic
and not-debuggable issues on their sets of 100's of PostgreSQL servers,
there are some other specs:

-- Many users would be willing to sacrifice significant performance (up
to 20%) as a start-time option in order to be "corruption-proof".
-- Even more users would only be interested in using the anti-corruption
options after they know they have a problem to troubleshoot it, and then
turn the corruption detection back off.

So, based on my conversations with users, what we really want is a
solution which does (1) for both (a) and (b) as a start-time option, and
having siginificant performance overhead for this option is OK.

Now, does block-level CRCs qualify?

The problem I have with CRC checks is that it only detects bad I/O, and
is completely unable to detect data corruption due to bad memory.  This
means that really we want a different solution which can detect both bad
RAM and bad I/O, and should only fall back on CRC checks if we're unable
to devise one.

One of the things Simon and I talked about in Japan is that most of the
time, data corruption makes the data page and/or tuple unreadable.  So,
checking data format for readable pages and tuples (and index nodes)
both before and after write to disk (the latter would presumably be
handled by the bgwriter and/or checkpointer) would catch a lot of kinds
of corruption before they had a chance to spread.

However, that solution would not detect subtle corruption, like
single-bit-flipping issues caused by quantum errors.  Also, it would
require reading back each page as it's written to disk, which is OK for
a bunch of single-row writes, but for bulk data loads a significant problem.

So, what I'm saying is that I think we really want a better solution,
and am throwing this out there to see if anyone is clever enough.

--Josh Berkus







Re: Block-level CRC checks

From
"Kevin Grittner"
Date:
Josh Berkus <josh@agliodbs.com> wrote:
> And a lot of our biggest users are having issues; it seems pretty
> much guarenteed that if you have more than 20 postgres servers, at
> least one of them will have bad memory, bad RAID and/or a bad
> driver.
Huh?!?  We have about 200 clusters running on about 100 boxes, and
we see that very rarely.  On about 100 older boxes, relegated to
less critical tasks, we see a failure maybe three or four times per
year.  It's usually not subtle, and a sane backup and redundant
server policy has kept us from suffering much pain from these.  I'm
not questioning the value of adding features to detect corruption,
but your numbers are hard to believe.
> The problem I have with CRC checks is that it only detects bad
> I/O, and is completely unable to detect data corruption due to bad
> memory. This means that really we want a different solution which
> can detect both bad RAM and bad I/O, and should only fall back on
> CRC checks if we're unable to devise one.
md5sum of each tuple?  As an optional system column (a la oid)?
> checking data format for readable pages and tuples (and index
> nodes) both before and after write to disk
Given that PostgreSQL goes through the OS, and many of us are using
RAID controllers with BBU RAM, how do you do a read with any
confidence that it came from the disk?  (I mean, I know how to do
that for a performance test, but as a routine step during production
use?)
-Kevin


Re: Block-level CRC checks

From
Greg Stark
Date:
On Tue, Dec 1, 2009 at 7:19 PM, Josh Berkus <josh@agliodbs.com> wrote:
> However, that solution would not detect subtle corruption, like
> single-bit-flipping issues caused by quantum errors.

Well there is a solution for this, ECC RAM. There's *no* software
solution for it. The corruption can just as easily happen the moment
you write the value before you calculate any checksum or in the
register holding the value before you even write it. Or it could occur
the moment after you finish checking the checksum. Also you're not
going to be able to be sure you're checking the actual dram and not
the L2 cache or the processor's L1/L0 caches.

ECC RAM solves this problem properly and it does work. There's not
much point in paying a much bigger cost for an ineffective solution.

> Also, it would
> require reading back each page as it's written to disk, which is OK for
> a bunch of single-row writes, but for bulk data loads a significant problem.

Not sure what that really means for Postgres. It would just mean
reading back the same page of memory from the filesystem cache that we
just read.

It sounds like you're describing fsyncing every single page to disk
and then wait 1min/7200 or even 1min/15k to do a direct read for every
single page -- that's not a 20% performance hit though. We would have
to change our mascot from the elephant to a snail I think.

You could imagine a more complex solution where you have a separate
process wait until the next checkpoint then do direct reads for all
the blocks written since the previous checkpoint (which have now been
fsynced) and verify that the block on disk has a verifiable CRC. I'm
not sure even direct reads let you get the block on disk if someone
else has written the block into cache though. If you could then this
sounds like it could be made to work efficiently (with sequential
bitmap-style scans) and could be quite handy. What I like about that
is you could deprioritize this process's i/o so that it didn't impact
the main processing. As things stand this wouldn't detect pages
written because they were dirtied by hint bit updates but that could
be addressed a few different ways.

-- 
greg


Re: Block-level CRC checks

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 2:06 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> Well, *I* think if we're ever going to have really reliable "in-place
> upgrades" that we can expect to function release after release, we're
> going to need to be able to read in "old version" pages, and convert
> them to current version pages, for some set of "old version" (I'ld be
> happy with $VERSION-1)...  But I don't see that happening any time
> soon...

I agree.  I've attempted to make this point before - as has Zdenek -
and been scorned for it, but I don't think it's become any less true
for all of that.  I don't think you have to look much further than the
limitations on upgrading from 8.3 to 8.4 to conclude that the current
strategy is always going to be pretty hit or miss.

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/~checkout~/pg-migrator/pg_migrator/README?rev=1.59&content-type=text/plain

...Robert


Re: Block-level CRC checks

From
Greg Stark
Date:
On Tue, Dec 1, 2009 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 1, 2009 at 2:06 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
>> Well, *I* think if we're ever going to have really reliable "in-place
>> upgrades" that we can expect to function release after release, we're
>> going to need to be able to read in "old version" pages, and convert
>> them to current version pages, for some set of "old version" (I'ld be
>> happy with $VERSION-1)...  But I don't see that happening any time
>> soon...
>
> I agree.  I've attempted to make this point before - as has Zdenek -
> and been scorned for it, but I don't think it's become any less true
> for all of that.  I don't think you have to look much further than the
> limitations on upgrading from 8.3 to 8.4 to conclude that the current
> strategy is always going to be pretty hit or miss.

I find that hard to understand. I believe the consensus is that an
on-demand page-level migration statregy like Aidan described is
precisely the plan when it's necessary to handle page format changes.
There were no page format changes for 8.3->8.4 however so there's no
point writing dead code until it actually has anything to do. And
there was no point writing it for previously releases because there
was pg_migrator anyways. Zdenek's plan was basically the same but he
wanted the backend to be able to handle any version page directly
without conversion any time.

Pointing at the 8.3 pg_migrator limitations is irrelevant -- every
single one of those limitations would not be addressed by a page-level
migration code path. They are all data-type redefinitions that can't
be fixed without understanding the table structure and definition.
These limitations would all require adding code to the new version to
handle the old data types and their behaviour and to convert them to
the new datatypes when a tuple is rewritten. In some cases this is
really not easy at all.

--
greg


Re: Block-level CRC checks

From
Greg Stark
Date:
On Tue, Dec 1, 2009 at 8:04 PM, Greg Stark <gsstark@mit.edu> wrote:
> And there was no point writing it for previously releases because there
> was **no** pg_migrator anyways.

oops

-- 
greg


Re: Block-level CRC checks

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 3:04 PM, Greg Stark <gsstark@mit.edu> wrote:
> I find that hard to understand. I believe the consensus is that an
> on-demand page-level migration statregy like Aidan described is
> precisely the plan when it's necessary to handle page format changes.
> There were no page format changes for 8.3->8.4 however so there's no
> point writing dead code until it actually has anything to do. And
> there was no point writing it for previously releases because there
> was pg_migrator anyways. Zdenek's plan was basically the same but he
> wanted the backend to be able to handle any version page directly
> without conversion any time.
>
> Pointing at the 8.3 pg_migrator limitations is irrelevant -- every
> single one of those limitations would not be addressed by a page-level
> migration code path. They are all data-type redefinitions that can't
> be fixed without understanding the table structure and definition.
> These limitations would all require adding code to the new version to
> handle the old data types and their behaviour and to convert them to
> the new datatypes when a tuple is rewritten. In some cases this is
> really not easy at all.

OK, fair enough.  My implication that only page formats were at issue
was off-base.  My underlying point was that I think we have to be
prepared to write code that can understand old binary formats (on the
tuple, page, or relation level) if we want this to work and work
reliably.  I believe that there has been much resistance to that idea.If I am wrong, great!

...Robert


Re: Block-level CRC checks

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> OK, fair enough.  My implication that only page formats were at issue
> was off-base.  My underlying point was that I think we have to be
> prepared to write code that can understand old binary formats (on the
> tuple, page, or relation level) if we want this to work and work
> reliably.  I believe that there has been much resistance to that idea.

We keep looking for cheaper alternatives.  There may not be any...
        regards, tom lane


Re: Block-level CRC checks

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>   
>> OK, fair enough.  My implication that only page formats were at issue
>> was off-base.  My underlying point was that I think we have to be
>> prepared to write code that can understand old binary formats (on the
>> tuple, page, or relation level) if we want this to work and work
>> reliably.  I believe that there has been much resistance to that idea.
>>     
>
> We keep looking for cheaper alternatives.  There may not be any...
>
>             
>   

Yeah. I think we might need to bite the bullet on that and start 
thinking more about different strategies for handling page versioning to 
satisfy various needs. I've been convinced for a while that some sort of 
versioning scheme is inevitable, but I do understand the reluctance.

cheers

andrew


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> 
> 
> Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >   
> >> OK, fair enough.  My implication that only page formats were at issue
> >> was off-base.  My underlying point was that I think we have to be
> >> prepared to write code that can understand old binary formats (on the
> >> tuple, page, or relation level) if we want this to work and work
> >> reliably.  I believe that there has been much resistance to that idea.
> >>     
> >
> > We keep looking for cheaper alternatives.  There may not be any...
> >
> >             
> >   
> 
> Yeah. I think we might need to bite the bullet on that and start 
> thinking more about different strategies for handling page versioning to 
> satisfy various needs. I've been convinced for a while that some sort of 
> versioning scheme is inevitable, but I do understand the reluctance.

I always felt our final solution would be a combination of pg_migrator
for system catalog changes and page format conversion for page changes.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > OK, here is another idea, maybe crazy:
> 
> > When we read in a page that has an invalid CRC, we check the page to see
> > which hint bits are _not_ set, and we try setting them to see if can get
> > a matching CRC.  If there no missing hint bits and the CRC doesn't
> > match, we know the page is corrupt.  If two hint bits are missing, we
> > can try setting one and both of them and see if can get a matching CRC. 
> > If we can, the page is OK, if not, it is corrupt.
> 
> > Now if 32 hint bits are missing, but could be based on transaction
> > status, then we would need 2^32 possible hint bit combinations, so we
> > can't do the test and we just assume the page is valid.
> 
> A typical page is going to have something like 100 tuples, so
> potentially 2^400 combinations to try.  I don't see this being
> realistic from that standpoint.  What's much worse is that to even
> find the potentially missing hint bits, you need to make very strong
> assumptions about the validity of the rest of the page.
> 
> The suggestions that were made upthread about moving the hint bits
> could resolve the second objection, but once you do that you might
> as well just exclude them from the CRC and eliminate the guessing.

OK, crazy idea #3.  What if we had a per-page counter of the number of
hint bits set --- that way, we would only consider a CRC check failure
to be corruption if the count matched the hint bit count on the page.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> OK, crazy idea #3.  What if we had a per-page counter of the number of
> hint bits set --- that way, we would only consider a CRC check failure
> to be corruption if the count matched the hint bit count on the page.

Seems like rather a large hole in the ability to detect corruption.
In particular, this again assumes that you can accurately locate all
the hint bits in a page whose condition is questionable.  Pick up the
wrong bits, you'll come to the wrong conclusion --- and the default
behavior you propose here is the wrong result.
        regards, tom lane


Re: Block-level CRC checks

From
Richard Huxton
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
>>
>> The suggestions that were made upthread about moving the hint bits
>> could resolve the second objection, but once you do that you might
>> as well just exclude them from the CRC and eliminate the guessing.
> 
> OK, crazy idea #3.  What if we had a per-page counter of the number of
> hint bits set --- that way, we would only consider a CRC check failure
> to be corruption if the count matched the hint bit count on the page.

Can I piggy-back on Bruce's crazy idea and ask a stupid question?

Why are we writing out the hint bits to disk anyway? Is it really so
slow to calculate them on read + cache them that it's worth all this
trouble? Are they not also to blame for the "write my import data twice"
feature?

--  Richard Huxton Archonet Ltd


Page-level version upgrade (was: Block-level CRC checks)

From
decibel
Date:
On Dec 1, 2009, at 12:58 PM, Tom Lane wrote:
> The bottom line here seems to be that the only practical way to do
> anything like this is to move the hint bits into their own area of
> the page, and then exclude them from the CRC.  Are we prepared to
> once again blow off any hope of in-place update for another release
> cycle?


What happened to the work that was being done to allow a page to be  
upgraded on the fly when it was read in from disk?
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Block-level CRC checks

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > OK, crazy idea #3.  What if we had a per-page counter of the number of
> > hint bits set --- that way, we would only consider a CRC check failure
> > to be corruption if the count matched the hint bit count on the page.
> 
> Seems like rather a large hole in the ability to detect corruption.
> In particular, this again assumes that you can accurately locate all
> the hint bits in a page whose condition is questionable.  Pick up the
> wrong bits, you'll come to the wrong conclusion --- and the default
> behavior you propose here is the wrong result.

I was assuming any update of hint bits would update the per-page counter
so it would always be accurate.  However, I seem to remember we don't
lock the page when updating hint bits, so that wouldn't work.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Greg Stark
Date:
On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton <dev@archonet.com> wrote:
> Why are we writing out the hint bits to disk anyway? Is it really so
> slow to calculate them on read + cache them that it's worth all this
> trouble? Are they not also to blame for the "write my import data twice"
> feature?

It would be interesting to experiment with different strategies. But
the results would depend a lot on workloads and I doubt one strategy
is best for everyone.

It has often been suggested that we could set the hint bits but not
dirty the page, so they would never be written out unless some other
update hit the page. In most use cases that would probably result in
the right thing happening where we avoid half the writes but still
stop doing transaction status lookups relatively promptly. The scary
thing is that there might be use cases such as static data loaded
where the hint bits never get set and every scan of the page has to
recheck those statuses until the tuples are frozen.

(Not dirtying the page almost gets us out of the CRC problems -- it
doesn't in our current setup because we don't take a lock when setting
the hint bits, so you could set it on a page someone is in the middle
of CRC checking and writing. There were other solutions proposed for
that, including just making hint bits require locking the page or
double buffering the write.)

There does need to be something like the hint bits which does
eventually have to be set because we can't keep transaction
information around forever. Even if you keep the transaction
information all the way back to the last freeze date (up to about 1GB
and change I think) then the data has to be written twice, the second
time is to freeze the transactions. In the worst case then reading a
page requires a random page access (or two) from anywhere in that 1GB+
file for each tuple on the page (whether visible to us or not).
-- 
greg


Re: Block-level CRC checks

From
decibel
Date:
On Dec 1, 2009, at 1:39 PM, Kevin Grittner wrote:
> Josh Berkus <josh@agliodbs.com> wrote:
>
>> And a lot of our biggest users are having issues; it seems pretty
>> much guarenteed that if you have more than 20 postgres servers, at
>> least one of them will have bad memory, bad RAID and/or a bad
>> driver.
>
> Huh?!?  We have about 200 clusters running on about 100 boxes, and
> we see that very rarely.  On about 100 older boxes, relegated to
> less critical tasks, we see a failure maybe three or four times per
> year.  It's usually not subtle, and a sane backup and redundant
> server policy has kept us from suffering much pain from these.  I'm
> not questioning the value of adding features to detect corruption,
> but your numbers are hard to believe.

That's just your experience. Others have had different experiences.

And honestly, bickering about exact numbers misses Josh's point  
completely. Postgres is seriously lacking in it's ability to detect  
hardware problems, and hardware *does fail*. And you can't just  
assume that when it fails it blows up completely.

We really do need some capability for detecting errors.

>> The problem I have with CRC checks is that it only detects bad
>> I/O, and is completely unable to detect data corruption due to bad
>> memory. This means that really we want a different solution which
>> can detect both bad RAM and bad I/O, and should only fall back on
>> CRC checks if we're unable to devise one.
>
> md5sum of each tuple?  As an optional system column (a la oid)

That's a possibility.

As Josh mentioned, some people will pay a serious performance hit to  
ensure that their data is safe and correct. The CRC proposal was  
intended as a middle of the road approach that would at least tell  
you that your hardware was probably OK. There's certainly more that  
could be done.

Also, I think some means of detecting torn pages would be very  
welcome. If this was done at the storage manager level it would  
probably be fairly transparent to the rest of the code.

>> checking data format for readable pages and tuples (and index
>> nodes) both before and after write to disk
>
> Given that PostgreSQL goes through the OS, and many of us are using
> RAID controllers with BBU RAM, how do you do a read with any
> confidence that it came from the disk?  (I mean, I know how to do
> that for a performance test, but as a routine step during production
> use?)


You'd probably need to go to some kind of stand-alone or background  
process that slowly reads and verifies the entire database.  
Unfortunately at that point you could only detect corruption and not  
correct it, but it'd still be better than nothing.
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Page-level version upgrade (was: Block-level CRC checks)

From
Greg Stark
Date:
On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote:
> What happened to the work that was being done to allow a page to be upgraded
> on the fly when it was read in from disk?

There were no page level changes between 8.3 and 8.4.


-- 
greg


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Greg Stark wrote:
> On Tue, Dec 1, 2009 at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> >> OK, here is another idea, maybe crazy:
> >
> >> When we read in a page that has an invalid CRC, we check the page to see
> >> which hint bits are _not_ set, and we try setting them to see if can get
> >> a matching CRC.
> 
> Unfortunately you would also have to try *unsetting* every hint bit as
> well since the updated hint bits might have made it to disk but not
> the CRC leaving the old CRC for the block with the unset bits.
> 
> I actually independently had the same thought today that Simon had of
> moving the hint bits to the line pointer. We can obtain more free bits
> in the line pointers by dividing the item offsets and sizes by
> maxalign if we need it. That should give at least 4 spare bits which
> is all we need for the four VALID/INVALID hint bits.
> 
> It should be relatively cheap to skip the hint bits in the line
> pointers since they'll be the same bits of every 16-bit value for a
> whole range. Alternatively we could just CRC the tuples and assume a
> corrupted line pointer will show itself quickly. That would actually
> make it faster than a straight CRC of the whole block -- making
> lemonade out of lemons as it were.

Yea, I am thinking we would have to have the hint bits in the line
pointers --- if not, we would have to reserve a lot of free space to
hold the maximum number of tuple hint bits --- seems like a waste.

I also like the idea that we don't need to CRC check the line pointers
because any corruption there is going to appear immediately.  However,
the bad news is that we wouldn't find the corruption until we try to
access bad data and might crash.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Bruce Momjian
Date:
Greg Stark wrote:
> On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote:
> > What happened to the work that was being done to allow a page to be upgraded
> > on the fly when it was read in from disk?
> 
> There were no page level changes between 8.3 and 8.4.

Yea, we have the idea of how to do it (in cases where the page size
doesn't increase), but no need to implement it in 8.3 to 8.4.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Richard Huxton
Date:
Greg Stark wrote:
> On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton <dev@archonet.com> wrote:
>> Why are we writing out the hint bits to disk anyway? Is it really so
>> slow to calculate them on read + cache them that it's worth all this
>> trouble? Are they not also to blame for the "write my import data twice"
>> feature?
> 
> It would be interesting to experiment with different strategies. But
> the results would depend a lot on workloads and I doubt one strategy
> is best for everyone.
> 
> It has often been suggested that we could set the hint bits but not
> dirty the page, so they would never be written out unless some other
> update hit the page. In most use cases that would probably result in
> the right thing happening where we avoid half the writes but still
> stop doing transaction status lookups relatively promptly. The scary
> thing is that there might be use cases such as static data loaded
> where the hint bits never get set and every scan of the page has to
> recheck those statuses until the tuples are frozen.

And how scary is that? Assuming we cache the hints...
1. With the page itself, so same lifespan
2. Separately, perhaps with a different (longer) lifespan.

Separately would then let you trade complexity for compactness - "all of
block B is deleted", "all of table T is visible".

So what is the cost of calculating the hint-bits for a whole block of
tuples in one go vs reading that block from actual spinning disk?

> There does need to be something like the hint bits which does
> eventually have to be set because we can't keep transaction
> information around forever. Even if you keep the transaction
> information all the way back to the last freeze date (up to about 1GB
> and change I think) then the data has to be written twice, the second
> time is to freeze the transactions. In the worst case then reading a
> page requires a random page access (or two) from anywhere in that 1GB+
> file for each tuple on the page (whether visible to us or not).

While on that topic - I'm assuming freezing requires substantially more
effort than updating hint bits?

--  Richard Huxton Archonet Ltd


Re: Block-level CRC checks

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Greg Stark wrote:
>> It should be relatively cheap to skip the hint bits in the line
>> pointers since they'll be the same bits of every 16-bit value for a
>> whole range. Alternatively we could just CRC the tuples and assume a
>> corrupted line pointer will show itself quickly. That would actually
>> make it faster than a straight CRC of the whole block -- making
>> lemonade out of lemons as it were.

I don't think "relatively cheap" is the right criterion here --- the
question to me is how many assumptions are you making in order to
compute the page's CRC.  Each assumption degrades the reliability
of the check, not to mention creating another maintenance hazard.

> Yea, I am thinking we would have to have the hint bits in the line
> pointers --- if not, we would have to reserve a lot of free space to
> hold the maximum number of tuple hint bits --- seems like a waste.

Not if you're willing to move the line pointers around.  I'd envision
an extra pointer in the page header, with a layout along the lines of
fixed-size page headerhint bitsline pointersfree spacetuples properspecial space

with the CRC covering everything except the hint bits and perhaps the
free space (depending on whether you wanted to depend on two more
pointers to be right).  We would have to move the line pointers anytime
we needed to grow the hint-bit space, and there would be a
straightforward tradeoff between how often to move the pointers versus
how much potentially-wasted space we leave at the end of the hint area.

Or we could put the hint bits after the pointers, which might be better
because the hints would be smaller == cheaper to move.

> I also like the idea that we don't need to CRC check the line pointers
> because any corruption there is going to appear immediately.  However,
> the bad news is that we wouldn't find the corruption until we try to
> access bad data and might crash.

That sounds exactly like the corruption detection system we have now.
If you think that behavior is acceptable, we can skip this whole
discussion.
        regards, tom lane


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Greg Stark wrote:
> >> It should be relatively cheap to skip the hint bits in the line
> >> pointers since they'll be the same bits of every 16-bit value for a
> >> whole range. Alternatively we could just CRC the tuples and assume a
> >> corrupted line pointer will show itself quickly. That would actually
> >> make it faster than a straight CRC of the whole block -- making
> >> lemonade out of lemons as it were.
> 
> I don't think "relatively cheap" is the right criterion here --- the
> question to me is how many assumptions are you making in order to
> compute the page's CRC.  Each assumption degrades the reliability
> of the check, not to mention creating another maintenance hazard.
> 
> > Yea, I am thinking we would have to have the hint bits in the line
> > pointers --- if not, we would have to reserve a lot of free space to
> > hold the maximum number of tuple hint bits --- seems like a waste.
> 
> Not if you're willing to move the line pointers around.  I'd envision
> an extra pointer in the page header, with a layout along the lines of
> 
>     fixed-size page header
>     hint bits
>     line pointers
>     free space
>     tuples proper
>     special space
> 
> with the CRC covering everything except the hint bits and perhaps the
> free space (depending on whether you wanted to depend on two more
> pointers to be right).  We would have to move the line pointers anytime
> we needed to grow the hint-bit space, and there would be a
> straightforward tradeoff between how often to move the pointers versus
> how much potentially-wasted space we leave at the end of the hint area.

I assume you don't want the hint bits in the line pointers because we
would need to lock the page?

> Or we could put the hint bits after the pointers, which might be better
> because the hints would be smaller == cheaper to move.

I don't see the value there because you would need to move the hint bits
every time you added a new line pointer.  The bigger problem is that you
would need to lock the page to update the hint bits if they move around
on the page.

> > I also like the idea that we don't need to CRC check the line pointers
> > because any corruption there is going to appear immediately.  However,
> > the bad news is that we wouldn't find the corruption until we try to
> > access bad data and might crash.
> 
> That sounds exactly like the corruption detection system we have now.
> If you think that behavior is acceptable, we can skip this whole
> discussion.

Agreed, hence the "bad" part.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes:
> So what is the cost of calculating the hint-bits for a whole block of
> tuples in one go vs reading that block from actual spinning disk?

Potentially a couple of hundred times worse, if you're unlucky and each
XID on the page requires visiting a different block of clog that's also
not in memory.  The average case probably isn't that bad, but I think
we'd likely be talking at least a factor of two penalty --- you'd be
hopelessly optimistic to assume you didn't need at least one clog visit
per page.

Also, if you want to assume that you're lucky and the XIDs mostly fall
within a fairly recent range of clog pages, you're still not out of the
woods.  In that situation what you are talking about is a spectacular
increase in the hit rate for cached clog pages --- which are already a
known contention bottleneck in many scenarios.

> While on that topic - I'm assuming freezing requires substantially more
> effort than updating hint bits?

It's a WAL-logged page change, so at minimum double the cost.
        regards, tom lane


Re: Block-level CRC checks

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> I don't think "relatively cheap" is the right criterion here --- the
>> question to me is how many assumptions are you making in order to
>> compute the page's CRC.  Each assumption degrades the reliability
>> of the check, not to mention creating another maintenance hazard.

> I assume you don't want the hint bits in the line pointers because we
> would need to lock the page?

No, I don't want them there because I don't want the CRC check to know
so much about the page layout.

>> Or we could put the hint bits after the pointers, which might be better
>> because the hints would be smaller == cheaper to move.

> I don't see the value there because you would need to move the hint bits
> every time you added a new line pointer.

No, we could add unused line pointers in multiples, exactly the same as
we would add unused hint bits in multiples if we did it the other way.
I don't know offhand which would be more efficient, but you can't just
dismiss one without analysis.

> The bigger problem is that you
> would need to lock the page to update the hint bits if they move around
> on the page.

We are already assuming that things aren't moving around when we update
a hint bit now.  That's what the requirement of shared buffer lock when
calling tqual.c is for.
        regards, tom lane


Re: Block-level CRC checks

From
Greg Stark
Date:
On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> Greg Stark wrote:
>>> It should be relatively cheap to skip the hint bits in the line
>>> pointers since they'll be the same bits of every 16-bit value for a
>>> whole range. Alternatively we could just CRC the tuples and assume a
>>> corrupted line pointer will show itself quickly. That would actually
>>> make it faster than a straight CRC of the whole block -- making
>>> lemonade out of lemons as it were.
>
> I don't think "relatively cheap" is the right criterion here --- the
> question to me is how many assumptions are you making in order to
> compute the page's CRC.  Each assumption degrades the reliability
> of the check, not to mention creating another maintenance hazard.

Well the only assumption here is that we know where the line pointers
start and end. That sounds like the same level of assumption as your
structure with the line pointers moving around. I agree with your
general point though -- trying to skip the hint bits strewn around in
the tuples means that every line pointer had better be correct or
you'll be in trouble before you even get to the CRC check. Skipping
them in the line pointers just means applying a hard-coded mask
against each word in that region.

It seems to me adding a third structure on the page and then requiring
tqual to be able to find that doesn't significantly reduce the
complexity over having tqual be able to find the line pointers. And it
significantly increases the complexity of every other part of the
system which has to deal with a third structure on the page. And
adding and compacting the page becomes a lot more complex.  I'm also
I'm a bit leery about adding more line pointers than necessary because
even a small number of line pointers will mean you're likely to often
fit one fewer tuple on the page.

--
greg


Re: Block-level CRC checks

From
decibel
Date:
On Dec 1, 2009, at 4:13 PM, Greg Stark wrote:
> On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton <dev@archonet.com>  
> wrote:
>> Why are we writing out the hint bits to disk anyway? Is it really so
>> slow to calculate them on read + cache them that it's worth all this
>> trouble? Are they not also to blame for the "write my import data  
>> twice"
>> feature?
>
> It would be interesting to experiment with different strategies. But
> the results would depend a lot on workloads and I doubt one strategy
> is best for everyone.

I agree that we'll always have the issue with freezing. But I also  
think it's time to revisit the whole idea of hint bits. AFAIK we only  
keep at maximum 2B transactions, and each one takes 2 bits in CLOG.  
So worst-case scenario, we're looking at 4G of clog. On modern  
hardware, that's not a lot. And that's also assuming that we don't do  
any kind of compression on that data (obviously we couldn't use just  
any old compression algorithm, but there's certainly tricks that  
could be used to reduce the size of this information).

I know this is something that folks at EnterpriseDB have looked at,  
perhaps there's data they can share.

> It has often been suggested that we could set the hint bits but not
> dirty the page, so they would never be written out unless some other
> update hit the page. In most use cases that would probably result in
> the right thing happening where we avoid half the writes but still
> stop doing transaction status lookups relatively promptly. The scary
> thing is that there might be use cases such as static data loaded
> where the hint bits never get set and every scan of the page has to
> recheck those statuses until the tuples are frozen.
>
> (Not dirtying the page almost gets us out of the CRC problems -- it
> doesn't in our current setup because we don't take a lock when setting
> the hint bits, so you could set it on a page someone is in the middle
> of CRC checking and writing. There were other solutions proposed for
> that, including just making hint bits require locking the page or
> double buffering the write.)
>
> There does need to be something like the hint bits which does
> eventually have to be set because we can't keep transaction
> information around forever. Even if you keep the transaction
> information all the way back to the last freeze date (up to about 1GB
> and change I think) then the data has to be written twice, the second
> time is to freeze the transactions. In the worst case then reading a
> page requires a random page access (or two) from anywhere in that 1GB+
> file for each tuple on the page (whether visible to us or not).
> -- 
> greg
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Block-level CRC checks

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't think "relatively cheap" is the right criterion here --- the
>> question to me is how many assumptions are you making in order to
>> compute the page's CRC. �Each assumption degrades the reliability
>> of the check, not to mention creating another maintenance hazard.

> Well the only assumption here is that we know where the line pointers
> start and end.

... and what they contain.  To CRC a subset of the page at all, we have
to put some amount of faith into the page header's pointers.  We can do
weak checks on those, but only weak ones.  If we process different parts
of the page differently, we're increasing our trust in those pointers
and reducing the quality of the CRC check.

> It seems to me adding a third structure on the page and then requiring
> tqual to be able to find that doesn't significantly reduce the
> complexity over having tqual be able to find the line pointers. And it
> significantly increases the complexity of every other part of the
> system which has to deal with a third structure on the page. And
> adding and compacting the page becomes a lot more complex.

The page compaction logic amounts to a grand total of two not-very-long
routines.  The vast majority of the code impact from this would be from
the problem of finding the out-of-line hint bits for a tuple, which as
you say appears about equivalently hard either way.  So I think keeping
the CRC logic as simple as possible is good from both a reliability and
performance standpoint.
        regards, tom lane


Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Tue, 2009-12-01 at 07:05 -0500, Bruce Momjian wrote:
>  
> > All useful detection mechanisms have non-zero false positives because we
> > would rather sometimes ring the bell for no reason than to let bad
> > things through silently, as we do now.
> 
> OK, but what happens if someone gets the failure report, assumes their
> hardware is faulty and replaces it, and then gets a failure report
> again? 

They are stupid? Nobody just replaces hardware. You test it.

We can't fix stupid.

Joshua D. Drake

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander



Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Tue, 2009-12-01 at 10:55 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2009-12-01 at 16:40 +0200, Heikki Linnakangas wrote:
> >> It's not hard to imagine that when a hardware glitch happens
> >> causing corruption, it also causes the system to crash. Recalculating
> >> the CRCs after crash would mask the corruption.
> 
> > They are already masked from us, so continuing to mask those errors
> > would not put us in a worse position.
> 
> No, it would just destroy a large part of the argument for why this
> is worth doing.  "We detect disk errors ... except for ones that happen
> during a database crash."  "Say what?"
> 
> The fundamental problem with this is the same as it's been all along:
> the tradeoff between implementation work expended, performance overhead
> added, and net number of real problems detected (with a suitably large
> demerit for actually *introducing* problems) just doesn't look
> attractive.  You can make various compromises that improve one or two of
> these factors at the cost of making the others worse, but at the end of
> the day I've still not seen a combination that seems worth doing.

Let me try a different but similar perspective. The problem we are
trying to solve here, only matters to a very small subset of the people
actually using PostgreSQL. Specifically, a percentage that is using
PostgreSQL in a situation where they can lose many thousands of dollars
per minute or hour should an outage occur.

On the other hand it is those very people that are *paying* people to
try and implement these features. Kind of a catch-22.

The hard core reality is this. *IF* it is one of the goals of this
project to insure that the software can be safely, effectively, and
responsibly operated in a manner that is acceptable to C* level people
in a Fortune level company then we *must* solve this problem.

If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant
to fork it because it will eventually happen. Our customers are
demanding these features.

Sincerely,

Joshua D. Drake


-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander



Re: Block-level CRC checks

From
Greg Stark
Date:
On Wed, Dec 2, 2009 at 12:03 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> On Tue, Dec 1, 2009 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't think "relatively cheap" is the right criterion here --- the
>>> question to me is how many assumptions are you making in order to
>>> compute the page's CRC.  Each assumption degrades the reliability
>>> of the check, not to mention creating another maintenance hazard.
>
>> Well the only assumption here is that we know where the line pointers
>> start and end.
>
> ... and what they contain.  To CRC a subset of the page at all, we have
> to put some amount of faith into the page header's pointers.  We can do
> weak checks on those, but only weak ones.  If we process different parts
> of the page differently, we're increasing our trust in those pointers
> and reducing the quality of the CRC check.

I'm not sure we're on the same page.  As I understand it there are
three proposals on the table now:

1) set aside a section of the page to contain only non-checksummed
hint bits. That section has to be relocatable so the crc check would
have to read the start and end address of it from the page header.

2) store the hint bits in the line pointers and skip checking the line
pointers. In that case the crc check would skip any bytes between the
start of the line pointer array and pd_lower (or pd_upper? no point in
crc checking unused bytes is there?)

3) store the hint bits in the line pointers and apply a mask which
masks out the 4 hint bits in each 32-bit word in the region between
the start of the line pointers and pd_lower (or pd_upper again)

These three options all seem to have the same level of interdependence
for the crc check, namely they all depend one or two values in the
page header to specify a range of bytes in the block. None of them
depend on the contents of the line pointers themselves being correct,
only the one or two fields in the header specifying which range of
bytes the hint bits lie in.

For what it's worth I don't think "decreasing the quality of the crc
check" is actually valid. The bottom line is that in all of the above
options if any pointer is invalid we'll be CRCing a different set of
data from the set that originally went into calculating the stored CRC
so we'll be effectively computing a random value which will have a
1/2^32 chance of being the value stored in the CRC field regardless of
anything else.

--
greg


Re: Block-level CRC checks

From
"Joshua D. Drake"
Date:
On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
> On Tue, Dec 1, 2009 at 1:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > The hard core reality is this. *IF* it is one of the goals of this
> > project to insure that the software can be safely, effectively, and
> > responsibly operated in a manner that is acceptable to C* level people
> > in a Fortune level company then we *must* solve this problem.
> >
> > If it is not the goal of the project, leave it to EDB/CMD/2ndQuandrant
> > to fork it because it will eventually happen. Our customers are
> > demanding these features.
> 
> OK, and when you fork it, how do you plan to implement it? 

Hey man, I am not an engineer :P. You know that. I am just speaking the
pressures that some of us are having in the marketplace about these
types of features.

> red judgment on whether this is a feasible solution.
> 
> Does $COMPETITOR offer this feature?
> 

My understanding is that MSSQL does. I am not sure about Oracle. Those
are the only two I run into (I don't run into MySQL at all). I know
others likely compete in the DB2 space.

Sincerely,

Joshua D. Drake



-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering
If the world pushes look it in the eye and GRR. Then push back harder. - Salamander



Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 5:15 PM, Greg Stark <gsstark@mit.edu> wrote:
> On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote:
>> What happened to the work that was being done to allow a page to be upgraded
>> on the fly when it was read in from disk?
>
> There were no page level changes between 8.3 and 8.4.

That's true, but I don't think it's the full and complete answer to
the question.  Zdenek submitted a page for CF 2008-11 which attempted
to add support for multiple page versions.  I guess we're on v4 right
now, and he was attempting to add support for v3 pages, which would
have allowed reading in pages from old PG versions.  To put it
bluntly, the code wasn't anything I would have wanted to deploy, but
the reason why Zdenek gave up on fixing it was because several
community members considerably senior to myself provided negative
feedback on the concept.

...Robert


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Tue, Dec 1, 2009 at 5:15 PM, Greg Stark <gsstark@mit.edu> wrote:
> > On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote:
> >> What happened to the work that was being done to allow a page to be upgraded
> >> on the fly when it was read in from disk?
> >
> > There were no page level changes between 8.3 and 8.4.
> 
> That's true, but I don't think it's the full and complete answer to
> the question.  Zdenek submitted a page for CF 2008-11 which attempted
> to add support for multiple page versions.  I guess we're on v4 right
> now, and he was attempting to add support for v3 pages, which would
> have allowed reading in pages from old PG versions.  To put it
> bluntly, the code wasn't anything I would have wanted to deploy, but
> the reason why Zdenek gave up on fixing it was because several
> community members considerably senior to myself provided negative
> feedback on the concept.

Well, there were quite a number of open issues relating to page
conversion:
o  Do we write the old version or just convert on read?o  How do we write pages that get larger on conversion to the
newformat?
 

As I rember the patch allowed read/wite of old versions, which greatly
increased its code impact.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 9:31 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Tue, Dec 1, 2009 at 5:15 PM, Greg Stark <gsstark@mit.edu> wrote:
>> > On Tue, Dec 1, 2009 at 9:58 PM, decibel <decibel@decibel.org> wrote:
>> >> What happened to the work that was being done to allow a page to be upgraded
>> >> on the fly when it was read in from disk?
>> >
>> > There were no page level changes between 8.3 and 8.4.
>>
>> That's true, but I don't think it's the full and complete answer to
>> the question.  Zdenek submitted a page for CF 2008-11 which attempted
>> to add support for multiple page versions.  I guess we're on v4 right
>> now, and he was attempting to add support for v3 pages, which would
>> have allowed reading in pages from old PG versions.  To put it
>> bluntly, the code wasn't anything I would have wanted to deploy, but
>> the reason why Zdenek gave up on fixing it was because several
>> community members considerably senior to myself provided negative
>> feedback on the concept.
>
> Well, there were quite a number of open issues relating to page
> conversion:
>
>        o  Do we write the old version or just convert on read?
>        o  How do we write pages that get larger on conversion to the
>           new format?
>
> As I rember the patch allowed read/wite of old versions, which greatly
> increased its code impact.

Oh, for sure there were plenty of issues with the patch, starting with
the fact that the way it was set up led to unacceptable performance
and code complexity trade-offs.  Some of my comments from the time:

http://archives.postgresql.org/pgsql-hackers/2008-11/msg00149.php
http://archives.postgresql.org/pgsql-hackers/2008-11/msg00152.php

But the point is that the concept, I think, is basically the right
one: you have to be able to read and make sense of the contents of old
page versions.  There is room, at least in my book, for debate about
which operations we should support on old pages.  Totally read only?
Set hit bits?  Kill old tuples?  Add new tuples?

The key issue, as I think Heikki identified at the time, is to figure
out how you're eventually going to get rid of the old pages.  He
proposed running a pre-upgrade utility on each page to reserve the
right amount of free space.

http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php

I don't like that solution.  If the pre-upgrade utility is something
that has to be run while the database is off-line, then it defeats the
point of an in-place upgrade.  If it can be run while the database is
up, I fear it will need to be deeply integrated into the server.  And
since we can't know the requirements for how much space to reserve
(and it needn't be a constant) until we design the new feature, this
will likely mean backpatching a rather large chunk of complex code,
which to put it mildly, is not the sort of thing we normally would
even consider.  I think a better approach is to support reading tuples
from old pages, but to write all new tuples into new pages.  A
full-table rewrite (like UPDATE foo SET x = x, CLUSTER, etc.) can be
used to propel everything to the new version, with the usual tricks
for people who need to rewrite the table a piece at a time.  But, this
is not religion for me.  I'm fine with some other design; I just can't
presently see how to make it work.

I think the present discussion of CRC checks is an excellent test-case
for any and all ideas about how to solve this problem.  If someone can
get a patch committed than can convert the 8.4 page format to an 8.5
format with the hint bits shuffled around a (hopefully optional) CRC
added, I think that'll become the de facto standard for how to handle
page format upgrades.

...Robert


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Bruce Momjian
Date:
Robert Haas wrote:
> > Well, there were quite a number of open issues relating to page
> > conversion:
> >
> > ? ? ? ?o ?Do we write the old version or just convert on read?
> > ? ? ? ?o ?How do we write pages that get larger on conversion to the
> > ? ? ? ? ? new format?
> >
> > As I rember the patch allowed read/wite of old versions, which greatly
> > increased its code impact.
> 
> Oh, for sure there were plenty of issues with the patch, starting with
> the fact that the way it was set up led to unacceptable performance
> and code complexity trade-offs.  Some of my comments from the time:
> 
> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00149.php
> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00152.php
> 
> But the point is that the concept, I think, is basically the right
> one: you have to be able to read and make sense of the contents of old
> page versions.  There is room, at least in my book, for debate about
> which operations we should support on old pages.  Totally read only?
> Set hit bits?  Kill old tuples?  Add new tuples?

I think part of the problem is there was no agreement before the patch
was coded and submitted, and there didn't seem to be much desire from
the patch author to adjust it, nor demand from the community because we
didn't need it yet.

> The key issue, as I think Heikki identified at the time, is to figure
> out how you're eventually going to get rid of the old pages.  He
> proposed running a pre-upgrade utility on each page to reserve the
> right amount of free space.
> 
> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php

Right.  There were two basic approaches to handling a patch that would
expand when upgraded to the new version --- either allow the system to
write the old format, or have a pre-upgrade script that moved tuples so
there was guaranteed enough free space in every page for the new format.
I think we agreed that the later was better than the former, and it was
easy because we don't have any need for that at this time.  Plus the
script would not rewrite every page, just certain pages that required
it.

> I don't like that solution.  If the pre-upgrade utility is something
> that has to be run while the database is off-line, then it defeats the
> point of an in-place upgrade.  If it can be run while the database is
> up, I fear it will need to be deeply integrated into the server.  And
> since we can't know the requirements for how much space to reserve
> (and it needn't be a constant) until we design the new feature, this
> will likely mean backpatching a rather large chunk of complex code,
> which to put it mildly, is not the sort of thing we normally would
> even consider.  I think a better approach is to support reading tuples
> from old pages, but to write all new tuples into new pages.  A
> full-table rewrite (like UPDATE foo SET x = x, CLUSTER, etc.) can be
> used to propel everything to the new version, with the usual tricks
> for people who need to rewrite the table a piece at a time.  But, this
> is not religion for me.  I'm fine with some other design; I just can't
> presently see how to make it work.

Well, perhaps the text I wrote above will clarify that the upgrade
script is only for page expansion --- it is not to rewrite every page
into the new format.

> I think the present discussion of CRC checks is an excellent test-case
> for any and all ideas about how to solve this problem.  If someone can
> get a patch committed than can convert the 8.4 page format to an 8.5
> format with the hint bits shuffled around a (hopefully optional) CRC
> added, I think that'll become the de facto standard for how to handle
> page format upgrades.

Well, yea, the idea would be that the 8.5 server would either convert
the page to the new format on read (assuming there is enough free space,
perhaps requiring a pre-upgrade script), or have the server write the
page in the old 8.4 format and not do CRC checks on the page.  My guess
is the former.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Greg Smith
Date:
Robert Haas wrote:
> If the pre-upgrade utility is something
> that has to be run while the database is off-line, then it defeats the
> point of an in-place upgrade.  If it can be run while the database is
> up, I fear it will need to be deeply integrated into the server.  And
> since we can't know the requirements for how much space to reserve
> (and it needn't be a constant) until we design the new feature, this
> will likely mean backpatching a rather large chunk of complex code,
> which to put it mildly, is not the sort of thing we normally would
> even consider.
You're wandering into the sort of overdesign that isn't really needed 
yet.  For now, presume it's a constant amount of overhead, and that the 
release notes for the new version will say "configure the pre-upgrade 
utility and tell it you need <x> bytes of space reserved".  That's 
sufficient for the CRC case, right?  Needs a few more bytes per page, 
8.5 release notes could say exactly how much.  Solve that before making 
things more complicated by presuming you need to solve the variable-size 
increase problem, too.  We'll be lucky to get the absolute simplest 
approach committed, you really need to have a big smoking gun to justify 
feature creep in this area.

(If I had to shoot from the hip and design for the variable case, why 
not just make the thing that determines how much space a given page 
needs reserved a function the user can re-install with a smarter version?)
> I think a better approach is to support reading tuples
> from old pages, but to write all new tuples into new pages.  A
> full-table rewrite (like UPDATE foo SET x = x, CLUSTER, etc.) can be
> used to propel everything to the new version, with the usual tricks
> for people who need to rewrite the table a piece at a time.
I think you're oversimplifying the operational difficulty of "the usual 
tricks".  This is a painful approach for the exact people who need this 
the most:  people with a live multi-TB installation they can't really 
afford to add too much load to.  The beauty of the in-place upgrade tool 
just converting pages as it scans through looking for them is that you 
can dial up its intensity to exactly how much overhead you can stand, 
and let it loose until it's done.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com



Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 10:34 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> > Well, there were quite a number of open issues relating to page
>> > conversion:
>> >
>> > ? ? ? ?o ?Do we write the old version or just convert on read?
>> > ? ? ? ?o ?How do we write pages that get larger on conversion to the
>> > ? ? ? ? ? new format?
>> >
>> > As I rember the patch allowed read/wite of old versions, which greatly
>> > increased its code impact.
>>
>> Oh, for sure there were plenty of issues with the patch, starting with
>> the fact that the way it was set up led to unacceptable performance
>> and code complexity trade-offs.  Some of my comments from the time:
>>
>> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00149.php
>> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00152.php
>>
>> But the point is that the concept, I think, is basically the right
>> one: you have to be able to read and make sense of the contents of old
>> page versions.  There is room, at least in my book, for debate about
>> which operations we should support on old pages.  Totally read only?
>> Set hit bits?  Kill old tuples?  Add new tuples?
>
> I think part of the problem is there was no agreement before the patch
> was coded and submitted, and there didn't seem to be much desire from
> the patch author to adjust it, nor demand from the community because we
> didn't need it yet.

Could be.  It's water under the bridge at this point.

>> The key issue, as I think Heikki identified at the time, is to figure
>> out how you're eventually going to get rid of the old pages.  He
>> proposed running a pre-upgrade utility on each page to reserve the
>> right amount of free space.
>>
>> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php
>
> Right.  There were two basic approaches to handling a patch that would
> expand when upgraded to the new version --- either allow the system to
> write the old format, or have a pre-upgrade script that moved tuples so
> there was guaranteed enough free space in every page for the new format.
> I think we agreed that the later was better than the former, and it was
> easy because we don't have any need for that at this time.  Plus the
> script would not rewrite every page, just certain pages that required
> it.

While I'm always willing to be proven wrong, I think it's a complete
dead-end to believe that it's going to be easier to reserve space for
page expansion using the upgrade-from version rather than the
upgrade-to version.  I am firmly of the belief that the NEW pg version
must be able to operate on an unmodified heap migrated from the OLD pg
version.  After this set of patches was rejected, Zdenek actually
proposed an alternate patch that would have allowed space reservation,
and it was rejected precisely because there was no clear certainty
that it would solve any hypothetical future problem.

...Robert


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 10:45 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Robert Haas wrote:
>>
>> If the pre-upgrade utility is something
>> that has to be run while the database is off-line, then it defeats the
>> point of an in-place upgrade.  If it can be run while the database is
>> up, I fear it will need to be deeply integrated into the server.  And
>> since we can't know the requirements for how much space to reserve
>> (and it needn't be a constant) until we design the new feature, this
>> will likely mean backpatching a rather large chunk of complex code,
>> which to put it mildly, is not the sort of thing we normally would
>> even consider.
>
> You're wandering into the sort of overdesign that isn't really needed yet.
>  For now, presume it's a constant amount of overhead, and that the release
> notes for the new version will say "configure the pre-upgrade utility and
> tell it you need <x> bytes of space reserved".  That's sufficient for the
> CRC case, right?  Needs a few more bytes per page, 8.5 release notes could
> say exactly how much.  Solve that before making things more complicated by
> presuming you need to solve the variable-size increase problem, too.  We'll
> be lucky to get the absolute simplest approach committed, you really need to
> have a big smoking gun to justify feature creep in this area.

Well, I think the best way to solve the problem is to design the
system in a way that makes it unnecessary to have a pre-upgrade tool
at all, by making the new PG version capable of handling page
expansion where needed.  I don't understand how putting that
functionality into the OLD PG version can be better.  But I may be
misunderstanding something.

> (If I had to shoot from the hip and design for the variable case, why not
> just make the thing that determines how much space a given page needs
> reserved a function the user can re-install with a smarter version?)

That's a pretty good idea.   I have no love of this pre-upgrade
concept, but if we're going to do it that way, then allowing someone
to load in a function to compute the required amount of free space to
reserve is a good thought.

>> I think a better approach is to support reading tuples
>> from old pages, but to write all new tuples into new pages.  A
>> full-table rewrite (like UPDATE foo SET x = x, CLUSTER, etc.) can be
>> used to propel everything to the new version, with the usual tricks
>> for people who need to rewrite the table a piece at a time.
>
> I think you're oversimplifying the operational difficulty of "the usual
> tricks".  This is a painful approach for the exact people who need this the
> most:  people with a live multi-TB installation they can't really afford to
> add too much load to.  The beauty of the in-place upgrade tool just
> converting pages as it scans through looking for them is that you can dial
> up its intensity to exactly how much overhead you can stand, and let it
> loose until it's done.

Fair enough.

...Robert


Re: Block-level CRC checks

From
Aidan Van Dyk
Date:
* Greg Stark <gsstark@mit.edu> [091201 20:14]:
> I'm not sure we're on the same page.  As I understand it there are
> three proposals on the table now:
> 
> 1) set aside a section of the page to contain only non-checksummed
> hint bits. That section has to be relocatable so the crc check would
> have to read the start and end address of it from the page header.
> 
> 2) store the hint bits in the line pointers and skip checking the line
> pointers. In that case the crc check would skip any bytes between the
> start of the line pointer array and pd_lower (or pd_upper? no point in
> crc checking unused bytes is there?)
> 
> 3) store the hint bits in the line pointers and apply a mask which
> masks out the 4 hint bits in each 32-bit word in the region between
> the start of the line pointers and pd_lower (or pd_upper again)

I'm not intimately familiar with the innards of the pages, but I had
*thought* that the original suggestion of moving the hint bits was
purely to make sure that they are in the same filesystem block/disk
sector as the CRC.  That may not be possible, but *if* that's the case,
you avoid the torn-page problem, with only 1 minimal assumption: - the FS-block/disk-sector will write whole "blocks"
ata time, or   likely be corrupt anyways
 

With my understanding of disks and platters, I'ld assume that if you got
a partial sector written, and something prevented it from being
completely written, I'ld guess the part missing would be smeared with
corruption....  And that would seem to hold with flash/SSD's too...

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Page-level version upgrade (was: Block-level CRC checks)

From
Bruce Momjian
Date:
Robert Haas wrote:
> >> The key issue, as I think Heikki identified at the time, is to figure
> >> out how you're eventually going to get rid of the old pages. ?He
> >> proposed running a pre-upgrade utility on each page to reserve the
> >> right amount of free space.
> >>
> >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php
> >
> > Right. ?There were two basic approaches to handling a patch that would
> > expand when upgraded to the new version --- either allow the system to
> > write the old format, or have a pre-upgrade script that moved tuples so
> > there was guaranteed enough free space in every page for the new format.
> > I think we agreed that the later was better than the former, and it was
> > easy because we don't have any need for that at this time. ?Plus the
> > script would not rewrite every page, just certain pages that required
> > it.
> 
> While I'm always willing to be proven wrong, I think it's a complete
> dead-end to believe that it's going to be easier to reserve space for
> page expansion using the upgrade-from version rather than the
> upgrade-to version.  I am firmly of the belief that the NEW pg version
> must be able to operate on an unmodified heap migrated from the OLD pg
> version.  After this set of patches was rejected, Zdenek actually

Does it need to write the old version, and if it does, it has to carry
around the old format structures all over the backend?  That was the
unclear part.

> proposed an alternate patch that would have allowed space reservation,
> and it was rejected precisely because there was no clear certainty
> that it would solve any hypothetical future problem.

True.  It was solving a problem we didn't have, yet.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Page-level version upgrade (was: Block-level CRC checks)

From
David Fetter
Date:
On Tue, Dec 01, 2009 at 10:34:11PM -0500, Bruce Momjian wrote:
> Robert Haas wrote:
> > The key issue, as I think Heikki identified at the time, is to
> > figure out how you're eventually going to get rid of the old
> > pages.  He proposed running a pre-upgrade utility on each page to
> > reserve the right amount of free space.
> > 
> > http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php
> 
> Right.  There were two basic approaches to handling a patch that
> would expand when upgraded to the new version --- either allow the
> system to write the old format, or have a pre-upgrade script that
> moved tuples so there was guaranteed enough free space in every page
> for the new format.  I think we agreed that the later was better
> than the former, and it was easy because we don't have any need for
> that at this time.  Plus the script would not rewrite every page,
> just certain pages that required it.

Please forgive me for barging in here, but that approach simply is
untenable if it requires that the database be down while those pages
are being found, marked, moved around, etc.

The data volumes that really concern people who need an in-place
upgrade are such that even 
   dd if=$PGDATA of=/dev/null bs=8192 # (or whatever the optimal block size would be)

would require *much* more time than such people would accept as a down
time window, and while that's a lower bound, it's not a reasonable
lower bound on the time.

If this re-jiggering could kick off in the background at start and
work on a running PostgreSQL, the whole objection goes away.

A problem that arises for any in-place upgrade system we do is that if
someone's at 99% storage capacity, we can pretty well guarantee some
kind of catastrophic failure.  Could we create some way to get an
estimate of space needed, given that the system needs to stay up while
that's happening?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Page-level version upgrade

From
Dimitri Fontaine
Date:
Hi,

As we're talking about crazy ideas...

Bruce Momjian <bruce@momjian.us> writes:
> Well, yea, the idea would be that the 8.5 server would either convert
> the page to the new format on read (assuming there is enough free space,
> perhaps requiring a pre-upgrade script), or have the server write the
> page in the old 8.4 format and not do CRC checks on the page.  My guess
> is the former.

We already have had demand for read only tables (some on-disk format
optimisation would then be possible). What about having page level
read-only restriction, thus allowing the newer server version to operate
in read-only mode on the older server version pages, and convert on
write by allocating whole new page(s)?

Then we go even crazier, with a special recovery mode on the new version
able to read older version WAL format, producing older version
pages. That sounds like code maintenance hell, but would allow for a
$new WAL standby to restore from a $old wal steam, and be read
only. Then you sitchover to the slave and it goes out of recovery and
creates new pages on writes.

How about going this crazy?

Regards,
-- 
dim


Re: Page-level version upgrade

From
Greg Stark
Date:
On Wed, Dec 2, 2009 at 11:26 AM, Dimitri Fontaine
<dfontaine@hi-media.com> wrote:
> We already have had demand for read only tables (some on-disk format
> optimisation would then be possible). What about having page level
> read-only restriction, thus allowing the newer server version to operate
> in read-only mode on the older server version pages, and convert on
> write by allocating whole new page(s)?

I'm a bit confused. Read-only tables are tables that the user has said
they don't intend to modify.  We can throw an error if they try. What
you're proposing are pages that the system treats as read-only but
what do you propose to do if the user actually does try to update or
delete (or lock) a record in those pages? If we want to avoid
converting them to new pages we need to be able to at least store an
xmax and set the ctid on those tuples. And probably we would need to
do other things like set hint bits or set fields in the page header.


-- 
greg


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Bruce Momjian
Date:
David Fetter wrote:
> > Right.  There were two basic approaches to handling a patch that
> > would expand when upgraded to the new version --- either allow the
> > system to write the old format, or have a pre-upgrade script that
> > moved tuples so there was guaranteed enough free space in every page
> > for the new format.  I think we agreed that the later was better
> > than the former, and it was easy because we don't have any need for
> > that at this time.  Plus the script would not rewrite every page,
> > just certain pages that required it.
> 
> Please forgive me for barging in here, but that approach simply is
> untenable if it requires that the database be down while those pages
> are being found, marked, moved around, etc.
> 
> The data volumes that really concern people who need an in-place
> upgrade are such that even 
> 
>     dd if=$PGDATA of=/dev/null bs=8192 # (or whatever the optimal block size would be)
> 
> would require *much* more time than such people would accept as a down
> time window, and while that's a lower bound, it's not a reasonable
> lower bound on the time.

Well, you can say it is unacceptable, but if there are no other options
then that is all we can offer.  My main point is that we should consider
writing old format pages only when we have no choice (page size might
expand), and even then, we might decide to have a pre-migration script
because the code impact of writing the old format would be too great. 
This is all hypothetical until we have a real use-case.

> If this re-jiggering could kick off in the background at start and
> work on a running PostgreSQL, the whole objection goes away.
> 
> A problem that arises for any in-place upgrade system we do is that if
> someone's at 99% storage capacity, we can pretty well guarantee some
> kind of catastrophic failure.  Could we create some way to get an
> estimate of space needed, given that the system needs to stay up while
> that's happening?

Yea, the database would expand and hopefully have full transaction
semantics.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Page-level version upgrade

From
Dimitri Fontaine
Date:
Greg Stark <gsstark@mit.edu> writes:
> On Wed, Dec 2, 2009 at 11:26 AM, Dimitri Fontaine
> <dfontaine@hi-media.com> wrote:
>> We already have had demand for read only tables (some on-disk format
>> optimisation would then be possible). What about having page level
>> read-only restriction, thus allowing the newer server version to operate
>> in read-only mode on the older server version pages, and convert on
>> write by allocating whole new page(s)?
>
> I'm a bit confused. Read-only tables are tables that the user has said
> they don't intend to modify.  We can throw an error if they try. What
> you're proposing are pages that the system treats as read-only but
> what do you propose to do if the user actually does try to update or
> delete (or lock) a record in those pages? 

Well it's still a pretty rough idea, so I'll need help from this forum
to get to something concrete enough for someone to be able to implement
it... and there you go:

> If we want to avoid
> converting them to new pages we need to be able to at least store an
> xmax and set the ctid on those tuples. And probably we would need to
> do other things like set hint bits or set fields in the page header.

My idea was more that any non read-only access to the page forces a
rewrite in the new format, and a deprecation of the ancient page. Maybe
like what vacuum would be doing on it as soon as it realises the page
contains no visible tuples anymore, but done by the backend at the time
of the modification.

That makes the first modifications of the page quite costly but allow to
somewhat choose when that happens. And still have read only access, so
you could test parts of your application on a hot standby running next
version.

Maybe there's just too much craziness in there now.
-- 
dim


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Tue, Dec 1, 2009 at 11:45 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> >> The key issue, as I think Heikki identified at the time, is to figure
>> >> out how you're eventually going to get rid of the old pages. ?He
>> >> proposed running a pre-upgrade utility on each page to reserve the
>> >> right amount of free space.
>> >>
>> >> http://archives.postgresql.org/pgsql-hackers/2008-11/msg00208.php
>> >
>> > Right. ?There were two basic approaches to handling a patch that would
>> > expand when upgraded to the new version --- either allow the system to
>> > write the old format, or have a pre-upgrade script that moved tuples so
>> > there was guaranteed enough free space in every page for the new format.
>> > I think we agreed that the later was better than the former, and it was
>> > easy because we don't have any need for that at this time. ?Plus the
>> > script would not rewrite every page, just certain pages that required
>> > it.
>>
>> While I'm always willing to be proven wrong, I think it's a complete
>> dead-end to believe that it's going to be easier to reserve space for
>> page expansion using the upgrade-from version rather than the
>> upgrade-to version.  I am firmly of the belief that the NEW pg version
>> must be able to operate on an unmodified heap migrated from the OLD pg
>> version.  After this set of patches was rejected, Zdenek actually
>
> Does it need to write the old version, and if it does, it has to carry
> around the old format structures all over the backend?  That was the
> unclear part.

I think it needs partial write support for the old version.  If the
page is not expanding, then you can probably just replace pages in
place.  But if the page is expanding, then you need to be able to move
individual tuples[1].  Since you want to be up and running while
that's happening, I think you probably need to be able to update xmax
and probably set hit bints.  But you don't need to be able to add
tuples to the old page format, and I don't think you need complete
vacuum support, since you don't plan to reuse the dead space - you'll
just recycle the whole page once the tuples are all dead.

As for carrying it around the whole backend, I'm not sure how much of
the backend really needs to know.  It would only be anything that
looks at pages, rather than, say, tuples, but I don't really know how
much code that touches.  I suppose that's one of the things we need to
figure out.

[1] Unless, of course, you use a pre-upgrade utility.  But this is
about how to make it work WITHOUT a pre-upgrade utility.

>> proposed an alternate patch that would have allowed space reservation,
>> and it was rejected precisely because there was no clear certainty
>> that it would solve any hypothetical future problem.
>
> True.  It was solving a problem we didn't have, yet.

Well, that's sort of a circular argument.  If you're going to reserve
space with a pre-upgrade utility, you're going to need to put the
pre-upgrade utility into the version you want to upgrade FROM.  If we
wanted to be able to use a pre-upgrade utility to upgrade to 8.5, we
would have had to put the utility into 8.4.

The problem I'm referring to is that there is no guarantee that you
would be able predict how much space to reserve.  In a case like CRCs,
it may be as simple as "4 bytes".  But what if, say, we switch to a
different compression algorithm for inline toast?  Some pages will
contract, others will expand, but there's no saying by how much - and
therefore no fixed amount of reserved space is guaranteed to be
adequate.  It's true that we might never want to do that particular
thing, but I don't think we can say categorically that we'll NEVER
want to do anything that expands pages by an unpredictable amount.  So
it might be quite complex to figure out how much space to reserve on
any given page.  If we can find a way to make that the NEW PG
version's problem, it's still complicated, but at least it's not
complicated stuff that has to be backpatched.

Another problem with a pre-upgrade utility is - how do you verify,
when you fire up the new cluster, that the pre-upgrade utility has
done its thing?  If the new PG version requires 4 bytes of space
reserved on each page, what happens when you get halfway through
upgrading your 1TB database and find a page with only 2 bytes
available?  There aren't a lot of good options.  The old PG version
could try to mark the DB in some way to indicate whether it
successfully completed, but what if there's a bug and something was
missed?  Then you have this scenario:

1. Run the pre-upgrade script.
2. pg_migrator.
3. Fire up new version.
4. Discover that pre-upgrade script forgot to reserve enough space on some page.
5. Report a bug.
6. Bug fixed, new version of pre-upgrade script is now available.
7. ???

If all the logic is in the new server, you may still be in hot water
when you discover that it can't deal with a particular case.  But
hopefully the problem would be confined to that page, or that
relation, and you could use the rest of your database.  And even if
not, when the bug is fixed, you are patching the version that you're
still running and not the version that you've already left behind and
can't easily go back to.  Of course if the bug is bad enough it can
fry your database under any design, but in the pre-upgrade script
design you have to be really, really confident that the pre-upgrade
script doesn't have any holes that will only be found after it's too
late.

...Robert


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Simon Riggs
Date:
On Wed, 2009-12-02 at 10:48 -0500, Robert Haas wrote:
> Well, that's sort of a circular argument.  If you're going to reserve
> space with a pre-upgrade utility, you're going to need to put the
> pre-upgrade utility into the version you want to upgrade FROM.  If we
> wanted to be able to use a pre-upgrade utility to upgrade to 8.5, we
> would have had to put the utility into 8.4.

Don't see any need to reserve space at all.

If this is really needed, we first run a script to prepare the 8.4
database for conversion to 8.5. The script would move things around if
it finds a block that would have difficulty after upgrade. We may be
able to do that simple, using fillfactor, or it may need to be more
complex. Either way, its still easy to do this when required. 

-- Simon Riggs           www.2ndQuadrant.com



Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Wed, Dec 2, 2009 at 11:08 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2009-12-02 at 10:48 -0500, Robert Haas wrote:
>> Well, that's sort of a circular argument.  If you're going to reserve
>> space with a pre-upgrade utility, you're going to need to put the
>> pre-upgrade utility into the version you want to upgrade FROM.  If we
>> wanted to be able to use a pre-upgrade utility to upgrade to 8.5, we
>> would have had to put the utility into 8.4.
>
> Don't see any need to reserve space at all.
>
> If this is really needed, we first run a script to prepare the 8.4
> database for conversion to 8.5. The script would move things around if
> it finds a block that would have difficulty after upgrade. We may be
> able to do that simple, using fillfactor, or it may need to be more
> complex. Either way, its still easy to do this when required.

I discussed the problems with this, as I see them, in the same email
you just quoted.  You don't have to agree with my analysis, of course.

...Robert


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Greg Smith
Date:
Robert Haas wrote:
> The problem I'm referring to is that there is no guarantee that you
> would be able predict how much space to reserve.  In a case like CRCs,
> it may be as simple as "4 bytes".  But what if, say, we switch to a
> different compression algorithm for inline toast?
Upthread, you made a perfectly sensible suggestion:  use the CRC 
addition as a test case to confirm you can build something useful that 
allowed slightly more complicated in-place upgrades than are supported 
now.  This requires some new code to do tuple shuffling, communicate 
reserved space, etc.  All things that seem quite sensible to have 
available, useful steps toward a more comprehensive solution, and an 
achievable goal you wouldn't even have to argue about.

Now, you're wandering us back down the path where we have to solve a 
"migrate TOAST changes" level problem in order to make progress.  
Starting with presuming you have to solve the hardest possible issue 
around is the documented path to failure here.  We've seen multiple such 
solutions before, and they all had trade-offs deemed unacceptable:  
either a performance loss for everyone (not just people upgrading), or 
unbearable code complexity.  There's every reason to believe your 
reinvention of the same techniques will suffer the same fate.

When someone has such a change to be made, maybe you could bring this 
back up again and gain some traction.  One of the big lessons I took 
from the 8.4 development's lack of progress on this class of problem:  
no work to make upgrades easier will get accepted unless there is such 
an upgrade on the table that requires it.  You need a test case to make 
sure the upgrade approach a) works as expected, and b) is code you must 
commit now or in-place upgrade is lost.  Anything else will be deferred; 
I don't think there's any interest in solving a speculative future 
problem left at this point, given that it will be code we can't even 
prove will work.

> Another problem with a pre-upgrade utility is - how do you verify,
> when you fire up the new cluster, that the pre-upgrade utility has
> done its thing?
Some additional catalog support was suggested to mark what the 
pre-upgrade utility had processed.   I'm sure I could find the messages 
about again if I had to.

> If all the logic is in the new server, you may still be in hot water
> when you discover that it can't deal with a particular case.
If you can't design a pre-upgrade script without showstopper bugs, what 
makes you think the much more complicated code in the new server (which 
will be carrying around an ugly mess of old and new engine parts) will 
work as advertised?  I think we'll be lucky to get the simplest possible 
scheme implemented, and that any of these more complicated ones will die 
under their own weight of their complexity.

Also, your logic seems to presume that no backports are possible to the 
old server.  A bug-fix to the pre-upgrade script is a completely 
reasonable and expected candidate for backporting, because it will be 
such a targeted  piece of code that adjusting it shouldn't impact 
anything else.  The same will not be even remotely true if there's a bug 
fix needed in a more complicated system that lives in a regularly 
traversed code path.  Having such a tightly targeted chunk of code makes 
pre-upgrade *more* likely to get bug-fix backports, because you won't be 
touching code executed by regular users at all.

The potential code impact of backporting fixes to the more complicated 
approaches here is another major obstacle to adopting one of them.  
That's an issue that we didn't even get to the last time, because 
showstopper issues popped up first.  That problem was looming had work 
continued down that path though.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com



Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Wed, Dec 2, 2009 at 1:08 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Robert Haas wrote:
>>
>> The problem I'm referring to is that there is no guarantee that you
>> would be able predict how much space to reserve.  In a case like CRCs,
>> it may be as simple as "4 bytes".  But what if, say, we switch to a
>> different compression algorithm for inline toast?
>
> Upthread, you made a perfectly sensible suggestion:  use the CRC addition as
> a test case to confirm you can build something useful that allowed slightly
> more complicated in-place upgrades than are supported now.  This requires
> some new code to do tuple shuffling, communicate reserved space, etc.  All
> things that seem quite sensible to have available, useful steps toward a
> more comprehensive solution, and an achievable goal you wouldn't even have
> to argue about.
>
> Now, you're wandering us back down the path where we have to solve a
> "migrate TOAST changes" level problem in order to make progress.  Starting
> with presuming you have to solve the hardest possible issue around is the
> documented path to failure here.  We've seen multiple such solutions before,
> and they all had trade-offs deemed unacceptable:  either a performance loss
> for everyone (not just people upgrading), or unbearable code complexity.
>  There's every reason to believe your reinvention of the same techniques
> will suffer the same fate.

Just to set the record straight, I don't intend to work on this
problem at all (unless paid, of course).  And I'm perfectly happy to
go with whatever workable solution someone else comes up with.  I'm
just offering opinions on what I see as the advantages and
disadvantages of different approaches, and anyone is working on this
is more than free to ignore them.

> Some additional catalog support was suggested to mark what the pre-upgrade
> utility had processed.   I'm sure I could find the messages about again if I
> had to.

And that's a perfectly sensible solution, except that adding a catalog
column to 8.4 at this point would force initdb, so that's a
non-starter.  I suppose we could shoehorn it into the reloptions.

> Also, your logic seems to presume that no backports are possible to the old
> server.

The problem on the table at the moment is that the proposed CRC
feature will expand every page by a uniform amount - so in this case a
fixed-space-per-page reservation utility would be completely adequate.Does anyone think this is a realistic thing to
backportto 8.4? 

...Robert


Re: Block-level CRC checks

From
Peter Eisentraut
Date:
On tis, 2009-12-01 at 19:41 +0000, Greg Stark wrote:
> > Also, it would
> > require reading back each page as it's written to disk, which is OK
> for
> > a bunch of single-row writes, but for bulk data loads a significant
> problem.
> 
> Not sure what that really means for Postgres. It would just mean
> reading back the same page of memory from the filesystem cache that we
> just read.

Surely the file system ought to be the place where to solve this.  After
all, we don't put link-level corruption detection into the libpq
protocol either.



Re: Block-level CRC checks

From
Peter Eisentraut
Date:
On tis, 2009-12-01 at 17:47 -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I also like the idea that we don't need to CRC check the line pointers
> > because any corruption there is going to appear immediately.  However,
> > the bad news is that we wouldn't find the corruption until we try to
> > access bad data and might crash.
> 
> That sounds exactly like the corruption detection system we have now.
> If you think that behavior is acceptable, we can skip this whole
> discussion.

I think one of the motivations for this CRC business was to detect
corruption in the user data.  As you say, we already handle corruption
in the metadata.



Re: Page-level version upgrade (was: Block-level CRC checks)

From
Greg Smith
Date:
Robert Haas wrote:
>> Some additional catalog support was suggested to mark what the pre-upgrade
>> utility had processed.   I'm sure I could find the messages about again if I
>> had to.
>>     
> And that's a perfectly sensible solution, except that adding a catalog
> column to 8.4 at this point would force initdb, so that's a
> non-starter.  I suppose we could shoehorn it into the reloptions.
>   
There's no reason the associated catalog support had to ship with the 
old version.  You can always modify the catalog after initdb, but before 
running the pre-upgrade utility.  pg_migrator might make that change for 
you.

> The problem on the table at the moment is that the proposed CRC
> feature will expand every page by a uniform amount - so in this case a
> fixed-space-per-page reservation utility would be completely adequate.
>  Does anyone think this is a realistic thing to backport to 8.4?
>   
I believe the main problem here is making sure that the server doesn't 
turn around and fill pages right back up again.  The logic that needs to 
show up here has two parts:

1) Don't fill new pages completely up, save the space that will be 
needed in the new version
2) Find old pages that are filled and free some space on them

The pre-upgrade utility we've been talking about does (2), and that's 
easy to imagine implementing as an add-on module rather than a 
backport.  I don't know how (1) can be done in a way such that it's 
easily backported to 8.4. 

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com



Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Wed, Dec 2, 2009 at 1:56 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Robert Haas wrote:
>>> Some additional catalog support was suggested to mark what the
>>> pre-upgrade
>>> utility had processed.   I'm sure I could find the messages about again
>>> if I
>>> had to.
>> And that's a perfectly sensible solution, except that adding a catalog
>> column to 8.4 at this point would force initdb, so that's a
>> non-starter.  I suppose we could shoehorn it into the reloptions.
> There's no reason the associated catalog support had to ship with the old
> version.  You can always modify the catalog after initdb, but before running
> the pre-upgrade utility.  pg_migrator might make that change for you.

Uh, really?  I don't think that's possible at all.

>> The problem on the table at the moment is that the proposed CRC
>> feature will expand every page by a uniform amount - so in this case a
>> fixed-space-per-page reservation utility would be completely adequate.
>>  Does anyone think this is a realistic thing to backport to 8.4?
>
> I believe the main problem here is making sure that the server doesn't turn
> around and fill pages right back up again.  The logic that needs to show up
> here has two parts:
>
> 1) Don't fill new pages completely up, save the space that will be needed in
> the new version
> 2) Find old pages that are filled and free some space on them
>
> The pre-upgrade utility we've been talking about does (2), and that's easy
> to imagine implementing as an add-on module rather than a backport.  I don't
> know how (1) can be done in a way such that it's easily backported to 8.4.

Me neither.

...Robert


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Greg Smith
Date:
Robert Haas wrote: <blockquote cite="mid:603c8f070912021104g32f14915rec242e0ecde3ae45@mail.gmail.com" type="cite"><pre
wrap="">OnWed, Dec 2, 2009 at 1:56 PM, Greg Smith <a class="moz-txt-link-rfc2396E"
href="mailto:greg@2ndquadrant.com"><greg@2ndquadrant.com></a>wrote: </pre><blockquote type="cite"><pre
wrap="">There'sno reason the associated catalog support had to ship with the old
 
version.  You can always modify the catalog after initdb, but before running
the pre-upgrade utility.  pg_migrator might make that change for you.   </pre></blockquote><pre wrap="">
Uh, really?  I don't think that's possible at all. </pre></blockquote> Worst case just to get this bootstrapped:  you
installa new table with the added bits.  Old version page upgrader accounts for itself there.  pg_migrator dumps that
dataand then loads it into its new, correct home on the newer version.  There's already stuff like that being done
anyway--dumpingthings from the old catalog and inserting into the new one--and if the origin is actually an add-on
ratherthan an original catalog page it doesn't really matter.  As long as the new version can see the info it needs in
itscatalog it doesn't matter how it got to there; that's the one that needs to check the migration status before it can
accessthings outside of the catalog.<br /><br /><pre class="moz-signature" cols="72">-- 
 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
<a class="moz-txt-link-abbreviated" href="mailto:greg@2ndQuadrant.com">greg@2ndQuadrant.com</a>  <a
class="moz-txt-link-abbreviated"href="http://www.2ndQuadrant.com">www.2ndQuadrant.com</a>
 
</pre>

Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Wed, Dec 2, 2009 at 2:27 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Robert Haas wrote:
>
> On Wed, Dec 2, 2009 at 1:56 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>
>
> There's no reason the associated catalog support had to ship with the old
> version.  You can always modify the catalog after initdb, but before running
> the pre-upgrade utility.  pg_migrator might make that change for you.
>
>
> Uh, really?  I don't think that's possible at all.
>
>
> Worst case just to get this bootstrapped:  you install a new table with the
> added bits.  Old version page upgrader accounts for itself there.
> pg_migrator dumps that data and then loads it into its new, correct home on
> the newer version.  There's already stuff like that being done
> anyway--dumping things from the old catalog and inserting into the new
> one--and if the origin is actually an add-on rather than an original catalog
> page it doesn't really matter.  As long as the new version can see the info
> it needs in its catalog it doesn't matter how it got to there; that's the
> one that needs to check the migration status before it can access things
> outside of the catalog.

That might work.  I think that in order to get a fixed OID for the new
catalog you would need to run a backend in bootstrap mode, which might
(not sure) require shutting down the database first.  But it sounds
doable.

There remains the issue of whether it is reasonable to think about
backpatching such a thing, and whether doing so is easier/better than
dealing with page expansion in the new server.

...Robert


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Greg Stark
Date:
On Wed, Dec 2, 2009 at 6:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Also, your logic seems to presume that no backports are possible to the old
>> server.
>
> The problem on the table at the moment is that the proposed CRC
> feature will expand every page by a uniform amount - so in this case a
> fixed-space-per-page reservation utility would be completely adequate.
>  Does anyone think this is a realistic thing to backport to 8.4?

This whole discussion is based on assumptions which do not match my
recollection of the old discussion. I would suggest people go back and
read the emails but it's clear at least some people have so it seems
people get different things out of those old emails. My recollection
of Tom and Heikki's suggestions for Zdenek were as follows:

1) When 8.9.0 comes out we also release an 8.8.x which contains a new
guc which says to prepare for an 8.9 update. If that guc is set then
any new pages are guaranteed to have enough space for 8.9.0 which
could be as simple as guaranteeing there are x bytes of free space, in
the case of the CRC it's actually *not* a uniform amount of free space
if we go with Tom's design of having a variable chunk which moves
around but it's still just a simple arithmetic to determine if there's
enough free space on the page for a new tuple so it would be simple
enough to backport.

2) When you want to prepare a database for upgrade you run the
precheck script which first of all makes sure you're running 8.8.x and
that the flag is set. Then it checks the free space on every page to
ensure it's satisfactory. If not then it can do a noop update to any
tuple on the page which the new free space calculation would guarantee
would go to a new page. Then you have to wait long enough and vacuum.

3) Then you run pg_migrator which swaps in the new catalog files.

4) Then you shut down and bring up 8.9.0 which on reading any page
*immediately* converts it to 8.9.0 format.

5) You would eventually also need some program which processes every
page and guarantees to write it back out in the new format. Otherwise
there will be pages that you never stop reconverting every time
they're read.

--
greg


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> This whole discussion is based on assumptions which do not match my
> recollection of the old discussion. I would suggest people go back and
> read the emails but it's clear at least some people have so it seems
> people get different things out of those old emails. My recollection
> of Tom and Heikki's suggestions for Zdenek were as follows:

> 1) When 8.9.0 comes out we also release an 8.8.x which contains a new
> guc which says to prepare for an 8.9 update.

Yeah, I think the critical point is not to assume that the behavior of
the old system is completely set in stone.  We can insist that you must
update to at least point release .N before beginning the migration
process.  That gives us a chance to backpatch code that makes
adjustments to the behavior of the old server, so long as the backpatch
isn't invasive enough to raise stability concerns.
        regards, tom lane


Re: Page-level version upgrade (was: Block-level CRC checks)

From
Robert Haas
Date:
On Wed, Dec 2, 2009 at 3:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> This whole discussion is based on assumptions which do not match my
>> recollection of the old discussion. I would suggest people go back and
>> read the emails but it's clear at least some people have so it seems
>> people get different things out of those old emails. My recollection
>> of Tom and Heikki's suggestions for Zdenek were as follows:
>
>> 1) When 8.9.0 comes out we also release an 8.8.x which contains a new
>> guc which says to prepare for an 8.9 update.
>
> Yeah, I think the critical point is not to assume that the behavior of
> the old system is completely set in stone.  We can insist that you must
> update to at least point release .N before beginning the migration
> process.  That gives us a chance to backpatch code that makes
> adjustments to the behavior of the old server, so long as the backpatch
> isn't invasive enough to raise stability concerns.

If we have consensus on that approach, I'm fine with it.  I just don't
want one of the people who wants this CRC feature to go to a lot of
trouble to develop a space reservation system that has to be
backpatched to 8.4, and then have the patch rejected as too
potentially destabilizing.

...Robert


Re: Block-level CRC checks

From
"Jonah H. Harris"
Date:
On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
> Does $COMPETITOR offer this feature?
>

My understanding is that MSSQL does. I am not sure about Oracle. Those
are the only two I run into (I don't run into MySQL at all). I know
others likely compete in the DB2 space.

To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages.

--
Jonah H. Harris, Senior DBA
myYearbook.com

Re: Block-level CRC checks

From
decibel
Date:
On Dec 3, 2009, at 1:53 PM, Jonah H. Harris wrote:
> On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake  
> <jd@commandprompt.com> wrote:
> On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
> > Does $COMPETITOR offer this feature?
> >
>
> My understanding is that MSSQL does. I am not sure about Oracle. Those
> are the only two I run into (I don't run into MySQL at all). I know
> others likely compete in the DB2 space.
>
> To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL  
> Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages.


So... now that the upgrade discussion seems to have died down... was  
any consensus reached on how to do said checksumming?
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2009-12-04 at 03:32 -0600, decibel wrote:

> So... now that the upgrade discussion seems to have died down... was  
> any consensus reached on how to do said checksumming?

Possibly. Please can you go through the discussion and pull out a
balanced summary of how to proceed? I lost track a while back and I'm
sure many others did also.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
"Massa, Harald Armin"
Date:
Kevin,

> md5sum of each tuple?  As an optional system column (a la oid)?

I am mainly an application programmer working with PostgreSQL. And I
want to point out an additional usefullness of an md5sum of each
tuple: it makes comparing table-contents in replicated / related
databases MUCH more feasible.

I am in the process of adding a user-space "myhash" column to all my
applications tables, filled by a trigger on insert / update. It really
speeds up table comparison across databases; and it is very helpfull
in debugging replications.

Harald


--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality


Re: Block-level CRC checks

From
Greg Stark
Date:
On Fri, Dec 4, 2009 at 9:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

> Possibly. Please can you go through the discussion and pull out a
> balanced summary of how to proceed? I lost track a while back and I'm
> sure many others did also.

I summarized the three feasible plans I think I saw;
<407d949e0912011713j63045989j67b7b343ef00c192@mail.gmail.com>

-- 
greg


Re: Block-level CRC checks

From
Bruce Momjian
Date:
decibel wrote:
> On Dec 3, 2009, at 1:53 PM, Jonah H. Harris wrote:
> > On Tue, Dec 1, 2009 at 1:27 PM, Joshua D. Drake  
> > <jd@commandprompt.com> wrote:
> > On Tue, 2009-12-01 at 13:20 -0500, Robert Haas wrote:
> > > Does $COMPETITOR offer this feature?
> > >
> >
> > My understanding is that MSSQL does. I am not sure about Oracle. Those
> > are the only two I run into (I don't run into MySQL at all). I know
> > others likely compete in the DB2 space.
> >
> > To my knowledge, MySQL, InnoDB, BerkeleyDB, solidDB, Oracle, SQL  
> > Server, Sybase, DB2, eXtremeDB, RDB, and Teradata all checksum pages.
> 
> 
> So... now that the upgrade discussion seems to have died down... was  
> any consensus reached on how to do said checksumming?

I think the hint bit has to be added to the item pointer, by using the
offset bits that are already zero, according to Greg Stark.  That
solution leads to easy programming, no expanding hint bit array, and it
is backward compatible so doesn't cause problems for pg_migrator.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2009-12-04 at 07:12 -0500, Bruce Momjian wrote:

> I think the hint bit has to be added to the item pointer, by using the
> offset bits that are already zero, according to Greg Stark.  That
> solution leads to easy programming, no expanding hint bit array, and it
> is backward compatible so doesn't cause problems for pg_migrator.

Seems like a reasonable way forward. 

As I pointed out here
http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php
we only need to use 3 bits not 4, but it does limit tuple length to 4096
for all block sizes. (Two different options there for doing that).

An added advantage of this approach is that the cachelines for the item
pointer array will already be in CPU cache, so there is no additional
access time when we set the hint bits when they are moved to their new
position.

I should also point out that removing 4 bits from the tuple header would
allow us to get rid of t_infomask2, reducing tuple length by a further 2
bytes.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Alvaro Herrera
Date:
BTW with VACUUM FULL removed I assume we're going to get rid of
HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Simon Riggs wrote:
> On Fri, 2009-12-04 at 07:12 -0500, Bruce Momjian wrote:
> 
> > I think the hint bit has to be added to the item pointer, by using the
> > offset bits that are already zero, according to Greg Stark.  That
> > solution leads to easy programming, no expanding hint bit array, and it
> > is backward compatible so doesn't cause problems for pg_migrator.
> 
> Seems like a reasonable way forward. 
> 
> As I pointed out here
> http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php
> we only need to use 3 bits not 4, but it does limit tuple length to 4096
> for all block sizes. (Two different options there for doing that).
> 
> An added advantage of this approach is that the cachelines for the item
> pointer array will already be in CPU cache, so there is no additional
> access time when we set the hint bits when they are moved to their new
> position.
> 
> I should also point out that removing 4 bits from the tuple header would
> allow us to get rid of t_infomask2, reducing tuple length by a further 2
> bytes.

Wow, that is a nice win.  Does alignment allow us to actually use that
space?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2009-12-04 at 07:54 -0500, Bruce Momjian wrote:

> > I should also point out that removing 4 bits from the tuple header would
> > allow us to get rid of t_infomask2, reducing tuple length by a further 2
> > bytes.
> 
> Wow, that is a nice win.  Does alignment allow us to actually use that
> space?

It would mean that tables up to 24 columns wide would still be 24 bytes
wide, whereas >8 columns now has to fit in 32 bytes. So in practical
terms most tables would benefit in your average database.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:

> BTW with VACUUM FULL removed I assume we're going to get rid of
> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?

Much as I would like to see those go, no. VF code should remain for some
time yet, IMHO. We could remove it, but doing so is not a priority
because it buys us nothing in terms of features and its the type of
thing we should do at the start of a release cycle, not end. I certainly
don't have time to do it, at least.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
> 
>> BTW with VACUUM FULL removed I assume we're going to get rid of
>> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
> 
> Much as I would like to see those go, no. VF code should remain for some
> time yet, IMHO.

I don't think we need to keep VF code otherwise, but I would leave
HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
we need a pre-upgrade script or something to scrub them off.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Block-level CRC checks

From
Greg Stark
Date:
On Fri, Dec 4, 2009 at 12:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, 2009-12-04 at 07:54 -0500, Bruce Momjian wrote:
>
>> > I should also point out that removing 4 bits from the tuple header would
>> > allow us to get rid of t_infomask2, reducing tuple length by a further 2
>> > bytes.
>>
>> Wow, that is a nice win.  Does alignment allow us to actually use that
>> space?
>
> It would mean that tables up to 24 columns wide would still be 24 bytes
> wide, whereas >8 columns now has to fit in 32 bytes. So in practical
> terms most tables would benefit in your average database.

I don't think getting rid of infomask2 wins us 2 bytes so fast. The
rest of those two bytes is natts which of course we still need.

If we lose vacuum full then the table's open for reducing the width of
command id too if we need more bits.  If we do that and we moved
everything we could to the line pointers including ctid we might just
be able to squeeze the tuple overhead down to 16 bytes. That would win
8 bytes per tuple for people with no null columns or with nulls and a
total of 9-64 columns but if they have 1-8 columns and any are null it
would actually consume more space. But it looks to me like it would be
very very tight and require drastic measures -- I think we would be
left with something like 11 bits for commandid and no spare bits in
the tuple header at all.

--
greg


Re: Block-level CRC checks

From
Greg Stark
Date:
On Fri, Dec 4, 2009 at 1:35 PM, Greg Stark <gsstark@mit.edu> wrote:
> If we lose vacuum full then the table's open for reducing the width of
> command id too if we need more bits.  If we do that and we moved
> everything we could to the line pointers including ctid we might just
> be able to squeeze the tuple overhead down to 16 bytes.

I'm not sure why I said "including ctid". We would have to move
everything transactional to the line pointer, including xmin, xmax,
ctid, all the hint bits, the updated flags, hot flags, etc. The only
things left in the tuple header would be things that have to be there
such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty
drastic change, though a fairly logical one. I recall someone actually
submitted a patch to separate out the transactional bits anyways a
while back, just to save a few bytes in in-memory tuples. If we could
save on disk-space usage it would be a lot more compelling. But it
doesn't look to me like it really saves enough often enough to be
worth so much code churn.

--
greg


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Greg Stark escribió:
> On Fri, Dec 4, 2009 at 1:35 PM, Greg Stark <gsstark@mit.edu> wrote:
> > If we lose vacuum full then the table's open for reducing the width of
> > command id too if we need more bits.  If we do that and we moved
> > everything we could to the line pointers including ctid we might just
> > be able to squeeze the tuple overhead down to 16 bytes.
> 
> I'm not sure why I said "including ctid". We would have to move
> everything transactional to the line pointer, including xmin, xmax,
> ctid, all the hint bits, the updated flags, hot flags, etc. The only
> things left in the tuple header would be things that have to be there
> such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty
> drastic change, though a fairly logical one.

Do we need XMAX_EXCL_LOCK and XMAX_SHARED_LOCK to be moved?  It seems to
me that they can stay with the tuple header because they are set by
wal-logged operations.  Same for XMAX_IS_MULTI.  The HASfoo bits are all
set on tuple creation, never touched later, so they can stay in the
header too.  We only need XMIN_COMMITTED, XMIN_INVALID, XMAX_COMMITTED,
XMAX_INVALID, HEAP_COMBOCID on the line pointer AFAICS ... oh, and
HEAP_HOT_UPDATED and HEAP_ONLY_TUPLE, not sure.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:
> Simon Riggs wrote:
> > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
> > 
> >> BTW with VACUUM FULL removed I assume we're going to get rid of
> >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
> > 
> > Much as I would like to see those go, no. VF code should remain for some
> > time yet, IMHO.
> 
> I don't think we need to keep VF code otherwise, but I would leave
> HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
> we need a pre-upgrade script or something to scrub them off.

CRCs are going to need scrubbing anyway, no?  Oh, but you're assuming
that CRCs are optional, so not everybody would need that, right?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Robert Haas
Date:
On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Heikki Linnakangas escribió:
>> Simon Riggs wrote:
>> > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
>> >
>> >> BTW with VACUUM FULL removed I assume we're going to get rid of
>> >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
>> >
>> > Much as I would like to see those go, no. VF code should remain for some
>> > time yet, IMHO.
>>
>> I don't think we need to keep VF code otherwise, but I would leave
>> HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
>> we need a pre-upgrade script or something to scrub them off.
>
> CRCs are going to need scrubbing anyway, no?  Oh, but you're assuming
> that CRCs are optional, so not everybody would need that, right?

If we can make not only the validity but also the presence of the CRC
field optional, it will simplify things greatly for in-place upgrade,
I think, because the upgrade won't itself require expanding the page.
Turning on the CRC functionality for a particular table may require
expanding the page, but that's a different problem.  :-)

Have we thought about what other things have changed between 8.4 and
8.5 that might cause problems for in-place upgrade?

...Robert


Re: Block-level CRC checks

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> As I pointed out here
> http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php
> we only need to use 3 bits not 4, but it does limit tuple length to 4096
> for all block sizes. (Two different options there for doing that).

Limiting the tuple length is a deal-breaker.
        regards, tom lane


Re: Block-level CRC checks

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I'm not sure why I said "including ctid". We would have to move
> everything transactional to the line pointer, including xmin, xmax,
> ctid, all the hint bits, the updated flags, hot flags, etc. The only
> things left in the tuple header would be things that have to be there
> such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty
> drastic change, though a fairly logical one. I recall someone actually
> submitted a patch to separate out the transactional bits anyways a
> while back, just to save a few bytes in in-memory tuples. If we could
> save on disk-space usage it would be a lot more compelling. But it
> doesn't look to me like it really saves enough often enough to be
> worth so much code churn.

It would also break things for indexes, which don't need all that stuff
in their line pointers.

More to the point, moving the same bits to someplace else on the page
doesn't save anything at all.
        regards, tom lane


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2009-12-04 at 10:43 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > As I pointed out here
> > http://archives.postgresql.org/pgsql-hackers/2009-12/msg00056.php
> > we only need to use 3 bits not 4, but it does limit tuple length to 4096
> > for all block sizes. (Two different options there for doing that).
> 
> Limiting the tuple length is a deal-breaker.

If people that use 32kB block sizes exist in practice, I note that
because tuples are at least 4 byte aligned that the first 2 bits of the
length are always unused. So they're available for those with strangely
long tuples, and can be used to signify high order bytes and so max
tuple length could be 16384. With tuples that long, it would be better
to assume 8-byte minimum alignment, which would put max tuple length
back up to 32KB again. None of that need effect people with a standard
8192 byte blocksize.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Have we thought about what other things have changed between 8.4 and
> 8.5 that might cause problems for in-place upgrade?

So far, nothing.  We even made Andrew Gierth jump through hoops to
keep hstore's on-disk representation upwards compatible.
        regards, tom lane


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2009-12-04 at 13:35 +0000, Greg Stark wrote:

> I don't think getting rid of infomask2 wins us 2 bytes so fast. The
> rest of those two bytes is natts which of course we still need.

err, yes, OK.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
> > Heikki Linnakangas escribi?:
> >> Simon Riggs wrote:
> >> > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
> >> >
> >> >> BTW with VACUUM FULL removed I assume we're going to get rid of
> >> >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
> >> >
> >> > Much as I would like to see those go, no. VF code should remain for some
> >> > time yet, IMHO.
> >>
> >> I don't think we need to keep VF code otherwise, but I would leave
> >> HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
> >> we need a pre-upgrade script or something to scrub them off.
> >
> > CRCs are going to need scrubbing anyway, no? ?Oh, but you're assuming
> > that CRCs are optional, so not everybody would need that, right?
> 
> If we can make not only the validity but also the presence of the CRC
> field optional, it will simplify things greatly for in-place upgrade,
> I think, because the upgrade won't itself require expanding the page.
> Turning on the CRC functionality for a particular table may require
> expanding the page, but that's a different problem.  :-)

Well, I am not sure how we would turn the _space_ used for CRC on and
off because you would have to rewrite the entire table/database to turn
it on, which seems unfortunate.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Robert Haas
Date:
On Fri, Dec 4, 2009 at 2:04 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Robert Haas wrote:
>> On Fri, Dec 4, 2009 at 9:48 AM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>> > Heikki Linnakangas escribi?:
>> >> Simon Riggs wrote:
>> >> > On Fri, 2009-12-04 at 09:52 -0300, Alvaro Herrera wrote:
>> >> >
>> >> >> BTW with VACUUM FULL removed I assume we're going to get rid of
>> >> >> HEAP_MOVED_IN and HEAP_MOVED_OFF too, right?
>> >> >
>> >> > Much as I would like to see those go, no. VF code should remain for some
>> >> > time yet, IMHO.
>> >>
>> >> I don't think we need to keep VF code otherwise, but I would leave
>> >> HEAP_MOVED_IN/OFF support alone for now for in-place upgrade. Otherwise
>> >> we need a pre-upgrade script or something to scrub them off.
>> >
>> > CRCs are going to need scrubbing anyway, no? ?Oh, but you're assuming
>> > that CRCs are optional, so not everybody would need that, right?
>>
>> If we can make not only the validity but also the presence of the CRC
>> field optional, it will simplify things greatly for in-place upgrade,
>> I think, because the upgrade won't itself require expanding the page.
>> Turning on the CRC functionality for a particular table may require
>> expanding the page, but that's a different problem.  :-)
>
> Well, I am not sure how we would turn the _space_ used for CRC on and
> off because you would have to rewrite the entire table/database to turn
> it on, which seems unfortunate.

Well, presumably you're going to have to do some of that work anyway,
because even if the space is set aside you're still going to have to
read the page in, CRC it, and write it back out.  However if the space
is not pre-allocated then you also have to deal with moving tuples to
other pages.  But that problem is going to have to be dealt with
somewhere along the line no matter what we do, because if you're
upgrading an 8.3 or 8.4 system to 8.5, you need to add that space
sometime: either before migration (with a pre-upgrade utility), or
after migration (by some sort of page converter/tuple mover), or only
when/if enabling the CRC feature.

One nice thing about making it the CRC feature's problem to make space
on each page is that people who don't want to use CRCs can still use
those extra 4 bytes/page for data.  That might not be worth the code
complexity if we were starting from scratch, but I'm thinking that
most of the code complexity is a given if we want to also support
in-place upgrade.

...Robert


Re: Block-level CRC checks

From
Alvaro Herrera
Date:
Massa, Harald Armin wrote:

> I am in the process of adding a user-space "myhash" column to all my
> applications tables, filled by a trigger on insert / update. It really
> speeds up table comparison across databases; and it is very helpfull
> in debugging replications.

Have you seen pg_comparator?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Block-level CRC checks

From
Bruce Momjian
Date:
Robert Haas wrote:
> > Well, I am not sure how we would turn the _space_ used for CRC on and
> > off because you would have to rewrite the entire table/database to turn
> > it on, which seems unfortunate.
> 
> Well, presumably you're going to have to do some of that work anyway,
> because even if the space is set aside you're still going to have to
> read the page in, CRC it, and write it back out.  However if the space
> is not pre-allocated then you also have to deal with moving tuples to
> other pages.  But that problem is going to have to be dealt with
> somewhere along the line no matter what we do, because if you're
> upgrading an 8.3 or 8.4 system to 8.5, you need to add that space
> sometime: either before migration (with a pre-upgrade utility), or
> after migration (by some sort of page converter/tuple mover), or only
> when/if enabling the CRC feature.
> 
> One nice thing about making it the CRC feature's problem to make space
> on each page is that people who don't want to use CRCs can still use
> those extra 4 bytes/page for data.  That might not be worth the code
> complexity if we were starting from scratch, but I'm thinking that
> most of the code complexity is a given if we want to also support
> in-place upgrade.

My guess is we can find somewhere on a 8.4 heap/index page to add four
bytes.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Block-level CRC checks

From
Chuck McDevitt
Date:
A curiosity question regarding torn pages:  How does this work on file systems that don't write in-place, but instead
alwaysdo copy-on-write? 

My example would be Sun's ZFS file system (In Solaris & BSD).  Because of its "snapshot & rollback" functionality, it
neverwrites a page in-place, but instead always copies it to another place on disk.  How does this affect the
corruptioncaused by a torn write? 

Can we end up with horrible corruption on this type of filesystem where we wouldn't on normal file systems, where we
arewriting to a previously zeroed area on disk? 

Sorry if this is a stupid question... Hopefully somebody can reassure me that this isn't an issue.


Re: Block-level CRC checks

From
Simon Riggs
Date:
On Fri, 2009-12-04 at 14:47 -0800, Chuck McDevitt wrote:
> A curiosity question regarding torn pages:  How does this work on file
> systems that don't write in-place, but instead always do
> copy-on-write?
> 
> My example would be Sun's ZFS file system (In Solaris & BSD).  Because
> of its "snapshot & rollback" functionality, it never writes a page
> in-place, but instead always copies it to another place on disk.  How
> does this affect the corruption caused by a torn write?
> 
> Can we end up with horrible corruption on this type of filesystem
> where we wouldn't on normal file systems, where we are writing to a
> previously zeroed area on disk?
> 
> Sorry if this is a stupid question... Hopefully somebody can reassure
> me that this isn't an issue.

Think we're still good. Not a stupid question.

Hint bits are set while the block is in shared_buffers and setting a
hint bit dirties the page, but does not write WAL.

Because the page is dirty we re-write the whole block at checkpoint, by
bgwriter cleaning or via dirty page eviction. So ZFS is OK, but we do
more writing than we want to, sometimes.

-- Simon Riggs           www.2ndQuadrant.com



Re: Block-level CRC checks

From
"Massa, Harald Armin"
Date:
>> I am in the process of adding a user-space "myhash" column to all my
>> applications tables, filled by a trigger on insert / update. It really
>> speeds up table comparison across databases; and it is very helpfull
>> in debugging replications.
>
> Have you seen pg_comparator?

yes, saw the lightning talk at pgday.eu
it also uses md5 hashes, just in an own schema. Guess pg_comparator
would profit from an integrated MD5 hash.

Harald


--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality


Re: Block-level CRC checks

From
Greg Stark
Date:
It can save space because the line pointers have less alignment  
requirements. But I don't see any point in the current state.

-- 
Greg

On 2009-12-04, at 3:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Greg Stark <gsstark@mit.edu> writes:
>> I'm not sure why I said "including ctid". We would have to move
>> everything transactional to the line pointer, including xmin, xmax,
>> ctid, all the hint bits, the updated flags, hot flags, etc. The only
>> things left in the tuple header would be things that have to be there
>> such as HAS_OIDS, HAS_NULLS, natts, hoff, etc. It would be a pretty
>> drastic change, though a fairly logical one. I recall someone  
>> actually
>> submitted a patch to separate out the transactional bits anyways a
>> while back, just to save a few bytes in in-memory tuples. If we could
>> save on disk-space usage it would be a lot more compelling. But it
>> doesn't look to me like it really saves enough often enough to be
>> worth so much code churn.
>
> It would also break things for indexes, which don't need all that  
> stuff
> in their line pointers.
>
> More to the point, moving the same bits to someplace else on the page
> doesn't save anything at all.
>
>            regards, tom lane


Re: Block-level CRC checks

From
Greg Stark
Date:
On Fri, Dec 4, 2009 at 10:47 PM, Chuck McDevitt <cmcdevitt@greenplum.com> wrote:
> A curiosity question regarding torn pages:  How does this work on file systems that don't write in-place, but instead
alwaysdo copy-on-write? 
>
> My example would be Sun's ZFS file system (In Solaris & BSD).  Because of its "snapshot & rollback" functionality, it
neverwrites a page in-place, but instead always copies it to another place on disk.  How does this affect the
corruptioncaused by a torn write? 
>
> Can we end up with horrible corruption on this type of filesystem where we wouldn't on normal file systems, where we
arewriting to a previously zeroed area on disk? 
>
> Sorry if this is a stupid question... Hopefully somebody can reassure me that this isn't an issue.

It's not a stupid question, we're not 100% sure but we believe ZFS
doesn't need full page writes because it's immune to torn pages.

I think the idea of ZFS is that the new partially written page isn't
visible because it's not linked into the tree until it's been
completely written. To me it appears this would depend on the drive
system ordering writes very strictly which seems hard to be sure is
happening. Perhaps this is tied to the tricks they do to avoid
contention on the root, if they do a write barrier before every root
update that seems like it should be sufficient to me, but I don't know
at that level of detail.

--
greg