Thread: AW: beta testing version

AW: beta testing version

From

Zeugswetter Andreas SB

Date:

06 December 2000, 07:16:51

> > > Sounds great! We can follow this way: when first after last
> > > checkpoint update to a page being logged, XLOG code can log
> > > not AM specific update record but entire page (creating backup
> > > "physical log"). During after crash recovery such pages will
> > > be redone first, ensuring page consistency for further redo ops.
> > > This means bigger log, of course.
> >
> > Be sure to include a CRC of each part of the block that you hope
> > to replay individually.
>
> Why should we do this? I'm not going to replay parts individually,
> I'm going to write entire pages to OS cache and than apply changes to
> them. Recovery is considered as succeeded after server is ensured
> that all applyed changes are on the disk. In the case of crash during
> recovery we'll replay entire game.

Yes, but there would need to be a way to verify the last page or record from txlog when
running on crap hardware. The point was, that crap hardware writes our 8k pages
in any order (e.g. 512 bytes from the end, then 512 bytes from front ...), and does not
even notice, that it only wrote part of one such 512 byte block when reading it back
after a crash. But, I actually doubt that this is true for all but the most crappy hardware.

Andreas

Re: AW: beta testing version

From

Tom Lane

Date:

06 December 2000, 11:15:41

Zeugswetter Andreas SB <ZeugswetterA@Wien.Spardat.at> writes:
> Yes, but there would need to be a way to verify the last page or
> record from txlog when running on crap hardware.

How exactly *do* we determine where the end of the valid log data is,
anyway?
        regards, tom lane

Re: AW: beta testing version

From

Daniele Orlandi

Date:

06 December 2000, 11:38:35

Tom Lane wrote:
> 
> Zeugswetter Andreas SB <ZeugswetterA@Wien.Spardat.at> writes:
> > Yes, but there would need to be a way to verify the last page or
> > record from txlog when running on crap hardware.
> 
> How exactly *do* we determine where the end of the valid log data is,
> anyway?

Couldn't you use a CRC ?

Anyway... may I suggest adding CRCs to the data ? I just discovered that
I had a faulty HD controller and I fear that something could have been
written erroneously (this could also help to detect faulty memory,
though only in certain cases).

Bye!

-- Daniele Orlandi Planet Srl

Re: AW: beta testing version

From

Bruce Guenter

Date:

06 December 2000, 13:25:36

On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote:
> Zeugswetter Andreas SB <ZeugswetterA@Wien.Spardat.at> writes:
> > Yes, but there would need to be a way to verify the last page or
> > record from txlog when running on crap hardware.
> How exactly *do* we determine where the end of the valid log data is,
> anyway?

I don't know how pgsql does it, but the only safe way I know of is to
include an "end" marker after each record.  When writing to the log,
append the records after the last end marker, ending with another end
marker, and fdatasync the log.  Then overwrite the previous end marker
to indicate it's not the end of the log any more and fdatasync again.

To ensure that it is written atomically, the end marker must not cross a
hardware sector boundary (typically 512 bytes).  This can be trivially
guaranteed by making the marker a single byte.

Any other way I've seen discussed (here and elsewhere) either
- Requires atomic multi-sector writes, which are possible only if all the sectors are sequential on disk, the kernel
issuesone large write for all of them, and you don't powerfail in the middle of the write. 
- Assume that a CRC is a guarantee.  A CRC would be a good addition to help ensure the data wasn't broken by flakey
drivefirmware, but doesn't guarantee consistency. 

--
Bruce Guenter <bruceg@em.ca>                       http://em.ca/~bruceg/

CRCs (was: beta testing version)

From

ncm@zembu.com (Nathan Myers)

Date:

06 December 2000, 18:29:30

On Wed, Dec 06, 2000 at 12:29:00PM +0100, Zeugswetter Andreas SB wrote:
> 
> > Why should we do this? I'm not going to replay parts individually,
> > I'm going to write entire pages to OS cache and than apply changes
> > to them. Recovery is considered as succeeded after server is ensured
> > that all applyed changes are on the disk. In the case of crash
> > during recovery we'll replay entire game.
>
> Yes, but there would need to be a way to verify the last page or
> record from txlog when running on crap hardware. The point was, that
> crap hardware writes our 8k pages in any order (e.g. 512 bytes from
> the end, then 512 bytes from front ...), and does not even notice,
> that it only wrote part of one such 512 byte block when reading it
> back after a crash. But, I actually doubt that this is true for all
> but the most crappy hardware.

By this standard all hardware is crap.  The behavior Andreas describes 
as "crappy" is the normal behavior of almost all drives in production, 
including the ones in your machine.

Furthermore, OSes re-order "atomic" writes into file systems (i.e.  
not raw partitions) to match partition block order, which often doesn't 
match the file block order.  Hence, the OSes are "crappy" too.

Wishful thinking is a poor substitute for real atomicity.  Block
CRCs can at least verify complete writes to reasonable confidence, 
if not ensure them.

Nathan Myers
ncm

CRCs (was: beta testing version)

From

ncm@zembu.com (Nathan Myers)

Date:

06 December 2000, 19:23:08

On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote:
> On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote:
> > Zeugswetter Andreas SB <ZeugswetterA@Wien.Spardat.at> writes:
> > > Yes, but there would need to be a way to verify the last page or
> > > record from txlog when running on crap hardware.
> >
> > How exactly *do* we determine where the end of the valid log data is,
> > anyway?
> 
> I don't know how pgsql does it, but the only safe way I know of is to
> include an "end" marker after each record.  When writing to the log,
> append the records after the last end marker, ending with another end
> marker, and fdatasync the log.  Then overwrite the previous end marker
> to indicate it's not the end of the log any more and fdatasync again.
>
> To ensure that it is written atomically, the end marker must not cross a
> hardware sector boundary (typically 512 bytes).  This can be trivially
> guaranteed by making the marker a single byte.

An "end" marker is not sufficient, unless all writes are done in
one-sector units with an fsync between, and the drive buffering 
is turned off.  For larger writes the OS will re-order the writes.  
Most drives will re-order them too, even if the OS doesn't.

> Any other way I've seen discussed (here and elsewhere) either
> - Requires atomic multi-sector writes, which are possible only if all
>   the sectors are sequential on disk, the kernel issues one large write
>   for all of them, and you don't powerfail in the middle of the write.
> - Assume that a CRC is a guarantee.  

We are already assuming a CRC is a guarantee.  

The drive computes a CRC for each sector, and if the CRC is OK the 
drive is happy.  CRC errors within the drive are quite frequent, and 
the drive re-reads when a bad CRC comes up.  (If it sees errors too 
frequently on a sector, it rewrites it; if it sees persistent errors 
on a sector, it marks that one bad and relocates it.)  You can expect 
to experience, in production, about the error rate that the drive 
manufacturer specifies as "maximum".

>   ... A CRC would be a good addition to
>   help ensure the data wasn't broken by flakey drive firmware, but
>   doesn't guarantee consistency.

No, a CRC would be a good addition to compensate for sector write
reordering, which is done both by the OS and by the drive, even for 
"atomic" writes.

It is not only "flaky" or "cheap" drives that re-order writes, or
acknowledge writes as complete that have are not yet on disk.  You 
can generally assume that *any* drive does it unless you have 
specifically turned that off.  The assumption is that if you care,
you have a UPS, or at least have configured the hardware yourself
to meet your needs.

It is purely wishful thinking to believe otherwise.

Nathan Myers
ncm@zembu.com

Re: AW: beta testing version

From

Daniele Orlandi

Date:

06 December 2000, 20:14:58

Bruce Guenter wrote:
> 
> - Assume that a CRC is a guarantee.  A CRC would be a good addition to
>   help ensure the data wasn't broken by flakey drive firmware, but
>   doesn't guarantee consistency.

Even a CRC per transaction (it could be a nice END record) ?

Bye!

-- Daniele

-------------------------------------------------------------------------------Daniele Orlandi - Utility Line Italia -
http://www.orlandi.comViaMezzera 29/A - 20030 - Seveso (MI) - Italy
 
-------------------------------------------------------------------------------

Re: AW: beta testing version

From

Bruce Guenter

Date:

06 December 2000, 21:53:45

On Wed, Dec 06, 2000 at 11:13:33PM +0000, Daniele Orlandi wrote:
> Bruce Guenter wrote:
> > - Assume that a CRC is a guarantee.  A CRC would be a good addition to
> >   help ensure the data wasn't broken by flakey drive firmware, but
> >   doesn't guarantee consistency.
> Even a CRC per transaction (it could be a nice END record) ?

CRCs are designed to catch N-bit errors (ie N bits in a row with their
values flipped).  N is (IIRC) the number of bits in the CRC minus one.
So, a 32-bit CRC can catch all 31-bit errors.  That's the only guarantee
a CRC gives.  Everything else has a 1 in 2^32-1 chance of producing the
same CRC as the original data.  That's pretty good odds, but not a
guarantee.
--
Bruce Guenter <bruceg@em.ca>                       http://em.ca/~bruceg/

Re: CRCs (was: beta testing version)

From

Bruce Guenter

Date:

06 December 2000, 21:56:42

On Wed, Dec 06, 2000 at 11:08:00AM -0800, Nathan Myers wrote:
> On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote:
> > On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote:
> > > How exactly *do* we determine where the end of the valid log data is,
> > > anyway?
> >
> > I don't know how pgsql does it, but the only safe way I know of is to
> > include an "end" marker after each record.  When writing to the log,
> > append the records after the last end marker, ending with another end
> > marker, and fdatasync the log.  Then overwrite the previous end marker
> > to indicate it's not the end of the log any more and fdatasync again.
> >
> > To ensure that it is written atomically, the end marker must not cross a
> > hardware sector boundary (typically 512 bytes).  This can be trivially
> > guaranteed by making the marker a single byte.
>
> An "end" marker is not sufficient, unless all writes are done in
> one-sector units with an fsync between, and the drive buffering
> is turned off.

That's why an end marker must follow all valid records.  When you write
records, you don't touch the marker, and add an end marker to the end of
the records you've written.  After writing and syncing the records, you
rewrite the end marker to indicate that the data following it is valid,
and sync again.  There is no state in that sequence in which partially-
written data could be confused as real data, assuming either your drives
aren't doing write-back caching or you have a UPS, and fsync doesn't
return until the drives return success.

> For larger writes the OS will re-order the writes.
> Most drives will re-order them too, even if the OS doesn't.

I'm well aware of that.

> > Any other way I've seen discussed (here and elsewhere) either
> > - Assume that a CRC is a guarantee.
>
> We are already assuming a CRC is a guarantee.
>
> The drive computes a CRC for each sector, and if the CRC is OK the
> drive is happy.  CRC errors within the drive are quite frequent, and
> the drive re-reads when a bad CRC comes up.

The kind of data failures that a CRC is guaranteed to catch (N-bit
errors) are almost precisely those that a mis-read on a hardware sector
would cause.

> >   ... A CRC would be a good addition to
> >   help ensure the data wasn't broken by flakey drive firmware, but
> >   doesn't guarantee consistency.
> No, a CRC would be a good addition to compensate for sector write
> reordering, which is done both by the OS and by the drive, even for
> "atomic" writes.

But it doesn't guarantee consistency, even in that case.  There is a
possibility (however small) that the random data that was located in the
sectors before the write will match the CRC.
--
Bruce Guenter <bruceg@em.ca>                       http://em.ca/~bruceg/

RE: AW: beta testing version

From

"Christopher Kings-Lynne"

Date:

06 December 2000, 22:06:14

> CRCs are designed to catch N-bit errors (ie N bits in a row with their
> values flipped).  N is (IIRC) the number of bits in the CRC minus one.
> So, a 32-bit CRC can catch all 31-bit errors.  That's the only guarantee
> a CRC gives.  Everything else has a 1 in 2^32-1 chance of producing the
> same CRC as the original data.  That's pretty good odds, but not a
> guarantee.

You've got a higher chance of undetected hard drive errors, memory errors,
solar flares, etc. than a CRC of that quality failing...

Chris

RE: AW: beta testing version

From

"Mikheev, Vadim"

Date:

06 December 2000, 22:19:20

> > CRCs are designed to catch N-bit errors (ie N bits in a row 
> with their
> > values flipped).  N is (IIRC) the number of bits in the CRC 
> minus one.
> > So, a 32-bit CRC can catch all 31-bit errors.  That's the 
> only guarantee
> > a CRC gives.  Everything else has a 1 in 2^32-1 chance of 
> producing the
> > same CRC as the original data.  That's pretty good odds, but not a
> > guarantee.
> 
> You've got a higher chance of undetected hard drive errors, 
> memory errors,
> solar flares, etc. than a CRC of that quality failing...

Also, how long is CRC in TCP/IP packages? => there is always
risk that backend will commit not what you sended to it.

Vadim

Re: CRCs (was: beta testing version)

From

ncm@zembu.com (Nathan Myers)

Date:

07 December 2000, 15:26:10

On Wed, Dec 06, 2000 at 06:53:37PM -0600, Bruce Guenter wrote:
> On Wed, Dec 06, 2000 at 11:08:00AM -0800, Nathan Myers wrote:
> > On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote:
> > > 
> > > I don't know how pgsql does it, but the only safe way I know of
> > > is to include an "end" marker after each record.
> > 
> > An "end" marker is not sufficient, unless all writes are done in
> > one-sector units with an fsync between, and the drive buffering 
> > is turned off.
> 
> That's why an end marker must follow all valid records.  When you write
> records, you don't touch the marker, and add an end marker to the end of
> the records you've written.  After writing and syncing the records, you
> rewrite the end marker to indicate that the data following it is valid,
> and sync again.  There is no state in that sequence in which partially-
> written data could be confused as real data, assuming either your drives
> aren't doing write-back caching or you have a UPS, and fsync doesn't
> return until the drives return success.

That requires an extra out-of-sequence write. 

> > > Any other way I've seen discussed (here and elsewhere) either
> > > - Assume that a CRC is a guarantee.  
> > 
> > We are already assuming a CRC is a guarantee.  
> >
> > The drive computes a CRC for each sector, and if the CRC is OK the 
> > drive is happy.  CRC errors within the drive are quite frequent, and 
> > the drive re-reads when a bad CRC comes up.
> 
> The kind of data failures that a CRC is guaranteed to catch (N-bit
> errors) are almost precisely those that a mis-read on a hardware sector
> would cause.

They catch a single mis-read, but not necessarily the quite likely
double mis-read.

> > >   ... A CRC would be a good addition to
> > >   help ensure the data wasn't broken by flakey drive firmware, but
> > >   doesn't guarantee consistency.
> > No, a CRC would be a good addition to compensate for sector write
> > reordering, which is done both by the OS and by the drive, even for 
> > "atomic" writes.
> 
> But it doesn't guarantee consistency, even in that case.  There is a
> possibility (however small) that the random data that was located in 
> the sectors before the write will match the CRC.

Generally, there are no guarantees, only reasonable expectations.  A 
64-bit CRC would give sufficient confidence without the out-of-sequence
write, and also detect corruption from any source including power outage.

(I'd also like to see CRCs on all the table blocks as well; is there
a place to put them?)

Nathan Myers
ncm@zembu.com

Re: CRCs (was: beta testing version)

From

Bruce Guenter

Date:

07 December 2000, 17:06:03

On Thu, Dec 07, 2000 at 12:25:41PM -0800, Nathan Myers wrote:
> That requires an extra out-of-sequence write.

Ayup!

> Generally, there are no guarantees, only reasonable expectations.

I would differ, but that's irrelevant.

> A 64-bit CRC would give sufficient confidence...

This is part of what I was getting at, in a roundabout way.  If you use
a CRC, hash, or any other kind of non-trivial check code, you have a
certain level of confidence in the data, but not a guarantee.  If you
decide, based on your expert opinions, that a 32 or 64 bit CRC or hash
gives you an adequate level of confidence in the event of a crash, then
I'll be satisfied, but don't call it a guarantee.

Them's small nits we're picking at, though.
--
Bruce Guenter <bruceg@em.ca>                       http://em.ca/~bruceg/

Re: AW: beta testing version

From

Daniele Orlandi

Date:

08 December 2000, 14:55:14

Bruce Guenter wrote:
> 
> CRCs are designed to catch N-bit errors (ie N bits in a row with their
> values flipped).  N is (IIRC) the number of bits in the CRC minus one.
> So, a 32-bit CRC can catch all 31-bit errors.  That's the only guarantee
> a CRC gives.  Everything else has a 1 in 2^32-1 chance of producing the
> same CRC as the original data.  That's pretty good odds, but not a
> guarantee.

Nothing is a guarante. Everywhere you have a non-null probability of
failure. Memories of any kind doesn't give you a *guarantee* that the
data you read is exactly the one you wrote. CPUs and transmsision lines
are subject to errors too.

You only may be guaranteed that the overall proabability of your system
is under a specified level. When the level is low enought you usually
suppose the absence of errors guaranteed.

With CRC32 you considerably reduce p, and given the frequency when CRC
would need to reveal an error, I would consider it enought.

Bye!

-- Daniele

-------------------------------------------------------------------------------Daniele Orlandi - Utility Line Italia -
http://www.orlandi.comViaMezzera 29/A - 20030 - Seveso (MI) - Italy

-------------------------------------------------------------------------------