Thread: Checkpoint cost, looks like it is WAL/CRC

Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

29 June 2005, 15:35:18

Tom, All:

Ok, finally managed though the peristent efforts of Mark Wong to get some 
tests through.  Here are two tests with the CRC and wall buffer checking 
completely cut out of the code, as Tom suggested:

5-min checkpoint:
http://khack.osdl.org/stp/302738/results/0/
http://khack.osdl.org/stp/302706/results/0/
60-min checkpoint:
http://khack.osdl.org/stp/302739/results/0/
(please note that OSDL is having technical difficulties and some links may 
not work)

This is the performance profile I'd expect and want to see, and the 
frequency for checkpoints doesn't affect the overall performance at all.  
Contrast it with these:

5-min checkpoint:
http://khack.osdl.org/stp/302671/results/0/
I don't hae a 60-minute checkpoint for comparison because of failures on 
the STP :-(

So, now that we know what the performance bottleneck is, how do we fix it?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

29 June 2005, 19:41:29

Josh Berkus <josh@agliodbs.com> writes:
> Ok, finally managed though the peristent efforts of Mark Wong to get some 
> tests through.  Here are two tests with the CRC and wall buffer checking 
> completely cut out of the code, as Tom suggested:

Uh, what exactly did you cut out?  I suggested dropping the dumping of
full page images, but not removing CRCs altogether ...
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

29 June 2005, 22:32:41

Tom,

> Uh, what exactly did you cut out?  I suggested dropping the dumping of
> full page images, but not removing CRCs altogether ...

Attached is the patch I used.  (it's a -Urn patch 'cause that's what STP 
takes)

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

30 June 2005, 00:23:15

Josh Berkus <josh@agliodbs.com> writes:
>> Uh, what exactly did you cut out?  I suggested dropping the dumping of
>> full page images, but not removing CRCs altogether ...

> Attached is the patch I used.

OK, thanks for the clarification.  So it does seem that dumping full
page images is a pretty big hit these days.  (In defense of the original
idea, I believe it was not such a hit at the time --- but as we continue
to improve performance, things that weren't originally at the top of the
profile become significant.)

It seems like we have two basic alternatives:

1. Offer a GUC to turn off full-page-image dumping, which you'd use only
if you really trust your hardware :-(

2. Think of a better defense against partial-page writes.

I like #2, or would if I could think of a better defense.  Ideas anyone?
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

30 June 2005, 00:26:39

Tom,

> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> if you really trust your hardware :-(

Are these just WAL pages?  Or database pages as well?

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

30 June 2005, 00:33:14

Josh Berkus <josh@agliodbs.com> writes:
>> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
>> if you really trust your hardware :-(

> Are these just WAL pages?  Or database pages as well?

Database pages.  The current theory is that we can completely
reconstruct from WAL data every page that's been modified since the
last checkpoint.  So the first write of any page after a checkpoint
dumps a full image of the page into WAL; subsequent writes only write
differences.

This is nice and secure ... at least when you are using hardware that
guarantees write ordering ... otherwise it's probably mostly useless
overhead.  Still, I'd not like to abandon the contract that if the disk
does what it is supposed to do then we will do what we are supposed to.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

"Magnus Hagander"

Date:

30 June 2005, 07:50:20

> 2. Think of a better defense against partial-page writes.
>
> I like #2, or would if I could think of a better defense.
> Ideas anyone?

FWIW, MSSQL deals with this using "Torn Page Detection". This is off by
default (no check at all!), but can be abled on a per-database level.
Note that it only *detects* torn pages. If it finds one, it won't start
and tell you to recover from backup. It can't automatically recover. I
would assume this greatly decreases the amount of data you have to
save...

From the BOL:
"
This option allows SQL Server to detect incomplete I/O operations caused
by power failures or other system outages. When true, it causes a bit to
be flipped for each 512-byte sector in an 8-kilobyte (KB) database page
whenever the page is written to disk. If a bit is in the wrong state
when the page is later read by SQL Server, then the page was written
incorrectly; a torn page is detected. Torn pages are usually detected
during recovery because any page that was written incorrectly is likely
to be read by recovery.

Although SQL Server database pages are 8 KB, disks perform I/O
operations using a 512-byte sector. Therefore, 16 sectors are written
per database page. A torn page can occur if the system fails (for
example, due to power failure) between the time the operating system
writes the first 512-byte sector to disk and the completion of the 8 KB
I/O operation. If the first sector of a database page is successfully
written before the failure, the database page on disk will appear as
updated, although it may not have succeeded.

Using battery-backed disk controller caches can ensure that data is
successfully written to disk or not written at all. In this case, do not
set torn page detection to true, for it is not needed.
"

and some FAQs:
"
Q. Does enabling the "torn page" database option add any measurable
performance overhead to a server?

A. The torn page option does not add much CPU cost at all, but it can
increase contention on "hot" pages. With torn page detection off, a page
can be accessed while it is being written to disk. This is not true if
torn page detection is on.

Q. When does SQL Server check for torn pages? At startup? Any time it
reads a page from disk?

A. Torn page detection is done whenever a page is read from disk. In
practice, this is likely to be during recovery, because any page on
which the write did not complete during normal operations is very likely
to be read by recovery (except for non-logged operations, such as index
creation, bcp, and so on).

Q. What happens when SQL Server detects a torn page?

A. When a torn page is detected, a severe I/O error is raised. This
error will close the connection. The database is only marked suspect if
the torn page is detected during recovery.

Q. How can I recover from torn pages?

A. Restoring the database from a backup and rolling the transaction log
forward should correct the problem with no data loss.

Q. What situations are most likely to cause torn pages?

A. Lab tests have shown that torn pages are quite likely to happen when
disk writes are occurring and power is lost. If you do not have a
battery backup or uninterruptible power supply (UPS), you should
consider enabling this option.
"


Perhaps a GUC that would let you chose between "full recovery, detection
or none at all"? Though that might complicate the code...

//Magnus

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

30 June 2005, 13:29:23

Tom,

> Database pages.  The current theory is that we can completely
> reconstruct from WAL data every page that's been modified since the
> last checkpoint.  So the first write of any page after a checkpoint
> dumps a full image of the page into WAL; subsequent writes only write
> differences.

What I'm confused about is that this shouldn't be anything new for 8.1.  Yet 
8.1 has *worse* performance on the STP machines than 8.0 does, and it's 
pretty much entirely due to this check.

> This is nice and secure ... at least when you are using hardware that
> guarantees write ordering ... otherwise it's probably mostly useless
> overhead.  Still, I'd not like to abandon the contract that if the disk
> does what it is supposed to do then we will do what we are supposed to.

Given the huge performance difference (30%), I think we have to give an option 
to turn it off.  So DBAs whose machines are in danger of being shut off a lot 
can have it on an the more performance-sensitive can turn it off.

One thing I am confused about, though: if the whole pages are in the database, 
why do we need a full copy in WAL instead of just the diffs?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

30 June 2005, 17:22:38

Josh Berkus <josh@agliodbs.com> writes:
> What I'm confused about is that this shouldn't be anything new for 8.1.  Yet 
> 8.1 has *worse* performance on the STP machines than 8.0 does, and it's 
> pretty much entirely due to this check.

That's simply not believable --- better recheck your analysis.  If 8.1
is worse it's not because of page-dumping, because we are more efficient
on that than before not less so.  Perhaps there's another issue?
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

30 June 2005, 18:05:43

Tom,

> > What I'm confused about is that this shouldn't be anything new for
> > 8.1.  Yet 8.1 has *worse* performance on the STP machines than 8.0
> > does, and it's pretty much entirely due to this check.
>
> That's simply not believable --- better recheck your analysis.  If 8.1
> is worse it's not because of page-dumping, because we are more efficient
> on that than before not less so.  Perhaps there's another issue?

Ach, I want to run head-to-head tests, but the system is down again.  

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

"Qingqing Zhou"

Date:

01 July 2005, 03:08:14

""Magnus Hagander"" <mha@sollentuna.net> writes
>
> FWIW, MSSQL deals with this using "Torn Page Detection". This is off by
> default (no check at all!), but can be abled on a per-database level.
> Note that it only *detects* torn pages. If it finds one, it won't start
> and tell you to recover from backup. It can't automatically recover. I
> would assume this greatly decreases the amount of data you have to
> save...
>

After reading the long discussion on torn page detection and many related
issues in doc/TODO.details/wal, I believe we need carefully analysis what
the sequence of a partial write.

case 1: if a partial write just write some sectors of a 8K page, then it is
ok - assuming we will replay xlog in any ways without comparing page header
LSN and xlog LSN (currently we do not do so);

case 2: if a patial write destroied some old data of a page, which are not
in our xlog, we will worry about this;

So the main issue is case2. If OS could schedule atomic sector-wise write ,
so case2 could never happen?

Regards,
Qingqing

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

02 July 2005, 23:16:29

Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> >> Uh, what exactly did you cut out?  I suggested dropping the dumping of
> >> full page images, but not removing CRCs altogether ...
> 
> > Attached is the patch I used.
> 
> OK, thanks for the clarification.  So it does seem that dumping full
> page images is a pretty big hit these days.  (In defense of the original
> idea, I believe it was not such a hit at the time --- but as we continue
> to improve performance, things that weren't originally at the top of the
> profile become significant.)
> 
> It seems like we have two basic alternatives:
> 
> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> if you really trust your hardware :-(
> 
> 2. Think of a better defense against partial-page writes.
> 
> I like #2, or would if I could think of a better defense.  Ideas anyone?

I have an idea!  Currently we write the backup pages (copies of pages
modified since the last checkpoint) when we write the WAL changes as
part of the commit.  See the XLogCheckBuffer() call in XLogInsert().

However, it seems that this is not the most efficient time to be writing
the backup pages.  It would be more efficient to write these just before
the page is written to disk (not when it is modified in the buffer
cache).  This would allow the background writer to be writing most of
the backup pages, rather than the transaction committer.

A further optimization would be to write the backup pages to the same 8k
file over and over again rather than adding I/O to pg_xlog WAL files.

Comments?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Greg Stark

Date:

03 July 2005, 03:47:49

Bruce Momjian <pgman@candle.pha.pa.us> writes:

> I have an idea!  Currently we write the backup pages (copies of pages
> modified since the last checkpoint) when we write the WAL changes as
> part of the commit.  See the XLogCheckBuffer() call in XLogInsert().

Can someone explain exactly what the problem being defeated by writing whole
pages to the WAL log? Like, if page is half flushed to disk and contains half
the old data and half of the new data, what exactly would go wrong with
recovery? When postgres sees the write why would it have trouble recovering
the correct contents for the page based on the half-written page and the
regular WAL entry?

> A further optimization would be to write the backup pages to the same 8k
> file over and over again rather than adding I/O to pg_xlog WAL files.

This is an interesting idea. But it's not necessarily universally a win. For
users with a battery backed cache it would be absolutely ideal and a huge win.
It would avoid pushing other pages out of cache and avoid doing extra i/o
syncing the no longer relevant data to the actual disk.

But for users without a battery backed cache I don't think it would be a win.
It would mean another data file that ideally would be on a segregated set of
spindles to avoid any unnecessary seeks. Currently the ideal postgres setup
has WAL on one set of spindles, data on another set of spindles, and OS
writable partitions like /var on an entirely separate set of spindles. That
takes at least 6 drives if you want mirrored drives. This would raise the
minimum to 8 drives.

-- 
greg

Re: Checkpoint cost, looks like it is WAL/CRC

From

Russell Smith

Date:

03 July 2005, 08:22:08

On Sun, 3 Jul 2005 04:47 pm, Greg Stark wrote:
> 
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> 
> > I have an idea!  Currently we write the backup pages (copies of pages
> > modified since the last checkpoint) when we write the WAL changes as
> > part of the commit.  See the XLogCheckBuffer() call in XLogInsert().
> 
> Can someone explain exactly what the problem being defeated by writing whole
> pages to the WAL log? Like, if page is half flushed to disk and contains half
> the old data and half of the new data, what exactly would go wrong with
> recovery? When postgres sees the write why would it have trouble recovering
> the correct contents for the page based on the half-written page and the
> regular WAL entry?

Problem we are solving.
Checkpoint Happens.  (Ensuring that ALL data make it to the disk) - This means we don't need the information in WAL
beforethe checkpoint.

We write some updates to the db, changing some pages.
The Power is pulled and one page is half written.  There is no way to know if the page is in a valid state. There is no
wayto recover the page from wal, becuase unless we wrote the full page into wal, we don't have a "starting point" for
modification.

That's probably very unclear but that is the idea.

Regards

Russell Smith

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

03 July 2005, 12:49:53

Greg Stark <gsstark@mit.edu> writes:
> Can someone explain exactly what the problem being defeated by writing whole
> pages to the WAL log?

Partial writes.  Without the full-page image, we do not have enough
information in WAL to reconstruct the correct page contents.

>> A further optimization would be to write the backup pages to the same 8k
>> file over and over again rather than adding I/O to pg_xlog WAL files.

How does that work, and why is it a win compared to doing the same
amount of I/O to WAL?
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Greg Stark

Date:

03 July 2005, 15:08:22

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Can someone explain exactly what the problem being defeated by writing whole
> > pages to the WAL log?
> 
> Partial writes.  Without the full-page image, we do not have enough
> information in WAL to reconstruct the correct page contents.

Sure, but why not?

If a 8k page contains 16 low level segments on disk and the old data is
AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would
contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page?

If the actual write only got out AAABAAACAAAAAAAA the resulting page is
garbage but why isn't what the WAL has enough information to reconstruct the
new version?

I do worry that a drive could write AAABAAACXYZKWBFH if it loses power in the
middle of the write, though. That would be bad.

-- 
greg

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

03 July 2005, 15:35:29

Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Partial writes.  Without the full-page image, we do not have enough
>> information in WAL to reconstruct the correct page contents.

> Sure, but why not?

> If a 8k page contains 16 low level segments on disk and the old data is
> AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would
> contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page?

It might contain parts of it ... scattered across a large number of WAL
entries ... but I don't think that's enough to reconstruct the page.
As an example, a btree insert WAL record will say "insert this tuple
at position N, shifting the other entries accordingly"; that does not
give you the ability to reconstruct entries that shifted across sector
boundaries, as they may not be present in the on-disk data of either
sector.  You're also going to have fairly serious problems interpreting
the page contents if what's on disk includes the effects of multiple
WAL records beyond the record you are currently looking at.

We could possibly do it if we added more information to the WAL records,
but that strikes me as a net loss: essentially it would pay the penalty
all the time instead of only on the first write after a checkpoint.

Also, you are assuming that the content of each sector is uniquely ---
and determinably --- either "old data" or "new data", not for example
"unreadable because partially written".
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

06 July 2005, 17:49:34

On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> >> Uh, what exactly did you cut out?  I suggested dropping the dumping of
> >> full page images, but not removing CRCs altogether ...
> 
> > Attached is the patch I used.
> 
> OK, thanks for the clarification.  So it does seem that dumping full
> page images is a pretty big hit these days.  

Yes the performance results are fairly damning. That's a shame, I
convinced myself that the CRC32 and block-hole compression was enough.

The 50% performance gain isn't the main thing for me. The 10 sec drop in
response time immediately after checkpoint is the real issue. Most sites
are looking for good response as an imperative, rather than throughput.

Overall, IMHO we must do something about this for 8.1. Possibly
something more for 8.2 also, but definitely *something* now.

> (In defense of the original
> idea, I believe it was not such a hit at the time --- but as we continue
> to improve performance, things that weren't originally at the top of the
> profile become significant.)

No defense required. As you say, it was the best idea at the time.

> It seems like we have two basic alternatives:
> 
> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> if you really trust your hardware :-(
> 
> 2. Think of a better defense against partial-page writes.
> 
> I like #2, or would if I could think of a better defense.  Ideas anyone?

Well, I'm all for #2 if we can think of one that will work. I can't.

Option #1 seems like the way forward, but I don't think it is
sufficiently safe just to have the option to turn things off.

With wal_changed_pages= off *any* crash would possibly require an
archive recovery, or a replication rebuild. It's good that we now have
PITR, but we do also have other options for availability. Users of
replication could well be amongst the first to try out this option. 

The problem is that you just wouldn't *know* whether the possibly was
yes or no. The temptation would be to assume "no" and just continue,
which could lead to data loss. And that would lead to a lack of trust in
PostgreSQL and eventual reputational loss. Would I do an archive
recovery, or would I trust that RAID array had written everything
properly? With an irate Web Site Manager saying "you think? it might?
maybe? You mean you don't know???"

If we pick option #1, it *must* also include a way of deciding whether a
crash recovery has succeeded, or not. Other commentators have mentioned
the torn_pages option for other DBMS. It seems we also need an
indicator. That should be a non-optional feature of 8.1 and then perhaps
an optional feature in later releases when we have more experience to
say that turning it off is OK in some circumstances.

We could implement the torn-pages option, but that seems a lot of work.
Another way of implementing a tell-tale would be to append the LSN again
as a data page trailer as the last 4 bytes of the page. Thus the LSN
would be both the first and last item on the data page. Any partial
write would fail to update the LSN trailer and we would be able to see
that a page was torn. That's considerably easier than trying to write a
torn page tell-tale to each 512 byte sector of a page as SQLServer does.

During recovery, if a full page image is not available, we would read
the page from the database and check that the first and last LSNs match.
If they do, then the page is not torn and recovery can be successful. If
they do not match, then we attempt to continue recovery, but issue a
warning that torn page has been detected and a full archive recovery is
recommended. It is likely that the recovery itself will fail almost
immediately following this, since changes will try to be made to a page
in the wrong state to receive it, but there's no harm in trying....

Like this specific idea or not, I'm saying that we need a tell-tale: a
way of knowing whether we have a torn page, or not. That way we can
safely continue to rely upon crash recovery.

Tom, I think you're the only person that could or would be trusted to
make such a change. Even past the 8.1 freeze, I say we need to do
something now on this issue.

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

06 July 2005, 19:23:09

Simon Riggs wrote:
> On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
> > Josh Berkus <josh@agliodbs.com> writes:
> > >> Uh, what exactly did you cut out?  I suggested dropping the dumping of
> > >> full page images, but not removing CRCs altogether ...
> > 
> > > Attached is the patch I used.
> > 
> > OK, thanks for the clarification.  So it does seem that dumping full
> > page images is a pretty big hit these days.  
> 
> Yes the performance results are fairly damning. That's a shame, I
> convinced myself that the CRC32 and block-hole compression was enough.
> 
> The 50% performance gain isn't the main thing for me. The 10 sec drop in
> response time immediately after checkpoint is the real issue. Most sites
> are looking for good response as an imperative, rather than throughput.

Yep.

> No defense required. As you say, it was the best idea at the time.
> 
> > It seems like we have two basic alternatives:
> > 
> > 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> > if you really trust your hardware :-(
> > 
> > 2. Think of a better defense against partial-page writes.
> > 
> > I like #2, or would if I could think of a better defense.  Ideas anyone?
> 
> Well, I'm all for #2 if we can think of one that will work. I can't.
> 
> Option #1 seems like the way forward, but I don't think it is
> sufficiently safe just to have the option to turn things off.

Well, I added #1 yesterday as 'full_page_writes', and it has the same
warnings as fsync (namely, on crash, be prepared to recovery or check
your system thoroughly.

As far as #2, my posted proposal was to write the full pages to WAL when
they are written to the file system, and not when they are first
modified in the shared buffers --- the goal being that it will even out
the load, and it will happen in a non-critical path, hopefully by the
background writer or at checkpoint time.

> With wal_changed_pages= off *any* crash would possibly require an
> archive recovery, or a replication rebuild. It's good that we now have
> PITR, but we do also have other options for availability. Users of
> replication could well be amongst the first to try out this option. 

Seems it is similar to fsync in risk, which is not a new option.

> The problem is that you just wouldn't *know* whether the possibly was
> yes or no. The temptation would be to assume "no" and just continue,
> which could lead to data loss. And that would lead to a lack of trust in
> PostgreSQL and eventual reputational loss. Would I do an archive
> recovery, or would I trust that RAID array had written everything
> properly? With an irate Web Site Manager saying "you think? it might?
> maybe? You mean you don't know???"

That is a serious problem, but the same problem we have in turning off
fsync.

> During recovery, if a full page image is not available, we would read
> the page from the database and check that the first and last LSNs match.
> If they do, then the page is not torn and recovery can be successful. If
> they do not match, then we attempt to continue recovery, but issue a
> warning that torn page has been detected and a full archive recovery is
> recommended. It is likely that the recovery itself will fail almost
> immediately following this, since changes will try to be made to a page
> in the wrong state to receive it, but there's no harm in trying....

I like the idea of checking the page during recovery so we don't have to
check all the pages, just certain pages.

> Like this specific idea or not, I'm saying that we need a tell-tale: a
> way of knowing whether we have a torn page, or not. That way we can
> safely continue to rely upon crash recovery.
> 
> Tom, I think you're the only person that could or would be trusted to
> make such a change. Even past the 8.1 freeze, I say we need to do
> something now on this issue.

I think if we document full_page_writes as similar to fsync in risk, we
are OK for 8.1, but if something can be done easily, it sounds good.

Now that we have a GUC we can experiment with the full page write load
and see how it can be improved.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

06 July 2005, 19:53:00

On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote:
> Well, I added #1 yesterday as 'full_page_writes', and it has the same
> warnings as fsync (namely, on crash, be prepared to recovery or check
> your system thoroughly.

Yes, which is why I comment now that the GUC alone is not enough.

There is no way to "check your system thoroughly". If there is a certain
way of knowing torn pages had *not* occurred, then I would be happy.

> As far as #2, my posted proposal was to write the full pages to WAL when
> they are written to the file system, and not when they are first
> modified in the shared buffers --- the goal being that it will even out
> the load, and it will happen in a non-critical path, hopefully by the
> background writer or at checkpoint time.

The page must be written before the changes to the page are written, so
that they are available sequentially in the log for replay. The log and
the database are not connected, so we cannot do it that way. If the page
is written out of sequence from the changes to it, how would recovery
know where to get the page from?

ISTM there is mileage in your idea of trying to shift the work to
another time. My thought is "which blocks exactly are the ones being
changed?". Maybe that would lead to a reduction.

> > With wal_changed_pages= off *any* crash would possibly require an
> > archive recovery, or a replication rebuild. It's good that we now have
> > PITR, but we do also have other options for availability. Users of
> > replication could well be amongst the first to try out this option. 
> 
> Seems it is similar to fsync in risk, which is not a new option.

Risk is not acceptable. We must have certainty, either way.

Why have two GUCs? Why not just have one GUC that does both at the same
time? When would you want one but not the other?
risk_data_loss_to_gain_performance = true

> I think if we document full_page_writes as similar to fsync in risk, we
> are OK for 8.1, but if something can be done easily, it sounds good.

Documenting something simply isn't enough. I simply cannot advise
anybody ever to use the new GUC. If their data was low value, they
wouldn't even be using PostgreSQL, they'd use a non-transactional DBMS.

I agree we *must* have the GUC, but we also *must* have a way for crash
recovery to tell us for certain that it has definitely worked, not just
maybe worked.

Best regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

"Joshua D. Drake"

Date:

06 July 2005, 21:17:55

>>Tom, I think you're the only person that could or would be trusted to
>>make such a change. Even past the 8.1 freeze, I say we need to do
>>something now on this issue.
> 
> 
> I think if we document full_page_writes as similar to fsync in risk, we
> are OK for 8.1, but if something can be done easily, it sounds good.
> 
> Now that we have a GUC we can experiment with the full page write load
> and see how it can be improved.

Question, with this option if the power goes out will I just roll 
through the transaction logs like normal? Or are we talking the 
potential to have to use something like pg_resetxlog or similar?

If it is just roll through the transaction logs then I have no problem 
with it, let the user decide the level of reliance they have. If it can
cause actual, need to restore from backup level damage then it is a 
literall no go IMHO.

Sincerely,

Joshua D. Drake



> 


-- 
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

Re: Checkpoint cost, looks like it is WAL/CRC

From

Oliver Jowett

Date:

06 July 2005, 21:49:12

Simon Riggs wrote:

> I agree we *must* have the GUC, but we also *must* have a way for crash
> recovery to tell us for certain that it has definitely worked, not just
> maybe worked.

Doesn't the same argument apply to the existing fsync = off case? i.e.
we already have a case where we don't provide a crash-recovery guarantee.

-O

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

07 July 2005, 01:02:07

Simon Riggs <simon@2ndquadrant.com> writes:
> On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote:
>> Well, I added #1 yesterday as 'full_page_writes', and it has the same
>> warnings as fsync (namely, on crash, be prepared to recovery or check
>> your system thoroughly.

> Yes, which is why I comment now that the GUC alone is not enough.

> There is no way to "check your system thoroughly". If there is a certain
> way of knowing torn pages had *not* occurred, then I would be happy.

I agree with Simon that this isn't much of a solution: no one who cares
about their data will dare turn off the option, and therefore the
possible performance gain is just hot air.

I do not see a better alternative at the moment :-( but we should keep
thinking about it.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

07 July 2005, 01:10:45

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> As far as #2, my posted proposal was to write the full pages to WAL when
> they are written to the file system, and not when they are first
> modified in the shared buffers ---

That is *completely* unworkable.  Or were you planning to abandon the
promise that a transaction is committed when we have flushed its WAL
commit record?

> Seems it is similar to fsync in risk, which is not a new option.

The point here is that fsync-off is only realistic for development
or playpen installations.  You don't turn it off in a production
machine, and I can't see that you'd turn off the full-page-write
option either.  So we have not solved anyone's performance problem.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruno Wolff III

Date:

07 July 2005, 01:16:45

On Wed, Jul 06, 2005 at 21:48:44 +0100, Simon Riggs <simon@2ndquadrant.com> wrote:
> 
> We could implement the torn-pages option, but that seems a lot of work.
> Another way of implementing a tell-tale would be to append the LSN again
> as a data page trailer as the last 4 bytes of the page. Thus the LSN
> would be both the first and last item on the data page. Any partial
> write would fail to update the LSN trailer and we would be able to see
> that a page was torn. That's considerably easier than trying to write a
> torn page tell-tale to each 512 byte sector of a page as SQLServer does.

Are you sure about that? That would probably be the normal case, but are
you promised that the hardware will write all of the sectors of a block
in order?

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

07 July 2005, 01:25:06

Bruno Wolff III <bruno@wolff.to> writes:
> Are you sure about that? That would probably be the normal case, but are
> you promised that the hardware will write all of the sectors of a block
> in order?

I don't think you can possibly assume that.  If the block crosses a
cylinder boundary then it's certainly an unsafe assumption, and even
within a cylinder (no seek required) I'm pretty sure that disk drives
have understood "write the next sector that passes under the heads"
for decades.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

07 July 2005, 01:29:52

Tom Lane wrote:
> Bruno Wolff III <bruno@wolff.to> writes:
> > Are you sure about that? That would probably be the normal case, but are
> > you promised that the hardware will write all of the sectors of a block
> > in order?
> 
> I don't think you can possibly assume that.  If the block crosses a
> cylinder boundary then it's certainly an unsafe assumption, and even
> within a cylinder (no seek required) I'm pretty sure that disk drives
> have understood "write the next sector that passes under the heads"
> for decades.

SCSI tagged queueing certainly allows 512-byte blocks to be reordered
during writes.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

07 July 2005, 04:28:44

On Wed, 2005-07-06 at 17:17 -0700, Joshua D. Drake wrote:
> >>Tom, I think you're the only person that could or would be trusted to
> >>make such a change. Even past the 8.1 freeze, I say we need to do
> >>something now on this issue.
> > 
> > 
> > I think if we document full_page_writes as similar to fsync in risk, we
> > are OK for 8.1, but if something can be done easily, it sounds good.
> > 
> > Now that we have a GUC we can experiment with the full page write load
> > and see how it can be improved.
> 
> Question, with this option if the power goes out will I just roll 
> through the transaction logs like normal? 

Most probably, yes. But:

> Or are we talking the 
> potential to have to use something like pg_resetxlog or similar?

Potentially. Just depends on what sort of crash occurred...

> If it is just roll through the transaction logs then I have no problem 
> with it, let the user decide the level of reliance they have. If it can
> cause actual, need to restore from backup level damage then it is a 
> literall no go IMHO.

Well, it can't *cause* problems, but it doesn't solve them when they
occur, as the current design does. If crash recovery fails, and it may
do (and worst of all it might not fail when it should have) then you
will need to recover the database using PITR or a replication mechanism.

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

07 July 2005, 04:31:27

On Thu, 2005-07-07 at 00:29 -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruno Wolff III <bruno@wolff.to> writes:
> > > Are you sure about that? That would probably be the normal case, but are
> > > you promised that the hardware will write all of the sectors of a block
> > > in order?
> > 
> > I don't think you can possibly assume that.  If the block crosses a
> > cylinder boundary then it's certainly an unsafe assumption, and even
> > within a cylinder (no seek required) I'm pretty sure that disk drives
> > have understood "write the next sector that passes under the heads"
> > for decades.
> 
> SCSI tagged queueing certainly allows 512-byte blocks to be reordered
> during writes.

Then a torn-page tell-tale is required that will tell us of any change
to any of the 512-byte sectors that make up a block/page.

Here's an idea:

We read the page that we would have backed up, calc the CRC and write a
short WAL record with just the CRC, not the block. When we recover we
re-read the database page, calc its CRC and compare it with the CRC from
the transaction log. If they differ, we know that the page was torn and
we know the database needs recovery. (So we calc the CRC when we log AND
when we recover).

This avoids the need to write full pages, though slightly slows down
recovery.

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

"Zeugswetter Andreas DAZ SD"

Date:

07 July 2005, 07:05:11

>> Are you sure about that? That would probably be the normal case, but
>> are you promised that the hardware will write all of the sectors of a

>> block in order?
>
> I don't think you can possibly assume that.  If the block
> crosses a cylinder boundary then it's certainly an unsafe
> assumption, and even within a cylinder (no seek required) I'm
> pretty sure that disk drives have understood "write the next
> sector that passes under the heads"
> for decades.

A lot of hardware exists, that guards against partial writes
of single IO requests (a persistent write cache for a HP raid
controller for intel servers costs ~500$ extra).

But, the OS usually has 4k (some 8k) filesystem buffer size,
and since we do not use direct io for datafiles, the OS might decide
to schedule two 4k writes differently for one 8k page.

If you do not build pg to match your fs buffer size you cannot
guard against partial writes with hardware :-(

We could alleviate that problem with direct io for datafiles.

Andreas

Re: Checkpoint cost, looks like it is WAL/CRC

From

"Zeugswetter Andreas DAZ SD"

Date:

07 July 2005, 07:12:08

> Here's an idea:
>
> We read the page that we would have backed up, calc the CRC and
> write a short WAL record with just the CRC, not the block. When
> we recover we re-read the database page, calc its CRC and
> compare it with the CRC from the transaction log. If they
> differ, we know that the page was torn and we know the database
> needs recovery. (So we calc the CRC when we log AND when we recover).

Won't work, since the page on disk may have x different contents
between 2 checkpoints (bgwriter from lru).

Only workable solution would imho be to write the LSN to each 512
byte block (not that I am propagating that idea).

Andreas

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

07 July 2005, 10:31:11

"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:
> Only workable solution would imho be to write the LSN to each 512
> byte block (not that I am propagating that idea). 

We're not doing anything like that, as it would create an impossible
space-management problem (or are you happy with limiting tuples to
500 bytes?).  What we *could* do is calculate a page-level CRC and
store it in the page header just before writing out.  Torn pages
would then manifest as a wrong CRC on read.  No correction ability,
but at least a reliable detection ability.

However, this will do nothing to solve the performance problem if
the core of that problem is the cost of computing page-sized CRCs :-(

We still don't know enough about the situation to know what a solution
might look like.  Is the slowdown Josh is seeing due to the extra CPU
cost of the CRCs, or the extra I/O cost, or excessive locking of the
WAL-related data structures while we do this stuff, or ???.  Need more
data.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

07 July 2005, 10:38:37

I wrote:
> We still don't know enough about the situation to know what a solution
> might look like.  Is the slowdown Josh is seeing due to the extra CPU
> cost of the CRCs, or the extra I/O cost, or excessive locking of the
> WAL-related data structures while we do this stuff, or ???.  Need more
> data.

Josh, is OSDL up enough that you can try another comparison run?
If so, please undo the previous patch (which disabled page dumping
entirely) and instead try removing this block of code, starting
at about xlog.c line 620 in CVS tip:
   /*    * Now add the backup block headers and data into the CRC    */   for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)   {
    if (dtbuf_bkp[i])       {           BkpBlock   *bkpb = &(dtbuf_xlg[i]);           char       *page;
 
           COMP_CRC32(rdata_crc,                      (char *) bkpb,                      sizeof(BkpBlock));
page= (char *) BufferGetBlock(dtbuf[i]);           if (bkpb->hole_length == 0)           {
COMP_CRC32(rdata_crc,                         page,                          BLCKSZ);           }           else
  {               /* must skip the hole */               COMP_CRC32(rdata_crc,                          page,
              bkpb->hole_offset);               COMP_CRC32(rdata_crc,                          page +
(bkpb->hole_offset+ bkpb->hole_length),                          BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
     }       }   }
 

This will remove just the CRC calculation work associated with backed-up
pages.  Note that any attempt to recover from the WAL will fail, but I
assume you don't need that for the purposes of the test run.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

"Zeugswetter Andreas DAZ SD"

Date:

07 July 2005, 10:55:11

>> Only workable solution would imho be to write the LSN to each 512
byte
>> block (not that I am propagating that idea).

"Only workable" was a stupid formulation, I meant a solution that works
with
a LSN.

> We're not doing anything like that, as it would create an
> impossible space-management problem (or are you happy with
> limiting tuples to 500 bytes?).

To do it, a layer between physical storage and row workmemory
would need to be inserted, of course that would add a lot of overhead.
I guess more overhead than computing a page crc.

> We still don't know enough about the situation to know what a solution
might look like.
> Is the slowdown Josh is seeing due to the extra CPU cost of the CRCs,
or the extra I/O cost,
> or excessive locking of the WAL-related data structures while we do
this stuff, or ???.
> Need more data.

Yes, especially the 10 sec instead of 1 sec response times look very
suspicious.

Andreas

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

07 July 2005, 12:15:29

Simon Riggs wrote:
> > SCSI tagged queueing certainly allows 512-byte blocks to be reordered
> > during writes.
> 
> Then a torn-page tell-tale is required that will tell us of any change
> to any of the 512-byte sectors that make up a block/page.
> 
> Here's an idea:
> 
> We read the page that we would have backed up, calc the CRC and write a
> short WAL record with just the CRC, not the block. When we recover we
> re-read the database page, calc its CRC and compare it with the CRC from
> the transaction log. If they differ, we know that the page was torn and
> we know the database needs recovery. (So we calc the CRC when we log AND
> when we recover).
> 
> This avoids the need to write full pages, though slightly slows down
> recovery.

Yes, that is a good idea!  That torn page thing sounded like a mess, and
I love that we can check them on recovery rather than whenever you
happen to access the page.

What would be great would be to implement this when full_page_writes is
off, _and_ have the page writes happen when the page is written to disk
rather than modified in the shared buffers.

I will add those to the TODO list now.  Updated item:
* Eliminate need to write full pages to WAL before page modification   [wal]  Currently, to protect against partial
diskpage writes, we write  full page images to WAL before they are modified so we can correct any  partial page writes
duringrecovery.  These pages can also be  eliminated from point-in-time archive files.        o  -Add ability to turn
offfull page writes        o  When off, write CRC to WAL and check file system blocks           on recovery        o
Writefull pages during file system write and not when           the page is modified in the buffer cache           This
allowsmost full page writes to happen in the background           writer.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

07 July 2005, 12:18:35

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Yes, that is a good idea!

... which was shot down in the very next message.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

07 July 2005, 12:28:25

Zeugswetter Andreas DAZ SD wrote:
> 
> >> Are you sure about that? That would probably be the normal case, but 
> >> are you promised that the hardware will write all of the sectors of a
> 
> >> block in order?
> > 
> > I don't think you can possibly assume that.  If the block 
> > crosses a cylinder boundary then it's certainly an unsafe 
> > assumption, and even within a cylinder (no seek required) I'm 
> > pretty sure that disk drives have understood "write the next 
> > sector that passes under the heads"
> > for decades.
> 
> A lot of hardware exists, that guards against partial writes
> of single IO requests (a persistent write cache for a HP raid 
> controller for intel servers costs ~500$ extra).
> 
> But, the OS usually has 4k (some 8k) filesystem buffer size,
> and since we do not use direct io for datafiles, the OS might decide 
> to schedule two 4k writes differently for one 8k page.
> 
> If you do not build pg to match your fs buffer size you cannot
> guard against partial writes with hardware :-(
> 
> We could alleviate that problem with direct io for datafiles.

Now that is an interesting analysis.  I thought people who used
batter-backed drive cache wouldn't have partial page write problems, but
I now see it is certainly possible.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Greg Stark

Date:

07 July 2005, 12:32:01

Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:
> > Only workable solution would imho be to write the LSN to each 512
> > byte block (not that I am propagating that idea). 
> 
> We're not doing anything like that, as it would create an impossible
> space-management problem (or are you happy with limiting tuples to
> 500 bytes?).  What we *could* do is calculate a page-level CRC and
> store it in the page header just before writing out.  Torn pages
> would then manifest as a wrong CRC on read.  No correction ability,
> but at least a reliable detection ability.

At the same time as you do the CRC you can copy the bytes to a fresh page
skipping the LSNs. Likewise, when writing out the page you have to calculate
the CRC; at the same time as you calculate the CRC you write out the bytes to
a temporary buffer adding LSNs and write that to disk.

This would be "zero-copy" if you're already scanning the bytes to calculate
the CRC since you can add and remove LSNs at the same time. It does require an
extra buffer to store the page in before writing and that entails some amount
of cache thrashing. But maybe you could reuse the same buffer over and over
again for every read/write.

-- 
greg

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

07 July 2005, 12:36:57

Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> What we *could* do is calculate a page-level CRC and
>> store it in the page header just before writing out.  Torn pages
>> would then manifest as a wrong CRC on read.  No correction ability,
>> but at least a reliable detection ability.

> At the same time as you do the CRC you can copy the bytes to a fresh page
> skipping the LSNs. Likewise, when writing out the page you have to calculate
> the CRC; at the same time as you calculate the CRC you write out the bytes to
> a temporary buffer adding LSNs and write that to disk.

Huh?  You seem to be proposing doing *both* things, which sounds entirely
pointless.

BTW, I was envisioning the page CRCs as something we'd only check during
crash recovery, not normal-operation reads.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

07 July 2005, 12:40:12

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > As far as #2, my posted proposal was to write the full pages to WAL when
> > they are written to the file system, and not when they are first
> > modified in the shared buffers ---
> 
> That is *completely* unworkable.  Or were you planning to abandon the
> promise that a transaction is committed when we have flushed its WAL
> commit record?

"completely" is a strong word.

What is on disk at the time the page is modified in the shared buffer is
just fine for recovery (it is the same as what we write to WAL anyway). 
It is just when the page gets written to disk that it changes for
recovery, so potentially during recovery you could take what is on disk,
modify it by reading WAL, then replace it later with the image from WAL.

The only problem I see is that the page might be partially written and
WAL modifications to the page might fail but later the page will be
replaced anyway.  Perhaps we could record pages that are corrupt
(hopefully only one) and make sure later page images replace them, or we
fail on recovery.  

> > Seems it is similar to fsync in risk, which is not a new option.
> 
> The point here is that fsync-off is only realistic for development
> or playpen installations.  You don't turn it off in a production
> machine, and I can't see that you'd turn off the full-page-write
> option either.  So we have not solved anyone's performance problem.

Yes, this is basically another fsync-like option that isn't for
production usage in most cases.  Sad but true.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

07 July 2005, 12:51:53

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> The point here is that fsync-off is only realistic for development
>> or playpen installations.  You don't turn it off in a production
>> machine, and I can't see that you'd turn off the full-page-write
>> option either.  So we have not solved anyone's performance problem.

> Yes, this is basically another fsync-like option that isn't for
> production usage in most cases.  Sad but true.

Just to make my position perfectly clear: I don't want to see this
option shipped in 8.1.  It's reasonable to have it in there for now
as an aid to our performance investigations, but I don't see that it
has any value for production.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

07 July 2005, 13:00:27

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> The point here is that fsync-off is only realistic for development
> >> or playpen installations.  You don't turn it off in a production
> >> machine, and I can't see that you'd turn off the full-page-write
> >> option either.  So we have not solved anyone's performance problem.
> 
> > Yes, this is basically another fsync-like option that isn't for
> > production usage in most cases.  Sad but true.
> 
> Just to make my position perfectly clear: I don't want to see this
> option shipped in 8.1.  It's reasonable to have it in there for now
> as an aid to our performance investigations, but I don't see that it
> has any value for production.

Well, this is the first I am hearing that, and of course your position
is just one vote.

One idea would be to just tie its behavior directly to fsync and remove
the option completely (that was the original TODO), or we can adjust it
so it doesn't have the same risks as fsync, or the same lack of failure
reporting as fsync.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

"Joshua D. Drake"

Date:

07 July 2005, 13:08:10

>>Just to make my position perfectly clear: I don't want to see this
>>option shipped in 8.1.  It's reasonable to have it in there for now
>>as an aid to our performance investigations, but I don't see that it
>>has any value for production.
> 
> 
> Well, this is the first I am hearing that, and of course your position
> is just one vote.

True but your "feature" was added after feature freeze ;). I don't see
this as a good thing overall. We should be looking for a solution not a 
band-aid that if you tear it off will pull the skin.

Sincerely,

Joshua D. Drake


-- 
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

07 July 2005, 13:11:22

Joshua D. Drake wrote:
> 
> >>Just to make my position perfectly clear: I don't want to see this
> >>option shipped in 8.1.  It's reasonable to have it in there for now
> >>as an aid to our performance investigations, but I don't see that it
> >>has any value for production.
> > 
> > 
> > Well, this is the first I am hearing that, and of course your position
> > is just one vote.
> 
> True but your "feature" was added after feature freeze ;). I don't see

My patch was posted days before the feature freeze.

> this as a good thing overall. We should be looking for a solution not a 
> band-aid that if you tear it off will pull the skin.

Sure, having it be _like_ fsync is not a good thing.  Seems we can roll
it into the fsync option, improve it, or remove it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

07 July 2005, 13:23:09

Simon Riggs wrote:
> On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote:
> > Well, I added #1 yesterday as 'full_page_writes', and it has the same
> > warnings as fsync (namely, on crash, be prepared to recovery or check
> > your system thoroughly.
> 
> Yes, which is why I comment now that the GUC alone is not enough.
> 
> There is no way to "check your system thoroughly". If there is a certain
> way of knowing torn pages had *not* occurred, then I would be happy.

Yep, it is a pain, and like fsync.

> > As far as #2, my posted proposal was to write the full pages to WAL when
> > they are written to the file system, and not when they are first
> > modified in the shared buffers --- the goal being that it will even out
> > the load, and it will happen in a non-critical path, hopefully by the
> > background writer or at checkpoint time.
> 
> The page must be written before the changes to the page are written, so
> that they are available sequentially in the log for replay. The log and
> the database are not connected, so we cannot do it that way. If the page
> is written out of sequence from the changes to it, how would recovery
> know where to get the page from?

See my later email --- the full page will be restored later from WAL, so
our changes don't have to be made at that point.

> ISTM there is mileage in your idea of trying to shift the work to
> another time. My thought is "which blocks exactly are the ones being
> changed?". Maybe that would lead to a reduction.
> 
> > > With wal_changed_pages= off *any* crash would possibly require an
> > > archive recovery, or a replication rebuild. It's good that we now have
> > > PITR, but we do also have other options for availability. Users of
> > > replication could well be amongst the first to try out this option. 
> > 
> > Seems it is similar to fsync in risk, which is not a new option.
> 
> Risk is not acceptable. We must have certainty, either way.
> 
> Why have two GUCs? Why not just have one GUC that does both at the same
> time? When would you want one but not the other?
> risk_data_loss_to_gain_performance = true

Yep, one new one might make sense.

> > I think if we document full_page_writes as similar to fsync in risk, we
> > are OK for 8.1, but if something can be done easily, it sounds good.
> 
> Documenting something simply isn't enough. I simply cannot advise
> anybody ever to use the new GUC. If their data was low value, they
> wouldn't even be using PostgreSQL, they'd use a non-transactional DBMS.
> 
> I agree we *must* have the GUC, but we also *must* have a way for crash
> recovery to tell us for certain that it has definitely worked, not just
> maybe worked.

Right.  I am thinking your CRC write to WAL might do that.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

07 July 2005, 13:59:39

Tom,

> Josh, is OSDL up enough that you can try another comparison run?

Thankfully, yes.

> If so, please undo the previous patch (which disabled page dumping
> entirely) and instead try removing this block of code, starting
> at about xlog.c line 620 in CVS tip:

Will do.  Results in a few days.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Kenneth Marshall

Date:

07 July 2005, 16:25:04

On Thu, Jul 07, 2005 at 11:36:40AM -0400, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> >> What we *could* do is calculate a page-level CRC and
> >> store it in the page header just before writing out.  Torn pages
> >> would then manifest as a wrong CRC on read.  No correction ability,
> >> but at least a reliable detection ability.
> 
> > At the same time as you do the CRC you can copy the bytes to a fresh page
> > skipping the LSNs. Likewise, when writing out the page you have to calculate
> > the CRC; at the same time as you calculate the CRC you write out the bytes to
> > a temporary buffer adding LSNs and write that to disk.
> 
> Huh?  You seem to be proposing doing *both* things, which sounds entirely
> pointless.
> 
> BTW, I was envisioning the page CRCs as something we'd only check during
> crash recovery, not normal-operation reads.
> 
>             regards, tom lane
> 
Does the DB page on disk have a version number? If so, maybe we could
update the WAL with the CRC+version anytime the page is update. You may
need to check the log for multiple CRC+version entries to determine the
torn-page status.

Ken

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

08 July 2005, 01:13:06

Josh Berkus <josh@agliodbs.com> writes:
>> If so, please undo the previous patch (which disabled page dumping
>> entirely) and instead try removing this block of code, starting
>> at about xlog.c line 620 in CVS tip:

> Will do.  Results in a few days.

Great.  BTW, don't bother testing snapshots between 2005/07/05 2300 EDT
and just now --- Bruce's full_page_writes patch introduced a large
random negative component into the timing ...
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

"Zeugswetter Andreas DAZ SD"

Date:

08 July 2005, 04:34:31

>>> The point here is that fsync-off is only realistic for development
or
>>> playpen installations.  You don't turn it off in a production
>>> machine, and I can't see that you'd turn off the full-page-write
>>> option either.  So we have not solved anyone's performance problem.
>
>> Yes, this is basically another fsync-like option that isn't for
>> production usage in most cases.  Sad but true.
>
> Just to make my position perfectly clear: I don't want to see
> this option shipped in 8.1.

Why not ? If your filesystem buffer size matches your pg page size,
and you have a persistent write cache, the option makes perfect sense.

Andreas

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

08 July 2005, 06:18:58

On Thu, 2005-07-07 at 11:59 -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Tom Lane wrote:
> > >> The point here is that fsync-off is only realistic for development
> > >> or playpen installations.  You don't turn it off in a production
> > >> machine, and I can't see that you'd turn off the full-page-write
> > >> option either.  So we have not solved anyone's performance problem.
> > 
> > > Yes, this is basically another fsync-like option that isn't for
> > > production usage in most cases.  Sad but true.
> > 
> > Just to make my position perfectly clear: I don't want to see this
> > option shipped in 8.1.  It's reasonable to have it in there for now
> > as an aid to our performance investigations, but I don't see that it
> > has any value for production.
> 
> Well, this is the first I am hearing that, and of course your position
> is just one vote.
> 
> One idea would be to just tie its behavior directly to fsync and remove
> the option completely (that was the original TODO), or we can adjust it
> so it doesn't have the same risks as fsync, or the same lack of failure
> reporting as fsync.

I second Tom's objection, until we agree either:
- a conclusive physical test that shows that specific hardware *never*
causes torn pages
- a national/international standard name/number for everybody to ask
their manufacturer whether or not they comply with that (I doubt that
exists...)
- a conclusive check for torn pages that can be added to the recovery
code to show whether or not they have occurred.

Is there also a potential showstopper in the redo machinery? We work on
the assumption that the post-checkpoint block is available in WAL as a
before image. Redo for all actions merely replay the write action again
onto the block. If we must reapply the write action onto the block, the
redo machinery must check to see whether the write action has already
been successfully applied before it decides to redo. I'm not sure that
the current code does that.

Having raised that objection, ISTM that checking for torn pages can be
accomplished reasonably well using a few rules... These are simple
because we do not update in place for MVCC. 

Since inserts and vacuums alter the pd_upper and pd_lower, we should be
able to do a self-consistency check that shows that all items are
correctly placed. If there is non-zero data higher than the pd_higher
pointer, then we know that the first sector is torn. If a pointer
doesn't match with a row version, then the page is torn.

It is possible that the first sector of a page could be undetectably
torn if it was nearly full and the item pointer pointed to the first
sector. However, for every page touched, the last WAL record to touch
that page should have an LSN that matches the database page. In most
cases they would match, proving the page was not torn. If they did not
match we would have no proof either way, so we would be advised to act
as if the page were torn for that situation. Possibly, we could
reinstate the idea of putting the LSN at the beginning and end of every
page, since that would help prove the first sector (only) was not torn.

It is possible that a page could be torn and yet still be consistent,
but this could only occur for a delete. Reapplying the delete, whether
or not it is visible on the page would overcome that without problem.

It is possible that there are one or more sectors of empty space in the
middle of a block could be torn, but their contents would still be
identical so is irrelevant and can be ignored.

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

Dawid Kuroczko

Date:

08 July 2005, 06:41:37

On 7/7/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote:
> One idea would be to just tie its behavior directly to fsync and remove
> the option completely (that was the original TODO), or we can adjust it
> so it doesn't have the same risks as fsync, or the same lack of failure
> reporting as fsync.

I wonder about one thing -- how much impact has the underlying filesystem?
I mean, the problem with "partial writes" to pages is how to handle a situation
when the machine looses power and we are not sure if the write was
completed or not.

But then again, imagine the data is on a filesystem with data journaling
(like ext3 with data=journal).  There, to my understanding, the data is
first written into journal prior to be written to disk drive.  Assuming the
drive looses power during the process, I guess there would be two
possible situations:1) the modification was committed to journal completely, so we can replay
the journal and we are sure the 8kb block is fine. (*)2) the modification in the journal is not complete.  It has not
beenfully 
committed to the filesystem journal.  And we are safe to assume that
drive has an old data.
(*) I am not sure if it is true for 8kb-blocks, and of course, I haven't got
good knowledge about ext3's journalling and its atomicity...
Assuming above are true, it would be interesting to see how ext3
with data=journal and partial writes competes with ext3 data=someother
without it.

I don't have extensive knowledge with journalling internals, but I thought
I would mention it, so people with wider knowledge could put their
input here.
  Regards,     Dawid

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

08 July 2005, 10:48:05

Simon Riggs <simon@2ndquadrant.com> writes:
> Is there also a potential showstopper in the redo machinery? We work on
> the assumption that the post-checkpoint block is available in WAL as a
> before image. Redo for all actions merely replay the write action again
> onto the block. If we must reapply the write action onto the block, the
> redo machinery must check to see whether the write action has already
> been successfully applied before it decides to redo. I'm not sure that
> the current code does that.

The redo machinery relies on the page LSN to tell whether the update has
occurred.  In the presence of torn pages, that's of course unreliable.

> Having raised that objection, ISTM that checking for torn pages can be
> accomplished reasonably well using a few rules...

I have zero confidence in this; the fact that you can think of
(incomplete, inaccurate) heuristics for heap-page operations doesn't
mean you can make it work for indexes.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Heikki Linnakangas

Date:

08 July 2005, 10:48:28

On Thu, 7 Jul 2005, Tom Lane wrote:

> We still don't know enough about the situation to know what a solution
> might look like.  Is the slowdown Josh is seeing due to the extra CPU
> cost of the CRCs, or the extra I/O cost, or excessive locking of the
> WAL-related data structures while we do this stuff, or ???.  Need more
> data.

I wonder if a different BLCKSZ would make a difference either way. Say, 
1024 bytes instead of the default 8192.

- Heikki

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

08 July 2005, 13:27:17

Tom,

> Great.  BTW, don't bother testing snapshots between 2005/07/05 2300 EDT
> and just now --- Bruce's full_page_writes patch introduced a large
> random negative component into the timing ...

Ach.  Starting over, then.

--Josh

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

08 July 2005, 15:36:14

On Fri, 2005-07-08 at 09:47 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Having raised that objection, ISTM that checking for torn pages can be
> > accomplished reasonably well using a few rules...
> 
> I have zero confidence in this; the fact that you can think of
> (incomplete, inaccurate) heuristics for heap-page operations doesn't
> mean you can make it work for indexes.

If we can find heuristics that cover some common cases, then I would be
happy. Anything that allows us to prove that we don't need to recover is
good. If we reduce the unknown state to an acceptable risk, then we are
more likely to make use of the performance gain in the real world.

Of course, they need to be accurate. Let's not get hung up on my error
rate.

I don't think we should care too much about indexes. We can rebuild
them...but losing heap sectors means *data loss*.

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

08 July 2005, 15:46:26

Simon Riggs <simon@2ndquadrant.com> writes:
> I don't think we should care too much about indexes. We can rebuild
> them...but losing heap sectors means *data loss*.

If you're so concerned about *data loss* then none of this will be
acceptable to you at all.  We are talking about going from a system
that can actually survive torn-page cases to one that can only tell
you whether you've lost data to such a case.  Arguing about the
probability with which we can detect the loss seems beside the point.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Hannu Krosing

Date:

09 July 2005, 17:34:22

On R, 2005-07-08 at 14:45 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I don't think we should care too much about indexes. We can rebuild
> > them...but losing heap sectors means *data loss*.

There might be some merit in idea to disabling WAL/PITR for indexes,
where one can accept some (and possibly a lot) time when recovering.

> If you're so concerned about *data loss* then none of this will be
> acceptable to you at all.  We are talking about going from a system
> that can actually survive torn-page cases to one that can only tell
> you whether you've lost data to such a case.  Arguing about the
> probability with which we can detect the loss seems beside the point.

-- 
Hannu Krosing <hannu@skype.net>

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

11 July 2005, 05:17:58

On Fri, 2005-07-08 at 14:45 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I don't think we should care too much about indexes. We can rebuild
> > them...but losing heap sectors means *data loss*.
> 
> If you're so concerned about *data loss* then none of this will be
> acceptable to you at all.  We are talking about going from a system
> that can actually survive torn-page cases to one that can only tell
> you whether you've lost data to such a case.  Arguing about the
> probability with which we can detect the loss seems beside the point.

In all of this, I see that turning off full page images would be an
option that defaults to "yes, take page images".

PITR was originally discussed (in 2002, see the archives) as a mechanism
that would allow full page images to be avoided. Since we now have PITR,
we can discuss more sensibly taking that option. If there are some
circumstances where we don't know the state of the server and need to
recover, that is OK, as long as we *can* recover. BUT only if we have a
fairly low chance of needing to use it. 

(Rebuilding an index is preferable to a full system recovery.)

So I am interested in the probability of us knowing whether the system
is damaged or not. It may then become an acceptable risk for a
production system to take in order to gain 50% performance. To that end,
I am willing to consider various heuristics that would allow us to
reduce the risk. I have suggested some, but am happy to hear others (or,
as you say, corrections to them) to make that idea more viable.

ISTM that Recovery could tell us:
1. Fully recovered, provably correct state of all data blocks
2. Fully recovered, unknown data correctness of some data blocks
3. Fully recovered, provably incorrect state of some data blocks

as well as:
a) no indexes require rebuilding
b) the following indexes require an immediate REINDEX...

Result 
1a requires no further action
1b requires some index rebuild after system becomes operational

Results 2 and 3 would require some form of system recovery

Since currently there are no tests performed to show correctness, we
won't ever know we're in state 1 and so would need to presume we are in
state 2 and recover.

My view is that if enough heuristics can be found to increase the
potential for ending a recovery in state 1 then turning off full page
images may become viable as a realistic cost/benefit. Though that is
never an option that I would suggest should be disabled by default.

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

11 July 2005, 07:52:31

On Fri, 2005-07-08 at 09:34 +0200, Zeugswetter Andreas DAZ SD wrote:
> >>> The point here is that fsync-off is only realistic for development
> or 
> >>> playpen installations.  You don't turn it off in a production 
> >>> machine, and I can't see that you'd turn off the full-page-write 
> >>> option either.  So we have not solved anyone's performance problem.
> > 
> >> Yes, this is basically another fsync-like option that isn't for 
> >> production usage in most cases.  Sad but true.
> > 
> > Just to make my position perfectly clear: I don't want to see 
> > this option shipped in 8.1.
> 
> Why not ? If your filesystem buffer size matches your pg page size,
> and you have a persistent write cache, the option makes perfect sense.

I think this point needs expansion:

It is possible to run with matching page sizes, in which case the option
to not-log full page images is desirable.

The only type of crash that can cause torn pages is an OS crash, such as
a power outage. If you have an expensive server with battery backup,
persistent write cache etc then this reduces that risk considerably.

However, as long as PostgreSQL can't tell the difference between any
crash and an OS crash, we must assume the worst.

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

11 July 2005, 13:21:43

Simon, Tom, 

> > Will do.  Results in a few days.

Actually, between the bad patch on the 5th and ongoing STP issues, I don't 
think I will have results before I leave town.    Will e-mail you offlist to 
give you info to retrieve results.

> Any chance you'd be able to do this with
>
> ext3 and a filesystem blocksize of 4096 bytes
>
> PostgreSQL built with a blocksize of 4096

Hmmm ... I'm not sure the current test supports these kinds of options.  Will 
ask Mark.


-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Kevin Brown

Date:

16 July 2005, 03:38:16

Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I don't think we should care too much about indexes. We can rebuild
> > them...but losing heap sectors means *data loss*.
> 
> If you're so concerned about *data loss* then none of this will be
> acceptable to you at all.  We are talking about going from a system
> that can actually survive torn-page cases to one that can only tell
> you whether you've lost data to such a case.  Arguing about the
> probability with which we can detect the loss seems beside the
> point.

I realize I'm coming into this discussion a bit late, and perhaps my
thinking on this is simplistically naive.  That said, I think I have
an idea of how to solve the torn page problem.

If the hardware lies to you about the data being written to the disk,
then no amount of work on our part can guarantee data integrity.  So
the below assumes that the hardware doesn't ever lie about this.

If you want to prevent a torn page, you have to make the last
synchronized write to the disk as part of the checkpoint process a
write that *cannot* result in a torn page.  So it has to be a write of
a buffer that is no larger than the sector size of the disk.  I'd make
it 256 bytes, to be sure of accomodating pretty much any disk hardware
out there.

In any case, the modified sequence would go something like:

1.  write the WAL entry, and encode in it a unique magic number
2.  sync()
3.  append the unique magic number to the WAL again (or to a separate   file if you like, it doesn't matter as long as
youknow where to   look for it during recovery), using a 256 byte (at most) write   buffer.

4.  sync()

After the first sync(), the OS guarantees that the data you've written
so far is committed to the platters, with the possible exception of a
torn page during the write operation, which will only happen during a
crash during step 2.  But if a crash happens here, then the second
occurrance of the unique magic number will not appear in the WAL (or
separate file, if that's the mechanism chosen), and you'll *know* that
you can't trust that the WAL entry was completely committed to the
platter.

If a crash happens during step 4, then either the appended magic
number won't appear during recovery, in which case the recovery
process can assume that the WAL entry is incomplete, or it will
appear, in which case it's *guaranteed by the hardware* that the WAL
entry is complete, because you'll know for sure that the previous
sync() completed successfully.

The amount of time between steps 2 and 4 should be small enough that
there should be no significant performance penalty involved, relative
to the time it takes for the first sync() to complete.

Thoughts?

-- 
Kevin Brown                          kevin@sysexperts.com

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

16 July 2005, 08:49:18

I don't think our problem is partial writes of WAL, which we already
check, but heap/index page writes, which we currently do not check for
partial writes.

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > I don't think we should care too much about indexes. We can rebuild
> > > them...but losing heap sectors means *data loss*.
> > 
> > If you're so concerned about *data loss* then none of this will be
> > acceptable to you at all.  We are talking about going from a system
> > that can actually survive torn-page cases to one that can only tell
> > you whether you've lost data to such a case.  Arguing about the
> > probability with which we can detect the loss seems beside the
> > point.
> 
> I realize I'm coming into this discussion a bit late, and perhaps my
> thinking on this is simplistically naive.  That said, I think I have
> an idea of how to solve the torn page problem.
> 
> If the hardware lies to you about the data being written to the disk,
> then no amount of work on our part can guarantee data integrity.  So
> the below assumes that the hardware doesn't ever lie about this.
> 
> If you want to prevent a torn page, you have to make the last
> synchronized write to the disk as part of the checkpoint process a
> write that *cannot* result in a torn page.  So it has to be a write of
> a buffer that is no larger than the sector size of the disk.  I'd make
> it 256 bytes, to be sure of accomodating pretty much any disk hardware
> out there.
> 
> In any case, the modified sequence would go something like:
> 
> 1.  write the WAL entry, and encode in it a unique magic number
> 2.  sync()
> 3.  append the unique magic number to the WAL again (or to a separate
>     file if you like, it doesn't matter as long as you know where to
>     look for it during recovery), using a 256 byte (at most) write
>     buffer.
> 4.  sync()
> 
> 
> After the first sync(), the OS guarantees that the data you've written
> so far is committed to the platters, with the possible exception of a
> torn page during the write operation, which will only happen during a
> crash during step 2.  But if a crash happens here, then the second
> occurrance of the unique magic number will not appear in the WAL (or
> separate file, if that's the mechanism chosen), and you'll *know* that
> you can't trust that the WAL entry was completely committed to the
> platter.
> 
> If a crash happens during step 4, then either the appended magic
> number won't appear during recovery, in which case the recovery
> process can assume that the WAL entry is incomplete, or it will
> appear, in which case it's *guaranteed by the hardware* that the WAL
> entry is complete, because you'll know for sure that the previous
> sync() completed successfully.
> 
> 
> The amount of time between steps 2 and 4 should be small enough that
> there should be no significant performance penalty involved, relative
> to the time it takes for the first sync() to complete.
> 
> 
> Thoughts?
> 
> 
> 
> -- 
> Kevin Brown                          kevin@sysexperts.com
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Kevin Brown

Date:

16 July 2005, 22:35:22

Bruce Momjian wrote:
> 
> I don't think our problem is partial writes of WAL, which we already
> check, but heap/index page writes, which we currently do not check for
> partial writes.

Hmm...I've read through the thread again and thought about the problem
further, and now think I understand what you're dealing with.

Ultimately, the problem is that you're storing diffs in the WAL, so
you have to be able to guarantee that every data/index page has been
completely written, right?

There's no way to detect a torn page without some sort of marker in
each disk-indivisible segment of the page, unless you're willing to
checksum the entire page.  With that in mind, the method Microsoft
uses for SQL Server is probably about as simple as it gets.  In our
case, I suppose we may as well allocate one byte per 256-bytes segment
for the torn page marker.  Just increment the marker value each time
you write the page (you'll have to read it from the page prior to
incrementing it, of course).

Other than that, torn page detection is really just a special case of
page corruption detection.

-- 
Kevin Brown                          kevin@sysexperts.com

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

19 July 2005, 17:07:15

Josh Berkus <josh@agliodbs.com> writes:
> So, now that we know what the performance bottleneck is, how do we fix it?

Josh, I see that all of those runs seem to be using wal_buffers = 8.
Have you tried materially increasing wal_buffers (say to 100 or 1000)
and/or experimenting with different wal_sync_method options since we
fixed the bufmgrlock problem?  I am wondering if the real issue is
WAL buffer contention or something like that.

It would also be useful to compare these runs to runs with fsync = off,
just to see how the performance changes.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

21 July 2005, 15:02:19

Tom,

> Josh, I see that all of those runs seem to be using wal_buffers = 8.
> Have you tried materially increasing wal_buffers (say to 100 or 1000)
> and/or experimenting with different wal_sync_method options since we
> fixed the bufmgrlock problem?  I am wondering if the real issue is
> WAL buffer contention or something like that.
>
> It would also be useful to compare these runs to runs with fsync = off,
> just to see how the performance changes.

As you know, I've been out of town.   I'll be running more tests, and 
collating my existing test results over then next few days.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

22 July 2005, 14:58:36

Tom,

> This will remove just the CRC calculation work associated with backed-up
> pages.  Note that any attempt to recover from the WAL will fail, but I
> assume you don't need that for the purposes of the test run.

Looks like the CRC calculation work isn't the issue.   I did test runs of
no-CRC vs. regular DBT2 with different checkpoint timeouts, and didn't
discern any statistical difference.   See attached spreadsheet chart (the
two different runs are on two different machines).

I think this test run http://khack.osdl.org/stp/302903/results/0/, with a
30-min checkpoint  shows pretty clearly that the behavior of the
performance drop is consistent with needing to "re-prime" the WAL will
full page images.   Each checkpoint drops performance abruptly, and then
slowly recovers until the next checkpoint.

Do note that there is a significant statistical variation in individual
runs.  It's only the overall trend which is significant.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

22 July 2005, 16:06:14

Did you test with full_page_writes on and off?

---------------------------------------------------------------------------

Josh Berkus wrote:
> Tom,
> 
> > This will remove just the CRC calculation work associated with backed-up
> > pages. ?Note that any attempt to recover from the WAL will fail, but I
> > assume you don't need that for the purposes of the test run.
> 
> Looks like the CRC calculation work isn't the issue.   I did test runs of 
> no-CRC vs. regular DBT2 with different checkpoint timeouts, and didn't 
> discern any statistical difference.   See attached spreadsheet chart (the 
> two different runs are on two different machines).
> 
> I think this test run http://khack.osdl.org/stp/302903/results/0/, with a 
> 30-min checkpoint  shows pretty clearly that the behavior of the 
> performance drop is consistent with needing to "re-prime" the WAL will 
> full page images.   Each checkpoint drops performance abruptly, and then 
> slowly recovers until the next checkpoint.
> 
> Do note that there is a significant statistical variation in individual 
> runs.  It's only the overall trend which is significant.
> 
> -- 
> --Josh
> 
> Josh Berkus
> Aglio Database Solutions
> San Francisco

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

22 July 2005, 16:12:42

Bruce,

> Did you test with full_page_writes on and off?

I didn't use your full_page_writes version because Tom said it was 
problematic.   This is CVS from July 3rd.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

22 July 2005, 16:16:22

Josh Berkus wrote:
> Bruce,
> 
> > Did you test with full_page_writes on and off?
> 
> I didn't use your full_page_writes version because Tom said it was 
> problematic.   This is CVS from July 3rd.

I think we need those tests run.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

22 July 2005, 16:20:18

Bruce,

> I think we need those tests run.

Sure.   What CVS day should I grab?   What's the option syntax? ( -c 
full_page_writes=false)?   

I have about 20 tests in queue right now but can stack yours up behind 
them.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Bruce Momjian

Date:

22 July 2005, 16:24:52

Josh Berkus wrote:
> Bruce,
> 
> > I think we need those tests run.
> 
> Sure.   What CVS day should I grab?   What's the option syntax? ( -c 
> full_page_writes=false)?   

Yes.  You can grab any from the day Tom fixed it, which was I think two
weeks ago.

> I have about 20 tests in queue right now but can stack yours up behind 
> them.

Great.  That would seem to be the definative way to test it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

22 July 2005, 17:12:19

Josh Berkus <josh@agliodbs.com> writes:
> Bruce,
>> Did you test with full_page_writes on and off?

> I didn't use your full_page_writes version because Tom said it was 
> problematic.   This is CVS from July 3rd.

We already know the results: should be equivalent to the hack Josh
tried first.

So what we know at this point is that dumping full pages into WAL is
expensive, and that the CRC calculation cost is not the main expense.
(I suppose that this indicates the reduction to 32-bit CRC was
helpful, because previous measurements sure suggested that CRC costs
were important ...)

What we still don't know is exactly where the main expense *is*.
Is it I/O, WAL buffer lock contention, or what?  I think the next
step is to vary the WAL-related parameters (wal_buffers,
wal_sync_method, and fsync) and see if we can learn anything that way.
It's entirely plausible that the optimal values for those have changed
due to our recent hacking.

(Note: turning off fsync is of course not a production option, but
it would be helpful to try it here --- it should give us a reading
on whether disk I/O wait is the culprit or not.)

I'd recommend that you *not* update to CVS tip, as that wouldn't
accomplish much except call into question any comparisons to the
runs you've already done.  There are a number of unrelated changes
this month that could have side-effects on performance.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Greg Stark

Date:

22 July 2005, 17:16:11

Josh Berkus <josh@agliodbs.com> writes:

> I think this test run http://khack.osdl.org/stp/302903/results/0/, with a 
> 30-min checkpoint  shows pretty clearly that the behavior of the 
> performance drop is consistent with needing to "re-prime" the WAL will 
> full page images.   Each checkpoint drops performance abruptly, and then 
> slowly recovers until the next checkpoint.

A 30-min checkpoint means that fsyncs will be happening on up to 30 minutes of
i/o on each database file. It could be the full page images that's slowing it
down or it could just be that the system is swamped with i/o that's been put
off for the last 30 minutes.

It's not nearly as bad in this case as it has been in the previous samples
since at least this test runs for 5 full checkpoint cycles but it's still
fairly unrealistic. And that last checkpoint, almost 20% of the i/o of the
test is only partially included in the timing data.

For any benchmarking to be meaningful you have to set the checkpoint interval
to something more realistic. Something like 5 minutes. That way when the final
checkpoint cycle isn't completely included in the timing data you'll at least
be missing a statistically insignificant portion of the work.

-- 
greg

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

22 July 2005, 17:19:15

Josh Berkus <josh@agliodbs.com> writes:
> Looks like the CRC calculation work isn't the issue.   I did test runs of
> no-CRC vs. regular DBT2 with different checkpoint timeouts, and didn't
> discern any statistical difference.   See attached spreadsheet chart (the
> two different runs are on two different machines).

Um, where are the test runs underlying this spreadsheet?  I don't have a
whole lot of confidence in looking at full-run average TPM numbers to
discern whether transient dropoffs in TPM are significant or not.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

22 July 2005, 17:21:08

Greg,

> For any benchmarking to be meaningful you have to set the checkpoint
> interval to something more realistic. Something like 5 minutes. That way
> when the final checkpoint cycle isn't completely included in the timing
> data you'll at least be missing a statistically insignificant portion of
> the work.

Look at the PDF I sent.    This was run with 5, 10, and 30 minutes.   

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

22 July 2005, 17:24:43

Tom,

> Um, where are the test runs underlying this spreadsheet?  I don't have a
> whole lot of confidence in looking at full-run average TPM numbers to
> discern whether transient dropoffs in TPM are significant or not.

Web in the form of: 
http://khack.osdl.org/stp/#test_number#/

Where #test_number# is:

Machine0, no patch:
302904
302905
302906

Machine0, patch:
301901
302902
302903

Machine2, no patch:
302910
302911
302912

Machine2, patch:
301907
302908
302909

BTW, I am currently doing a wal_buffers scalability run.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

22 July 2005, 19:54:41

Greg Stark <gsstark@mit.edu> writes:
> For any benchmarking to be meaningful you have to set the checkpoint interval
> to something more realistic. Something like 5 minutes. That way when the final
> checkpoint cycle isn't completely included in the timing data you'll at least
> be missing a statistically insignificant portion of the work.

This isn't about benchmarking --- or at least, I don't put any stock in
the average NOTPM values for the long-checkpoint-interval runs.  What we
want to understand is why there's a checkpoint-triggered performance
dropoff that (appears to) last longer than the checkpoint itself.  If
we can fix that, it should have beneficial impact on real-world cases.
But we do not have to, and should not, restrict ourselves to real-world
test cases while trying to figure out what's going on.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

22 July 2005, 20:11:49

Josh Berkus <josh@agliodbs.com> writes:
>> Um, where are the test runs underlying this spreadsheet?  I don't have a
>> whole lot of confidence in looking at full-run average TPM numbers to
>> discern whether transient dropoffs in TPM are significant or not.

> Web in the form of: 
> http://khack.osdl.org/stp/#test_number#/

> Where #test_number# is:

> Machine0, no patch:
> 302904
> 302905
> 302906

> Machine0, patch:
> 301901
> 302902
> 302903

> Machine2, no patch:
> 302910
> 302911
> 302912

> Machine2, patch:
> 301907
> 302908
> 302909

Hmm.  Eyeballing the NOTPM trace for cases 302912 and 302909, it sure
looks like the post-checkpoint performance recovery is *slower* in
the latter.  And why is 302902 visibly slower overall than 302905?
I thought for a bit that you had gotten "patch" vs "no patch" backwards,
but the oprofile results linked to these pages look right: XLogInsert
takes significantly more time in the "no patch" cases.

There's something awfully weird going on here.  I was prepared to see
no statistically-significant differences, but not multiple cases that
seem to be going the "wrong direction".

BTW, I'd like to look at 302906, but its [Details] link is broken.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

22 July 2005, 20:27:24

Tom,

> There's something awfully weird going on here.  I was prepared to see
> no statistically-significant differences, but not multiple cases that
> seem to be going the "wrong direction".

There's a lot of variance in the tests.   I'm currently running a variance 
test battery on one machine to figure out why the results become so 
variable when checkpointing is < 1 hour.  

Actually, to cover all which is currently running:

machine1 & 3 wal_buffers scaling test
machine2 STP variance test
machine4 full_page_writes=false test

All of these should wind up in about 5 days, provided the STP doesn't crash 
(not an insignificant concern).

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Mark Wong

Date:

25 July 2005, 15:35:03

On Fri, 22 Jul 2005 19:11:36 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> BTW, I'd like to look at 302906, but its [Details] link is broken.

Ugh, I tried digging onto the internal systems and it looks like they
were destroyed (or not saved) somehow.  It'll have to be rerun. 
Sorry...

Mark

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

26 July 2005, 18:57:22

On Fri, 2005-07-22 at 19:11 -0400, Tom Lane wrote:
> Hmm.  Eyeballing the NOTPM trace for cases 302912 and 302909, it sure
> looks like the post-checkpoint performance recovery is *slower* in
> the latter.  And why is 302902 visibly slower overall than 302905?
> I thought for a bit that you had gotten "patch" vs "no patch" backwards,
> but the oprofile results linked to these pages look right: XLogInsert
> takes significantly more time in the "no patch" cases.
> 
> There's something awfully weird going on here.  I was prepared to see
> no statistically-significant differences, but not multiple cases that
> seem to be going the "wrong direction".

All of the tests have been performed with wal_buffers = 8, so there will
be massive contention for those buffers, leading to increased I/O...

All of the tests show that there is a CPU utilisation drop, and an I/O
wait increase immediately following checkpoints.

When we advance the insert pointer and a wal_buffer still needs writing,
we clean it by attempting to perform an I/O while holding WALInsertLock.
Very probably the WALWriteLock is currently held, so we wait on the
WALWriteLock and everybody else waits on us. Normally, its fairly hard
for that to occur since we attempt to XLogWrite when walbuffers are more
than half full, but we do this with a conditional acquire, so when we're
busy we just keep filling up wal_buffers. Normally, thats OK.

When we have a checkpoint, almost every xlog write has at least a whole
block appended to it. So we can easily fill up wal_buffers very quickly
while WALWriteLock is held. Once there is no space available, we then
effectively halt all transactions while we write out that buffer. 

My conjecture is that the removal of the CPU bottleneck has merely moved
the problem by allowing users to fill wal buffers faster and go into a
wait state quicker than they did before. The beneficial effect of the
conditional acquire when wal buffers is full never occurs, and
performance drops.

We should run tests with much higher wal_buffers numbers to nullify the
effect described above and reduce contention. That way we will move
towards the log disk speed being the limiting factor, patch or no patch.

So, I think Tom's improvement of CRC/hole compression will prove itself
when we have higher values of wal_buffers,

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

26 July 2005, 19:28:40

Simon,

> We should run tests with much higher wal_buffers numbers to nullify the
> effect described above and reduce contention. That way we will move
> towards the log disk speed being the limiting factor, patch or no patch.

I've run such tests, at a glance they do seem to improve performance.   I 
need some time to collate the results.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Tom Lane

Date:

26 July 2005, 20:15:37

Josh Berkus <josh@agliodbs.com> writes:
>> We should run tests with much higher wal_buffers numbers to nullify the
>> effect described above and reduce contention. That way we will move
>> towards the log disk speed being the limiting factor, patch or no patch.

> I've run such tests, at a glance they do seem to improve performance.   I 
> need some time to collate the results.

With larger wal_buffers values it might also be interesting to take some
measures to put a larger share of the WAL writing burden on the bgwriter.

Currently the bgwriter only writes out WAL buffers in two scenarios:

1. It wants to write a dirty shared buffer that has LSN beyond the
current WAL flush marker.  Just like any backend, the bgwriter must
flush WAL as far as the LSN before writing the buffer.

2. The bgwriter is completing a checkpoint.  It must flush WAL as far as
the checkpoint record before updating pg_control.

It might be interesting to add some logic to explicitly check for and
write out any full-but-unwritten WAL buffers during the bgwriter's
main loop.

In a scenario with many small transactions, this is probably a waste of
effort since backends will be forcing WAL write/flush any time they
commit.  (This is why I haven't pursued the idea already.)  However,
given a large transaction and adequate wal_buffer space, such a tactic
should offload WAL writing work nicely.

I have no idea whether the DBT benchmarks would benefit at all, but
given that they are affected positively by increasing wal_buffers,
they must have a fair percentage of not-small transactions.
        regards, tom lane

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

26 July 2005, 21:32:50

Tom,

> I have no idea whether the DBT benchmarks would benefit at all, but
> given that they are affected positively by increasing wal_buffers,
> they must have a fair percentage of not-small transactions.

Even if they don't, we'll have series tests for DW here at GreenPlum soon, 
and I'll bet they'd affect *those*.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Checkpoint cost, looks like it is WAL/CRC

From

Simon Riggs

Date:

27 July 2005, 19:18:50

On Tue, 2005-07-26 at 19:15 -0400, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> >> We should run tests with much higher wal_buffers numbers to nullify the
> >> effect described above and reduce contention. That way we will move
> >> towards the log disk speed being the limiting factor, patch or no patch.
> 
> > I've run such tests, at a glance they do seem to improve performance.   I 
> > need some time to collate the results.
> 
> With larger wal_buffers values it might also be interesting to take some
> measures to put a larger share of the WAL writing burden on the bgwriter.
> 
> Currently the bgwriter only writes out WAL buffers in two scenarios:
> 
> 1. It wants to write a dirty shared buffer that has LSN beyond the
> current WAL flush marker.  Just like any backend, the bgwriter must
> flush WAL as far as the LSN before writing the buffer.
> 
> 2. The bgwriter is completing a checkpoint.  It must flush WAL as far as
> the checkpoint record before updating pg_control.
> 
> It might be interesting to add some logic to explicitly check for and
> write out any full-but-unwritten WAL buffers during the bgwriter's
> main loop.
> 
> In a scenario with many small transactions, this is probably a waste of
> effort since backends will be forcing WAL write/flush any time they
> commit.  (This is why I haven't pursued the idea already.)  However,
> given a large transaction and adequate wal_buffer space, such a tactic
> should offload WAL writing work nicely.
> 
> I have no idea whether the DBT benchmarks would benefit at all, but
> given that they are affected positively by increasing wal_buffers,
> they must have a fair percentage of not-small transactions.

Yes, I was musing on that also. I think it would help keep response time
even, which seems to be the route to higher performance anyway. This is
more important in real world than in benchmarks, where a nice even
stream of commits arrives to save the day...

I guess I'd be concerned that the poor bgwriter can't do all of this
work. I was thinking about a separate log writer, so we could have both
bgwriter and logwriter active simultaneously on I/O. It has taken a
while to get bgwriter to perform its duties efficiently, so I'd rather
not give it so many that it performs them all badly.

The logwriter would be more of a helper, using LWLockConditionalAcquire
to see if the WALWriteLock was kept active. Each backend would still
perform its own commit write. (We could change that in the future, but
thats a lot more work.) We would only need one new GUC log_writer_delay,
defaulting to 50 ms (??) - if set to zero, the default, then we don't
spawn a logwriter daemon at all. (Perhaps we also need another one to
say how many blocks get written each time its active... but I'm not
hugely in favour of more parameters to get wrong).

That way we could take the LWLockConditionalAcquire on WALWriteLock out
of the top of XLogInsert, which was effectively doing that work.

I think this would also reduce the apparent need for high wal_buffer
settings - probably could get away with a lot less than the 2048 recent
performance results would suggest.

Best Regards, Simon Riggs

Re: Checkpoint cost, looks like it is WAL/CRC

From

Josh Berkus

Date:

27 July 2005, 19:37:26

Simon,

> I guess I'd be concerned that the poor bgwriter can't do all of this
> work. I was thinking about a separate log writer, so we could have both
> bgwriter and logwriter active simultaneously on I/O. It has taken a
> while to get bgwriter to perform its duties efficiently, so I'd rather
> not give it so many that it performs them all badly.

Speaking of which, now that I have a target value for wal_buffers, I want 
to re-test the bgwriter.   In my previous tests, varying the bgwriter 
settings on DBT2 had no discernable effect.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco