Thread: Checkpoint cost, looks like it is WAL/CRC
Tom, All: Ok, finally managed though the peristent efforts of Mark Wong to get some tests through. Here are two tests with the CRC and wall buffer checking completely cut out of the code, as Tom suggested: 5-min checkpoint: http://khack.osdl.org/stp/302738/results/0/ http://khack.osdl.org/stp/302706/results/0/ 60-min checkpoint: http://khack.osdl.org/stp/302739/results/0/ (please note that OSDL is having technical difficulties and some links may not work) This is the performance profile I'd expect and want to see, and the frequency for checkpoints doesn't affect the overall performance at all. Contrast it with these: 5-min checkpoint: http://khack.osdl.org/stp/302671/results/0/ I don't hae a 60-minute checkpoint for comparison because of failures on the STP :-( So, now that we know what the performance bottleneck is, how do we fix it? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Ok, finally managed though the peristent efforts of Mark Wong to get some > tests through. Here are two tests with the CRC and wall buffer checking > completely cut out of the code, as Tom suggested: Uh, what exactly did you cut out? I suggested dropping the dumping of full page images, but not removing CRCs altogether ... regards, tom lane
Tom, > Uh, what exactly did you cut out? I suggested dropping the dumping of > full page images, but not removing CRCs altogether ... Attached is the patch I used. (it's a -Urn patch 'cause that's what STP takes) -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> Uh, what exactly did you cut out? I suggested dropping the dumping of >> full page images, but not removing CRCs altogether ... > Attached is the patch I used. OK, thanks for the clarification. So it does seem that dumping full page images is a pretty big hit these days. (In defense of the original idea, I believe it was not such a hit at the time --- but as we continue to improve performance, things that weren't originally at the top of the profile become significant.) It seems like we have two basic alternatives: 1. Offer a GUC to turn off full-page-image dumping, which you'd use only if you really trust your hardware :-( 2. Think of a better defense against partial-page writes. I like #2, or would if I could think of a better defense. Ideas anyone? regards, tom lane
Tom, > 1. Offer a GUC to turn off full-page-image dumping, which you'd use only > if you really trust your hardware :-( Are these just WAL pages? Or database pages as well? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only >> if you really trust your hardware :-( > Are these just WAL pages? Or database pages as well? Database pages. The current theory is that we can completely reconstruct from WAL data every page that's been modified since the last checkpoint. So the first write of any page after a checkpoint dumps a full image of the page into WAL; subsequent writes only write differences. This is nice and secure ... at least when you are using hardware that guarantees write ordering ... otherwise it's probably mostly useless overhead. Still, I'd not like to abandon the contract that if the disk does what it is supposed to do then we will do what we are supposed to. regards, tom lane
> 2. Think of a better defense against partial-page writes. > > I like #2, or would if I could think of a better defense. > Ideas anyone? FWIW, MSSQL deals with this using "Torn Page Detection". This is off by default (no check at all!), but can be abled on a per-database level. Note that it only *detects* torn pages. If it finds one, it won't start and tell you to recover from backup. It can't automatically recover. I would assume this greatly decreases the amount of data you have to save... From the BOL: " This option allows SQL Server to detect incomplete I/O operations caused by power failures or other system outages. When true, it causes a bit to be flipped for each 512-byte sector in an 8-kilobyte (KB) database page whenever the page is written to disk. If a bit is in the wrong state when the page is later read by SQL Server, then the page was written incorrectly; a torn page is detected. Torn pages are usually detected during recovery because any page that was written incorrectly is likely to be read by recovery. Although SQL Server database pages are 8 KB, disks perform I/O operations using a 512-byte sector. Therefore, 16 sectors are written per database page. A torn page can occur if the system fails (for example, due to power failure) between the time the operating system writes the first 512-byte sector to disk and the completion of the 8 KB I/O operation. If the first sector of a database page is successfully written before the failure, the database page on disk will appear as updated, although it may not have succeeded. Using battery-backed disk controller caches can ensure that data is successfully written to disk or not written at all. In this case, do not set torn page detection to true, for it is not needed. " and some FAQs: " Q. Does enabling the "torn page" database option add any measurable performance overhead to a server? A. The torn page option does not add much CPU cost at all, but it can increase contention on "hot" pages. With torn page detection off, a page can be accessed while it is being written to disk. This is not true if torn page detection is on. Q. When does SQL Server check for torn pages? At startup? Any time it reads a page from disk? A. Torn page detection is done whenever a page is read from disk. In practice, this is likely to be during recovery, because any page on which the write did not complete during normal operations is very likely to be read by recovery (except for non-logged operations, such as index creation, bcp, and so on). Q. What happens when SQL Server detects a torn page? A. When a torn page is detected, a severe I/O error is raised. This error will close the connection. The database is only marked suspect if the torn page is detected during recovery. Q. How can I recover from torn pages? A. Restoring the database from a backup and rolling the transaction log forward should correct the problem with no data loss. Q. What situations are most likely to cause torn pages? A. Lab tests have shown that torn pages are quite likely to happen when disk writes are occurring and power is lost. If you do not have a battery backup or uninterruptible power supply (UPS), you should consider enabling this option. " Perhaps a GUC that would let you chose between "full recovery, detection or none at all"? Though that might complicate the code... //Magnus
Tom, > Database pages. The current theory is that we can completely > reconstruct from WAL data every page that's been modified since the > last checkpoint. So the first write of any page after a checkpoint > dumps a full image of the page into WAL; subsequent writes only write > differences. What I'm confused about is that this shouldn't be anything new for 8.1. Yet 8.1 has *worse* performance on the STP machines than 8.0 does, and it's pretty much entirely due to this check. > This is nice and secure ... at least when you are using hardware that > guarantees write ordering ... otherwise it's probably mostly useless > overhead. Still, I'd not like to abandon the contract that if the disk > does what it is supposed to do then we will do what we are supposed to. Given the huge performance difference (30%), I think we have to give an option to turn it off. So DBAs whose machines are in danger of being shut off a lot can have it on an the more performance-sensitive can turn it off. One thing I am confused about, though: if the whole pages are in the database, why do we need a full copy in WAL instead of just the diffs? -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > What I'm confused about is that this shouldn't be anything new for 8.1. Yet > 8.1 has *worse* performance on the STP machines than 8.0 does, and it's > pretty much entirely due to this check. That's simply not believable --- better recheck your analysis. If 8.1 is worse it's not because of page-dumping, because we are more efficient on that than before not less so. Perhaps there's another issue? regards, tom lane
Tom, > > What I'm confused about is that this shouldn't be anything new for > > 8.1. Yet 8.1 has *worse* performance on the STP machines than 8.0 > > does, and it's pretty much entirely due to this check. > > That's simply not believable --- better recheck your analysis. If 8.1 > is worse it's not because of page-dumping, because we are more efficient > on that than before not less so. Perhaps there's another issue? Ach, I want to run head-to-head tests, but the system is down again. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
""Magnus Hagander"" <mha@sollentuna.net> writes > > FWIW, MSSQL deals with this using "Torn Page Detection". This is off by > default (no check at all!), but can be abled on a per-database level. > Note that it only *detects* torn pages. If it finds one, it won't start > and tell you to recover from backup. It can't automatically recover. I > would assume this greatly decreases the amount of data you have to > save... > After reading the long discussion on torn page detection and many related issues in doc/TODO.details/wal, I believe we need carefully analysis what the sequence of a partial write. case 1: if a partial write just write some sectors of a 8K page, then it is ok - assuming we will replay xlog in any ways without comparing page header LSN and xlog LSN (currently we do not do so); case 2: if a patial write destroied some old data of a page, which are not in our xlog, we will worry about this; So the main issue is case2. If OS could schedule atomic sector-wise write , so case2 could never happen? Regards, Qingqing
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > >> Uh, what exactly did you cut out? I suggested dropping the dumping of > >> full page images, but not removing CRCs altogether ... > > > Attached is the patch I used. > > OK, thanks for the clarification. So it does seem that dumping full > page images is a pretty big hit these days. (In defense of the original > idea, I believe it was not such a hit at the time --- but as we continue > to improve performance, things that weren't originally at the top of the > profile become significant.) > > It seems like we have two basic alternatives: > > 1. Offer a GUC to turn off full-page-image dumping, which you'd use only > if you really trust your hardware :-( > > 2. Think of a better defense against partial-page writes. > > I like #2, or would if I could think of a better defense. Ideas anyone? I have an idea! Currently we write the backup pages (copies of pages modified since the last checkpoint) when we write the WAL changes as part of the commit. See the XLogCheckBuffer() call in XLogInsert(). However, it seems that this is not the most efficient time to be writing the backup pages. It would be more efficient to write these just before the page is written to disk (not when it is modified in the buffer cache). This would allow the background writer to be writing most of the backup pages, rather than the transaction committer. A further optimization would be to write the backup pages to the same 8k file over and over again rather than adding I/O to pg_xlog WAL files. Comments? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > I have an idea! Currently we write the backup pages (copies of pages > modified since the last checkpoint) when we write the WAL changes as > part of the commit. See the XLogCheckBuffer() call in XLogInsert(). Can someone explain exactly what the problem being defeated by writing whole pages to the WAL log? Like, if page is half flushed to disk and contains half the old data and half of the new data, what exactly would go wrong with recovery? When postgres sees the write why would it have trouble recovering the correct contents for the page based on the half-written page and the regular WAL entry? > A further optimization would be to write the backup pages to the same 8k > file over and over again rather than adding I/O to pg_xlog WAL files. This is an interesting idea. But it's not necessarily universally a win. For users with a battery backed cache it would be absolutely ideal and a huge win. It would avoid pushing other pages out of cache and avoid doing extra i/o syncing the no longer relevant data to the actual disk. But for users without a battery backed cache I don't think it would be a win. It would mean another data file that ideally would be on a segregated set of spindles to avoid any unnecessary seeks. Currently the ideal postgres setup has WAL on one set of spindles, data on another set of spindles, and OS writable partitions like /var on an entirely separate set of spindles. That takes at least 6 drives if you want mirrored drives. This would raise the minimum to 8 drives. -- greg
On Sun, 3 Jul 2005 04:47 pm, Greg Stark wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > I have an idea! Currently we write the backup pages (copies of pages > > modified since the last checkpoint) when we write the WAL changes as > > part of the commit. See the XLogCheckBuffer() call in XLogInsert(). > > Can someone explain exactly what the problem being defeated by writing whole > pages to the WAL log? Like, if page is half flushed to disk and contains half > the old data and half of the new data, what exactly would go wrong with > recovery? When postgres sees the write why would it have trouble recovering > the correct contents for the page based on the half-written page and the > regular WAL entry? Problem we are solving. Checkpoint Happens. (Ensuring that ALL data make it to the disk) - This means we don't need the information in WAL beforethe checkpoint. We write some updates to the db, changing some pages. The Power is pulled and one page is half written. There is no way to know if the page is in a valid state. There is no wayto recover the page from wal, becuase unless we wrote the full page into wal, we don't have a "starting point" for modification. That's probably very unclear but that is the idea. Regards Russell Smith
Greg Stark <gsstark@mit.edu> writes: > Can someone explain exactly what the problem being defeated by writing whole > pages to the WAL log? Partial writes. Without the full-page image, we do not have enough information in WAL to reconstruct the correct page contents. >> A further optimization would be to write the backup pages to the same 8k >> file over and over again rather than adding I/O to pg_xlog WAL files. How does that work, and why is it a win compared to doing the same amount of I/O to WAL? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > Can someone explain exactly what the problem being defeated by writing whole > > pages to the WAL log? > > Partial writes. Without the full-page image, we do not have enough > information in WAL to reconstruct the correct page contents. Sure, but why not? If a 8k page contains 16 low level segments on disk and the old data is AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page? If the actual write only got out AAABAAACAAAAAAAA the resulting page is garbage but why isn't what the WAL has enough information to reconstruct the new version? I do worry that a drive could write AAABAAACXYZKWBFH if it loses power in the middle of the write, though. That would be bad. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> Partial writes. Without the full-page image, we do not have enough >> information in WAL to reconstruct the correct page contents. > Sure, but why not? > If a 8k page contains 16 low level segments on disk and the old data is > AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would > contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page? It might contain parts of it ... scattered across a large number of WAL entries ... but I don't think that's enough to reconstruct the page. As an example, a btree insert WAL record will say "insert this tuple at position N, shifting the other entries accordingly"; that does not give you the ability to reconstruct entries that shifted across sector boundaries, as they may not be present in the on-disk data of either sector. You're also going to have fairly serious problems interpreting the page contents if what's on disk includes the effects of multiple WAL records beyond the record you are currently looking at. We could possibly do it if we added more information to the WAL records, but that strikes me as a net loss: essentially it would pay the penalty all the time instead of only on the first write after a checkpoint. Also, you are assuming that the content of each sector is uniquely --- and determinably --- either "old data" or "new data", not for example "unreadable because partially written". regards, tom lane
On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > >> Uh, what exactly did you cut out? I suggested dropping the dumping of > >> full page images, but not removing CRCs altogether ... > > > Attached is the patch I used. > > OK, thanks for the clarification. So it does seem that dumping full > page images is a pretty big hit these days. Yes the performance results are fairly damning. That's a shame, I convinced myself that the CRC32 and block-hole compression was enough. The 50% performance gain isn't the main thing for me. The 10 sec drop in response time immediately after checkpoint is the real issue. Most sites are looking for good response as an imperative, rather than throughput. Overall, IMHO we must do something about this for 8.1. Possibly something more for 8.2 also, but definitely *something* now. > (In defense of the original > idea, I believe it was not such a hit at the time --- but as we continue > to improve performance, things that weren't originally at the top of the > profile become significant.) No defense required. As you say, it was the best idea at the time. > It seems like we have two basic alternatives: > > 1. Offer a GUC to turn off full-page-image dumping, which you'd use only > if you really trust your hardware :-( > > 2. Think of a better defense against partial-page writes. > > I like #2, or would if I could think of a better defense. Ideas anyone? Well, I'm all for #2 if we can think of one that will work. I can't. Option #1 seems like the way forward, but I don't think it is sufficiently safe just to have the option to turn things off. With wal_changed_pages= off *any* crash would possibly require an archive recovery, or a replication rebuild. It's good that we now have PITR, but we do also have other options for availability. Users of replication could well be amongst the first to try out this option. The problem is that you just wouldn't *know* whether the possibly was yes or no. The temptation would be to assume "no" and just continue, which could lead to data loss. And that would lead to a lack of trust in PostgreSQL and eventual reputational loss. Would I do an archive recovery, or would I trust that RAID array had written everything properly? With an irate Web Site Manager saying "you think? it might? maybe? You mean you don't know???" If we pick option #1, it *must* also include a way of deciding whether a crash recovery has succeeded, or not. Other commentators have mentioned the torn_pages option for other DBMS. It seems we also need an indicator. That should be a non-optional feature of 8.1 and then perhaps an optional feature in later releases when we have more experience to say that turning it off is OK in some circumstances. We could implement the torn-pages option, but that seems a lot of work. Another way of implementing a tell-tale would be to append the LSN again as a data page trailer as the last 4 bytes of the page. Thus the LSN would be both the first and last item on the data page. Any partial write would fail to update the LSN trailer and we would be able to see that a page was torn. That's considerably easier than trying to write a torn page tell-tale to each 512 byte sector of a page as SQLServer does. During recovery, if a full page image is not available, we would read the page from the database and check that the first and last LSNs match. If they do, then the page is not torn and recovery can be successful. If they do not match, then we attempt to continue recovery, but issue a warning that torn page has been detected and a full archive recovery is recommended. It is likely that the recovery itself will fail almost immediately following this, since changes will try to be made to a page in the wrong state to receive it, but there's no harm in trying.... Like this specific idea or not, I'm saying that we need a tell-tale: a way of knowing whether we have a torn page, or not. That way we can safely continue to rely upon crash recovery. Tom, I think you're the only person that could or would be trusted to make such a change. Even past the 8.1 freeze, I say we need to do something now on this issue. Best Regards, Simon Riggs
Simon Riggs wrote: > On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote: > > Josh Berkus <josh@agliodbs.com> writes: > > >> Uh, what exactly did you cut out? I suggested dropping the dumping of > > >> full page images, but not removing CRCs altogether ... > > > > > Attached is the patch I used. > > > > OK, thanks for the clarification. So it does seem that dumping full > > page images is a pretty big hit these days. > > Yes the performance results are fairly damning. That's a shame, I > convinced myself that the CRC32 and block-hole compression was enough. > > The 50% performance gain isn't the main thing for me. The 10 sec drop in > response time immediately after checkpoint is the real issue. Most sites > are looking for good response as an imperative, rather than throughput. Yep. > No defense required. As you say, it was the best idea at the time. > > > It seems like we have two basic alternatives: > > > > 1. Offer a GUC to turn off full-page-image dumping, which you'd use only > > if you really trust your hardware :-( > > > > 2. Think of a better defense against partial-page writes. > > > > I like #2, or would if I could think of a better defense. Ideas anyone? > > Well, I'm all for #2 if we can think of one that will work. I can't. > > Option #1 seems like the way forward, but I don't think it is > sufficiently safe just to have the option to turn things off. Well, I added #1 yesterday as 'full_page_writes', and it has the same warnings as fsync (namely, on crash, be prepared to recovery or check your system thoroughly. As far as #2, my posted proposal was to write the full pages to WAL when they are written to the file system, and not when they are first modified in the shared buffers --- the goal being that it will even out the load, and it will happen in a non-critical path, hopefully by the background writer or at checkpoint time. > With wal_changed_pages= off *any* crash would possibly require an > archive recovery, or a replication rebuild. It's good that we now have > PITR, but we do also have other options for availability. Users of > replication could well be amongst the first to try out this option. Seems it is similar to fsync in risk, which is not a new option. > The problem is that you just wouldn't *know* whether the possibly was > yes or no. The temptation would be to assume "no" and just continue, > which could lead to data loss. And that would lead to a lack of trust in > PostgreSQL and eventual reputational loss. Would I do an archive > recovery, or would I trust that RAID array had written everything > properly? With an irate Web Site Manager saying "you think? it might? > maybe? You mean you don't know???" That is a serious problem, but the same problem we have in turning off fsync. > During recovery, if a full page image is not available, we would read > the page from the database and check that the first and last LSNs match. > If they do, then the page is not torn and recovery can be successful. If > they do not match, then we attempt to continue recovery, but issue a > warning that torn page has been detected and a full archive recovery is > recommended. It is likely that the recovery itself will fail almost > immediately following this, since changes will try to be made to a page > in the wrong state to receive it, but there's no harm in trying.... I like the idea of checking the page during recovery so we don't have to check all the pages, just certain pages. > Like this specific idea or not, I'm saying that we need a tell-tale: a > way of knowing whether we have a torn page, or not. That way we can > safely continue to rely upon crash recovery. > > Tom, I think you're the only person that could or would be trusted to > make such a change. Even past the 8.1 freeze, I say we need to do > something now on this issue. I think if we document full_page_writes as similar to fsync in risk, we are OK for 8.1, but if something can be done easily, it sounds good. Now that we have a GUC we can experiment with the full page write load and see how it can be improved. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote: > Well, I added #1 yesterday as 'full_page_writes', and it has the same > warnings as fsync (namely, on crash, be prepared to recovery or check > your system thoroughly. Yes, which is why I comment now that the GUC alone is not enough. There is no way to "check your system thoroughly". If there is a certain way of knowing torn pages had *not* occurred, then I would be happy. > As far as #2, my posted proposal was to write the full pages to WAL when > they are written to the file system, and not when they are first > modified in the shared buffers --- the goal being that it will even out > the load, and it will happen in a non-critical path, hopefully by the > background writer or at checkpoint time. The page must be written before the changes to the page are written, so that they are available sequentially in the log for replay. The log and the database are not connected, so we cannot do it that way. If the page is written out of sequence from the changes to it, how would recovery know where to get the page from? ISTM there is mileage in your idea of trying to shift the work to another time. My thought is "which blocks exactly are the ones being changed?". Maybe that would lead to a reduction. > > With wal_changed_pages= off *any* crash would possibly require an > > archive recovery, or a replication rebuild. It's good that we now have > > PITR, but we do also have other options for availability. Users of > > replication could well be amongst the first to try out this option. > > Seems it is similar to fsync in risk, which is not a new option. Risk is not acceptable. We must have certainty, either way. Why have two GUCs? Why not just have one GUC that does both at the same time? When would you want one but not the other? risk_data_loss_to_gain_performance = true > I think if we document full_page_writes as similar to fsync in risk, we > are OK for 8.1, but if something can be done easily, it sounds good. Documenting something simply isn't enough. I simply cannot advise anybody ever to use the new GUC. If their data was low value, they wouldn't even be using PostgreSQL, they'd use a non-transactional DBMS. I agree we *must* have the GUC, but we also *must* have a way for crash recovery to tell us for certain that it has definitely worked, not just maybe worked. Best regards, Simon Riggs
>>Tom, I think you're the only person that could or would be trusted to >>make such a change. Even past the 8.1 freeze, I say we need to do >>something now on this issue. > > > I think if we document full_page_writes as similar to fsync in risk, we > are OK for 8.1, but if something can be done easily, it sounds good. > > Now that we have a GUC we can experiment with the full page write load > and see how it can be improved. Question, with this option if the power goes out will I just roll through the transaction logs like normal? Or are we talking the potential to have to use something like pg_resetxlog or similar? If it is just roll through the transaction logs then I have no problem with it, let the user decide the level of reliance they have. If it can cause actual, need to restore from backup level damage then it is a literall no go IMHO. Sincerely, Joshua D. Drake > -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/
Simon Riggs wrote: > I agree we *must* have the GUC, but we also *must* have a way for crash > recovery to tell us for certain that it has definitely worked, not just > maybe worked. Doesn't the same argument apply to the existing fsync = off case? i.e. we already have a case where we don't provide a crash-recovery guarantee. -O
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote: >> Well, I added #1 yesterday as 'full_page_writes', and it has the same >> warnings as fsync (namely, on crash, be prepared to recovery or check >> your system thoroughly. > Yes, which is why I comment now that the GUC alone is not enough. > There is no way to "check your system thoroughly". If there is a certain > way of knowing torn pages had *not* occurred, then I would be happy. I agree with Simon that this isn't much of a solution: no one who cares about their data will dare turn off the option, and therefore the possible performance gain is just hot air. I do not see a better alternative at the moment :-( but we should keep thinking about it. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > As far as #2, my posted proposal was to write the full pages to WAL when > they are written to the file system, and not when they are first > modified in the shared buffers --- That is *completely* unworkable. Or were you planning to abandon the promise that a transaction is committed when we have flushed its WAL commit record? > Seems it is similar to fsync in risk, which is not a new option. The point here is that fsync-off is only realistic for development or playpen installations. You don't turn it off in a production machine, and I can't see that you'd turn off the full-page-write option either. So we have not solved anyone's performance problem. regards, tom lane
On Wed, Jul 06, 2005 at 21:48:44 +0100, Simon Riggs <simon@2ndquadrant.com> wrote: > > We could implement the torn-pages option, but that seems a lot of work. > Another way of implementing a tell-tale would be to append the LSN again > as a data page trailer as the last 4 bytes of the page. Thus the LSN > would be both the first and last item on the data page. Any partial > write would fail to update the LSN trailer and we would be able to see > that a page was torn. That's considerably easier than trying to write a > torn page tell-tale to each 512 byte sector of a page as SQLServer does. Are you sure about that? That would probably be the normal case, but are you promised that the hardware will write all of the sectors of a block in order?
Bruno Wolff III <bruno@wolff.to> writes: > Are you sure about that? That would probably be the normal case, but are > you promised that the hardware will write all of the sectors of a block > in order? I don't think you can possibly assume that. If the block crosses a cylinder boundary then it's certainly an unsafe assumption, and even within a cylinder (no seek required) I'm pretty sure that disk drives have understood "write the next sector that passes under the heads" for decades. regards, tom lane
Tom Lane wrote: > Bruno Wolff III <bruno@wolff.to> writes: > > Are you sure about that? That would probably be the normal case, but are > > you promised that the hardware will write all of the sectors of a block > > in order? > > I don't think you can possibly assume that. If the block crosses a > cylinder boundary then it's certainly an unsafe assumption, and even > within a cylinder (no seek required) I'm pretty sure that disk drives > have understood "write the next sector that passes under the heads" > for decades. SCSI tagged queueing certainly allows 512-byte blocks to be reordered during writes. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Wed, 2005-07-06 at 17:17 -0700, Joshua D. Drake wrote: > >>Tom, I think you're the only person that could or would be trusted to > >>make such a change. Even past the 8.1 freeze, I say we need to do > >>something now on this issue. > > > > > > I think if we document full_page_writes as similar to fsync in risk, we > > are OK for 8.1, but if something can be done easily, it sounds good. > > > > Now that we have a GUC we can experiment with the full page write load > > and see how it can be improved. > > Question, with this option if the power goes out will I just roll > through the transaction logs like normal? Most probably, yes. But: > Or are we talking the > potential to have to use something like pg_resetxlog or similar? Potentially. Just depends on what sort of crash occurred... > If it is just roll through the transaction logs then I have no problem > with it, let the user decide the level of reliance they have. If it can > cause actual, need to restore from backup level damage then it is a > literall no go IMHO. Well, it can't *cause* problems, but it doesn't solve them when they occur, as the current design does. If crash recovery fails, and it may do (and worst of all it might not fail when it should have) then you will need to recover the database using PITR or a replication mechanism. Best Regards, Simon Riggs
On Thu, 2005-07-07 at 00:29 -0400, Bruce Momjian wrote: > Tom Lane wrote: > > Bruno Wolff III <bruno@wolff.to> writes: > > > Are you sure about that? That would probably be the normal case, but are > > > you promised that the hardware will write all of the sectors of a block > > > in order? > > > > I don't think you can possibly assume that. If the block crosses a > > cylinder boundary then it's certainly an unsafe assumption, and even > > within a cylinder (no seek required) I'm pretty sure that disk drives > > have understood "write the next sector that passes under the heads" > > for decades. > > SCSI tagged queueing certainly allows 512-byte blocks to be reordered > during writes. Then a torn-page tell-tale is required that will tell us of any change to any of the 512-byte sectors that make up a block/page. Here's an idea: We read the page that we would have backed up, calc the CRC and write a short WAL record with just the CRC, not the block. When we recover we re-read the database page, calc its CRC and compare it with the CRC from the transaction log. If they differ, we know that the page was torn and we know the database needs recovery. (So we calc the CRC when we log AND when we recover). This avoids the need to write full pages, though slightly slows down recovery. Best Regards, Simon Riggs
>> Are you sure about that? That would probably be the normal case, but >> are you promised that the hardware will write all of the sectors of a >> block in order? > > I don't think you can possibly assume that. If the block > crosses a cylinder boundary then it's certainly an unsafe > assumption, and even within a cylinder (no seek required) I'm > pretty sure that disk drives have understood "write the next > sector that passes under the heads" > for decades. A lot of hardware exists, that guards against partial writes of single IO requests (a persistent write cache for a HP raid controller for intel servers costs ~500$ extra). But, the OS usually has 4k (some 8k) filesystem buffer size, and since we do not use direct io for datafiles, the OS might decide to schedule two 4k writes differently for one 8k page. If you do not build pg to match your fs buffer size you cannot guard against partial writes with hardware :-( We could alleviate that problem with direct io for datafiles. Andreas
> Here's an idea: > > We read the page that we would have backed up, calc the CRC and > write a short WAL record with just the CRC, not the block. When > we recover we re-read the database page, calc its CRC and > compare it with the CRC from the transaction log. If they > differ, we know that the page was torn and we know the database > needs recovery. (So we calc the CRC when we log AND when we recover). Won't work, since the page on disk may have x different contents between 2 checkpoints (bgwriter from lru). Only workable solution would imho be to write the LSN to each 512 byte block (not that I am propagating that idea). Andreas
"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes: > Only workable solution would imho be to write the LSN to each 512 > byte block (not that I am propagating that idea). We're not doing anything like that, as it would create an impossible space-management problem (or are you happy with limiting tuples to 500 bytes?). What we *could* do is calculate a page-level CRC and store it in the page header just before writing out. Torn pages would then manifest as a wrong CRC on read. No correction ability, but at least a reliable detection ability. However, this will do nothing to solve the performance problem if the core of that problem is the cost of computing page-sized CRCs :-( We still don't know enough about the situation to know what a solution might look like. Is the slowdown Josh is seeing due to the extra CPU cost of the CRCs, or the extra I/O cost, or excessive locking of the WAL-related data structures while we do this stuff, or ???. Need more data. regards, tom lane
I wrote: > We still don't know enough about the situation to know what a solution > might look like. Is the slowdown Josh is seeing due to the extra CPU > cost of the CRCs, or the extra I/O cost, or excessive locking of the > WAL-related data structures while we do this stuff, or ???. Need more > data. Josh, is OSDL up enough that you can try another comparison run? If so, please undo the previous patch (which disabled page dumping entirely) and instead try removing this block of code, starting at about xlog.c line 620 in CVS tip: /* * Now add the backup block headers and data into the CRC */ for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) { if (dtbuf_bkp[i]) { BkpBlock *bkpb = &(dtbuf_xlg[i]); char *page; COMP_CRC32(rdata_crc, (char *) bkpb, sizeof(BkpBlock)); page= (char *) BufferGetBlock(dtbuf[i]); if (bkpb->hole_length == 0) { COMP_CRC32(rdata_crc, page, BLCKSZ); } else { /* must skip the hole */ COMP_CRC32(rdata_crc, page, bkpb->hole_offset); COMP_CRC32(rdata_crc, page + (bkpb->hole_offset+ bkpb->hole_length), BLCKSZ - (bkpb->hole_offset + bkpb->hole_length)); } } } This will remove just the CRC calculation work associated with backed-up pages. Note that any attempt to recover from the WAL will fail, but I assume you don't need that for the purposes of the test run. regards, tom lane
>> Only workable solution would imho be to write the LSN to each 512 byte >> block (not that I am propagating that idea). "Only workable" was a stupid formulation, I meant a solution that works with a LSN. > We're not doing anything like that, as it would create an > impossible space-management problem (or are you happy with > limiting tuples to 500 bytes?). To do it, a layer between physical storage and row workmemory would need to be inserted, of course that would add a lot of overhead. I guess more overhead than computing a page crc. > We still don't know enough about the situation to know what a solution might look like. > Is the slowdown Josh is seeing due to the extra CPU cost of the CRCs, or the extra I/O cost, > or excessive locking of the WAL-related data structures while we do this stuff, or ???. > Need more data. Yes, especially the 10 sec instead of 1 sec response times look very suspicious. Andreas
Simon Riggs wrote: > > SCSI tagged queueing certainly allows 512-byte blocks to be reordered > > during writes. > > Then a torn-page tell-tale is required that will tell us of any change > to any of the 512-byte sectors that make up a block/page. > > Here's an idea: > > We read the page that we would have backed up, calc the CRC and write a > short WAL record with just the CRC, not the block. When we recover we > re-read the database page, calc its CRC and compare it with the CRC from > the transaction log. If they differ, we know that the page was torn and > we know the database needs recovery. (So we calc the CRC when we log AND > when we recover). > > This avoids the need to write full pages, though slightly slows down > recovery. Yes, that is a good idea! That torn page thing sounded like a mess, and I love that we can check them on recovery rather than whenever you happen to access the page. What would be great would be to implement this when full_page_writes is off, _and_ have the page writes happen when the page is written to disk rather than modified in the shared buffers. I will add those to the TODO list now. Updated item: * Eliminate need to write full pages to WAL before page modification [wal] Currently, to protect against partial diskpage writes, we write full page images to WAL before they are modified so we can correct any partial page writes duringrecovery. These pages can also be eliminated from point-in-time archive files. o -Add ability to turn offfull page writes o When off, write CRC to WAL and check file system blocks on recovery o Writefull pages during file system write and not when the page is modified in the buffer cache This allowsmost full page writes to happen in the background writer. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Yes, that is a good idea! ... which was shot down in the very next message. regards, tom lane
Zeugswetter Andreas DAZ SD wrote: > > >> Are you sure about that? That would probably be the normal case, but > >> are you promised that the hardware will write all of the sectors of a > > >> block in order? > > > > I don't think you can possibly assume that. If the block > > crosses a cylinder boundary then it's certainly an unsafe > > assumption, and even within a cylinder (no seek required) I'm > > pretty sure that disk drives have understood "write the next > > sector that passes under the heads" > > for decades. > > A lot of hardware exists, that guards against partial writes > of single IO requests (a persistent write cache for a HP raid > controller for intel servers costs ~500$ extra). > > But, the OS usually has 4k (some 8k) filesystem buffer size, > and since we do not use direct io for datafiles, the OS might decide > to schedule two 4k writes differently for one 8k page. > > If you do not build pg to match your fs buffer size you cannot > guard against partial writes with hardware :-( > > We could alleviate that problem with direct io for datafiles. Now that is an interesting analysis. I thought people who used batter-backed drive cache wouldn't have partial page write problems, but I now see it is certainly possible. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes: > > Only workable solution would imho be to write the LSN to each 512 > > byte block (not that I am propagating that idea). > > We're not doing anything like that, as it would create an impossible > space-management problem (or are you happy with limiting tuples to > 500 bytes?). What we *could* do is calculate a page-level CRC and > store it in the page header just before writing out. Torn pages > would then manifest as a wrong CRC on read. No correction ability, > but at least a reliable detection ability. At the same time as you do the CRC you can copy the bytes to a fresh page skipping the LSNs. Likewise, when writing out the page you have to calculate the CRC; at the same time as you calculate the CRC you write out the bytes to a temporary buffer adding LSNs and write that to disk. This would be "zero-copy" if you're already scanning the bytes to calculate the CRC since you can add and remove LSNs at the same time. It does require an extra buffer to store the page in before writing and that entails some amount of cache thrashing. But maybe you could reuse the same buffer over and over again for every read/write. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> What we *could* do is calculate a page-level CRC and >> store it in the page header just before writing out. Torn pages >> would then manifest as a wrong CRC on read. No correction ability, >> but at least a reliable detection ability. > At the same time as you do the CRC you can copy the bytes to a fresh page > skipping the LSNs. Likewise, when writing out the page you have to calculate > the CRC; at the same time as you calculate the CRC you write out the bytes to > a temporary buffer adding LSNs and write that to disk. Huh? You seem to be proposing doing *both* things, which sounds entirely pointless. BTW, I was envisioning the page CRCs as something we'd only check during crash recovery, not normal-operation reads. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > As far as #2, my posted proposal was to write the full pages to WAL when > > they are written to the file system, and not when they are first > > modified in the shared buffers --- > > That is *completely* unworkable. Or were you planning to abandon the > promise that a transaction is committed when we have flushed its WAL > commit record? "completely" is a strong word. What is on disk at the time the page is modified in the shared buffer is just fine for recovery (it is the same as what we write to WAL anyway). It is just when the page gets written to disk that it changes for recovery, so potentially during recovery you could take what is on disk, modify it by reading WAL, then replace it later with the image from WAL. The only problem I see is that the page might be partially written and WAL modifications to the page might fail but later the page will be replaced anyway. Perhaps we could record pages that are corrupt (hopefully only one) and make sure later page images replace them, or we fail on recovery. > > Seems it is similar to fsync in risk, which is not a new option. > > The point here is that fsync-off is only realistic for development > or playpen installations. You don't turn it off in a production > machine, and I can't see that you'd turn off the full-page-write > option either. So we have not solved anyone's performance problem. Yes, this is basically another fsync-like option that isn't for production usage in most cases. Sad but true. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> The point here is that fsync-off is only realistic for development >> or playpen installations. You don't turn it off in a production >> machine, and I can't see that you'd turn off the full-page-write >> option either. So we have not solved anyone's performance problem. > Yes, this is basically another fsync-like option that isn't for > production usage in most cases. Sad but true. Just to make my position perfectly clear: I don't want to see this option shipped in 8.1. It's reasonable to have it in there for now as an aid to our performance investigations, but I don't see that it has any value for production. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> The point here is that fsync-off is only realistic for development > >> or playpen installations. You don't turn it off in a production > >> machine, and I can't see that you'd turn off the full-page-write > >> option either. So we have not solved anyone's performance problem. > > > Yes, this is basically another fsync-like option that isn't for > > production usage in most cases. Sad but true. > > Just to make my position perfectly clear: I don't want to see this > option shipped in 8.1. It's reasonable to have it in there for now > as an aid to our performance investigations, but I don't see that it > has any value for production. Well, this is the first I am hearing that, and of course your position is just one vote. One idea would be to just tie its behavior directly to fsync and remove the option completely (that was the original TODO), or we can adjust it so it doesn't have the same risks as fsync, or the same lack of failure reporting as fsync. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
>>Just to make my position perfectly clear: I don't want to see this >>option shipped in 8.1. It's reasonable to have it in there for now >>as an aid to our performance investigations, but I don't see that it >>has any value for production. > > > Well, this is the first I am hearing that, and of course your position > is just one vote. True but your "feature" was added after feature freeze ;). I don't see this as a good thing overall. We should be looking for a solution not a band-aid that if you tear it off will pull the skin. Sincerely, Joshua D. Drake -- Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240 PostgreSQL Replication, Consulting, Custom Programming, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/
Joshua D. Drake wrote: > > >>Just to make my position perfectly clear: I don't want to see this > >>option shipped in 8.1. It's reasonable to have it in there for now > >>as an aid to our performance investigations, but I don't see that it > >>has any value for production. > > > > > > Well, this is the first I am hearing that, and of course your position > > is just one vote. > > True but your "feature" was added after feature freeze ;). I don't see My patch was posted days before the feature freeze. > this as a good thing overall. We should be looking for a solution not a > band-aid that if you tear it off will pull the skin. Sure, having it be _like_ fsync is not a good thing. Seems we can roll it into the fsync option, improve it, or remove it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Simon Riggs wrote: > On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote: > > Well, I added #1 yesterday as 'full_page_writes', and it has the same > > warnings as fsync (namely, on crash, be prepared to recovery or check > > your system thoroughly. > > Yes, which is why I comment now that the GUC alone is not enough. > > There is no way to "check your system thoroughly". If there is a certain > way of knowing torn pages had *not* occurred, then I would be happy. Yep, it is a pain, and like fsync. > > As far as #2, my posted proposal was to write the full pages to WAL when > > they are written to the file system, and not when they are first > > modified in the shared buffers --- the goal being that it will even out > > the load, and it will happen in a non-critical path, hopefully by the > > background writer or at checkpoint time. > > The page must be written before the changes to the page are written, so > that they are available sequentially in the log for replay. The log and > the database are not connected, so we cannot do it that way. If the page > is written out of sequence from the changes to it, how would recovery > know where to get the page from? See my later email --- the full page will be restored later from WAL, so our changes don't have to be made at that point. > ISTM there is mileage in your idea of trying to shift the work to > another time. My thought is "which blocks exactly are the ones being > changed?". Maybe that would lead to a reduction. > > > > With wal_changed_pages= off *any* crash would possibly require an > > > archive recovery, or a replication rebuild. It's good that we now have > > > PITR, but we do also have other options for availability. Users of > > > replication could well be amongst the first to try out this option. > > > > Seems it is similar to fsync in risk, which is not a new option. > > Risk is not acceptable. We must have certainty, either way. > > Why have two GUCs? Why not just have one GUC that does both at the same > time? When would you want one but not the other? > risk_data_loss_to_gain_performance = true Yep, one new one might make sense. > > I think if we document full_page_writes as similar to fsync in risk, we > > are OK for 8.1, but if something can be done easily, it sounds good. > > Documenting something simply isn't enough. I simply cannot advise > anybody ever to use the new GUC. If their data was low value, they > wouldn't even be using PostgreSQL, they'd use a non-transactional DBMS. > > I agree we *must* have the GUC, but we also *must* have a way for crash > recovery to tell us for certain that it has definitely worked, not just > maybe worked. Right. I am thinking your CRC write to WAL might do that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom, > Josh, is OSDL up enough that you can try another comparison run? Thankfully, yes. > If so, please undo the previous patch (which disabled page dumping > entirely) and instead try removing this block of code, starting > at about xlog.c line 620 in CVS tip: Will do. Results in a few days. -- Josh Berkus Aglio Database Solutions San Francisco
On Thu, Jul 07, 2005 at 11:36:40AM -0400, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >> What we *could* do is calculate a page-level CRC and > >> store it in the page header just before writing out. Torn pages > >> would then manifest as a wrong CRC on read. No correction ability, > >> but at least a reliable detection ability. > > > At the same time as you do the CRC you can copy the bytes to a fresh page > > skipping the LSNs. Likewise, when writing out the page you have to calculate > > the CRC; at the same time as you calculate the CRC you write out the bytes to > > a temporary buffer adding LSNs and write that to disk. > > Huh? You seem to be proposing doing *both* things, which sounds entirely > pointless. > > BTW, I was envisioning the page CRCs as something we'd only check during > crash recovery, not normal-operation reads. > > regards, tom lane > Does the DB page on disk have a version number? If so, maybe we could update the WAL with the CRC+version anytime the page is update. You may need to check the log for multiple CRC+version entries to determine the torn-page status. Ken
Josh Berkus <josh@agliodbs.com> writes: >> If so, please undo the previous patch (which disabled page dumping >> entirely) and instead try removing this block of code, starting >> at about xlog.c line 620 in CVS tip: > Will do. Results in a few days. Great. BTW, don't bother testing snapshots between 2005/07/05 2300 EDT and just now --- Bruce's full_page_writes patch introduced a large random negative component into the timing ... regards, tom lane
>>> The point here is that fsync-off is only realistic for development or >>> playpen installations. You don't turn it off in a production >>> machine, and I can't see that you'd turn off the full-page-write >>> option either. So we have not solved anyone's performance problem. > >> Yes, this is basically another fsync-like option that isn't for >> production usage in most cases. Sad but true. > > Just to make my position perfectly clear: I don't want to see > this option shipped in 8.1. Why not ? If your filesystem buffer size matches your pg page size, and you have a persistent write cache, the option makes perfect sense. Andreas
On Thu, 2005-07-07 at 11:59 -0400, Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Tom Lane wrote: > > >> The point here is that fsync-off is only realistic for development > > >> or playpen installations. You don't turn it off in a production > > >> machine, and I can't see that you'd turn off the full-page-write > > >> option either. So we have not solved anyone's performance problem. > > > > > Yes, this is basically another fsync-like option that isn't for > > > production usage in most cases. Sad but true. > > > > Just to make my position perfectly clear: I don't want to see this > > option shipped in 8.1. It's reasonable to have it in there for now > > as an aid to our performance investigations, but I don't see that it > > has any value for production. > > Well, this is the first I am hearing that, and of course your position > is just one vote. > > One idea would be to just tie its behavior directly to fsync and remove > the option completely (that was the original TODO), or we can adjust it > so it doesn't have the same risks as fsync, or the same lack of failure > reporting as fsync. I second Tom's objection, until we agree either: - a conclusive physical test that shows that specific hardware *never* causes torn pages - a national/international standard name/number for everybody to ask their manufacturer whether or not they comply with that (I doubt that exists...) - a conclusive check for torn pages that can be added to the recovery code to show whether or not they have occurred. Is there also a potential showstopper in the redo machinery? We work on the assumption that the post-checkpoint block is available in WAL as a before image. Redo for all actions merely replay the write action again onto the block. If we must reapply the write action onto the block, the redo machinery must check to see whether the write action has already been successfully applied before it decides to redo. I'm not sure that the current code does that. Having raised that objection, ISTM that checking for torn pages can be accomplished reasonably well using a few rules... These are simple because we do not update in place for MVCC. Since inserts and vacuums alter the pd_upper and pd_lower, we should be able to do a self-consistency check that shows that all items are correctly placed. If there is non-zero data higher than the pd_higher pointer, then we know that the first sector is torn. If a pointer doesn't match with a row version, then the page is torn. It is possible that the first sector of a page could be undetectably torn if it was nearly full and the item pointer pointed to the first sector. However, for every page touched, the last WAL record to touch that page should have an LSN that matches the database page. In most cases they would match, proving the page was not torn. If they did not match we would have no proof either way, so we would be advised to act as if the page were torn for that situation. Possibly, we could reinstate the idea of putting the LSN at the beginning and end of every page, since that would help prove the first sector (only) was not torn. It is possible that a page could be torn and yet still be consistent, but this could only occur for a delete. Reapplying the delete, whether or not it is visible on the page would overcome that without problem. It is possible that there are one or more sectors of empty space in the middle of a block could be torn, but their contents would still be identical so is irrelevant and can be ignored. Best Regards, Simon Riggs
On 7/7/05, Bruce Momjian <pgman@candle.pha.pa.us> wrote: > One idea would be to just tie its behavior directly to fsync and remove > the option completely (that was the original TODO), or we can adjust it > so it doesn't have the same risks as fsync, or the same lack of failure > reporting as fsync. I wonder about one thing -- how much impact has the underlying filesystem? I mean, the problem with "partial writes" to pages is how to handle a situation when the machine looses power and we are not sure if the write was completed or not. But then again, imagine the data is on a filesystem with data journaling (like ext3 with data=journal). There, to my understanding, the data is first written into journal prior to be written to disk drive. Assuming the drive looses power during the process, I guess there would be two possible situations:1) the modification was committed to journal completely, so we can replay the journal and we are sure the 8kb block is fine. (*)2) the modification in the journal is not complete. It has not beenfully committed to the filesystem journal. And we are safe to assume that drive has an old data. (*) I am not sure if it is true for 8kb-blocks, and of course, I haven't got good knowledge about ext3's journalling and its atomicity... Assuming above are true, it would be interesting to see how ext3 with data=journal and partial writes competes with ext3 data=someother without it. I don't have extensive knowledge with journalling internals, but I thought I would mention it, so people with wider knowledge could put their input here. Regards, Dawid
Simon Riggs <simon@2ndquadrant.com> writes: > Is there also a potential showstopper in the redo machinery? We work on > the assumption that the post-checkpoint block is available in WAL as a > before image. Redo for all actions merely replay the write action again > onto the block. If we must reapply the write action onto the block, the > redo machinery must check to see whether the write action has already > been successfully applied before it decides to redo. I'm not sure that > the current code does that. The redo machinery relies on the page LSN to tell whether the update has occurred. In the presence of torn pages, that's of course unreliable. > Having raised that objection, ISTM that checking for torn pages can be > accomplished reasonably well using a few rules... I have zero confidence in this; the fact that you can think of (incomplete, inaccurate) heuristics for heap-page operations doesn't mean you can make it work for indexes. regards, tom lane
On Thu, 7 Jul 2005, Tom Lane wrote: > We still don't know enough about the situation to know what a solution > might look like. Is the slowdown Josh is seeing due to the extra CPU > cost of the CRCs, or the extra I/O cost, or excessive locking of the > WAL-related data structures while we do this stuff, or ???. Need more > data. I wonder if a different BLCKSZ would make a difference either way. Say, 1024 bytes instead of the default 8192. - Heikki
Tom, > Great. BTW, don't bother testing snapshots between 2005/07/05 2300 EDT > and just now --- Bruce's full_page_writes patch introduced a large > random negative component into the timing ... Ach. Starting over, then. --Josh -- Josh Berkus Aglio Database Solutions San Francisco
On Fri, 2005-07-08 at 09:47 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Having raised that objection, ISTM that checking for torn pages can be > > accomplished reasonably well using a few rules... > > I have zero confidence in this; the fact that you can think of > (incomplete, inaccurate) heuristics for heap-page operations doesn't > mean you can make it work for indexes. If we can find heuristics that cover some common cases, then I would be happy. Anything that allows us to prove that we don't need to recover is good. If we reduce the unknown state to an acceptable risk, then we are more likely to make use of the performance gain in the real world. Of course, they need to be accurate. Let's not get hung up on my error rate. I don't think we should care too much about indexes. We can rebuild them...but losing heap sectors means *data loss*. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > I don't think we should care too much about indexes. We can rebuild > them...but losing heap sectors means *data loss*. If you're so concerned about *data loss* then none of this will be acceptable to you at all. We are talking about going from a system that can actually survive torn-page cases to one that can only tell you whether you've lost data to such a case. Arguing about the probability with which we can detect the loss seems beside the point. regards, tom lane
On R, 2005-07-08 at 14:45 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I don't think we should care too much about indexes. We can rebuild > > them...but losing heap sectors means *data loss*. There might be some merit in idea to disabling WAL/PITR for indexes, where one can accept some (and possibly a lot) time when recovering. > If you're so concerned about *data loss* then none of this will be > acceptable to you at all. We are talking about going from a system > that can actually survive torn-page cases to one that can only tell > you whether you've lost data to such a case. Arguing about the > probability with which we can detect the loss seems beside the point. -- Hannu Krosing <hannu@skype.net>
On Fri, 2005-07-08 at 14:45 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I don't think we should care too much about indexes. We can rebuild > > them...but losing heap sectors means *data loss*. > > If you're so concerned about *data loss* then none of this will be > acceptable to you at all. We are talking about going from a system > that can actually survive torn-page cases to one that can only tell > you whether you've lost data to such a case. Arguing about the > probability with which we can detect the loss seems beside the point. In all of this, I see that turning off full page images would be an option that defaults to "yes, take page images". PITR was originally discussed (in 2002, see the archives) as a mechanism that would allow full page images to be avoided. Since we now have PITR, we can discuss more sensibly taking that option. If there are some circumstances where we don't know the state of the server and need to recover, that is OK, as long as we *can* recover. BUT only if we have a fairly low chance of needing to use it. (Rebuilding an index is preferable to a full system recovery.) So I am interested in the probability of us knowing whether the system is damaged or not. It may then become an acceptable risk for a production system to take in order to gain 50% performance. To that end, I am willing to consider various heuristics that would allow us to reduce the risk. I have suggested some, but am happy to hear others (or, as you say, corrections to them) to make that idea more viable. ISTM that Recovery could tell us: 1. Fully recovered, provably correct state of all data blocks 2. Fully recovered, unknown data correctness of some data blocks 3. Fully recovered, provably incorrect state of some data blocks as well as: a) no indexes require rebuilding b) the following indexes require an immediate REINDEX... Result 1a requires no further action 1b requires some index rebuild after system becomes operational Results 2 and 3 would require some form of system recovery Since currently there are no tests performed to show correctness, we won't ever know we're in state 1 and so would need to presume we are in state 2 and recover. My view is that if enough heuristics can be found to increase the potential for ending a recovery in state 1 then turning off full page images may become viable as a realistic cost/benefit. Though that is never an option that I would suggest should be disabled by default. Best Regards, Simon Riggs
On Fri, 2005-07-08 at 09:34 +0200, Zeugswetter Andreas DAZ SD wrote: > >>> The point here is that fsync-off is only realistic for development > or > >>> playpen installations. You don't turn it off in a production > >>> machine, and I can't see that you'd turn off the full-page-write > >>> option either. So we have not solved anyone's performance problem. > > > >> Yes, this is basically another fsync-like option that isn't for > >> production usage in most cases. Sad but true. > > > > Just to make my position perfectly clear: I don't want to see > > this option shipped in 8.1. > > Why not ? If your filesystem buffer size matches your pg page size, > and you have a persistent write cache, the option makes perfect sense. I think this point needs expansion: It is possible to run with matching page sizes, in which case the option to not-log full page images is desirable. The only type of crash that can cause torn pages is an OS crash, such as a power outage. If you have an expensive server with battery backup, persistent write cache etc then this reduces that risk considerably. However, as long as PostgreSQL can't tell the difference between any crash and an OS crash, we must assume the worst. Best Regards, Simon Riggs
Simon, Tom, > > Will do. Results in a few days. Actually, between the bad patch on the 5th and ongoing STP issues, I don't think I will have results before I leave town. Will e-mail you offlist to give you info to retrieve results. > Any chance you'd be able to do this with > > ext3 and a filesystem blocksize of 4096 bytes > > PostgreSQL built with a blocksize of 4096 Hmmm ... I'm not sure the current test supports these kinds of options. Will ask Mark. -- Josh Berkus Aglio Database Solutions San Francisco
Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I don't think we should care too much about indexes. We can rebuild > > them...but losing heap sectors means *data loss*. > > If you're so concerned about *data loss* then none of this will be > acceptable to you at all. We are talking about going from a system > that can actually survive torn-page cases to one that can only tell > you whether you've lost data to such a case. Arguing about the > probability with which we can detect the loss seems beside the > point. I realize I'm coming into this discussion a bit late, and perhaps my thinking on this is simplistically naive. That said, I think I have an idea of how to solve the torn page problem. If the hardware lies to you about the data being written to the disk, then no amount of work on our part can guarantee data integrity. So the below assumes that the hardware doesn't ever lie about this. If you want to prevent a torn page, you have to make the last synchronized write to the disk as part of the checkpoint process a write that *cannot* result in a torn page. So it has to be a write of a buffer that is no larger than the sector size of the disk. I'd make it 256 bytes, to be sure of accomodating pretty much any disk hardware out there. In any case, the modified sequence would go something like: 1. write the WAL entry, and encode in it a unique magic number 2. sync() 3. append the unique magic number to the WAL again (or to a separate file if you like, it doesn't matter as long as youknow where to look for it during recovery), using a 256 byte (at most) write buffer. 4. sync() After the first sync(), the OS guarantees that the data you've written so far is committed to the platters, with the possible exception of a torn page during the write operation, which will only happen during a crash during step 2. But if a crash happens here, then the second occurrance of the unique magic number will not appear in the WAL (or separate file, if that's the mechanism chosen), and you'll *know* that you can't trust that the WAL entry was completely committed to the platter. If a crash happens during step 4, then either the appended magic number won't appear during recovery, in which case the recovery process can assume that the WAL entry is incomplete, or it will appear, in which case it's *guaranteed by the hardware* that the WAL entry is complete, because you'll know for sure that the previous sync() completed successfully. The amount of time between steps 2 and 4 should be small enough that there should be no significant performance penalty involved, relative to the time it takes for the first sync() to complete. Thoughts? -- Kevin Brown kevin@sysexperts.com
I don't think our problem is partial writes of WAL, which we already check, but heap/index page writes, which we currently do not check for partial writes. --------------------------------------------------------------------------- Kevin Brown wrote: > Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > I don't think we should care too much about indexes. We can rebuild > > > them...but losing heap sectors means *data loss*. > > > > If you're so concerned about *data loss* then none of this will be > > acceptable to you at all. We are talking about going from a system > > that can actually survive torn-page cases to one that can only tell > > you whether you've lost data to such a case. Arguing about the > > probability with which we can detect the loss seems beside the > > point. > > I realize I'm coming into this discussion a bit late, and perhaps my > thinking on this is simplistically naive. That said, I think I have > an idea of how to solve the torn page problem. > > If the hardware lies to you about the data being written to the disk, > then no amount of work on our part can guarantee data integrity. So > the below assumes that the hardware doesn't ever lie about this. > > If you want to prevent a torn page, you have to make the last > synchronized write to the disk as part of the checkpoint process a > write that *cannot* result in a torn page. So it has to be a write of > a buffer that is no larger than the sector size of the disk. I'd make > it 256 bytes, to be sure of accomodating pretty much any disk hardware > out there. > > In any case, the modified sequence would go something like: > > 1. write the WAL entry, and encode in it a unique magic number > 2. sync() > 3. append the unique magic number to the WAL again (or to a separate > file if you like, it doesn't matter as long as you know where to > look for it during recovery), using a 256 byte (at most) write > buffer. > 4. sync() > > > After the first sync(), the OS guarantees that the data you've written > so far is committed to the platters, with the possible exception of a > torn page during the write operation, which will only happen during a > crash during step 2. But if a crash happens here, then the second > occurrance of the unique magic number will not appear in the WAL (or > separate file, if that's the mechanism chosen), and you'll *know* that > you can't trust that the WAL entry was completely committed to the > platter. > > If a crash happens during step 4, then either the appended magic > number won't appear during recovery, in which case the recovery > process can assume that the WAL entry is incomplete, or it will > appear, in which case it's *guaranteed by the hardware* that the WAL > entry is complete, because you'll know for sure that the previous > sync() completed successfully. > > > The amount of time between steps 2 and 4 should be small enough that > there should be no significant performance penalty involved, relative > to the time it takes for the first sync() to complete. > > > Thoughts? > > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: > > I don't think our problem is partial writes of WAL, which we already > check, but heap/index page writes, which we currently do not check for > partial writes. Hmm...I've read through the thread again and thought about the problem further, and now think I understand what you're dealing with. Ultimately, the problem is that you're storing diffs in the WAL, so you have to be able to guarantee that every data/index page has been completely written, right? There's no way to detect a torn page without some sort of marker in each disk-indivisible segment of the page, unless you're willing to checksum the entire page. With that in mind, the method Microsoft uses for SQL Server is probably about as simple as it gets. In our case, I suppose we may as well allocate one byte per 256-bytes segment for the torn page marker. Just increment the marker value each time you write the page (you'll have to read it from the page prior to incrementing it, of course). Other than that, torn page detection is really just a special case of page corruption detection. -- Kevin Brown kevin@sysexperts.com
Josh Berkus <josh@agliodbs.com> writes: > So, now that we know what the performance bottleneck is, how do we fix it? Josh, I see that all of those runs seem to be using wal_buffers = 8. Have you tried materially increasing wal_buffers (say to 100 or 1000) and/or experimenting with different wal_sync_method options since we fixed the bufmgrlock problem? I am wondering if the real issue is WAL buffer contention or something like that. It would also be useful to compare these runs to runs with fsync = off, just to see how the performance changes. regards, tom lane
Tom, > Josh, I see that all of those runs seem to be using wal_buffers = 8. > Have you tried materially increasing wal_buffers (say to 100 or 1000) > and/or experimenting with different wal_sync_method options since we > fixed the bufmgrlock problem? I am wondering if the real issue is > WAL buffer contention or something like that. > > It would also be useful to compare these runs to runs with fsync = off, > just to see how the performance changes. As you know, I've been out of town. I'll be running more tests, and collating my existing test results over then next few days. -- Josh Berkus Aglio Database Solutions San Francisco
Tom, > This will remove just the CRC calculation work associated with backed-up > pages. Note that any attempt to recover from the WAL will fail, but I > assume you don't need that for the purposes of the test run. Looks like the CRC calculation work isn't the issue. I did test runs of no-CRC vs. regular DBT2 with different checkpoint timeouts, and didn't discern any statistical difference. See attached spreadsheet chart (the two different runs are on two different machines). I think this test run http://khack.osdl.org/stp/302903/results/0/, with a 30-min checkpoint shows pretty clearly that the behavior of the performance drop is consistent with needing to "re-prime" the WAL will full page images. Each checkpoint drops performance abruptly, and then slowly recovers until the next checkpoint. Do note that there is a significant statistical variation in individual runs. It's only the overall trend which is significant. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Did you test with full_page_writes on and off? --------------------------------------------------------------------------- Josh Berkus wrote: > Tom, > > > This will remove just the CRC calculation work associated with backed-up > > pages. ?Note that any attempt to recover from the WAL will fail, but I > > assume you don't need that for the purposes of the test run. > > Looks like the CRC calculation work isn't the issue. I did test runs of > no-CRC vs. regular DBT2 with different checkpoint timeouts, and didn't > discern any statistical difference. See attached spreadsheet chart (the > two different runs are on two different machines). > > I think this test run http://khack.osdl.org/stp/302903/results/0/, with a > 30-min checkpoint shows pretty clearly that the behavior of the > performance drop is consistent with needing to "re-prime" the WAL will > full page images. Each checkpoint drops performance abruptly, and then > slowly recovers until the next checkpoint. > > Do note that there is a significant statistical variation in individual > runs. It's only the overall trend which is significant. > > -- > --Josh > > Josh Berkus > Aglio Database Solutions > San Francisco [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce, > Did you test with full_page_writes on and off? I didn't use your full_page_writes version because Tom said it was problematic. This is CVS from July 3rd. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: > Bruce, > > > Did you test with full_page_writes on and off? > > I didn't use your full_page_writes version because Tom said it was > problematic. This is CVS from July 3rd. I think we need those tests run. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce, > I think we need those tests run. Sure. What CVS day should I grab? What's the option syntax? ( -c full_page_writes=false)? I have about 20 tests in queue right now but can stack yours up behind them. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: > Bruce, > > > I think we need those tests run. > > Sure. What CVS day should I grab? What's the option syntax? ( -c > full_page_writes=false)? Yes. You can grab any from the day Tom fixed it, which was I think two weeks ago. > I have about 20 tests in queue right now but can stack yours up behind > them. Great. That would seem to be the definative way to test it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Josh Berkus <josh@agliodbs.com> writes: > Bruce, >> Did you test with full_page_writes on and off? > I didn't use your full_page_writes version because Tom said it was > problematic. This is CVS from July 3rd. We already know the results: should be equivalent to the hack Josh tried first. So what we know at this point is that dumping full pages into WAL is expensive, and that the CRC calculation cost is not the main expense. (I suppose that this indicates the reduction to 32-bit CRC was helpful, because previous measurements sure suggested that CRC costs were important ...) What we still don't know is exactly where the main expense *is*. Is it I/O, WAL buffer lock contention, or what? I think the next step is to vary the WAL-related parameters (wal_buffers, wal_sync_method, and fsync) and see if we can learn anything that way. It's entirely plausible that the optimal values for those have changed due to our recent hacking. (Note: turning off fsync is of course not a production option, but it would be helpful to try it here --- it should give us a reading on whether disk I/O wait is the culprit or not.) I'd recommend that you *not* update to CVS tip, as that wouldn't accomplish much except call into question any comparisons to the runs you've already done. There are a number of unrelated changes this month that could have side-effects on performance. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: > I think this test run http://khack.osdl.org/stp/302903/results/0/, with a > 30-min checkpoint shows pretty clearly that the behavior of the > performance drop is consistent with needing to "re-prime" the WAL will > full page images. Each checkpoint drops performance abruptly, and then > slowly recovers until the next checkpoint. A 30-min checkpoint means that fsyncs will be happening on up to 30 minutes of i/o on each database file. It could be the full page images that's slowing it down or it could just be that the system is swamped with i/o that's been put off for the last 30 minutes. It's not nearly as bad in this case as it has been in the previous samples since at least this test runs for 5 full checkpoint cycles but it's still fairly unrealistic. And that last checkpoint, almost 20% of the i/o of the test is only partially included in the timing data. For any benchmarking to be meaningful you have to set the checkpoint interval to something more realistic. Something like 5 minutes. That way when the final checkpoint cycle isn't completely included in the timing data you'll at least be missing a statistically insignificant portion of the work. -- greg
Josh Berkus <josh@agliodbs.com> writes: > Looks like the CRC calculation work isn't the issue. I did test runs of > no-CRC vs. regular DBT2 with different checkpoint timeouts, and didn't > discern any statistical difference. See attached spreadsheet chart (the > two different runs are on two different machines). Um, where are the test runs underlying this spreadsheet? I don't have a whole lot of confidence in looking at full-run average TPM numbers to discern whether transient dropoffs in TPM are significant or not. regards, tom lane
Greg, > For any benchmarking to be meaningful you have to set the checkpoint > interval to something more realistic. Something like 5 minutes. That way > when the final checkpoint cycle isn't completely included in the timing > data you'll at least be missing a statistically insignificant portion of > the work. Look at the PDF I sent. This was run with 5, 10, and 30 minutes. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Tom, > Um, where are the test runs underlying this spreadsheet? I don't have a > whole lot of confidence in looking at full-run average TPM numbers to > discern whether transient dropoffs in TPM are significant or not. Web in the form of: http://khack.osdl.org/stp/#test_number#/ Where #test_number# is: Machine0, no patch: 302904 302905 302906 Machine0, patch: 301901 302902 302903 Machine2, no patch: 302910 302911 302912 Machine2, patch: 301907 302908 302909 BTW, I am currently doing a wal_buffers scalability run. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Greg Stark <gsstark@mit.edu> writes: > For any benchmarking to be meaningful you have to set the checkpoint interval > to something more realistic. Something like 5 minutes. That way when the final > checkpoint cycle isn't completely included in the timing data you'll at least > be missing a statistically insignificant portion of the work. This isn't about benchmarking --- or at least, I don't put any stock in the average NOTPM values for the long-checkpoint-interval runs. What we want to understand is why there's a checkpoint-triggered performance dropoff that (appears to) last longer than the checkpoint itself. If we can fix that, it should have beneficial impact on real-world cases. But we do not have to, and should not, restrict ourselves to real-world test cases while trying to figure out what's going on. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: >> Um, where are the test runs underlying this spreadsheet? I don't have a >> whole lot of confidence in looking at full-run average TPM numbers to >> discern whether transient dropoffs in TPM are significant or not. > Web in the form of: > http://khack.osdl.org/stp/#test_number#/ > Where #test_number# is: > Machine0, no patch: > 302904 > 302905 > 302906 > Machine0, patch: > 301901 > 302902 > 302903 > Machine2, no patch: > 302910 > 302911 > 302912 > Machine2, patch: > 301907 > 302908 > 302909 Hmm. Eyeballing the NOTPM trace for cases 302912 and 302909, it sure looks like the post-checkpoint performance recovery is *slower* in the latter. And why is 302902 visibly slower overall than 302905? I thought for a bit that you had gotten "patch" vs "no patch" backwards, but the oprofile results linked to these pages look right: XLogInsert takes significantly more time in the "no patch" cases. There's something awfully weird going on here. I was prepared to see no statistically-significant differences, but not multiple cases that seem to be going the "wrong direction". BTW, I'd like to look at 302906, but its [Details] link is broken. regards, tom lane
Tom, > There's something awfully weird going on here. I was prepared to see > no statistically-significant differences, but not multiple cases that > seem to be going the "wrong direction". There's a lot of variance in the tests. I'm currently running a variance test battery on one machine to figure out why the results become so variable when checkpointing is < 1 hour. Actually, to cover all which is currently running: machine1 & 3 wal_buffers scaling test machine2 STP variance test machine4 full_page_writes=false test All of these should wind up in about 5 days, provided the STP doesn't crash (not an insignificant concern). -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Fri, 22 Jul 2005 19:11:36 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, I'd like to look at 302906, but its [Details] link is broken. Ugh, I tried digging onto the internal systems and it looks like they were destroyed (or not saved) somehow. It'll have to be rerun. Sorry... Mark
On Fri, 2005-07-22 at 19:11 -0400, Tom Lane wrote: > Hmm. Eyeballing the NOTPM trace for cases 302912 and 302909, it sure > looks like the post-checkpoint performance recovery is *slower* in > the latter. And why is 302902 visibly slower overall than 302905? > I thought for a bit that you had gotten "patch" vs "no patch" backwards, > but the oprofile results linked to these pages look right: XLogInsert > takes significantly more time in the "no patch" cases. > > There's something awfully weird going on here. I was prepared to see > no statistically-significant differences, but not multiple cases that > seem to be going the "wrong direction". All of the tests have been performed with wal_buffers = 8, so there will be massive contention for those buffers, leading to increased I/O... All of the tests show that there is a CPU utilisation drop, and an I/O wait increase immediately following checkpoints. When we advance the insert pointer and a wal_buffer still needs writing, we clean it by attempting to perform an I/O while holding WALInsertLock. Very probably the WALWriteLock is currently held, so we wait on the WALWriteLock and everybody else waits on us. Normally, its fairly hard for that to occur since we attempt to XLogWrite when walbuffers are more than half full, but we do this with a conditional acquire, so when we're busy we just keep filling up wal_buffers. Normally, thats OK. When we have a checkpoint, almost every xlog write has at least a whole block appended to it. So we can easily fill up wal_buffers very quickly while WALWriteLock is held. Once there is no space available, we then effectively halt all transactions while we write out that buffer. My conjecture is that the removal of the CPU bottleneck has merely moved the problem by allowing users to fill wal buffers faster and go into a wait state quicker than they did before. The beneficial effect of the conditional acquire when wal buffers is full never occurs, and performance drops. We should run tests with much higher wal_buffers numbers to nullify the effect described above and reduce contention. That way we will move towards the log disk speed being the limiting factor, patch or no patch. So, I think Tom's improvement of CRC/hole compression will prove itself when we have higher values of wal_buffers, Best Regards, Simon Riggs
Simon, > We should run tests with much higher wal_buffers numbers to nullify the > effect described above and reduce contention. That way we will move > towards the log disk speed being the limiting factor, patch or no patch. I've run such tests, at a glance they do seem to improve performance. I need some time to collate the results. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> We should run tests with much higher wal_buffers numbers to nullify the >> effect described above and reduce contention. That way we will move >> towards the log disk speed being the limiting factor, patch or no patch. > I've run such tests, at a glance they do seem to improve performance. I > need some time to collate the results. With larger wal_buffers values it might also be interesting to take some measures to put a larger share of the WAL writing burden on the bgwriter. Currently the bgwriter only writes out WAL buffers in two scenarios: 1. It wants to write a dirty shared buffer that has LSN beyond the current WAL flush marker. Just like any backend, the bgwriter must flush WAL as far as the LSN before writing the buffer. 2. The bgwriter is completing a checkpoint. It must flush WAL as far as the checkpoint record before updating pg_control. It might be interesting to add some logic to explicitly check for and write out any full-but-unwritten WAL buffers during the bgwriter's main loop. In a scenario with many small transactions, this is probably a waste of effort since backends will be forcing WAL write/flush any time they commit. (This is why I haven't pursued the idea already.) However, given a large transaction and adequate wal_buffer space, such a tactic should offload WAL writing work nicely. I have no idea whether the DBT benchmarks would benefit at all, but given that they are affected positively by increasing wal_buffers, they must have a fair percentage of not-small transactions. regards, tom lane
Tom, > I have no idea whether the DBT benchmarks would benefit at all, but > given that they are affected positively by increasing wal_buffers, > they must have a fair percentage of not-small transactions. Even if they don't, we'll have series tests for DW here at GreenPlum soon, and I'll bet they'd affect *those*. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Tue, 2005-07-26 at 19:15 -0400, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > >> We should run tests with much higher wal_buffers numbers to nullify the > >> effect described above and reduce contention. That way we will move > >> towards the log disk speed being the limiting factor, patch or no patch. > > > I've run such tests, at a glance they do seem to improve performance. I > > need some time to collate the results. > > With larger wal_buffers values it might also be interesting to take some > measures to put a larger share of the WAL writing burden on the bgwriter. > > Currently the bgwriter only writes out WAL buffers in two scenarios: > > 1. It wants to write a dirty shared buffer that has LSN beyond the > current WAL flush marker. Just like any backend, the bgwriter must > flush WAL as far as the LSN before writing the buffer. > > 2. The bgwriter is completing a checkpoint. It must flush WAL as far as > the checkpoint record before updating pg_control. > > It might be interesting to add some logic to explicitly check for and > write out any full-but-unwritten WAL buffers during the bgwriter's > main loop. > > In a scenario with many small transactions, this is probably a waste of > effort since backends will be forcing WAL write/flush any time they > commit. (This is why I haven't pursued the idea already.) However, > given a large transaction and adequate wal_buffer space, such a tactic > should offload WAL writing work nicely. > > I have no idea whether the DBT benchmarks would benefit at all, but > given that they are affected positively by increasing wal_buffers, > they must have a fair percentage of not-small transactions. Yes, I was musing on that also. I think it would help keep response time even, which seems to be the route to higher performance anyway. This is more important in real world than in benchmarks, where a nice even stream of commits arrives to save the day... I guess I'd be concerned that the poor bgwriter can't do all of this work. I was thinking about a separate log writer, so we could have both bgwriter and logwriter active simultaneously on I/O. It has taken a while to get bgwriter to perform its duties efficiently, so I'd rather not give it so many that it performs them all badly. The logwriter would be more of a helper, using LWLockConditionalAcquire to see if the WALWriteLock was kept active. Each backend would still perform its own commit write. (We could change that in the future, but thats a lot more work.) We would only need one new GUC log_writer_delay, defaulting to 50 ms (??) - if set to zero, the default, then we don't spawn a logwriter daemon at all. (Perhaps we also need another one to say how many blocks get written each time its active... but I'm not hugely in favour of more parameters to get wrong). That way we could take the LWLockConditionalAcquire on WALWriteLock out of the top of XLogInsert, which was effectively doing that work. I think this would also reduce the apparent need for high wal_buffer settings - probably could get away with a lot less than the 2048 recent performance results would suggest. Best Regards, Simon Riggs
Simon, > I guess I'd be concerned that the poor bgwriter can't do all of this > work. I was thinking about a separate log writer, so we could have both > bgwriter and logwriter active simultaneously on I/O. It has taken a > while to get bgwriter to perform its duties efficiently, so I'd rather > not give it so many that it performs them all badly. Speaking of which, now that I have a target value for wal_buffers, I want to re-test the bgwriter. In my previous tests, varying the bgwriter settings on DBT2 had no discernable effect. -- --Josh Josh Berkus Aglio Database Solutions San Francisco