Thread: Avoiding unnecessary reads in recovery
In recovery, with full_pages_writes=on, we read in each page only to overwrite the contents with a full page image. That's a waste of time, and can have a surprisingly large effect on recovery time. As a quick test on my laptop, I initialized a DBT-2 test with 5 warehouses, and let it run for 2 minutes without think-times to generate some WAL. Then I did a "kill -9 postmaster", and took a copy of the data directory to use for testing recovery. With CVS HEAD, the recovery took ~ 2 minutes. With the attached patch, it took 5 seconds. (yes, I used the same not-yet-recovered data directory in both tests, and cleared the os cache with "echo 1 > /proc/sys/vm/drop_caches"). I was surprised how big a difference it makes, but when you think about it it's logical. Without the patch, it's doing roughly the same I/O as the test itself, reading in pages, modifying them, and writing them back. With the patch, all the reads are done sequentially from the WAL, and then written back in a batch at the end of the WAL replay which is a lot more efficient. It's interesting that (with the patch) full_page_writes can *shorten* your recovery time. I've always thought it to have a purely negative effect on performance. I'll leave it up to the jury if this tiny little change is appropriate after feature freeze... While working on this, this comment in ReadBuffer caught my eye: > /* > * During WAL recovery, the first access to any data page should > * overwrite the whole page from the WAL; so a clobbered page > * header is not reason to fail. Hence, when InRecovery we may > * always act as though zero_damaged_pages is ON. > */ > if (zero_damaged_pages || InRecovery) > { But that assumption only holds if full_page_writes is enabled, right? I changed that in the attached patch as well, but if it isn't accepted that part of it should still be applied, I think. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/transam/xlogutils.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v retrieving revision 1.49 diff -c -r1.49 xlogutils.c *** src/backend/access/transam/xlogutils.c 5 Jan 2007 22:19:24 -0000 1.49 --- src/backend/access/transam/xlogutils.c 25 Apr 2007 11:40:09 -0000 *************** *** 226,232 **** if (blkno < lastblock) { /* page exists in file */ ! buffer = ReadBuffer(reln, blkno); } else { --- 226,235 ---- if (blkno < lastblock) { /* page exists in file */ ! if(init) ! buffer = ZapBuffer(reln, blkno); ! else ! buffer = ReadBuffer(reln, blkno); } else { Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.216 diff -c -r1.216 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 30 Mar 2007 18:34:55 -0000 1.216 --- src/backend/storage/buffer/bufmgr.c 25 Apr 2007 11:44:27 -0000 *************** *** 97,102 **** --- 97,103 ---- static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty, int set_flag_bits); static void buffer_write_error_callback(void *arg); + static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only); static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr); static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln); *************** *** 121,126 **** --- 122,148 ---- Buffer ReadBuffer(Relation reln, BlockNumber blockNum) { + return ReadBuffer_common(reln, blockNum, false); + } + + /* + * ZapBuffer -- like ReadBuffer, but doesn't read the contents of the page + * from disk. The caller is expected to completely rewrite the page, + * regardless of the current contents. This should only be used in + * recovery where there's no concurrent readers that might see the + * contents of the page before the caller rewrites it. + */ + Buffer + ZapBuffer(Relation reln, BlockNumber blockNum) + { + Assert(InRecovery); + + return ReadBuffer_common(reln, blockNum, true); + } + + static Buffer + ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only) + { volatile BufferDesc *bufHdr; Block bufBlock; bool found; *************** *** 253,269 **** } else { smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); /* check for garbage data */ if (!PageHeaderIsValid((PageHeader) bufBlock)) { /* ! * During WAL recovery, the first access to any data page should ! * overwrite the whole page from the WAL; so a clobbered page ! * header is not reason to fail. Hence, when InRecovery we may ! * always act as though zero_damaged_pages is ON. */ ! if (zero_damaged_pages || InRecovery) { ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED), --- 275,293 ---- } else { + if(!alloc_only) + { smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); /* check for garbage data */ if (!PageHeaderIsValid((PageHeader) bufBlock)) { /* ! * When full_pages_writes is enabled, the first access to any data page should ! * overwrite the whole page from the WAL during recovery; so a clobbered page ! * header is not reason to fail. Hence, we may ! * act as though zero_damaged_pages is ON. */ ! if (zero_damaged_pages || (InRecovery && fullPageWrites)) { ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED), *************** *** 277,282 **** --- 301,307 ---- errmsg("invalid page header in block %u of relation \"%s\"", blockNum, RelationGetRelationName(reln)))); } + } } if (isLocalBuf) Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.384 diff -c -r1.384 guc.c *** src/backend/utils/misc/guc.c 12 Apr 2007 06:53:47 -0000 1.384 --- src/backend/utils/misc/guc.c 25 Apr 2007 11:19:52 -0000 *************** *** 103,109 **** extern int CommitDelay; extern int CommitSiblings; extern char *default_tablespace; - extern bool fullPageWrites; #ifdef TRACE_SORT extern bool trace_sort; --- 103,108 ---- Index: src/include/access/xlog.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v retrieving revision 1.76 diff -c -r1.76 xlog.h *** src/include/access/xlog.h 5 Jan 2007 22:19:51 -0000 1.76 --- src/include/access/xlog.h 25 Apr 2007 11:19:46 -0000 *************** *** 142,147 **** --- 142,148 ---- extern int XLogArchiveTimeout; extern char *XLOG_sync_method; extern const char XLOG_sync_method_default[]; + extern bool fullPageWrites; #define XLogArchivingActive() (XLogArchiveCommand[0] != '\0') Index: src/include/storage/bufmgr.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.102 diff -c -r1.102 bufmgr.h *** src/include/storage/bufmgr.h 5 Jan 2007 22:19:57 -0000 1.102 --- src/include/storage/bufmgr.h 25 Apr 2007 11:39:35 -0000 *************** *** 111,116 **** --- 111,117 ---- * prototypes for functions in bufmgr.c */ extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum); + extern Buffer ZapBuffer(Relation reln, BlockNumber blockNum); extern void ReleaseBuffer(Buffer buffer); extern void UnlockReleaseBuffer(Buffer buffer); extern void MarkBufferDirty(Buffer buffer);
Heikki Linnakangas wrote: > While working on this, this comment in ReadBuffer caught my eye: > >> /* >> * During WAL recovery, the first access to any data page should >> * overwrite the whole page from the WAL; so a clobbered page >> * header is not reason to fail. Hence, when InRecovery we may >> * always act as though zero_damaged_pages is ON. >> */ >> if (zero_damaged_pages || InRecovery) >> { > > But that assumption only holds if full_page_writes is enabled, right? I > changed that in the attached patch as well, but if it isn't accepted > that part of it should still be applied, I think. On second thought, my fix still isn't 100% right because one could turn full_page_writes on before starting replay. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > While working on this, this comment in ReadBuffer caught my eye: > >> /* >> * During WAL recovery, the first access to any data page should >> * overwrite the whole page from the WAL; so a clobbered page >> * header is not reason to fail. Hence, when InRecovery we may >> * always act as though zero_damaged_pages is ON. >> */ >> if (zero_damaged_pages || InRecovery) >> { > > But that assumption only holds if full_page_writes is enabled, right? I changed > that in the attached patch as well, but if it isn't accepted that part of it > should still be applied, I think. Well it's only true if full_page_writes was on when the WAL was written. Which isn't necessarily the same as saying it's enabled during recovery... As long as there's a backup block in the log we can use it to clobber pages in the heap -- which is what your patch effectively does anyways. If we're replaying a log entry where there isn't a backup block and we find a damaged page then we're in trouble. Either the damaged page was in a previous backup block or it's the recovery itself that's damaging it. In the latter case it would be pretty useful to abort the recovery so the user doesn't lose his backup and has a chance to recovery properly (possibly after reporting and fixing the bug). So in short I think with your patch this piece of code no longer has a role. Either your patch kicks in and we never even look at the damaged page at all, or we should be treating it as corrupt data and just check zero_damaged_pages alone and not do anything special in recovery. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > So in short I think with your patch this piece of code no longer has a role. > Either your patch kicks in and we never even look at the damaged page at all, > or we should be treating it as corrupt data and just check zero_damaged_pages > alone and not do anything special in recovery. Good point. Adjusted patch attached. I added some comments as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/transam/xlogutils.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v retrieving revision 1.49 diff -c -r1.49 xlogutils.c *** src/backend/access/transam/xlogutils.c 5 Jan 2007 22:19:24 -0000 1.49 --- src/backend/access/transam/xlogutils.c 25 Apr 2007 14:27:05 -0000 *************** *** 226,232 **** if (blkno < lastblock) { /* page exists in file */ ! buffer = ReadBuffer(reln, blkno); } else { --- 226,235 ---- if (blkno < lastblock) { /* page exists in file */ ! if (init) ! buffer = ZapBuffer(reln, blkno); ! else ! buffer = ReadBuffer(reln, blkno); } else { Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.216 diff -c -r1.216 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 30 Mar 2007 18:34:55 -0000 1.216 --- src/backend/storage/buffer/bufmgr.c 25 Apr 2007 14:36:15 -0000 *************** *** 17,22 **** --- 17,26 ---- * and pin it so that no one can destroy it while this process * is using it. * + * ZapBuffer() -- like ReadBuffer, but destroys the contents of the + * page. Used in recovery when the page is completely overwritten + * from WAL. + * * ReleaseBuffer() -- unpin a buffer * * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty". *************** *** 97,102 **** --- 101,107 ---- static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty, int set_flag_bits); static void buffer_write_error_callback(void *arg); + static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only); static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr); static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln); *************** *** 121,126 **** --- 126,155 ---- Buffer ReadBuffer(Relation reln, BlockNumber blockNum) { + return ReadBuffer_common(reln, blockNum, false); + } + + /* + * ZapBuffer -- like ReadBuffer, but doesn't read the contents of the page + * from disk. The caller is expected to completely rewrite the page, + * regardless of the current contents. This should only be used in + * recovery where there's no concurrent readers that might see the + * contents of the page before the caller rewrites it. + */ + Buffer + ZapBuffer(Relation reln, BlockNumber blockNum) + { + Assert(InRecovery); + + return ReadBuffer_common(reln, blockNum, true); + } + + /* + * ReadBuffer_common -- common logic of ReadBuffer and ZapBuffer. + */ + static Buffer + ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only) + { volatile BufferDesc *bufHdr; Block bufBlock; bool found; *************** *** 253,269 **** } else { smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); /* check for garbage data */ if (!PageHeaderIsValid((PageHeader) bufBlock)) { /* ! * During WAL recovery, the first access to any data page should ! * overwrite the whole page from the WAL; so a clobbered page ! * header is not reason to fail. Hence, when InRecovery we may ! * always act as though zero_damaged_pages is ON. */ ! if (zero_damaged_pages || InRecovery) { ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED), --- 282,304 ---- } else { + /* + * Read in the page, unless the caller intends to overwrite it + * and just wants us to allocate a buffer. + */ + if (!alloc_only) + { smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); /* check for garbage data */ if (!PageHeaderIsValid((PageHeader) bufBlock)) { /* ! * We used to ignore corrupt pages in WAL recovery, but ! * that was only ever safe if full_page_writes was enabled. ! * Now the caller sets alloc_only to false if he intends to ! * overwrite the whole page, which is already checked above. */ ! if (zero_damaged_pages) { ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED), *************** *** 277,282 **** --- 312,318 ---- errmsg("invalid page header in block %u of relation \"%s\"", blockNum, RelationGetRelationName(reln)))); } + } } if (isLocalBuf) Index: src/include/storage/bufmgr.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.102 diff -c -r1.102 bufmgr.h *** src/include/storage/bufmgr.h 5 Jan 2007 22:19:57 -0000 1.102 --- src/include/storage/bufmgr.h 25 Apr 2007 14:14:28 -0000 *************** *** 111,116 **** --- 111,117 ---- * prototypes for functions in bufmgr.c */ extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum); + extern Buffer ZapBuffer(Relation reln, BlockNumber blockNum); extern void ReleaseBuffer(Buffer buffer); extern void UnlockReleaseBuffer(Buffer buffer); extern void MarkBufferDirty(Buffer buffer);
On Wed, 2007-04-25 at 13:48 +0100, Heikki Linnakangas wrote: > I was surprised how big a difference it makes, but when you think about > it it's logical. Without the patch, it's doing roughly the same I/O as > the test itself, reading in pages, modifying them, and writing them > back. With the patch, all the reads are done sequentially from the WAL, > and then written back in a batch at the end of the WAL replay which is a > lot more efficient. Interesting patch. It would be good to see a longer term test. I'd expect things to fall back to about 50% of the time on a longer recovery where the writes need to be written out of cache. I was thinking of getting the bgwriter to help out to improve matters there. Patch-wise, I love the name ZapBuffer() but would think that OverwriteBuffer() would be more descriptive. As regards the zero_damaged_pages question, I raised that some time ago but we didn't arrive at an explicit answer. All I would say is we can't allow invalid pages in the buffer manager at any time, whatever options we have requested, otherwise other code will fail almost immediately. I'm not sure there's another option? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > As regards the zero_damaged_pages question, I raised that some time ago > but we didn't arrive at an explicit answer. All I would say is we can't > allow invalid pages in the buffer manager at any time, whatever options > we have requested, otherwise other code will fail almost immediately. Yeah --- the proposed new bufmgr routine should probably explicitly zero the content of the buffer. It doesn't really matter in the context of WAL recovery, since there can't be any concurrent access to the buffer, but it'd make it safe to use in non-WAL contexts (I think there are other places where we know we are going to init the page and so a physical read is a waste of time). Also, this would let the patch be + if (alloc_only) + MemSet... + else smgrread... and you don't need to hack out the PageHeaderIsValid test, since it will allow zeroed pages. Possibly ReadZeroedBuffer would be a better name? regards, tom lane
Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> As regards the zero_damaged_pages question, I raised that some time ago >> but we didn't arrive at an explicit answer. All I would say is we can't >> allow invalid pages in the buffer manager at any time, whatever options >> we have requested, otherwise other code will fail almost immediately. > > Yeah --- the proposed new bufmgr routine should probably explicitly zero > the content of the buffer. It doesn't really matter in the context of > WAL recovery, since there can't be any concurrent access to the buffer, > but it'd make it safe to use in non-WAL contexts (I think there are > other places where we know we are going to init the page and so a > physical read is a waste of time). Is there? I can't think of any. Extending a relation doesn't count. Zeroing the buffer explicitly might be a good idea anyway. I agree it feels a bit dangerous to have a moment when there's a garbled page in buffer cache, even if only in recovery. > Also, this would let the patch be > > + if (alloc_only) > + MemSet... > + else > smgrread... > > and you don't need to hack out the PageHeaderIsValid test, since it will > allow zeroed pages. Well, I'd still put the PageHeaderIsValid test in the else-branch. Not like the few cycles spent on the test matter, but I don't see a reason not to. > Possibly ReadZeroedBuffer would be a better name? Yeah, sounds good. "Read" is a bit misleading, since it doesn't actually read in the buffer, but it's good that the name is closely related to ReadBuffer. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> but it'd make it safe to use in non-WAL contexts (I think there are >> other places where we know we are going to init the page and so a >> physical read is a waste of time). > Is there? I can't think of any. Extending a relation doesn't count. No, but re-using a free page in an index does. I'm not sure which index AMs know for sure the page is free, and which have to read it and check, but I think there's at least some scope for that. regards, tom lane
On Apr 25, 2007, at 2:48 PM, Heikki Linnakangas wrote: > In recovery, with full_pages_writes=on, we read in each page only > to overwrite the contents with a full page image. That's a waste of > time, and can have a surprisingly large effect on recovery time. > > As a quick test on my laptop, I initialized a DBT-2 test with 5 > warehouses, and let it run for 2 minutes without think-times to > generate some WAL. Then I did a "kill -9 postmaster", and took a > copy of the data directory to use for testing recovery. > > With CVS HEAD, the recovery took ~ 2 minutes. With the attached > patch, it took 5 seconds. (yes, I used the same not-yet-recovered > data directory in both tests, and cleared the os cache with "echo 1 > > /proc/sys/vm/drop_caches"). > > I was surprised how big a difference it makes, but when you think > about it it's logical. Without the patch, it's doing roughly the > same I/O as the test itself, reading in pages, modifying them, and > writing them back. With the patch, all the reads are done > sequentially from the WAL, and then written back in a batch at the > end of the WAL replay which is a lot more efficient. > > It's interesting that (with the patch) full_page_writes can > *shorten* your recovery time. I've always thought it to have a > purely negative effect on performance. > > I'll leave it up to the jury if this tiny little change is > appropriate after feature freeze... > > While working on this, this comment in ReadBuffer caught my eye: > >> /* >> * During WAL recovery, the first access to any data page should >> * overwrite the whole page from the WAL; so a clobbered page >> * header is not reason to fail. Hence, when InRecovery we may >> * always act as though zero_damaged_pages is ON. >> */ >> if (zero_damaged_pages || InRecovery) >> { > > But that assumption only holds if full_page_writes is enabled, > right? I changed that in the attached patch as well, but if it > isn't accepted that part of it should still be applied, I think. So what happens if a backend is running with full_page_writes = off, someone edits postgresql.conf to turns it on and forgets to reload/ restart, and then we crash? You'll come up in recovery mode thinking that f_p_w was turned on, when in fact it wasn't. ISTM that we need to somehow log what the status of full_page_writes is, if it's going to affect how recovery works. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> but it'd make it safe to use in non-WAL contexts (I think there are >>> other places where we know we are going to init the page and so a >>> physical read is a waste of time). > >> Is there? I can't think of any. Extending a relation doesn't count. > > No, but re-using a free page in an index does. I'm not sure which index > AMs know for sure the page is free, and which have to read it and check, > but I think there's at least some scope for that. B-tree, GIN ans GiST read and check. I'm not sure how hash works. I think the latest bitmap index patch doesn't support reusing empty pages at all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> So what happens if a backend is running with full_page_writes > = off, someone edits postgresql.conf to turns it on and > forgets to reload/ restart, and then we crash? You'll come up > in recovery mode thinking that f_p_w was turned on, when in > fact it wasn't. > > ISTM that we need to somehow log what the status of > full_page_writes is, if it's going to affect how recovery works. Optimally recovery should do this when confronted with a full page image only. The full page is in the same WAL record that first touches a page, so this should not need to depend on a setting. Andreas
Jim Nasby <decibel@decibel.org> writes: > So what happens if a backend is running with full_page_writes = off, > someone edits postgresql.conf to turns it on and forgets to reload/ > restart, and then we crash? You'll come up in recovery mode thinking > that f_p_w was turned on, when in fact it wasn't. One of the advantages of the proposed patch is that it avoids having to make any assumptions like that. regards, tom lane
Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: >> As regards the zero_damaged_pages question, I raised that some time ago >> but we didn't arrive at an explicit answer. All I would say is we can't >> allow invalid pages in the buffer manager at any time, whatever options >> we have requested, otherwise other code will fail almost immediately. > > Yeah --- the proposed new bufmgr routine should probably explicitly zero > the content of the buffer. It doesn't really matter in the context of > WAL recovery, since there can't be any concurrent access to the buffer, > but it'd make it safe to use in non-WAL contexts (I think there are > other places where we know we are going to init the page and so a > physical read is a waste of time). To implement that correctly, I think we'd need to take the content lock to clear the buffer if it's already found in the cache. It doesn't seem right to me for the buffer manager to do that, in the worst case it could lead to deadlocks if that function was ever used while holding another buffer locked. What we could have is the semantics of "Return a buffer, with either correct contents or completely zeroed out". It would act just like ReadBuffer if the buffer was already in memory, and zero out the page otherwise. That's a bit strange semantics to have, but is simple to implement and works for the use-cases we've been talking about. Patch implementing that attached. I named the function "ReadOrZeroBuffer". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/access/transam/xlogutils.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v retrieving revision 1.49 diff -c -r1.49 xlogutils.c *** src/backend/access/transam/xlogutils.c 5 Jan 2007 22:19:24 -0000 1.49 --- src/backend/access/transam/xlogutils.c 27 Apr 2007 10:44:37 -0000 *************** *** 226,232 **** if (blkno < lastblock) { /* page exists in file */ ! buffer = ReadBuffer(reln, blkno); } else { --- 226,235 ---- if (blkno < lastblock) { /* page exists in file */ ! if (init) ! buffer = ReadOrZeroBuffer(reln, blkno); ! else ! buffer = ReadBuffer(reln, blkno); } else { Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.216 diff -c -r1.216 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 30 Mar 2007 18:34:55 -0000 1.216 --- src/backend/storage/buffer/bufmgr.c 27 Apr 2007 10:49:08 -0000 *************** *** 17,22 **** --- 17,26 ---- * and pin it so that no one can destroy it while this process * is using it. * + * ReadOrZeroBuffer() -- like ReadBuffer, but clears out the contents of the + * page if it's not in cache already. Used in recovery when the page is + * completely overwritten from WAL. + * * ReleaseBuffer() -- unpin a buffer * * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty". *************** *** 97,102 **** --- 101,107 ---- static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty, int set_flag_bits); static void buffer_write_error_callback(void *arg); + static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only); static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr); static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln); *************** *** 121,126 **** --- 126,152 ---- Buffer ReadBuffer(Relation reln, BlockNumber blockNum) { + return ReadBuffer_common(reln, blockNum, false); + } + + /* + * ReadOrZeroBuffer -- like ReadBuffer, but if the page isn't in buffer + * cache already, it's filled with zeros instead of reading it from + * disk. The caller is expected to rewrite it with real data, + * regardless of the current contents. + */ + Buffer + ReadOrZeroBuffer(Relation reln, BlockNumber blockNum) + { + return ReadBuffer_common(reln, blockNum, true); + } + + /* + * ReadBuffer_common -- common logic of ReadBuffer and ReadZeroedBuffer + */ + static Buffer + ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage) + { volatile BufferDesc *bufHdr; Block bufBlock; bool found; *************** *** 253,269 **** } else { smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); /* check for garbage data */ if (!PageHeaderIsValid((PageHeader) bufBlock)) { /* ! * During WAL recovery, the first access to any data page should ! * overwrite the whole page from the WAL; so a clobbered page ! * header is not reason to fail. Hence, when InRecovery we may ! * always act as though zero_damaged_pages is ON. */ ! if (zero_damaged_pages || InRecovery) { ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED), --- 279,304 ---- } else { + /* + * Read in the page, unless the caller intends to overwrite it + * and just wants us to allocate a buffer. + */ + if (zeroPage) + MemSet((char *) bufBlock, 0, BLCKSZ); + else + { smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); /* check for garbage data */ if (!PageHeaderIsValid((PageHeader) bufBlock)) { /* ! * We used to ignore corrupt pages in WAL recovery, but that ! * was only ever safe if full_page_writes was enabled. Now that ! * the caller sets zeroPage to true if he intends to overwrite ! * the whole page, if we get here the page better be valid or ! * we'll lose data. */ ! if (zero_damaged_pages) { ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED), *************** *** 277,282 **** --- 312,318 ---- errmsg("invalid page header in block %u of relation \"%s\"", blockNum, RelationGetRelationName(reln)))); } + } } if (isLocalBuf) Index: src/include/storage/bufmgr.h =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.102 diff -c -r1.102 bufmgr.h *** src/include/storage/bufmgr.h 5 Jan 2007 22:19:57 -0000 1.102 --- src/include/storage/bufmgr.h 27 Apr 2007 10:44:20 -0000 *************** *** 111,116 **** --- 111,117 ---- * prototypes for functions in bufmgr.c */ extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum); + extern Buffer ReadOrZeroBuffer(Relation reln, BlockNumber blockNum); extern void ReleaseBuffer(Buffer buffer); extern void UnlockReleaseBuffer(Buffer buffer); extern void MarkBufferDirty(Buffer buffer);
Heikki Linnakangas wrote: > What we could have is the semantics of "Return a buffer, with either > correct contents or completely zeroed out". It would act just like > ReadBuffer if the buffer was already in memory, and zero out the page > otherwise. That's a bit strange semantics to have, but is simple to > implement and works for the use-cases we've been talking about. Huh, why does that work in the case where the recovery code reads a page, then evicts it because of memory pressure, and later needs to read it again? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Heikki Linnakangas wrote: > >> What we could have is the semantics of "Return a buffer, with either >> correct contents or completely zeroed out". It would act just like >> ReadBuffer if the buffer was already in memory, and zero out the page >> otherwise. That's a bit strange semantics to have, but is simple to >> implement and works for the use-cases we've been talking about. > > Huh, why does that work in the case where the recovery code reads a > page, then evicts it because of memory pressure, and later needs to read > it again? I don't understand the problem. You only use ReadOrZeroBuffer when you're going to replace the contents entirely, and don't care about the old contents. If you want to read something in, you use ReadBuffer. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Alvaro Herrera wrote: > >Heikki Linnakangas wrote: > > > >>What we could have is the semantics of "Return a buffer, with either > >>correct contents or completely zeroed out". It would act just like > >>ReadBuffer if the buffer was already in memory, and zero out the page > >>otherwise. That's a bit strange semantics to have, but is simple to > >>implement and works for the use-cases we've been talking about. > > > >Huh, why does that work in the case where the recovery code reads a > >page, then evicts it because of memory pressure, and later needs to read > >it again? > > I don't understand the problem. You only use ReadOrZeroBuffer when > you're going to replace the contents entirely, and don't care about the > old contents. If you want to read something in, you use ReadBuffer. Oh, the recovery code selects which one to call based on the "init" param, which is on the first hunk of the diff :-) I forgot that, I was just thinking in your description above. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
I assume this is 8.4 material. --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > >> As regards the zero_damaged_pages question, I raised that some time ago > >> but we didn't arrive at an explicit answer. All I would say is we can't > >> allow invalid pages in the buffer manager at any time, whatever options > >> we have requested, otherwise other code will fail almost immediately. > > > > Yeah --- the proposed new bufmgr routine should probably explicitly zero > > the content of the buffer. It doesn't really matter in the context of > > WAL recovery, since there can't be any concurrent access to the buffer, > > but it'd make it safe to use in non-WAL contexts (I think there are > > other places where we know we are going to init the page and so a > > physical read is a waste of time). > > To implement that correctly, I think we'd need to take the content lock > to clear the buffer if it's already found in the cache. It doesn't seem > right to me for the buffer manager to do that, in the worst case it > could lead to deadlocks if that function was ever used while holding > another buffer locked. > > What we could have is the semantics of "Return a buffer, with either > correct contents or completely zeroed out". It would act just like > ReadBuffer if the buffer was already in memory, and zero out the page > otherwise. That's a bit strange semantics to have, but is simple to > implement and works for the use-cases we've been talking about. > > Patch implementing that attached. I named the function "ReadOrZeroBuffer". > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2007-04-27 at 10:37 -0400, Bruce Momjian wrote: > I assume this is 8.4 material. I think its a small enough, performance-only change to allow it at this time. It will provide considerable additional benefit for Warm Standby servers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Apr 26, 2007, at 3:39 PM, Tom Lane wrote: > Jim Nasby <decibel@decibel.org> writes: >> So what happens if a backend is running with full_page_writes = off, >> someone edits postgresql.conf to turns it on and forgets to reload/ >> restart, and then we crash? You'll come up in recovery mode thinking >> that f_p_w was turned on, when in fact it wasn't. > > One of the advantages of the proposed patch is that it avoids > having to > make any assumptions like that. Yeah, the only email in the thread when I wrote that was Heikki's original email (I wrote the email on a plane, but did note it had been some time after the original email with no replies). Sorry for the noise. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Fri, 2007-04-27 at 12:22 +0100, Heikki Linnakangas wrote: > Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > >> As regards the zero_damaged_pages question, I raised that some time ago > >> but we didn't arrive at an explicit answer. All I would say is we can't > >> allow invalid pages in the buffer manager at any time, whatever options > >> we have requested, otherwise other code will fail almost immediately. > > > > Yeah --- the proposed new bufmgr routine should probably explicitly zero > > the content of the buffer. It doesn't really matter in the context of > > WAL recovery, since there can't be any concurrent access to the buffer, > > but it'd make it safe to use in non-WAL contexts (I think there are > > other places where we know we are going to init the page and so a > > physical read is a waste of time). > > To implement that correctly, I think we'd need to take the content lock > to clear the buffer if it's already found in the cache. It doesn't seem > right to me for the buffer manager to do that, in the worst case it > could lead to deadlocks if that function was ever used while holding > another buffer locked. > > What we could have is the semantics of "Return a buffer, with either > correct contents or completely zeroed out". It would act just like > ReadBuffer if the buffer was already in memory, and zero out the page > otherwise. That's a bit strange semantics to have, but is simple to > implement and works for the use-cases we've been talking about. Sounds good. > Patch implementing that attached. I named the function "ReadOrZeroBuffer". We already have an API quirk similar to this: relation extension. It seems strange to have two different kinds of special case API that are used alongside each other in XLogReadBuffer() Currently if we extend by a block we saybuffer = ReadBuffer(reln, P_NEW); Why not just add another option, so where you use ReadOrZeroBuffer we just saybuffer = ReadBuffer(reln, P_INIT); which we then check for on entry by sayingisInit = (blockNum == P_INIT); just as we already do for P_NEW That way you can do the code like thisif (isExtend || isInit){ /* new or initialised buffers are zero-filled */ MemSet((char*) bufBlock, 0, BLCKSZ); if (isExtend) smgrextend(reln->rd_smgr, blockNum, (char *)bufBlock, reln->rd_istemp);} That way we don't have to have ReadBuffer_common etc.. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > On Fri, 2007-04-27 at 12:22 +0100, Heikki Linnakangas wrote: >> Patch implementing that attached. I named the function "ReadOrZeroBuffer". > > We already have an API quirk similar to this: relation extension. It > seems strange to have two different kinds of special case API that are > used alongside each other in XLogReadBuffer() > > Currently if we extend by a block we say > buffer = ReadBuffer(reln, P_NEW); > > Why not just add another option, so where you use ReadOrZeroBuffer we > just say > buffer = ReadBuffer(reln, P_INIT); Because ReadOrZeroBuffer needs the block number as an argument. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
I was actually thinking that we could slip this in 8.3. It's a simple, well-understood patch, which fixes a little data integrity quirk as well as gives a nice recovery speed up. Bruce Momjian wrote: > I assume this is 8.4 material. > > --------------------------------------------------------------------------- > > Heikki Linnakangas wrote: >> Tom Lane wrote: >>> "Simon Riggs" <simon@2ndquadrant.com> writes: >>>> As regards the zero_damaged_pages question, I raised that some time ago >>>> but we didn't arrive at an explicit answer. All I would say is we can't >>>> allow invalid pages in the buffer manager at any time, whatever options >>>> we have requested, otherwise other code will fail almost immediately. >>> Yeah --- the proposed new bufmgr routine should probably explicitly zero >>> the content of the buffer. It doesn't really matter in the context of >>> WAL recovery, since there can't be any concurrent access to the buffer, >>> but it'd make it safe to use in non-WAL contexts (I think there are >>> other places where we know we are going to init the page and so a >>> physical read is a waste of time). >> To implement that correctly, I think we'd need to take the content lock >> to clear the buffer if it's already found in the cache. It doesn't seem >> right to me for the buffer manager to do that, in the worst case it >> could lead to deadlocks if that function was ever used while holding >> another buffer locked. >> >> What we could have is the semantics of "Return a buffer, with either >> correct contents or completely zeroed out". It would act just like >> ReadBuffer if the buffer was already in memory, and zero out the page >> otherwise. That's a bit strange semantics to have, but is simple to >> implement and works for the use-cases we've been talking about. >> >> Patch implementing that attached. I named the function "ReadOrZeroBuffer". >> >> -- >> Heikki Linnakangas >> EnterpriseDB http://www.enterprisedb.com > > >> ---------------------------(end of broadcast)--------------------------- >> TIP 6: explain analyze is your friend > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I was actually thinking that we could slip this in 8.3. It's a simple, > well-understood patch, which fixes a little data integrity quirk as well > as gives a nice recovery speed up. Yeah. It's arguably a bug fix, in fact, since it eliminates the issue that the recovery behavior is wrong if full-page-writes had been off when the WAL log was made. regards, tom lane
Heikki Linnakangas <heikki@enterprisedb.com> writes: > What we could have is the semantics of "Return a buffer, with either > correct contents or completely zeroed out". It would act just like > ReadBuffer if the buffer was already in memory, and zero out the page > otherwise. That's a bit strange semantics to have, but is simple to > implement and works for the use-cases we've been talking about. > Patch implementing that attached. I named the function "ReadOrZeroBuffer". Applied. BTW, I realized that there is a potential issue created by this, which is that the smgr level might see a write for a page that it never saw a read for. I don't think there are any bad consequences of this ATM, but it is skating around the edges of some bugs we've had previously with relation extension. In particular ReadOrZeroBuffer avoids the error that would normally occur if one tries to read a page that's beyond the logical EOF; and if the page is subsequently modified and written, md.c is likely to get confused/unhappy, particularly if the page is beyond the next segment boundary. This isn't a problem in XLogReadBuffer's usage because it carefully checks the EOF position before trying to use ReadOrZeroBuffer, but it's a limitation other callers will need to think about. regards, tom lane