Thread: Avoiding unnecessary reads in recovery

Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
In recovery, with full_pages_writes=on, we read in each page only to
overwrite the contents with a full page image. That's a waste of time,
and can have a surprisingly large effect on recovery time.

As a quick test on my laptop, I initialized a DBT-2 test with 5
warehouses, and let it run for 2 minutes without think-times to generate
some WAL. Then I did a "kill -9 postmaster", and took a copy of the data
directory to use for testing recovery.

With CVS HEAD, the recovery took ~ 2 minutes. With the attached patch,
it took 5 seconds. (yes, I used the same not-yet-recovered data
directory in both tests, and cleared the os cache with "echo 1 >
/proc/sys/vm/drop_caches").

I was surprised how big a difference it makes, but when you think about
it it's logical. Without the patch, it's doing roughly the same I/O as
the test itself, reading in pages, modifying them, and writing them
back. With the patch, all the reads are done sequentially from the WAL,
and then written back in a batch at the end of the WAL replay which is a
lot more efficient.

It's interesting that (with the patch) full_page_writes can *shorten*
your recovery time. I've always thought it to have a purely negative
effect on performance.

I'll leave it up to the jury if this tiny little change is appropriate
after feature freeze...

While working on this, this comment in ReadBuffer caught my eye:

>     /*
>      * During WAL recovery, the first access to any data page should
>      * overwrite the whole page from the WAL; so a clobbered page
>      * header is not reason to fail.  Hence, when InRecovery we may
>      * always act as though zero_damaged_pages is ON.
>      */
>     if (zero_damaged_pages || InRecovery)
>     {

But that assumption only holds if full_page_writes is enabled, right? I
changed that in the attached patch as well, but if it isn't accepted
that part of it should still be applied, I think.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/xlogutils.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v
retrieving revision 1.49
diff -c -r1.49 xlogutils.c
*** src/backend/access/transam/xlogutils.c    5 Jan 2007 22:19:24 -0000    1.49
--- src/backend/access/transam/xlogutils.c    25 Apr 2007 11:40:09 -0000
***************
*** 226,232 ****
      if (blkno < lastblock)
      {
          /* page exists in file */
!         buffer = ReadBuffer(reln, blkno);
      }
      else
      {
--- 226,235 ----
      if (blkno < lastblock)
      {
          /* page exists in file */
!         if(init)
!             buffer = ZapBuffer(reln, blkno);
!         else
!             buffer = ReadBuffer(reln, blkno);
      }
      else
      {
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.216
diff -c -r1.216 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    30 Mar 2007 18:34:55 -0000    1.216
--- src/backend/storage/buffer/bufmgr.c    25 Apr 2007 11:44:27 -0000
***************
*** 97,102 ****
--- 97,103 ----
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
                    int set_flag_bits);
  static void buffer_write_error_callback(void *arg);
+ static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only);
  static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
              bool *foundPtr);
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
***************
*** 121,126 ****
--- 122,148 ----
  Buffer
  ReadBuffer(Relation reln, BlockNumber blockNum)
  {
+     return ReadBuffer_common(reln, blockNum, false);
+ }
+
+ /*
+  * ZapBuffer -- like ReadBuffer, but doesn't read the contents of the page
+  *        from disk. The caller is expected to completely rewrite the page,
+  *        regardless of the current contents. This should only be used in
+  *        recovery where there's no concurrent readers that might see the
+  *        contents of the page before the caller rewrites it.
+  */
+ Buffer
+ ZapBuffer(Relation reln, BlockNumber blockNum)
+ {
+     Assert(InRecovery);
+
+     return ReadBuffer_common(reln, blockNum, true);
+ }
+
+ static Buffer
+ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only)
+ {
      volatile BufferDesc *bufHdr;
      Block        bufBlock;
      bool        found;
***************
*** 253,269 ****
      }
      else
      {
          smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);
          /* check for garbage data */
          if (!PageHeaderIsValid((PageHeader) bufBlock))
          {
              /*
!              * During WAL recovery, the first access to any data page should
!              * overwrite the whole page from the WAL; so a clobbered page
!              * header is not reason to fail.  Hence, when InRecovery we may
!              * always act as though zero_damaged_pages is ON.
               */
!             if (zero_damaged_pages || InRecovery)
              {
                  ereport(WARNING,
                          (errcode(ERRCODE_DATA_CORRUPTED),
--- 275,293 ----
      }
      else
      {
+         if(!alloc_only)
+         {
          smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);
          /* check for garbage data */
          if (!PageHeaderIsValid((PageHeader) bufBlock))
          {
              /*
!              * When full_pages_writes is enabled, the first access to any data page should
!              * overwrite the whole page from the WAL during recovery; so a clobbered page
!              * header is not reason to fail.  Hence, we may
!              * act as though zero_damaged_pages is ON.
               */
!             if (zero_damaged_pages || (InRecovery && fullPageWrites))
              {
                  ereport(WARNING,
                          (errcode(ERRCODE_DATA_CORRUPTED),
***************
*** 277,282 ****
--- 301,307 ----
                   errmsg("invalid page header in block %u of relation \"%s\"",
                          blockNum, RelationGetRelationName(reln))));
          }
+         }
      }

      if (isLocalBuf)
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.384
diff -c -r1.384 guc.c
*** src/backend/utils/misc/guc.c    12 Apr 2007 06:53:47 -0000    1.384
--- src/backend/utils/misc/guc.c    25 Apr 2007 11:19:52 -0000
***************
*** 103,109 ****
  extern int    CommitDelay;
  extern int    CommitSiblings;
  extern char *default_tablespace;
- extern bool fullPageWrites;

  #ifdef TRACE_SORT
  extern bool trace_sort;
--- 103,108 ----
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/access/xlog.h,v
retrieving revision 1.76
diff -c -r1.76 xlog.h
*** src/include/access/xlog.h    5 Jan 2007 22:19:51 -0000    1.76
--- src/include/access/xlog.h    25 Apr 2007 11:19:46 -0000
***************
*** 142,147 ****
--- 142,148 ----
  extern int    XLogArchiveTimeout;
  extern char *XLOG_sync_method;
  extern const char XLOG_sync_method_default[];
+ extern bool fullPageWrites;

  #define XLogArchivingActive()    (XLogArchiveCommand[0] != '\0')

Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.102
diff -c -r1.102 bufmgr.h
*** src/include/storage/bufmgr.h    5 Jan 2007 22:19:57 -0000    1.102
--- src/include/storage/bufmgr.h    25 Apr 2007 11:39:35 -0000
***************
*** 111,116 ****
--- 111,117 ----
   * prototypes for functions in bufmgr.c
   */
  extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
+ extern Buffer ZapBuffer(Relation reln, BlockNumber blockNum);
  extern void ReleaseBuffer(Buffer buffer);
  extern void UnlockReleaseBuffer(Buffer buffer);
  extern void MarkBufferDirty(Buffer buffer);

Re: Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> While working on this, this comment in ReadBuffer caught my eye:
> 
>>     /*
>>      * During WAL recovery, the first access to any data page should
>>      * overwrite the whole page from the WAL; so a clobbered page
>>      * header is not reason to fail.  Hence, when InRecovery we may
>>      * always act as though zero_damaged_pages is ON.
>>      */
>>     if (zero_damaged_pages || InRecovery)
>>     {
> 
> But that assumption only holds if full_page_writes is enabled, right? I 
> changed that in the attached patch as well, but if it isn't accepted 
> that part of it should still be applied, I think.

On second thought, my fix still isn't 100% right because one could turn 
full_page_writes on before starting replay.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Avoiding unnecessary reads in recovery

From
Gregory Stark
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> While working on this, this comment in ReadBuffer caught my eye:
>
>>     /*
>>      * During WAL recovery, the first access to any data page should
>>      * overwrite the whole page from the WAL; so a clobbered page
>>      * header is not reason to fail.  Hence, when InRecovery we may
>>      * always act as though zero_damaged_pages is ON.
>>      */
>>     if (zero_damaged_pages || InRecovery)
>>     {
>
> But that assumption only holds if full_page_writes is enabled, right? I changed
> that in the attached patch as well, but if it isn't accepted that part of it
> should still be applied, I think.

Well it's only true if full_page_writes was on when the WAL was written. Which
isn't necessarily the same as saying it's enabled during recovery...

As long as there's a backup block in the log we can use it to clobber pages in
the heap -- which is what your patch effectively does anyways. If we're
replaying a log entry where there isn't a backup block and we find a damaged
page then we're in trouble. Either the damaged page was in a previous backup
block or it's the recovery itself that's damaging it. 

In the latter case it would be pretty useful to abort the recovery so the user
doesn't lose his backup and has a chance to recovery properly (possibly after
reporting and fixing the bug).

So in short I think with your patch this piece of code no longer has a role.
Either your patch kicks in and we never even look at the damaged page at all,
or we should be treating it as corrupt data and just check zero_damaged_pages
alone and not do anything special in recovery.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> So in short I think with your patch this piece of code no longer has a role.
> Either your patch kicks in and we never even look at the damaged page at all,
> or we should be treating it as corrupt data and just check zero_damaged_pages
> alone and not do anything special in recovery.

Good point. Adjusted patch attached. I added some comments as well.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/xlogutils.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v
retrieving revision 1.49
diff -c -r1.49 xlogutils.c
*** src/backend/access/transam/xlogutils.c    5 Jan 2007 22:19:24 -0000    1.49
--- src/backend/access/transam/xlogutils.c    25 Apr 2007 14:27:05 -0000
***************
*** 226,232 ****
      if (blkno < lastblock)
      {
          /* page exists in file */
!         buffer = ReadBuffer(reln, blkno);
      }
      else
      {
--- 226,235 ----
      if (blkno < lastblock)
      {
          /* page exists in file */
!         if (init)
!             buffer = ZapBuffer(reln, blkno);
!         else
!             buffer = ReadBuffer(reln, blkno);
      }
      else
      {
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.216
diff -c -r1.216 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    30 Mar 2007 18:34:55 -0000    1.216
--- src/backend/storage/buffer/bufmgr.c    25 Apr 2007 14:36:15 -0000
***************
*** 17,22 ****
--- 17,26 ----
   *        and pin it so that no one can destroy it while this process
   *        is using it.
   *
+  * ZapBuffer() -- like ReadBuffer, but destroys the contents of the
+  *        page. Used in recovery when the page is completely overwritten
+  *        from WAL.
+  *
   * ReleaseBuffer() -- unpin a buffer
   *
   * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
***************
*** 97,102 ****
--- 101,107 ----
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
                    int set_flag_bits);
  static void buffer_write_error_callback(void *arg);
+ static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only);
  static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
              bool *foundPtr);
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
***************
*** 121,126 ****
--- 126,155 ----
  Buffer
  ReadBuffer(Relation reln, BlockNumber blockNum)
  {
+     return ReadBuffer_common(reln, blockNum, false);
+ }
+
+ /*
+  * ZapBuffer -- like ReadBuffer, but doesn't read the contents of the page
+  *        from disk. The caller is expected to completely rewrite the page,
+  *        regardless of the current contents. This should only be used in
+  *        recovery where there's no concurrent readers that might see the
+  *        contents of the page before the caller rewrites it.
+  */
+ Buffer
+ ZapBuffer(Relation reln, BlockNumber blockNum)
+ {
+     Assert(InRecovery);
+
+     return ReadBuffer_common(reln, blockNum, true);
+ }
+
+ /*
+  * ReadBuffer_common -- common logic of ReadBuffer and ZapBuffer.
+  */
+ static Buffer
+ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only)
+ {
      volatile BufferDesc *bufHdr;
      Block        bufBlock;
      bool        found;
***************
*** 253,269 ****
      }
      else
      {
          smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);
          /* check for garbage data */
          if (!PageHeaderIsValid((PageHeader) bufBlock))
          {
              /*
!              * During WAL recovery, the first access to any data page should
!              * overwrite the whole page from the WAL; so a clobbered page
!              * header is not reason to fail.  Hence, when InRecovery we may
!              * always act as though zero_damaged_pages is ON.
               */
!             if (zero_damaged_pages || InRecovery)
              {
                  ereport(WARNING,
                          (errcode(ERRCODE_DATA_CORRUPTED),
--- 282,304 ----
      }
      else
      {
+         /*
+          * Read in the page, unless the caller intends to overwrite it
+          * and just wants us to allocate a buffer.
+          */
+         if (!alloc_only)
+         {
          smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);
          /* check for garbage data */
          if (!PageHeaderIsValid((PageHeader) bufBlock))
          {
              /*
!              * We used to ignore corrupt pages in WAL recovery, but
!              * that was only ever safe if full_page_writes was enabled.
!              * Now the caller sets alloc_only to false if he intends to
!              * overwrite the whole page, which is already checked above.
               */
!             if (zero_damaged_pages)
              {
                  ereport(WARNING,
                          (errcode(ERRCODE_DATA_CORRUPTED),
***************
*** 277,282 ****
--- 312,318 ----
                   errmsg("invalid page header in block %u of relation \"%s\"",
                          blockNum, RelationGetRelationName(reln))));
          }
+         }
      }

      if (isLocalBuf)
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.102
diff -c -r1.102 bufmgr.h
*** src/include/storage/bufmgr.h    5 Jan 2007 22:19:57 -0000    1.102
--- src/include/storage/bufmgr.h    25 Apr 2007 14:14:28 -0000
***************
*** 111,116 ****
--- 111,117 ----
   * prototypes for functions in bufmgr.c
   */
  extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
+ extern Buffer ZapBuffer(Relation reln, BlockNumber blockNum);
  extern void ReleaseBuffer(Buffer buffer);
  extern void UnlockReleaseBuffer(Buffer buffer);
  extern void MarkBufferDirty(Buffer buffer);

Re: Avoiding unnecessary reads in recovery

From
"Simon Riggs"
Date:
On Wed, 2007-04-25 at 13:48 +0100, Heikki Linnakangas wrote:

> I was surprised how big a difference it makes, but when you think about 
> it it's logical. Without the patch, it's doing roughly the same I/O as 
> the test itself, reading in pages, modifying them, and writing them 
> back. With the patch, all the reads are done sequentially from the WAL, 
> and then written back in a batch at the end of the WAL replay which is a 
> lot more efficient.

Interesting patch.

It would be good to see a longer term test. I'd expect things to fall
back to about 50% of the time on a longer recovery where the writes need
to be written out of cache.

I was thinking of getting the bgwriter to help out to improve matters
there.


Patch-wise, I love the name ZapBuffer() but would think that
OverwriteBuffer() would be more descriptive.

As regards the zero_damaged_pages question, I raised that some time ago
but we didn't arrive at an explicit answer. All I would say is we can't
allow invalid pages in the buffer manager at any time, whatever options
we have requested, otherwise other code will fail almost immediately.
I'm not sure there's another option?

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Avoiding unnecessary reads in recovery

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> As regards the zero_damaged_pages question, I raised that some time ago
> but we didn't arrive at an explicit answer. All I would say is we can't
> allow invalid pages in the buffer manager at any time, whatever options
> we have requested, otherwise other code will fail almost immediately.

Yeah --- the proposed new bufmgr routine should probably explicitly zero
the content of the buffer.  It doesn't really matter in the context of
WAL recovery, since there can't be any concurrent access to the buffer,
but it'd make it safe to use in non-WAL contexts (I think there are
other places where we know we are going to init the page and so a
physical read is a waste of time).  Also, this would let the patch be

+    if (alloc_only)
+        MemSet...
+    else    smgrread...

and you don't need to hack out the PageHeaderIsValid test, since it will
allow zeroed pages.

Possibly ReadZeroedBuffer would be a better name?
        regards, tom lane


Re: Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> As regards the zero_damaged_pages question, I raised that some time ago
>> but we didn't arrive at an explicit answer. All I would say is we can't
>> allow invalid pages in the buffer manager at any time, whatever options
>> we have requested, otherwise other code will fail almost immediately.
> 
> Yeah --- the proposed new bufmgr routine should probably explicitly zero
> the content of the buffer.  It doesn't really matter in the context of
> WAL recovery, since there can't be any concurrent access to the buffer,
> but it'd make it safe to use in non-WAL contexts (I think there are
> other places where we know we are going to init the page and so a
> physical read is a waste of time).

Is there? I can't think of any. Extending a relation doesn't count.

Zeroing the buffer explicitly might be a good idea anyway. I agree it 
feels a bit dangerous to have a moment when there's a garbled page in 
buffer cache, even if only in recovery.

> Also, this would let the patch be
> 
> +    if (alloc_only)
> +        MemSet...
> +    else
>         smgrread...
> 
> and you don't need to hack out the PageHeaderIsValid test, since it will
> allow zeroed pages.

Well, I'd still put the PageHeaderIsValid test in the else-branch. Not 
like the few cycles spent on the test matter, but I don't see a reason 
not to.

> Possibly ReadZeroedBuffer would be a better name?

Yeah, sounds good. "Read" is a bit misleading, since it doesn't actually 
read in the buffer, but it's good that the name is closely related to 
ReadBuffer.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Avoiding unnecessary reads in recovery

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> but it'd make it safe to use in non-WAL contexts (I think there are
>> other places where we know we are going to init the page and so a
>> physical read is a waste of time).

> Is there? I can't think of any. Extending a relation doesn't count.

No, but re-using a free page in an index does.  I'm not sure which index
AMs know for sure the page is free, and which have to read it and check,
but I think there's at least some scope for that.
        regards, tom lane


Re: Avoiding unnecessary reads in recovery

From
Jim Nasby
Date:
On Apr 25, 2007, at 2:48 PM, Heikki Linnakangas wrote:
> In recovery, with full_pages_writes=on, we read in each page only  
> to overwrite the contents with a full page image. That's a waste of  
> time, and can have a surprisingly large effect on recovery time.
>
> As a quick test on my laptop, I initialized a DBT-2 test with 5  
> warehouses, and let it run for 2 minutes without think-times to  
> generate some WAL. Then I did a "kill -9 postmaster", and took a  
> copy of the data directory to use for testing recovery.
>
> With CVS HEAD, the recovery took ~ 2 minutes. With the attached  
> patch, it took 5 seconds. (yes, I used the same not-yet-recovered  
> data directory in both tests, and cleared the os cache with "echo 1  
> > /proc/sys/vm/drop_caches").
>
> I was surprised how big a difference it makes, but when you think  
> about it it's logical. Without the patch, it's doing roughly the  
> same I/O as the test itself, reading in pages, modifying them, and  
> writing them back. With the patch, all the reads are done  
> sequentially from the WAL, and then written back in a batch at the  
> end of the WAL replay which is a lot more efficient.
>
> It's interesting that (with the patch) full_page_writes can  
> *shorten* your recovery time. I've always thought it to have a  
> purely negative effect on performance.
>
> I'll leave it up to the jury if this tiny little change is  
> appropriate after feature freeze...
>
> While working on this, this comment in ReadBuffer caught my eye:
>
>>     /*
>>      * During WAL recovery, the first access to any data page should
>>      * overwrite the whole page from the WAL; so a clobbered page
>>      * header is not reason to fail.  Hence, when InRecovery we may
>>      * always act as though zero_damaged_pages is ON.
>>      */
>>     if (zero_damaged_pages || InRecovery)
>>     {
>
> But that assumption only holds if full_page_writes is enabled,  
> right? I changed that in the attached patch as well, but if it  
> isn't accepted that part of it should still be applied, I think.

So what happens if a backend is running with full_page_writes = off,  
someone edits postgresql.conf to turns it on and forgets to reload/ 
restart, and then we crash? You'll come up in recovery mode thinking  
that f_p_w was turned on, when in fact it wasn't.

ISTM that we need to somehow log what the status of full_page_writes  
is, if it's going to affect how recovery works.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> but it'd make it safe to use in non-WAL contexts (I think there are
>>> other places where we know we are going to init the page and so a
>>> physical read is a waste of time).
> 
>> Is there? I can't think of any. Extending a relation doesn't count.
> 
> No, but re-using a free page in an index does.  I'm not sure which index
> AMs know for sure the page is free, and which have to read it and check,
> but I think there's at least some scope for that.

B-tree, GIN ans GiST read and check. I'm not sure how hash works. I 
think the latest bitmap index patch doesn't support reusing empty pages 
at all.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Avoiding unnecessary reads in recovery

From
"Zeugswetter Andreas ADI SD"
Date:
> So what happens if a backend is running with full_page_writes
> = off, someone edits postgresql.conf to turns it on and
> forgets to reload/ restart, and then we crash? You'll come up
> in recovery mode thinking that f_p_w was turned on, when in
> fact it wasn't.
>
> ISTM that we need to somehow log what the status of
> full_page_writes is, if it's going to affect how recovery works.

Optimally recovery should do this when confronted with a full page image
only. The full page is in the same WAL record that first touches a page,
so this should not need to depend on a setting.

Andreas


Re: Avoiding unnecessary reads in recovery

From
Tom Lane
Date:
Jim Nasby <decibel@decibel.org> writes:
> So what happens if a backend is running with full_page_writes = off,  
> someone edits postgresql.conf to turns it on and forgets to reload/ 
> restart, and then we crash? You'll come up in recovery mode thinking  
> that f_p_w was turned on, when in fact it wasn't.

One of the advantages of the proposed patch is that it avoids having to
make any assumptions like that.
        regards, tom lane


Re: Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> As regards the zero_damaged_pages question, I raised that some time ago
>> but we didn't arrive at an explicit answer. All I would say is we can't
>> allow invalid pages in the buffer manager at any time, whatever options
>> we have requested, otherwise other code will fail almost immediately.
>
> Yeah --- the proposed new bufmgr routine should probably explicitly zero
> the content of the buffer.  It doesn't really matter in the context of
> WAL recovery, since there can't be any concurrent access to the buffer,
> but it'd make it safe to use in non-WAL contexts (I think there are
> other places where we know we are going to init the page and so a
> physical read is a waste of time).

To implement that correctly, I think we'd need to take the content lock
to clear the buffer if it's already found in the cache. It doesn't seem
right to me for the buffer manager to do that, in the worst case it
could lead to deadlocks if that function was ever used while holding
another buffer locked.

What we could have is the semantics of "Return a buffer, with either
correct contents or completely zeroed out". It would act just like
ReadBuffer if the buffer was already in memory, and zero out the page
otherwise. That's a bit strange semantics to have, but is simple to
implement and works for the use-cases we've been talking about.

Patch implementing that attached. I named the function "ReadOrZeroBuffer".

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/xlogutils.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v
retrieving revision 1.49
diff -c -r1.49 xlogutils.c
*** src/backend/access/transam/xlogutils.c    5 Jan 2007 22:19:24 -0000    1.49
--- src/backend/access/transam/xlogutils.c    27 Apr 2007 10:44:37 -0000
***************
*** 226,232 ****
      if (blkno < lastblock)
      {
          /* page exists in file */
!         buffer = ReadBuffer(reln, blkno);
      }
      else
      {
--- 226,235 ----
      if (blkno < lastblock)
      {
          /* page exists in file */
!         if (init)
!             buffer = ReadOrZeroBuffer(reln, blkno);
!         else
!             buffer = ReadBuffer(reln, blkno);
      }
      else
      {
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.216
diff -c -r1.216 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    30 Mar 2007 18:34:55 -0000    1.216
--- src/backend/storage/buffer/bufmgr.c    27 Apr 2007 10:49:08 -0000
***************
*** 17,22 ****
--- 17,26 ----
   *        and pin it so that no one can destroy it while this process
   *        is using it.
   *
+  * ReadOrZeroBuffer() -- like ReadBuffer, but clears out the contents of the
+  *        page if it's not in cache already. Used in recovery when the page is
+  *        completely overwritten from WAL.
+  *
   * ReleaseBuffer() -- unpin a buffer
   *
   * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
***************
*** 97,102 ****
--- 101,107 ----
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
                    int set_flag_bits);
  static void buffer_write_error_callback(void *arg);
+ static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only);
  static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
              bool *foundPtr);
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
***************
*** 121,126 ****
--- 126,152 ----
  Buffer
  ReadBuffer(Relation reln, BlockNumber blockNum)
  {
+     return ReadBuffer_common(reln, blockNum, false);
+ }
+
+ /*
+  * ReadOrZeroBuffer -- like ReadBuffer, but if the page isn't in buffer
+  *        cache already, it's filled with zeros instead of reading it from
+  *        disk. The caller is expected to rewrite it with real data,
+  *        regardless of the current contents.
+  */
+ Buffer
+ ReadOrZeroBuffer(Relation reln, BlockNumber blockNum)
+ {
+     return ReadBuffer_common(reln, blockNum, true);
+ }
+
+ /*
+  * ReadBuffer_common -- common logic of ReadBuffer and ReadZeroedBuffer
+  */
+ static Buffer
+ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
+ {
      volatile BufferDesc *bufHdr;
      Block        bufBlock;
      bool        found;
***************
*** 253,269 ****
      }
      else
      {
          smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);
          /* check for garbage data */
          if (!PageHeaderIsValid((PageHeader) bufBlock))
          {
              /*
!              * During WAL recovery, the first access to any data page should
!              * overwrite the whole page from the WAL; so a clobbered page
!              * header is not reason to fail.  Hence, when InRecovery we may
!              * always act as though zero_damaged_pages is ON.
               */
!             if (zero_damaged_pages || InRecovery)
              {
                  ereport(WARNING,
                          (errcode(ERRCODE_DATA_CORRUPTED),
--- 279,304 ----
      }
      else
      {
+         /*
+          * Read in the page, unless the caller intends to overwrite it
+          * and just wants us to allocate a buffer.
+          */
+         if (zeroPage)
+             MemSet((char *) bufBlock, 0, BLCKSZ);
+         else
+         {
          smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);
          /* check for garbage data */
          if (!PageHeaderIsValid((PageHeader) bufBlock))
          {
              /*
!              * We used to ignore corrupt pages in WAL recovery, but that
!              * was only ever safe if full_page_writes was enabled. Now that
!              * the caller sets zeroPage to true if he intends to overwrite
!              * the whole page, if we get here the page better be valid or
!              * we'll lose data.
               */
!             if (zero_damaged_pages)
              {
                  ereport(WARNING,
                          (errcode(ERRCODE_DATA_CORRUPTED),
***************
*** 277,282 ****
--- 312,318 ----
                   errmsg("invalid page header in block %u of relation \"%s\"",
                          blockNum, RelationGetRelationName(reln))));
          }
+         }
      }

      if (isLocalBuf)
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.102
diff -c -r1.102 bufmgr.h
*** src/include/storage/bufmgr.h    5 Jan 2007 22:19:57 -0000    1.102
--- src/include/storage/bufmgr.h    27 Apr 2007 10:44:20 -0000
***************
*** 111,116 ****
--- 111,117 ----
   * prototypes for functions in bufmgr.c
   */
  extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
+ extern Buffer ReadOrZeroBuffer(Relation reln, BlockNumber blockNum);
  extern void ReleaseBuffer(Buffer buffer);
  extern void UnlockReleaseBuffer(Buffer buffer);
  extern void MarkBufferDirty(Buffer buffer);

Re: Avoiding unnecessary reads in recovery

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:

> What we could have is the semantics of "Return a buffer, with either 
> correct contents or completely zeroed out". It would act just like 
> ReadBuffer if the buffer was already in memory, and zero out the page 
> otherwise. That's a bit strange semantics to have, but is simple to 
> implement and works for the use-cases we've been talking about.

Huh, why does that work in the case where the recovery code reads a
page, then evicts it because of memory pressure, and later needs to read
it again?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
> 
>> What we could have is the semantics of "Return a buffer, with either 
>> correct contents or completely zeroed out". It would act just like 
>> ReadBuffer if the buffer was already in memory, and zero out the page 
>> otherwise. That's a bit strange semantics to have, but is simple to 
>> implement and works for the use-cases we've been talking about.
> 
> Huh, why does that work in the case where the recovery code reads a
> page, then evicts it because of memory pressure, and later needs to read
> it again?

I don't understand the problem. You only use ReadOrZeroBuffer when 
you're going to replace the contents entirely, and don't care about the 
old contents. If you want to read something in, you use ReadBuffer.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Avoiding unnecessary reads in recovery

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> Alvaro Herrera wrote:
> >Heikki Linnakangas wrote:
> >
> >>What we could have is the semantics of "Return a buffer, with either 
> >>correct contents or completely zeroed out". It would act just like 
> >>ReadBuffer if the buffer was already in memory, and zero out the page 
> >>otherwise. That's a bit strange semantics to have, but is simple to 
> >>implement and works for the use-cases we've been talking about.
> >
> >Huh, why does that work in the case where the recovery code reads a
> >page, then evicts it because of memory pressure, and later needs to read
> >it again?
> 
> I don't understand the problem. You only use ReadOrZeroBuffer when 
> you're going to replace the contents entirely, and don't care about the 
> old contents. If you want to read something in, you use ReadBuffer.

Oh, the recovery code selects which one to call based on the "init"
param, which is on the first hunk of the diff :-)  I forgot that, I was
just thinking in your description above.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Avoiding unnecessary reads in recovery

From
Bruce Momjian
Date:
I assume this is 8.4 material.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Tom Lane wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> >> As regards the zero_damaged_pages question, I raised that some time ago
> >> but we didn't arrive at an explicit answer. All I would say is we can't
> >> allow invalid pages in the buffer manager at any time, whatever options
> >> we have requested, otherwise other code will fail almost immediately.
> > 
> > Yeah --- the proposed new bufmgr routine should probably explicitly zero
> > the content of the buffer.  It doesn't really matter in the context of
> > WAL recovery, since there can't be any concurrent access to the buffer,
> > but it'd make it safe to use in non-WAL contexts (I think there are
> > other places where we know we are going to init the page and so a
> > physical read is a waste of time).  
> 
> To implement that correctly, I think we'd need to take the content lock 
> to clear the buffer if it's already found in the cache. It doesn't seem 
> right to me for the buffer manager to do that, in the worst case it 
> could lead to deadlocks if that function was ever used while holding 
> another buffer locked.
> 
> What we could have is the semantics of "Return a buffer, with either 
> correct contents or completely zeroed out". It would act just like 
> ReadBuffer if the buffer was already in memory, and zero out the page 
> otherwise. That's a bit strange semantics to have, but is simple to 
> implement and works for the use-cases we've been talking about.
> 
> Patch implementing that attached. I named the function "ReadOrZeroBuffer".
> 
> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com


> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Avoiding unnecessary reads in recovery

From
"Simon Riggs"
Date:
On Fri, 2007-04-27 at 10:37 -0400, Bruce Momjian wrote:
> I assume this is 8.4 material.

I think its a small enough, performance-only change to allow it at this
time. It will provide considerable additional benefit for Warm Standby
servers.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Avoiding unnecessary reads in recovery

From
Jim Nasby
Date:
On Apr 26, 2007, at 3:39 PM, Tom Lane wrote:
> Jim Nasby <decibel@decibel.org> writes:
>> So what happens if a backend is running with full_page_writes = off,
>> someone edits postgresql.conf to turns it on and forgets to reload/
>> restart, and then we crash? You'll come up in recovery mode thinking
>> that f_p_w was turned on, when in fact it wasn't.
>
> One of the advantages of the proposed patch is that it avoids  
> having to
> make any assumptions like that.

Yeah, the only email in the thread when I wrote that was Heikki's  
original email (I wrote the email on a plane, but did note it had  
been some time after the original email with no replies). Sorry for  
the noise.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Avoiding unnecessary reads in recovery

From
"Simon Riggs"
Date:
On Fri, 2007-04-27 at 12:22 +0100, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> >> As regards the zero_damaged_pages question, I raised that some time ago
> >> but we didn't arrive at an explicit answer. All I would say is we can't
> >> allow invalid pages in the buffer manager at any time, whatever options
> >> we have requested, otherwise other code will fail almost immediately.
> > 
> > Yeah --- the proposed new bufmgr routine should probably explicitly zero
> > the content of the buffer.  It doesn't really matter in the context of
> > WAL recovery, since there can't be any concurrent access to the buffer,
> > but it'd make it safe to use in non-WAL contexts (I think there are
> > other places where we know we are going to init the page and so a
> > physical read is a waste of time).  
> 
> To implement that correctly, I think we'd need to take the content lock 
> to clear the buffer if it's already found in the cache. It doesn't seem 
> right to me for the buffer manager to do that, in the worst case it 
> could lead to deadlocks if that function was ever used while holding 
> another buffer locked.
> 
> What we could have is the semantics of "Return a buffer, with either 
> correct contents or completely zeroed out". It would act just like 
> ReadBuffer if the buffer was already in memory, and zero out the page 
> otherwise. That's a bit strange semantics to have, but is simple to 
> implement and works for the use-cases we've been talking about.

Sounds good.

> Patch implementing that attached. I named the function "ReadOrZeroBuffer".

We already have an API quirk similar to this: relation extension. It
seems strange to have two different kinds of special case API that are
used alongside each other in XLogReadBuffer()

Currently if we extend by a block we saybuffer = ReadBuffer(reln, P_NEW);

Why not just add another option, so where you use ReadOrZeroBuffer we
just saybuffer = ReadBuffer(reln, P_INIT);

which we then check for on entry by sayingisInit = (blockNum == P_INIT);
just as we already do for P_NEW

That way you can do the code like thisif (isExtend || isInit){    /* new or initialised buffers are zero-filled */
MemSet((char*) bufBlock, 0, BLCKSZ);    if (isExtend)        smgrextend(reln->rd_smgr, blockNum,                (char
*)bufBlock,               reln->rd_istemp);}
 

That way we don't have to have ReadBuffer_common etc..

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2007-04-27 at 12:22 +0100, Heikki Linnakangas wrote:
>> Patch implementing that attached. I named the function "ReadOrZeroBuffer".
> 
> We already have an API quirk similar to this: relation extension. It
> seems strange to have two different kinds of special case API that are
> used alongside each other in XLogReadBuffer()
> 
> Currently if we extend by a block we say
>     buffer = ReadBuffer(reln, P_NEW);
> 
> Why not just add another option, so where you use ReadOrZeroBuffer we
> just say
>     buffer = ReadBuffer(reln, P_INIT);

Because ReadOrZeroBuffer needs the block number as an argument.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Avoiding unnecessary reads in recovery

From
Heikki Linnakangas
Date:
I was actually thinking that we could slip this in 8.3. It's a simple, 
well-understood patch, which fixes a little data integrity quirk as well 
as gives a nice recovery speed up.

Bruce Momjian wrote:
> I assume this is 8.4 material.
> 
> ---------------------------------------------------------------------------
> 
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> "Simon Riggs" <simon@2ndquadrant.com> writes:
>>>> As regards the zero_damaged_pages question, I raised that some time ago
>>>> but we didn't arrive at an explicit answer. All I would say is we can't
>>>> allow invalid pages in the buffer manager at any time, whatever options
>>>> we have requested, otherwise other code will fail almost immediately.
>>> Yeah --- the proposed new bufmgr routine should probably explicitly zero
>>> the content of the buffer.  It doesn't really matter in the context of
>>> WAL recovery, since there can't be any concurrent access to the buffer,
>>> but it'd make it safe to use in non-WAL contexts (I think there are
>>> other places where we know we are going to init the page and so a
>>> physical read is a waste of time).  
>> To implement that correctly, I think we'd need to take the content lock 
>> to clear the buffer if it's already found in the cache. It doesn't seem 
>> right to me for the buffer manager to do that, in the worst case it 
>> could lead to deadlocks if that function was ever used while holding 
>> another buffer locked.
>>
>> What we could have is the semantics of "Return a buffer, with either 
>> correct contents or completely zeroed out". It would act just like 
>> ReadBuffer if the buffer was already in memory, and zero out the page 
>> otherwise. That's a bit strange semantics to have, but is simple to 
>> implement and works for the use-cases we've been talking about.
>>
>> Patch implementing that attached. I named the function "ReadOrZeroBuffer".
>>
>> -- 
>>    Heikki Linnakangas
>>    EnterpriseDB   http://www.enterprisedb.com
> 
> 
>> ---------------------------(end of broadcast)---------------------------
>> TIP 6: explain analyze is your friend
> 


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Avoiding unnecessary reads in recovery

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I was actually thinking that we could slip this in 8.3. It's a simple, 
> well-understood patch, which fixes a little data integrity quirk as well 
> as gives a nice recovery speed up.

Yeah.  It's arguably a bug fix, in fact, since it eliminates the issue
that the recovery behavior is wrong if full-page-writes had been off
when the WAL log was made.
        regards, tom lane


Re: Avoiding unnecessary reads in recovery

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> What we could have is the semantics of "Return a buffer, with either 
> correct contents or completely zeroed out". It would act just like 
> ReadBuffer if the buffer was already in memory, and zero out the page 
> otherwise. That's a bit strange semantics to have, but is simple to 
> implement and works for the use-cases we've been talking about.

> Patch implementing that attached. I named the function "ReadOrZeroBuffer".

Applied.  BTW, I realized that there is a potential issue created by
this, which is that the smgr level might see a write for a page that it
never saw a read for.  I don't think there are any bad consequences of
this ATM, but it is skating around the edges of some bugs we've had
previously with relation extension.  In particular ReadOrZeroBuffer
avoids the error that would normally occur if one tries to read a page
that's beyond the logical EOF; and if the page is subsequently modified
and written, md.c is likely to get confused/unhappy, particularly if the
page is beyond the next segment boundary.  This isn't a problem in
XLogReadBuffer's usage because it carefully checks the EOF position
before trying to use ReadOrZeroBuffer, but it's a limitation other
callers will need to think about.
        regards, tom lane