Re: 16-bit page checksums for 9.2 - Mailing list pgsql-hackers

From Dan Scales
Subject Re: 16-bit page checksums for 9.2
Date
Msg-id 825374483.586078.1327622481067.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Whole thread Raw
In response to Re: 16-bit page checksums for 9.2  (Noah Misch <noah@leadboat.com>)
Responses Re: 16-bit page checksums for 9.2
List pgsql-hackers
Some other comments on the checksum patch:

I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite().  If
therewere every another storage backend, it would have to duplicate the checksum check, right?  Is there a disadvantage
toputting it in smgrwrite()?
 

You may have already noticed this, but a bunch of the comments are incorrect, now that you have moved the checksum
calculationto mdwrite().
 
 config.sgml - says writes via temp_buffers (which I think means local buffers) are not checksummed -- that's no longer
true,right? bufmgr.c, line 1914 - bufToWrite no longer exists.  You could revert changes from 1912-1920 localbuf.c,
line203 - as mentioned below, this comment is no longer relevant, you are checksumming local buffers now
 


Dan

----- Original Message -----
From: "Noah Misch" <noah@leadboat.com>
To: "Simon Riggs" <simon@2ndQuadrant.com>
Cc: "Heikki Linnakangas" <heikki.linnakangas@enterprisedb.com>, "Robert Haas" <robertmhaas@gmail.com>, "Andres Freund"
<andres@anarazel.de>,"Kevin Grittner" <Kevin.Grittner@wicourts.gov>, david@fetter.org, aidan@highrise.ca,
stark@mit.edu,pgsql-hackers@postgresql.org
 
Sent: Thursday, January 26, 2012 12:20:39 PM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
> v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled.  One could not simply condition them on the
page_checksums setting, though.  Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on.  If that write tears,
leaving the checksum intact, that block will now fail verification.  A couple
of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
bumping the page
LSN:heap_xlog_visible()visibilitymap_clear()visibilitymap_truncate()lazy_scan_heap()XLogRecordPageWithFreeSpace()FreeSpaceMapTruncateRel()fsm_set_and_search()fsm_vacuum_page()fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards.  Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too.  The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum.  If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data.  They would need a dump/reload
to get the performance of a never-checksummed database.  Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml

> +       <para>
> +        Turning this parameter off speeds normal operation, but
> +        might allow data corruption to go unnoticed. The checksum uses
> +        16-bit checksums, using the fast Fletcher 16 algorithm. With this
> +        parameter enabled there is still a non-zero probability that an error
> +        could go undetected, as well as a non-zero probability of false
> +        positives.
> +       </para>

What sources of false positives do you intend to retain?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -20,6 +20,7 @@
>  #include "postgres.h"
>  
>  #include "access/visibilitymap.h"
> +#include "access/transam.h"
>  #include "access/xact.h"
>  #include "access/xlogutils.h"
>  #include "catalog/catalog.h"
> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
>  /* XLOG gives us high 4 bits */
>  #define XLOG_SMGR_CREATE    0x10
>  #define XLOG_SMGR_TRUNCATE    0x20
> +#define XLOG_SMGR_HINT        0x40
>  
>  typedef struct xl_smgr_create
>  {
> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
>      smgrDoPendingDeletes(false);
>  }
>  
> +/*
> + * Write a backup block if needed when we are setting a hint.
> + *
> + * Deciding the "if needed" bit is delicate and requires us to either
> + * grab WALInsertLock or check the info_lck spinlock. If we check the
> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
> + * so the design choice here is to just go straight for the WALInsertLock
> + * and trust that calls to this function are minimised elsewhere.
> + *
> + * Callable while holding share lock on the buffer content.
> + *
> + * Possible that multiple concurrent backends could attempt to write
> + * WAL records. In that case, more than one backup block may be recorded
> + * though that isn't important to the outcome and the backup blocks are
> + * likely to be identical anyway.
> + */
> +#define    SMGR_HINT_WATERMARK        13579
> +void
> +smgr_buffer_hint(Buffer buffer)
> +{
> +    /*
> +     * Make an XLOG entry reporting the hint
> +     */
> +    XLogRecPtr    lsn;
> +    XLogRecData rdata[2];
> +    int            watermark = SMGR_HINT_WATERMARK;
> +
> +    /*
> +     * Not allowed to have zero-length records, so use a small watermark
> +     */
> +    rdata[0].data = (char *) (&watermark);
> +    rdata[0].len = sizeof(int);
> +    rdata[0].buffer = InvalidBuffer;
> +    rdata[0].buffer_std = false;
> +    rdata[0].next = &(rdata[1]);
> +
> +    rdata[1].data = NULL;
> +    rdata[1].len = 0;
> +    rdata[1].buffer = buffer;
> +    rdata[1].buffer_std = true;
> +    rdata[1].next = NULL;
> +
> +    lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
> +
> +    /*
> +     * Set the page LSN if we wrote a backup block.
> +     */
> +    if (!XLByteEQ(InvalidXLogRecPtr, lsn))
> +    {
> +        Page     page = BufferGetPage(buffer);
> +        PageSetLSN(page, lsn);

PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.

Need a PageSetTLI() here, too.

> +        elog(LOG, "inserted backup block for hint bit");
> +    }
> +}

> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>      if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
>          (BM_DIRTY | BM_JUST_DIRTIED))
>      {
> +        /*
> +         * If we're writing checksums and we care about torn pages then we
> +         * cannot dirty a page during recovery as a result of a hint.
> +         * We can set the hint, just not dirty the page as a result.
> +         *
> +         * See long discussion in bufpage.c
> +         */
> +        if (HintsMustNotDirtyPage())
> +            return;

This expands toif (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.

> +
> +        /*
> +         * Write a full page into WAL iff this is the first change on the
> +         * block since the last checkpoint. That will never be the case
> +         * if the block is already dirty because we either made a change
> +         * or set a hint already. Note that aggressive cleaning of blocks
> +         * dirtied by hint bit setting would increase the call rate.
> +         * Bulk setting of hint bits would reduce the call rate...
> +         *
> +         * We must issue the WAL record before we mark the buffer dirty.
> +         * Otherwise we might write the page before we write the WAL.
> +         * That causes a race condition, since a checkpoint might
> +         * occur between writing the WAL record and marking the buffer dirty.
> +         * We solve that with a kluge, but one that is already in use
> +         * during transaction commit to prevent race conditions.
> +         * Basically, we simply prevent the checkpoint WAL record from
> +         * being written until we have marked the buffer dirty. We don't
> +         * start the checkpoint flush until we have marked dirty, so our
> +         * checkpoint must flush the change to disk successfully or the
> +         * checkpoint never gets written, so crash recovery will set us right.
> +         *
> +         * XXX rename PGPROC variable later; keep it same now for clarity
> +         */
> +        MyPgXact->inCommit = true;
> +        smgr_buffer_hint(buffer);
> +
>          LockBufHdr(bufHdr);
>          Assert(bufHdr->refcount > 0);
>          if (!(bufHdr->flags & BM_DIRTY))
> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>          }
>          bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
>          UnlockBufHdr(bufHdr);
> +        MyPgXact->inCommit = false;
>      }
>  }
>  
> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 096d36a..a220310 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>          /* Find smgr relation for buffer */
>          oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>  
> +        /* XXX do we want to write checksums for local buffers? An option? */
> +

Don't we have that, now that it happens in mdwrite()?

I can see value in an option to exclude local buffers, since corruption there
may be less exciting.  It doesn't seem important for an initial patch, though.

> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c

> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.

I get a 404 for that link.

> --- a/src/include/storage/bufpage.h
> +++ b/src/include/storage/bufpage.h

> -#define PD_VALID_FLAG_BITS    0x0007        /* OR of all valid pd_flags bits */
> +#define PD_VALID_FLAG_BITS    0x800F        /* OR of all non-checksum pd_flags bits */

The comment is not consistent with the character of the value.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: Inline Extension
Next
From: Dan Langille
Date:
Subject: PGCon 2012 Call for Papers - reminder