Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From Andres Freund
Subject Re: corrupt pages detected by enabling checksums
Date
Msg-id 20130404013900.GE19178@awork2.anarazel.de
Whole thread Raw
In response to Re: corrupt pages detected by enabling checksums  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: corrupt pages detected by enabling checksums
List pgsql-hackers
On 2013-04-04 02:58:43 +0200, Andres Freund wrote:
> On 2013-04-03 20:45:51 -0400, Tom Lane wrote:
> > andres@anarazel.de (Andres Freund) writes:
> > > Looking at the page lsn's with dd I noticed something peculiar:
> >
> > > page 0:
> > > 01 00 00 00 18 c2 00 31 => 1/3100C218
> > > page 1:
> > > 01 00 00 00 80 44 01 31 => 1/31014480
> > > page 10:
> > > 01 00 00 00 60 ce 05 31 => 1/3105ce60
> > > page 43:
> > > 01 00 00 00 58 7a 16 31 => 1/31167a58
> > > page 44:
> > > 01 00 00 00 f0 99 16 31 => 1/311699f0
> > > page 45:
> > > 00 00 00 00 00 00 00 00 => 0/0
> > > page 90:
> > > 01 00 00 00 90 17 1d 32 => 1/321d1790
> > > page 91:
> > > 01 00 00 00 38 ef 1b 32 => 1/321bef38
> >
> > > So we have written out pages that are after pages without a LSN that
> > > have an LSN thats *beyond* the point XLOG has successfully been written
> > > to disk (1/31169A38)?
> >
> > If you're looking into the FPIs, those would contain the page's older
> > LSN, not the one assigned by the current WAL record.
>
> Nope, thats from the heap, and the LSNs are *newer* than what startup
> recovered to. I am pretty sure by now we are missing out on valid WAL, I
> am just not sure why.

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Remember, as explained in other mails, I am pretty sure the end of wal
was errorneously detected, resulting in the following error:
27692  2013-04-03 10:09:15.647 PDT:LOG:  incorrect resource manager data checksum in record at 1/31169A68

Not that without a *hardware* crash end of wal is normally found first
by the checks for wrong LSNs or wrong record lengths. Not here.

My theory is that a page that was full page written changed while it was
inserted into the WAL which caused the error. A possibly scenario for
that would e.g. two concurrent calls to MarkBufferDirtyHint().
Note:* 2. The caller might have only share-lock instead of exclusive-lock on the*    buffer's content lock.
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=    (BM_DIRTY | BM_JUST_DIRTIED)){    if (DataChecksumsEnabled()
&&(bufHdr->flags & BM_PERMANENT))    {        MyPgXact->delayChkpt = delayChkpt = true;        lsn =
XLogSaveBufferForHint(buffer);/* PART 1 */    }    }
 
    LockBufHdr(bufHdr);    Assert(bufHdr->refcount > 0);
    if (!(bufHdr->flags & BM_DIRTY))    {        dirtied = true;        /* Means "will be dirtied by this action" */
        /*         * Set the page LSN if we wrote a backup block. We aren't         * supposed to set this when only
holdinga share lock but         * as long as we serialise it somehow we're OK. We choose to         * set LSN while
holdingthe buffer header lock, which causes         * any reader of an LSN who holds only a share lock to also
*obtain a buffer header lock before using PageGetLSN().         * Fortunately, thats not too many places.         *
   * If checksums are enabled, you might think we should reset the         * checksum here. That will happen when the
pageis written         * sometime later in this checkpoint cycle.         */        if (!XLogRecPtrIsInvalid(lsn))
     PageSetLSN(page, lsn); /* PART 2 */    }
 

So XLogSaveBufferForHint(buffer) is called for all buffers which are not
yet dirty. Without any locks. It just does:

XLogSaveBufferForHint(Buffer buffer)
{rdata[0].data = (char *) (&watermark);rdata[0].len = sizeof(int);rdata[0].next = &(rdata[1]);
rdata[1].data = NULL;rdata[1].len = 0;rdata[1].buffer = buffer;rdata[1].buffer_std = true;rdata[1].next = NULL;
return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

I think what happens is that that one backend grabs WALInsertLock first,
it computes a full page write for the buffer, including a CRC. And
returns a valid LSN. Then it continues towards the PageSetLSN() after
LockBufHdr().
Allthewhile the second backend notices that WALInsertLock is free again
and continues towards the WALInsertLock protected parts of XLogInsert().

Only that it already has done that part:/* * Calculate CRC of the data, including all the backup blocks * * Note that
therecord header isn't added into the CRC initially since we * don't know the prev-link yet.  Thus, the CRC will
representthe CRC of * the whole record in the order: rdata, then backup blocks, then record * header.
*/INIT_CRC32(rdata_crc);for(rdt = rdata; rdt != NULL; rdt = rdt->next)    COMP_CRC32(rdata_crc, rdt->data, rdt->len);
 

It goes on to write all the data to the WAL after successfully getting
WALInsertLock:while (write_len){    while (rdata->data == NULL)        rdata = rdata->next;
    if (freespace > 0)    {        if (rdata->len > freespace)        {            memcpy(Insert->currpos, rdata->data,
freespace);           rdata->data += freespace;            rdata->len -= freespace;            write_len -= freespace;
     }        else        {            memcpy(Insert->currpos, rdata->data, rdata->len);            freespace -=
rdata->len;           write_len -= rdata->len;            Insert->currpos += rdata->len;            rdata =
rdata->next;           continue;        }    }
 
    /* Use next buffer */    updrqst = AdvanceXLInsertBuffer(false);    curridx = Insert->curridx;    /* Mark page
headerto indicate this record continues on the page */    Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
Insert->currpage->xlp_rem_len= write_len;    freespace = INSERT_FREESPACE(Insert);}
 

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

Does this explanation make sense?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Ants Aasma
Date:
Subject: Re: Page replacement algorithm in buffer cache
Next
From: Greg Stark
Date:
Subject: Re: Multi-pass planner