Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: corrupt pages detected by enabling checksums |
Date | |
Msg-id | 20130404013900.GE19178@awork2.anarazel.de Whole thread Raw |
In response to | Re: corrupt pages detected by enabling checksums (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: corrupt pages detected by enabling checksums
|
List | pgsql-hackers |
On 2013-04-04 02:58:43 +0200, Andres Freund wrote: > On 2013-04-03 20:45:51 -0400, Tom Lane wrote: > > andres@anarazel.de (Andres Freund) writes: > > > Looking at the page lsn's with dd I noticed something peculiar: > > > > > page 0: > > > 01 00 00 00 18 c2 00 31 => 1/3100C218 > > > page 1: > > > 01 00 00 00 80 44 01 31 => 1/31014480 > > > page 10: > > > 01 00 00 00 60 ce 05 31 => 1/3105ce60 > > > page 43: > > > 01 00 00 00 58 7a 16 31 => 1/31167a58 > > > page 44: > > > 01 00 00 00 f0 99 16 31 => 1/311699f0 > > > page 45: > > > 00 00 00 00 00 00 00 00 => 0/0 > > > page 90: > > > 01 00 00 00 90 17 1d 32 => 1/321d1790 > > > page 91: > > > 01 00 00 00 38 ef 1b 32 => 1/321bef38 > > > > > So we have written out pages that are after pages without a LSN that > > > have an LSN thats *beyond* the point XLOG has successfully been written > > > to disk (1/31169A38)? > > > > If you're looking into the FPIs, those would contain the page's older > > LSN, not the one assigned by the current WAL record. > > Nope, thats from the heap, and the LSNs are *newer* than what startup > recovered to. I am pretty sure by now we are missing out on valid WAL, I > am just not sure why. Ok, I think I see the bug. And I think its been introduced in the checkpoints patch. Remember, as explained in other mails, I am pretty sure the end of wal was errorneously detected, resulting in the following error: 27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68 Not that without a *hardware* crash end of wal is normally found first by the checks for wrong LSNs or wrong record lengths. Not here. My theory is that a page that was full page written changed while it was inserted into the WAL which caused the error. A possibly scenario for that would e.g. two concurrent calls to MarkBufferDirtyHint(). Note:* 2. The caller might have only share-lock instead of exclusive-lock on the* buffer's content lock. if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) != (BM_DIRTY | BM_JUST_DIRTIED)){ if (DataChecksumsEnabled() &&(bufHdr->flags & BM_PERMANENT)) { MyPgXact->delayChkpt = delayChkpt = true; lsn = XLogSaveBufferForHint(buffer);/* PART 1 */ } } LockBufHdr(bufHdr); Assert(bufHdr->refcount > 0); if (!(bufHdr->flags & BM_DIRTY)) { dirtied = true; /* Means "will be dirtied by this action" */ /* * Set the page LSN if we wrote a backup block. We aren't * supposed to set this when only holdinga share lock but * as long as we serialise it somehow we're OK. We choose to * set LSN while holdingthe buffer header lock, which causes * any reader of an LSN who holds only a share lock to also *obtain a buffer header lock before using PageGetLSN(). * Fortunately, thats not too many places. * * If checksums are enabled, you might think we should reset the * checksum here. That will happen when the pageis written * sometime later in this checkpoint cycle. */ if (!XLogRecPtrIsInvalid(lsn)) PageSetLSN(page, lsn); /* PART 2 */ } So XLogSaveBufferForHint(buffer) is called for all buffers which are not yet dirty. Without any locks. It just does: XLogSaveBufferForHint(Buffer buffer) {rdata[0].data = (char *) (&watermark);rdata[0].len = sizeof(int);rdata[0].next = &(rdata[1]); rdata[1].data = NULL;rdata[1].len = 0;rdata[1].buffer = buffer;rdata[1].buffer_std = true;rdata[1].next = NULL; return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata); } I think what happens is that that one backend grabs WALInsertLock first, it computes a full page write for the buffer, including a CRC. And returns a valid LSN. Then it continues towards the PageSetLSN() after LockBufHdr(). Allthewhile the second backend notices that WALInsertLock is free again and continues towards the WALInsertLock protected parts of XLogInsert(). Only that it already has done that part:/* * Calculate CRC of the data, including all the backup blocks * * Note that therecord header isn't added into the CRC initially since we * don't know the prev-link yet. Thus, the CRC will representthe CRC of * the whole record in the order: rdata, then backup blocks, then record * header. */INIT_CRC32(rdata_crc);for(rdt = rdata; rdt != NULL; rdt = rdt->next) COMP_CRC32(rdata_crc, rdt->data, rdt->len); It goes on to write all the data to the WAL after successfully getting WALInsertLock:while (write_len){ while (rdata->data == NULL) rdata = rdata->next; if (freespace > 0) { if (rdata->len > freespace) { memcpy(Insert->currpos, rdata->data, freespace); rdata->data += freespace; rdata->len -= freespace; write_len -= freespace; } else { memcpy(Insert->currpos, rdata->data, rdata->len); freespace -= rdata->len; write_len -= rdata->len; Insert->currpos += rdata->len; rdata = rdata->next; continue; } } /* Use next buffer */ updrqst = AdvanceXLInsertBuffer(false); curridx = Insert->curridx; /* Mark page headerto indicate this record continues on the page */ Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD; Insert->currpage->xlp_rem_len= write_len; freespace = INSERT_FREESPACE(Insert);} If by now the first backend has proceeded to PageSetLSN() we are writing different data to disk than the one we computed the checksum of before. Boom. I think the whole locking interactions in MarkBufferDirtyHint() need to be thought over pretty carefully. Does this explanation make sense? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: