Re: corrupt pages detected by enabling checksums - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: corrupt pages detected by enabling checksums
Date
Msg-id CA+U5nMKwguCMugQEDzoyx__Ug3TUcRKA_MejkkoJLkc0hS78gg@mail.gmail.com
Whole thread Raw
In response to Re: corrupt pages detected by enabling checksums  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: corrupt pages detected by enabling checksums  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 1 May 2013 16:33, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 1, 2013 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I was worried because SyncOneBuffer checks whether it needs writing
>>> without taking a content lock, so the exclusive lock doesn't help. That
>>> makes sense, because you don't want a checkpoint to have to get a
>>> content lock on every buffer in the buffer pool. But it also means we
>>> need to follow the rules laid out in transam/README and dirty the pages
>>> before writing WAL.
>>
>> On further review, I think you're right about this.  We'd have a
>> problem if the following sequence of events were to occur:
>>
>> 1. vacuum writes the WAL record, but does not yet mark the buffer
>> dirty, and then goes into the tank
>> 2. checkpointer decides where the checkpoint will begin
>> 3. buffer pool is scanned as part of the checkpoint process, observing
>> target buffer not to be dirty
>> 4. vacuum finally wakes up and marks the buffer dirty
>> 5. crash, after WAL is flushed but before buffer is written
>>
>> However, on looking at this a little more, shouldn't we be doing
>> log_heap_clean() *before* we do visibilitymap_set()?
>
> Hit send too soon.
>
> To finish that thought: right now the order of operations is...
>
> 1. Perform the cleanup operations on the buffer.
> 2. Set the visibility map bit.
> 3. Log the visibility map change.
> 4. Log the cleanup operations.
>
> It seems to me that it would be better to do (4) as close to (1) as
> possible.  Isn't it bad if the operations are replayed in an order
> that differs from the order in which they were performed - or if, say,
> the first WAL record were replayed but the second were not, for
> whatever reason?

I agree, but that was in the original coding wasn't it?

Why aren't we writing just one WAL record for this action? We use a
single WAL record in other places where we make changes to multiple
blocks with multiple full page writes, e.g. index block split. That
would make the action atomic and we'd just have this...

1. Perform the cleanup operations on the buffer.
2. Set the visibility map bit.
3. Log the cleanup operations and visibility map change.

which can then be replayed with correct sequence, locking etc.
and AFAICS would likely be faster also.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Greg Smith
Date:
Subject: Re: [PATCH] add --throttle to pgbench (submission 3)
Next
From: Stefan Kaltenbrunner
Date:
Subject: Re: Re: [BUGS] BUG #8128: pg_dump (>= 9.1) failed while dumping a scheme named "old" from PostgreSQL 8.4