Re: WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: WAL logging problem in 9.4.3?
Date
Msg-id CANP8+jLWXr+uskrHOSdcKXj+rEDWFrKgfYuC4TNhCuv7Po+jbA@mail.gmail.com
Whole thread Raw
In response to Re: WAL logging problem in 9.4.3?  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On 22 July 2015 at 17:21, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

When a WAL-skipping COPY begins, we add an entry for that relation in a "pending-fsyncs" hash table. Whenever we perform any action on a heap that would normally be WAL-logged, we check if the relation is in the hash table, and skip WAL-logging if so.

That was a simplified explanation. In reality, when WAL-skipping COPY begins, we also memorize the current size of the relation. Any actions on blocks greater than the old size are not WAL-logged, and any actions on smaller-numbered blocks are. This ensures that if you did any INSERTs on the table before the COPY, any new actions on the blocks that were already WAL-logged by the INSERT are also WAL-logged. And likewise if you perform any INSERTs after (or during, by trigger) the COPY, and they modify the new pages, those actions are not WAL-logged. So starting a WAL-skipping COPY splits the relation into two parts: the first part that is WAL-logged as usual, and the later part that is not WAL-logged. (there is one loose end marked with XXX in the patch on this, when one of the pages involved in a cold UPDATE is before the watermark and the other is after)

The actual fsync() has been moved to the end of transaction, as we are now skipping WAL-logging of any actions after the COPY as well.

And truncations complicate things further. If we emit a truncation WAL record in the transaction, we also make an entry in the hash table to record that. All operations on a relation that has been truncated must be WAL-logged as usual, because replaying the truncate record will destroy all data even if we fsync later. But we still optimize for "BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we truncate a relation that has already been marked for fsync-at-COMMIT, we don't need to WAL-log the truncation either.


This is more invasive than I'd like to backpatch, but I think it's the simplest approach that works, and doesn't disable any of the important optimizations we have.

I didn't like it when I first read this, but I do now. As a by product of fixing the bug it actually extends the optimization.

You can optimize this approach so we always write WAL unless one of the two subid fields are set, so there is no need to call smgrIsSyncPending() every time. I couldn't see where this depended upon wal_level, but I guess its there somewhere.

I'm unhappy about the call during MarkBufferDirtyHint() which is just too costly. The only way to do this cheaply is to specifically mark buffers as being BM_WAL_SKIPPED, so they do not need to be hinted. That flag would be removed when we flush the buffers for the relation.
 

And what reason is there to think that this would fix all the problems?

I don't think either suggested fix could be claimed to be a great solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.

I can't get too excited about a half-fix that leaves you with data corruption in some scenarios.

On further consideration, it seems obvious that Andres' suggestion would not work for UPDATE or DELETE, so I now agree.

It does seem a big thing to backpatch; alternative suggestions?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: extend pgbench expressions with functions
Next
From: Ildus Kurbangaliev
Date:
Subject: Re: RFC: replace pg_stat_activity.waiting with something more descriptive