Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] WAL logging problem in 9.4.3?
Date
Msg-id CA+TgmoYzTPpxvPUUs_ALW1YYNxV6mCrj6zwqnaN6FqdNc1JUzw@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] WAL logging problem in 9.4.3?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
Re: [HACKERS] WAL logging problem in 9.4.3?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: [HACKERS] WAL logging problem in 9.4.3?  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
List pgsql-hackers
On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> This is the fixed verison v22.

I'd like to offer a few thoughts on this thread and on these patches,
which is now more than 4 years old and more than 150 messages in
length.

First, I'd like to restate my understanding of the problem just to see
whether I've got the right idea and whether we're all on the same
page. When wal_level=minimal, we sometimes try to skip WAL logging on
newly-created relations in favor of fsync-ing the relation at commit
time. The idea is that if the transaction aborts or is aborted by a
crash, the contents of the relation don't need to be reproduced
because they are irrelevant, so no WAL is needed, and if the
transaction commits we can't lose any data on a crash because we've
already fsync'd, and standbys don't matter because wal_level=minimal
precludes having any. However, we're not entirely consistent about
skipping WAL-logging: some operations do and others don't, and this
causes confusion if a crash occurs, because we might try to replay
some of the things that happened to that relation but not all of them.
For example, the original poster complained about a sequence of steps
where an index truncation was logged but subsequent index insertions
were not; a badly-timed crash will replay the truncation but can't
replay the index insertions because they weren't logged in the first
place; consequently, while the state was actually OK at the beginning
of replay, it's no longer OK by the end. Replaying nothing would've
been OK, but replaying some things and not others isn't.

Second, for anyone who is not following this thread closely but is
interested in a summary, I'd like to summarize how I believe that the
current patch proposes to solve the problem. As I understand it, the
approach taken by the patch is to try to change things so that we log
nothing at all for relations created or truncated in the current
top-level transaction, and everything for others. To achieve this, the
patch makes a number of changes, three of which seem to me to be
particularly key. One, the patch changes the relcache infrastructure
with the goal of making it possible to reliably identify whether a
relation has been created or truncated in the current toplevel
transaction; our current code does have tracking for this, but it's
not 100% accurate. Two, the patch changes the definition of
RelationNeedsWAL() so that it not only checks that the relation is a
permanent one, but also that either wal_level != minimal or the
relation was not created in the current transaction. It seems to me
that if RelationNeedsWAL() is used to gate every test for whether or
not to write WAL pertaining to a particular relation, this ought to
achieve the desired behavior of logging either everything or nothing.
It is not quite clear to me how we can be sure that we use that in
every relevant place. Three, the patch replaces the various ad-hoc
bits of code which fsync relations which perform unlogged operations
on permanent relations with a new tracking mechanism that arranges to
perform all of the relevant fsync() calls at commit time. This is
further augmented with a mechanism that instead logs all the relation
pages in lieu of fsync()ing if the relation is very small, on the
theory that logging a few FPIs will be cheaper than an fsync(). I view
this additional mechanism as perhaps a bit much for a bug fix patch,
but I understand that the goal is to prevent a performance regression,
and it's not really over the top, so I think it's probably OK.

Third, I'd like to offer a couple of general comments on the state of
these patches. Broadly, I think they look pretty good. They seem quite
well-engineered to me and as far as I can see the overall approach is
sound. I think there are a number of places where the comments could
be better; I'll include a few points on that further down. I also
think that the code in swap_relation_files() which takes ExclusiveLock
on the relations looks quite strange. It's hard to understand why it's
needed at all, or why that lock level is used. On the flip side, I
think that the test suite looks really impressive and should be of
considerable help not only in making sure that this is fixed but
detecting if it gets broken again in the future. Perhaps it doesn't
cover every scenario we care about, but if that turns out to be the
case, it seems like it would be easily to further generalize. I really
like the idea of this *kind* of test framework.

Comments on comments, and other nitpicking:

- in-trasaction is mis-spelled in the doc patch. accidentially is
mis-spelled in the 0002 patch.
- I think the header comment for the new TAP test could do a far
better job explaining the overall goal of this testing than it
actually does.
- I think somewhere in relcache.c or rel.h there ought to be comments
explaining the precise degree to which rd_createSubid,
rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable,
including problem scenarios. This patch removes some language of this
sort from CopyFrom(), which was a funny place to have that information
in the first place, but I don't see that it adds anything to replace
it. I also think that we ought to explain - for the fields that are
reliable - that they need to be reliable precisely for the purpose of
not breaking this stuff. There's a bit of this right now:

+ * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+ * relfilenode change has took place in the current transaction. Unlike
+ * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+ * means that the currently active relfilenode is transaction-local and we
+ * sync the relation at commit instead of WAL-logging.

...but I think that needs to be somewhat expanded and clarified.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Why overhead of SPI is so large?
Next
From: Fabien COELHO
Date:
Subject: Re: pgbench - extend initialization phase control