Re: First draft of snapshot snapshot building design document - Mailing list pgsql-hackers

From Andres Freund
Subject Re: First draft of snapshot snapshot building design document
Date
Msg-id 201210200013.42694.andres@2ndquadrant.com
Whole thread Raw
In response to Re: First draft of snapshot snapshot building design document  (Peter Geoghegan <peter@2ndquadrant.com>)
List pgsql-hackers
Hi,

On Friday, October 19, 2012 10:53:06 PM Peter Geoghegan wrote:
> On 16 October 2012 12:30, Andres Freund <andres@2ndquadrant.com> wrote:
> > Here's the first version of the promised document. I hope it answers most
> > of the questions.
>
> This makes for interesting reading.

Thanks.


> Step 14 *is* kind-of relevant, because this is one place where we
> resolve relation OID based on relfilenode, which is part of
> snapbuild.c, since it has a LookupTableByRelFileNode() call (the only
> other such call is within snapbuild.c itself, as part of checking if a
> particular relation + xid have had catalogue changes, which can be a
> concern due to DDL, which is the basic problem that snapbuild.c seeks
> to solve). Assuming that it truly is necessary to have a
> LookupTableByRelFileNode() function, I don't think your plugin (which
> will soon actually be a contrib module, right?) has any business
> calling it.

That shouldn't be needed anymore, that was just because I didn't finish some
loose ends quickly enough. The callback now gets passed the TupleDesc.


> Thankfully, I *think* that this implies that
> plugins don't need to directly concern themselves with cache
> invalidation.

Correct. They don't need to know anything about it, its all handled
transparently now.


> By the way, shouldn't this use InvalidOid?:

Yes. Fixed.


> allay some of those concerns.  It says:
> > == xmin Horizon Handling ==
> >
> > Reusing MVCC for timetravel access has one obvious major problem:
> > VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
> > obviously, we want autovacuum/manual vacuum to work as before.
> >
> > The idea here is to reuse the infrastrcuture built for
> > hot_standby_feedback which allows us to keep the xmin horizon of a
> > walsender backend artificially low. We keep it low enough so we can
> > restart decoding from the last location the client has confirmed to be
> > safely received. The means that we keep it low enough to contain the
> > last checkpoints oldestXid value.
>
> These ideas are still underdeveloped. For one thing, it seems kind of
> weak to me that we're obliged to have vacuum grind to a halt across
> the cluster because some speculative plugin has its proc's xmin pegged
> to some value due to a logical standby still needing it for reading
> catalogues only.   Think of the failure modes – what happens when the
> standby participating in a plugin-based logical replication system
> croaks for indeterminate reasons? Doing this with the WAL sender as
> part of hot_standby_feedback is clearly much less hazardous, because
> there *isn't* a WAL sender when streaming replication isn't active in
> respect of some corresponding standby, and hot_standby_feeback need
> only support deferring vacuum for the streaming replication standby
> whose active snapshot's xmin is furthest in the past.

> If a standby is taken out of commission for an hour, it can probably catch
> up without difficulty (difficulty like needing a base-backup), and it has no
> repercussions for the master or anyone else.

Only if you have setup an archive_command with sufficient retention. Otherwise
it all depends on the wal_keep_segments value.

> However, I believe that logical change records would be rendered meaningless
> in the same scenario, because VACUUM, having not seen a “pegged” xmin
> horizon due to the standby's unavailability, goes ahead and vacuums past a
> point still needed to keep the standby in a consistent state.

Well, thats why I want to persist them similar to twophase.c.

> Maybe you can invent a new type of xmin horizon that applies only to
> catalogues so this isn't so bad

Yes, I thought about this. I seems like a very sensible optimization, but I
really, really, would like to get the simpler version finished.

> and I see you've suggested as much in follow-up mail to Robert

> but it might be unsatisfactory to have that impact the PGAXT performance
> optimisation introduced in commit ed0b409d, or obfuscate that code.

Hm. I don't forsee any need to pessimize that. But I guess the burden of proof
lies on me writing up a patch for this.

> You could have the xmin be a special xmin, say though PGAXT.vacuumFlags
indicating this, but wouldn't that necessarily preclude the backend from
> having a non-special notion of its xmin? How does that bode for this
actually becoming maximally generic infrastructure?

Uhm. Why should the decoding backend *ever* need its own xmin? I will *never*
be allowed to do writes itself or similar?

> ...
> You go on to say:
> > That also means we need to make that value persist across
> > restarts/crashes in a very similar manner to twophase.c's. That
> > infrastructure actually also useful to make hot_standby_feedback work
> > properly across primary restarts.
>
> So here you anticipating my criticism above about needing to peg the
> xmin horizon. That's fine, but I still don't know yet is how you
> propose to make this work in a reasonable way, free from all the usual
> caveats about leaving prepared transactions open for an indefinitely
> long time (what happens when we need to XID wraparound?, how does the
> need to hold a transaction open interfere with a given plugin
> backend's ability to execute regular queries?, etc).

Well, its not like its introducing a problem that wasn't there before. I
really don't see this as something all that problematic.

If people protest we can add some tiny bit of code to autovacuum that throws
everything away thats older than some autovacuum_vacuum_freeze_max_age or
such. We could even do the same to prepared transactions.

But in the end the answer is that you need to monitor *any* replication system
carefully.

> Furthermore, I don't know how it's going to be independently useful to make
> hot_standby_feedback work across primary restarts, because the primary
> cannot then generate VACUUM cleanup WAL records that the standby might
> replay, resulting in a hard conflict. Maybe there's something
> incredibly obvious that I'm missing, but doesn't that problem almost
> take care of itself?

Autovacuum immediately starts after a restart. Once its started some workers a
reconnecting standby cannot lower the xmin again, so you have a high
likelihood of conflicts. I have seen that problem in the field (ironically
"fixed" by creating a prepared xact before restarting ...).

> You talk about the relfilenode/oid mapping problem some:
> > == Catalog/User Table Detection ==
> >
> > To detect whether a record/transaction does catalog modifications - which
> > we need to do for memory/performance reasons - we need to resolve the
> > RelFileNode's in xlog records back to the original tables. Unfortunately
> > RelFileNode's only contain the tables relfilenode, not their table oid.
> > We only can do catalog access once we reached FULL_SNAPSHOT, before that
> > we can use some heuristics but otherwise we have to assume that every
> > record changes the catalog.
>
> What exactly are the implications of having to assume catalogue
> changes?

Higher cpu/storage overhead, nothing else.

> I see that right now, you're just skipping actual decoding
> once you've taken care of snapshot building chores within
> SnapBuildCallback():
>
> +     if (snapstate->state == SNAPBUILD_START)
> +         return SNAPBUILD_SKIP;

When were in SNAPBUILD_START state we don't have *any* knowledge about the
system yet, so we can't do anything with collected records anyway (very likely
the record we just read was part of an already started transaction). Once were
in SNAPBUILD_FULL_SNAPSHOT state we can start collecting changes if the
respective transaction started *after* we have reached
SNAPBUILD_FULL_SNAPSHOT:

> > The heuristics we can use are:
> > * relfilenode->spcNode == GLOBALTABLESPACE_OID
> > * relfilenode->relNode <= FirstNormalObjectId
> > * RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid
> >
> > Those detect some catalog tables but not all (think VACUUM FULL), but if
> > they detect one they are correct.
>
> If the heuristics aren't going to be completely reliable, why is that
> acceptable? You've said it "seems to work fine", but I don't quite
> follow.

Should have left that out, its a small internal optimization... If the above
heuristic detect that a relfilenode relates to a catalog table its guaranteed
to be correct. It cannot detect all catalog changes though, so you can only
use it to skip doing work for catalog tables, not for skipping work if
!catalog.

> > After reaching FULL_SNAPSHOT we can do catalog access if our heuristics
> > tell us a table might not be a catalog table. For that we use the new
> > RELFILENODE syscache with (spcNode, relNode).
>
> I share some of Robert's concerns here. The fact that relfilenode
> mappings have to be time-relativised may tip the scales against this
> approach.

Whats the problem with the time-relativized access?

> As Robert has said, it may be that simply adding the table
> OID to the WAL header is the way forward. It's not as if we can't
> optimise that later. One compromise might be to only do that when we
> haven't yet reached FULL_SNAPSHOT

When writing the WAL we don't have any knowledge about what state some
potential decoding process could be in, so its an all or nothing thing.

I don't have a problem with writing the table oid into the records somewhere,
I just think its not required. One reason for storing it in there independent
from this patchset/feature is debugging. I wished for that in the past.

We need to build the snapshot to access catalog anyway, so its not like doing
the relfilenode lookup time-relativzed incurs any additional costs. Also, we
need to do the table-oid lookup time-relatived as well, because table oids can
be reused.

> > To make that easier everytime a decoding process finds an online
> > checkpoint record it exlusively takes a global lwlock and checks whether
> > visibility information has been already been written out for that
> > checkpoint and does so if not. We only need to do that once as
> > visibility information is the same between all decoding backends.
>
> Where and how would that visibility information be represented?

Some extra pg_* directory, like pg_decode/$LSN_OF_CHECKPOINT.

> So, typically you'd expect it to say “no catalogue changes for entire
> checkpoint“ much of the time?

No, not really. It will probably look similar to the files ExportSnapshot
currently produces. Even if no catalog changes happened we still need to keep
knowledge about committed transactions and such.

Btw, I doubt all that many checkpoint<->checkpoint windows will have
absolutely no catalog changes. At least some pg_class.relfrozenxid,
pg_class.reltuples changes are to be expected.

> The patch we've seen
> (0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]) doesn't
> address the question of transactions that contain DDL:
> ...
> but you do address this question (or a closely related question,
> which, I gather is the crux of the issue: How to mix DDL and DML in
>
> transactions?) in the new doc (README.SNAPBUILD.txt [6]):
> > == mixed DDL/DML transaction handling  ==
> >
> > When a transactions uses DDL and DML in the same transaction things get a
> > bit more complicated because we need to handle CommandIds and ComboCids
> > as we need to use the correct version of the catalog when decoding the
> > individual tuples.
>
> Right, so it becomes necessary to think about time-travelling not just
> to a particular transaction, but to a particular point in a particular
> transaction – the exact point at which the catalogue showed a
> structure consistent with sanely interpreting logical WAL records
> created during that window after the last DDL command (if any), but
> before the next (if any). This intelligence is only actually needed
> when decoding tuples created in that actual transaction, because only
> those tuples have their format change throughout a single transaction.

Exactly.

> > CommandId handling itself is relatively simple, we can figure out the
> > current CommandId relatively easily by looking at the currently used one
> > in changes. The problematic part is that those CommandId frequently will
> > not be actual cmin or cmax values but ComboCids. Those are used to
> > minimize space in the heap. During normal operation cmin/cmax values are
> > only used within the backend emitting those rows and only during one
> > toplevel transaction, so instead of storing cmin/cmax only a reference
> > to an in-memory value is stored that contains both. Whenever we see a
> > new CommandId we call
> > ApplyCacheAddNewCommandId.
>
> Right. So in general, transaction A doesn't have to concern itself
> with the order that other transactions had tuples become visible or
> invisible (cmin and cmax); transaction A need only concern itself with
> whether they're visible or invisible based on if relevant transactions
> (xids) committed, its own xid, plus each tuple's xmin and xmax.  It is
> this property of cmin/cmax that enabled the combocid optimisation in
> 8.3, which introduces an array in *backend local* memory, to map a
> single HeapTupleHeader field (where previously there were 2 – cmin and
> cmax) to an entry in that array, under the theory that it's unusual
> for a tuple to be created and then deleted in the same transaction.
> Most of the time, that one HeapTupleHeader field wouldn't have a
> mapping to the local array – rather, it would simply have a cmin or a
> cmax. That's how we save heap space.

Yes. The whole handling here is nearly completely analogous to the normal
handling of CommandIds.

> > To resolve this problem during heap_* whenever we generate a new combocid
> > (detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog
> > table we log the new XLOG_HEAP2_NEW_COMBOCID record containing the
> > mapping. During decoding this ComboCid is added to the applycache
> > (ApplyCacheAddNewComboCid). They are only guaranteed to be visible within
> > a single transaction, so we cannot simply setup all of them globally.
>
> This seems more or less reasonable. The fact that the combocid
> optimisation uses a local memory array isn't actually an important
> property of combocids as a performance optimisation

It is an important property here because ComboCids from different toplevel
transactions conflict with each other which means we have to deal with them on
a per toplevel-xid basis.

> For ease of implementation, but also because real combocids are
> expected to be needed infrequently, I suggest that rather than logging
> the mapping, you log the values directly in a record (i.e. The full
> cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You
> could easily exhaust the combocid space otherwise, and besides, you
> cannot do anything with the mapping from outside of the backend that
> originated the combocid for that transaction (you don't have the local
> array, or the local hashtable used for combocids).

I can't really follow here. Obviously we need to generate the
XLOG_HEAP2_NEW_COMBOCID locally in the transaction/backend that generated the
change?

> > Before calling the output plugin ComboCids are temporarily setup and torn
> > down afterwards.
>
> How? You're using a combocid-like array + hashtable local to the
> plugin backend?

I added
extern void PutComboCommandId(CommandId combocid, CommandId cmin, CommandId
cmax);
which in combination with the existing
extern void AtEOXact_ComboCid(void);
is enough.


> Anyway, for now, this is unimplemented, which is perhaps the biggest
> concern about it:
>
> +     /* check if its one of our txids, toplevel is also in there */
> +     else if (TransactionIdInArray(xmin, snapshot->subxip,
> snapshot->subxcnt)) +     {
> +         CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
> +         /* no support for that yet */
> +         if (tuple->t_infomask & HEAP_COMBOCID){
> +             elog(WARNING, "combocids not yet supported");
> +             return false;
> +         }
> +         if (cmin >= snapshot->curcid)
> +             return false;    /* inserted after scan started */
> +     }
>
> Above, you aren't taking this into account (code from
> HeapTupleHeaderGetCmax()):
>
> /* We do not store cmax when locking a tuple */
> Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));
>
> Sure, you're only interested in cmin, so this doesn't look like it
> applies (isn't this just a sanity check?), but actually, based on this
> it seems to me that the current sane representation of cmin needs to
> be obtained in a more concurrency aware fashion - having the backend
> local data-structures that originate the tuple isn't even good enough.

You completely lost me here and in the following paragraphs. The infomask is
available for everyone, and we only read/write cmin|cmax|comboid when were
inside the transaction or when we have already logged a HEAP2_NEW_COMBOCID and
thus have the necessary information?
Which concurrency concerns are you referring to?

>  I have an idea that the  HeapTupleHeaderGetRawCommandId(tuple) call in your
>  code could well be bogus even when (t_infomask & HEAP_COMBOCID) == 0.

Ah? The other .satisfies routines do HeapTupleHeaderGetCmin(tuple) which
returns exactly that if !(tup->t_infomask & HEAP_COMBOCID).

But anyway, with the new combocid handling the code uses the usual
HeapTupleHeaderGetCmin/Cmax calls, so it looks even more like the normal
routines.

Thanks for the extensive review! I am pretty sure this a lot to take in ;).

Greetings,

Andres

--
Andres Freund        http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: assertion failure w/extended query protocol
Next
From: Tom Lane
Date:
Subject: Re: assertion failure w/extended query protocol