Re: First draft of snapshot snapshot building design document - Mailing list pgsql-hackers

From Andres Freund
Subject Re: First draft of snapshot snapshot building design document
Date
Msg-id 201210181720.27616.andres@2ndquadrant.com
Whole thread Raw
In response to Re: First draft of snapshot snapshot building design document  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: First draft of snapshot snapshot building design document  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thursday, October 18, 2012 04:47:12 PM Robert Haas wrote:
> On Tue, Oct 16, 2012 at 7:30 AM, Andres Freund <andres@2ndquadrant.com> 
wrote:
> > On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
> >> The design document [2] really just explains the problem (which is the
> >> need for catalog metadata at a point in time to make sense of heap
> >> tuples), without describing the solution that this patch offers with
> >> any degree of detail. Rather, [2] says "How we build snapshots is
> >> somewhat intricate and complicated and seems to be out of scope for
> >> this document", which is unsatisfactory. I look forward to reading the
> >> promised document that describes this mechanism in more detail.
> > 
> > Here's the first version of the promised document. I hope it answers most
> > of the questions.
> > 
> > Input welcome!
> 
> I haven't grokked all of this in its entirety, but I'm kind of
> uncomfortable with the relfilenode -> OID mapping stuff.  I'm
> wondering if we should, when logical replication is enabled, find a
> way to cram the table OID into the XLOG record.  It seems like that
> would simplify things.
> 
> If we don't choose to do that, it's worth noting that you actually
> need 16 bytes of data to generate a unique identifier for a relation,
> as in database OID + tablespace OID + relfilenode# + backend ID.
> Backend ID might be ignorable because WAL-based logical replication is
> going to ignore temporary relations anyway, but you definitely need
> the other two.  ...

Hm. I should take look at the way temporary tables are represented. As you say 
I is not going to matter for WAL decoding, but still...

> Another thing to think about is that, like catalog snapshots,
> relfilenode mappings have to be time-relativized; that is, you need to
> know what the mapping was at the proper point in the WAL sequence, not
> what it is now.  In practice, the risk here seems to be minimal,
> because it takes a while to churn through 4 billion OIDs.  However, I
> suspect it pays to think about this fairly carefully because if we do
> ever run into a situation where the OID counter wraps during a time
> period comparable to the replication lag, the bugs will be extremely
> difficult to debug.

I think with a rollbacks + restarts we might even be able to see the same 
relfilenode earlier.

> Anyhow, adding the table OID to the WAL header would chew up a few
> more bytes of WAL space, but it seems like it might be worth it to
> avoid having to think very hard about all of these issues.

I don't think its necessary to change wal logging here. The relfilenode mapping 
is now looked up using the timetravel snapshot we've built using (spcNode, 
relNode) as the key, so the time-relativized lookup is "builtin". If we screw 
that up way much more is broken anyway.

Two problems are left:

1) (reltablespace, relfilenode) is not unique in pg_class because InvalidOid is 
stored for relfilenode if its a shared or nailed table. That not a problem for 
the lookup because weve already checked the relmapper before that, so we never 
look those up anyway. But it violates documented requirements of syscache.c. 
Even after some looking I haven't found any problem that that could cause.

2) We need to decide whether a HEAP[1-2]_* record did catalog changes when 
building/updating snapshots. Unfortunately we also need to do this *before* we 
built the first snapshot. For now treating all tables as catalog modifying 
before we built the snapshot seems to work fine.
I think encoding the oid in the xlog header wouln't help all that much here, 
because I am pretty sure we want to have the set of "catalog tables" to be 
extensible at some point...


Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Bug in -c CLI option of pg_dump/pg_restore
Next
From: Peter Geoghegan
Date:
Subject: Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)