Re: logical changeset generation v6.2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v6.2
Date
Msg-id 20131018182616.GC16188@awork2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v6.2  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical changeset generation v6.2
Re: logical changeset generation v6.2
Re: logical changeset generation v6.2
List pgsql-hackers
On 2013-10-14 09:36:03 -0400, Robert Haas wrote:
> > I thought and implemented that in the beginning. Unfortunately it's not
> > enough :(. That's probably the issue that took me longest to understand
> > in this patchseries...
> >
> > Combocids can only fix the case where a transaction actually has create
> > a combocid:
> >
> > 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid)
> > 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)
> >
> > So, if we're decoding data that needs to lookup those rows in TX1 or TX2
> > we both times need access to cmin and cmax, but neither transaction will
> > have created a multixact. That can only be an issue in transaction with
> > catalog modifications.
> 
> Oh, yuck.  So that means you have to write an extra WAL record for
> EVERY heap insert, update, or delete to a catalog table?  OUCH.

So. As it turns out that solution isn't sufficient in the face of VACUUM
FULL and mixed DML/DDL transaction that have not yet been decoded.

To reiterate, as published it works like:
For every modification of catalog tuple (insert, multi_insert, update,
delete) that has influence over visibility issue a record that contains:
* filenode
* ctid
* (cmin, cmax)

When doing a visibility check on a catalog row during decoding of mixed
DML/DDL transaction lookup (cmin, cmax) for that row since we don't
store both for the tuple.

That mostly works great.

The problematic scenario is decoding a transaction that has done mixed
DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
FULL obviously changes the filenode and the ctid of a tuple, so we
cannot successfully do a lookup based on what we logged before.

I know of the following solutions:
1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
2) Make VACUUM FULL prevent DDL and then wait till all changestreams  have decoded up to the current point.
3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables  if there are life decoding slots around,
insteaddelegate that  responsibility to the slot management.
 
4) Store both (cmin, cmax) for catalog tuples.

I bascially think only 1) and 4) are realistic. And 1) sucks.

I've developed a prototype for 4) and except currently being incredibly
ugly, it seems to be the most promising approach by far. My trick to
store both cmin and cmax is to store cmax in t_hoff managed space when
wal_level = logical.
That even works when changing wal_level from < logical to logical
because only ever need to store both cmin and cmax for transactions that
have decodeable content - which they cannot yet have before wal_level =
logical.

This requires some not so nice things:
* A way to declare we're storing both. I've currently chosen HEAP_MOVED_OFF | HEAP_MOVED_IN. That sucks.
* A way for heap_form_tuple to know it should add the necessary space to t_hoff. I've added TupleDesc->tdhaswidecid for
it.
* Fiddling with existing checks for HEAP_MOVED{,OFF,IN} to check for both set at the same time.
* Changing the WAL logging to (optionally?) transport the current CommandId instead of always resetting it
InvalidCommandId.

The benefits are:
* Working VACUUM FULL
* Much simpler tqual.c logic, everything is stored in the row itself. No hash or something like that built.
* No more need to log (relfilenode, cmin, cmax) separately from heap changes itself anymore.

In the end, the costs are that individual catalog rows are 4 bytes
bigger iff wal_level = logical. That seems acceptable.

Some questions remain:
* Better idea for a flag than HEAP_MOVED_OFF | HEAP_MOVED_IN
* Should we just unconditionally log the current CommandId or make it conditional. We have plenty of flag space to
signalwhether it's present, but it's just 4 bytes.
 

Comments?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: libpgport vs libpgcommon
Next
From: Andres Freund
Date:
Subject: Re: logical changeset generation v6.4