Re: logical changeset generation v6.2 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: logical changeset generation v6.2
Date
Msg-id CA+TgmoZOiUiPcKsrgiftEqsqN0bKuZLVrsLyAwkcwv5je3_SQQ@mail.gmail.com
Whole thread Raw
In response to Re: logical changeset generation v6.2  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: logical changeset generation v6.2
List pgsql-hackers
On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I know of the following solutions:
> 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
> 2) Make VACUUM FULL prevent DDL and then wait till all changestreams
>    have decoded up to the current point.
> 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
>    if there are life decoding slots around, instead delegate that
>    responsibility to the slot management.
> 4) Store both (cmin, cmax) for catalog tuples.
>
> I bascially think only 1) and 4) are realistic. And 1) sucks.
>
> I've developed a prototype for 4) and except currently being incredibly
> ugly, it seems to be the most promising approach by far. My trick to
> store both cmin and cmax is to store cmax in t_hoff managed space when
> wal_level = logical.

In my opinion, (4) is too ugly to consider.  I think that if we start
playing games like this, we're opening up the doors to lots of subtle
bugs and future architectural pain that will be with us for many, many
years to come.  I believe we will bitterly regret any foray into this
area.

It has long seemed to me to be a shame that we don't have some system
for allowing old relfilenodes to stick around until they are no longer
in use.  If we had that, we might be able to allow utilities like
CLUSTER or VACUUM FULL to permit concurrent read access to the table.
I realize that what people really want is to let those things run
while allowing concurrent *write* access to the table, but a bird in
the hand is worth two in the bush.  What we're really talking about
here is applying MVCC to filesystem actions: instead of removing the
old relfilenode(s) immediately, we do it when they're no longer
referenced by anyone, just as we don't remove old tuples immediately,
but rather when they are no longer referenced by anyone.  The details
are tricky, though: we can allow write access to the *new* heap just
as soon as the rewrite is finished, but anyone who is still looking at
the *old* heap can't ever upgrade their AccessShareLock to anything
higher, or hilarity will ensue.  Also, if they lock some *other*
relation and AcceptInvalidationMessages(), their relcache entry for
the rewritten relation will get rebuilt, and that's bound to work out
poorly.  The net-net here is that I think (3) is an attractive
solution, but I don't know that we can make it work in a reasonable
amount of time.

I don't think I understand exactly what you have in mind for (2); can
you elaborate?  I have always thought that having a
WaitForDecodingToCatchUp() primitive was a good way of handling
changes that were otherwise too difficult to track our way through.  I
am not sure you're doing that at all right now, which in some sense I
guess is fine, but I haven't really understood your aversion to this
solution.  There are some locking issues to be worked out here, but
the problems don't seem altogether intractable.

(1) is basically deciding not to fix the problem.  I don't think
that's acceptable.

I don't have another idea right at the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: LDAP: bugfix and deprecated OpenLDAP API
Next
From: Peter Eisentraut
Date:
Subject: Re: autovacuum_work_mem