Re: logical changeset generation v6.2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v6.2
Date
Msg-id 20131021141558.GC2968@awork2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v6.2  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical changeset generation v6.2  (Andres Freund <andres@2ndquadrant.com>)
Re: logical changeset generation v6.2  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On 2013-10-21 09:32:12 -0400, Robert Haas wrote:
> On Fri, Oct 18, 2013 at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I know of the following solutions:
> > 1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
> > 2) Make VACUUM FULL prevent DDL and then wait till all changestreams
> >    have decoded up to the current point.
> > 3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
> >    if there are life decoding slots around, instead delegate that
> >    responsibility to the slot management.
> > 4) Store both (cmin, cmax) for catalog tuples.
> >
> > I bascially think only 1) and 4) are realistic. And 1) sucks.
> >
> > I've developed a prototype for 4) and except currently being incredibly
> > ugly, it seems to be the most promising approach by far. My trick to
> > store both cmin and cmax is to store cmax in t_hoff managed space when
> > wal_level = logical.
> 
> In my opinion, (4) is too ugly to consider.  I think that if we start
> playing games like this, we're opening up the doors to lots of subtle
> bugs and future architectural pain that will be with us for many, many
> years to come.  I believe we will bitterly regret any foray into this
> area.

Hm. After looking at the required code - which you obviously cannot have
yet - it's not actually too bad. Will post a patch implementing it later.

I don't really buy the architectural argument since originally cmin/cmax
*were* both stored. It's not something we're just inventing now. We just
optimized that away but now have discovered that's not always a good
idea and thus don't always use the optimization.

The actual decoding code shrinks by about 200 lines using this logic
which is a hint that it's not a bad idea.

> It has long seemed to me to be a shame that we don't have some system
> for allowing old relfilenodes to stick around until they are no longer
> in use.  If we had that, we might be able to allow utilities like
> CLUSTER or VACUUM FULL to permit concurrent read access to the table.
> I realize that what people really want is to let those things run
> while allowing concurrent *write* access to the table, but a bird in
> the hand is worth two in the bush.  What we're really talking about
> here is applying MVCC to filesystem actions: instead of removing the
> old relfilenode(s) immediately, we do it when they're no longer
> referenced by anyone, just as we don't remove old tuples immediately,
> but rather when they are no longer referenced by anyone.  The details
> are tricky, though: we can allow write access to the *new* heap just
> as soon as the rewrite is finished, but anyone who is still looking at
> the *old* heap can't ever upgrade their AccessShareLock to anything
> higher, or hilarity will ensue.  Also, if they lock some *other*
> relation and AcceptInvalidationMessages(), their relcache entry for
> the rewritten relation will get rebuilt, and that's bound to work out
> poorly.  The net-net here is that I think (3) is an attractive
> solution, but I don't know that we can make it work in a reasonable
> amount of time.

I've looked at it before, and I honestly don't have a real clue how to
do it robustly.

> I don't think I understand exactly what you have in mind for (2); can
> you elaborate?  I have always thought that having a
> WaitForDecodingToCatchUp() primitive was a good way of handling
> changes that were otherwise too difficult to track our way through.  I
> am not sure you're doing that at all right now, which in some sense I
> guess is fine, but I haven't really understood your aversion to this
> solution.  There are some locking issues to be worked out here, but
> the problems don't seem altogether intractable.

So, what we need to do for rewriting catalog tables would be:
1) lock table against writes
2) wait for all in-progress xacts to finish, they could have modified  the table in question (we don't keep locks on
systemtables)
 
3) acquire xlog insert pointer
4) wait for all logical decoding actions to read past that pointer
5) upgrade the lock to an access exclusive one
6) perform vacuum full as usual

The lock upgrade hazards in here are the reason I am adverse to the
solution. And I don't see how we can avoid them, since in order for
decoding to catchup it has to be able to read from the
catalog... Otherwise it's easy enough to implement.

> (1) is basically deciding not to fix the problem.  I don't think
> that's acceptable.

I'd like to argue against this, but unfortunately I agree.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Commitfest II CLosed
Next
From: Hannu Krosing
Date:
Subject: Re: logical changeset generation v6.4