Thread: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
Hi,

When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old
multixacts around because they won't be read after the upgrade (and
aren't compatible). It just resets the new cluster's nextMulti to the
old + 1.
Unfortunately that means that there'll be a offsets/0000 file created by
initdb around. Sounds harmless enough, but that'll actually cause
problems if the old cluster had a nextMulti that's bigger than that
page.

When vac_truncate_clog() calls TruncateMultiXact() that'll scan
pg_multixact/offsets to find the earliest existing segment. That'll be
0000. If the to-be-truncated data is older than the last existing
segment it returns. Then it'll try to determine the last required data
in members/ by accessing the oldest data in offsets/.

Unfortunately, due to the existing 0000/ segment, that means it'll
sometimes try to access a nonexistant offsets/ file. Preventing vacuum
et al from succeeding.


It seems to me the fix for this is to a) rmtree("pg_multixact/members",
false) in copy_clog_xlog_xid() in the oldcluster < 9.3 case b) add a
warning to the release notes that everyone that used pg_upgrade and has
a 0000 file lying around should report to the mailinglist.

b) is a bit unsatisfactory, but I don't want to suggest removing the
file. In too many situations it might actually still be needed.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
Hi,

On 2014-05-30 14:16:31 +0200, Andres Freund wrote:
> When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old
> multixacts around because they won't be read after the upgrade (and
> aren't compatible). It just resets the new cluster's nextMulti to the
> old + 1.
> Unfortunately that means that there'll be a offsets/0000 file created by
> initdb around. Sounds harmless enough, but that'll actually cause
> problems if the old cluster had a nextMulti that's bigger than that
> page.
>
> When vac_truncate_clog() calls TruncateMultiXact() that'll scan
> pg_multixact/offsets to find the earliest existing segment. That'll be
> 0000. If the to-be-truncated data is older than the last existing
> segment it returns. Then it'll try to determine the last required data
> in members/ by accessing the oldest data in offsets/.
>
> Unfortunately, due to the existing 0000/ segment, that means it'll
> sometimes try to access a nonexistant offsets/ file. Preventing vacuum
> et al from succeeding.

For the sake of search engines, the error that triggers is:
ERROR: could not access status of transaction 2072053907
DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, May 30, 2014 at 02:16:31PM +0200, Andres Freund wrote:
> Hi,
>
> When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old
> multixacts around because they won't be read after the upgrade (and
> aren't compatible). It just resets the new cluster's nextMulti to the
> old + 1.
> Unfortunately that means that there'll be a offsets/0000 file created by
> initdb around. Sounds harmless enough, but that'll actually cause
> problems if the old cluster had a nextMulti that's bigger than that
> page.
>
> When vac_truncate_clog() calls TruncateMultiXact() that'll scan
> pg_multixact/offsets to find the earliest existing segment. That'll be
> 0000. If the to-be-truncated data is older than the last existing
> segment it returns. Then it'll try to determine the last required data
> in members/ by accessing the oldest data in offsets/.
>
> Unfortunately, due to the existing 0000/ segment, that means it'll
> sometimes try to access a nonexistant offsets/ file. Preventing vacuum
> et al from succeeding.
>
>
> It seems to me the fix for this is to a) rmtree("pg_multixact/members",
> false) in copy_clog_xlog_xid() in the oldcluster < 9.3 case b) add a
> warning to the release notes that everyone that used pg_upgrade and has
> a 0000 file lying around should report to the mailinglist.
>
> b) is a bit unsatisfactory, but I don't want to suggest removing the
> file. In too many situations it might actually still be needed.

This is a bug in 9.3 pg_upgrade as well?  Why has no one reported it
before?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-05-30 09:29:16 -0400, Bruce Momjian wrote:
> This is a bug in 9.3 pg_upgrade as well?

Yes.

>  Why has no one reported it before?

My guess is that it wasn't attributed to pg_upgrade in the
past. Typically the error will only occur a fair amount of time
later. You'll just see vacuums randomly erroring out with slru.c errors
about nonexistant files :(.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, May 30, 2014 at 03:32:44PM +0200, Andres Freund wrote:
> On 2014-05-30 09:29:16 -0400, Bruce Momjian wrote:
> > This is a bug in 9.3 pg_upgrade as well?
>
> Yes.
>
> >  Why has no one reported it before?
>
> My guess is that it wasn't attributed to pg_upgrade in the
> past. Typically the error will only occur a fair amount of time
> later. You'll just see vacuums randomly erroring out with slru.c errors
> about nonexistant files :(.

But how much later?  pg_upgrade is pretty popular now but I am just not
seeing the number of errors as I would expect:

    ERROR: could not access status of transaction 2072053907
    DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory.

I am not saying there is no bug, but from your analysis it would seem to
be 100% of pg_upgrade'ed clusters that use multi-xacts.  Is that true?
If so, it seems we would need to tell everyone to remove the 0000 files
if there are higher numbered ones with numbering gaps.  Is this
something our next minor release should fix in the multi-xacts code?
Fixing pg_upgrade seems like the easy part.

--
  Bruce_ Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-05-30 11:00:06 -0400, Bruce Momjian wrote:
> On Fri, May 30, 2014 at 03:32:44PM +0200, Andres Freund wrote:
> > On 2014-05-30 09:29:16 -0400, Bruce Momjian wrote:
> > > This is a bug in 9.3 pg_upgrade as well?
> >
> > Yes.
> >
> > >  Why has no one reported it before?
> >
> > My guess is that it wasn't attributed to pg_upgrade in the
> > past. Typically the error will only occur a fair amount of time
> > later. You'll just see vacuums randomly erroring out with slru.c errors
> > about nonexistant files :(.
>
> But how much later?  pg_upgrade is pretty popular now but I am just not
> seeing the number of errors as I would expect:
>
>     ERROR: could not access status of transaction 2072053907
>     DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory.

> I am not saying there is no bug, but from your analysis it would seem to
> be 100% of pg_upgrade'ed clusters that use multi-xacts.

It'd need to be clusters that used more multixacts than fit onto two
slru offsets/ segments in < 9.2. Otherwise things will probably just
continue to work because there's no hole.
Also the new cluster needs to have used more than
vacuum_multixact_freeze_min_age multis after the upgrade to make the
problem visible. I think.

> If so, it seems we would need to tell everyone to remove the 0000 files
> if there are higher numbered ones with numbering gaps.

The problem is that it's not actually that easy to define whether
there's a gap - after a multixact id wraparound the 0000 file might
exist again. So I'd rather not give a simple instruction that might
delete critical data.

> Is this something our next minor release should fix in the multi-xacts
> code?

I wondered whether we could detect that case and deal with it
transparently, but haven't come up with something smart. Alvaro?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, May 30, 2014 at 05:13:17PM +0200, Andres Freund wrote:
> On 2014-05-30 11:00:06 -0400, Bruce Momjian wrote:
> > On Fri, May 30, 2014 at 03:32:44PM +0200, Andres Freund wrote:
> > > On 2014-05-30 09:29:16 -0400, Bruce Momjian wrote:
> > > > This is a bug in 9.3 pg_upgrade as well?
> > >
> > > Yes.
> > >
> > > >  Why has no one reported it before?
> > >
> > > My guess is that it wasn't attributed to pg_upgrade in the
> > > past. Typically the error will only occur a fair amount of time
> > > later. You'll just see vacuums randomly erroring out with slru.c errors
> > > about nonexistant files :(.
> >
> > But how much later?  pg_upgrade is pretty popular now but I am just not
> > seeing the number of errors as I would expect:
> >
> >     ERROR: could not access status of transaction 2072053907
> >     DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory.
>
> > I am not saying there is no bug, but from your analysis it would seem to
> > be 100% of pg_upgrade'ed clusters that use multi-xacts.
>
> It'd need to be clusters that used more multixacts than fit onto two
> slru offsets/ segments in < 9.2. Otherwise things will probably just
> continue to work because there's no hole.

Do you mean <= 9.2 here for the old cluster?  Multi-xacts were added in
9.3:

        commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182
        Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
        Date:   Wed Jan 23 12:04:59 2013 -0300

            Improve concurrency of foreign key locking

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-05-30 16:35:15 -0400, Bruce Momjian wrote:
> On Fri, May 30, 2014 at 05:13:17PM +0200, Andres Freund wrote:
> > It'd need to be clusters that used more multixacts than fit onto two
> > slru offsets/ segments in < 9.2. Otherwise things will probably just
> > continue to work because there's no hole.
>
> Do you mean <= 9.2 here for the old cluster?

yes.

> Multi-xacts were added in 9.3:

That's not really correct. They were added in 8.2 or something...

>         commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182
>         Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
>         Date:   Wed Jan 23 12:04:59 2013 -0300
>
>             Improve concurrency of foreign key locking

That just expanded the usage to also be able to track updated rows.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, May 30, 2014 at 11:03:53PM +0200, Andres Freund wrote:
> On 2014-05-30 16:35:15 -0400, Bruce Momjian wrote:
> > On Fri, May 30, 2014 at 05:13:17PM +0200, Andres Freund wrote:
> > > It'd need to be clusters that used more multixacts than fit onto two
> > > slru offsets/ segments in < 9.2. Otherwise things will probably just
> > > continue to work because there's no hole.
> >
> > Do you mean <= 9.2 here for the old cluster?
>
> yes.
>
> > Multi-xacts were added in 9.3:
>
> That's not really correct. They were added in 8.2 or something...
>
> >         commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182
> >         Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> >         Date:   Wed Jan 23 12:04:59 2013 -0300
> >
> >             Improve concurrency of foreign key locking
>
> That just expanded the usage to also be able to track updated rows.

I guess I meant the new multixacts file format was in 9.3.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, May 30, 2014 at 05:25:26PM -0400, Bruce Momjian wrote:
> On Fri, May 30, 2014 at 11:03:53PM +0200, Andres Freund wrote:
> > On 2014-05-30 16:35:15 -0400, Bruce Momjian wrote:
> > > On Fri, May 30, 2014 at 05:13:17PM +0200, Andres Freund wrote:
> > > > It'd need to be clusters that used more multixacts than fit onto two
> > > > slru offsets/ segments in < 9.2. Otherwise things will probably just
> > > > continue to work because there's no hole.
> > >
> > > Do you mean <= 9.2 here for the old cluster?
> >
> > yes.
> >
> > > Multi-xacts were added in 9.3:
> >
> > That's not really correct. They were added in 8.2 or something...
> >
> > >         commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182
> > >         Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> > >         Date:   Wed Jan 23 12:04:59 2013 -0300
> > >
> > >             Improve concurrency of foreign key locking
> >
> > That just expanded the usage to also be able to track updated rows.
>
> I guess I meant the new multixacts file format was in 9.3.

It appears this item is on hold until Alvaro can comment.  I can
probably find the impact and correct the problem for previous upgrades,
but it might be 1-2 months until I can get to it.  The fix for future
upgrades is simple.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Andres Freund wrote:
> Hi,
>
> When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old
> multixacts around because they won't be read after the upgrade (and
> aren't compatible). It just resets the new cluster's nextMulti to the
> old + 1.
> Unfortunately that means that there'll be a offsets/0000 file created by
> initdb around. Sounds harmless enough, but that'll actually cause
> problems if the old cluster had a nextMulti that's bigger than that
> page.
>
> When vac_truncate_clog() calls TruncateMultiXact() that'll scan
> pg_multixact/offsets to find the earliest existing segment. That'll be
> 0000. If the to-be-truncated data is older than the last existing
> segment it returns. Then it'll try to determine the last required data
> in members/ by accessing the oldest data in offsets/.

I'm trying to understand the mechanism of this bug, and I'm not
succeeding.  If the offset/0000 was created by initdb, how come we try
to delete a file that's not also members/0000?  I mean, surely the file
as created by initdb is empty (zeroed).  In your sample error message
downthread,

ERROR: could not access status of transaction 2072053907
DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory.

what prompted the status of that multixid to be sought?  I see one
possible path to this error message, which is SlruPhysicalReadPage().
(There are other paths that lead to similar errors, but they use
"transaction 0" instead, so we can rule those out; and we can rule out
anything that uses MultiXactMemberCtl because of the path given in
DETAIL.)

There are four callsites that lead to that:

RecordNewMultiXact
GetMultiXactIdMembers (2x)
TrimMultiXact

Of those, only GetMultiXactIdMembers is likely to be called from vacuum
(actually RecordNewMultiXact can too, in a few cases, if it happens to
freeze a multi by creating another multi; should be pretty rare).
But you were talking about vacuum truncating pg_multixact -- and I don't
see how that's related to these functions.

Is it possible that you pasted the wrong error message?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-06-13 13:51:51 -0400, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Hi,
> >
> > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old
> > multixacts around because they won't be read after the upgrade (and
> > aren't compatible). It just resets the new cluster's nextMulti to the
> > old + 1.
> > Unfortunately that means that there'll be a offsets/0000 file created by
> > initdb around. Sounds harmless enough, but that'll actually cause
> > problems if the old cluster had a nextMulti that's bigger than that
> > page.
> >
> > When vac_truncate_clog() calls TruncateMultiXact() that'll scan
> > pg_multixact/offsets to find the earliest existing segment. That'll be
> > 0000. If the to-be-truncated data is older than the last existing
> > segment it returns. Then it'll try to determine the last required data
> > in members/ by accessing the oldest data in offsets/.
>
> I'm trying to understand the mechanism of this bug, and I'm not
> succeeding.  If the offset/0000 was created by initdb, how come we try
> to delete a file that's not also members/0000?  I mean, surely the file
> as created by initdb is empty (zeroed).  In your sample error message
> downthread,

The bit you're missing is essentially the following in TruncateMultiXact():
    /*
     * Note we can't just plow ahead with the truncation; it's possible that
     * there are no segments to truncate, which is a problem because we are
     * going to attempt to read the offsets page to determine where to
     * truncate the members SLRU.  So we first scan the directory to determine
     * the earliest offsets page number that we can read without error.
     */
That check is thwarted due to the 0000 segment. So the segment
containing the oldestMXact has already been removed, but we don't notice
that that's the case.

> ERROR: could not access status of transaction 2072053907
> DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory.
>
> what prompted the status of that multixid to be sought?  I see one
> possible path to this error message, which is SlruPhysicalReadPage().
> (There are other paths that lead to similar errors, but they use
> "transaction 0" instead, so we can rule those out; and we can rule out
> anything that uses MultiXactMemberCtl because of the path given in
> DETAIL.)

Same function:
    /*
     * First, compute the safe truncation point for MultiXactMember. This is
     * the starting offset of the multixact we were passed as MultiXactOffset
     * cutoff.
     */
    {
        int            pageno;
        int            slotno;
        int            entryno;
        MultiXactOffset *offptr;

        /* lock is acquired by SimpleLruReadPage_ReadOnly */

        pageno = MultiXactIdToOffsetPage(oldestMXact);
        entryno = MultiXactIdToOffsetEntry(oldestMXact);

        slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno,
                                            oldestMXact);
        offptr = (MultiXactOffset *)
            MultiXactOffsetCtl->shared->page_buffer[slotno];
        offptr += entryno;
        oldestOffset = *offptr;

        LWLockRelease(MultiXactOffsetControlLock);
    }

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> It appears this item is on hold until Alvaro can comment.  I can
> probably find the impact and correct the problem for previous upgrades,
> but it might be 1-2 months until I can get to it.  The fix for future
> upgrades is simple.

TBH I think future upgrades is what we need to fix now, so that people
don't do broken upgrades from 9.2 to 9.4 once the latter is out.  (I
would assume that upgrades from 9.3 will not have a problem, unless the
9.3 setup itself is already broken due to a prior pg_upgrade).  Also, we
need to fix pg_upgrade to protect people upgrading to 9.3.5 from <= 9.2.

As for detecting the case in existing installs, I think we can supply a
query in the next minor release notes that uses pg_ls_dir() to see
whether there is a 0000 file outside the working range of
multixact/offset, whose result if nonempty would indicate the need to
delete (or more likely rename) a file.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> On Fri, May 30, 2014 at 02:16:31PM +0200, Andres Freund wrote:
> > Hi,
> >
> > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old
> > multixacts around because they won't be read after the upgrade (and
> > aren't compatible). It just resets the new cluster's nextMulti to the
> > old + 1.
> > Unfortunately that means that there'll be a offsets/0000 file created by
> > initdb around. Sounds harmless enough, but that'll actually cause
> > problems if the old cluster had a nextMulti that's bigger than that
> > page.

I've been playing with this a bit.  Here's a patch that just does
rmtree() of the problematic file during pg_upgrade, as proposed by
Andres, which solves the problem.  Note that this patch removes the 0000
file in both cases: when upgrading from 9.2, and when upgrading from
9.3+.  The former case is the bug that Andres reported.  In the second
case, we overwrite the files with the ones from the old cluster; if
there's a lingering 0000 file in the new cluster, it would cause
problems as well.  (Really, I don't see any reason to think that these
two cases are any different.)

> This is a bug in 9.3 pg_upgrade as well?  Why has no one reported it
> before?

I think one reason is that not all upgrades see an issue here; for old
clusters that haven't gone beyond the 0000 offset file, there is no
problem.  For clusters that have gone beyond 0000 but not by much, the
file will be deleted during the first truncation.  It only becomes a
problem if the cluster is close enough to 2^31.  Another thing to keep
in consideration is that initdb initializes all databases' datminmxid to
1.  If the old cluster was past the 2^31 point, it means the datminmxid
doesn't move from 1 until the actual wraparound.


I noticed another problem while trying to reproduce it.  It only happens
in assert-enabled builds: when FreezeMultiXactId() sees an old multi, it
asserts that it mustn't be running anymore, which is fine except that if
the multi value is one that survived a pg_upgrade cycle from 9.2 or
older (i.e. it was a share-locked tuple in the old cluster), an error is
raised when the assertion is checked.  Since it doesn't make sense to
examine its running status at that point, the attached patch simply
disables the assertion in that case.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Wed, Jun 18, 2014 at 06:51:31PM -0400, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > On Fri, May 30, 2014 at 02:16:31PM +0200, Andres Freund wrote:
> > > Hi,
> > >
> > > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old
> > > multixacts around because they won't be read after the upgrade (and
> > > aren't compatible). It just resets the new cluster's nextMulti to the
> > > old + 1.
> > > Unfortunately that means that there'll be a offsets/0000 file created by
> > > initdb around. Sounds harmless enough, but that'll actually cause
> > > problems if the old cluster had a nextMulti that's bigger than that
> > > page.
>
> I've been playing with this a bit.  Here's a patch that just does
> rmtree() of the problematic file during pg_upgrade, as proposed by
> Andres, which solves the problem.  Note that this patch removes the 0000
> file in both cases: when upgrading from 9.2, and when upgrading from
> 9.3+.  The former case is the bug that Andres reported.  In the second
> case, we overwrite the files with the ones from the old cluster; if
> there's a lingering 0000 file in the new cluster, it would cause
> problems as well.  (Really, I don't see any reason to think that these
> two cases are any different.)

I wasn't happy with having that delete code added there when we do
directory delete in the function above.  I instead broke apart the
delete and copy code and called the delete code where needed, in the
attached patch.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Attachment

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Wed, Jun 18, 2014 at 06:51:31PM -0400, Alvaro Herrera wrote:
> > This is a bug in 9.3 pg_upgrade as well?  Why has no one reported it
> > before?
>
> I think one reason is that not all upgrades see an issue here; for old
> clusters that haven't gone beyond the 0000 offset file, there is no
> problem.  For clusters that have gone beyond 0000 but not by much, the
> file will be deleted during the first truncation.  It only becomes a
> problem if the cluster is close enough to 2^31.  Another thing to keep
> in consideration is that initdb initializes all databases' datminmxid to
> 1.  If the old cluster was past the 2^31 point, it means the datminmxid
> doesn't move from 1 until the actual wraparound.

OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a
problem?  That might explain the rare reporting of this bug.  What would
the test query look like so we can tell people when to remove the '0000'
files?  Would we need to see the existence of '0000' and high-numbered
files?  How high?  What does a 2^31 file look like?

Also, is there a reason you didn't remove the 'members/0000' file in your
patch?  I have removed it in my version.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Wed, Jun 18, 2014 at 09:52:05PM -0400, Bruce Momjian wrote:
> On Wed, Jun 18, 2014 at 06:51:31PM -0400, Alvaro Herrera wrote:
> > > This is a bug in 9.3 pg_upgrade as well?  Why has no one reported it
> > > before?
> >
> > I think one reason is that not all upgrades see an issue here; for old
> > clusters that haven't gone beyond the 0000 offset file, there is no
> > problem.  For clusters that have gone beyond 0000 but not by much, the
> > file will be deleted during the first truncation.  It only becomes a
> > problem if the cluster is close enough to 2^31.  Another thing to keep
> > in consideration is that initdb initializes all databases' datminmxid to
> > 1.  If the old cluster was past the 2^31 point, it means the datminmxid
> > doesn't move from 1 until the actual wraparound.
>
> OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a
> problem?  That might explain the rare reporting of this bug.  What would
> the test query look like so we can tell people when to remove the '0000'
> files?  Would we need to see the existence of '0000' and high-numbered
> files?  How high?  What does a 2^31 file look like?

Also, what would a legitimate 0000 file at wrap-around time look like?
Would there have to be an 'ffff' or 'ffffff' file?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> I wasn't happy with having that delete code added there when we do
> directory delete in the function above.  I instead broke apart the
> delete and copy code and called the delete code where needed, in the
> attached patch.

Makes sense, yeah.  I didn't look closely enough to realize that the
function that does the copying also does the rmtree part.

I also now realize why the other case (upgrade from 9.3 to 9.4) does not
have a bug: we are already deleting the files in that path.


Bruce Momjian wrote:

> > I think one reason is that not all upgrades see an issue here; for old
> > clusters that haven't gone beyond the 0000 offset file, there is no
> > problem.  For clusters that have gone beyond 0000 but not by much, the
> > file will be deleted during the first truncation.  It only becomes a
> > problem if the cluster is close enough to 2^31.  Another thing to keep
> > in consideration is that initdb initializes all databases' datminmxid to
> > 1.  If the old cluster was past the 2^31 point, it means the datminmxid
> > doesn't move from 1 until the actual wraparound.
>
> OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a
> problem?  That might explain the rare reporting of this bug.  What would
> the test query look like so we can tell people when to remove the '0000'
> files?  Would we need to see the existence of '0000' and high-numbered
> files?  How high?  What does a 2^31 file look like?

I misspoke.

I ran a few more upgrades, and then tried vacuuming all databases, which
is when the truncate code is run.  Say the original cluster had an
oldestmulti of 10 million.  If you just run VACUUM in the new cluster
after the upgrade, the 0000 file is not deleted: it's not yet old enough
in terms of multixact age.  An error is not thrown, because we're still
not attempting a truncate.  But if you lower the
vacuum_multixact_freeze_table_age to 10 million minus one, then we will
try the deletion and that will raise the error.

I think (didn't actually try) if you just let 150 million multixacts be
generated, that's the first time you will get the error.

Now if you run a VACUUM FREEZE after the upgrade, the file will be
deleted with no error.

I now think that the reason most people haven't hit the problem is that
they don't generate enough multis after upgrading a database that had
enough multis in the old database.  This seems a bit curious

> Also, is there a reason you didn't remove the 'members/0000' file in your
> patch?  I have removed it in my version.

There's no point.  That file is the starting point for new multis
anyway, and it's compatible with the new format (because it's all
zeroes).

I guess you could get in trouble if you initdb the 9.3 database,
generate a few multis there, and then drop all tables and use that
modified cluster as a "new" cluster for pg_upgrade.  I would question
the sanity of a person doing something like that, however.


Bruce Momjian wrote:

> > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a
> > problem?  That might explain the rare reporting of this bug.  What would
> > the test query look like so we can tell people when to remove the '0000'
> > files?  Would we need to see the existence of '0000' and high-numbered
> > files?  How high?  What does a 2^31 file look like?
>
> Also, what would a legitimate 0000 file at wrap-around time look like?
> Would there have to be an 'ffff' or 'ffffff' file?

Since I was wrong, there is no point in further research here.  Anyway
the last file before wrapping around in pg_multixact/members is FFFF.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
BTW I hacked up pg_resetxlog a bit to make it generate the necessary
pg_multixact/offset file when -m is given.  This is not acceptable for
commit because it duplicates the #defines from pg_multixact.c, but maybe
we want this functionality enough that we're interested in a more
complete version of this patch; also it unconditionally writes one zero
byte to the file, which is of course wrong if the file exists and
already contains data.

It'd be nice to be able to write this in a way that lets it work for all
existing SLRU users (pg_clog is the most common, but
pg_multixact/members and the predicate locking stuff might also be
useful).  Not sure what would that look like.

Another consideration is that it doesn't remove existing files that are
outside of the new valid range according to freezing parameters and
such, but I'm not sure this is really doable considering that we might
need to change the relminmxid and datminmxid values, etc.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> Bruce Momjian wrote:

> Bruce Momjian wrote:
>
> > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a
> > > problem?  That might explain the rare reporting of this bug.  What would
> > > the test query look like so we can tell people when to remove the '0000'
> > > files?  Would we need to see the existence of '0000' and high-numbered
> > > files?  How high?  What does a 2^31 file look like?
> >
> > Also, what would a legitimate 0000 file at wrap-around time look like?
> > Would there have to be an 'ffff' or 'ffffff' file?
>
> Since I was wrong, there is no point in further research here.  Anyway
> the last file before wrapping around in pg_multixact/members is FFFF.

Oops, I meant the last file before wrap in pg_multixact/offsets is FFFF,
which is what we're talking about in this thread.

For members it's 14078, but it's not relevant here.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Thu, Jun 19, 2014 at 06:04:25PM -0400, Alvaro Herrera wrote:
> Bruce Momjian wrote:
>
> > I wasn't happy with having that delete code added there when we do
> > directory delete in the function above.  I instead broke apart the
> > delete and copy code and called the delete code where needed, in the
> > attached patch.
>
> Makes sense, yeah.  I didn't look closely enough to realize that the
> function that does the copying also does the rmtree part.

OK.  Should I apply my patch so at least pg_upgrade is good going
forward?

> I also now realize why the other case (upgrade from 9.3 to 9.4) does not
> have a bug: we are already deleting the files in that path.

Right, and I think the patch makes it clearer why we need those 'rm'
function calls because they mirror the 'copy' ones.

> > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a
> > problem?  That might explain the rare reporting of this bug.  What would
> > the test query look like so we can tell people when to remove the '0000'
> > files?  Would we need to see the existence of '0000' and high-numbered
> > files?  How high?  What does a 2^31 file look like?
>
> I misspoke.
>
> I ran a few more upgrades, and then tried vacuuming all databases, which
> is when the truncate code is run.  Say the original cluster had an
> oldestmulti of 10 million.  If you just run VACUUM in the new cluster
> after the upgrade, the 0000 file is not deleted: it's not yet old enough
> in terms of multixact age.  An error is not thrown, because we're still
> not attempting a truncate.  But if you lower the
> vacuum_multixact_freeze_table_age to 10 million minus one, then we will
> try the deletion and that will raise the error.
>
> I think (didn't actually try) if you just let 150 million multixacts be
> generated, that's the first time you will get the error.
>
> Now if you run a VACUUM FREEZE after the upgrade, the file will be
> deleted with no error.
>
> I now think that the reason most people haven't hit the problem is that
> they don't generate enough multis after upgrading a database that had
> enough multis in the old database.  This seems a bit curious

OK, that does make more sense.  A user would need to have the to
exceeded 0000 to the point where it was removed from their old cluster,
and _then_ run far enough past the freeze horizon to again require file
removal.  This does make sense why we are seeing the bug only now, and
while a quick minor release with a query to fix this will get us out of
the problem with minimal impact.

> > Also, is there a reason you didn't remove the 'members/0000' file in your
> > patch?  I have removed it in my version.
>
> There's no point.  That file is the starting point for new multis
> anyway, and it's compatible with the new format (because it's all
> zeroes).

I think it should be done for consistency with the 'copy' function calls
above.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Thu, Jun 19, 2014 at 06:12:41PM -0400, Alvaro Herrera wrote:
> BTW I hacked up pg_resetxlog a bit to make it generate the necessary
> pg_multixact/offset file when -m is given.  This is not acceptable for
> commit because it duplicates the #defines from pg_multixact.c, but maybe
> we want this functionality enough that we're interested in a more
> complete version of this patch; also it unconditionally writes one zero
> byte to the file, which is of course wrong if the file exists and
> already contains data.

Why would we want this if the system functions fine without those files
being created?  I don't think we want pg_resetxlog to be doing anything
except changing pg_controldata.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> On Thu, Jun 19, 2014 at 06:12:41PM -0400, Alvaro Herrera wrote:
> > BTW I hacked up pg_resetxlog a bit to make it generate the necessary
> > pg_multixact/offset file when -m is given.  This is not acceptable for
> > commit because it duplicates the #defines from pg_multixact.c, but maybe
> > we want this functionality enough that we're interested in a more
> > complete version of this patch; also it unconditionally writes one zero
> > byte to the file, which is of course wrong if the file exists and
> > already contains data.
>
> Why would we want this if the system functions fine without those files
> being created?  I don't think we want pg_resetxlog to be doing anything
> except changing pg_controldata.

I don't understand why you say the system functions fine.  If you move
the multixactid with pg_resetxlog to a point which doesn't have the
necessary file and page, the system will not start.  Only pg_upgrade
knows to create the file appropriately, but normal operation doesn't.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Bruce Momjian wrote:
>> Why would we want this if the system functions fine without those files
>> being created?  I don't think we want pg_resetxlog to be doing anything
>> except changing pg_controldata.

> I don't understand why you say the system functions fine.  If you move
> the multixactid with pg_resetxlog to a point which doesn't have the
> necessary file and page, the system will not start.  Only pg_upgrade
> knows to create the file appropriately, but normal operation doesn't.

Yeah.  In my mind, at least, there's been a TODO for some time for
pg_resetxlog to make sure that pg_clog and the other SLRU-like files
get populated appropriately when you tell it to change those counters.
You have to have a segment file at the right place or startup fails.

I'm not sure why Alvaro chose to fix this only for pg_multixact, but
fixing that case is better than not fixing anything.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Thu, Jun 19, 2014 at 09:06:54PM -0400, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > Bruce Momjian wrote:
>
> > Bruce Momjian wrote:
> >
> > > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a
> > > > problem?  That might explain the rare reporting of this bug.  What would
> > > > the test query look like so we can tell people when to remove the '0000'
> > > > files?  Would we need to see the existence of '0000' and high-numbered
> > > > files?  How high?  What does a 2^31 file look like?
> > >
> > > Also, what would a legitimate 0000 file at wrap-around time look like?
> > > Would there have to be an 'ffff' or 'ffffff' file?
> >
> > Since I was wrong, there is no point in further research here.  Anyway
> > the last file before wrapping around in pg_multixact/members is FFFF.
>
> Oops, I meant the last file before wrap in pg_multixact/offsets is FFFF,
> which is what we're talking about in this thread.
>
> For members it's 14078, but it's not relevant here.

OK, so the next questions is, what will the minor-release-note query we
give users to test this look like?  Do we expect no gaps in numbering?
Is it enough to test for the existance of '0000' and lack of '0001' and
'FFFF'?  Basically, if we expect no gaps in normal numbering, then a
'0000' with no number after it and no wrap-around number before it means
the '0000' is left over from initdb and can be removed.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, Jun 20, 2014 at 01:33:52PM -0400, Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > On Thu, Jun 19, 2014 at 06:12:41PM -0400, Alvaro Herrera wrote:
> > > BTW I hacked up pg_resetxlog a bit to make it generate the necessary
> > > pg_multixact/offset file when -m is given.  This is not acceptable for
> > > commit because it duplicates the #defines from pg_multixact.c, but maybe
> > > we want this functionality enough that we're interested in a more
> > > complete version of this patch; also it unconditionally writes one zero
> > > byte to the file, which is of course wrong if the file exists and
> > > already contains data.
> >
> > Why would we want this if the system functions fine without those files
> > being created?  I don't think we want pg_resetxlog to be doing anything
> > except changing pg_controldata.
>
> I don't understand why you say the system functions fine.  If you move
> the multixactid with pg_resetxlog to a point which doesn't have the
> necessary file and page, the system will not start.  Only pg_upgrade
> knows to create the file appropriately, but normal operation doesn't.

True, but who would change the multi-xact except pg_upgrade, and would
you want pg_resetxlog to be changing other files in the file system,
perhaps as part of some disaster recovery operation?

The larger question is whether the backend code should more cleanly
handle such cases.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, Jun 20, 2014 at 01:41:42PM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Bruce Momjian wrote:
> >> Why would we want this if the system functions fine without those files
> >> being created?  I don't think we want pg_resetxlog to be doing anything
> >> except changing pg_controldata.
>
> > I don't understand why you say the system functions fine.  If you move
> > the multixactid with pg_resetxlog to a point which doesn't have the
> > necessary file and page, the system will not start.  Only pg_upgrade
> > knows to create the file appropriately, but normal operation doesn't.
>
> Yeah.  In my mind, at least, there's been a TODO for some time for
> pg_resetxlog to make sure that pg_clog and the other SLRU-like files
> get populated appropriately when you tell it to change those counters.
> You have to have a segment file at the right place or startup fails.
>
> I'm not sure why Alvaro chose to fix this only for pg_multixact, but
> fixing that case is better than not fixing anything.

Well, if we are going to do this, we had better do all cases or the
behavior will be quite surprising, and when doing disaster recovery,
surprises are not good.  I am a little worried that removing files as
part of pg_resetxlog might cause data to be lost as users try to figure
things out.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Fri, Jun 20, 2014 at 01:41:42PM -0400, Tom Lane wrote:
>> Yeah.  In my mind, at least, there's been a TODO for some time for
>> pg_resetxlog to make sure that pg_clog and the other SLRU-like files
>> get populated appropriately when you tell it to change those counters.
>> You have to have a segment file at the right place or startup fails.

> Well, if we are going to do this, we had better do all cases or the
> behavior will be quite surprising, and when doing disaster recovery,
> surprises are not good.  I am a little worried that removing files as
> part of pg_resetxlog might cause data to be lost as users try to figure
> things out.

Huh?  The basic charter of pg_resetxlog is to throw away files, ie, all
of your WAL.

But in this case what we're talking about is a secondary function, namely
the ability to move the XID, MXID, etc counter values.  Up to now, if you
did that, it was on your head to manually create appropriate SLRU segment
files.  It's certainly less error-prone, not more so, if the code were
to take care of that for you.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, Jun 20, 2014 at 02:24:55PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Fri, Jun 20, 2014 at 01:41:42PM -0400, Tom Lane wrote:
> >> Yeah.  In my mind, at least, there's been a TODO for some time for
> >> pg_resetxlog to make sure that pg_clog and the other SLRU-like files
> >> get populated appropriately when you tell it to change those counters.
> >> You have to have a segment file at the right place or startup fails.
>
> > Well, if we are going to do this, we had better do all cases or the
> > behavior will be quite surprising, and when doing disaster recovery,
> > surprises are not good.  I am a little worried that removing files as
> > part of pg_resetxlog might cause data to be lost as users try to figure
> > things out.
>
> Huh?  The basic charter of pg_resetxlog is to throw away files, ie, all
> of your WAL.
>
> But in this case what we're talking about is a secondary function, namely
> the ability to move the XID, MXID, etc counter values.  Up to now, if you
> did that, it was on your head to manually create appropriate SLRU segment
> files.  It's certainly less error-prone, not more so, if the code were
> to take care of that for you.

I am fine with that, but let's do a complete job so the user API is not
confusing.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Thu, Jun 19, 2014 at 06:04:25PM -0400, Alvaro Herrera wrote:
> Bruce Momjian wrote:
>
> > I wasn't happy with having that delete code added there when we do
> > directory delete in the function above.  I instead broke apart the
> > delete and copy code and called the delete code where needed, in the
> > attached patch.
>
> Makes sense, yeah.  I didn't look closely enough to realize that the
> function that does the copying also does the rmtree part.

Hearing nothing, patch applied back through 9.3.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, Jun 20, 2014 at 02:07:25PM -0400, Bruce Momjian wrote:
> On Thu, Jun 19, 2014 at 09:06:54PM -0400, Alvaro Herrera wrote:
> > Alvaro Herrera wrote:
> > > Bruce Momjian wrote:
> >
> > > Bruce Momjian wrote:
> > >
> > > > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a
> > > > > problem?  That might explain the rare reporting of this bug.  What would
> > > > > the test query look like so we can tell people when to remove the '0000'
> > > > > files?  Would we need to see the existence of '0000' and high-numbered
> > > > > files?  How high?  What does a 2^31 file look like?
> > > >
> > > > Also, what would a legitimate 0000 file at wrap-around time look like?
> > > > Would there have to be an 'ffff' or 'ffffff' file?
> > >
> > > Since I was wrong, there is no point in further research here.  Anyway
> > > the last file before wrapping around in pg_multixact/members is FFFF.
> >
> > Oops, I meant the last file before wrap in pg_multixact/offsets is FFFF,
> > which is what we're talking about in this thread.
> >
> > For members it's 14078, but it's not relevant here.
>
> OK, so the next questions is, what will the minor-release-note query we
> give users to test this look like?  Do we expect no gaps in numbering?
> Is it enough to test for the existance of '0000' and lack of '0001' and
> 'FFFF'?  Basically, if we expect no gaps in normal numbering, then a
> '0000' with no number after it and no wrap-around number before it means
> the '0000' is left over from initdb and can be removed.

Assuming this is true, here is a query we can put in the next 9.3 minor
release notes to tell users if they need to remove the '0000' file:

    WITH list(file) AS
    (
            SELECT * FROM pg_ls_dir('pg_multixact/offsets')
    )
    SELECT  EXISTS (SELECT * FROM list WHERE file = '0000') AND
            NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND
            NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND
            EXISTS (SELECT * FROM list WHERE file != '0000')
            AS file_removal_needed;

Do they need to remove the members/0000 file too?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> Assuming this is true, here is a query we can put in the next 9.3 minor
> release notes to tell users if they need to remove the '0000' file:
>
>     WITH list(file) AS
>     (
>             SELECT * FROM pg_ls_dir('pg_multixact/offsets')
>     )
>     SELECT  EXISTS (SELECT * FROM list WHERE file = '0000') AND
>             NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND
>             NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND
>             EXISTS (SELECT * FROM list WHERE file != '0000')
>             AS file_removal_needed;
>
> Do they need to remove the members/0000 file too?

No, the members file shouldn't be removed; if they have used multixacts
at all since upgrading, the members/0000 file contains the members of
the first bunch of multis.  Removing it would remove valid data.  (Also
I still think that removing the members/0000 file that initdb creates is
a mistake.)

I think the query above is correct.  One possible quibble is a FFFF
segment that isn't allocated in full.  I don't think this is a problem,
because although the 0000 file would survive deletion, it would be
overwritten with zeroes as soon as the counter wraps around.  Since 0000
is seen as "in the future" as soon as file 8000 starts being used, it
shouldn't be a problem there.

Another thing probably worth doing is have pg_upgrade set
pg_database.datminmxid to the minimum of pg_class.relminmxid on each
database.  Right now these values are all initialized to 1, which is a
problem.  We have set_frozenxids() which appears to be related, although
I don't readily understand what that is doing.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> I think the query above is correct.  One possible quibble is a FFFF
> segment that isn't allocated in full.  I don't think this is a problem,
> because although the 0000 file would survive deletion, it would be
> overwritten with zeroes as soon as the counter wraps around.  Since 0000
> is seen as "in the future" as soon as file 8000 starts being used, it
> shouldn't be a problem there.

... that said, there is a wraparound check in SimpleLruTruncate: if you
are using file 8000 and above, slru.c refuses to remove the old files
because it believes you may have already suffered wraparound, and you
would get a WARNING message every time the truncation was attempted.

The warnings would probably be pretty much invisible because of their
low frequency.  This might be improved by my commit this afternoon,
which should cause multixact truncations to be attempted much more
often.  In any case, just removing the offending offsets/0000 file
should clear it up.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, Jun 27, 2014 at 05:05:21PM -0400, Alvaro Herrera wrote:
> Bruce Momjian wrote:
>
> > Assuming this is true, here is a query we can put in the next 9.3 minor
> > release notes to tell users if they need to remove the '0000' file:
> >
> >     WITH list(file) AS
> >     (
> >             SELECT * FROM pg_ls_dir('pg_multixact/offsets')
> >     )
> >     SELECT  EXISTS (SELECT * FROM list WHERE file = '0000') AND
> >             NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND
> >             NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND
> >             EXISTS (SELECT * FROM list WHERE file != '0000')
> >             AS file_removal_needed;
> >
> > Do they need to remove the members/0000 file too?
>
> No, the members file shouldn't be removed; if they have used multixacts
> at all since upgrading, the members/0000 file contains the members of
> the first bunch of multis.  Removing it would remove valid data.  (Also
> I still think that removing the members/0000 file that initdb creates is
> a mistake.)

We never remove offset or members files with user data;  we are only
discussing the removal of empty files created by initdb.  Can you
explain why we would remove the offsets file but retain the members file
created by initdb when we are _not_ preserving files from the old
cluster?  We are certainly setting the offset and member value from the
old cluster:

        exec_prog(UTILITY_LOG_FILE, NULL, true,
                  "\"%s/pg_resetxlog\" -m %u,%u \"%s\"",
                  new_cluster.bindir,
                  old_cluster.controldata.chkpnt_nxtmulti + 1,
                  old_cluster.controldata.chkpnt_nxtmulti,
                  new_cluster.pgdata);

How are we not creating a gap in members files by setting the members
value?  Why will that not cause problems?

> I think the query above is correct.  One possible quibble is a FFFF
> segment that isn't allocated in full.  I don't think this is a problem,
> because although the 0000 file would survive deletion, it would be
> overwritten with zeroes as soon as the counter wraps around.  Since 0000
> is seen as "in the future" as soon as file 8000 starts being used, it
> shouldn't be a problem there.

Understood.   Here is an updated query with a better column label --- I
think this is ready for the 9.3.X release notes, along with some text:

    WITH list(file) AS
    (
        SELECT * FROM pg_ls_dir('pg_multixact/offsets')
    )
    SELECT     EXISTS (SELECT * FROM list WHERE file = '0000') AND
        NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND
        NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND
        EXISTS (SELECT * FROM list WHERE file != '0000')
        AS file_0000_removal_required;

> Another thing probably worth doing is have pg_upgrade set
> pg_database.datminmxid to the minimum of pg_class.relminmxid on each
> database.  Right now these values are all initialized to 1, which is a
> problem.  We have set_frozenxids() which appears to be related, although
> I don't readily understand what that is doing.

I will address this in a separate email.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Fri, Jun 27, 2014 at 05:05:21PM -0400, Alvaro Herrera wrote:
> Another thing probably worth doing is have pg_upgrade set
> pg_database.datminmxid to the minimum of pg_class.relminmxid on each
> database.  Right now these values are all initialized to 1, which is a
> problem.  We have set_frozenxids() which appears to be related, although
> I don't readily understand what that is doing.

Yikes, OK.  What pg_upgrade's set_frozenxids() does is to set the
database and relation frozen xids to be the current xid, particularly
template0 and the system tables, then pg_dump --binary-upgrade adjusts
many of them to match the old cluster.  What isn't happening, as you
pointed out, is that we are not dealing with the new database and
relation-level minmxid values in pg_upgrade or pg_dump --binary-upgrade.
The attached 1k-line patch fixes that.

What I am not sure about is how to set values from pre-9.3 clusters, and
whether 9.3 pg_upgrade upgrades from pre-9.3 clusters are a problem.
Are users who used pg_upgrade to go to 9.4 beta in trouble?

I also have no way to know what value to use for pre-9.3 clusters --- I
used controldata.chkpnt_nxtmulti in pg_upgrade (because the value was
accessible), but 0 in pg_dump/pg_dumpall, like we already do for frozen
xid values, but that usage is for major versions that pg_upgrade doesn't
support, so it might be the wrong default.  I am thinking that should be
using controldata.chkpnt_nxtmulti, which exists back to 8.4, but I have
no access to that value from pg_dump.  In fact, the patch as it exists
is flawed because it uses controldata.chkpnt_nxtmulti to set values from
set_frozenxids(), because the value is accessible, but uses zero in
pg_dump and pg_dumpall for pre-9.3 old clusters.  :-(

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Attachment

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Bruce Momjian wrote:

> What I am not sure about is how to set values from pre-9.3 clusters, and
> whether 9.3 pg_upgrade upgrades from pre-9.3 clusters are a problem.
> Are users who used pg_upgrade to go to 9.4 beta in trouble?
>
> I also have no way to know what value to use for pre-9.3 clusters --- I
> used controldata.chkpnt_nxtmulti in pg_upgrade (because the value was
> accessible), but 0 in pg_dump/pg_dumpall, like we already do for frozen
> xid values, but that usage is for major versions that pg_upgrade doesn't
> support, so it might be the wrong default.  I am thinking that should be
> using controldata.chkpnt_nxtmulti, which exists back to 8.4, but I have
> no access to that value from pg_dump.  In fact, the patch as it exists
> is flawed because it uses controldata.chkpnt_nxtmulti to set values from
> set_frozenxids(), because the value is accessible, but uses zero in
> pg_dump and pg_dumpall for pre-9.3 old clusters.  :-(

Bruce and I discussed this on IM and I think we have reached a
conclusion on what needs to be done:

* When upgrading from 9.2 or older, all tables need to have relminmxid
  set to oldestMulti.  However, since pg_dump --binary-upgrade cannot
  extract useful values from the catalog, we will need to have the
  schema load create all tables with relminmxid=0.  A subsequent UPDATE
  will fix the values.

  In this case, each database' datminmxid value is going to be set to
  pg_control's oldestMulti.

  If I recall correctly, oldestMulti is computed as nextMulti-1.


* When upgrading from 9.3 or newer, the relminmxid values from the old
  cluster must be preserved.  Also, datminmxid is going to be preserved.


Finally, there is the question of what to do if the database has already
been upgraded and thus the tables are all at relminmxid=1.  As far as I
can tell, if the original value of nextMulti was below 2^31, there
should be no issue because vacuuming would advance the value normally.
If the original value was beyond that point, then vacuum would have been
bleating all along about the wraparound point.  In this case, I think it
should be enough the UPDATE the pg_class values to the current
oldestMulti value from pg_control, but I haven't tested this.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
I forgot to add:  Many thanks, Bruce, for working on this patch.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Tue, Jul  1, 2014 at 03:01:06PM -0400, Alvaro Herrera wrote:
> Bruce Momjian wrote:
>
> > What I am not sure about is how to set values from pre-9.3 clusters, and
> > whether 9.3 pg_upgrade upgrades from pre-9.3 clusters are a problem.
> > Are users who used pg_upgrade to go to 9.4 beta in trouble?
> >
> > I also have no way to know what value to use for pre-9.3 clusters --- I
> > used controldata.chkpnt_nxtmulti in pg_upgrade (because the value was
> > accessible), but 0 in pg_dump/pg_dumpall, like we already do for frozen
> > xid values, but that usage is for major versions that pg_upgrade doesn't
> > support, so it might be the wrong default.  I am thinking that should be
> > using controldata.chkpnt_nxtmulti, which exists back to 8.4, but I have
> > no access to that value from pg_dump.  In fact, the patch as it exists
> > is flawed because it uses controldata.chkpnt_nxtmulti to set values from
> > set_frozenxids(), because the value is accessible, but uses zero in
> > pg_dump and pg_dumpall for pre-9.3 old clusters.  :-(
>
> Bruce and I discussed this on IM and I think we have reached a
> conclusion on what needs to be done:
>
> * When upgrading from 9.2 or older, all tables need to have relminmxid
>   set to oldestMulti.  However, since pg_dump --binary-upgrade cannot
>   extract useful values from the catalog, we will need to have the
>   schema load create all tables with relminmxid=0.  A subsequent UPDATE
>   will fix the values.

OK, the updated attached patch does this.  It repurposes
set_frozenxids().

>   In this case, each database' datminmxid value is going to be set to
>   pg_control's oldestMulti.
>
>   If I recall correctly, oldestMulti is computed as nextMulti-1.

We don't have oldestMulti in pg_controldata in pre-9.3, so we have to
use nextMulti-1.

> * When upgrading from 9.3 or newer, the relminmxid values from the old
>   cluster must be preserved.  Also, datminmxid is going to be preserved.

Yes, that was already in the patch.

> Finally, there is the question of what to do if the database has already
> been upgraded and thus the tables are all at relminmxid=1.  As far as I
> can tell, if the original value of nextMulti was below 2^31, there
> should be no issue because vacuuming would advance the value normally.
> If the original value was beyond that point, then vacuum would have been
> bleating all along about the wraparound point.  In this case, I think it
> should be enough the UPDATE the pg_class values to the current
> oldestMulti value from pg_control, but I haven't tested this.

Well, we are already having users run a query for the 9.3.X minor
version upgrade to optionally remove the 0000 file.  Is there something
else they should run to test for this?  We certainly could check for
files >= 8000, but I am not sure that is sufficient.  We would then need
them to somehow update all the database/relation minmxid fields, and I
am not even sure what value we should set it to.  Is that something we
want to publish?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Attachment

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Wed, Jul  2, 2014 at 11:56:34AM -0400, Alvaro Herrera wrote:
> > How are we not creating a gap in members files by setting the members
> > value?  Why will that not cause problems?
>
> We're not setting the offset and member value here; we're setting the
> "nextMulti" and "oldestMulti" values.  Both affect the offsets counter,
> not the members counter.  The members counter is kept at zero, so the
> next multi to be allocated will create members starting from the first
> members position in page zero.  initdb of the new cluster created the
> first members page, which corresponds to the first members value that
> will be used.
>
> Note pg_resetxlog --help says:
>
>   -m MXID,MXID     set next and oldest multitransaction ID
>
> I think you're confusing the fact that we pass two values here (next and
> oldest, both apply to offset counters) with offsets vs. members.
>
> To affect the members counter you would use this other pg_resetxlog switch:
>   -O OFFSET        set next multitransaction offset
> which, AFAICS, we only use when upgrading from a 9.3+ cluster to a newer
> 9.3+ cluster (and we also copy the files from old cluster to the new
> one).

OK, thanks for the analysis.  Attached patch applied back through 9.3.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Attachment

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> On Fri, Jun 27, 2014 at 05:05:21PM -0400, Alvaro Herrera wrote:

> > > Do they need to remove the members/0000 file too?
> >
> > No, the members file shouldn't be removed; if they have used multixacts
> > at all since upgrading, the members/0000 file contains the members of
> > the first bunch of multis.  Removing it would remove valid data.  (Also
> > I still think that removing the members/0000 file that initdb creates is
> > a mistake.)
>
> We never remove offset or members files with user data;  we are only
> discussing the removal of empty files created by initdb.  Can you
> explain why we would remove the offsets file but retain the members file
> created by initdb when we are _not_ preserving files from the old
> cluster?  We are certainly setting the offset and member value from the
> old cluster:
>
>         exec_prog(UTILITY_LOG_FILE, NULL, true,
>                   "\"%s/pg_resetxlog\" -m %u,%u \"%s\"",
>                   new_cluster.bindir,
>                   old_cluster.controldata.chkpnt_nxtmulti + 1,
>                   old_cluster.controldata.chkpnt_nxtmulti,
>                   new_cluster.pgdata);
>
> How are we not creating a gap in members files by setting the members
> value?  Why will that not cause problems?

We're not setting the offset and member value here; we're setting the
"nextMulti" and "oldestMulti" values.  Both affect the offsets counter,
not the members counter.  The members counter is kept at zero, so the
next multi to be allocated will create members starting from the first
members position in page zero.  initdb of the new cluster created the
first members page, which corresponds to the first members value that
will be used.

Note pg_resetxlog --help says:

  -m MXID,MXID     set next and oldest multitransaction ID

I think you're confusing the fact that we pass two values here (next and
oldest, both apply to offset counters) with offsets vs. members.

To affect the members counter you would use this other pg_resetxlog switch:
  -O OFFSET        set next multitransaction offset
which, AFAICS, we only use when upgrading from a 9.3+ cluster to a newer
9.3+ cluster (and we also copy the files from old cluster to the new
one).


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Tue, Jul  1, 2014 at 03:30:47PM -0400, Bruce Momjian wrote:
> > * When upgrading from 9.2 or older, all tables need to have relminmxid
> >   set to oldestMulti.  However, since pg_dump --binary-upgrade cannot
> >   extract useful values from the catalog, we will need to have the
> >   schema load create all tables with relminmxid=0.  A subsequent UPDATE
> >   will fix the values.
>
> OK, the updated attached patch does this.  It repurposes
> set_frozenxids().

OK, patch applied back through 9.3.  This leaves the two queries users
are going to have to run in 9.3.X.  The first will tell users if they
should remove offsets/0000:

    WITH list(file) AS
    (
        SELECT * FROM pg_ls_dir('pg_multixact/offsets')
    )
    SELECT     EXISTS (SELECT * FROM list WHERE file = '0000') AND
        NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND
        NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND
        EXISTS (SELECT * FROM list WHERE file != '0000')
        AS file_0000_removal_required;

The second will tell them if they should go to a wiki page that will
show them how to set database and relation minmxid values:

    WITH list(file) AS
    (
        SELECT * FROM pg_ls_dir('pg_multixact/offsets')
    )
    SELECT     EXISTS (SELECT * FROM list WHERE file ~ '^[8-9A-F]') AND
        EXISTS (SELECT * FROM pg_database WHERE datminmxid = 1)
        AS update_database_and_relation_minmxid;

I will work on the wiki page now.

I think this issue is now ready for 9.3.X, once we write the wiki page
and release note text.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Wed, Jul  2, 2014 at 03:32:55PM -0400, Bruce Momjian wrote:
> The second will tell them if they should go to a wiki page that will
> show them how to set database and relation minmxid values:
>
>     WITH list(file) AS
>     (
>         SELECT * FROM pg_ls_dir('pg_multixact/offsets')
>     )
>     SELECT     EXISTS (SELECT * FROM list WHERE file ~ '^[8-9A-F]') AND
>         EXISTS (SELECT * FROM pg_database WHERE datminmxid = 1)
>         AS update_database_and_relation_minmxid;
>
> I will work on the wiki page now.
>
> I think this issue is now ready for 9.3.X, once we write the wiki page
> and release note text.

I have completed the wiki page for the second bug fix:

    https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix

The first bug fix just requires removal of the '0000' file, and I think
can be fully explained in the release notes.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Tue, Jul  1, 2014 at 03:01:06PM -0400, Alvaro Herrera wrote:
>> Finally, there is the question of what to do if the database has already
>> been upgraded and thus the tables are all at relminmxid=1.  As far as I
>> can tell, if the original value of nextMulti was below 2^31, there
>> should be no issue because vacuuming would advance the value normally.
>> If the original value was beyond that point, then vacuum would have been
>> bleating all along about the wraparound point.  In this case, I think it
>> should be enough the UPDATE the pg_class values to the current
>> oldestMulti value from pg_control, but I haven't tested this.

> Well, we are already having users run a query for the 9.3.X minor
> version upgrade to optionally remove the 0000 file.  Is there something
> else they should run to test for this?  We certainly could check for
> files >= 8000, but I am not sure that is sufficient.  We would then need
> them to somehow update all the database/relation minmxid fields, and I
> am not even sure what value we should set it to.  Is that something we
> want to publish?

I started transcribing Bruce's proposed fix procedure at
https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix
into the release notes, but I'm afraid it's all wet.

He's suggesting copying the last checkpoint's NextMultiXactId into
datminmxid/relminmxid, which is surely the wrong thing: that's likely to
be newer than all mxids in the tables, not older than them.  I thought at
first that this was a simple thinko and he meant to write oldestMultiXid,
but here's the thing: if we're in the situation where we've got
wraparound, isn't oldestMultiXid going to be 1?  The value recorded in the
checkpoint isn't magic, it's just going to be extracted from whatever's in
pg_database; and the whole problem here is that we can't trust that data.
Where can we get a useful lower bound from?

I'm a bit inclined to not say anything about fix procedures in the release
notes, because I'm not sure that this is a problem in the field.  If
anybody did have a wraparound they'd be getting bleats from VACUUM, and no
one has reported any such thing that I've heard.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
I wrote:
> I started transcribing Bruce's proposed fix procedure at
> https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix
> into the release notes, but I'm afraid it's all wet.

> He's suggesting copying the last checkpoint's NextMultiXactId into
> datminmxid/relminmxid, which is surely the wrong thing: that's likely to
> be newer than all mxids in the tables, not older than them.  I thought at
> first that this was a simple thinko and he meant to write oldestMultiXid,
> but here's the thing: if we're in the situation where we've got
> wraparound, isn't oldestMultiXid going to be 1?  The value recorded in the
> checkpoint isn't magic, it's just going to be extracted from whatever's in
> pg_database; and the whole problem here is that we can't trust that data.
> Where can we get a useful lower bound from?

Ugh: it's worse than that.  pg_upgrade itself is using this utterly
nonsensical logic to set datminmxid/relminmxid.  This is a stop-ship
issue for 9.3.5.

After some reflection it seems to me that we could estimate oldestmxid for
a pre-9.3 source cluster as the NextMultiXactId from its pg_control less
2000000000 or so.  This will nearly always be much older than the actual
oldest mxid, but that's okay --- the next vacuuming cycle will advance the
datminmxid/relminmxid values to match reality, so long as they aren't
wrapped around already.

Note that there's already an assumption baked into pg_upgrade that 2E9
xids or mxids back is safely past the oldest actual data; see where it
sets autovacuum_freeze_max_age and autovacuum_multixact_freeze_max_age
while starting the new cluster.

(Hm ... I guess "2000000000 or so" actually needs to be a bit less than
that, otherwise autovacuum might kick off while we're munging the new
cluster.)

We could recommend the same estimate in the instructions about cleaning
up a previous pg_upgrade by hand.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 13:37:01 -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Tue, Jul  1, 2014 at 03:01:06PM -0400, Alvaro Herrera wrote:
> >> Finally, there is the question of what to do if the database has already
> >> been upgraded and thus the tables are all at relminmxid=1.  As far as I
> >> can tell, if the original value of nextMulti was below 2^31, there
> >> should be no issue because vacuuming would advance the value normally.
> >> If the original value was beyond that point, then vacuum would have been
> >> bleating all along about the wraparound point.  In this case, I think it
> >> should be enough the UPDATE the pg_class values to the current
> >> oldestMulti value from pg_control, but I haven't tested this.
>
> > Well, we are already having users run a query for the 9.3.X minor
> > version upgrade to optionally remove the 0000 file.  Is there something
> > else they should run to test for this?  We certainly could check for
> > files >= 8000, but I am not sure that is sufficient.  We would then need
> > them to somehow update all the database/relation minmxid fields, and I
> > am not even sure what value we should set it to.  Is that something we
> > want to publish?
>
> I started transcribing Bruce's proposed fix procedure at
> https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix
> into the release notes, but I'm afraid it's all wet.

I don't understand why we should do anything but remove the 0000 file if
it's all zeroes? This seems far too complicated. Beside the fact that I
doubt it's actually achieving anything reliably?

> I'm a bit inclined to not say anything about fix procedures in the release
> notes, because I'm not sure that this is a problem in the field.  If
> anybody did have a wraparound they'd be getting bleats from VACUUM, and no
> one has reported any such thing that I've heard.

There actually have been a couple reports about the general problem I
think - reacting to one was how I noticed the bug.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-20 13:37:01 -0400, Tom Lane wrote:
>> I started transcribing Bruce's proposed fix procedure at
>> https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix
>> into the release notes, but I'm afraid it's all wet.

> I don't understand why we should do anything but remove the 0000 file if
> it's all zeroes? This seems far too complicated. Beside the fact that I
> doubt it's actually achieving anything reliably?

This one is not about the extra 0000 file.  It's about whether datminmxid
and relminmxid are wrong.  In the previous coding of pg_upgrade, they'd
have been left at "1" even if that value has wrapped around into the
future.

Now, if you were lucky, you'd have no bad side-effects even if they had
wrapped around ... but if you're not so lucky, you'd get failures due to
accesses to nonexistent pg_multixact files, after vacuum has truncated
pg_multixact in the mistaken impression that there are no remaining
references to old mxids.

The solution I just proposed of forcing the minmxid values down to
"now - 2E9" is not that great btw; I think it will result in autovacuum
deciding it has to scan everything in sight, which is the sort of
intensive operation that pg_upgrade was intended to avoid.  I'm not sure
we have any alternatives though.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 16:16:18 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-07-20 13:37:01 -0400, Tom Lane wrote:
> >> I started transcribing Bruce's proposed fix procedure at
> >> https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix
> >> into the release notes, but I'm afraid it's all wet.
>
> > I don't understand why we should do anything but remove the 0000 file if
> > it's all zeroes? This seems far too complicated. Beside the fact that I
> > doubt it's actually achieving anything reliably?
>
> This one is not about the extra 0000 file.  It's about whether datminmxid
> and relminmxid are wrong.  In the previous coding of pg_upgrade, they'd
> have been left at "1" even if that value has wrapped around into the
> future.

Yea, I'm rereading the thread atm. I'd stopped following it after a
while and didn't notice the drift into a second problem.

> Now, if you were lucky, you'd have no bad side-effects even if they had
> wrapped around ...

Hm. If relminmxid = 1 is still in the past everything should be fine,
right? The next vacuum (which will run soon) will just set it anew. But
if not
    /* relminmxid must never go backward, either */
    if (MultiXactIdIsValid(minmulti) &&
        MultiXactIdPrecedes(pgcform->relminmxid, minmulti))
    {
        pgcform->relminmxid = minmulti;
        dirty = true;
    }
will prevent it from being changed. Similarly
    /* ditto */
    if (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti))
    {
        dbform->datminmxid = newMinMulti;
        dirty = true;
    }
will prevent datminmxid from becoming sane.

vac_truncate_clog() will be called with ReadNextMultiXactId() - -
autovacuum_multixact_freeze_max_age as the truncation for multis. That
itself would be fine since there won't be any actual multis in that
range. The problem is that we might miss having to do a full table
vacuum because
we check relminmxid to determine that:
    scan_all |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
                                            mxactFullScanLimit);
which then could cause problems in the future.

Imo that means we need to fix this :( for existing clusters.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 15:55:25 -0400, Tom Lane wrote:
> I wrote:
> > I started transcribing Bruce's proposed fix procedure at
> > https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix
> > into the release notes, but I'm afraid it's all wet.
>
> > He's suggesting copying the last checkpoint's NextMultiXactId into
> > datminmxid/relminmxid, which is surely the wrong thing: that's likely to
> > be newer than all mxids in the tables, not older than them.  I thought at
> > first that this was a simple thinko and he meant to write oldestMultiXid,
> > but here's the thing: if we're in the situation where we've got
> > wraparound, isn't oldestMultiXid going to be 1?  The value recorded in the
> > checkpoint isn't magic, it's just going to be extracted from whatever's in
> > pg_database; and the whole problem here is that we can't trust that data.
> > Where can we get a useful lower bound from?
>
> Ugh: it's worse than that.  pg_upgrade itself is using this utterly
> nonsensical logic to set datminmxid/relminmxid.  This is a stop-ship
> issue for 9.3.5.
>
> After some reflection it seems to me that we could estimate oldestmxid for
> a pre-9.3 source cluster as the NextMultiXactId from its pg_control less
> 2000000000 or so.  This will nearly always be much older than the actual
> oldest mxid, but that's okay --- the next vacuuming cycle will advance the
> datminmxid/relminmxid values to match reality, so long as they aren't
> wrapped around already.

I think NextMultiXactId should be fine when upgrading from pre 9.3
clusters. Pre 9.3 multis that have been written before a shutdown don't
really have a meaning: To the point that we actually never look at
them. MultiXactIdSetOldestVisible() would have set oldestMXact to
MultiXactState->nextMXact if no backend has any active
ones. GetMultiXactIdMembers et al will deal with that.

So I think for pg_upgrade itself using NextMulti as a replacement for
oldestMXact is fine.

Now that obviously does *not* mean it's safe to do that to after the
cluster has already been running 9.3+.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-20 16:16:18 -0400, Tom Lane wrote:
>> This one is not about the extra 0000 file.  It's about whether datminmxid
>> and relminmxid are wrong.  In the previous coding of pg_upgrade, they'd
>> have been left at "1" even if that value has wrapped around into the
>> future.

> Yea, I'm rereading the thread atm. I'd stopped following it after a
> while and didn't notice the drift into a second problem.

I've been doing more analysis on this.  It looks to me like most people
would not notice a problem, but ...

1. When updating from a pre-9.3 version, pg_upgrade sets pg_control's
oldestMultiXid to equal the old cluster's NextMultiXactId (with a useless
increment of the new NextMultiXactId, but that's not too relevant).
This might seem silly, because there are very probably mxids out there
that are before this value, but the multixact code is designed to assume
that LOCKED_ONLY mxids before oldestMultiXid are not running --- without
ever going to pg_multixact to check.  So in fact, there will be no
attempts to access on-disk data about old mxids; at least as long as the
mxid counters haven't wrapped around.

2. However, pg_upgrade also sets datminmxid/relminmxid to equal the old
cluster's NextMultiXactId.  The trouble with this is that it might fool
(auto)vacuum into never seeing and freezing the pre-upgrade mxids; they're
in the table but the metadata says not, so we'd not force a full table
scan to find them.  If such a mxid manages to survive past wraparound,
it'll be "in the future" and the multixact code will start complaining
about it, like this:

    if (!MultiXactIdPrecedes(multi, nextMXact))
        ereport(ERROR,
                (errcode(ERRCODE_INTERNAL_ERROR),
                 errmsg("MultiXactId %u has not been created yet -- apparent wraparound",
                        multi)));

3. In practice, because full-table vacuum scans get forced periodically
anyway to advance relminxid, it's entirely likely that old mxids would get
frozen before there was a problem.  vacuum will freeze an old mxid on
sight, whatever the reason for visiting it.

4. The patch Bruce applied to initialize datminmxid/relminmxid to the old
NextMultiXactId rather than 1 does not fundamentally change anything here.
It narrows the window in which wraparound can cause problems, but only by
the distance that "1" is in-the-future at the time of upgrade.  Indeed,
you could argue that it makes things worse, because it *guarantees* that
vacuum is unaware of old mxids existing in the table, whereas with the "1"
you had no such problem unless you were more than halfway to the wrap point.


What we've effectively got ATM, with either the patched or unpatched code,
is that you're safe to the extent that relminxid-driven freezing happens
often enough to get rid of mxids before they wrap.  I note that that's
exactly the situation we had pre-9.3.  That being the case, and in view
of the lack of complaints from the field about it, I wonder whether we
aren't greatly overreacting.  If the alternative is that pg_upgrade forces
cluster-wide freezing activity to occur immediately upon system startup,
I'd definitely say that the cure is worse than the disease.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 17:22:48 -0400, Tom Lane wrote:
> 2. However, pg_upgrade also sets datminmxid/relminmxid to equal the old
> cluster's NextMultiXactId.  The trouble with this is that it might fool
> (auto)vacuum into never seeing and freezing the pre-upgrade mxids; they're
> in the table but the metadata says not, so we'd not force a full table
> scan to find them.

We effectively can't actually assume those are going to be vacuumed away
anyway. There might have been 2**32 mxids in the older cluster already -
no value of datminmxid/relminmxid can protect us against that.

> 4. The patch Bruce applied to initialize datminmxid/relminmxid to the old
> NextMultiXactId rather than 1 does not fundamentally change anything here.
> It narrows the window in which wraparound can cause problems, but only by
> the distance that "1" is in-the-future at the time of upgrade.

I think it's actually more than that. Consider what happens if
pg_upgrade has used pg_resetxlog to set nextMulti to > 2^31. If
rel/datminmxid are set to 1 regardless vac_update_relstats() and
vac_update_datfrozenxid() won't increase them anymore because of:
    /* relminmxid must never go backward, either */
    if (MultiXactIdIsValid(minmulti) &&
        MultiXactIdPrecedes(pgcform->relminmxid, minmulti))
    {
        pgcform->relminmxid = minmulti;
        dirty = true;
    }

And that can actually cause significant problems once 9.3+ creates new
multis because they'll never get vacuumed away but still do get
truncated. If it's an updating multi xmax that can effectively make the
row unreadable - not just block updates.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-20 17:22:48 -0400, Tom Lane wrote:
>> 4. The patch Bruce applied to initialize datminmxid/relminmxid to the old
>> NextMultiXactId rather than 1 does not fundamentally change anything here.
>> It narrows the window in which wraparound can cause problems, but only by
>> the distance that "1" is in-the-future at the time of upgrade.

> I think it's actually more than that. Consider what happens if
> pg_upgrade has used pg_resetxlog to set nextMulti to > 2^31. If
> rel/datminmxid are set to 1 regardless vac_update_relstats() and
> vac_update_datfrozenxid() won't increase them anymore because of:
>     /* relminmxid must never go backward, either */
>     if (MultiXactIdIsValid(minmulti) &&
>         MultiXactIdPrecedes(pgcform->relminmxid, minmulti))
>     {
>         pgcform->relminmxid = minmulti;
>         dirty = true;
>     }

> And that can actually cause significant problems once 9.3+ creates new
> multis because they'll never get vacuumed away but still do get
> truncated. If it's an updating multi xmax that can effectively make the
> row unreadable - not just block updates.

No, I don't think so.  Truncation is driven off oldestMultiXid from
pg_control, not from relminmxid.  The only thing in-the-future values of
those will do to us is prevent autovacuum from thinking it must do a full
table scan.  (In particular, in-the-future values do not cause
oldestMultiXid to get advanced, because we're always looking for the
oldest value not the newest.)

But in any case, we both agree that setting relminmxid to equal nextMulti
is completely unsafe in a 9.3 cluster that's already been up.  So the
proposed fix instructions are certainly wrong.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 17:43:04 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-07-20 17:22:48 -0400, Tom Lane wrote:
> >> 4. The patch Bruce applied to initialize datminmxid/relminmxid to the old
> >> NextMultiXactId rather than 1 does not fundamentally change anything here.
> >> It narrows the window in which wraparound can cause problems, but only by
> >> the distance that "1" is in-the-future at the time of upgrade.
>
> > I think it's actually more than that. Consider what happens if
> > pg_upgrade has used pg_resetxlog to set nextMulti to > 2^31. If
> > rel/datminmxid are set to 1 regardless vac_update_relstats() and
> > vac_update_datfrozenxid() won't increase them anymore because of:
> >     /* relminmxid must never go backward, either */
> >     if (MultiXactIdIsValid(minmulti) &&
> >         MultiXactIdPrecedes(pgcform->relminmxid, minmulti))
> >     {
> >         pgcform->relminmxid = minmulti;
> >         dirty = true;
> >     }
>
> > And that can actually cause significant problems once 9.3+ creates new
> > multis because they'll never get vacuumed away but still do get
> > truncated. If it's an updating multi xmax that can effectively make the
> > row unreadable - not just block updates.
>
> No, I don't think so.  Truncation is driven off oldestMultiXid from
> pg_control, not from relminmxid. The only thing in-the-future values of
> those will do to us is prevent autovacuum from thinking it must do a full
> table scan.  (In particular, in-the-future values do not cause
> oldestMultiXid to get advanced, because we're always looking for the
> oldest value not the newest.)

Right. But that's the problem. If oldestMulti is set to, say, 3000000000
by pg_resetxlog during pg_upgrade but *minmxid = 1 those tables won't be
full tables scanned because of multixacts. But vac_truncate_clog() will
    SetMultiXactIdLimit(minMulti, minmulti_datoid);
regardless.

Note that it'll not notice the limit of other databases in this case
because vac_truncate_clog() will effectively use the in memory
GetOldestMultiXactId() and check if other databases are before that. But
there won't be any because they all appear in the future. Due to that
the next checkpoint will truncate the clog to the cutoff multi xid used
by the last vacuum.

Am I missing something?

> But in any case, we both agree that setting relminmxid to equal nextMulti
> is completely unsafe in a 9.3 cluster that's already been up.  So the
> proposed fix instructions are certainly wrong.

Right. I'm pondering what to do about it instead. The best idea I have
is something like:
1) Jot down pg_controldata|grep NextMultiXactId
2) kill/wait for all existing transactions to end
3) vacuum all databases with vacuum_multixact_freeze_min_age=0. That'll
   get rid of all old appearing multis
4) Update pg_class to set relminmxid=value from 1), same with
   pg_database

But that sucks and doesn't deal with all the problems :(

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-20 17:43:04 -0400, Tom Lane wrote:
>> No, I don't think so.  Truncation is driven off oldestMultiXid from
>> pg_control, not from relminmxid. The only thing in-the-future values of
>> those will do to us is prevent autovacuum from thinking it must do a full
>> table scan.  (In particular, in-the-future values do not cause
>> oldestMultiXid to get advanced, because we're always looking for the
>> oldest value not the newest.)

> Right. But that's the problem. If oldestMulti is set to, say, 3000000000
> by pg_resetxlog during pg_upgrade but *minmxid = 1 those tables won't be
> full tables scanned because of multixacts. But vac_truncate_clog() will
>     SetMultiXactIdLimit(minMulti, minmulti_datoid);
> regardless.

> Note that it'll not notice the limit of other databases in this case
> because vac_truncate_clog() will effectively use the in memory
> GetOldestMultiXactId() and check if other databases are before that. But
> there won't be any because they all appear in the future. Due to that
> the next checkpoint will truncate the clog to the cutoff multi xid used
> by the last vacuum.

Right.

> Am I missing something?

My point is that the cutoff multi xid won't be new enough to remove
non-LOCKED_ONLY (ie, post-9.3) mxids.

>> But in any case, we both agree that setting relminmxid to equal nextMulti
>> is completely unsafe in a 9.3 cluster that's already been up.  So the
>> proposed fix instructions are certainly wrong.

> Right. I'm pondering what to do about it instead. The best idea I have
> is something like:
> 1) Jot down pg_controldata|grep NextMultiXactId
> 2) kill/wait for all existing transactions to end
> 3) vacuum all databases with vacuum_multixact_freeze_min_age=0. That'll
>    get rid of all old appearing multis
> 4) Update pg_class to set relminmxid=value from 1), same with
>    pg_database

> But that sucks and doesn't deal with all the problems :(

Yeah.  At this point I'm of the opinion that we should not recommend any
manual corrective actions for this issue.  They're likely to do more harm
than good, especially if the user misses or fat-fingers any steps.

I'm also thinking that the lack of any complaints suggests there are few
or no existing installations with nextMulti past 2^31, anyhow.  If it were
even past 400000000 (default autovacuum_multixact_freeze_max_age), we'd
have been hearing howls of anguish about full-database freezing scans
occurring immediately after a pg_upgrade (thanks to minmxid = 1 being old
enough to trigger that).  So the way I've documented this patch in the
draft release notes is that it prevents the latter problem.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 18:16:51 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-07-20 17:43:04 -0400, Tom Lane wrote:
> >> No, I don't think so.  Truncation is driven off oldestMultiXid from
> >> pg_control, not from relminmxid. The only thing in-the-future values of
> >> those will do to us is prevent autovacuum from thinking it must do a full
> >> table scan.  (In particular, in-the-future values do not cause
> >> oldestMultiXid to get advanced, because we're always looking for the
> >> oldest value not the newest.)
>
> > Right. But that's the problem. If oldestMulti is set to, say, 3000000000
> > by pg_resetxlog during pg_upgrade but *minmxid = 1 those tables won't be
> > full tables scanned because of multixacts. But vac_truncate_clog() will
> >     SetMultiXactIdLimit(minMulti, minmulti_datoid);
> > regardless.
>
> > Note that it'll not notice the limit of other databases in this case
> > because vac_truncate_clog() will effectively use the in memory
> > GetOldestMultiXactId() and check if other databases are before that. But
> > there won't be any because they all appear in the future. Due to that
> > the next checkpoint will tru6ncate the clog to the cutoff multi xid used
> > by the last vacuum.
>
> Right.
>
> > Am I missing something?
>
> My point is that the cutoff multi xid won't be new enough to remove
> non-LOCKED_ONLY (ie, post-9.3) mxids.

Why not? Afaics this will continue to happen until multixacts are
wrapped around once? So the cutoff multi will be new enough for that at
some point after the pg_upgrade?

Luckily in most cases full table vacuums triggered due to normal xids
will prevent bad problems though. There have been a couple reports where
people included pg_controldata output indicating crazy rates of multixid
consumption but I think none of those were crazy enough to burn multis
so fast that important ones get truncated before a full table vacuum
occurs due to normal xids.

> >> But in any case, we both agree that setting relminmxid to equal nextMulti
> >> is completely unsafe in a 9.3 cluster that's already been up.  So the
> >> proposed fix instructions are certainly wrong.
>
> > Right. I'm pondering what to do about it instead. The best idea I have
> > is something like:
> > 1) Jot down pg_controldata|grep NextMultiXactId
> > 2) kill/wait for all existing transactions to end
> > 3) vacuum all databases with vacuum_multixact_freeze_min_age=0. That'll
> >    get rid of all old appearing multis
> > 4) Update pg_class to set relminmxid=value from 1), same with
> >    pg_database
>
> > But that sucks and doesn't deal with all the problems :(
>
> Yeah.  At this point I'm of the opinion that we should not recommend any
> manual corrective actions for this issue.  They're likely to do more harm
> than good, especially if the user misses or fat-fingers any steps.

I don't really see us coming up with something robust in time :/. It's a
bid sad, but maybe we should recommend contacting the mailing list if
pg_upgrade has been used and nextMulti is above 2^31?

Btw, we really should have txid_current() equivalent for multis...

> I'm also thinking that the lack of any complaints suggests there are few
> or no existing installations with nextMulti past 2^31, anyhow.  If it were
> even past 400000000 (default autovacuum_multixact_freeze_max_age), we'd
> have been hearing howls of anguish about full-database freezing scans
> occurring immediately after a pg_upgrade (thanks to minmxid = 1 being old
> enough to trigger that).

I think people just chalk that up to 'crazy pg vacuuming behaviour' and
not investigate further. At least that's my practical experience :(

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-20 18:16:51 -0400, Tom Lane wrote:
>> My point is that the cutoff multi xid won't be new enough to remove
>> non-LOCKED_ONLY (ie, post-9.3) mxids.

> Why not? Afaics this will continue to happen until multixacts are
> wrapped around once? So the cutoff multi will be new enough for that at
> some point after the pg_upgrade?

Before that happens, nextMultiXid will catch up with the minmxid = 1
values, and they'll be in the past, and then we're at the same point
that we're at to begin with if we used 9.3.5 pg_upgrade.

> Luckily in most cases full table vacuums triggered due to normal xids
> will prevent bad problems though.

Yeah.  While it's not that comfortable to rely on that, we were reliant
on that effect in every pre-9.3 branch, so I'm not terribly upset about
it continuing to be the case in existing 9.3 installations.  As long as
they're not consuming mxids at a rate much greater than xids, they'll be
all right; and things will be unconditionally safe once the freezing
process has advanced far enough.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 18:39:06 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-07-20 18:16:51 -0400, Tom Lane wrote:
> >> My point is that the cutoff multi xid won't be new enough to remove
> >> non-LOCKED_ONLY (ie, post-9.3) mxids.
>
> > Why not? Afaics this will continue to happen until multixacts are
> > wrapped around once? So the cutoff multi will be new enough for that at
> > some point after the pg_upgrade?
>
> Before that happens, nextMultiXid will catch up with the minmxid = 1
> values, and they'll be in the past, and then we're at the same point
> that we're at to begin with if we used 9.3.5 pg_upgrade.

There'll be problems earlier than that afaics. Consider what happens if
a pg_upgrade happened at NextMulti=3000000000 which will have set
minmxid=1. If, after that, a couple 100k multis have been used, but
fewer than autovacuum_freeze_max_age normal xids were consumed we're in
an unfortunate situation.

If any *single* full table vacuum after that calls
vac_update_datfrozenxid() which just needs its datfrozenxid advance by
one we're in trouble: vac_truncate_clog() will be called with minMulti =
GetOldestMultiXactId(). Note that the latter only returns the oldest
*running* multi. Because vac_truncate_clog() only finds 1s in
datfrozenxid it'll ignore all of them because of:
        if (MultiXactIdPrecedes(dbform->datminmxid, minMulti))
        {
            minMulti = dbform->datminmxid;
            minmulti_datoid = HeapTupleGetOid(tuple);
        }
and calls
    SetMultiXactIdLimit(minMulti, minmulti_datoid);

If the vacuum happened without a concurrent backend using multis active
this will set the limit to the *current* NextMulti. Then
TruncateMultiXact() called after the checkpoint will truncate away
everything up to the current NextMulti essentially corrupting rows
created in 9.3+.

> > Luckily in most cases full table vacuums triggered due to normal xids
> > will prevent bad problems though.
>
> Yeah.  While it's not that comfortable to rely on that, we were reliant
> on that effect in every pre-9.3 branch, so I'm not terribly upset about
> it continuing to be the case in existing 9.3 installations.

Well, there's a pretty fundamental distinction to < 9.3: The worst that
could happen before is a transient error while *writing*. Now it can
happen during reading.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> If any *single* full table vacuum after that calls
> vac_update_datfrozenxid() which just needs its datfrozenxid advance by
> one we're in trouble: vac_truncate_clog() will be called with minMulti =
> GetOldestMultiXactId().

Uh, no, the cutoff is GetOldestMultiXactId minus
vacuum_multixact_freeze_min_age (or the autovac equivalent).
See vacuum_set_xid_limits.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
I wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
>> If any *single* full table vacuum after that calls
>> vac_update_datfrozenxid() which just needs its datfrozenxid advance by
>> one we're in trouble: vac_truncate_clog() will be called with minMulti =
>> GetOldestMultiXactId().

> Uh, no, the cutoff is GetOldestMultiXactId minus
> vacuum_multixact_freeze_min_age (or the autovac equivalent).

Oh, wait, I see what you're on about: that cutoff doesn't constrain
the global value computed by vac_truncate_clog.  Hmm.

I wonder whether we should change vac_update_relstats so that it only
applies the "relminxid mustn't go backwards" rule as long as relminxid
is sane, ie, not in the future.  If it is in the future, forcibly update
it to the cutoff we actually used.  Likewise for relminmxid.  And I guess
we'd need a similar rule for updating datmin(m)xid.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 19:04:30 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > If any *single* full table vacuum after that calls
> > vac_update_datfrozenxid() which just needs its datfrozenxid advance by
> > one we're in trouble: vac_truncate_clog() will be called with minMulti =
> > GetOldestMultiXactId().
>
> Uh, no, the cutoff is GetOldestMultiXactId minus
> vacuum_multixact_freeze_min_age (or the autovac equivalent).
> See vacuum_set_xid_limits.

Unfortunately not :(. The stuff computed in vacuum_set_xid_limits isn't
used for truncation - which normally is sane because another database
might be further behind.

void
vac_update_datfrozenxid(void)
{
...
    MultiXactId newMinMulti;
...
    /*
     * Similarly, initialize the MultiXact "min" with the value that would be
     * used on pg_class for new tables.  See AddNewRelationTuple().
     */
    newMinMulti = GetOldestMultiXactId();
...
    while ((classTup = systable_getnext(scan)) != NULL)
    {
...
        if (MultiXactIdPrecedes(classForm->relminmxid, newMinMulti))
            newMinMulti = classForm->relminmxid;
    }
...
        if (dirty || ForceTransactionIdLimitUpdate())
        vac_truncate_clog(newFrozenXid, newMinMulti);
}

But even if it were, I don't really see how that changes anything?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 19:11:40 -0400, Tom Lane wrote:
> I wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> >> If any *single* full table vacuum after that calls
> >> vac_update_datfrozenxid() which just needs its datfrozenxid advance by
> >> one we're in trouble: vac_truncate_clog() will be called with minMulti =
> >> GetOldestMultiXactId().
>
> > Uh, no, the cutoff is GetOldestMultiXactId minus
> > vacuum_multixact_freeze_min_age (or the autovac equivalent).
>
> Oh, wait, I see what you're on about: that cutoff doesn't constrain
> the global value computed by vac_truncate_clog.  Hmm.
>
> I wonder whether we should change vac_update_relstats so that it only
> applies the "relminxid mustn't go backwards" rule as long as relminxid
> is sane, ie, not in the future.  If it is in the future, forcibly update
> it to the cutoff we actually used.  Likewise for relminmxid.  And I guess
> we'd need a similar rule for updating datmin(m)xid.

I'm wondering the same. How would we do that from a concurreny POV for
the pg_database rows? I think we could just accept the race condition
that two xacts move dbform->datminmxid backwards to different values
since both have to be 'somewhat' correct?

I think this is out of the remit for 9.3.5. At least I don't have the
mental capacity to do this properly till tomorrow afternoon.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-20 19:11:40 -0400, Tom Lane wrote:
>> I wonder whether we should change vac_update_relstats so that it only
>> applies the "relminxid mustn't go backwards" rule as long as relminxid
>> is sane, ie, not in the future.  If it is in the future, forcibly update
>> it to the cutoff we actually used.  Likewise for relminmxid.  And I guess
>> we'd need a similar rule for updating datmin(m)xid.

> I'm wondering the same. How would we do that from a concurreny POV for
> the pg_database rows? I think we could just accept the race condition
> that two xacts move dbform->datminmxid backwards to different values
> since both have to be 'somewhat' correct?

Seems no worse than it is today.  Is it even possible for two xacts to be
trying to update that at the same time?

> I think this is out of the remit for 9.3.5. At least I don't have the
> mental capacity to do this properly till tomorrow afternoon.

Doesn't sound that hard to me (and I'm still reasonably awake, which
I bet you're not).  Will take a look.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 19:37:15 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-07-20 19:11:40 -0400, Tom Lane wrote:
> >> I wonder whether we should change vac_update_relstats so that it only
> >> applies the "relminxid mustn't go backwards" rule as long as relminxid
> >> is sane, ie, not in the future.  If it is in the future, forcibly update
> >> it to the cutoff we actually used.  Likewise for relminmxid.  And I guess
> >> we'd need a similar rule for updating datmin(m)xid.
>
> > I'm wondering the same. How would we do that from a concurreny POV for
> > the pg_database rows? I think we could just accept the race condition
> > that two xacts move dbform->datminmxid backwards to different values
> > since both have to be 'somewhat' correct?
>
> Seems no worse than it is today.

Well, right now it can only advance, never go backwards. I think. But it
don't see how it could matter - all the possible minimum values better
be correct.

> Is it even possible for two xacts to be trying to update that at the
> same time?

At least I don't see anything prohibiting it. vac_update_datfrozenxid()(
is called straight from vacuum()/do_autovacuum(). The only thing that'll
serialize that I can see is the buffer lock for the heap_inplace_update().

Seems odd from a concurrency perspective. Worth a look someday not too
far away.

> > I think this is out of the remit for 9.3.5. At least I don't have the
> > mental capacity to do this properly till tomorrow afternoon.
>
> Doesn't sound that hard to me (and I'm still reasonably awake, which
> I bet you're not).

I'm most definitely not awake, right ;). Especially as it's 32 °C in my
flat - at 2 in the morning...

> Will take a look.

Cool. Will take a look at the result tomorrow.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-20 19:37:15 -0400, Tom Lane wrote:
>>> I wonder whether we should change vac_update_relstats so that it only
>>> applies the "relminxid mustn't go backwards" rule as long as relminxid
>>> is sane, ie, not in the future.  If it is in the future, forcibly update
>>> it to the cutoff we actually used.  Likewise for relminmxid.  And I guess
>>> we'd need a similar rule for updating datmin(m)xid.

>> I'm wondering the same.

Here's a draft patch for this.  I think this will fix all cases where
the "1" minmxid inserted by previous pg_upgrade versions is actually
in the future at the time we run VACUUM.  We would still be at risk if
it had been in the future when pg_upgrade ran but no longer is now,
since that would mean there could be non-lock-only mxids on disk that
are older than "1".  However, for the reasons discussed upthread, it
seems fairly unlikely to me that people would actually get burnt in
practice, so I'm satisfied with doing this much and no more.

I noticed while poking at it that there was an additional small logic
error in vac_update_datfrozenxid: if we decide to update only one of
datfrozenxid and datminmxid, the value passed to vac_truncate_clog for the
other field was *not* the value stored in pg_database but the value we'd
decided not to use.  That didn't seem like a good thing; it'd at least
result in possibly using MyDatabaseId as oldestxid_datoid or
minmulti_datoid for a value that doesn't actually match our database.

Barring objections I'll commit this tomorrow.

            regards, tom lane

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 8822a15..3f72d5a 100644
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*************** vac_update_relstats(Relation relation,
*** 733,751 ****
      }

      /*
!      * relfrozenxid should never go backward.  Caller can pass
!      * InvalidTransactionId if it has no new data.
       */
      if (TransactionIdIsNormal(frozenxid) &&
!         TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid))
      {
          pgcform->relfrozenxid = frozenxid;
          dirty = true;
      }

!     /* relminmxid must never go backward, either */
      if (MultiXactIdIsValid(minmulti) &&
!         MultiXactIdPrecedes(pgcform->relminmxid, minmulti))
      {
          pgcform->relminmxid = minmulti;
          dirty = true;
--- 733,765 ----
      }

      /*
!      * Update relfrozenxid, unless caller passed InvalidTransactionId
!      * indicating it has no new data.
!      *
!      * Ordinarily, we don't let relfrozenxid go backwards: if things are
!      * working correctly, the only way the new frozenxid could be older would
!      * be if a previous VACUUM was done with a tighter freeze_min_age, in
!      * which case we don't want to forget the work it already did.  However,
!      * if the stored relfrozenxid is "in the future", then it must be corrupt
!      * and it seems best to overwrite it with the cutoff we used this time.
!      * See vac_update_datfrozenxid() concerning what we consider to be "in the
!      * future".
       */
      if (TransactionIdIsNormal(frozenxid) &&
!         pgcform->relfrozenxid != frozenxid &&
!         (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
!          TransactionIdPrecedes(GetOldestXmin(NULL, true),
!                                pgcform->relfrozenxid)))
      {
          pgcform->relfrozenxid = frozenxid;
          dirty = true;
      }

!     /* Similarly for relminmxid */
      if (MultiXactIdIsValid(minmulti) &&
!         pgcform->relminmxid != minmulti &&
!         (MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
!          MultiXactIdPrecedes(GetOldestMultiXactId(), pgcform->relminmxid)))
      {
          pgcform->relminmxid = minmulti;
          dirty = true;
*************** vac_update_relstats(Relation relation,
*** 772,779 ****
   *        truncate pg_clog and pg_multixact.
   *
   *        We violate transaction semantics here by overwriting the database's
!  *        existing pg_database tuple with the new value.  This is reasonably
!  *        safe since the new value is correct whether or not this transaction
   *        commits.  As with vac_update_relstats, this avoids leaving dead tuples
   *        behind after a VACUUM.
   */
--- 786,793 ----
   *        truncate pg_clog and pg_multixact.
   *
   *        We violate transaction semantics here by overwriting the database's
!  *        existing pg_database tuple with the new values.  This is reasonably
!  *        safe since the new values are correct whether or not this transaction
   *        commits.  As with vac_update_relstats, this avoids leaving dead tuples
   *        behind after a VACUUM.
   */
*************** vac_update_datfrozenxid(void)
*** 786,792 ****
--- 800,808 ----
      SysScanDesc scan;
      HeapTuple    classTup;
      TransactionId newFrozenXid;
+     TransactionId lastSaneFrozenXid;
      MultiXactId newMinMulti;
+     MultiXactId lastSaneMinMulti;
      bool        dirty = false;

      /*
*************** vac_update_datfrozenxid(void)
*** 795,807 ****
       * committed pg_class entries for new tables; see AddNewRelationTuple().
       * So we cannot produce a wrong minimum by starting with this.
       */
!     newFrozenXid = GetOldestXmin(NULL, true);

      /*
       * Similarly, initialize the MultiXact "min" with the value that would be
       * used on pg_class for new tables.  See AddNewRelationTuple().
       */
!     newMinMulti = GetOldestMultiXactId();

      /*
       * We must seqscan pg_class to find the minimum Xid, because there is no
--- 811,823 ----
       * committed pg_class entries for new tables; see AddNewRelationTuple().
       * So we cannot produce a wrong minimum by starting with this.
       */
!     newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true);

      /*
       * Similarly, initialize the MultiXact "min" with the value that would be
       * used on pg_class for new tables.  See AddNewRelationTuple().
       */
!     newMinMulti = lastSaneMinMulti = GetOldestMultiXactId();

      /*
       * We must seqscan pg_class to find the minimum Xid, because there is no
*************** vac_update_datfrozenxid(void)
*** 842,847 ****
--- 858,873 ----
      Assert(TransactionIdIsNormal(newFrozenXid));
      Assert(MultiXactIdIsValid(newMinMulti));

+     /*
+      * It's possible that every entry in pg_class is "in the future" (bugs in
+      * pg_upgrade have been known to produce such situations).  Don't update
+      * pg_database, nor truncate CLOG, unless we found at least one
+      * sane-looking entry.
+      */
+     if (!TransactionIdPrecedes(newFrozenXid, lastSaneFrozenXid) ||
+         !MultiXactIdPrecedes(newMinMulti, lastSaneMinMulti))
+         return;
+
      /* Now fetch the pg_database tuple we need to update. */
      relation = heap_open(DatabaseRelationId, RowExclusiveLock);

*************** vac_update_datfrozenxid(void)
*** 852,872 ****
      dbform = (Form_pg_database) GETSTRUCT(tuple);

      /*
!      * Don't allow datfrozenxid to go backward (probably can't happen anyway);
!      * and detect the common case where it doesn't go forward either.
       */
!     if (TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid))
      {
          dbform->datfrozenxid = newFrozenXid;
          dirty = true;
      }

!     /* ditto */
!     if (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti))
      {
          dbform->datminmxid = newMinMulti;
          dirty = true;
      }

      if (dirty)
          heap_inplace_update(relation, tuple);
--- 878,907 ----
      dbform = (Form_pg_database) GETSTRUCT(tuple);

      /*
!      * As in vac_update_relstats(), we ordinarily don't want to let
!      * datfrozenxid go backward; but if it's "in the future" then it must be
!      * corrupt and it seems best to overwrite it.
       */
!     if (dbform->datfrozenxid != newFrozenXid &&
!         (TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid) ||
!          TransactionIdPrecedes(lastSaneFrozenXid, dbform->datfrozenxid)))
      {
          dbform->datfrozenxid = newFrozenXid;
          dirty = true;
      }
+     else
+         newFrozenXid = dbform->datfrozenxid;

!     /* Ditto for datminmxid */
!     if (dbform->datminmxid != newMinMulti &&
!         (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti) ||
!          MultiXactIdPrecedes(lastSaneMinMulti, dbform->datminmxid)))
      {
          dbform->datminmxid = newMinMulti;
          dirty = true;
      }
+     else
+         newMinMulti = dbform->datminmxid;

      if (dirty)
          heap_inplace_update(relation, tuple);
*************** vac_update_datfrozenxid(void)
*** 875,883 ****
      heap_close(relation, RowExclusiveLock);

      /*
!      * If we were able to advance datfrozenxid, see if we can truncate
!      * pg_clog. Also do it if the shared XID-wrap-limit info is stale, since
!      * this action will update that too.
       */
      if (dirty || ForceTransactionIdLimitUpdate())
          vac_truncate_clog(newFrozenXid, newMinMulti);
--- 910,918 ----
      heap_close(relation, RowExclusiveLock);

      /*
!      * If we were able to advance datfrozenxid or datminmxid, see if we can
!      * truncate pg_clog and/or pg_multixact.  Also do it if the shared
!      * XID-wrap-limit info is stale, since this action will update that too.
       */
      if (dirty || ForceTransactionIdLimitUpdate())
          vac_truncate_clog(newFrozenXid, newMinMulti);
*************** vac_update_datfrozenxid(void)
*** 890,902 ****
   *        Scan pg_database to determine the system-wide oldest datfrozenxid,
   *        and use it to truncate the transaction commit log (pg_clog).
   *        Also update the XID wrap limit info maintained by varsup.c.
   *
!  *        The passed XID is simply the one I just wrote into my pg_database
!  *        entry.  It's used to initialize the "min" calculation.
   *
   *        This routine is only invoked when we've managed to change our
!  *        DB's datfrozenxid entry, or we found that the shared XID-wrap-limit
!  *        info is stale.
   */
  static void
  vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti)
--- 925,938 ----
   *        Scan pg_database to determine the system-wide oldest datfrozenxid,
   *        and use it to truncate the transaction commit log (pg_clog).
   *        Also update the XID wrap limit info maintained by varsup.c.
+  *        Likewise for datminmxid.
   *
!  *        The passed XID and minMulti are the updated values for my own
!  *        pg_database entry. They're used to initialize the "min" calculations.
   *
   *        This routine is only invoked when we've managed to change our
!  *        DB's datfrozenxid/datminmxid values, or we found that the shared
!  *        XID-wrap-limit info is stale.
   */
  static void
  vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti)
*************** vac_truncate_clog(TransactionId frozenXI
*** 909,920 ****
      Oid            minmulti_datoid;
      bool        frozenAlreadyWrapped = false;

!     /* init oldest datoids to sync with my frozen values */
      oldestxid_datoid = MyDatabaseId;
      minmulti_datoid = MyDatabaseId;

      /*
!      * Scan pg_database to compute the minimum datfrozenxid
       *
       * Note: we need not worry about a race condition with new entries being
       * inserted by CREATE DATABASE.  Any such entry will have a copy of some
--- 945,956 ----
      Oid            minmulti_datoid;
      bool        frozenAlreadyWrapped = false;

!     /* init oldest datoids to sync with my frozenXID/minMulti values */
      oldestxid_datoid = MyDatabaseId;
      minmulti_datoid = MyDatabaseId;

      /*
!      * Scan pg_database to compute the minimum datfrozenxid/datminmxid
       *
       * Note: we need not worry about a race condition with new entries being
       * inserted by CREATE DATABASE.  Any such entry will have a copy of some

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
I wrote:
> Here's a draft patch for this.  I think this will fix all cases where
> the "1" minmxid inserted by previous pg_upgrade versions is actually
> in the future at the time we run VACUUM.  We would still be at risk if
> it had been in the future when pg_upgrade ran but no longer is now,
> since that would mean there could be non-lock-only mxids on disk that
> are older than "1".  However, for the reasons discussed upthread, it
> seems fairly unlikely to me that people would actually get burnt in
> practice, so I'm satisfied with doing this much and no more.

Ah, belay that: as coded, that would allow truncation of clog/multixact
as soon as any one relation in any one database had sane
frozenxid/minmxid.  If we want to have any pretense of being safe, we have
to postpone truncation until *all* relations have been vacuumed.  So more
like the attached, instead.

            regards, tom lane

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 8822a15..ec9a7b6 100644
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*************** static BufferAccessStrategy vac_strategy
*** 66,72 ****

  /* non-export function prototypes */
  static List *get_rel_oids(Oid relid, const RangeVar *vacrel);
! static void vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti);
  static bool vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast,
             bool for_wraparound);

--- 66,75 ----

  /* non-export function prototypes */
  static List *get_rel_oids(Oid relid, const RangeVar *vacrel);
! static void vac_truncate_clog(TransactionId frozenXID,
!                   MultiXactId minMulti,
!                   TransactionId lastSaneFrozenXid,
!                   MultiXactId lastSaneMinMulti);
  static bool vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast,
             bool for_wraparound);

*************** vac_update_relstats(Relation relation,
*** 733,751 ****
      }

      /*
!      * relfrozenxid should never go backward.  Caller can pass
!      * InvalidTransactionId if it has no new data.
       */
      if (TransactionIdIsNormal(frozenxid) &&
!         TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid))
      {
          pgcform->relfrozenxid = frozenxid;
          dirty = true;
      }

!     /* relminmxid must never go backward, either */
      if (MultiXactIdIsValid(minmulti) &&
!         MultiXactIdPrecedes(pgcform->relminmxid, minmulti))
      {
          pgcform->relminmxid = minmulti;
          dirty = true;
--- 736,768 ----
      }

      /*
!      * Update relfrozenxid, unless caller passed InvalidTransactionId
!      * indicating it has no new data.
!      *
!      * Ordinarily, we don't let relfrozenxid go backwards: if things are
!      * working correctly, the only way the new frozenxid could be older would
!      * be if a previous VACUUM was done with a tighter freeze_min_age, in
!      * which case we don't want to forget the work it already did.  However,
!      * if the stored relfrozenxid is "in the future", then it must be corrupt
!      * and it seems best to overwrite it with the cutoff we used this time.
!      * See vac_update_datfrozenxid() concerning what we consider to be "in the
!      * future".
       */
      if (TransactionIdIsNormal(frozenxid) &&
!         pgcform->relfrozenxid != frozenxid &&
!         (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) ||
!          TransactionIdPrecedes(GetOldestXmin(NULL, true),
!                                pgcform->relfrozenxid)))
      {
          pgcform->relfrozenxid = frozenxid;
          dirty = true;
      }

!     /* Similarly for relminmxid */
      if (MultiXactIdIsValid(minmulti) &&
!         pgcform->relminmxid != minmulti &&
!         (MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
!          MultiXactIdPrecedes(GetOldestMultiXactId(), pgcform->relminmxid)))
      {
          pgcform->relminmxid = minmulti;
          dirty = true;
*************** vac_update_relstats(Relation relation,
*** 772,779 ****
   *        truncate pg_clog and pg_multixact.
   *
   *        We violate transaction semantics here by overwriting the database's
!  *        existing pg_database tuple with the new value.  This is reasonably
!  *        safe since the new value is correct whether or not this transaction
   *        commits.  As with vac_update_relstats, this avoids leaving dead tuples
   *        behind after a VACUUM.
   */
--- 789,796 ----
   *        truncate pg_clog and pg_multixact.
   *
   *        We violate transaction semantics here by overwriting the database's
!  *        existing pg_database tuple with the new values.  This is reasonably
!  *        safe since the new values are correct whether or not this transaction
   *        commits.  As with vac_update_relstats, this avoids leaving dead tuples
   *        behind after a VACUUM.
   */
*************** vac_update_datfrozenxid(void)
*** 786,792 ****
--- 803,812 ----
      SysScanDesc scan;
      HeapTuple    classTup;
      TransactionId newFrozenXid;
+     TransactionId lastSaneFrozenXid;
      MultiXactId newMinMulti;
+     MultiXactId lastSaneMinMulti;
+     bool        bogus = false;
      bool        dirty = false;

      /*
*************** vac_update_datfrozenxid(void)
*** 795,807 ****
       * committed pg_class entries for new tables; see AddNewRelationTuple().
       * So we cannot produce a wrong minimum by starting with this.
       */
!     newFrozenXid = GetOldestXmin(NULL, true);

      /*
       * Similarly, initialize the MultiXact "min" with the value that would be
       * used on pg_class for new tables.  See AddNewRelationTuple().
       */
!     newMinMulti = GetOldestMultiXactId();

      /*
       * We must seqscan pg_class to find the minimum Xid, because there is no
--- 815,827 ----
       * committed pg_class entries for new tables; see AddNewRelationTuple().
       * So we cannot produce a wrong minimum by starting with this.
       */
!     newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true);

      /*
       * Similarly, initialize the MultiXact "min" with the value that would be
       * used on pg_class for new tables.  See AddNewRelationTuple().
       */
!     newMinMulti = lastSaneMinMulti = GetOldestMultiXactId();

      /*
       * We must seqscan pg_class to find the minimum Xid, because there is no
*************** vac_update_datfrozenxid(void)
*** 828,833 ****
--- 848,868 ----
          Assert(TransactionIdIsNormal(classForm->relfrozenxid));
          Assert(MultiXactIdIsValid(classForm->relminmxid));

+         /*
+          * If things are working properly, no relation should have a
+          * relfrozenxid or relminmxid that is "in the future".  However, such
+          * cases have been known to arise due to bugs in pg_upgrade.  If we
+          * see any entries that are "in the future", chicken out and don't do
+          * anything.  This ensures we won't truncate clog before those
+          * relations have been scanned and cleaned up.
+          */
+         if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) ||
+             MultiXactIdPrecedes(lastSaneMinMulti, classForm->relminmxid))
+         {
+             bogus = true;
+             break;
+         }
+
          if (TransactionIdPrecedes(classForm->relfrozenxid, newFrozenXid))
              newFrozenXid = classForm->relfrozenxid;

*************** vac_update_datfrozenxid(void)
*** 839,844 ****
--- 874,883 ----
      systable_endscan(scan);
      heap_close(relation, AccessShareLock);

+     /* chicken out if bogus data found */
+     if (bogus)
+         return;
+
      Assert(TransactionIdIsNormal(newFrozenXid));
      Assert(MultiXactIdIsValid(newMinMulti));

*************** vac_update_datfrozenxid(void)
*** 852,872 ****
      dbform = (Form_pg_database) GETSTRUCT(tuple);

      /*
!      * Don't allow datfrozenxid to go backward (probably can't happen anyway);
!      * and detect the common case where it doesn't go forward either.
       */
!     if (TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid))
      {
          dbform->datfrozenxid = newFrozenXid;
          dirty = true;
      }

!     /* ditto */
!     if (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti))
      {
          dbform->datminmxid = newMinMulti;
          dirty = true;
      }

      if (dirty)
          heap_inplace_update(relation, tuple);
--- 891,920 ----
      dbform = (Form_pg_database) GETSTRUCT(tuple);

      /*
!      * As in vac_update_relstats(), we ordinarily don't want to let
!      * datfrozenxid go backward; but if it's "in the future" then it must be
!      * corrupt and it seems best to overwrite it.
       */
!     if (dbform->datfrozenxid != newFrozenXid &&
!         (TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid) ||
!          TransactionIdPrecedes(lastSaneFrozenXid, dbform->datfrozenxid)))
      {
          dbform->datfrozenxid = newFrozenXid;
          dirty = true;
      }
+     else
+         newFrozenXid = dbform->datfrozenxid;

!     /* Ditto for datminmxid */
!     if (dbform->datminmxid != newMinMulti &&
!         (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti) ||
!          MultiXactIdPrecedes(lastSaneMinMulti, dbform->datminmxid)))
      {
          dbform->datminmxid = newMinMulti;
          dirty = true;
      }
+     else
+         newMinMulti = dbform->datminmxid;

      if (dirty)
          heap_inplace_update(relation, tuple);
*************** vac_update_datfrozenxid(void)
*** 875,886 ****
      heap_close(relation, RowExclusiveLock);

      /*
!      * If we were able to advance datfrozenxid, see if we can truncate
!      * pg_clog. Also do it if the shared XID-wrap-limit info is stale, since
!      * this action will update that too.
       */
      if (dirty || ForceTransactionIdLimitUpdate())
!         vac_truncate_clog(newFrozenXid, newMinMulti);
  }


--- 923,935 ----
      heap_close(relation, RowExclusiveLock);

      /*
!      * If we were able to advance datfrozenxid or datminmxid, see if we can
!      * truncate pg_clog and/or pg_multixact.  Also do it if the shared
!      * XID-wrap-limit info is stale, since this action will update that too.
       */
      if (dirty || ForceTransactionIdLimitUpdate())
!         vac_truncate_clog(newFrozenXid, newMinMulti,
!                           lastSaneFrozenXid, lastSaneMinMulti);
  }


*************** vac_update_datfrozenxid(void)
*** 890,905 ****
   *        Scan pg_database to determine the system-wide oldest datfrozenxid,
   *        and use it to truncate the transaction commit log (pg_clog).
   *        Also update the XID wrap limit info maintained by varsup.c.
   *
!  *        The passed XID is simply the one I just wrote into my pg_database
!  *        entry.  It's used to initialize the "min" calculation.
   *
   *        This routine is only invoked when we've managed to change our
!  *        DB's datfrozenxid entry, or we found that the shared XID-wrap-limit
!  *        info is stale.
   */
  static void
! vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti)
  {
      TransactionId myXID = GetCurrentTransactionId();
      Relation    relation;
--- 939,960 ----
   *        Scan pg_database to determine the system-wide oldest datfrozenxid,
   *        and use it to truncate the transaction commit log (pg_clog).
   *        Also update the XID wrap limit info maintained by varsup.c.
+  *        Likewise for datminmxid.
   *
!  *        The passed frozenXID and minMulti are the updated values for my own
!  *        pg_database entry. They're used to initialize the "min" calculations.
!  *        The caller also passes the "last sane" XID and MXID, since it has
!  *        those at hand already.
   *
   *        This routine is only invoked when we've managed to change our
!  *        DB's datfrozenxid/datminmxid values, or we found that the shared
!  *        XID-wrap-limit info is stale.
   */
  static void
! vac_truncate_clog(TransactionId frozenXID,
!                   MultiXactId minMulti,
!                   TransactionId lastSaneFrozenXid,
!                   MultiXactId lastSaneMinMulti)
  {
      TransactionId myXID = GetCurrentTransactionId();
      Relation    relation;
*************** vac_truncate_clog(TransactionId frozenXI
*** 907,920 ****
      HeapTuple    tuple;
      Oid            oldestxid_datoid;
      Oid            minmulti_datoid;
      bool        frozenAlreadyWrapped = false;

!     /* init oldest datoids to sync with my frozen values */
      oldestxid_datoid = MyDatabaseId;
      minmulti_datoid = MyDatabaseId;

      /*
!      * Scan pg_database to compute the minimum datfrozenxid
       *
       * Note: we need not worry about a race condition with new entries being
       * inserted by CREATE DATABASE.  Any such entry will have a copy of some
--- 962,976 ----
      HeapTuple    tuple;
      Oid            oldestxid_datoid;
      Oid            minmulti_datoid;
+     bool        bogus = false;
      bool        frozenAlreadyWrapped = false;

!     /* init oldest datoids to sync with my frozenXID/minMulti values */
      oldestxid_datoid = MyDatabaseId;
      minmulti_datoid = MyDatabaseId;

      /*
!      * Scan pg_database to compute the minimum datfrozenxid/datminmxid
       *
       * Note: we need not worry about a race condition with new entries being
       * inserted by CREATE DATABASE.  Any such entry will have a copy of some
*************** vac_truncate_clog(TransactionId frozenXI
*** 936,941 ****
--- 992,1010 ----
          Assert(TransactionIdIsNormal(dbform->datfrozenxid));
          Assert(MultiXactIdIsValid(dbform->datminmxid));

+         /*
+          * If things are working properly, no database should have a
+          * datfrozenxid or datminmxid that is "in the future".  However, such
+          * cases have been known to arise due to bugs in pg_upgrade.  If we
+          * see any entries that are "in the future", chicken out and don't do
+          * anything.  This ensures we won't truncate clog before those
+          * databases have been scanned and cleaned up.  (We will issue the
+          * "already wrapped" warning if appropriate, though.)
+          */
+         if (TransactionIdPrecedes(lastSaneFrozenXid, dbform->datfrozenxid) ||
+             MultiXactIdPrecedes(lastSaneMinMulti, dbform->datminmxid))
+             bogus = true;
+
          if (TransactionIdPrecedes(myXID, dbform->datfrozenxid))
              frozenAlreadyWrapped = true;
          else if (TransactionIdPrecedes(dbform->datfrozenxid, frozenXID))
*************** vac_truncate_clog(TransactionId frozenXI
*** 969,974 ****
--- 1038,1047 ----
          return;
      }

+     /* chicken out if data is bogus in any other way */
+     if (bogus)
+         return;
+
      /*
       * Truncate CLOG to the oldest computed value.  Note we don't truncate
       * multixacts; that will be done by the next checkpoint.

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Alvaro Herrera
Date:
Thanks for spending some time on this issue.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-20 23:19:30 -0400, Tom Lane wrote:
> I wrote:
> > Here's a draft patch for this.  I think this will fix all cases where
> > the "1" minmxid inserted by previous pg_upgrade versions is actually
> > in the future at the time we run VACUUM.  We would still be at risk if
> > it had been in the future when pg_upgrade ran but no longer is now,
> > since that would mean there could be non-lock-only mxids on disk that
> > are older than "1".  However, for the reasons discussed upthread, it
> > seems fairly unlikely to me that people would actually get burnt in
> > practice, so I'm satisfied with doing this much and no more.
>
> Ah, belay that: as coded, that would allow truncation of clog/multixact
> as soon as any one relation in any one database had sane
> frozenxid/minmxid.  If we want to have any pretense of being safe, we have
> to postpone truncation until *all* relations have been vacuumed.  So more
> like the attached, instead.

Looks basically good to me. Two minor things:
1) How about adding a elog(WARNING, "relation %u has invalid freeze
limits", ...)  to vac_update_datfrozenxid()? Otherwise it's a bit hard
to understand what's going on. I wouldn't want to invest too much effort
into the precision of the message, but some log bleat seems justified.

2) From a pedantic POV lastSane* should be initialized to
ReadNewTransactionId()/ReadNextMultiXactId(). On a busy system we could
otherwise consider concurrently updated values as being too new if some
longrunning transaction finishes. Also GetOldestXmin() can sometimes go
backwards...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> 1) How about adding a elog(WARNING, "relation %u has invalid freeze
> limits", ...)  to vac_update_datfrozenxid()? Otherwise it's a bit hard
> to understand what's going on. I wouldn't want to invest too much effort
> into the precision of the message, but some log bleat seems justified.

I considered that but wasn't really convinced it was a good idea.
You'd be getting a lot of log spam on a 9.3 installation that had
been affected by the pg_upgrade bug, until all the tables got cleaned
up (and we're intentionally not forcing that to happen quickly).

Anybody else have an opinion?

> 2) From a pedantic POV lastSane* should be initialized to
> ReadNewTransactionId()/ReadNextMultiXactId(). On a busy system we could
> otherwise consider concurrently updated values as being too new if some
> longrunning transaction finishes. Also GetOldestXmin() can sometimes go
> backwards...

If it's wrong then that's a pre-existing bug, but I don't think it's
wrong.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-21 12:20:56 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > 2) From a pedantic POV lastSane* should be initialized to
> > ReadNewTransactionId()/ReadNextMultiXactId(). On a busy system we could
> > otherwise consider concurrently updated values as being too new if some
> > longrunning transaction finishes. Also GetOldestXmin() can sometimes go
> > backwards...
>
> If it's wrong then that's a pre-existing bug, but I don't think it's
> wrong.

I'm not wondering so much about vac_update_relstats(). There indeed the
calls aren't new and the worst that can happen is a slightly older
freeze limit. I'm more wondering about vac_update_datfrozenxid(). Afaics
we very well could hit
    newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true);
    newMinMulti = lastSaneMinMulti = GetOldestMultiXactId();
...
        /*
         * If things are working properly, no relation should have a
         * relfrozenxid or relminmxid that is "in the future".  However, such
         * cases have been known to arise due to bugs in pg_upgrade.  If we
         * see any entries that are "in the future", chicken out and don't do
         * anything.  This ensures we won't truncate clog before those
         * relations have been scanned and cleaned up.
         */
        if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) ||
            MultiXactIdPrecedes(lastSaneMinMulti, classForm->relminmxid))
        {
            bogus = true;
            break;
        }

for a relation that has just been vacuumed by another backend. In a
database with a high number of pg_class entries/much bloat it's not that
unrealistic that relfrozenxid/relminmxid have been updated since the
GetOldest* calls above. Since vac_update_relstats() writes them using
heap_inplace_update() the window for that isn't particularly small. That
could potentially lead to never updating the database freeze limits,
right?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> I'm not wondering so much about vac_update_relstats(). There indeed the
> calls aren't new and the worst that can happen is a slightly older
> freeze limit. I'm more wondering about vac_update_datfrozenxid(). Afaics
> we very well could hit
>     newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true);
>     newMinMulti = lastSaneMinMulti = GetOldestMultiXactId();
> for a relation that has just been vacuumed by another backend.

Hmm ... I see.  The issue is not what the computed minimum datfrozenxid
etc should be; it's right to err in the backwards direction there.
It's whether we want to declare that the calculation is bogus and abandon
truncation if another session manages to sneak in a very-new relfrozenxid.
Yeah, you're right, we need to be conservative about doing that.  I'd
wanted to avoid extra calls here but I guess we have to do them after all.
Will fix.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-21 12:43:24 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I'm not wondering so much about vac_update_relstats(). There indeed the
> > calls aren't new and the worst that can happen is a slightly older
> > freeze limit. I'm more wondering about vac_update_datfrozenxid(). Afaics
> > we very well could hit
> >     newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true);
> >     newMinMulti = lastSaneMinMulti = GetOldestMultiXactId();
> > for a relation that has just been vacuumed by another backend.
>
> Hmm ... I see.  The issue is not what the computed minimum datfrozenxid
> etc should be; it's right to err in the backwards direction there.
> It's whether we want to declare that the calculation is bogus and abandon
> truncation if another session manages to sneak in a very-new relfrozenxid.
> Yeah, you're right, we need to be conservative about doing that.  I'd
> wanted to avoid extra calls here but I guess we have to do them after all.
> Will fix.

I wonder if GetTopTransactionId()/MultiXactIdSetOldestMember() and using
lastSane* = ReadNew* isn't sufficient. After the xid assignment
concurrent GetOldest* can't go below the ReadNew* values anymore, right?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-21 12:43:24 -0400, Tom Lane wrote:
>> Will fix.

> I wonder if GetTopTransactionId()/MultiXactIdSetOldestMember() and using
> lastSane* = ReadNew* isn't sufficient. After the xid assignment
> concurrent GetOldest* can't go below the ReadNew* values anymore, right?

I don't see any point in being picky about it.  What we want is to reject
values that are conclusively bogus; things that are just close to bogus
are not so interesting.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-21 13:03:21 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-07-21 12:43:24 -0400, Tom Lane wrote:
> >> Will fix.
>
> > I wonder if GetTopTransactionId()/MultiXactIdSetOldestMember() and using
> > lastSane* = ReadNew* isn't sufficient. After the xid assignment
> > concurrent GetOldest* can't go below the ReadNew* values anymore, right?
>
> I don't see any point in being picky about it.  What we want is to reject
> values that are conclusively bogus; things that are just close to bogus
> are not so interesting.

It'd just have been about optimizing away repeated calls to ReadNew*...

Thanks.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-07-21 13:03:21 -0400, Tom Lane wrote:
>> I don't see any point in being picky about it.  What we want is to reject
>> values that are conclusively bogus; things that are just close to bogus
>> are not so interesting.

> It'd just have been about optimizing away repeated calls to ReadNew*...

Oh.  I think the existing code is probably efficient enough about that.
vac_update_datfrozenxid() only does it once anyway, and the calls in
vac_update_relstats() are arranged so that we don't expect them to be hit
at all in normal cases.

            regards, tom lane

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Stuart Bishop
Date:
On 21 July 2014 00:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I'm a bit inclined to not say anything about fix procedures in the release
> notes, because I'm not sure that this is a problem in the field.  If
> anybody did have a wraparound they'd be getting bleats from VACUUM, and no
> one has reported any such thing that I've heard.

I hit the issue, but Andres helped on IRC (and so started this
thread). Yes, VACUUM failing on a production server after the 9.1->9.3
upgrade was the problem. I'd assumed there was database corruption and
I had some unreadable blocks, and probably would have just rebuilt the
table if #postgresql hadn't been so helpful.


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-22 16:08:37 +0700, Stuart Bishop wrote:
> On 21 July 2014 00:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > I'm a bit inclined to not say anything about fix procedures in the release
> > notes, because I'm not sure that this is a problem in the field.  If
> > anybody did have a wraparound they'd be getting bleats from VACUUM, and no
> > one has reported any such thing that I've heard.
>
> I hit the issue, but Andres helped on IRC (and so started this
> thread). Yes, VACUUM failing on a production server after the 9.1->9.3
> upgrade was the problem. I'd assumed there was database corruption and
> I had some unreadable blocks, and probably would have just rebuilt the
> table if #postgresql hadn't been so helpful.

This is actually a second problem that came up during the investigation
of the problem you had - with different symptoms and causes... IIRC it's
not what you've experienced.

If you look at
http://www.postgresql.org/docs/devel/static/release-9-3-5.html your
problem should be described.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Sun, Jul 20, 2014 at 11:19:30PM -0400, Tom Lane wrote:
> I wrote:
> > Here's a draft patch for this.  I think this will fix all cases where
> > the "1" minmxid inserted by previous pg_upgrade versions is actually
> > in the future at the time we run VACUUM.  We would still be at risk if
> > it had been in the future when pg_upgrade ran but no longer is now,
> > since that would mean there could be non-lock-only mxids on disk that
> > are older than "1".  However, for the reasons discussed upthread, it
> > seems fairly unlikely to me that people would actually get burnt in
> > practice, so I'm satisfied with doing this much and no more.
>
> Ah, belay that: as coded, that would allow truncation of clog/multixact
> as soon as any one relation in any one database had sane
> frozenxid/minmxid.  If we want to have any pretense of being safe, we have
> to postpone truncation until *all* relations have been vacuumed.  So more
> like the attached, instead.

Should we conclude that the multi-xact code is hopelessly complex and
needs a redesign?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Andres Freund
Date:
On 2014-07-22 09:09:31 -0400, Bruce Momjian wrote:
> On Sun, Jul 20, 2014 at 11:19:30PM -0400, Tom Lane wrote:
> > I wrote:
> > > Here's a draft patch for this.  I think this will fix all cases where
> > > the "1" minmxid inserted by previous pg_upgrade versions is actually
> > > in the future at the time we run VACUUM.  We would still be at risk if
> > > it had been in the future when pg_upgrade ran but no longer is now,
> > > since that would mean there could be non-lock-only mxids on disk that
> > > are older than "1".  However, for the reasons discussed upthread, it
> > > seems fairly unlikely to me that people would actually get burnt in
> > > practice, so I'm satisfied with doing this much and no more.
> >
> > Ah, belay that: as coded, that would allow truncation of clog/multixact
> > as soon as any one relation in any one database had sane
> > frozenxid/minmxid.  If we want to have any pretense of being safe, we have
> > to postpone truncation until *all* relations have been vacuumed.  So more
> > like the attached, instead.
>
> Should we conclude that the multi-xact code is hopelessly complex and
> needs a redesign?

That might be true, but I don't see the above as evidence of it. It's a
nontrivial workaround for a bug; it's not surprising that it needs a
couple iterations to make sense. Without the pg_upgrade bug there'd be
no need to make those adjustments.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts

From
Bruce Momjian
Date:
On Tue, Jul 22, 2014 at 04:14:01PM +0200, Andres Freund wrote:
> On 2014-07-22 09:09:31 -0400, Bruce Momjian wrote:
> > On Sun, Jul 20, 2014 at 11:19:30PM -0400, Tom Lane wrote:
> > > I wrote:
> > > > Here's a draft patch for this.  I think this will fix all cases where
> > > > the "1" minmxid inserted by previous pg_upgrade versions is actually
> > > > in the future at the time we run VACUUM.  We would still be at risk if
> > > > it had been in the future when pg_upgrade ran but no longer is now,
> > > > since that would mean there could be non-lock-only mxids on disk that
> > > > are older than "1".  However, for the reasons discussed upthread, it
> > > > seems fairly unlikely to me that people would actually get burnt in
> > > > practice, so I'm satisfied with doing this much and no more.
> > >
> > > Ah, belay that: as coded, that would allow truncation of clog/multixact
> > > as soon as any one relation in any one database had sane
> > > frozenxid/minmxid.  If we want to have any pretense of being safe, we have
> > > to postpone truncation until *all* relations have been vacuumed.  So more
> > > like the attached, instead.
> >
> > Should we conclude that the multi-xact code is hopelessly complex and
> > needs a redesign?
>
> That might be true, but I don't see the above as evidence of it. It's a
> nontrivial workaround for a bug; it's not surprising that it needs a
> couple iterations to make sense. Without the pg_upgrade bug there'd be
> no need to make those adjustments.

Well, my point is that even Tom was confused about how things should be
handled.  Anyway, seems no one thinks a redesign is needed, but I had to
ask.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +