Thread: Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

Moving discussion to -hackers.  Tom, I think you worked most with this
code, your input would be appreciated.

Original discussion is around:
http://archives.postgresql.org/message-id/20180524211311.tnswfnjwnii54htx%40alvherre.pgsql

On 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
> On 2018-May-24, Andres Freund wrote:
> > Then there's also:
> > http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com
> 
> ah, so deleting the relcache file makes the problem to go away?  That's
> definitely pretty strange.  I see no reason for the value in relcache to
> become out of step with the catalogued value in the same database ... I
> don't think we transmit in any way values of one database to another.

I can reproduce the issue. As far as I can tell we just don't ever
actually update nailed relcache entries in the normal course, leaving
the "physical address" aside.  VACUUM will, via
vac_update_relstats() -> heap_inplace_update() -> CacheInvalidateHeapTuple(),
send out an invalidation. But invalidation, in my case another session,
will essentially ignore most of that due to:

static void
RelationClearRelation(Relation relation, bool rebuild)
...
    /*
     * Never, never ever blow away a nailed-in system relation, because we'd
     * be unable to recover.  However, we must redo RelationInitPhysicalAddr
     * in case it is a mapped relation whose mapping changed.
     *
     * If it's a nailed-but-not-mapped index, then we need to re-read the
     * pg_class row to see if its relfilenode changed. We do that immediately
     * if we're inside a valid transaction and the relation is open (not
     * counting the nailed refcount).  Otherwise just mark the entry as
     * possibly invalid, and it'll be fixed when next opened.
     */
    if (relation->rd_isnailed)
    {
        RelationInitPhysicalAddr(relation);

        if (relation->rd_rel->relkind == RELKIND_INDEX ||
            relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
        {
            relation->rd_isvalid = false;    /* needs to be revalidated */
            if (relation->rd_refcnt > 1 && IsTransactionState())
                RelationReloadIndexInfo(relation);
        }
        return;
    }

Which basically means that once running we'll never update the relcache
data for nailed entries.  That's unproblematic for most relcache fields,
but not for things like RelationData->rd_rel->relfrozenxid / relminmxid.

This'll e.g. lead to lazy_vacuum_rel() wrongly not using aggressive
vacuums despite being required. And it'll lead, triggering this thread,
to wrong errors being raised during vacuum because relfrozenxid just is
some random value from the past.  I suspect this might also be
co-responsible for a bunch of planning issues for queries involving the
catalog, because the planner will use wrong relcache data until the next
time the init file is thrown away?

This looks like a very longstanding bug to me.  I'm not yet quite sure
what the best way to deal with this is.  I suspect we might get away
with just looking up a new version of the pg_class tuple and copying
rd_rel over?

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Moving discussion to -hackers.  Tom, I think you worked most with this
> code, your input would be appreciated.

Yeah, the assumption in the relcache is that the only part of a nailed
catalog's relcache entry that really needs to be updated intrasession is
the relfilenode mapping.  For nailed indexes, we allow updating of some
additional fields, and I guess what has to happen here is that we teach
the code to update some additional fields for nailed tables too.

            regards, tom lane


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-25 17:47:37 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Moving discussion to -hackers.  Tom, I think you worked most with this
> > code, your input would be appreciated.
> 
> Yeah, the assumption in the relcache is that the only part of a nailed
> catalog's relcache entry that really needs to be updated intrasession is
> the relfilenode mapping.

Paging through the changes to relcache.c and vacuum[lazy].c it looks to
me like that hasn't been true in a long time, right?


> For nailed indexes, we allow updating of some additional fields, and I
> guess what has to happen here is that we teach the code to update some
> additional fields for nailed tables too.

Yea, it seems like we could just get a new version of the pg_class tuple
if in the right state, and memcpy() it into place. Not sure if there's
any other issues...


BTW, and I guess this mostly goes to Alvaro, I don't understand why that
code accepts relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX?
That seems like something we'll hopefully never support.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-25 15:05:31 -0700, Andres Freund wrote:
> On 2018-05-25 17:47:37 -0400, Tom Lane wrote:
> > For nailed indexes, we allow updating of some additional fields, and I
> > guess what has to happen here is that we teach the code to update some
> > additional fields for nailed tables too.
> 
> Yea, it seems like we could just get a new version of the pg_class tuple
> if in the right state, and memcpy() it into place. Not sure if there's
> any other issues...

That part isn't too hard. I've a patch that appears to address the
issue, and isn't *too* ugly.

We don't really have a way to force .init file removal / update for
shared relations however. Otherwise we'll just continue to read old data
from .init files at startup. And there'll commonly not be any
outstanding invalidation.  Thus it appears to me that we need to extend
RelcacheInitFileInval to also support the shared file.  That's WAL
logged, but it looks like we can just add flag like
XACT_COMPLETION_UPDATE_RELCACHE_FILE without breaking the WAL format.

Does anybody see a way to not have to remove the .init file?

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-26 13:45:06 -0700, Andres Freund wrote:
> On 2018-05-25 15:05:31 -0700, Andres Freund wrote:
> > On 2018-05-25 17:47:37 -0400, Tom Lane wrote:
> > > For nailed indexes, we allow updating of some additional fields, and I
> > > guess what has to happen here is that we teach the code to update some
> > > additional fields for nailed tables too.
> > 
> > Yea, it seems like we could just get a new version of the pg_class tuple
> > if in the right state, and memcpy() it into place. Not sure if there's
> > any other issues...
> 
> That part isn't too hard. I've a patch that appears to address the
> issue, and isn't *too* ugly.
> 
> We don't really have a way to force .init file removal / update for
> shared relations however. Otherwise we'll just continue to read old data
> from .init files at startup. And there'll commonly not be any
> outstanding invalidation.  Thus it appears to me that we need to extend
> RelcacheInitFileInval to also support the shared file.  That's WAL
> logged, but it looks like we can just add flag like
> XACT_COMPLETION_UPDATE_RELCACHE_FILE without breaking the WAL format.
> 
> Does anybody see a way to not have to remove the .init file?

Just to be clear: We already remove the non-shared relcache init file
when a non-shared table in it is changed . Which I presume is the reason
this issue hasn't bitten us in a much bigger way. While the lack of
proper invalidations means that already running sessions will see the
wrong values and make wrong decisions, the fact that the non-shared file
will regularly be removed has reduced the impact quite a bit.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
"Nasby, Jim"
Date:
On May 26, 2018, at 1:45 PM, Andres Freund <andres@anarazel.de> wrote:
> Does anybody see a way to not have to remove the .init file?

How about only keeping the critical info for being able to find relations in the .init files, and then fully populate
thecache by doing a normal lookup? Since there’s multiple entries needed to bootstrap things I guess there’d need to be
aflag in relcache to indicate that an existing entry wasn’t fully formed. 

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:

On May 27, 2018 9:39:49 AM PDT, "Nasby, Jim" <nasbyj@amazon.com> wrote:
>On May 26, 2018, at 1:45 PM, Andres Freund <andres@anarazel.de> wrote:
>> Does anybody see a way to not have to remove the .init file?
>
>How about only keeping the critical info for being able to find
>relations in the .init files, and then fully populate the cache by
>doing a normal lookup? Since there’s multiple entries needed to
>bootstrap things I guess there’d need to be a flag in relcache to
>indicate that an existing entry wasn’t fully formed.

Then the cache wouldn't have any benefits, no? It's been a while, but last time I checked it does make quite a
measurableperformance difference in a new backend. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On May 27, 2018 9:39:49 AM PDT, "Nasby, Jim" <nasbyj@amazon.com> wrote:
>> How about only keeping the critical info for being able to find
>> relations in the .init files, and then fully populate the cache by
>> doing a normal lookup?

> Then the cache wouldn't have any benefits, no? It's been a while, but last time I checked it does make quite a
measurableperformance difference in a new backend. 

Yeah, we don't want to lose the performance benefit.  But I don't think
there's any need for special magic here: we just have to accept the fact
that there's a need to flush that cache sometimes.  In normal use it
shouldn't happen often enough to be a performance problem.

FWIW, I'm not on board with "memcpy the whole row".  I think the right
thing is more like what we do in RelationReloadIndexInfo, ie copy over
the specific fields that we expect to be mutable.

            regards, tom lane


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-05-27 13:22:21 -0400, Tom Lane wrote:
> But I don't think there's any need for special magic here: we just
> have to accept the fact that there's a need to flush that cache
> sometimes.  In normal use it shouldn't happen often enough to be a
> performance problem.

Yea, it's not that problematic. We already remove the local init
file. I started out trying to write a version of invalidation that also
WAL logs shared inval, but that turns out to be hard to do without
breaking compatibilty.  So I think we should just always unlink the
shared one - that seems to work well.


> FWIW, I'm not on board with "memcpy the whole row".  I think the right
> thing is more like what we do in RelationReloadIndexInfo, ie copy over
> the specific fields that we expect to be mutable.

But that's what RelationReloadIndexInfo() etc do?
    relp = (Form_pg_class) GETSTRUCT(pg_class_tuple);
    memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE);
I don't think we need to modify anything outside of rd_rel at this
point?

I've a patch that seems to work, that mostly needs some comment
polishing.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
"Nishant, Fnu"
Date:
Hi,
We were working on this issue and thinking if we could actually make pg_class(rd_rel) part of recache entry
upgradable.
To achieve this we can allocate Form_pg_class structures (for shared relations… a small number) on shared memory.
We do not need global pg_internal_init file as new backend during boot up will be set to point at already stored
Form_pg_classstructure.
 

Thanks,
Nishant

On 5/27/18, 1:01 PM, "Andres Freund" <andres@anarazel.de> wrote:

    Hi,
    
    On 2018-05-27 13:22:21 -0400, Tom Lane wrote:
    > But I don't think there's any need for special magic here: we just
    > have to accept the fact that there's a need to flush that cache
    > sometimes.  In normal use it shouldn't happen often enough to be a
    > performance problem.
    
    Yea, it's not that problematic. We already remove the local init
    file. I started out trying to write a version of invalidation that also
    WAL logs shared inval, but that turns out to be hard to do without
    breaking compatibilty.  So I think we should just always unlink the
    shared one - that seems to work well.
    
    
    > FWIW, I'm not on board with "memcpy the whole row".  I think the right
    > thing is more like what we do in RelationReloadIndexInfo, ie copy over
    > the specific fields that we expect to be mutable.
    
    But that's what RelationReloadIndexInfo() etc do?
        relp = (Form_pg_class) GETSTRUCT(pg_class_tuple);
        memcpy(relation->rd_rel, relp, CLASS_TUPLE_SIZE);
    I don't think we need to modify anything outside of rd_rel at this
    point?
    
    I've a patch that seems to work, that mostly needs some comment
    polishing.
    
    Greetings,
    
    Andres Freund
    
    


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-05-27 13:00:06 -0700, Andres Freund wrote:
> I've a patch that seems to work, that mostly needs some comment
> polishing.

Attached is what I currently have. Still needs some more work, but I
think it's more than good enough to review the approach.  Basically the
approach consists out of two changes:

1) Send init file removals for shared nailed relations as well.

   This fixes that the shared init file contains arbitrarily outdated
   information for relfrozenxid etc. Leading to e.g. the pg_authid
   errors we've seen in some recent threads.  Only applies to
   new connections.

2) Reread RelationData->rd_rel for nailed relations when invalidated.

   This ensures that already built relcache entries for nailed relations
   are updated. Currently they never are.  This currently doesn't cause
   *that* frequently an issue for !shared entries, because for those the
   init file gets zapped regularly, and autovacuum workers usually don't
   live that long.  But it's still a significant correctness issue for
   both shared an non shared relations.

FWIW, I wonder if this isn't critical enough to make us consider having
a point release earlier..

Greetings,

Andres Freund

Attachment

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

(please don't top post)

On 2018-05-28 15:07:52 +0000, Nishant, Fnu wrote:
> We were working on this issue and thinking if we could actually make
> pg_class(rd_rel) part of recache entry upgradable.

Right, that's necessary. See the patch I just sent.


> To achieve this we can allocate Form_pg_class structures (for shared
> relations… a small number) on shared memory.

But why would this be necessary / a good idea? Even if we decided it
were, it seems like it'd end up being quite invasive.  But I doubt it's
a good plan, because relcache entries want / need to be updated
differently in the transaction that does the changes (as it needs to see
the effect of catalog changes before commit) than other sessions (which
only should see them after commit).

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
"Nishant, Fnu"
Date:
Hi,

    > To achieve this we can allocate Form_pg_class structures (for shared
    > relations… a small number) on shared memory.
    
    But why would this be necessary / a good idea? Even if we decided it
    were, it seems like it'd end up being quite invasive.  But I doubt it's
    a good plan, because relcache entries want / need to be updated
    differently in the transaction that does the changes (as it needs to see
    the effect of catalog changes before commit) than other sessions (which
    only should see them after commit).

It will be a good idea as, we can avoid maintaining file (creation/deletion/updation) for period of engine running and
wedo not need to invalidate other backend cache.
 
For transaction(which change catalog), we can allocate a new private Form_pg_class structure and do operations on it
andupdate shared structure post commit.
 
Routine roughly will be-
1) A backend having a relcache entry for a shared rel where rd_rel part points to shared memory structure.
Rel->rd_rel is storing a shared memory pointer.
2) Wants to update the entry...allocate a new private structure and memcpy the shared content to this new memory and
pointrd_rel to private memory
 
Rel->rd_rel is storing a pointer to process specific structure.
3) Transaction committed and we copy the private memory content to shared memory area and point rd_rel again to shared
memory.
4) free private memory.
5) Other backends do not do any invalidation but still get the latest updated values.

Why is this good idea? 
Lets take an example (assuming we have 1000 postgres backends running).
With shared memory scheme-
    Operation wise, for a transaction, we allocate/free once (private memory allocation) and memcpy data to and fro
(fromshared to private and back to shared)...
 
    Overall memory footprint 1 shared copy and 1 private only when updating.
    No file creation/deletion/updation.
With current file scheme-
    Operation wise, for a transaction, we use private cache but we need to invalidate 1000 other caches( which will be
atleast1000 memcpy and allocate/free) and may involve reading back in page of pg_class.
 
    Overall memory footprint 1000 private copies.
    We have to create/delete/update init file and synchronize around it.

Having said that we may not worry about transaction for updating all values...(I think relfrozenxid can be updated by a
CASoperation ...still thinking on it).
 

-Nishant



Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-29 18:06:12 +0000, Nishant, Fnu wrote:
> Hi,
> 
>     > To achieve this we can allocate Form_pg_class structures (for shared
>     > relations… a small number) on shared memory.
>     
>     But why would this be necessary / a good idea? Even if we decided it
>     were, it seems like it'd end up being quite invasive.  But I doubt it's
>     a good plan, because relcache entries want / need to be updated
>     differently in the transaction that does the changes (as it needs to see
>     the effect of catalog changes before commit) than other sessions (which
>     only should see them after commit).
> 
> It will be a good idea as, we can avoid maintaining file
> (creation/deletion/updation) for period of engine running and we do
> not need to invalidate other backend cache.

a) That's a major change. Shouldn't be considered for a bugfix.
b) This is going to have locking issues / lock contention.


> > Why is this good idea? 
> Lets take an example (assuming we have 1000 postgres backends running).
> With shared memory scheme-
>     Operation wise, for a transaction, we allocate/free once (private memory allocation) and memcpy data to and fro
(fromshared to private and back to shared)...
 
>     Overall memory footprint 1 shared copy and 1 private only when updating.
>     No file creation/deletion/updation.

I don't buy that nailed relations are a meaningful part of that
problem. They hardly ever change.  And a shared cache is a much bigger
issues.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Alvaro Herrera
Date:
I added an Assert(DatabasePath != NULL) to
RelationCacheInitFilePreInvalidate() and didn't see a single crash when
running the tests.  I thought that adding a "VACUUM FREEZE pg_class"
in the recovery tests (where there is a standby) ought to do it, but it
doesn't.  What's the deal there?

Here are some proposed changes.  Some of these comment edits are WIP :-)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Alvaro Herrera
Date:
On 2018-May-29, Alvaro Herrera wrote:

> I added an Assert(DatabasePath != NULL) to
> RelationCacheInitFilePreInvalidate() and didn't see a single crash when
> running the tests.  I thought that adding a "VACUUM FREEZE pg_class"
> in the recovery tests (where there is a standby) ought to do it, but it
> doesn't.  What's the deal there?

Sorry, that was dumb -- the assert obviously is hit if I remove the code
to set DatabasePath beforehand :-)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-05-29 19:14:51 -0400, Alvaro Herrera wrote:
> I added an Assert(DatabasePath != NULL) to
> RelationCacheInitFilePreInvalidate() and didn't see a single crash when
> running the tests.  I thought that adding a "VACUUM FREEZE pg_class"
> in the recovery tests (where there is a standby) ought to do it, but it
> doesn't.  What's the deal there?
> 
> Here are some proposed changes.  Some of these comment edits are WIP :-)

I'm a bit confused by these changes - there seems to be some that look
like a borked diff?  And a number of others look pretty unrelated?

> -- 
> Álvaro Herrera                https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

> diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
> index 0d6100fb08..8a8620943f 100644
> --- a/src/backend/utils/cache/inval.c
> +++ b/src/backend/utils/cache/inval.c
> @@ -872,6 +872,8 @@ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
>                                       int nmsgs, bool RelcacheInitFileInval,
>                                       Oid dbid, Oid tsid)
>  {
> +    Assert(InRecovery);
> +

Idk, seems unrelated.

>      if (nmsgs <= 0)
>          return;
>  
> @@ -884,12 +886,13 @@ ProcessCommittedInvalidationMessages(SharedInvalidationMessage *msgs,
>               dbid);
>  
>          /*
> -         * RelationCacheInitFilePreInvalidate, when the invalidation message
> -         * is for a specific database, requires DatabasePath to be set, but we
> -         * should not use SetDatabasePath during recovery, since it is
> -         * intended to be used only once by normal backends.  Hence, a quick
> -         * hack: set DatabasePath directly then unset after use.
> +         * When the invalidation message is for a specific database,
> +         * RelationCacheInitFilePreInvalidate requires DatabasePath to be set,
> +         * but we're not allowed to use SetDatabasePath during recovery (since
> +         * it is intended to be used by normal backends).  Hence, a quick hack:
> +         * set DatabasePath directly then unset after use.
>           */
> +        Assert(!DatabasePath);    /* don't clobber an existing value */
>          if (OidIsValid(dbid))
>              DatabasePath = GetDatabasePath(dbid, tsid);

Same.


> diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
> index aa4427724d..71b2212cbd 100644
> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c
> @@ -1934,6 +1934,11 @@ RelationIdGetRelation(Oid relationId)
>                  RelationClearRelation(rd, true);
>  
>              /*
> +             * Normal entries are valid by now -- except nailed ones when
> +             * loaded before relcache initialization.  There isn't enough
> +             * infrastructure yet to do pg_class lookups, but we need their
> +             * rd_rel entries to be updated, so we let these through.
> +             */
>               * Normally entries need to be valid here, but before the relcache
>               * has been initialized, not enough infrastructure exists to
>               * perform pg_class lookups. The structure of such entries doesn't
> @@ -2346,8 +2351,7 @@ RelationClearRelation(Relation relation, bool rebuild)
>      RelationCloseSmgr(relation);

This sure looks like it's a syntax error? So I guess you might not have
staged the removal ports of the diff?


>      /*
> -     * Treat nailed-in system relations separately, they always need to be
> -     * accessible, so we can't blow them away.
> +     * We cannot blow away nailed-in relations, so treat them especially.
>       */

The former seems just as accurate, and is basically just the already
existing comment?


>      if (relation->rd_isnailed)
>      {
> @@ -5942,7 +5946,8 @@ write_relcache_init_file(bool shared)
>       * wrote out was up-to-date.)
>       *
>       * This mustn't run concurrently with the code that unlinks an init file
> -     * and sends SI messages, so grab a serialization lock for the duration.
> +     * and sends SI messages (see RelationCacheInitFilePreInvalidate), so grab
> +     * a serialization lock for the duration.
>       */
>      LWLockAcquire(RelCacheInitLock, LW_EXCLUSIVE);

Unrelated?


> @@ -6061,6 +6066,10 @@ RelationHasUnloggedIndex(Relation rel)
>   * changed one or more of the relation cache entries that are kept in the
>   * local init file.
>   *
> + * When DatabasePath is set, both the init file for that database and the
> + * shared (global) init files are to be removed; otherwise only the latter is.
> + * This is useful during recovery (XXX really?)
> + *

I'm confused?


>   * To be safe against concurrent inspection or rewriting of the init file,
>   * we must take RelCacheInitLock, then remove the old init file, then send
>   * the SI messages that include relcache inval for such relations, and then
> @@ -6180,9 +6189,9 @@ unlink_initfile(const char *initfilename, int elevel)
>  {
>      if (unlink(initfilename) < 0)
>      {
> -        /* It might not be there, but log any error other than ENOENT */
> +        /* It might not be there, but report any error other than ENOENT */
>          if (errno != ENOENT)
> -            ereport(ERROR,
> +            ereport(elevel,
>                      (errcode_for_file_access(),
>                       errmsg("could not remove cache file \"%s\": %m",
>                              initfilename)));

Included the elevel inclusion. Can't parse the difference between log
and report here?

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-05-28 12:52:06 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-05-27 13:00:06 -0700, Andres Freund wrote:
> > I've a patch that seems to work, that mostly needs some comment
> > polishing.
> 
> Attached is what I currently have. Still needs some more work, but I
> think it's more than good enough to review the approach.  Basically the
> approach consists out of two changes:
> 
> 1) Send init file removals for shared nailed relations as well.
> 
>    This fixes that the shared init file contains arbitrarily outdated
>    information for relfrozenxid etc. Leading to e.g. the pg_authid
>    errors we've seen in some recent threads.  Only applies to
>    new connections.
> 
> 2) Reread RelationData->rd_rel for nailed relations when invalidated.
> 
>    This ensures that already built relcache entries for nailed relations
>    are updated. Currently they never are.  This currently doesn't cause
>    *that* frequently an issue for !shared entries, because for those the
>    init file gets zapped regularly, and autovacuum workers usually don't
>    live that long.  But it's still a significant correctness issue for
>    both shared an non shared relations.

Here's a more polished v2 version of the patch.  I, locally, did the
work to backpatch the change. Besides trivialities there's two
nontrivial changes:

- In earlier versions there's no global invalidations. I've inquired and
  complained about what exactly they're for in
  http://archives.postgresql.org/message-id/20180611231634.w2rgtlzxaw4loefk%40alap3.anarazel.de

  In earlier branches we just don't do the global thing. That seems
  unproblematic to me.

- The bigger issue is that in 9.3, and just there
  https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=8de3e410faa06ab20ec1aa6d0abb0a2c040261ba
  does not yet exist.

  That means back then we performed reloads even outside a
  transaction. I don't feel confident about invoking additional catalog
  reloads in the new situations, so I kept the IsTransactionState()
  checks in RelationReloadNailed().  That seems less risky, but possibly
  somebody wants to argue the other way round?

There's some minor other conflicts, but they're all pretty obvious.

I plan to go over the change again tomorrow, and then push. Unless
somebody has comments before then, obviously.


> FWIW, I wonder if this isn't critical enough to make us consider having
> a point release earlier..

Still think this is something we should seriously consider.

- Andres

Attachment

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
> I plan to go over the change again tomorrow, and then push. Unless
> somebody has comments before then, obviously.

Done.

- Andres


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Fri, May 25, 2018 at 3:37 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

Moving discussion to -hackers.  Tom, I think you worked most with this
code, your input would be appreciated.

Original discussion is around:
http://archives.postgresql.org/message-id/20180524211311.tnswfnjwnii54htx%40alvherre.pgsql

On 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
> On 2018-May-24, Andres Freund wrote:
> > Then there's also:
> > http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com
>
> ah, so deleting the relcache file makes the problem to go away?  That's
> definitely pretty strange.  I see no reason for the value in relcache to
> become out of step with the catalogued value in the same database ... I
> don't think we transmit in any way values of one database to another.

I can reproduce the issue. As far as I can tell we just don't ever
actually update nailed relcache entries in the normal course, leaving
the "physical address" aside.  VACUUM will, via
vac_update_relstats() -> heap_inplace_update() -> CacheInvalidateHeapTuple(),
send out an invalidation. But invalidation, in my case another session,
will essentially ignore most of that due to:

static void
RelationClearRelation(Relation relation, bool rebuild)
...
        /*
         * Never, never ever blow away a nailed-in system relation, because we'd
         * be unable to recover.  However, we must redo RelationInitPhysicalAddr
         * in case it is a mapped relation whose mapping changed.
         *
         * If it's a nailed-but-not-mapped index, then we need to re-read the
         * pg_class row to see if its relfilenode changed. We do that immediately
         * if we're inside a valid transaction and the relation is open (not
         * counting the nailed refcount).  Otherwise just mark the entry as
         * possibly invalid, and it'll be fixed when next opened.
         */
        if (relation->rd_isnailed)
        {
                RelationInitPhysicalAddr(relation);

                if (relation->rd_rel->relkind == RELKIND_INDEX ||
                        relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
                {
                        relation->rd_isvalid = false;   /* needs to be revalidated */
                        if (relation->rd_refcnt > 1 && IsTransactionState())
                                RelationReloadIndexInfo(relation);
                }
                return;
        }

Which basically means that once running we'll never update the relcache
data for nailed entries.  That's unproblematic for most relcache fields,
but not for things like RelationData->rd_rel->relfrozenxid / relminmxid.

This'll e.g. lead to lazy_vacuum_rel() wrongly not using aggressive
vacuums despite being required. And it'll lead, triggering this thread,
to wrong errors being raised during vacuum because relfrozenxid just is
some random value from the past.  I suspect this might also be
co-responsible for a bunch of planning issues for queries involving the
catalog, because the planner will use wrong relcache data until the next
time the init file is thrown away?

This looks like a very longstanding bug to me.  I'm not yet quite sure
what the best way to deal with this is.  I suspect we might get away
with just looking up a new version of the pg_class tuple and copying
rd_rel over?

Greetings,

Andres Freund

I have a question related to this - and specifically, preventing the error until we have a patch :).  We are encountering this error every few weeks on one very high transaction db, and have to restart to fix it.

If I read you correctly, the cache may never be invalidated for these catalogs even if I manually VACUUM them?  I was thinking if I routinely run VACUUM FREEZE on these tables in every database I might avoid the issue.  But given the cause of the issue, would that just make no difference and I will still hit the error eventually?

Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Matheus de Oliveira
Date:
Hello friends.

On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
> I plan to go over the change again tomorrow, and then push. Unless
> somebody has comments before then, obviously.

Done.


Sorry to bother about this, but do you have any plan to do the minor release before planned due to this bug?

There seem to have too many users affected by this. And worst is that many users may not have even noticed they have the problem, which can cause `age(datfrozenxid)` to keep increasing until reachs 2.1 billions and the system goes down.

In my case, I have a server that its `age(datfrozenxid)` is already at 1.9 billions, and I expect it to reach 2.1 billions in about 14 days. Fortunately, I have monitoring system over `age(datfrozenxid)`, that is why I found the issue in one of my servers.

I'm pondering what is the best option to avoid a forced shutdown of this server:
- should I just wait for a release (if it is soon, I would be fine)?
- build PG from the git version by myself?
- or is there a safer workaround to the problem? (not clear to me if deleting the `global/pg_internal.init` file is really the way to go, and the details, is it safe? Should I stop the server, delete, start?)

Best regards,
--
Matheus de Oliveira


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Tue, Jun 19, 2018 at 8:26 AM Matheus de Oliveira <matioli.matheus@gmail.com> wrote:
Hello friends.

On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
> I plan to go over the change again tomorrow, and then push. Unless
> somebody has comments before then, obviously.

Done.


Sorry to bother about this, but do you have any plan to do the minor release before planned due to this bug?

There seem to have too many users affected by this. And worst is that many users may not have even noticed they have the problem, which can cause `age(datfrozenxid)` to keep increasing until reachs 2.1 billions and the system goes down.

In my case, I have a server that its `age(datfrozenxid)` is already at 1.9 billions, and I expect it to reach 2.1 billions in about 14 days. Fortunately, I have monitoring system over `age(datfrozenxid)`, that is why I found the issue in one of my servers.

I'm pondering what is the best option to avoid a forced shutdown of this server:
- should I just wait for a release (if it is soon, I would be fine)?
- build PG from the git version by myself?
- or is there a safer workaround to the problem? (not clear to me if deleting the `global/pg_internal.init` file is really the way to go, and the details, is it safe? Should I stop the server, delete, start?)

Best regards,
--
Matheus de Oliveira


Restarting the database has fixed the error on these pg_catalog tables, allowing us to vacuum them and avoid wraparound. 

We first noticed a restart fixed the issue because SAN snapshots did not have the error. The only difference really being shared memory and nothing disk-level.

Jeremy 

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-06-19 10:26:26 -0300, Matheus de Oliveira wrote:
> Hello friends.
> 
> On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund <andres@anarazel.de> wrote:
> 
> >
> > On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
> > > I plan to go over the change again tomorrow, and then push. Unless
> > > somebody has comments before then, obviously.
> >
> > Done.
> >
> >
> Sorry to bother about this, but do you have any plan to do the minor
> release before planned due to this bug?

Unclear at this point.  There's been discussion about it, without coming
to a conclusion.


> I'm pondering what is the best option to avoid a forced shutdown of this
> server:
> - should I just wait for a release (if it is soon, I would be fine)?
> - build PG from the git version by myself?
> - or is there a safer workaround to the problem? (not clear to me if
> deleting the `global/pg_internal.init` file is really the way to go, and
> the details, is it safe? Should I stop the server, delete, start?)

You should first make sure it's actually this problem - which tables are
holding back the xmin horizon?  After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-06-19 08:25:33 -0500, Jeremy Finzel wrote:
> I have a question related to this - and specifically, preventing the error
> until we have a patch :).

The patch has been committed to postgres, although no release has been
made including it yet.


> We are encountering this error every few weeks on one very high
> transaction db, and have to restart to fix it.

> If I read you correctly, the cache may never be invalidated for these
> catalogs even if I manually VACUUM them?  I was thinking if I routinely run
> VACUUM FREEZE on these tables in every database I might avoid the
> issue.

Correct, that won't help. In fact, it'll quite possibly make it worse.


> But given the cause of the issue, would that just make no difference and I
> will still hit the error eventually?

Yes. You might be able to get away with just removing the
pg_internal.init files before and after an explicit VACUUM on shared
system tables.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Matheus de Oliveira
Date:


On Tue, Jun 19, 2018 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-06-19 10:26:26 -0300, Matheus de Oliveira wrote:
> Hello friends.
>
> On Tue, Jun 12, 2018 at 3:31 PM, Andres Freund <andres@anarazel.de> wrote:
>
> >
> > On 2018-06-11 17:39:14 -0700, Andres Freund wrote:
> > > I plan to go over the change again tomorrow, and then push. Unless
> > > somebody has comments before then, obviously.
> >
> > Done.
> >
> >
> Sorry to bother about this, but do you have any plan to do the minor
> release before planned due to this bug?

Unclear at this point.  There's been discussion about it, without coming
to a conclusion.


Ok. Thank you for the information.

I really hope you decide to release it soon, I'm very afraid about the users that have hit the bug but haven't noticed the issue.
 

> I'm pondering what is the best option to avoid a forced shutdown of this
> server:
> - should I just wait for a release (if it is soon, I would be fine)?
> - build PG from the git version by myself?
> - or is there a safer workaround to the problem? (not clear to me if
> deleting the `global/pg_internal.init` file is really the way to go, and
> the details, is it safe? Should I stop the server, delete, start?)

You should first make sure it's actually this problem - which tables are
holding back the xmin horizon?

How can I be sure? The tables are `pg_authid` and `pg_auth_members`, the following message is logged every minute (which I belive is due to `autovacuum_naptime`, which is using the default of 1 minute):

    ERROR:  found xmin 3460221635 from before relfrozenxid 1245633870
    CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"
    ERROR:  found xmin 3460221635 from before relfrozenxid 1245633870
    CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_auth_members"

Do you need controldata or more info to validate it?
 
  After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.


I'll try that and get back to you if it worked or not. Thank you for the confirmation.

And thank you for all clarifications.

Best regards,
--
Matheus de Oliveira


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-06-19 17:05:48 -0300, Matheus de Oliveira wrote:
> > You should first make sure it's actually this problem - which tables are
> > holding back the xmin horizon?
> 
> 
> How can I be sure? The tables are `pg_authid` and `pg_auth_members`, the
> following message is logged every minute (which I belive is due to
> `autovacuum_naptime`, which is using the default of 1 minute):

Yes, that sounds like the issue. Basically just wanted the table names:

>     ERROR:  found xmin 3460221635 from before relfrozenxid 1245633870
>     CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"

which indeed are shared relations.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Sergey Burladyan
Date:
Andres Freund <andres@anarazel.de> writes:

> You should first make sure it's actually this problem - which tables are
> holding back the xmin horizon?  After that, yes, deleting the
> global/pg_internal.init file is the way to go, and I can't think of a
> case where it's problematic, even without stopping the server.

Thanks for clarification! I also have this problem, BTW, autovacuum does
not worked at all:
# select max(last_autovacuum) from pg_stat_user_tables;
              max
-------------------------------
 2018-06-06 00:48:47.813841+03

all it workers stoped with this messages:
ERROR: found xmin 982973690 from before relfrozenxid 2702858737
CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_authid"
ERROR: found xmin 982973690 from before relfrozenxid 2702858761
CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_auth_members"

and it does not try to vacuum other tables.

-- 
Sergey Burladyan


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-06-20 15:05:59 +0300, Sergey Burladyan wrote:
> Andres Freund <andres@anarazel.de> writes:
> 
> > You should first make sure it's actually this problem - which tables are
> > holding back the xmin horizon?  After that, yes, deleting the
> > global/pg_internal.init file is the way to go, and I can't think of a
> > case where it's problematic, even without stopping the server.
> 
> Thanks for clarification! I also have this problem, BTW, autovacuum does
> not worked at all:
> # select max(last_autovacuum) from pg_stat_user_tables;
>               max
> -------------------------------
>  2018-06-06 00:48:47.813841+03
> 
> all it workers stoped with this messages:
> ERROR: found xmin 982973690 from before relfrozenxid 2702858737
> CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_authid"
> ERROR: found xmin 982973690 from before relfrozenxid 2702858761
> CONTEXT: automatic vacuum of table "avito_delta.pg_catalog.pg_auth_members"
> 
> and it does not try to vacuum other tables.

Yea, that's this bug.  I'd remove global/pg_internal.init, reconnect,
and manually vacuum.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Matheus de Oliveira
Date:
Hello again...

On Tue, Jun 19, 2018 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote:
...

After that, yes, deleting the
global/pg_internal.init file is the way to go, and I can't think of a
case where it's problematic, even without stopping the server.


Just to let you know. I applied the work around in the affected server and seemed to work way, so far no error.

Thank you a lot for all the information and help.

Best regards,
--
Matheus de Oliveira


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jerry Sievers
Date:
Hackers, felt like reporting this relevant tidbit in case of interest...

My site is among the few who've hit this bug.

We observed recently a case of pg_database having joined pg_authid and
pg_auth_members in the bad xmin, unable to vacuum shcatalog group,
however...

pg_database then started working again without any intervention on our
part so apparently due to some relevant system activity.

We did later do a restart to correct the problem for those other 2
 catalogs but , will try the global init file hack mentioned below later
 if/when we face this anew before running a fixed Pg version.

Many thanks!


Matheus de Oliveira <matioli.matheus@gmail.com> writes:

> Hello again...
>
> On Tue, Jun 19, 2018 at 12:53 PM, Andres Freund <andres@anarazel.de>
> wrote:
>
>     ...
>    
>     After that, yes, deleting the
>     global/pg_internal.init file is the way to go, and I can't think
>     of a
>     case where it's problematic, even without stopping the server.
>    

>
>
> Just to let you know. I applied the work around in the affected
> server and seemed to work way, so far no error.
>
> Thank you a lot for all the information and help.
>
> Best regards,
> --
> Matheus de Oliveira
>
>
>
>

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Kyle Samson
Date:
Hello,

We found the exact same issue on a production database at TripAdvisor. We have no new information to add for debugging.
Bumpingthis to up the priority on a patch version release getting out. 

Thanks,
-Kyle Samson

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Peter Geoghegan
Date:
On Wed, Aug 8, 2018 at 10:23 AM, Kyle Samson <kysamson@tripadvisor.com> wrote:
> We found the exact same issue on a production database at TripAdvisor. We have no new information to add for
debugging.Bumping this to up the priority on a patch version release getting out.
 

There is a batch of point releases that were just stamped + tagged.
They should be released shortly.

-- 
Peter Geoghegan


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Bruce Momjian
Date:
On Wed, Aug  8, 2018 at 10:27:59AM -0700, Peter Geoghegan wrote:
> On Wed, Aug 8, 2018 at 10:23 AM, Kyle Samson <kysamson@tripadvisor.com> wrote:
> > We found the exact same issue on a production database at TripAdvisor. We have no new information to add for
debugging.Bumping this to up the priority on a patch version release getting out.
 
> 
> There is a batch of point releases that were just stamped + tagged.
> They should be released shortly.

They have been released:

    https://www.postgresql.org/about/news/1878/

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +