Thread: found xmin from before relfrozenxid on pg_catalog.pg_authid

found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:
Getting some concerning errors in one of our databases that is on 9.5.11, on autovacuum from template0 database pg_authid and pg_auth_members.  I only saw some notes on the list about this error related to materialized views.  FWIW, we did use pg_upgrade to upgrade this database from 9.4 to 9.5.  Here is an example:

2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19 12:08:33 CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before relfrozenxid 740087784",,,,,"automatic vacuum of table ""template0.pg_catalog.pg_authid""",,,,""
2018-03-19 12:08:33.957 CDT,,,14892,,5aafee91.3a2c,2,,2018-03-19 12:08:33 CDT,59/340953,0,ERROR,XX001,"found xmin 2906288383 from before relfrozenxid 740087784",,,,,"automatic vacuum of table ""template0.pg_catalog.pg_auth_members""",,,,""


Any insight would be much appreciated.

Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Alvaro Herrera
Date:
Jeremy Finzel wrote:
> Getting some concerning errors in one of our databases that is on 9.5.11,
> on autovacuum from template0 database pg_authid and pg_auth_members.  I
> only saw some notes on the list about this error related to materialized
> views.  FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
> 9.5.  Here is an example:
> 
> 2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19 12:08:33
> CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before relfrozenxid
> 740087784",,,,,"automatic vacuum of table
> ""template0.pg_catalog.pg_authid""",,,,""

Can you please supply output of pg_controldata?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:
pg_control version number:            942
Catalog version number:               201510051
Database system identifier:           6351536019599012028
Database cluster state:               in production
pg_control last modified:             Mon 19 Mar 2018 12:56:10 PM CDT
Latest checkpoint location:           262BE/FE96240
Prior checkpoint location:            262BA/623D5E40
Latest checkpoint's REDO location:    262BA/F5499E98
Latest checkpoint's REDO WAL file:    00000001000262BA000000F5
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          16/3132524419
Latest checkpoint's NextOID:          1090653331
Latest checkpoint's NextMultiXactId:  2142
Latest checkpoint's NextMultiOffset:  5235
Latest checkpoint's oldestXID:        1829964553
Latest checkpoint's oldestXID's DB:   12376
Latest checkpoint's oldestActiveXID:  3131774441
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:1829964553
Latest checkpoint's newestCommitTsXid:3132524418
Time of latest checkpoint:            Mon 19 Mar 2018 12:54:08 PM CDT
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    logical
wal_log_hints setting:                off
max_connections setting:              2000
max_worker_processes setting:         10
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       on
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Jeremy Finzel wrote:
> Getting some concerning errors in one of our databases that is on 9.5.11,
> on autovacuum from template0 database pg_authid and pg_auth_members.  I
> only saw some notes on the list about this error related to materialized
> views.  FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
> 9.5.  Here is an example:
>
> 2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19 12:08:33
> CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before relfrozenxid
> 740087784",,,,,"automatic vacuum of table
> ""template0.pg_catalog.pg_authid""",,,,""

Can you please supply output of pg_controldata?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
Jeremy Finzel wrote:
> Getting some concerning errors in one of our databases that is on 9.5.11,
> on autovacuum from template0 database pg_authid and pg_auth_members.  I
> only saw some notes on the list about this error related to materialized
> views.  FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
> 9.5.  Here is an example:
>
> 2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19 12:08:33
> CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before relfrozenxid
> 740087784",,,,,"automatic vacuum of table
> ""template0.pg_catalog.pg_authid""",,,,""

Can you please supply output of pg_controldata?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Please forgive my accidental top-post.  Here:

pg_control version number:            942
Catalog version number:               201510051
Database system identifier:           6351536019599012028
Database cluster state:               in production
pg_control last modified:             Mon 19 Mar 2018 12:56:10 PM CDT
Latest checkpoint location:           262BE/FE96240
Prior checkpoint location:            262BA/623D5E40
Latest checkpoint's REDO location:    262BA/F5499E98
Latest checkpoint's REDO WAL file:    00000001000262BA000000F5
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          16/3132524419
Latest checkpoint's NextOID:          1090653331
Latest checkpoint's NextMultiXactId:  2142
Latest checkpoint's NextMultiOffset:  5235
Latest checkpoint's oldestXID:        1829964553
Latest checkpoint's oldestXID's DB:   12376
Latest checkpoint's oldestActiveXID:  3131774441
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:1829964553
Latest checkpoint's newestCommitTsXid:3132524418
Time of latest checkpoint:            Mon 19 Mar 2018 12:54:08 PM CDT
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    logical
wal_log_hints setting:                off
max_connections setting:              2000
max_worker_processes setting:         10
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       on
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0


Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi Jeremy, Alvaro,

On 2018-03-19 13:00:13 -0500, Jeremy Finzel wrote:
> On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
> wrote:
> 
> > Jeremy Finzel wrote:
> > > Getting some concerning errors in one of our databases that is on 9.5.11,
> > > on autovacuum from template0 database pg_authid and pg_auth_members.  I
> > > only saw some notes on the list about this error related to materialized
> > > views.  FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
> > > 9.5.  Here is an example:
> > >
> > > 2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19
> > 12:08:33
> > > CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before
> > relfrozenxid
> > > 740087784",,,,,"automatic vacuum of table
> > > ""template0.pg_catalog.pg_authid""",,,,""
> >
> > Can you please supply output of pg_controldata?

> Latest checkpoint's NextXID:          16/3132524419
> Latest checkpoint's NextMultiXactId:  2142
> Latest checkpoint's NextMultiOffset:  5235
> Latest checkpoint's oldestXID:        1829964553
> Latest checkpoint's oldestXID's DB:   12376
> Latest checkpoint's oldestActiveXID:  3131774441
> Latest checkpoint's oldestMultiXid:   1
> Latest checkpoint's oldestMulti's DB: 16400

Hm, based on these it doesn't look like multixacts were involved (based
on oldestMultiXid it's highly unlikley there've multi wraparound, and
there's not much multixact usage on system tables anyway).  Which
suggests that there might have been actual corrpution here.

Jeremy:
- which version of 9.4 and 9.5 ran on this? Do you know?
- Can you install the pageinspect extension? If so, it might be a
  CREATE EXTENSION pageinspect;
  CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT lp int2, OUT xmin xid)
RETURNS SETOF RECORD
LANGUAGE SQL
AS $$
    SELECT blockno, lp, t_xmin
    FROM
        generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno, -- every block in the relation
        heap_page_items(get_raw_page($1::text, blockno::int4)) -- every item on the page
    WHERE
        t_xmin IS NOT NULL -- filter out empty items
        AND t_xmin != 1 -- filter out bootstrap
        AND t_xmin != 2 -- filter out frozen transaction id
        AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' | x'0200')::int) -- filter out frozen rows with xid
present
        AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid = $1)) -- xid cutoff filter
$$;
  SELECT * FROM check_rel('pg_authid') LIMIT 100;

  and then display all items for one of the affected pages like
  SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));


Alvaro:
- Hm, we talked about code adding context for these kind of errors,
  right? Is that just skipped for csvlog?
- Alvaro, does the above check_rel() function make sense?

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Mon, Mar 19, 2018 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
Hi Jeremy, Alvaro,

On 2018-03-19 13:00:13 -0500, Jeremy Finzel wrote:
> On Mon, Mar 19, 2018 at 12:46 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
> wrote:
>
> > Jeremy Finzel wrote:
> > > Getting some concerning errors in one of our databases that is on 9.5.11,
> > > on autovacuum from template0 database pg_authid and pg_auth_members.  I
> > > only saw some notes on the list about this error related to materialized
> > > views.  FWIW, we did use pg_upgrade to upgrade this database from 9.4 to
> > > 9.5.  Here is an example:
> > >
> > > 2018-03-19 12:08:33.946 CDT,,,14892,,5aafee91.3a2c,1,,2018-03-19
> > 12:08:33
> > > CDT,59/340951,0,ERROR,XX001,"found xmin 2906288382 from before
> > relfrozenxid
> > > 740087784",,,,,"automatic vacuum of table
> > > ""template0.pg_catalog.pg_authid""",,,,""
> >
> > Can you please supply output of pg_controldata?

> Latest checkpoint's NextXID:          16/3132524419
> Latest checkpoint's NextMultiXactId:  2142
> Latest checkpoint's NextMultiOffset:  5235
> Latest checkpoint's oldestXID:        1829964553
> Latest checkpoint's oldestXID's DB:   12376
> Latest checkpoint's oldestActiveXID:  3131774441
> Latest checkpoint's oldestMultiXid:   1
> Latest checkpoint's oldestMulti's DB: 16400

Hm, based on these it doesn't look like multixacts were involved (based
on oldestMultiXid it's highly unlikley there've multi wraparound, and
there's not much multixact usage on system tables anyway).  Which
suggests that there might have been actual corrpution here.

Jeremy:
- which version of 9.4 and 9.5 ran on this? Do you know?

We upgraded to 9.5.5, and today we are running 9.5.11.  And actually we upgraded from 9.3, not 9.4.  We are still trying to figure out which point release we were on at 9.3.
 
- Can you install the pageinspect extension? If so, it might be a
  CREATE EXTENSION pageinspect;
  CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT lp int2, OUT xmin xid)
RETURNS SETOF RECORD
LANGUAGE SQL
AS $$
    SELECT blockno, lp, t_xmin
    FROM
        generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno, -- every block in the relation
        heap_page_items(get_raw_page($1::text, blockno::int4)) -- every item on the page
    WHERE
        t_xmin IS NOT NULL -- filter out empty items
        AND t_xmin != 1 -- filter out bootstrap
        AND t_xmin != 2 -- filter out frozen transaction id
        AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' | x'0200')::int) -- filter out frozen rows with xid present
        AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid = $1)) -- xid cutoff filter
$$;
  SELECT * FROM check_rel('pg_authid') LIMIT 100;

Small note - Needs to be this because != is not supported for xid:

        AND NOT t_xmin = 1 -- filter out bootstrap
        AND NOT t_xmin = 2 -- filter out frozen transaction id
 

  and then display all items for one of the affected pages like
  SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));


Alvaro:
- Hm, we talked about code adding context for these kind of errors,
  right? Is that just skipped for csvlog?
- Alvaro, does the above check_rel() function make sense?

Greetings,

Andres Freund

The function does NOT show any issue with either of those tables.

One very interesting thing that is puzzling us - we have taken several san snapshots of the system real time that are running on the exact same version 9.5.11, and they do NOT show the same error when we vacuum these tables.  It makes us wonder if simply a db restart would solve the issue.

We will continue to investigate but interested in your feedback about what we have seen thus far.

Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-03-19 14:37:24 -0500, Jeremy Finzel wrote:
> We upgraded to 9.5.5, and today we are running 9.5.11.  And actually we
> upgraded from 9.3, not 9.4.  We are still trying to figure out which point
> release we were on at 9.3.

Ok.  IIRC there used to be a bug a few years back that sometimes lead to
highly contended pages being skipped during vacuum, and we'd still
update relfrozenxid. IIRC it required the table to be extended at the
same time or something?


> 
> > - Can you install the pageinspect extension? If so, it might be a
> >   CREATE EXTENSION pageinspect;
> >   CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT
> > lp int2, OUT xmin xid)
> > RETURNS SETOF RECORD
> > LANGUAGE SQL
> > AS $$
> >     SELECT blockno, lp, t_xmin
> >     FROM
> >         generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno,
> > -- every block in the relation
> >         heap_page_items(get_raw_page($1::text, blockno::int4)) -- every
> > item on the page
> >     WHERE
> >         t_xmin IS NOT NULL -- filter out empty items
> >         AND t_xmin != 1 -- filter out bootstrap
> >         AND t_xmin != 2 -- filter out frozen transaction id
> >         AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' |
> > x'0200')::int) -- filter out frozen rows with xid present
> >         AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid
> > = $1)) -- xid cutoff filter
> > $$;
> >   SELECT * FROM check_rel('pg_authid') LIMIT 100;
> >
> 
> Small note - Needs to be this because != is not supported for xid:
> 
>         AND NOT t_xmin = 1 -- filter out bootstrap
>         AND NOT t_xmin = 2 -- filter out frozen transaction id

Only on older releases ;). But yea, that looks right.



> >   and then display all items for one of the affected pages like
> >   SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));
> >
> >
> > Alvaro:
> > - Hm, we talked about code adding context for these kind of errors,
> >   right? Is that just skipped for csvlog?
> > - Alvaro, does the above check_rel() function make sense?
> >
> > Greetings,
> >
> > Andres Freund
> >
> 
> The function does NOT show any issue with either of those tables.

Uh, huh?  Alvaro, do you see a bug in my query?

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Mon, Mar 19, 2018 at 2:41 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-03-19 14:37:24 -0500, Jeremy Finzel wrote:
> We upgraded to 9.5.5, and today we are running 9.5.11.  And actually we
> upgraded from 9.3, not 9.4.  We are still trying to figure out which point
> release we were on at 9.3.

Ok.  IIRC there used to be a bug a few years back that sometimes lead to
highly contended pages being skipped during vacuum, and we'd still
update relfrozenxid. IIRC it required the table to be extended at the
same time or something?


>
> > - Can you install the pageinspect extension? If so, it might be a
> >   CREATE EXTENSION pageinspect;
> >   CREATE OR REPLACE FUNCTION check_rel(rel regclass, OUT blockno int8, OUT
> > lp int2, OUT xmin xid)
> > RETURNS SETOF RECORD
> > LANGUAGE SQL
> > AS $$
> >     SELECT blockno, lp, t_xmin
> >     FROM
> >         generate_series(0, pg_relation_size($1::text) / 8192 - 1) blockno,
> > -- every block in the relation
> >         heap_page_items(get_raw_page($1::text, blockno::int4)) -- every
> > item on the page
> >     WHERE
> >         t_xmin IS NOT NULL -- filter out empty items
> >         AND t_xmin != 1 -- filter out bootstrap
> >         AND t_xmin != 2 -- filter out frozen transaction id
> >         AND (t_infomask & ((x'0100' | x'0200')::int)) != ((x'0100' |
> > x'0200')::int) -- filter out frozen rows with xid present
> >         AND age(t_xmin) > age((SELECT relfrozenxid FROM pg_class WHERE oid
> > = $1)) -- xid cutoff filter
> > $$;
> >   SELECT * FROM check_rel('pg_authid') LIMIT 100;
> >
>
> Small note - Needs to be this because != is not supported for xid:
>
>         AND NOT t_xmin = 1 -- filter out bootstrap
>         AND NOT t_xmin = 2 -- filter out frozen transaction id

Only on older releases ;). But yea, that looks right.



> >   and then display all items for one of the affected pages like
> >   SELECT * FROM heap_page_items(get_raw_page('pg_authid', 34343));
> >
> >
> > Alvaro:
> > - Hm, we talked about code adding context for these kind of errors,
> >   right? Is that just skipped for csvlog?
> > - Alvaro, does the above check_rel() function make sense?
> >
> > Greetings,
> >
> > Andres Freund
> >
>
> The function does NOT show any issue with either of those tables.

Uh, huh?  Alvaro, do you see a bug in my query?

Greetings,

Andres Freund

FWIW, if I remove the last filter, I get these rows and I believe row 7/57/2906288382 is the one generating error:

SELECT * FROM check_rel('pg_authid') LIMIT 100;
 blockno | lp |    xmin
---------+----+------------
       7 |  4 | 2040863716
       7 |  5 | 2040863716
       7 |  8 | 2041172882
       7 |  9 | 2041172882
       7 | 12 | 2041201779
       7 | 13 | 2041201779
       7 | 16 | 2089742733
       7 | 17 | 2090021318
       7 | 18 | 2090021318
       7 | 47 | 2090021898
       7 | 48 | 2090021898
       7 | 49 | 2102749003
       7 | 50 | 2103210571
       7 | 51 | 2103210571
       7 | 54 | 2154640913
       7 | 55 | 2163849781
       7 | 56 | 2295315714
       7 | 57 | 2906288382
       7 | 58 | 2906329443
       7 | 60 | 3131766386
       8 |  1 | 2089844462
       8 |  2 | 2089844462
       8 |  3 | 2089844463
       8 |  6 | 2089844463
       8 |  9 | 2295318868
(25 rows)

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-03-19 14:53:58 -0500, Jeremy Finzel wrote:
> FWIW, if I remove the last filter, I get these rows and I believe row 7/57/
> 2906288382 is the one generating error:

Oh, yea, that makes sense. It's wrapped around and looks like it's from
the future.

> SELECT * FROM check_rel('pg_authid') LIMIT 100;
>  blockno | lp |    xmin
> ---------+----+------------
>        7 |  4 | 2040863716
>        7 |  5 | 2040863716
>        7 |  8 | 2041172882
>        7 |  9 | 2041172882
>        7 | 12 | 2041201779
>        7 | 13 | 2041201779
>        7 | 16 | 2089742733
>        7 | 17 | 2090021318
>        7 | 18 | 2090021318
>        7 | 47 | 2090021898
>        7 | 48 | 2090021898
>        7 | 49 | 2102749003
>        7 | 50 | 2103210571
>        7 | 51 | 2103210571
>        7 | 54 | 2154640913
>        7 | 55 | 2163849781
>        7 | 56 | 2295315714
>        7 | 57 | 2906288382
>        7 | 58 | 2906329443
>        7 | 60 | 3131766386
>        8 |  1 | 2089844462
>        8 |  2 | 2089844462
>        8 |  3 | 2089844463
>        8 |  6 | 2089844463
>        8 |  9 | 2295318868
> (25 rows)

Could you show the contents of those two pages with a query like I had
in an earlier email?

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Mon, Mar 19, 2018 at 2:56 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-03-19 14:53:58 -0500, Jeremy Finzel wrote:
> FWIW, if I remove the last filter, I get these rows and I believe row 7/57/
> 2906288382 is the one generating error:

Oh, yea, that makes sense. It's wrapped around and looks like it's from
the future.

> SELECT * FROM check_rel('pg_authid') LIMIT 100;
>  blockno | lp |    xmin
> ---------+----+------------
>        7 |  4 | 2040863716
>        7 |  5 | 2040863716
>        7 |  8 | 2041172882
>        7 |  9 | 2041172882
>        7 | 12 | 2041201779
>        7 | 13 | 2041201779
>        7 | 16 | 2089742733
>        7 | 17 | 2090021318
>        7 | 18 | 2090021318
>        7 | 47 | 2090021898
>        7 | 48 | 2090021898
>        7 | 49 | 2102749003
>        7 | 50 | 2103210571
>        7 | 51 | 2103210571
>        7 | 54 | 2154640913
>        7 | 55 | 2163849781
>        7 | 56 | 2295315714
>        7 | 57 | 2906288382
>        7 | 58 | 2906329443
>        7 | 60 | 3131766386
>        8 |  1 | 2089844462
>        8 |  2 | 2089844462
>        8 |  3 | 2089844463
>        8 |  6 | 2089844463
>        8 |  9 | 2295318868
> (25 rows)

Could you show the contents of those two pages with a query like I had
in an earlier email?

Greetings,

Andres Freund

SELECT heap_page_items(get_raw_page('pg_authid', 7));

                                                              heap_page_items
--------------------------------------------------------------------------------------------------------------------------------------------
 (1,4720,1,108,1897434979,0,0,"(7,1)",11,2825,32,111111111000000000000000000000000000000001011000010011111100001001111000,507769370)
 (2,4608,1,108,1897442758,0,18,"(7,2)",11,2825,32,111111111000000000000000000000000000000011000011101100000010001001111000,507776451)
 (3,4496,1,108,1897442758,0,20,"(7,3)",11,2825,32,111111111000000000000000000000000000000000100011101100000010001001111000,507776452)
 (4,4384,1,108,2040863716,0,37,"(7,4)",11,2313,32,111111111000000000000000000000000000000000110110111011100011001011111000,525105004)
 (5,4272,1,108,2040863716,0,39,"(7,5)",11,2313,32,111111111000000000000000000000000000000010110110111011100011001011111000,525105005)
 (6,0,3,0,,,,,,,,,)
 (7,0,3,0,,,,,,,,,)
 (8,4160,1,108,2041172882,0,49,"(7,8)",11,2313,32,111111111000000000000000000000000000000001110100101011000111001011111000,525219118)
 (9,4048,1,108,2041172882,0,51,"(7,9)",11,2313,32,111111111000000000000000000000000000000011110100101011000111001011111000,525219119)
 (10,0,3,0,,,,,,,,,)
 (11,0,3,0,,,,,,,,,)
 (12,3936,1,108,2041201779,0,181,"(7,12)",11,2313,32,111111111000000000000000000000000000000011010100010111100111001011111000,525236779)
 (13,3824,1,108,2041201779,0,183,"(7,13)",11,2313,32,111111111000000000000000000000000000000000110100010111100111001011111000,525236780)
 (14,0,3,0,,,,,,,,,)
 (15,0,3,0,,,,,,,,,)
 (16,3712,1,108,2089742733,0,0,"(7,16)",11,2313,32,111111111000000000000000000000000000000001000101110011100000001111111000,532706210)
 (17,3600,1,108,2090021318,0,1,"(7,17)",11,2313,32,111111111000000000000000000000000000000001001100001101001000001111111000,532753458)
 (18,3488,1,108,2090021318,0,3,"(7,18)",11,2313,32,111111111000000000000000000000000000000011001100001101001000001111111000,532753459)
 (19,8080,1,108,753125991,0,2,"(7,19)",11,2825,32,111111111000000000000000000000000000000000110001010111001011100001110000,236796556)
 (20,7968,1,108,753125991,0,4,"(7,20)",11,2825,32,111111111000000000000000000000000000000010110001010111001011100001110000,236796557)
 (21,7856,1,108,753125992,0,19,"(7,21)",11,2825,32,111111111000000000000000000000000000000011001001010111001011100001110000,236796563)
 (22,7744,1,108,753125992,0,21,"(7,22)",11,2825,32,111111111000000000000000000000000000000000101001010111001011100001110000,236796564)
 (23,7632,1,108,753125993,0,36,"(7,23)",11,2825,32,111111111000000000000000000000000000000001011001010111001011100001110000,236796570)
 (24,7520,1,108,753125993,0,38,"(7,24)",11,2825,32,111111111000000000000000000000000000000011011001010111001011100001110000,236796571)
 (25,7408,1,108,753125994,0,53,"(7,25)",11,2825,32,111111111000000000000000000000000000000010000101010111001011100001110000,236796577)
 (26,7296,1,108,753125994,0,55,"(7,26)",11,2825,32,111111111000000000000000000000000000000001000101010111001011100001110000,236796578)
 (27,7184,1,108,753125995,0,70,"(7,27)",11,2825,32,111111111000000000000000000000000000000000010101010111001011100001110000,236796584)
 (28,7072,1,108,753125995,0,72,"(7,28)",11,2825,32,111111111000000000000000000000000000000010010101010111001011100001110000,236796585)
 (29,6960,1,108,753125997,0,87,"(7,29)",11,2825,32,111111111000000000000000000000000000000011001101010111001011100001110000,236796595)
 (30,6848,1,108,753125997,0,89,"(7,30)",11,2825,32,111111111000000000000000000000000000000000101101010111001011100001110000,236796596)
 (31,6736,1,108,753125998,0,104,"(7,31)",11,2825,32,111111111000000000000000000000000000000001011101010111001011100001110000,236796602)
 (32,6624,1,108,753125998,0,106,"(7,32)",11,2825,32,111111111000000000000000000000000000000011011101010111001011100001110000,236796603)
 (33,6512,1,108,753125999,0,121,"(7,33)",11,2825,32,111111111000000000000000000000000000000010000011010111001011100001110000,236796609)
 (34,6400,1,108,753125999,0,123,"(7,34)",11,2825,32,111111111000000000000000000000000000000001000011010111001011100001110000,236796610)
 (35,6288,1,108,753126000,0,138,"(7,35)",11,2825,32,111111111000000000000000000000000000000000010011010111001011100001110000,236796616)
 (36,6176,1,108,753126000,0,140,"(7,36)",11,2825,32,111111111000000000000000000000000000000010010011010111001011100001110000,236796617)
 (37,6064,1,108,753126001,0,155,"(7,37)",11,2825,32,111111111000000000000000000000000000000011110011010111001011100001110000,236796623)
 (38,5952,1,108,753126001,0,157,"(7,38)",11,2825,32,111111111000000000000000000000000000000000001011010111001011100001110000,236796624)
 (39,5840,1,108,753126002,0,172,"(7,39)",11,2825,32,111111111000000000000000000000000000000001101011010111001011100001110000,236796630)
 (40,5728,1,108,753126002,0,174,"(7,40)",11,2825,32,111111111000000000000000000000000000000011101011010111001011100001110000,236796631)
 (41,5616,1,108,753126003,0,189,"(7,41)",11,2825,32,111111111000000000000000000000000000000010111011010111001011100001110000,236796637)
 (42,5504,1,108,753126003,0,191,"(7,42)",11,2825,32,111111111000000000000000000000000000000001111011010111001011100001110000,236796638)
 (43,5392,1,108,753126004,0,206,"(7,43)",11,2825,32,111111111000000000000000000000000000000000100111010111001011100001110000,236796644)
 (44,5280,1,108,753126004,0,208,"(7,44)",11,2825,32,111111111000000000000000000000000000000010100111010111001011100001110000,236796645)
 (45,5168,1,108,753126005,0,223,"(7,45)",11,2825,32,111111111000000000000000000000000000000011010111010111001011100001110000,236796651)
 (46,5056,1,108,753126005,0,225,"(7,46)",11,2825,32,111111111000000000000000000000000000000000110111010111001011100001110000,236796652)
 (47,3376,1,108,2090021898,0,18,"(7,47)",11,2313,32,111111111000000000000000000000000000000001000111001101001000001111111000,532753634)
 (48,3264,1,108,2090021898,0,20,"(7,48)",11,2313,32,111111111000000000000000000000000000000000100111001101001000001111111000,532753636)
 (49,3152,1,108,2102749003,0,0,"(7,49)",11,2313,32,111111111000000000000000000000000000000001111011101110010000110100000100,548445662)
 (50,3040,1,108,2103210571,0,107,"(7,50)",11,2313,32,111111111000000000000000000000000000000000010110101100100100110100000100,548556136)
 (51,2928,1,108,2103210571,0,109,"(7,51)",11,2313,32,111111111000000000000000000000000000000010010110101100100100110100000100,548556137)
 (52,0,3,0,,,,,,,,,)
 (53,0,3,0,,,,,,,,,)
 (54,2816,1,108,2154640913,0,0,"(7,54)",11,2313,32,111111111000000000000000000000000000000010111111010010000000100010000100,554701565)
 (55,2704,1,108,2163849781,0,0,"(7,55)",11,2313,32,111111111000000000000000000000000000000000110000001011101001100010000100,555316236)
 (56,2592,1,108,2295315714,0,0,"(7,56)",11,2313,32,111111111000000000000000000000000000000000100101010100001000010001000100,572590756)
 (57,2480,1,108,2906288382,0,0,"(7,57)",11,2313,32,111111111000000000000000000000000000000000001000101101000100111001111100,1047670032)
 (58,2368,1,108,2906329443,0,0,"(7,58)",11,2313,32,111111111000000000000000000000000000000001111001101101000100111001111100,1047670174)
 (59,60,2,0,,,,,,,,,)
 (60,2224,1,144,3131766386,0,0,"(7,60)",32779,10507,32,111111111100000000000000000000000000000000001101010001010011111100000010,1090298544)
 (61,0,0,0,,,,,,,,,)
 (62,0,0,0,,,,,,,,,)
 (63,0,0,0,,,,,,,,,)
 (64,0,0,0,,,,,,,,,)
 (65,0,0,0,,,,,,,,,)
 (66,0,0,0,,,,,,,,,)
 (67,0,0,0,,,,,,,,,)
 (68,0,0,0,,,,,,,,,)
 (69,4944,1,108,1034607755,0,1,"(7,69)",11,2825,32,111111111000000000000000000000000000000000010111011000010010010100001000,279217896)
 (70,4832,1,108,1034607755,0,3,"(7,70)",11,2825,32,111111111000000000000000000000000000000010010111011000010010010100001000,279217897)
(70 rows)

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Mon, Mar 19, 2018 at 3:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:


On Mon, Mar 19, 2018 at 2:56 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-03-19 14:53:58 -0500, Jeremy Finzel wrote:
> FWIW, if I remove the last filter, I get these rows and I believe row 7/57/
> 2906288382 is the one generating error:

Oh, yea, that makes sense. It's wrapped around and looks like it's from
the future.

> SELECT * FROM check_rel('pg_authid') LIMIT 100;
>  blockno | lp |    xmin
> ---------+----+------------
>        7 |  4 | 2040863716
>        7 |  5 | 2040863716
>        7 |  8 | 2041172882
>        7 |  9 | 2041172882
>        7 | 12 | 2041201779
>        7 | 13 | 2041201779
>        7 | 16 | 2089742733
>        7 | 17 | 2090021318
>        7 | 18 | 2090021318
>        7 | 47 | 2090021898
>        7 | 48 | 2090021898
>        7 | 49 | 2102749003
>        7 | 50 | 2103210571
>        7 | 51 | 2103210571
>        7 | 54 | 2154640913
>        7 | 55 | 2163849781
>        7 | 56 | 2295315714
>        7 | 57 | 2906288382
>        7 | 58 | 2906329443
>        7 | 60 | 3131766386
>        8 |  1 | 2089844462
>        8 |  2 | 2089844462
>        8 |  3 | 2089844463
>        8 |  6 | 2089844463
>        8 |  9 | 2295318868
> (25 rows)

Could you show the contents of those two pages with a query like I had
in an earlier email?

Greetings,

Andres Freund

SELECT heap_page_items(get_raw_page('pg_authid', 7));

                                                              heap_page_items
--------------------------------------------------------------------------------------------------------------------------------------------
 (1,4720,1,108,1897434979,0,0,"(7,1)",11,2825,32,111111111000000000000000000000000000000001011000010011111100001001111000,507769370)
 (2,4608,1,108,1897442758,0,18,"(7,2)",11,2825,32,111111111000000000000000000000000000000011000011101100000010001001111000,507776451)
 (3,4496,1,108,1897442758,0,20,"(7,3)",11,2825,32,111111111000000000000000000000000000000000100011101100000010001001111000,507776452)
 (4,4384,1,108,2040863716,0,37,"(7,4)",11,2313,32,111111111000000000000000000000000000000000110110111011100011001011111000,525105004)
 (5,4272,1,108,2040863716,0,39,"(7,5)",11,2313,32,111111111000000000000000000000000000000010110110111011100011001011111000,525105005)
 (6,0,3,0,,,,,,,,,)
 (7,0,3,0,,,,,,,,,)
 (8,4160,1,108,2041172882,0,49,"(7,8)",11,2313,32,111111111000000000000000000000000000000001110100101011000111001011111000,525219118)
 (9,4048,1,108,2041172882,0,51,"(7,9)",11,2313,32,111111111000000000000000000000000000000011110100101011000111001011111000,525219119)
 (10,0,3,0,,,,,,,,,)
 (11,0,3,0,,,,,,,,,)
 (12,3936,1,108,2041201779,0,181,"(7,12)",11,2313,32,111111111000000000000000000000000000000011010100010111100111001011111000,525236779)
 (13,3824,1,108,2041201779,0,183,"(7,13)",11,2313,32,111111111000000000000000000000000000000000110100010111100111001011111000,525236780)
 (14,0,3,0,,,,,,,,,)
 (15,0,3,0,,,,,,,,,)
 (16,3712,1,108,2089742733,0,0,"(7,16)",11,2313,32,111111111000000000000000000000000000000001000101110011100000001111111000,532706210)
 (17,3600,1,108,2090021318,0,1,"(7,17)",11,2313,32,111111111000000000000000000000000000000001001100001101001000001111111000,532753458)
 (18,3488,1,108,2090021318,0,3,"(7,18)",11,2313,32,111111111000000000000000000000000000000011001100001101001000001111111000,532753459)
 (19,8080,1,108,753125991,0,2,"(7,19)",11,2825,32,111111111000000000000000000000000000000000110001010111001011100001110000,236796556)
 (20,7968,1,108,753125991,0,4,"(7,20)",11,2825,32,111111111000000000000000000000000000000010110001010111001011100001110000,236796557)
 (21,7856,1,108,753125992,0,19,"(7,21)",11,2825,32,111111111000000000000000000000000000000011001001010111001011100001110000,236796563)
 (22,7744,1,108,753125992,0,21,"(7,22)",11,2825,32,111111111000000000000000000000000000000000101001010111001011100001110000,236796564)
 (23,7632,1,108,753125993,0,36,"(7,23)",11,2825,32,111111111000000000000000000000000000000001011001010111001011100001110000,236796570)
 (24,7520,1,108,753125993,0,38,"(7,24)",11,2825,32,111111111000000000000000000000000000000011011001010111001011100001110000,236796571)
 (25,7408,1,108,753125994,0,53,"(7,25)",11,2825,32,111111111000000000000000000000000000000010000101010111001011100001110000,236796577)
 (26,7296,1,108,753125994,0,55,"(7,26)",11,2825,32,111111111000000000000000000000000000000001000101010111001011100001110000,236796578)
 (27,7184,1,108,753125995,0,70,"(7,27)",11,2825,32,111111111000000000000000000000000000000000010101010111001011100001110000,236796584)
 (28,7072,1,108,753125995,0,72,"(7,28)",11,2825,32,111111111000000000000000000000000000000010010101010111001011100001110000,236796585)
 (29,6960,1,108,753125997,0,87,"(7,29)",11,2825,32,111111111000000000000000000000000000000011001101010111001011100001110000,236796595)
 (30,6848,1,108,753125997,0,89,"(7,30)",11,2825,32,111111111000000000000000000000000000000000101101010111001011100001110000,236796596)
 (31,6736,1,108,753125998,0,104,"(7,31)",11,2825,32,111111111000000000000000000000000000000001011101010111001011100001110000,236796602)
 (32,6624,1,108,753125998,0,106,"(7,32)",11,2825,32,111111111000000000000000000000000000000011011101010111001011100001110000,236796603)
 (33,6512,1,108,753125999,0,121,"(7,33)",11,2825,32,111111111000000000000000000000000000000010000011010111001011100001110000,236796609)
 (34,6400,1,108,753125999,0,123,"(7,34)",11,2825,32,111111111000000000000000000000000000000001000011010111001011100001110000,236796610)
 (35,6288,1,108,753126000,0,138,"(7,35)",11,2825,32,111111111000000000000000000000000000000000010011010111001011100001110000,236796616)
 (36,6176,1,108,753126000,0,140,"(7,36)",11,2825,32,111111111000000000000000000000000000000010010011010111001011100001110000,236796617)
 (37,6064,1,108,753126001,0,155,"(7,37)",11,2825,32,111111111000000000000000000000000000000011110011010111001011100001110000,236796623)
 (38,5952,1,108,753126001,0,157,"(7,38)",11,2825,32,111111111000000000000000000000000000000000001011010111001011100001110000,236796624)
 (39,5840,1,108,753126002,0,172,"(7,39)",11,2825,32,111111111000000000000000000000000000000001101011010111001011100001110000,236796630)
 (40,5728,1,108,753126002,0,174,"(7,40)",11,2825,32,111111111000000000000000000000000000000011101011010111001011100001110000,236796631)
 (41,5616,1,108,753126003,0,189,"(7,41)",11,2825,32,111111111000000000000000000000000000000010111011010111001011100001110000,236796637)
 (42,5504,1,108,753126003,0,191,"(7,42)",11,2825,32,111111111000000000000000000000000000000001111011010111001011100001110000,236796638)
 (43,5392,1,108,753126004,0,206,"(7,43)",11,2825,32,111111111000000000000000000000000000000000100111010111001011100001110000,236796644)
 (44,5280,1,108,753126004,0,208,"(7,44)",11,2825,32,111111111000000000000000000000000000000010100111010111001011100001110000,236796645)
 (45,5168,1,108,753126005,0,223,"(7,45)",11,2825,32,111111111000000000000000000000000000000011010111010111001011100001110000,236796651)
 (46,5056,1,108,753126005,0,225,"(7,46)",11,2825,32,111111111000000000000000000000000000000000110111010111001011100001110000,236796652)
 (47,3376,1,108,2090021898,0,18,"(7,47)",11,2313,32,111111111000000000000000000000000000000001000111001101001000001111111000,532753634)
 (48,3264,1,108,2090021898,0,20,"(7,48)",11,2313,32,111111111000000000000000000000000000000000100111001101001000001111111000,532753636)
 (49,3152,1,108,2102749003,0,0,"(7,49)",11,2313,32,111111111000000000000000000000000000000001111011101110010000110100000100,548445662)
 (50,3040,1,108,2103210571,0,107,"(7,50)",11,2313,32,111111111000000000000000000000000000000000010110101100100100110100000100,548556136)
 (51,2928,1,108,2103210571,0,109,"(7,51)",11,2313,32,111111111000000000000000000000000000000010010110101100100100110100000100,548556137)
 (52,0,3,0,,,,,,,,,)
 (53,0,3,0,,,,,,,,,)
 (54,2816,1,108,2154640913,0,0,"(7,54)",11,2313,32,111111111000000000000000000000000000000010111111010010000000100010000100,554701565)
 (55,2704,1,108,2163849781,0,0,"(7,55)",11,2313,32,111111111000000000000000000000000000000000110000001011101001100010000100,555316236)
 (56,2592,1,108,2295315714,0,0,"(7,56)",11,2313,32,111111111000000000000000000000000000000000100101010100001000010001000100,572590756)
 (57,2480,1,108,2906288382,0,0,"(7,57)",11,2313,32,111111111000000000000000000000000000000000001000101101000100111001111100,1047670032)
 (58,2368,1,108,2906329443,0,0,"(7,58)",11,2313,32,111111111000000000000000000000000000000001111001101101000100111001111100,1047670174)
 (59,60,2,0,,,,,,,,,)
 (60,2224,1,144,3131766386,0,0,"(7,60)",32779,10507,32,111111111100000000000000000000000000000000001101010001010011111100000010,1090298544)
 (61,0,0,0,,,,,,,,,)
 (62,0,0,0,,,,,,,,,)
 (63,0,0,0,,,,,,,,,)
 (64,0,0,0,,,,,,,,,)
 (65,0,0,0,,,,,,,,,)
 (66,0,0,0,,,,,,,,,)
 (67,0,0,0,,,,,,,,,)
 (68,0,0,0,,,,,,,,,)
 (69,4944,1,108,1034607755,0,1,"(7,69)",11,2825,32,111111111000000000000000000000000000000000010111011000010010010100001000,279217896)
 (70,4832,1,108,1034607755,0,3,"(7,70)",11,2825,32,111111111000000000000000000000000000000010010111011000010010010100001000,279217897)
(70 rows)

Does the fact that a snapshot does not have this issue suggest it could be memory-related corruption and a db restart could clear it up?

Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-03-19 15:37:51 -0500, Jeremy Finzel wrote:
> Does the fact that a snapshot does not have this issue suggest it could be
> memory-related corruption and a db restart could clear it up?

Could you show the page from the snapshot? I suspect it might just be a
problem that's temporarily not visible as corrupted.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Peter Geoghegan
Date:
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> SELECT heap_page_items(get_raw_page('pg_authid', 7));

Can you post this?

SELECT * FROM page_header(get_raw_page('pg_authid', 7));

-- 
Peter Geoghegan


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Mon, Mar 19, 2018 at 3:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> SELECT heap_page_items(get_raw_page('pg_authid', 7));

Can you post this?

SELECT * FROM page_header(get_raw_page('pg_authid', 7));

--
Peter Geoghegan

@Peter :

staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
      lsn       | checksum | flags | lower | upper | special | pagesize | version | prune_xid
----------------+----------+-------+-------+-------+---------+----------+---------+-----------
 262B4/10FDC478 |        0 |     1 |   304 |  2224 |    8192 |     8192 |       4 |         0
(1 row)

@Andres :

This is from snapshot (on 9.5.12, but we didn't have the error either on a 9.5.11 snap):

                                                             heap_page_items
-----------------------------------------------------------------------------------------------------------------------------------------
 (1,0,0,0,,,,,,,,,)
 (2,0,0,0,,,,,,,,,)
 (3,0,0,0,,,,,,,,,)
 (4,0,0,0,,,,,,,,,)
 (5,0,0,0,,,,,,,,,)
 (6,8080,1,108,3137434815,0,0,"(7,6)",11,10505,32,111111111000000000000000000000000000000001111110111100001000010010000001,2166427518)
 (7,7936,1,144,3137434816,0,0,"(7,7)",11,10507,32,111111111100000000000000000000000000000001110111000001110101101100001011,3504005358)
 (8,0,0,0,,,,,,,,,)
 (9,0,0,0,,,,,,,,,)
 (10,7792,1,144,3137434817,0,0,"(7,10)",11,10507,32,111111111100000000000000000000000000000001010110001101000011011111101000,401353834)
 (11,7680,1,108,3137434818,0,0,"(7,11)",11,10505,32,111111111000000000000000000000000000000001100011010010010001000001100001,2248708806)
 (12,0,0,0,,,,,,,,,)
 (13,0,0,0,,,,,,,,,)
 (14,7568,1,108,3137434819,0,0,"(7,14)",11,10505,32,111111111000000000000000000000000000000001110011110011011011100010100101,2770187214)
 (15,7456,1,108,3137434820,0,0,"(7,15)",11,10505,32,111111111000000000000000000000000000000011110001010001010011110010100001,2235343503)
 (16,0,0,0,,,,,,,,,)
 (17,0,0,0,,,,,,,,,)
 (18,0,0,0,,,,,,,,,)
 (19,0,0,0,,,,,,,,,)
 (20,0,0,0,,,,,,,,,)
 (21,0,0,0,,,,,,,,,)
 (22,0,0,0,,,,,,,,,)
 (23,0,0,0,,,,,,,,,)
 (24,0,0,0,,,,,,,,,)
 (25,0,0,0,,,,,,,,,)
 (26,0,0,0,,,,,,,,,)
 (27,0,0,0,,,,,,,,,)
 (28,0,0,0,,,,,,,,,)
 (29,0,0,0,,,,,,,,,)
 (30,0,0,0,,,,,,,,,)
 (31,0,0,0,,,,,,,,,)
 (32,0,0,0,,,,,,,,,)
 (33,0,0,0,,,,,,,,,)
 (34,0,0,0,,,,,,,,,)
 (35,0,0,0,,,,,,,,,)
 (36,0,0,0,,,,,,,,,)
 (37,0,0,0,,,,,,,,,)
 (38,0,0,0,,,,,,,,,)
 (39,0,0,0,,,,,,,,,)
 (40,0,0,0,,,,,,,,,)
 (41,0,0,0,,,,,,,,,)
 (42,0,0,0,,,,,,,,,)
 (43,0,0,0,,,,,,,,,)
 (44,0,0,0,,,,,,,,,)
 (45,0,0,0,,,,,,,,,)
 (46,0,0,0,,,,,,,,,)
 (47,0,0,0,,,,,,,,,)
 (48,0,0,0,,,,,,,,,)
 (49,0,0,0,,,,,,,,,)
 (50,0,0,0,,,,,,,,,)
 (51,0,0,0,,,,,,,,,)
 (52,7344,1,108,3137434821,0,0,"(7,52)",11,10505,32,111111111000000000000000000000000000000011011011111110001010010101000001,2191859675)
 (53,7232,1,108,3137434822,0,0,"(7,53)",11,10505,32,111111111000000000000000000000000000000001101011010111100110011011100100,661027542)
 (54,0,0,0,,,,,,,,,)
 (55,0,0,0,,,,,,,,,)
 (56,0,0,0,,,,,,,,,)
 (57,0,0,0,,,,,,,,,)
 (58,0,0,0,,,,,,,,,)
 (59,0,0,0,,,,,,,,,)
 (60,0,0,0,,,,,,,,,)
 (61,7120,1,108,3137434823,0,0,"(7,61)",11,10505,32,111111111000000000000000000000000000000000010111010110000101010111010100,732568296)
 (62,6976,1,144,3137434824,0,0,"(7,62)",11,10507,32,111111111100000000000000000000000000000010100100001001001010110000010100,674571301)
 (63,6864,1,108,3137434825,0,0,"(7,63)",11,10505,32,111111111000000000000000000000000000000010011011011111011100001001010001,2319695577)
 (64,6720,1,144,3137434826,0,0,"(7,64)",11,10507,32,111111111100000000000000000000000000000001000010011001111011100100101000,345892418)
 (65,6608,1,108,3137434827,0,0,"(7,65)",11,10505,32,111111111000000000000000000000000000000001000001010100010010101011001010,1398049410)
 (66,6496,1,108,3137434828,0,0,"(7,66)",11,10505,32,111111111000000000000000000000000000000000101000111100110101001000101010,1414188820)
 (67,6384,1,108,3137434829,0,0,"(7,67)",11,10505,32,111111111000000000000000000000000000000000110101011101111011001110101001,2513301164)
 (68,0,0,0,,,,,,,,,)
 (69,0,0,0,,,,,,,,,)
 (70,0,0,0,,,,,,,,,)
(70 rows)


Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Peter Geoghegan
Date:
On Mon, Mar 19, 2018 at 1:55 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> @Peter :
>
> staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
>       lsn       | checksum | flags | lower | upper | special | pagesize |
> version | prune_xid
> ----------------+----------+-------+-------+-------+---------+----------+---------+-----------
>  262B4/10FDC478 |        0 |     1 |   304 |  2224 |    8192 |     8192 |
> 4 |         0
> (1 row)

Thanks.

That looks normal. I wonder if the contents of that page looks
consistent with the rest of the table following manual inspection,
though. I recently saw system catalog corruption on a 9.5 instance
where an entirely different relation's page ended up in pg_attribute
and pg_depend. They were actually pristine index pages from an
application index. I still have no idea why this happened.

This is very much a guess, but it can't hurt to check if the contents
of the tuples themselves are actually sane by inspecting them with
"SELECT * FROM pg_authid". heap_page_items() doesn't actually care
about the shape of the tuples in the page, so this might have been
missed.

-- 
Peter Geoghegan


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Mon, Mar 19, 2018 at 4:12 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 19, 2018 at 1:55 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> @Peter :
>
> staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
>       lsn       | checksum | flags | lower | upper | special | pagesize |
> version | prune_xid
> ----------------+----------+-------+-------+-------+---------+----------+---------+-----------
>  262B4/10FDC478 |        0 |     1 |   304 |  2224 |    8192 |     8192 |
> 4 |         0
> (1 row)

Thanks.

That looks normal. I wonder if the contents of that page looks
consistent with the rest of the table following manual inspection,
though. I recently saw system catalog corruption on a 9.5 instance
where an entirely different relation's page ended up in pg_attribute
and pg_depend. They were actually pristine index pages from an
application index. I still have no idea why this happened.

This is very much a guess, but it can't hurt to check if the contents
of the tuples themselves are actually sane by inspecting them with
"SELECT * FROM pg_authid". heap_page_items() doesn't actually care
about the shape of the tuples in the page, so this might have been
missed.

--
Peter Geoghegan

The data all looks fine.  I even forced the index scan on both indexes which also looks fine.

Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Mon, Mar 19, 2018 at 3:55 PM, Jeremy Finzel <finzelj@gmail.com> wrote:


On Mon, Mar 19, 2018 at 3:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> SELECT heap_page_items(get_raw_page('pg_authid', 7));

Can you post this?

SELECT * FROM page_header(get_raw_page('pg_authid', 7));

--
Peter Geoghegan

@Peter :

staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
      lsn       | checksum | flags | lower | upper | special | pagesize | version | prune_xid
----------------+----------+-------+-------+-------+---------+----------+---------+-----------
 262B4/10FDC478 |        0 |     1 |   304 |  2224 |    8192 |     8192 |       4 |         0
(1 row)

@Andres :

This is from snapshot (on 9.5.12, but we didn't have the error either on a 9.5.11 snap):

                                                             heap_page_items
-----------------------------------------------------------------------------------------------------------------------------------------
 (1,0,0,0,,,,,,,,,)
 (2,0,0,0,,,,,,,,,)
 (3,0,0,0,,,,,,,,,)
 (4,0,0,0,,,,,,,,,)
 (5,0,0,0,,,,,,,,,)
 (6,8080,1,108,3137434815,0,0,"(7,6)",11,10505,32,111111111000000000000000000000000000000001111110111100001000010010000001,2166427518)
 (7,7936,1,144,3137434816,0,0,"(7,7)",11,10507,32,111111111100000000000000000000000000000001110111000001110101101100001011,3504005358)
 (8,0,0,0,,,,,,,,,)
 (9,0,0,0,,,,,,,,,)
 (10,7792,1,144,3137434817,0,0,"(7,10)",11,10507,32,111111111100000000000000000000000000000001010110001101000011011111101000,401353834)
 (11,7680,1,108,3137434818,0,0,"(7,11)",11,10505,32,111111111000000000000000000000000000000001100011010010010001000001100001,2248708806)
 (12,0,0,0,,,,,,,,,)
 (13,0,0,0,,,,,,,,,)
 (14,7568,1,108,3137434819,0,0,"(7,14)",11,10505,32,111111111000000000000000000000000000000001110011110011011011100010100101,2770187214)
 (15,7456,1,108,3137434820,0,0,"(7,15)",11,10505,32,111111111000000000000000000000000000000011110001010001010011110010100001,2235343503)
 (16,0,0,0,,,,,,,,,)
 (17,0,0,0,,,,,,,,,)
 (18,0,0,0,,,,,,,,,)
 (19,0,0,0,,,,,,,,,)
 (20,0,0,0,,,,,,,,,)
 (21,0,0,0,,,,,,,,,)
 (22,0,0,0,,,,,,,,,)
 (23,0,0,0,,,,,,,,,)
 (24,0,0,0,,,,,,,,,)
 (25,0,0,0,,,,,,,,,)
 (26,0,0,0,,,,,,,,,)
 (27,0,0,0,,,,,,,,,)
 (28,0,0,0,,,,,,,,,)
 (29,0,0,0,,,,,,,,,)
 (30,0,0,0,,,,,,,,,)
 (31,0,0,0,,,,,,,,,)
 (32,0,0,0,,,,,,,,,)
 (33,0,0,0,,,,,,,,,)
 (34,0,0,0,,,,,,,,,)
 (35,0,0,0,,,,,,,,,)
 (36,0,0,0,,,,,,,,,)
 (37,0,0,0,,,,,,,,,)
 (38,0,0,0,,,,,,,,,)
 (39,0,0,0,,,,,,,,,)
 (40,0,0,0,,,,,,,,,)
 (41,0,0,0,,,,,,,,,)
 (42,0,0,0,,,,,,,,,)
 (43,0,0,0,,,,,,,,,)
 (44,0,0,0,,,,,,,,,)
 (45,0,0,0,,,,,,,,,)
 (46,0,0,0,,,,,,,,,)
 (47,0,0,0,,,,,,,,,)
 (48,0,0,0,,,,,,,,,)
 (49,0,0,0,,,,,,,,,)
 (50,0,0,0,,,,,,,,,)
 (51,0,0,0,,,,,,,,,)
 (52,7344,1,108,3137434821,0,0,"(7,52)",11,10505,32,111111111000000000000000000000000000000011011011111110001010010101000001,2191859675)
 (53,7232,1,108,3137434822,0,0,"(7,53)",11,10505,32,111111111000000000000000000000000000000001101011010111100110011011100100,661027542)
 (54,0,0,0,,,,,,,,,)
 (55,0,0,0,,,,,,,,,)
 (56,0,0,0,,,,,,,,,)
 (57,0,0,0,,,,,,,,,)
 (58,0,0,0,,,,,,,,,)
 (59,0,0,0,,,,,,,,,)
 (60,0,0,0,,,,,,,,,)
 (61,7120,1,108,3137434823,0,0,"(7,61)",11,10505,32,111111111000000000000000000000000000000000010111010110000101010111010100,732568296)
 (62,6976,1,144,3137434824,0,0,"(7,62)",11,10507,32,111111111100000000000000000000000000000010100100001001001010110000010100,674571301)
 (63,6864,1,108,3137434825,0,0,"(7,63)",11,10505,32,111111111000000000000000000000000000000010011011011111011100001001010001,2319695577)
 (64,6720,1,144,3137434826,0,0,"(7,64)",11,10507,32,111111111100000000000000000000000000000001000010011001111011100100101000,345892418)
 (65,6608,1,108,3137434827,0,0,"(7,65)",11,10505,32,111111111000000000000000000000000000000001000001010100010010101011001010,1398049410)
 (66,6496,1,108,3137434828,0,0,"(7,66)",11,10505,32,111111111000000000000000000000000000000000101000111100110101001000101010,1414188820)
 (67,6384,1,108,3137434829,0,0,"(7,67)",11,10505,32,111111111000000000000000000000000000000000110101011101111011001110101001,2513301164)
 (68,0,0,0,,,,,,,,,)
 (69,0,0,0,,,,,,,,,)
 (70,0,0,0,,,,,,,,,)
(70 rows)


Thanks,
Jeremy

Any suggestions as to what I can do from this point?  Is it feasible that a server restart would fix this, or are you saying it would just mask the problem?  I can't reproduce it on a snapshot which is quite odd.

FWIW, maybe this is obvious, but the pages look identical on the streamers we have to the master.

Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Tue, Mar 20, 2018 at 11:19 AM, Jeremy Finzel <finzelj@gmail.com> wrote:


On Mon, Mar 19, 2018 at 3:55 PM, Jeremy Finzel <finzelj@gmail.com> wrote:


On Mon, Mar 19, 2018 at 3:53 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> SELECT heap_page_items(get_raw_page('pg_authid', 7));

Can you post this?

SELECT * FROM page_header(get_raw_page('pg_authid', 7));

--
Peter Geoghegan

@Peter :

staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
      lsn       | checksum | flags | lower | upper | special | pagesize | version | prune_xid
----------------+----------+-------+-------+-------+---------+----------+---------+-----------
 262B4/10FDC478 |        0 |     1 |   304 |  2224 |    8192 |     8192 |       4 |         0
(1 row)

@Andres :

This is from snapshot (on 9.5.12, but we didn't have the error either on a 9.5.11 snap):

                                                             heap_page_items
-----------------------------------------------------------------------------------------------------------------------------------------
 (1,0,0,0,,,,,,,,,)
 (2,0,0,0,,,,,,,,,)
 (3,0,0,0,,,,,,,,,)
 (4,0,0,0,,,,,,,,,)
 (5,0,0,0,,,,,,,,,)
 (6,8080,1,108,3137434815,0,0,"(7,6)",11,10505,32,111111111000000000000000000000000000000001111110111100001000010010000001,2166427518)
 (7,7936,1,144,3137434816,0,0,"(7,7)",11,10507,32,111111111100000000000000000000000000000001110111000001110101101100001011,3504005358)
 (8,0,0,0,,,,,,,,,)
 (9,0,0,0,,,,,,,,,)
 (10,7792,1,144,3137434817,0,0,"(7,10)",11,10507,32,111111111100000000000000000000000000000001010110001101000011011111101000,401353834)
 (11,7680,1,108,3137434818,0,0,"(7,11)",11,10505,32,111111111000000000000000000000000000000001100011010010010001000001100001,2248708806)
 (12,0,0,0,,,,,,,,,)
 (13,0,0,0,,,,,,,,,)
 (14,7568,1,108,3137434819,0,0,"(7,14)",11,10505,32,111111111000000000000000000000000000000001110011110011011011100010100101,2770187214)
 (15,7456,1,108,3137434820,0,0,"(7,15)",11,10505,32,111111111000000000000000000000000000000011110001010001010011110010100001,2235343503)
 (16,0,0,0,,,,,,,,,)
 (17,0,0,0,,,,,,,,,)
 (18,0,0,0,,,,,,,,,)
 (19,0,0,0,,,,,,,,,)
 (20,0,0,0,,,,,,,,,)
 (21,0,0,0,,,,,,,,,)
 (22,0,0,0,,,,,,,,,)
 (23,0,0,0,,,,,,,,,)
 (24,0,0,0,,,,,,,,,)
 (25,0,0,0,,,,,,,,,)
 (26,0,0,0,,,,,,,,,)
 (27,0,0,0,,,,,,,,,)
 (28,0,0,0,,,,,,,,,)
 (29,0,0,0,,,,,,,,,)
 (30,0,0,0,,,,,,,,,)
 (31,0,0,0,,,,,,,,,)
 (32,0,0,0,,,,,,,,,)
 (33,0,0,0,,,,,,,,,)
 (34,0,0,0,,,,,,,,,)
 (35,0,0,0,,,,,,,,,)
 (36,0,0,0,,,,,,,,,)
 (37,0,0,0,,,,,,,,,)
 (38,0,0,0,,,,,,,,,)
 (39,0,0,0,,,,,,,,,)
 (40,0,0,0,,,,,,,,,)
 (41,0,0,0,,,,,,,,,)
 (42,0,0,0,,,,,,,,,)
 (43,0,0,0,,,,,,,,,)
 (44,0,0,0,,,,,,,,,)
 (45,0,0,0,,,,,,,,,)
 (46,0,0,0,,,,,,,,,)
 (47,0,0,0,,,,,,,,,)
 (48,0,0,0,,,,,,,,,)
 (49,0,0,0,,,,,,,,,)
 (50,0,0,0,,,,,,,,,)
 (51,0,0,0,,,,,,,,,)
 (52,7344,1,108,3137434821,0,0,"(7,52)",11,10505,32,111111111000000000000000000000000000000011011011111110001010010101000001,2191859675)
 (53,7232,1,108,3137434822,0,0,"(7,53)",11,10505,32,111111111000000000000000000000000000000001101011010111100110011011100100,661027542)
 (54,0,0,0,,,,,,,,,)
 (55,0,0,0,,,,,,,,,)
 (56,0,0,0,,,,,,,,,)
 (57,0,0,0,,,,,,,,,)
 (58,0,0,0,,,,,,,,,)
 (59,0,0,0,,,,,,,,,)
 (60,0,0,0,,,,,,,,,)
 (61,7120,1,108,3137434823,0,0,"(7,61)",11,10505,32,111111111000000000000000000000000000000000010111010110000101010111010100,732568296)
 (62,6976,1,144,3137434824,0,0,"(7,62)",11,10507,32,111111111100000000000000000000000000000010100100001001001010110000010100,674571301)
 (63,6864,1,108,3137434825,0,0,"(7,63)",11,10505,32,111111111000000000000000000000000000000010011011011111011100001001010001,2319695577)
 (64,6720,1,144,3137434826,0,0,"(7,64)",11,10507,32,111111111100000000000000000000000000000001000010011001111011100100101000,345892418)
 (65,6608,1,108,3137434827,0,0,"(7,65)",11,10505,32,111111111000000000000000000000000000000001000001010100010010101011001010,1398049410)
 (66,6496,1,108,3137434828,0,0,"(7,66)",11,10505,32,111111111000000000000000000000000000000000101000111100110101001000101010,1414188820)
 (67,6384,1,108,3137434829,0,0,"(7,67)",11,10505,32,111111111000000000000000000000000000000000110101011101111011001110101001,2513301164)
 (68,0,0,0,,,,,,,,,)
 (69,0,0,0,,,,,,,,,)
 (70,0,0,0,,,,,,,,,)
(70 rows)


Thanks,
Jeremy

Any suggestions as to what I can do from this point?  Is it feasible that a server restart would fix this, or are you saying it would just mask the problem?  I can't reproduce it on a snapshot which is quite odd.

FWIW, maybe this is obvious, but the pages look identical on the streamers we have to the master.

Thanks,
Jeremy

A server restart and upgrade to 9.5.12 (at the same time), as expected, made the issue go away.  Still doesn't give us any answers as to what happened or if it would happen again!  Thanks for the feeback.

Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Peter Geoghegan
Date:
On Wed, Mar 21, 2018 at 1:38 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> A server restart and upgrade to 9.5.12 (at the same time), as expected, made
> the issue go away.  Still doesn't give us any answers as to what happened or
> if it would happen again!  Thanks for the feeback.

You may still want to use amcheck to look for problems. The version on
Github works with 9.5, and there are Redhat and Debian pgdg packages.
See:

https://github.com/petergeoghegan/amcheck

The "heapallindexed" option will be of particular interest to you -
that option verifies that the table has matching rows for a target
index (in addition to testing the structure of a target B-Tree index
itself). This is probably the best general test for corruption that is
available. There is a fair chance that this will reveal new
information.

-- 
Peter Geoghegan


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Wed, Mar 21, 2018 at 4:29 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Mar 21, 2018 at 1:38 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> A server restart and upgrade to 9.5.12 (at the same time), as expected, made
> the issue go away.  Still doesn't give us any answers as to what happened or
> if it would happen again!  Thanks for the feeback.

You may still want to use amcheck to look for problems. The version on
Github works with 9.5, and there are Redhat and Debian pgdg packages.
See:

https://github.com/petergeoghegan/amcheck

The "heapallindexed" option will be of particular interest to you -
that option verifies that the table has matching rows for a target
index (in addition to testing the structure of a target B-Tree index
itself). This is probably the best general test for corruption that is
available. There is a fair chance that this will reveal new
information.

--
Peter Geoghegan

Thank you for the recommendation.  I ran both amcheck functions on all 4 indexes of those 2 tables with heapallindexed = true, but no issues were found.

Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Alvaro Herrera
Date:
I admit I'm pretty surprised by this whole episode.  I have no useful
advice to offer here.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Peter Geoghegan
Date:
On Thu, Mar 22, 2018 at 12:27 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> Thank you for the recommendation.  I ran both amcheck functions on all 4
> indexes of those 2 tables with heapallindexed = true, but no issues were
> found.

Probably wouldn't hurt to run it against all indexes, if you can make
time for that. If you can generalize from the example query that calls
the bt_index_check() function, but set
"heapallindexed=>i.indisprimary" and remove "n.nspname =
'pg_catalog'", as well as "LIMIT 10".  This will test tables and
indexes from all schemas, which might be interesting.
-- 
Peter Geoghegan


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:
On Thu, Mar 22, 2018 at 3:20 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Mar 22, 2018 at 12:27 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> Thank you for the recommendation.  I ran both amcheck functions on all 4
> indexes of those 2 tables with heapallindexed = true, but no issues were
> found.

Probably wouldn't hurt to run it against all indexes, if you can make
time for that. If you can generalize from the example query that calls
the bt_index_check() function, but set
"heapallindexed=>i.indisprimary" and remove "n.nspname =
'pg_catalog'", as well as "LIMIT 10".  This will test tables and
indexes from all schemas, which might be interesting.
--
Peter Geoghegan

I am running this on a san snapshot of our production system.  I assume that this will give me a valid check for file-system-level corruption.  I am going to kick it off and see if I find anything interesting.

Thanks,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Peter Geoghegan
Date:
On Thu, Mar 22, 2018 at 2:24 PM, Jeremy Finzel <finzelj@gmail.com> wrote:
> I am running this on a san snapshot of our production system.  I assume that
> this will give me a valid check for file-system-level corruption.  I am
> going to kick it off and see if I find anything interesting.

It might. Note that SAN snapshots might have corruption hazards,
though. Postgres expects crash consistency across all filesystems, so
you might run into trouble if you had a separate filesystem for WAL,
for example. I know that LVM snapshots only provide a consistent view
of a single logical volume, even though many LVM + Postgres setups
will involve multiple logical volumes. This makes it possible for a
small inconsistency across logical volumes to corrupt data.

I don't know anything about your SAN snapshotting, but this is at
least something to consider.

-- 
Peter Geoghegan


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Maxim Boguk
Date:
​Hi everyone,

I just got the same issue on 9.6.8:

2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] [] ERROR:  found xmin 2808837517 from before relfrozenxid 248712603
2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] [] CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"

Additional details:

1) bt_index_check - no errors on both indexes.

2) SELECT pg_truncate_visibility_map('pg_authid'::regclass); + vacuum freeze - doesn't help.

3)Total 6 rows affected:
template1=# select oid, ctid, xmin, xmax, cmin, cmax from pg_authid where xmin::text::bigint > (select relfrozenxid::text::bigint from pg_class where relname='pg_authid');
    oid    |  ctid  |    xmin    | xmax | cmin | cmax
-----------+--------+------------+------+------+------
 183671986 | (0,90) | 3039161773 |    0 |    0 |    0
 183106699 | (1,48) | 2576823237 |    0 |    0 |    0
 183921770 | (1,50) | 3265971811 |    0 |    0 |    0
 183921914 | (1,52) | 3266122344 |    0 |    0 |    0
 187988173 | (1,58) | 4258893789 |    0 |    0 |    0
 182424977 | (1,62) | 2808837517 |    0 |    0 |    0
on total two pages.

4) template1=# select relfrozenxid from pg_catalog.pg_class where relname = 'pg_authid';
 relfrozenxid
--------------
   2548304492


5)Rows itself looks pretty valid and correspond to the actual database users.
7)No database/server crash happened last few years, no disk errors/problems.

I feel it could be related with vacuum skip locked pages patch + freeze + shared catalog combination, but cannot prove it yet.

Looking for possible course of action.
Probably simplest fix - drop and recreate these 6 affected users, but so far I willing spent some time research into this issue.

--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-15 11:06:38 +0200, Maxim Boguk wrote:
> ​Hi everyone,
> 
> I just got the same issue on 9.6.8:
> 
> 2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] [] ERROR:
> found xmin 2808837517 from before relfrozenxid 248712603
> 2018-05-15 11:52:01 MSK 33558 @ from  [vxid:317/92895305 txid:0] []
> CONTEXT:  automatic vacuum of table "template0.pg_catalog.pg_authid"
> 
> Additional details:
> 
> 1) bt_index_check - no errors on both indexes.
> 
> 2) SELECT pg_truncate_visibility_map('pg_authid'::regclass); + vacuum
> freeze - doesn't help.

Yea, too late.


> 3)Total 6 rows affected:
> template1=# select oid, ctid, xmin, xmax, cmin, cmax from pg_authid where
> xmin::text::bigint > (select relfrozenxid::text::bigint from pg_class where
> relname='pg_authid');
>     oid    |  ctid  |    xmin    | xmax | cmin | cmax
> -----------+--------+------------+------+------+------
>  183671986 | (0,90) | 3039161773 |    0 |    0 |    0
>  183106699 | (1,48) | 2576823237 |    0 |    0 |    0
>  183921770 | (1,50) | 3265971811 |    0 |    0 |    0
>  183921914 | (1,52) | 3266122344 |    0 |    0 |    0
>  187988173 | (1,58) | 4258893789 |    0 |    0 |    0
>  182424977 | (1,62) | 2808837517 |    0 |    0 |    0
> on total two pages.
> 
> 4) template1=# select relfrozenxid from pg_catalog.pg_class where relname =
> 'pg_authid';
>  relfrozenxid
> --------------
>    2548304492
> 
> 
> 5)Rows itself looks pretty valid and correspond to the actual database
> users.
> 7)No database/server crash happened last few years, no disk errors/problems.
> 
> I feel it could be related with vacuum skip locked pages patch + freeze +
> shared catalog combination, but cannot prove it yet.
> ​
> Looking for possible course of action.
> Probably simplest fix - drop and recreate these 6 affected users, but so
> far I willing spent some time research into this issue.

Could you use pageinspect to get the infomasks for the affected tuples?

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Maxim Boguk
Date:
Hi Andres,


> ​
> Looking for possible course of action.
> Probably simplest fix - drop and recreate these 6 affected users, but so
> far I willing spent some time research into this issue.

Could you use pageinspect to get the infomasks for the affected tuples?

Greetings,

Andres Freund


​For sample:

postgres=# vacuum pg_catalog.pg_authid;
ERROR:  found xmin 2894889518 from before relfrozenxid 248712603

select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where xmin::text::bigint=2894889518;
  ctid  |    xmin    | xmax | cmin | cmax
--------+------------+------+------+------
 (1,26) | 2894889518 |    0 |    0 |    0


postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1)) where t_ctid::text='(1,26)';
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lp          | 26
lp_off      | 4656
lp_flags    | 1
lp_len      | 144
t_xmin      | 2894889518
t_xmax      | 0
t_field3    | 0
t_ctid      | (1,26)
t_infomask2 | 32779
t_infomask  | 10507
t_hoff      | 32
t_bits      | 1111111111000000
t_oid       | 189787727

Any new role created in DB instantly affected by this issue.


In the same time:

 select relfrozenxid from pg_class where relname='pg_authid';
 relfrozenxid
--------------
   2863429136

So it's interesting where value of " from before relfrozenxid 248712603" come from.




--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-05-22 21:30:43 +0300, Maxim Boguk wrote:
> ​For sample:
> 
> postgres=# vacuum pg_catalog.pg_authid;
> ERROR:  found xmin 2894889518 from before relfrozenxid 248712603
> 
> select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where
> xmin::text::bigint=2894889518;
>   ctid  |    xmin    | xmax | cmin | cmax
> --------+------------+------+------+------
>  (1,26) | 2894889518 |    0 |    0 |    0
> 
> 
> postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1))
> where t_ctid::text='(1,26)';
> -[ RECORD 1
>
]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 26
> lp_off      | 4656
> lp_flags    | 1
> lp_len      | 144
> t_xmin      | 2894889518
> t_xmax      | 0
> t_field3    | 0
> t_ctid      | (1,26)
> t_infomask2 | 32779
> t_infomask  | 10507
> t_hoff      | 32
> t_bits      | 1111111111000000
> t_oid       | 189787727
>

So this row has, if I didn't screw up decoding the following infomask
bits set::
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

So it's not been previously frozen, which I was wondering about.


> Which indeed makes it
> Any new role created in DB instantly affected by this issue.

What's txid_current()?


> 
> In the same time:
> 
>  select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>    2863429136
> 
> So it's interesting where value of " from before relfrozenxid 248712603"
> come from.

Hm. That's indeed odd.  Could you get a backtrace of the error with "bt
full" of the error?

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Maxim Boguk
Date:


On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-05-22 21:30:43 +0300, Maxim Boguk wrote:
> ​For sample:
>
> postgres=# vacuum pg_catalog.pg_authid;
> ERROR:  found xmin 2894889518 from before relfrozenxid 248712603
>
> select ctid, xmin, xmax, cmin, cmax from pg_catalog.pg_authid where
> xmin::text::bigint=2894889518;
>   ctid  |    xmin    | xmax | cmin | cmax
> --------+------------+------+------+------
>  (1,26) | 2894889518 |    0 |    0 |    0
>
>
> postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 1))
> where t_ctid::text='(1,26)';
> -[ RECORD 1
> ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 26
> lp_off      | 4656
> lp_flags    | 1
> lp_len      | 144
> t_xmin      | 2894889518
> t_xmax      | 0
> t_field3    | 0
> t_ctid      | (1,26)
> t_infomask2 | 32779
> t_infomask  | 10507
> t_hoff      | 32
> t_bits      | 1111111111000000
> t_oid       | 189787727
>

So this row has, if I didn't screw up decoding the following infomask
bits set::
HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

So it's not been previously frozen, which I was wondering about.


> Which indeed makes it
> Any new role created in DB instantly affected by this issue.

What's txid_current()?


>
> In the same time:
>
>  select relfrozenxid from pg_class where relname='pg_authid';
>  relfrozenxid
> --------------
>    2863429136
>
> So it's interesting where value of " from before relfrozenxid 248712603"
> come from.

Hm. That's indeed odd.  Could you get a backtrace of the error with "bt
full" of the error?

Greetings,

Andres Freund

​select txid_current();
 txid_current
--------------
  41995913769
​About gdb bt - it's tricky because it is mission critical master db of huge project.
I'll will try promote backup replica and check is issue persist there and if yes - we will have our playground for a while, but it will require sometime to arrange.​


--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
Hi,

On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
> On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres@anarazel.de> wrote:
> > >  select relfrozenxid from pg_class where relname='pg_authid';
> > >  relfrozenxid
> > > --------------
> > >    2863429136

> ​select txid_current();
>  txid_current
> --------------
>   41995913769

So that's an xid of 3341208114, if you leave the epoch out. What's
autovacuum_freeze_max_age set to in that cluster?

Can you show pg_controldata output, and relminmxid from that cluster?

I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error:    248712603


> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Maxim Boguk
Date:


On Tue, May 22, 2018 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
> On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres@anarazel.de> wrote:
> > >  select relfrozenxid from pg_class where relname='pg_authid';
> > >  relfrozenxid
> > > --------------
> > >    2863429136

> ​select txid_current();
>  txid_current
> --------------
>   41995913769

So that's an xid of 3341208114, if you leave the epoch out. What's
​​
autovacuum_freeze_max_age set to in that cluster?

​postgres=# show autovacuum_freeze_max_age;
 autovacuum_freeze_max_age
---------------------------
 200000000
(default value I think)​

 
Can you show pg_controldata output, and
​​
relminmxid from that cluster?

​postgres@db:~$ /usr/lib/postgresql/9.6/bin/pg_controldata  -D /var/lib/postgresql/9.6/main
pg_control version number:            960
Catalog version number:               201608131
Database system identifier:           6469368654711450114
Database cluster state:               in production
pg_control last modified:             Tue 22 May 2018 10:20:14 PM MSK
Latest checkpoint location:           CCB5/F9C37950
Prior checkpoint location:            CCB0/43F316B0
Latest checkpoint's REDO location:    CCB1/6706BD88
Latest checkpoint's REDO WAL file:    000000010000CCB100000067
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          9:3341161759
Latest checkpoint's NextOID:          190071899
Latest checkpoint's NextMultiXactId:  59416233
Latest checkpoint's NextMultiOffset:  215588532
Latest checkpoint's oldestXID:        2814548646
Latest checkpoint's oldestXID's DB:   16400
Latest checkpoint's oldestActiveXID:  3341161759
Latest checkpoint's oldestMultiXid:   54264778
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:2814548646
Latest checkpoint's newestCommitTsXid:3341161758
Time of latest checkpoint:            Tue 22 May 2018 10:05:16 PM MSK
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on
max_connections setting:              2000
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       on
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0
postgres=# select datname,datfrozenxid,datminmxid from pg_database order by datname;
  datname  | datfrozenxid | datminmxid
-----------+--------------+------------
 **        |   2815939794 |   54265194
 postgres  |   2863429136 |   54280819
 template0 |   3148297669 |   59342106
 template1 |   2816765546 |   59261794



 

I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error:    248712603


> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

​My gdb kunf-fu exceptionally rusty now, so I better play in sandbox.​
​And if error did't exist on fresh promoted replica it will give us useful info as well.​

--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


On Tue, May 22, 2018 at 2:42 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:


On Tue, May 22, 2018 at 10:30 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-05-22 22:18:15 +0300, Maxim Boguk wrote:
> On Tue, May 22, 2018 at 9:47 PM, Andres Freund <andres@anarazel.de> wrote:
> > >  select relfrozenxid from pg_class where relname='pg_authid';
> > >  relfrozenxid
> > > --------------
> > >    2863429136

> ​select txid_current();
>  txid_current
> --------------
>   41995913769

So that's an xid of 3341208114, if you leave the epoch out. What's
​​
autovacuum_freeze_max_age set to in that cluster?

​postgres=# show autovacuum_freeze_max_age;
 autovacuum_freeze_max_age
---------------------------
 200000000
(default value I think)​

 
Can you show pg_controldata output, and
​​
relminmxid from that cluster?

​postgres@db:~$ /usr/lib/postgresql/9.6/bin/pg_controldata  -D /var/lib/postgresql/9.6/main
pg_control version number:            960
Catalog version number:               201608131
Database system identifier:           6469368654711450114
Database cluster state:               in production
pg_control last modified:             Tue 22 May 2018 10:20:14 PM MSK
Latest checkpoint location:           CCB5/F9C37950
Prior checkpoint location:            CCB0/43F316B0
Latest checkpoint's REDO location:    CCB1/6706BD88
Latest checkpoint's REDO WAL file:    000000010000CCB100000067
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          9:3341161759
Latest checkpoint's NextOID:          190071899
Latest checkpoint's NextMultiXactId:  59416233
Latest checkpoint's NextMultiOffset:  215588532
Latest checkpoint's oldestXID:        2814548646
Latest checkpoint's oldestXID's DB:   16400
Latest checkpoint's oldestActiveXID:  3341161759
Latest checkpoint's oldestMultiXid:   54264778
Latest checkpoint's oldestMulti's DB: 16400
Latest checkpoint's oldestCommitTsXid:2814548646
Latest checkpoint's newestCommitTsXid:3341161758
Time of latest checkpoint:            Tue 22 May 2018 10:05:16 PM MSK
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    replica
wal_log_hints setting:                on
max_connections setting:              2000
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       on
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0
postgres=# select datname,datfrozenxid,datminmxid from pg_database order by datname;
  datname  | datfrozenxid | datminmxid
-----------+--------------+------------
 **        |   2815939794 |   54265194
 postgres  |   2863429136 |   54280819
 template0 |   3148297669 |   59342106
 template1 |   2816765546 |   59261794



 

I might be daft here, but it's surely curious that the relfrozenxid from
the error and pg_catalog are really different (number of digits):
catalog: 2863429136
error:    248712603


> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

​My gdb kunf-fu exceptionally rusty now, so I better play in sandbox.​
​And if error did't exist on fresh promoted replica it will give us useful info as well.​

--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"


All - 

We just noticed that this exact same problem has now returned for us.  It has exactly the same symptoms as my original post (including the fact that a snapshot does not have the issue).

I am open to all suggestions as to troubleshooting this further!

Thank you,
Jeremy

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Maxim Boguk
Date:



> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

Greetings,

Andres Freund

​Ok this issue reproducible on promoted replica.
So now I have my playground.
Will provide gdb info in few hours.​


--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Maxim Boguk
Date:


On Thu, May 24, 2018 at 12:38 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:



> ​About gdb bt - it's tricky because it is mission critical master db of
> huge project.
> I'll will try promote backup replica and check is issue persist there and
> if yes - we will have our playground for a while, but it will require
> sometime to arrange.​

You should be ok to just bt that in the running cluster, but I
definitely understand if you don't want to do that...  I'd appreciate if
you set up the a playground, because this seems like something that'll
reappear.

Greetings,

Andres Freund

​bt full :

​#0  errmsg_internal (fmt=0x555b62e6eb70 "found xmin %u from before relfrozenxid %u") at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/utils/error/elog.c:828
        edata = <optimized out>
        __func__ = "errmsg_internal"
#1  0x0000555b62ad1cb9 in heap_prepare_freeze_tuple (tuple=<optimized out>, relfrozenxid=relfrozenxid@entry=248720453, relminmxid=relminmxid@entry=53644256, cutoff_xid=3485221679,
    cutoff_multi=<optimized out>, frz=frz@entry=0x555b640d1988, totally_frozen_p=0x7ffca32c0e90 "\001")
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/access/heap/heapam.c:6685
        changed = <optimized out>
        freeze_xmax = 0 '\000'
        xid = <optimized out>
        totally_frozen = <optimized out>
        __func__ = "heap_prepare_freeze_tuple"
#2  0x0000555b62bfd2b5 in lazy_scan_heap (aggressive=0 '\000', nindexes=2, Irel=0x555b64095948, vacrelstats=<optimized out>, options=26, onerel=0x555b64029498)
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuumlazy.c:1090
        tuple_totally_frozen = 1 '\001'
        itemid = 0x7f2b7bc5d4fc
        buf = 18138842
        page = <optimized out>
        offnum = 26
        maxoff = 26
        hastup = 1 '\001'
        nfrozen = 0
        freespace = <optimized out>
        all_frozen = 1 '\001'
        tupgone = 0 '\000'
        prev_dead_count = 0
        all_visible_according_to_vm = 0 '\000'
        all_visible = 1 '\001'
        has_dead_tuples = 0 '\000'
        visibility_cutoff_xid = 3490221678
        relname = 0x555b640296a8 "pg_authid"
        relminmxid = 53644256
        tups_vacuumed = 0
        indstats = 0x555b64095928
        relfrozenxid = 248720453
        skipping_blocks = 0 '\000'
        frozen = 0x555b640d1988
        initprog_val = {1, 2, 582}
        vacuumed_pages = 0
        num_tuples = 80
        nkeep = 0
        ru0 = {tv = {tv_sec = 1527175061, tv_usec = 739743}, ru = {ru_utime = {tv_sec = 0, tv_usec = 60000}, ru_stime = {tv_sec = 1, tv_usec = 416000}, {ru_maxrss = 9704,
              __ru_maxrss_word = 9704}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 33982,
              __ru_minflt_word = 33982}, {ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 0, __ru_inblock_word = 0}, {ru_oublock = 8,
              __ru_oublock_word = 8}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, {ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 38446,
              __ru_nvcsw_word = 38446}, {ru_nivcsw = 2, __ru_nivcsw_word = 2}}}
        vmbuffer = 8763411
        empty_pages = 0
        nunused = 0
        i = <optimized out>
        next_unskippable_block = <optimized out>
        buf = {data = 0x1 <error: Cannot access memory at address 0x1>, len = -1557393520, maxlen = 32764, cursor = -1557393616}
        nblocks = 2
        blkno = <optimized out>
        tuple = {t_len = 144, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 1}, ip_posid = 26}, t_tableOid = 1260, t_data = 0x7f2b7bc5e6b0}
        initprog_index = {0, 1, 5}
#3  lazy_vacuum_rel (onerel=onerel@entry=0x555b64029498, options=options@entry=1, params=params@entry=0x7ffca32c11b0, bstrategy=<optimized out>)
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuumlazy.c:253
        vacrelstats = <optimized out>
        Irel = 0x555b64095948
        nindexes = 2
        ru0 = {tv = {tv_sec = 93850993708032, tv_usec = 23936}, ru = {ru_utime = {tv_sec = 128, tv_usec = 93851007694584}, ru_stime = {tv_sec = 0, tv_usec = 0}, {ru_maxrss = 140723046059424,
              __ru_maxrss_word = 140723046059424}, {ru_ixrss = 93850989549501, __ru_ixrss_word = 93850989549501}, {ru_idrss = 16, __ru_idrss_word = 16}, {ru_isrss = 93851007694584,
              __ru_isrss_word = 93851007694584}, {ru_minflt = 140723046059472, __ru_minflt_word = 140723046059472}, {ru_majflt = 93850989559601, __ru_majflt_word = 93850989559601}, {
              ru_nswap = 93850994010032, __ru_nswap_word = 93850994010032}, {ru_inblock = 139833259949608, __ru_inblock_word = 139833259949608}, {ru_oublock = 93850994010032,
              __ru_oublock_word = 93850994010032}, {ru_msgsnd = 93851008239032, __ru_msgsnd_word = 93851008239032}, {ru_msgrcv = 140723046059904, __ru_msgrcv_word = 140723046059904}, {
              ru_nsignals = 93850989313526, __ru_nsignals_word = 93850989313526}, {ru_nvcsw = 345176855416, __ru_nvcsw_word = 345176855416}, {ru_nivcsw = 140723046059551,
              __ru_nivcsw_word = 140723046059551}}}
        starttime = 0
        secs = 1527175061
        usecs = 144
        read_rate = <optimized out>
        write_rate = <optimized out>
        aggressive = 0 '\000'
        scanned_all_unfrozen = <optimized out>
        xidFullScanLimit = 3440221679
        mxactFullScanLimit = 4204418425
        new_rel_pages = <optimized out>
        new_rel_tuples = <optimized out>
        new_rel_allvisible = 2737573328
        new_live_tuples = <optimized out>
        new_frozen_xid = <optimized out>
        new_min_multi = <optimized out>
        __func__ = "lazy_vacuum_rel"
#4  0x0000555b62bfa54d in vacuum_rel (relid=relid@entry=1260, relation=relation@entry=0x555b64047100, options=options@entry=1, params=params@entry=0x7ffca32c11b0)
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:1391
        lmode = 4
        onerel = 0x555b64029498
        onerelid = {relId = 1260, dbId = 0}
        toast_relid = 0
        save_userid = 10
        save_sec_context = 0
        save_nestlevel = 2
        __func__ = "vacuum_rel"
#5  0x0000555b62bfb545 in vacuum (options=1, relation=0x555b64047100, relid=relid@entry=0, params=params@entry=0x7ffca32c11b0, va_cols=0x0, bstrategy=<optimized out>, bstrategy@entry=0x0,
    isTopLevel=1 '\001') at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:303
        relid = 1260
        cur = 0x555b640998b8
        save_exception_stack = 0x7ffca32c1590
        save_context_stack = 0x0
        local_sigjmp_buf = {{__jmpbuf = {93851008735448, -4513934124599105488, 93850994009984, 93851008619616, 93851008619616, 93851008398480, -4513934124655728592, -7776366411743385552},
            __mask_was_saved = 0, __saved_mask = {__val = {0, 0, 0, 0, 1, 1, 1, 3473119737632784380, 139833388464563, 140723046060384, 12, 140723046060912, 93850990837571,
                4049721212982268568, 1, 140723046060432}}}}
        stmttype = <optimized out>
        in_outer_xact = 0 '\000'
        use_own_xacts = 1 '\001'
        relations = 0x555b640998d8
        in_vacuum = 1 '\001'
        __func__ = "vacuum"
#6  0x0000555b62bfb8ca in ExecVacuum (vacstmt=vacstmt@entry=0x555b64047150, isTopLevel=isTopLevel@entry=1 '\001')
    at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/commands/vacuum.c:121
        params = {freeze_min_age = -1, freeze_table_age = -1, multixact_freeze_min_age = -1, multixact_freeze_table_age = -1, is_wraparound = 0 '\000', log_min_duration = -1}
#7  0x0000555b62d27787 in standard_ProcessUtility (parsetree=0x555b64047150, queryString=0x555b64046688 "vacuum pg_catalog.pg_authid;", context=<optimized out>, params=0x0,
    dest=0x555b64047490, completionTag=0x7ffca32c19c0 "") at /build/postgresql-9.6-9VrjLn/postgresql-9.6-9.6.8/build/../src/backend/tcop/utility.c:655
        stmt = 0x555b64047150
        isTopLevel = <optimized out>
        __func__ = "standard_ProcessUtility"


--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"




--
Maxim Boguk
Senior Postgresql DBA
http://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Alvaro Herrera
Date:
Hmm .. surely

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5016181fd7..5d7fa1fb45 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
     xid = HeapTupleHeaderGetXmin(tuple);
     xmin_frozen = ((xid == FrozenTransactionId) ||
                    HeapTupleHeaderXminFrozen(tuple));
-    if (TransactionIdIsNormal(xid))
+    if (!xmin_frozen && TransactionIdIsNormal(xid))
     {
         if (TransactionIdPrecedes(xid, relfrozenxid))
             ereport(ERROR,


??


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> Hmm .. surely
> 
> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 5016181fd7..5d7fa1fb45 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>      xid = HeapTupleHeaderGetXmin(tuple);
>      xmin_frozen = ((xid == FrozenTransactionId) ||
>                     HeapTupleHeaderXminFrozen(tuple));
> -    if (TransactionIdIsNormal(xid))
> +    if (!xmin_frozen && TransactionIdIsNormal(xid))
>      {
>          if (TransactionIdPrecedes(xid, relfrozenxid))
>              ereport(ERROR,
> 
> 
> ??

I don't think that's necesary - HeapTupleHeaderGetXmin() returns
FrozenTransactionId if the tuple is frozen (note the
HeapTupleHeaderXminFrozen() within).

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-24 13:30:54 -0700, Andres Freund wrote:
> On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> > Hmm .. surely
> > 
> > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> > index 5016181fd7..5d7fa1fb45 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -6690,7 +6690,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
> >      xid = HeapTupleHeaderGetXmin(tuple);
> >      xmin_frozen = ((xid == FrozenTransactionId) ||
> >                     HeapTupleHeaderXminFrozen(tuple));
> > -    if (TransactionIdIsNormal(xid))
> > +    if (!xmin_frozen && TransactionIdIsNormal(xid))
> >      {
> >          if (TransactionIdPrecedes(xid, relfrozenxid))
> >              ereport(ERROR,
> > 
> > 
> > ??
> 
> I don't think that's necesary - HeapTupleHeaderGetXmin() returns
> FrozenTransactionId if the tuple is frozen (note the
> HeapTupleHeaderXminFrozen() within).

FWIW, even if that weren't the case: a) there'd be a lot more wrong with
this routine imo. b) some of the tuples affected clearly weren't
frozen...

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Alvaro Herrera
Date:
On 2018-May-24, Andres Freund wrote:

> On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> > Hmm .. surely

> >      xid = HeapTupleHeaderGetXmin(tuple);
> >      xmin_frozen = ((xid == FrozenTransactionId) ||
> >                     HeapTupleHeaderXminFrozen(tuple));
> > -    if (TransactionIdIsNormal(xid))
> > +    if (!xmin_frozen && TransactionIdIsNormal(xid))

> I don't think that's necesary - HeapTupleHeaderGetXmin() returns
> FrozenTransactionId if the tuple is frozen (note the
> HeapTupleHeaderXminFrozen() within).

Ah, yeah ... I probably thought about this when writing it and removed
it for that reason.

BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
confusing, by failing to return true if the xmin is numerically
FrozenXid (which it'll be if the database was pg_upgraded).  I wonder
about this one in HeapTupleSatisfiesMVCC:

    else
    {
        /* xmin is committed, but maybe not according to our snapshot */
        if (!HeapTupleHeaderXminFrozen(tuple) &&
            XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
            return false;        /* treat as still in progress */
    }

I think this is not a bug only because XidInMVCCSnapshot does this

    /* Any xid < xmin is not in-progress */
    if (TransactionIdPrecedes(xid, snapshot->xmin))
        return false;

which makes it return false for FrozenXid, but seems more of an accident
than explicitly designed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Alvaro Herrera
Date:
On 2018-May-24, Andres Freund wrote:

> FWIW, even if that weren't the case: a) there'd be a lot more wrong with
> this routine imo. b) some of the tuples affected clearly weren't
> frozen...

Right.

BTW is it just a coincidence or are all the affected tables pg_authid?
Maybe the problem is shared relations ..?  Maybe the fact that they have
separate relfrozenxid (!?) in different databases?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
> BTW is it just a coincidence or are all the affected tables pg_authid?
> Maybe the problem is shared relations ..?  Maybe the fact that they have
> separate relfrozenxid (!?) in different databases?

Yes, that appears to be part of the problem. I've looked at a number of
shared relation related codepaths, but so far my theory is that the
relcache is wrong. Note that one of the reports in this thread clearly
had a different relcache relfrozenxid than in the catalog.

Then there's also:
http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Jeremy Finzel
Date:


BTW is it just a coincidence or are all the affected tables pg_authid?
Maybe the problem is shared relations ..?  Maybe the fact that they have
separate relfrozenxid (!?) in different databases?

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
 We have had this problem twice and both times on both pg_authid and pg_auth_members. Thanks,

Jeremy 

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Alvaro Herrera
Date:
On 2018-May-24, Andres Freund wrote:

> On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
> > BTW is it just a coincidence or are all the affected tables pg_authid?
> > Maybe the problem is shared relations ..?  Maybe the fact that they have
> > separate relfrozenxid (!?) in different databases?
> 
> Yes, that appears to be part of the problem. I've looked at a number of
> shared relation related codepaths, but so far my theory is that the
> relcache is wrong. Note that one of the reports in this thread clearly
> had a different relcache relfrozenxid than in the catalog.

Hmm ... is that because they read the values on different databases?
Are you referring to the reports by Maxim Boguk?  I see one value from
template1, another value from template0.

> Then there's also:
> http://archives.postgresql.org/message-id/1527193504642.36340%40amazon.com

ah, so deleting the relcache file makes the problem to go away?  That's
definitely pretty strange.  I see no reason for the value in relcache to
become out of step with the catalogued value in the same database ... I
don't think we transmit in any way values of one database to another.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-24 17:13:11 -0400, Alvaro Herrera wrote:
> On 2018-May-24, Andres Freund wrote:
> 
> > On 2018-05-24 16:49:40 -0400, Alvaro Herrera wrote:
> > > BTW is it just a coincidence or are all the affected tables pg_authid?
> > > Maybe the problem is shared relations ..?  Maybe the fact that they have
> > > separate relfrozenxid (!?) in different databases?
> > 
> > Yes, that appears to be part of the problem. I've looked at a number of
> > shared relation related codepaths, but so far my theory is that the
> > relcache is wrong. Note that one of the reports in this thread clearly
> > had a different relcache relfrozenxid than in the catalog.
> 
> Hmm ... is that because they read the values on different databases?
> Are you referring to the reports by Maxim Boguk?  I see one value from
> template1, another value from template0.

I was referring to
https://www.postgresql.org/message-id/20180522193007.4bi5oluqb7c72oq2@alap3.anarazel.de
but you're right, it's possible that that's just caused by time passing
or different databases.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Andres Freund
Date:
On 2018-05-24 16:46:24 -0400, Alvaro Herrera wrote:
> On 2018-May-24, Andres Freund wrote:
> 
> > On 2018-05-24 13:08:53 -0400, Alvaro Herrera wrote:
> > > Hmm .. surely
> 
> > >      xid = HeapTupleHeaderGetXmin(tuple);
> > >      xmin_frozen = ((xid == FrozenTransactionId) ||
> > >                     HeapTupleHeaderXminFrozen(tuple));
> > > -    if (TransactionIdIsNormal(xid))
> > > +    if (!xmin_frozen && TransactionIdIsNormal(xid))
> 
> > I don't think that's necesary - HeapTupleHeaderGetXmin() returns
> > FrozenTransactionId if the tuple is frozen (note the
> > HeapTupleHeaderXminFrozen() within).
> 
> Ah, yeah ... I probably thought about this when writing it and removed
> it for that reason.
> 
> BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
> confusing, by failing to return true if the xmin is numerically
> FrozenXid (which it'll be if the database was pg_upgraded).  I wonder
> about this one in HeapTupleSatisfiesMVCC:

I suggest raising this on -hackers. I agree that it's unfortunate.

Greetings,

Andres Freund


Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

From
Peter Geoghegan
Date:
On Fri, May 25, 2018 at 1:38 PM, Andres Freund <andres@anarazel.de> wrote:
>> BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
>> confusing, by failing to return true if the xmin is numerically
>> FrozenXid (which it'll be if the database was pg_upgraded).  I wonder
>> about this one in HeapTupleSatisfiesMVCC:
>
> I suggest raising this on -hackers. I agree that it's unfortunate.

I wonder if BootstrapTransactionId also needs to be considered here.

-- 
Peter Geoghegan