Thread: ERROR: found multixact from before relminmxid
Hi,
Some time ago, I had this errors frequently showed in logs after some
autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables show the same and not complete the tasks (showed by some table bloat select).
Then, I did a full dump/restore into a new version (10.2) and everything is ok for a couple of months. Now, I have errors like this again:
db1=# cluster pc44t;
ERROR: found multixact 134100944 from before relminmxid 192042633
Like before, the only way to make the errors to desapear is by dump/reload the whole table.
Thanks for any help.
Best regards,
Alexandre
On 04/06/2018 02:09 AM, Alexandre Arruda wrote: > Hi, > > Some time ago, I had this errors frequently showed in logs after some > autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables > show the same and not complete the tasks (showed by some table bloat > select). > Then, I did a full dump/restore into a new version (10.2) and everything > is ok for a couple of months. Now, I have errors like this again: > > db1=# cluster pc44t; > > ERROR: found multixact 134100944 from before relminmxid 192042633 > > > > Like before, the only way to make the errors to desapear is by > dump/reload the whole table. > > > Thanks for any help. > That's going to be hard, unless you still have the cluster around. This surely seems like some sort of data corruption issue, but without being able to inspect the data pages it's nearly impossible to determine what went wrong. We'd also need more information about what happened to the hardware and cluster before the issues started to appear - crashes, hardware issues. And so on. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
2018-04-06 9:39 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:
On 04/06/2018 02:09 AM, Alexandre Arruda wrote:
> Hi,
>
> Some time ago, I had this errors frequently showed in logs after some
> autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables
> show the same and not complete the tasks (showed by some table bloat
> select).
> Then, I did a full dump/restore into a new version (10.2) and everything
> is ok for a couple of months. Now, I have errors like this again:
>
> db1=# cluster pc44t;
>
> ERROR: found multixact 134100944 from before relminmxid 192042633
>
>
>
> Like before, the only way to make the errors to desapear is by
> dump/reload the whole table.
>
>
> Thanks for any help.
>
That's going to be hard, unless you still have the cluster around.
This surely seems like some sort of data corruption issue, but without
being able to inspect the data pages it's nearly impossible to determine
what went wrong.
We'd also need more information about what happened to the hardware and
cluster before the issues started to appear - crashes, hardware issues.
And so on.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas,
The old cluster are gone, unfortunatly.This server is a 128GB ECC RAM with a dual redundant hba fiber channel connect to a sotorage with Raid 6 and I don't have (apparently) any errors reported.
Yesterday I did a test with one table: some sum aggragates, count(*), etc, then dump/reload and repeat the tests the results (of querys) are the same, regarding the vacuum problem
thats disapeared.
On 04/06/2018 04:29 PM, Alexandre Arruda wrote: > 2018-04-06 9:39 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>>: > > > > On 04/06/2018 02:09 AM, Alexandre Arruda wrote: > > Hi, > > > > Some time ago, I had this errors frequently showed in logs after some > > autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables > > show the same and not complete the tasks (showed by some table bloat > > select). > > Then, I did a full dump/restore into a new version (10.2) and everything > > is ok for a couple of months. Now, I have errors like this again: > > > > db1=# cluster pc44t; > > > > ERROR: found multixact 134100944 from before relminmxid 192042633 > > > > > > > > Like before, the only way to make the errors to desapear is by > > dump/reload the whole table. > > > > > > Thanks for any help. > > > > That's going to be hard, unless you still have the cluster around. > > This surely seems like some sort of data corruption issue, but without > being able to inspect the data pages it's nearly impossible to determine > what went wrong. > > We'd also need more information about what happened to the hardware and > cluster before the issues started to appear - crashes, hardware issues. > And so on. > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > Hi Tomas, > The old cluster are gone, unfortunatly. > > This server is a 128GB ECC RAM with a dual redundant hba fiber channel > connect to a sotorage with Raid 6 and I don't have (apparently) any > errors reported. > Yesterday I did a test with one table: some sum aggragates, count(*), > etc, then dump/reload and repeat the tests the results (of querys) are > the same, regarding the vacuum problem > thats disapeared. > I'm not sure I understand correctly. So you can reproduce the issue? If yes, how can you share the scripts/data you use (and how large is it)? If we could reproduce it locally, it would make the investigation much easier. BTW you mentioned you're using PostgreSQL 9.6 - which minor version, specifically? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
2018-04-06 13:11 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:
On 04/06/2018 04:29 PM, Alexandre Arruda wrote:
> 2018-04-06 9:39 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>>: I'm not sure I understand correctly. So you can reproduce the issue? If>
>
>
> On 04/06/2018 02:09 AM, Alexandre Arruda wrote:
> > Hi,
> >
> > Some time ago, I had this errors frequently showed in logs after some
> > autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in this tables
> > show the same and not complete the tasks (showed by some table bloat
> > select).
> > Then, I did a full dump/restore into a new version (10.2) and everything
> > is ok for a couple of months. Now, I have errors like this again:
> >
> > db1=# cluster pc44t;
> >
> > ERROR: found multixact 134100944 from before relminmxid 192042633
> >
> >
> >
> > Like before, the only way to make the errors to desapear is by
> > dump/reload the whole table.
> >
> >
> > Thanks for any help.
> >
>
> That's going to be hard, unless you still have the cluster around.
>
> This surely seems like some sort of data corruption issue, but without
> being able to inspect the data pages it's nearly impossible to determine
> what went wrong.
>
> We'd also need more information about what happened to the hardware and
> cluster before the issues started to appear - crashes, hardware issues.
> And so on.
>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> Hi Tomas,
> The old cluster are gone, unfortunatly.
>
> This server is a 128GB ECC RAM with a dual redundant hba fiber channel
> connect to a sotorage with Raid 6 and I don't have (apparently) any
> errors reported.
> Yesterday I did a test with one table: some sum aggragates, count(*),
> etc, then dump/reload and repeat the tests the results (of querys) are
> the same, regarding the vacuum problem
> thats disapeared.
>
yes, how can you share the scripts/data you use (and how large is it)?
If we could reproduce it locally, it would make the investigation much
easier.
BTW you mentioned you're using PostgreSQL 9.6 - which minor version,
specifically?
regards
Hi Tomas,
No, I can't reproduce. What I did is a simple way to "validate" the current table data to see if a dump/reload
preserve them. Old postgresql was 9.6.5. The problem returns now in new 10.3 installation.
There is a way to correct this tables without a dump/reload ?
I'm thinking to reinstall cluster doing a initdb --data-checksums, but I'm affraid about a severe performance impact.
Best regards,
Alexandre
On 04/09/2018 01:49 PM, Alexandre Arruda wrote: > > > 2018-04-06 13:11 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>>: > > > > On 04/06/2018 04:29 PM, Alexandre Arruda wrote: > > 2018-04-06 9:39 GMT-03:00 Tomas Vondra <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com> > > <mailto:tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>>>: > > > > > > > > On 04/06/2018 02:09 AM, Alexandre Arruda wrote: > > > Hi, > > > > > > Some time ago, I had this errors frequently showed in logs > after some > > > autovacuum in some tables(pg 9.6). VACUUM FULL or CLUSTER in > this tables > > > show the same and not complete the tasks (showed by some > table bloat > > > select). > > > Then, I did a full dump/restore into a new version (10.2) > and everything > > > is ok for a couple of months. Now, I have errors like this > again: > > > > > > db1=# cluster pc44t; > > > > > > ERROR: found multixact 134100944 from before relminmxid > 192042633 > > > > > > > > > > > > Like before, the only way to make the errors to desapear is by > > > dump/reload the whole table. > > > > > > > > > Thanks for any help. > > > > > > > That's going to be hard, unless you still have the cluster around. > > > > This surely seems like some sort of data corruption issue, but > without > > being able to inspect the data pages it's nearly impossible to > determine > > what went wrong. > > > > We'd also need more information about what happened to the > hardware and > > cluster before the issues started to appear - crashes, > hardware issues. > > And so on. > > > > regards > > > > -- > > Tomas Vondra http://www.2ndQuadrant.com > > PostgreSQL Development, 24x7 Support, Remote DBA, Training & > Services > > > > > > Hi Tomas, > > The old cluster are gone, unfortunatly. > > > > This server is a 128GB ECC RAM with a dual redundant hba fiber channel > > connect to a sotorage with Raid 6 and I don't have (apparently) any > > errors reported. > > Yesterday I did a test with one table: some sum aggragates, count(*), > > etc, then dump/reload and repeat the tests the results (of querys) are > > the same, regarding the vacuum problem > > thats disapeared. > > > > I'm not sure I understand correctly. So you can reproduce the issue? If > yes, how can you share the scripts/data you use (and how large is it)? > If we could reproduce it locally, it would make the investigation much > easier. > > BTW you mentioned you're using PostgreSQL 9.6 - which minor version, > specifically? > > > regards > > > Hi Tomas, > > No, I can't reproduce. What I did is a simple way to "validate" the > current table data to see if a dump/reload > preserve them. Old postgresql was 9.6.5. The problem returns now in new > 10.3 installation. > I'm confused. First you say you can't reproduce the issue, then you say it got back on a new installation. So which is it? > There is a way to correct this tables without a dump/reload ? > > I'm thinking to reinstall cluster doing a initdb --data-checksums, > but I'm affraid about a severe performance impact. > The performance impact should be non-measurable for most workloads. It might be a couple of percent on certain workloads/hardware, but that's probably not your case. The bigger question is whether this can actually detect the issue. If it's due to an storage issue, then perhaps yes. But if you only see multixact issues consistently and nothing else, it might easily be a PostgreSQL bug (in which case the checksum will be correct). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 9, 2018 at 7:01 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > The bigger question is whether this can actually detect the issue. If > it's due to an storage issue, then perhaps yes. But if you only see > multixact issues consistently and nothing else, it might easily be a > PostgreSQL bug (in which case the checksum will be correct). You can also run amcheck. Get the version targeting earlier Postgres releases off Github (there are packages for most Linux systems). This can verify that the heap is consistent with indexes. -- Peter Geoghegan
2018-04-09 18:23 GMT-03:00 Peter Geoghegan <pg@bowt.ie>: > > On Mon, Apr 9, 2018 at 7:01 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > The bigger question is whether this can actually detect the issue. If > > it's due to an storage issue, then perhaps yes. But if you only see > > multixact issues consistently and nothing else, it might easily be a > > PostgreSQL bug (in which case the checksum will be correct). > > You can also run amcheck. Get the version targeting earlier Postgres > releases off Github (there are packages for most Linux systems). This > can verify that the heap is consistent with indexes. > > -- > Peter Geoghegan Hi Peter, I ran amcheck in all index of a table and I only get empty returns. I did some tests: production=# reindex table fn06t; REINDEX production=# vacuum verbose fn06t; INFO: vacuuming "public.fn06t" ERROR: found multixact 76440919 from before relminmxid 122128619 production=# select count(*),sum(fn06vrtitu) from fn06t; count | sum ---------+----------------- 2592363 | 4833603148.1172 (1 row) #createdb freshdb #pg_dump -t fn06t production | psql freshdb freshdb=# select count(*),sum(fn06vrtitu) from fn06t; count | sum ---------+----------------- 2592363 | 4833603148.1172 (1 row) psql -c "select * from fn06t order by <PKEY>" production > multixact_error_fn06t.txt psql -c "select * from fn06t order by <PKEY>" freshdb > freshdb_fn06t.txt diff points no differences in files and md5sum produces the same hash. The question is: if a dump can read the table "correctly", why a CLUSTER/vacuum full stops while reading them? Best regards, Alexandre
On Mon, Apr 9, 2018 at 5:55 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: > I ran amcheck in all index of a table and I only get empty returns. Did you try doing so with the "heapallindexed" option? That's what's really interesting here. -- Peter Geoghegan
2018-04-09 22:16 GMT-03:00 Peter Geoghegan <pg@bowt.ie>: > On Mon, Apr 9, 2018 at 5:55 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: >> I ran amcheck in all index of a table and I only get empty returns. > > Did you try doing so with the "heapallindexed" option? That's what's > really interesting here. > > -- > Peter Geoghegan production=# select bt_index_parent_check('fn06t_pkey',true); bt_index_parent_check ----------------------- (1 row) production=# select bt_index_parent_check('ifn06t1',true); bt_index_parent_check ----------------------- (1 row) production=# select bt_index_parent_check('ifn06t4',true); bt_index_parent_check ----------------------- (1 row) (... and all other indexes returns null too) I tried with bt_index_check too. Same results. regards, Alexandre
On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: > (... and all other indexes returns null too) > > I tried with bt_index_check too. Same results. That's interesting, because it tells me that you have a table that appears to not be corrupt, despite the CLUSTER error. Also, the error itself comes from sanity checking added to MultiXact freezing fairly recently, in commit 699bf7d0. You didn't say anything about regular VACUUM being broken. Do you find that it works without any apparent issue? I have a suspicion that this could be a subtle bug in CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by CLUSTER, so it's not that hard to imagine a MultiXact freezing bug that is peculiar to CLUSTER. Though I haven't thought about it in much detail. -- Peter Geoghegan
On April 9, 2018 7:51:19 PM PDT, Peter Geoghegan <pg@bowt.ie> wrote: >On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> >wrote: >> (... and all other indexes returns null too) >> >> I tried with bt_index_check too. Same results. > >That's interesting, because it tells me that you have a table that >appears to not be corrupt, despite the CLUSTER error. Also, the error >itself comes from sanity checking added to MultiXact freezing fairly >recently, in commit 699bf7d0. > >You didn't say anything about regular VACUUM being broken. Do you find >that it works without any apparent issue? > >I have a suspicion that this could be a subtle bug in >CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by >CLUSTER, so it's not that hard to imagine a MultiXact freezing bug >that is peculiar to CLUSTER. Though I haven't thought about it in much >detail. I've not followed this thread. Possible it's the overeager check for pg upgraded tuples from before 9.3 that Alvaro fixedrecently? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, Apr 9, 2018 at 7:53 PM, Andres Freund <andres@anarazel.de> wrote: > I've not followed this thread. Possible it's the overeager check for pg upgraded tuples from before 9.3 that Alvaro fixedrecently? I was aware of commit 477ad05e, which must be what you're referring to. I don't think that that's what this is, since this error occurs within heap_freeze_tuple() -- it's not the over-enforced HEAP_XMAX_IS_LOCKED_ONLY() error within heap_prepare_freeze_tuple(). And, because this database wasn't pg_upgraded. I should wait until tomorrow before doing any further analysis, though. -- Peter Geoghegan
2018-04-09 23:51 GMT-03:00 Peter Geoghegan <pg@bowt.ie>: > On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: >> (... and all other indexes returns null too) >> >> I tried with bt_index_check too. Same results. > > That's interesting, because it tells me that you have a table that > appears to not be corrupt, despite the CLUSTER error. Also, the error > itself comes from sanity checking added to MultiXact freezing fairly > recently, in commit 699bf7d0. > > You didn't say anything about regular VACUUM being broken. Do you find > that it works without any apparent issue? > > I have a suspicion that this could be a subtle bug in > CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by > CLUSTER, so it's not that hard to imagine a MultiXact freezing bug > that is peculiar to CLUSTER. Though I haven't thought about it in much > detail. > > -- > Peter Geoghegan Hi Peter, Actualy, I first notice the problem in logs by autovacuum: 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of table "production.public.fn06t" 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765 from before relminmxid 73262006 production=# vacuum analyze verbose fn06t; INFO: vacuuming "public.fn06t" ERROR: found multixact 76440919 from before relminmxid 122128619
On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda <adaldeia@gmail.com> wrote: > Actualy, I first notice the problem in logs by autovacuum: > > 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of > table "production.public.fn06t" > 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765 > from before relminmxid 73262006 > > production=# vacuum analyze verbose fn06t; > INFO: vacuuming "public.fn06t" > ERROR: found multixact 76440919 from before relminmxid 122128619 Do you think that CLUSTER was run before regular VACUUM/autovacuum showed this error, though? Have you noticed any data loss? Things look okay when you do your dump + restore, right? The problem, as far as you know, is strictly that CLUSTER + VACUUM refuse to finish/raise these multixactid errors? -- Peter Geoghegan
On 2018-04-10 08:31:20 -0300, Alexandre Arruda wrote: > 2018-04-09 23:51 GMT-03:00 Peter Geoghegan <pg@bowt.ie>: > > On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: > >> (... and all other indexes returns null too) > >> > >> I tried with bt_index_check too. Same results. > > > > That's interesting, because it tells me that you have a table that > > appears to not be corrupt, despite the CLUSTER error. Also, the error > > itself comes from sanity checking added to MultiXact freezing fairly > > recently, in commit 699bf7d0. > > > > You didn't say anything about regular VACUUM being broken. Do you find > > that it works without any apparent issue? > > > > I have a suspicion that this could be a subtle bug in > > CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by > > CLUSTER, so it's not that hard to imagine a MultiXact freezing bug > > that is peculiar to CLUSTER. Though I haven't thought about it in much > > detail. > > > > -- > > Peter Geoghegan > > Hi Peter, > > Actualy, I first notice the problem in logs by autovacuum: > > 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of > table "production.public.fn06t" > 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765 > from before relminmxid 73262006 > > production=# vacuum analyze verbose fn06t; > INFO: vacuuming "public.fn06t" > ERROR: found multixact 76440919 from before relminmxid 122128619 What does the function in https://www.postgresql.org/message-id/20180319181723.ugaf7hfkluqyos5d@alap3.anarazel.de say about your table? Could you post pg_controldata output and SELECT * FROM pg_class WHERE oid = 'public.fn06t'::regclass; ? Greetings, Andres Freund
2018-04-10 19:53 GMT-03:00 Andres Freund <andres@anarazel.de>: > On 2018-04-10 08:31:20 -0300, Alexandre Arruda wrote: >> 2018-04-09 23:51 GMT-03:00 Peter Geoghegan <pg@bowt.ie>: >> > On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: >> >> (... and all other indexes returns null too) >> >> >> >> I tried with bt_index_check too. Same results. >> > >> > That's interesting, because it tells me that you have a table that >> > appears to not be corrupt, despite the CLUSTER error. Also, the error >> > itself comes from sanity checking added to MultiXact freezing fairly >> > recently, in commit 699bf7d0. >> > >> > You didn't say anything about regular VACUUM being broken. Do you find >> > that it works without any apparent issue? >> > >> > I have a suspicion that this could be a subtle bug in >> > CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by >> > CLUSTER, so it's not that hard to imagine a MultiXact freezing bug >> > that is peculiar to CLUSTER. Though I haven't thought about it in much >> > detail. >> > >> > -- >> > Peter Geoghegan >> >> Hi Peter, >> >> Actualy, I first notice the problem in logs by autovacuum: >> >> 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of >> table "production.public.fn06t" >> 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765 >> from before relminmxid 73262006 >> >> production=# vacuum analyze verbose fn06t; >> INFO: vacuuming "public.fn06t" >> ERROR: found multixact 76440919 from before relminmxid 122128619 > > What does the function in > https://www.postgresql.org/message-id/20180319181723.ugaf7hfkluqyos5d@alap3.anarazel.de > say about your table? > > Could you post pg_controldata output and > SELECT * FROM pg_class WHERE oid = 'public.fn06t'::regclass; > ? > > Greetings, > > Andres Freund pg_control version number: 1002 Catalog version number: 201707211 Database system identifier: 6513284223567708119 Database cluster state: in production pg_control last modified: Ter 10 Abr 2018 23:51:56 BRT Latest checkpoint location: 183/E4849618 Prior checkpoint location: 183/E19F8A80 Latest checkpoint's REDO location: 183/E202C430 Latest checkpoint's REDO WAL file: 0000000100000183000000E2 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:494263452 Latest checkpoint's NextOID: 1816390854 Latest checkpoint's NextMultiXactId: 256298856 Latest checkpoint's NextMultiOffset: 529468697 Latest checkpoint's oldestXID: 103399998 Latest checkpoint's oldestXID's DB: 866864162 Latest checkpoint's oldestActiveXID: 494263450 Latest checkpoint's oldestMultiXid: 73262006 Latest checkpoint's oldestMulti's DB: 866864162 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Time of latest checkpoint: Ter 10 Abr 2018 23:50:14 BRT Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: off max_connections setting: 600 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: a8eb07ea85c4bbeaf202020d036277b276bda47ef55c1456723ec2b3c40386b1 SELECT * FROM pg_class WHERE oid = 'public.fn06t'::regclass; -[ RECORD 1 ]-------+------------------------------------------------------ relname | fn06t relnamespace | 2200 reltype | 866874136 reloftype | 0 relowner | 10 relam | 0 relfilenode | 1092835324 reltablespace | 0 relpages | 218797 reltuples | 2.60181e+06 relallvisible | 192741 reltoastrelid | 866874160 relhasindex | t relisshared | f relpersistence | p relkind | r relnatts | 154 relchecks | 0 relhasoids | t relhaspkey | t relhasrules | f relhastriggers | t relhassubclass | f relrowsecurity | f relforcerowsecurity | f relispopulated | t relreplident | d relispartition | f relfrozenxid | 211279862 relminmxid | 122128619 relacl | {postgres=arwdDxt/postgres,usuarios=arwdDxt/postgres} reloptions | relpartbound |
On Tue, Apr 10, 2018 at 7:54 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: > pg_control version number: 1002 Andres was also asking about his check_rel() function, from https://www.postgresql.org/message-id/20180319181723.ugaf7hfkluqyos5d@alap3.anarazel.de. Can you check that out as well, please? You'll need to be able to install the pageinspect contrib module. -- Peter Geoghegan
2018-04-10 19:09 GMT-03:00 Peter Geoghegan <pg@bowt.ie>: > On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda <adaldeia@gmail.com> wrote: >> Actualy, I first notice the problem in logs by autovacuum: >> >> 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of >> table "production.public.fn06t" >> 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765 >> from before relminmxid 73262006 >> >> production=# vacuum analyze verbose fn06t; >> INFO: vacuuming "public.fn06t" >> ERROR: found multixact 76440919 from before relminmxid 122128619 > > Do you think that CLUSTER was run before regular VACUUM/autovacuum > showed this error, though? Yes, because the table is clustered in the old database and reclustered when it was reloaded in the version 10. But the table was not clustered again. > Have you noticed any data loss? Things look okay when you do your dump > + restore, right? The problem, as far as you know, is strictly that > CLUSTER + VACUUM refuse to finish/raise these multixactid errors? I did not see, apparently, any data loss. A dump/reload in fresh db shows the same data os production. I verify this by md5sum in genereted files created by a select statment (ordered by pkey) in both databases (prod and test). The problem is with regular vaccum that not complete. Cluster or vacuum FULL is a try to correct the problem. But I'm affraid if any data can be corrupted along time.
On Wed, Apr 11, 2018 at 8:31 AM, Alexandre Arruda <adaldeia@gmail.com> wrote:
2018-04-10 19:09 GMT-03:00 Peter Geoghegan <pg@bowt.ie>:
> On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda <adaldeia@gmail.com> wrote:
>> Actualy, I first notice the problem in logs by autovacuum:
>>
>> 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of
>> table "production.public.fn06t"
>> 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765
>> from before relminmxid 73262006
>>
>> production=# vacuum analyze verbose fn06t;
>> INFO: vacuuming "public.fn06t"
>> ERROR: found multixact 76440919 from before relminmxid 122128619
>
> Do you think that CLUSTER was run before regular VACUUM/autovacuum
> showed this error, though?
Yes, because the table is clustered in the old database and
reclustered when it was reloaded in the version 10.
But the table was not clustered again.
I wonder if we're staring at some race condition in visibility map where a previous vacuum inadvertently skips a page because the visibility map bit is set, thus leaving behind an unfrozen multixid. The page then again becomes !all_visible and the next vacuum then complains about the unfrozen multixid.
Thanks,
Pavan
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
On April 10, 2018 11:51:40 PM PDT, Pavan Deolasee <pavan.deolasee@gmail.com> wrote: >On Wed, Apr 11, 2018 at 8:31 AM, Alexandre Arruda <adaldeia@gmail.com> >wrote: > >> 2018-04-10 19:09 GMT-03:00 Peter Geoghegan <pg@bowt.ie>: >> > On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda ><adaldeia@gmail.com> >> wrote: >> >> Actualy, I first notice the problem in logs by autovacuum: >> >> >> >> 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of >> >> table "production.public.fn06t" >> >> 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact >68834765 >> >> from before relminmxid 73262006 >> >> >> >> production=# vacuum analyze verbose fn06t; >> >> INFO: vacuuming "public.fn06t" >> >> ERROR: found multixact 76440919 from before relminmxid 122128619 >> > >> > Do you think that CLUSTER was run before regular VACUUM/autovacuum >> > showed this error, though? >> >> Yes, because the table is clustered in the old database and >> reclustered when it was reloaded in the version 10. >> But the table was not clustered again. >> >> >I wonder if we're staring at some race condition in visibility map >where a >previous vacuum inadvertently skips a page because the visibility map >bit >is set, thus leaving behind an unfrozen multixid. The page then again >becomes !all_visible and the next vacuum then complains about the >unfrozen >multixid. Should be possible to write a query using page inspect and freespacemap to make sure they at least currently match. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
2018-04-10 19:53 GMT-03:00 Andres Freund <andres@anarazel.de>: > On 2018-04-10 08:31:20 -0300, Alexandre Arruda wrote: >> 2018-04-09 23:51 GMT-03:00 Peter Geoghegan <pg@bowt.ie>: >> > On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: >> >> (... and all other indexes returns null too) >> >> >> >> I tried with bt_index_check too. Same results. >> > >> > That's interesting, because it tells me that you have a table that >> > appears to not be corrupt, despite the CLUSTER error. Also, the error >> > itself comes from sanity checking added to MultiXact freezing fairly >> > recently, in commit 699bf7d0. >> > >> > You didn't say anything about regular VACUUM being broken. Do you find >> > that it works without any apparent issue? >> > >> > I have a suspicion that this could be a subtle bug in >> > CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by >> > CLUSTER, so it's not that hard to imagine a MultiXact freezing bug >> > that is peculiar to CLUSTER. Though I haven't thought about it in much >> > detail. >> > >> > -- >> > Peter Geoghegan >> >> Hi Peter, >> >> Actualy, I first notice the problem in logs by autovacuum: >> >> 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of >> table "production.public.fn06t" >> 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765 >> from before relminmxid 73262006 >> >> production=# vacuum analyze verbose fn06t; >> INFO: vacuuming "public.fn06t" >> ERROR: found multixact 76440919 from before relminmxid 122128619 > > What does the function in > https://www.postgresql.org/message-id/20180319181723.ugaf7hfkluqyos5d@alap3.anarazel.de > say about your table? > > Could you post pg_controldata output and > SELECT * FROM pg_class WHERE oid = 'public.fn06t'::regclass; > ? > > Greetings, > > Andres Freund Hi Andres, I install pageinspect and create the function, but it returns is null: select * from check_rel('fn06t') limit 100; blockno | lp | xmin ---------+----+------ (0 rows) Other affected table: vacuum verbose fn06t4; INFO: vacuuming "public.fn06t4" ERROR: found multixact 68834765 from before relminmxid 73262006 production=# select * from check_rel('fn06t4') limit 100; blockno | lp | xmin ---------+----+------ (0 rows) Best regards, Alexandre
Hi, I think I found the bug, and am about to post a fix for it belo https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de. Greetings, Andres Freund
Em seg, 28 de mai de 2018 às 16:44, Andres Freund <andres@anarazel.de> escreveu: > > Hi, > > I think I found the bug, and am about to post a fix for it belo > https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de. > > Greetings, > > Andres Freund Hi Andres, In end of April we did a complete dump/reload in database to version 10.3. Today, the problem returns: production=# vacuum verbose co27t; INFO: vacuuming "public.co27t" ERROR: found multixact 81704071 from before relminmxid 107665371 production=# vacuum full verbose co27t; INFO: vacuuming "public.co27t" ERROR: found multixact 105476076 from before relminmxid 107665371 production=# cluster co27t; ERROR: found multixact 105476076 from before relminmxid 107665371 But this time, regular vacuum versus full/cluster are different in multixact number. Your patch is applicable to this issue and is in 10.4 ? Best regards, Alexandre
On Tue, Jun 5, 2018 at 8:42 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:
Hi Andres,Em seg, 28 de mai de 2018 às 16:44, Andres Freund <andres@anarazel.de> escreveu:
>
> Hi,
>
> I think I found the bug, and am about to post a fix for it belo
> https://postgr.es/m/20180525203736. crkbg36muzxrjj5e@alap3. anarazel.de.
>
> Greetings,
>
> Andres Freund
In end of April we did a complete dump/reload in database to version 10.3.
Today, the problem returns:
production=# vacuum verbose co27t;
INFO: vacuuming "public.co27t"
ERROR: found multixact 81704071 from before relminmxid 107665371
production=# vacuum full verbose co27t;
INFO: vacuuming "public.co27t"
ERROR: found multixact 105476076 from before relminmxid 107665371
production=# cluster co27t;
ERROR: found multixact 105476076 from before relminmxid 107665371
But this time, regular vacuum versus full/cluster are different in
multixact number.
Your patch is applicable to this issue and is in 10.4 ?
Best regards,
Alexandre
We encountered this issue ourselves for the first time on a busy OLTP system. It is at 9.6.8. We found that patching to 9.6.9 on a snapshot of this system did not fix the problem, but I assume that is because the patch in 9.6.9 only prevents the problem moving forward. Is that accurate?
Before we take an outage for this patch, we want as much information as possible on if this is indeed likely to be our issue.
Like the other people on this thread, amcheck didn't show anything on the snap:
db=# select bt_index_parent_check(indexrelid,true) FROM pg_stat_user_indexes WHERE relname = 'mytable';
bt_index_parent_check
-----------------------
(5 rows)
db=# select bt_index_check(indexrelid,true) FROM pg_stat_user_indexes WHERE relname = 'mytable';
bt_index_check
----------------
(5 rows)
Not surprisingly, I can get the problem to go away in production if I use pg_repack to rebuild the table. But we are interested of course in solving this problem permanently.
Thanks,
Jeremy
On 2018-06-08 12:38:03 -0500, Jeremy Finzel wrote: > On Tue, Jun 5, 2018 at 8:42 PM, Alexandre Arruda <adaldeia@gmail.com> wrote: > > > Em seg, 28 de mai de 2018 às 16:44, Andres Freund <andres@anarazel.de> > > escreveu: > > > > > > Hi, > > > > > > I think I found the bug, and am about to post a fix for it belo > > > https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de. > > > > > > Greetings, > > > > > > Andres Freund > > > > Hi Andres, > > > > In end of April we did a complete dump/reload in database to version 10.3. > > Today, the problem returns: > > > > production=# vacuum verbose co27t; > > INFO: vacuuming "public.co27t" > > ERROR: found multixact 81704071 from before relminmxid 107665371 > > production=# vacuum full verbose co27t; > > INFO: vacuuming "public.co27t" > > ERROR: found multixact 105476076 from before relminmxid 107665371 > > production=# cluster co27t; > > ERROR: found multixact 105476076 from before relminmxid 107665371 > > > > But this time, regular vacuum versus full/cluster are different in > > multixact number. > > Your patch is applicable to this issue and is in 10.4 ? > > > > Best regards, > > > > Alexandre > > > > > We encountered this issue ourselves for the first time on a busy OLTP > system. It is at 9.6.8. We found that patching to 9.6.9 on a snapshot of > this system did not fix the problem, but I assume that is because the patch > in 9.6.9 only prevents the problem moving forward. Is that accurate? Which patch in 9.6.9 are you referring to? The patch I linked to above hasn't yet been merged, much less been released. Unfortunately, on second thought, I don't quite see how it applies to the cases here (rather than other reports about pg_authid and such). So there might be a separate issue. Is there any chance I could get access to a copy of the data? It's very hard to debug something like this without something that can reproduce the issue... Greetings, Andres Freund
On Fri, Jun 8, 2018 at 1:08 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-06-08 12:38:03 -0500, Jeremy Finzel wrote:
> On Tue, Jun 5, 2018 at 8:42 PM, Alexandre Arruda <adaldeia@gmail.com> wrote:
>
> > Em seg, 28 de mai de 2018 às 16:44, Andres Freund <andres@anarazel.de>
> > escreveu:
> > >
> > > Hi,
> > >
> > > I think I found the bug, and am about to post a fix for it belo
> > > https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de.
> > >
> > > Greetings,
> > >
> > > Andres Freund
> >
> > Hi Andres,
> >
> > In end of April we did a complete dump/reload in database to version 10.3.
> > Today, the problem returns:
> >
> > production=# vacuum verbose co27t;
> > INFO: vacuuming "public.co27t"
> > ERROR: found multixact 81704071 from before relminmxid 107665371
> > production=# vacuum full verbose co27t;
> > INFO: vacuuming "public.co27t"
> > ERROR: found multixact 105476076 from before relminmxid 107665371
> > production=# cluster co27t;
> > ERROR: found multixact 105476076 from before relminmxid 107665371
> >
> > But this time, regular vacuum versus full/cluster are different in
> > multixact number.
> > Your patch is applicable to this issue and is in 10.4 ?
> >
> > Best regards,
> >
> > Alexandre
> >
> >
> We encountered this issue ourselves for the first time on a busy OLTP
> system. It is at 9.6.8. We found that patching to 9.6.9 on a snapshot of
> this system did not fix the problem, but I assume that is because the patch
> in 9.6.9 only prevents the problem moving forward. Is that accurate?
Which patch in 9.6.9 are you referring to? The patch I linked to above
hasn't yet been merged, much less been released.
No I was referring to this from the documentation:
Avoid spuriously marking pages as all-visible (Dan Wood, Pavan Deolasee, Álvaro Herrera)
This could happen if some tuples were locked (but not deleted). While queries would still function correctly, vacuum would normally ignore such pages, with the long-term effect that the tuples were never frozen. In recent releases this would eventually result in errors such as "found multixact nnnnn from before relminmxid nnnnn".
thanks,
Jeremy
Is there any chance I could get access
to a copy of the data? It's very hard to debug something like this
without something that can reproduce the issue...
It would be very difficult for us to be able to clean and anonymize this data and provide a snapshot publicly. But I am very willing to hook up a debugger on a snapshot that reproduces the issue if you can guide me into what exactly I should look for here.
Thanks,
Jeremy
Hi, On 2018-06-08 13:30:33 -0500, Jeremy Finzel wrote: > Avoid spuriously marking pages as all-visible (Dan Wood, Pavan Deolasee, > Álvaro Herrera) > > This could happen if some tuples were locked (but not deleted). While > queries would still function correctly, vacuum would normally ignore such > pages, with the long-term effect that the tuples were never frozen. In > recent releases this would eventually result in errors such as "found > multixact nnnnn from before relminmxid nnnnn". Oh, I already had forgotten about that one... It does sound like a likely explanation of your issue. Hard to tell without investigating a *lot* more closely than I realistically can do remotely. It seems quite possible to be the cause - I'd strongly suggest to upgrade to prevent further occurances, or at least exclude it as a cause. Greetings, Andres Freund
On 6/8/18 8:30 PM, Jeremy Finzel wrote: > > No I was referring to this from the documentation: > > Avoid spuriously marking pages as all-visible (Dan Wood, Pavan > Deolasee, Álvaro Herrera) > > This could happen if some tuples were locked (but not deleted). > While queries would still function correctly, vacuum would normally > ignore such pages, with the long-term effect that the tuples were > never frozen. In recent releases this would eventually result in > errors such as "found multixact nnnnn from before relminmxid nnnnn". > > Hello hackers, We got the same issue on a 9.6.10, autovacuum reports the same error when he tried to freeze a table: ERROR: found multixact 37009963 from before relminmxid 48606990 autovacuum was not able to freeze this table until postgres reach 1 million transactions before wraparound and refuse any new transaction. We have an OLTP workload and I noticed queries like SELECT .. FOR SHARE. I checked durability settings (fsync, fpw) everything seems fine. Also, I did not notice any error on the storage (local ssd with raid controler). We "solved" with a dump/restore on another server, also we kept previous cluster to investigate. I want to be sure we encountered the bug solved in 9.6.9 and it is not a new one. If we confirm it, maybe we should advise users to perform integrity check? I was surprised 9.6.9 avoid new appearance of corruption but nothing to ensure if database is already corrupted. FYI, we tried to do a pgdump on secondary: pg_dump: Error message from server: ERROR: could not access status of transaction 1323321209 DETAIL: Could not open file "pg_clog/04EE": No such file or directory. I am surprised this clog file is missing on the secondary (but present on primary) : [...] 04EB 04EC 04ED 0CEE 0CEF 0CF0 0CF1 [...] I ran pg_visibility's checks, they do not reveal any corruption : select pg_check_visible(i) from (select nspname ||'.' || relname as i from pg_class join pg_namespace on pg_class.relnamespace=pg_namespace.oid WHERE relkind IN ('r','t','m')) as i; (0 rows) select pg_check_frozen(i) from (select nspname ||'.' || relname as i from pg_class join pg_namespace on pg_class.relnamespace=pg_namespace.oid WHERE relkind IN ('r','t','m')) as i; (0 rows) Here is pg_controldata and information on the table where autovacuum reports error : /usr/pgsql-9.6/bin/pg_controldata pg_control version number: 960 Catalog version number: 201608131 Database system identifier: 6451990604639161176 Database cluster state: in production pg_control last modified: Mon 05 Nov 2018 12:44:30 PM CET Latest checkpoint location: B9AF/70A4FD98 Prior checkpoint location: B9AF/70A4B7D0 Latest checkpoint's REDO location: B9AF/70A4FD60 Latest checkpoint's REDO WAL file: 000000010000B9AF00000070 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 1:1323325854 Latest checkpoint's NextOID: 1665327960 Latest checkpoint's NextMultiXactId: 60149587 Latest checkpoint's NextMultiOffset: 174269996 Latest checkpoint's oldestXID: 3471809503 Latest checkpoint's oldestXID's DB: 16393 Latest checkpoint's oldestActiveXID: 1323325854 Latest checkpoint's oldestMultiXid: 48606990 Latest checkpoint's oldestMulti's DB: 16393 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Time of latest checkpoint: Mon 05 Nov 2018 12:44:29 PM CET Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: off max_connections setting: 750 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 SELECT * FROM pg_class WHERE oid = 'anonymised'::regclass; -[ RECORD 1 ]-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- relname | anonymised relnamespace | 2200 reltype | 16719 reloftype | 0 relowner | 16386 relam | 0 relfilenode | 438244741 reltablespace | 0 relpages | 1823173 reltuples | 6.82778e+07 relallvisible | 1822975 reltoastrelid | 438244744 relhasindex | t relisshared | f relpersistence | p relkind | r relnatts | 25 relchecks | 0 relhasoids | f relhaspkey | t relhasrules | f relhastriggers | t relhassubclass | f relrowsecurity | f relforcerowsecurity | f relispopulated | t relreplident | d relfrozenxid | 3471809503 relminmxid | 48606990 relacl | {anonymised} reloptions | {autovacuum_vacuum_scale_factor=0.1,autovacuum_analyze_scale_factor=0.05} Do you have any idea to investigate? Or steps to identify previous corruption on another instance? Thanks!
On 11/5/18 5:41 PM, Adrien NAYRAT wrote: > We "solved" with a dump/restore on another server, also we kept previous > cluster to investigate. I want to be sure we encountered the bug solved > in 9.6.9 and it is not a new one. FYI, I am not sure I will be able to keep previous cluster many days. > > If we confirm it, maybe we should advise users to perform integrity > check? I was surprised 9.6.9 avoid new appearance of corruption but > nothing to ensure if database is already corrupted. If I understand correctly, it seems related to vacuum skipping page already frozen. Maybe we should advise to do a VACUUM FREEZE with DISABLE_PAGE_SKIPPING option? Regards,
The best solution that I have found is kick all connections and execute a select for update to /dev/null in the affected tables, i.e.:
psql -o /dev/null -c "select * from table for update" database
After this, the vacuum is executed w/o problems again.
Best regards,
Alexandre
Em qua, 7 de nov de 2018 às 10:05, Adrien NAYRAT <adrien.nayrat@anayrat.info> escreveu:
On 11/5/18 5:41 PM, Adrien NAYRAT wrote:
> We "solved" with a dump/restore on another server, also we kept previous
> cluster to investigate. I want to be sure we encountered the bug solved
> in 9.6.9 and it is not a new one.
FYI, I am not sure I will be able to keep previous cluster many days.
>
> If we confirm it, maybe we should advise users to perform integrity
> check? I was surprised 9.6.9 avoid new appearance of corruption but
> nothing to ensure if database is already corrupted.
If I understand correctly, it seems related to vacuum skipping page
already frozen. Maybe we should advise to do a VACUUM FREEZE with
DISABLE_PAGE_SKIPPING option?
Regards,
On 11/7/18 1:21 PM, Alexandre Arruda wrote: > The best solution that I have found is kick all connections and execute > a select for update to /dev/null in the affected tables, i.e.: > > psql -o /dev/null -c "select * from table for update" database > > After this, the vacuum is executed w/o problems again. > > Best regards, > > Alexandre Thanks Alexandre, I will try.