Thread: Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Brian Ghidinelli
Date:
=C3=81lvaro Herrera wrote: > This smells like it's looping waiting for a multixact to be fully > written out ... except that the uninterruptibility part of that was = fixed in > time for 9.4.0, >=20 > Author: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> > Branch: master Release: REL9_5_BR [51f9ea25d] 2014-11-14 15:14:01 = -0300 > Branch: REL9_4_STABLE Release: REL9_4_0 [137e4da6d] 2014-11-14 = 15:14:02 -0300 > Branch: REL9_3_STABLE Release: REL9_3_6 [d45e8dc52] 2014-11-14 = 15:14:02 -0300 >=20 > = http://git.postgresql.org/gitweb/?p=3Dpostgresql.git;a=3Dcommit;h=3D137e4d= a6d >=20 >=20 > Can you attach to the looping process with gdb when it's doing the > select() dance, and obtain a backtrace? You need debug symbols > installed. Alvaro, thanks for the quick response. I=E2=80=99ve done some additional = research and found your involvement on other threads in 2014 and 2015 = that have allowed me to collect more supporting details. In a nutshell, = it appears to be multixid related and /maybe/ pg_upgrade related; I = don=E2=80=99t know if we used it on the 9.3 branch or not. I don=E2=80=99t= have the debug symbols but I did attach gdb and could see it was = working on multixid. I don=E2=80=99t have the ability to compile from = source on this production server unfortunately but I=E2=80=99m happy to = do whatever else I can to help the team find a solution. First, I checked the pg_upgrade changelog note in 9.4.4 and here=E2=80=99s= what I found in my install: * pg_controldata DOES report =E2=80=9CLatest checkpoint=E2=80=99s = oldestMultiXid=E2=80=9D =3D 1 * There IS a file named 0000 in PGDATA/pg_multixact/offsets pg_controldata: # /usr/pgsql-9.4/bin/pg_controldata ~postgres/9.4/data/ pg_control version number: 942 Catalog version number: 201409291 Database system identifier: 6123001933830237052 Database cluster state: in production pg_control last modified: Thu 18 Feb 2016 09:53:29 AM PST Latest checkpoint location: 130/6FBD9E58 Prior checkpoint location: 130/6F9ED950 Latest checkpoint's REDO location: 130/6FAEEFC0 Latest checkpoint's REDO WAL file: 00000001000001300000006F Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/90358137 Latest checkpoint's NextOID: 5523460 Latest checkpoint's NextMultiXactId: 84342 Latest checkpoint's NextMultiOffset: 167545 Latest checkpoint's oldestXID: 654 Latest checkpoint's oldestXID's DB: 16416 Latest checkpoint's oldestActiveXID: 90358137 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 16416 Time of latest checkpoint: Thu 18 Feb 2016 09:52:31 AM PST Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no Current wal_level setting: hot_standby Current wal_log_hints setting: off Current max_connections setting: 100 Current max_worker_processes setting: 8 Current max_prepared_xacts setting: 0 Current max_locks_per_xact setting: 64 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 the autovacuum log status: # select pg_class.relname, pg_locks.transactionid, pg_locks.mode, pg_locks.granted as "g", left(pg_stat_activity.query, 60) as query, pg_stat_activity.query_start, age(now(),pg_stat_activity.query_start) as "age", pg_stat_activity.pid=20 from pg_stat_activity,pg_locks left outer join pg_class on (pg_locks.relation =3D pg_class.oid) =20 where pg_locks.pid=3Dpg_stat_activity.pid and pg_stat_activity.pid =3D 14578 order by query_start; relname | transactionid | = mode | g | query | = query_start | age | pid =20 = -----------------------------------------+---------------+----------------= ----------+---+-----------------------------------------------+-----------= --------------------+-----------------+------- idx_clubmember_memberid | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 idx_lookupclubmember_uidmember | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 idx_tbllookupclubmember_uidclubofrecord | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 xpktbllookupclubmember_mem | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 tbllookupclubmember_pkey | | = RowExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 | | ExclusiveLock = | t | autovacuum: VACUUM public.tbllookupclubmember | = 2016-02-18 10:36:30.360324+00 | 06:55:04.883172 | 14578 tbllookupclubmember | | = ShareUpdateExclusiveLock | t | autovacuum: VACUUM = public.tbllookupclubmember | 2016-02-18 10:36:30.360324+00 | = 06:55:04.883172 | 14578 gdb bt for pid 14578: Program received signal SIGINT, Interrupt. 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6 (gdb) bt #0 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6 #1 0x000000000077aa9a in pg_usleep () #2 0x00000000004b66af in GetMultiXactIdMembers () #3 0x0000000000487b93 in heap_prepare_freeze_tuple () #4 0x0000000000585e07 in lazy_vacuum_rel () #5 0x0000000000583a75 in ?? () #6 0x0000000000583e37 in vacuum () #7 0x0000000000616a06 in ?? () #8 0x00000000006171db in ?? () #9 0x0000000000617289 in StartAutoVacWorker () #10 0x0000000000624cf2 in ?? () #11 <signal handler called> #12 0x000000376bace123 in __select_nocancel () from /lib64/libc.so.6 #13 0x0000000000621f76 in ?? () #14 0x000000000062619a in PostmasterMain () #15 0x00000000005bd52d in main () (gdb) cont Continuing. A couple other things I saw you request in other threads: # select oid::regclass, relfrozenxid, relminmxid from pg_class where = relname =3D 'tbllookupclubmember'; oid | relfrozenxid | relminmxid=20 ---------------------+--------------+------------ tbllookupclubmember | 22915818 | 616 (1 row) # select datname, datfrozenxid, datminmxid from pg_database where = datname =3D 'prod'; datname | datfrozenxid | datminmxid=20 --------------------+--------------+------------ prod | 654 | 1 # ls -al ~postgres/9.4/data/pg_multixact/offsets/ total 368 drwx------ 2 postgres postgres 4096 Mar 5 2015 . drwx------ 4 postgres postgres 4096 Mar 5 2015 .. -rw------- 1 postgres postgres 262144 Mar 5 2015 0000 -rw------- 1 postgres postgres 81920 Feb 18 09:47 0001 In this thread: = http://postgresql.nabble.com/BUG-11264-Auto-vacuum-wraparound-job-blocking= -everything-td5816237.html, I saw you recommend this fix: > > How do we fix the current issue with this one server? Is there an = easy=20 > >fix?=20 > > Thanks.=20 >=20 > As far as I am aware, you should=20 > UPDATE pg_database SET datminmxid=3D20783=20 >=20 > and that should fix it. The oldestMulti value in pg_control would get=20= > updated by itself some time later. If you experience stalls before=20 > oldestMulti fixes itself, you could stop the server (cleanly!) and = then=20 > pg_resetxlog -m x,y where x is the correct nextMulti value from=20 > pg_controldata and y is 20783.=20 Is that something I should/could do here? I=E2=80=99ve temporarily = disabled autovacuum on the offending table until we can figure it out. Thank you! Brian
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Alvaro Herrera
Date:
Brian Ghidinelli wrote: > Alvaro, thanks for the quick response. Iâve done some additional research and found your involvement on other threadsin 2014 and 2015 that have allowed me to collect more supporting details. In a nutshell, it appears to be multixidrelated and /maybe/ pg_upgrade related; I donât know if we used it on the 9.3 branch or not. I donât have thedebug symbols but I did attach gdb and could see it was working on multixid. I donât have the ability to compile fromsource on this production server unfortunately but Iâm happy to do whatever else I can to help the team find a solution. Okay, great. What would be most helpful is figuring out the pg_upgrade history of this server; if you have copies of the cluster just before the upgrade, to extract the "nextMultiXactId" value, that would be useful. How large is this table? We could try to scan it to look for the values that are causing the problem, and set oldestMxact to one that would not cause a problem. How large is the cluster? For experimentation, it would be very useful if you could take a copy of it, on a server where you could recompile with debugging symbols. -- Ãlvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Brian Ghidinelli
Date:
> On Feb 18, 2016, at 12:59, Alvaro Herrera <alvherre@2ndquadrant.com> = wrote: >=20 > Okay, great. What would be most helpful is figuring out the = pg_upgrade > history of this server; if you have copies of the cluster just before > the upgrade, to extract the "nextMultiXactId" value, that would be > useful. Unfortunately we removed the 9.3 data dir for space reasons=E2=80=A6 I = may have backups from then so maybe I could spin up a docker container = and restore it but would that tell us the same thing? > How large is this table? We could try to scan it to look for the = values > that are causing the problem, and set oldestMxact to one that would = not > cause a problem. Database size is 34gb. This particular table is only 105MB. If you = account for all of the relations and indices it=E2=80=99s 239MB. There = is one table in the system which is 17GB that stores email campaigns and = deliveries. Everything else is all pretty small-ish at 2.5gb or under. How do you query for oldestMxact? > How large is the cluster? For experimentation, it would be very = useful > if you could take a copy of it, on a server where you could recompile > with debugging symbols. Is there a Docker container by chance that has symbols enabled? That = would make standing up a test environment a lot easier. Our production = infrastructure is not yet inside Docker but we run it in dev and it=E2=80=99= s easy to spin up and throw away. Brian
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Alvaro Herrera
Date:
Brian Ghidinelli wrote: > > > On Feb 18, 2016, at 12:59, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > Okay, great. What would be most helpful is figuring out the pg_upgrade > > history of this server; if you have copies of the cluster just before > > the upgrade, to extract the "nextMultiXactId" value, that would be > > useful. > > Unfortunately we removed the 9.3 data dir for space reasons⦠I may > have backups from then so maybe I could spin up a docker container and > restore it but would that tell us the same thing? What kind of backups? If you mean pg_dump, then it's of no use. If you have physical backups, then yes, it could be useful, assuming they are not much older than the upgrade (one from immediately before the upgrade would be best.) > > How large is this table? We could try to scan it to look for the values > > that are causing the problem, and set oldestMxact to one that would not > > cause a problem. > > Database size is 34gb. This particular table is only 105MB. If you > account for all of the relations and indices itâs 239MB. There is one > table in the system which is 17GB that stores email campaigns and > deliveries. Email campaigns! I'm always happy to receive those. > Everything else is all pretty small-ish at 2.5gb or under. > > How do you query for oldestMxact? What I was thinking is scanning the table using pageinspect (get_raw_page + heap_page_items), grab the xmax value of the tuples that have the HEAP_XMAX_IS_MULTI bit set, looking for the informask combination that indicate's prior-to-pg_upgrade. The combination to find is that HEAP_XMAX_LOCK_ONLY (0x0080) must be set, and both HEAP_XMAX_EXCL_LOCK (0x0040) and HEAP_XMAX_KEYSHR_LOCK (0x0010) must be cleared. (See for instance the comments inside MultiXactIdGetUpdateXid in src/backend/access/heap/heapam.c, where this particular bit pattern is tested.) This might help: https://www.commandprompt.com/blogs/alvaro_herrera/2011/11/decoding_infomasks/ Add one to the newest multixact value you find with that combination of bits, then use that as the new cluster's "oldestMultiXact" to pg_resetxlog (make sure to shut down cleanly!) as in the previous post you found. In retrospect, it sounds like we should have made GetMultiXactIdMembers receive the infomask too and check for the pre-upgrade combination of bits. It would have saved some headaches. But I'm not sure it's worth doing now. > > How large is the cluster? For experimentation, it would be very useful > > if you could take a copy of it, on a server where you could recompile > > with debugging symbols. > > Is there a Docker container by chance that has symbols enabled? That > would make standing up a test environment a lot easier. Our production > infrastructure is not yet inside Docker but we run it in dev and itâs > easy to spin up and throw away. There may be one, but then I wouldn't know where to find it. I bet you can modify one --- it's just a "--enable-debug" option to Postgres' configure. -- Ãlvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Alvaro Herrera
Date:
Here's another idea: just try SELECT FOR UPDATE on the offending tuple. As I recall, that code path has enough protections that we wouldn't try to read the members of the multixact even if we don't see it as below the horizon, simply by checking the infomask bits as I mentioned before. You could try SELECT FOR UPDATE the complete table. If you had debug symbols you could grab the specific TID that's giving you trouble from the backtrace ... but then the table is not *that* large. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Brian Ghidinelli
Date:
Any chance that a DDL modification like adding a column would effect an upda= te of those values across the table? Brian=20 > On Feb 19, 2016, at 13:45, Alvaro Herrera <alvherre@2ndquadrant.com> wrote= : >=20 > Here's another idea: just try SELECT FOR UPDATE on the offending tuple. > As I recall, that code path has enough protections that we wouldn't try > to read the members of the multixact even if we don't see it as below > the horizon, simply by checking the infomask bits as I mentioned before. >=20 > You could try SELECT FOR UPDATE the complete table. If you had debug > symbols you could grab the specific TID that's giving you trouble from > the backtrace ... but then the table is not *that* large. >=20 > --=20 > =C3=81lvaro Herrera http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >=20 >=20 > --=20 > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Alvaro Herrera
Date:
Brian Ghidinelli wrote: > Any chance that a DDL modification like adding a column would effect an update of those values across the table? If a column is added without a default value, then the existing tuples are not touched. If a default value is added, then the complete table is rewritten. I suppose that the rewritten table would not include old Multixact values in XMAX, but I haven't checked. I wonder if it's possible for a column to be added in an operation that does not rewrite the table yet the relminmxid is advanced past the oldest multixact present. I assume, given the lack of traffic in this thread, that you have solved your problem already? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Brian Ghidinelli
Date:
> On Mar 7, 2016, at 08:21, Alvaro Herrera <alvherre@2ndquadrant.com> = wrote: >=20 > Brian Ghidinelli wrote: >> Any chance that a DDL modification like adding a column would effect = an update of those values across the table? >=20 > If a column is added without a default value, then the existing tuples > are not touched. If a default value is added, then the complete table > is rewritten. I suppose that the rewritten table would not include = old > Multixact values in XMAX, but I haven't checked. >=20 > I wonder if it's possible for a column to be added in an operation = that > does not rewrite the table yet the relminmxid is advanced past the > oldest multixact present. >=20 > I assume, given the lack of traffic in this thread, that you have = solved > your problem already? Thanks for the follow up Alvaro. I=E2=80=99m no longer on the bugs list = so don=E2=80=99t know if this will make it through, but I went the route = of recreating the table and updating all of the foreign key references = and shared that on Feb 19th.=20 This seems to have fixed the problem with vacuuming the table however it = did not update the multixact value which still reads 1. Any idea if this = will still cause problems in the future? # /usr/pgsql-9.4/bin/pg_controldata ~postgres/9.4/=20 pg_control version number: 942 Catalog version number: 201409291 Database system identifier: 6123001933830237052 Database cluster state: in production pg_control last modified: Sat 12 Mar 2016 01:57:08 PM PST Latest checkpoint location: 13A/54469E58 Prior checkpoint location: 13A/54129CC8 Latest checkpoint's REDO location: 13A/543F2908 Latest checkpoint's REDO WAL file: 000000010000013A00000054 Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0/92143770 Latest checkpoint's NextOID: 5609664 Latest checkpoint's NextMultiXactId: 84435 Latest checkpoint's NextMultiOffset: 167731 Latest checkpoint's oldestXID: 654 Latest checkpoint's oldestXID's DB: 16416 Latest checkpoint's oldestActiveXID: 92143770 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 16416 Time of latest checkpoint: Sat 12 Mar 2016 01:56:19 PM PST Brian
Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
From
Alvaro Herrera
Date:
Brian Ghidinelli wrote: > Thanks for the follow up Alvaro. Iâm no longer on the bugs list so > donât know if this will make it through, but I went the route of > recreating the table and updating all of the foreign key references > and shared that on Feb 19th. > > This seems to have fixed the problem with vacuuming the table however > it did not update the multixact value which still reads 1. Any idea if > this will still cause problems in the future? Well, your use of multixacts appear low so it's normal that it hasn't moved forward. That counter is normally only moved forward as vacuum tries to clean up multixacts that are 5 million old. Yours are only ~84000 old, so you have a long way to go before the counter is updated. Your system should be fine. -- Ãlvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services