Thread: pg_upgrade - link mode and transaction-wraparound data loss
Hi I tried running pg_upgrade from the current snapshot of postgresql and upgrading from 8.4.4 to the snapshot version. Everything seem to look fine in the process and all that came out was only "ok's" but when I tried a simple query on the databse it keeps throwing these message out of the back side. DETAIL: You might have already suffered transaction-wraparound data loss. WARNING: some databases have not been vacuumed in over 2 billion transactions The database was around 600GB and it took a couple of minutes to run pg_upgrade after I had all the binaries in the correct place. It is not really an easy task to throw around 600GB of data, so I cannot gaurantee that the above is reproducible, but I'll see if I can get time and try to reproduce it. Jesper
jesper@krogh.cc wrote: > Hi > > I tried running pg_upgrade from the current snapshot of postgresql and > upgrading from 8.4.4 to the snapshot version. Everything seem to look fine > in the process and all that came out was only "ok's" but when I tried a > simple query on the databse it keeps throwing these message out of the back > side. > > DETAIL: You might have already suffered transaction-wraparound data loss. > WARNING: some databases have not been vacuumed in over 2 billion > transactions > > > The database was around 600GB and it took a couple of minutes to run > pg_upgrade after I had all the binaries in the correct place. > > It is not really an easy task to throw around 600GB of data, so I cannot > gaurantee that the above is reproducible, but I'll see if I can get time > and try to reproduce it. This certainly should never have happened, so I am guessing it is a bug. pg_upgrade tries hard to make sure all your datfrozenxid and relfrozenxid are properly migrated from the old server, and the transaction id is set properly. Unfortunately this is the first time I have heard of such a problem, so I am unclear on its cause. The warning is issued from vacuum.c::vac_truncate_clog(). Can you run this query and show us the output: SELECT datname, datfrozenxid FROM pg_database; It would be good to see these numbers on both the old and new servers. I would also like to see: SELECT txid_current(); on the old and new servers, but if you can only provide these values on one of the two servers, it is still useful. Thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
> jesper@krogh.cc wrote: >> Hi >> >> I tried running pg_upgrade from the current snapshot of postgresql and >> upgrading from 8.4.4 to the snapshot version. Everything seem to look >> fine >> in the process and all that came out was only "ok's" but when I tried a >> simple query on the databse it keeps throwing these message out of the >> back >> side. >> >> DETAIL: You might have already suffered transaction-wraparound data >> loss. >> WARNING: some databases have not been vacuumed in over 2 billion >> transactions >> >> >> The database was around 600GB and it took a couple of minutes to run >> pg_upgrade after I had all the binaries in the correct place. >> >> It is not really an easy task to throw around 600GB of data, so I cannot >> gaurantee that the above is reproducible, but I'll see if I can get time >> and try to reproduce it. > > This certainly should never have happened, so I am guessing it is a bug. > pg_upgrade tries hard to make sure all your datfrozenxid and > relfrozenxid are properly migrated from the old server, and the > transaction id is set properly. Unfortunately this is the first time I > have heard of such a problem, so I am unclear on its cause. > > The warning is issued from vacuum.c::vac_truncate_clog(). Can you run > this query and show us the output: > > SELECT datname, datfrozenxid FROM pg_database; > > It would be good to see these numbers on both the old and new servers. > I would also like to see: > > SELECT txid_current(); > > on the old and new servers, but if you can only provide these values on > one of the two servers, it is still useful. Thanks. Hi Bruce, thanks for your prompt response. First the new one.. jk@pal:~$ psql -p 5433 psql (9.0beta1) Type "help" for help. data=# SELECT datname, datfrozenxid FROM pg_database; datname | datfrozenxid -----------+--------------template0 | 654postgres | 2374592801data | 2023782337jk | 2023822188template1| 2374592801workqueue | 2023822188 (6 rows) data=# SELECT txid_current();txid_current -------------- 2375384556 (1 row) data=# \q Then the old one. jk@pal:~$ psql data psql (9.0beta1, server 8.4.1) WARNING: psql version 9.0, server version 8.4. Some psql features might not work. Type "help" for help. data# SELECT datname, datfrozenxid FROM pg_database; datname | datfrozenxid -----------+--------------template0 | 2073823552postgres | 2023820521data | 2023782337jk | 2023822188template1| 2073823552workqueue | 2023822188 (6 rows) data=# SELECT txid_current();txid_current -------------- 2390524243 (1 row) The old database has been "copied" over using rsync and pg_start_backup()/pg_stop_backup() procecures and started up using a recovery.conf file. Jesper
jesper@krogh.cc wrote: > > on the old and new servers, but if you can only provide these values on > > one of the two servers, it is still useful. Thanks. > > Hi Bruce, thanks for your prompt response. > > First the new one.. Great. > jk@pal:~$ psql -p 5433 > psql (9.0beta1) > Type "help" for help. > > data=# SELECT datname, datfrozenxid FROM pg_database; > datname | datfrozenxid > -----------+-------------- > template0 | 654 > postgres | 2374592801 > data | 2023782337 > jk | 2023822188 > template1 | 2374592801 > workqueue | 2023822188 > (6 rows) > > data=# SELECT txid_current(); > txid_current > -------------- > 2375384556 > (1 row) > > data=# \q > > Then the old one. > > jk@pal:~$ psql data > psql (9.0beta1, server 8.4.1) > WARNING: psql version 9.0, server version 8.4. > Some psql features might not work. > Type "help" for help. > > data# SELECT datname, datfrozenxid FROM pg_database; > datname | datfrozenxid > -----------+-------------- > template0 | 2073823552 This line above looks very odd because I didn't think the template0 datfrozenxid could be advanced. Can I see the output of this query: SELECT datname, datfrozenxid, datallowconn FROM pg_database; I am wondering if you set datallowconn for template0 to 'true'. > postgres | 2023820521 > data | 2023782337 > jk | 2023822188 > template1 | 2073823552 > workqueue | 2023822188 > (6 rows) > > data=# SELECT txid_current(); > txid_current > -------------- > 2390524243 > (1 row) > > > The old database has been "copied" over using rsync and > pg_start_backup()/pg_stop_backup() procecures and started up > using a recovery.conf file. My other idea is that somehow recovery touches datallowconn for template0. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
On 2010-05-18 20:52, Bruce Momjian wrote: > This line above looks very odd because I didn't think the template0 > datfrozenxid could be advanced. Can I see the output of this query: > > SELECT datname, datfrozenxid, datallowconn FROM pg_database; > > Only from the "old" database: data=# SELECT datname, datfrozenxid, datallowconn FROM pg_database; datname | datfrozenxid | datallowconn -----------+--------------+-------------- template0 | 2073823552 | f postgres | 2023820521 | t data | 2023782337| t jk | 2023822188 | t template1 | 2073823552 | t workqueue | 2023822188 | t (6 rows) > I am wondering if you set datallowconn for template0 to 'true'. From this database, I cannot give any more results, I ran some other queries and then restarted postgres, subsequently it seemed totally broken. I'm in the process of running the test over again, but it'll take a while before data is in. I'll report back. -- Jesper
Jesper Krogh wrote: > On 2010-05-18 20:52, Bruce Momjian wrote: > > This line above looks very odd because I didn't think the template0 > > datfrozenxid could be advanced. Can I see the output of this query: > > > > SELECT datname, datfrozenxid, datallowconn FROM pg_database; > > > > > > Only from the "old" database: > data=# SELECT datname, datfrozenxid, datallowconn FROM pg_database; > datname | datfrozenxid | datallowconn > -----------+--------------+-------------- > template0 | 2073823552 | f > postgres | 2023820521 | t > data | 2023782337 | t > jk | 2023822188 | t > template1 | 2073823552 | t > workqueue | 2023822188 | t > (6 rows) OK, datallowconn = false is right for template0, but I am still confused how it got set to that high value. > > I am wondering if you set datallowconn for template0 to 'true'. > > From this database, I cannot give any more results, I ran some other > queries and then restarted postgres, subsequently it seemed > totally broken. I'm in the process of running the test over again, but > it'll take a while before data is in. I'll report back. OK, thanks. This does seem odd. Frankly, having template0's datfrozenxid be wrong would not cause any kind of instability because template0 is used only by pg_dump, so I am wondering if something else is seriously wrong. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
jesper@krogh.cc wrote: > First the new one.. > > jk@pal:~$ psql -p 5433 > psql (9.0beta1) > Type "help" for help. > > data=# SELECT datname, datfrozenxid FROM pg_database; > datname | datfrozenxid > -----------+-------------- > template0 | 654 > postgres | 2374592801 > data | 2023782337 > jk | 2023822188 > template1 | 2374592801 > workqueue | 2023822188 > (6 rows) > > data=# SELECT txid_current(); > txid_current > -------------- > 2375384556 > (1 row) I just ran a test and all the datfrozenxids are less than the current xid, so the only database that could be generating a wraparound warning is 'template0'. But, again, I though that template0 was not touched for wraparound protection --- I am starting to think I am wrong. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
Excerpts from jesper's message of mar may 18 13:22:12 -0400 2010: > jk@pal:~$ psql data > psql (9.0beta1, server 8.4.1) > WARNING: psql version 9.0, server version 8.4. > Some psql features might not work. > Type "help" for help. > > data# SELECT datname, datfrozenxid FROM pg_database; > datname | datfrozenxid > -----------+-------------- > template0 | 2073823552 > postgres | 2023820521 > data | 2023782337 > jk | 2023822188 > template1 | 2073823552 > workqueue | 2023822188 > (6 rows) Does the old server have pg_database.datallowconn = true for template0? --
Alvaro Herrera wrote: > Excerpts from jesper's message of mar may 18 13:22:12 -0400 2010: > > > jk@pal:~$ psql data > > psql (9.0beta1, server 8.4.1) > > WARNING: psql version 9.0, server version 8.4. > > Some psql features might not work. > > Type "help" for help. > > > > data# SELECT datname, datfrozenxid FROM pg_database; > > datname | datfrozenxid > > -----------+-------------- > > template0 | 2073823552 > > postgres | 2023820521 > > data | 2023782337 > > jk | 2023822188 > > template1 | 2073823552 > > workqueue | 2023822188 > > (6 rows) > > Does the old server have pg_database.datallowconn = true for template0? The user reported back that it did not: Only from the "old" database:data=# SELECT datname, datfrozenxid, datallowconn FROM pg_database; datname | datfrozenxid| datallowconn-----------+--------------+-------------- template0 | 2073823552 | f postgres | 2023820521| t data | 2023782337 | t jk | 2023822188 | t template1 | 2073823552 | t workqueue | 2023822188| t(6 rows) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
On 2010-05-18 21:56, Bruce Momjian wrote: > Jesper Krogh wrote: > >> On 2010-05-18 20:52, Bruce Momjian wrote: >> >>> This line above looks very odd because I didn't think the template0 >>> datfrozenxid could be advanced. Can I see the output of this query: >>> >>> SELECT datname, datfrozenxid, datallowconn FROM pg_database; >>> >>> >>> >> Only from the "old" database: >> data=# SELECT datname, datfrozenxid, datallowconn FROM pg_database; >> datname | datfrozenxid | datallowconn >> -----------+--------------+-------------- >> template0 | 2073823552 | f >> postgres | 2023820521 | t >> data | 2023782337 | t >> jk | 2023822188 | t >> template1 | 2073823552 | t >> workqueue | 2023822188 | t >> (6 rows) >> > OK, datallowconn = false is right for template0, but I am still confused > how it got set to that high value. > This is the "production system". I have absolutely no indications that anything should be wrong in there. It has run rock-solid since it got migrated (dump/restore) to 8.4 for about 7 months now. So I am a bit scared about you telling that it seems wrong. (but that cannot be attributed to pg_upgrade) > OK, thanks. This does seem odd. Frankly, having template0's > datfrozenxid be wrong would not cause any kind of instability because > template0 is used only by pg_dump, so I am wondering if something else > is seriously wrong. > I also think that something was seriously wrong with the pg_upgrade'd version. I'll try to reproduce and be a bit more carefull in tracking the steps this time. -- Jesper
Jesper Krogh wrote: > On 2010-05-18 21:56, Bruce Momjian wrote: > > Jesper Krogh wrote: > > > >> On 2010-05-18 20:52, Bruce Momjian wrote: > >> > >>> This line above looks very odd because I didn't think the template0 > >>> datfrozenxid could be advanced. Can I see the output of this query: > >>> > >>> SELECT datname, datfrozenxid, datallowconn FROM pg_database; > >>> > >>> > >>> > >> Only from the "old" database: > >> data=# SELECT datname, datfrozenxid, datallowconn FROM pg_database; > >> datname | datfrozenxid | datallowconn > >> -----------+--------------+-------------- > >> template0 | 2073823552 | f > >> postgres | 2023820521 | t > >> data | 2023782337 | t > >> jk | 2023822188 | t > >> template1 | 2073823552 | t > >> workqueue | 2023822188 | t > >> (6 rows) > >> > > OK, datallowconn = false is right for template0, but I am still confused > > how it got set to that high value. > > > > This is the "production system". I have absolutely no indications that > anything should be wrong in there. It has run rock-solid since it got > migrated (dump/restore) to 8.4 for about 7 months now. So I am a bit > scared about you telling that it seems wrong. (but that cannot be > attributed to pg_upgrade) I am on chat with Alvaro now and it seems we do somehow connect to template0 for transaction id wraparound. I think Alvaro will post shortly on this. > > OK, thanks. This does seem odd. Frankly, having template0's > > datfrozenxid be wrong would not cause any kind of instability because > > template0 is used only by pg_dump, so I am wondering if something else > > is seriously wrong. > > > I also think that something was seriously wrong with the pg_upgrade'd > version. I'll try to reproduce and be a bit more carefull in tracking > the steps > this time. Thanks, but I think the entire problem might be this template0 xid issue that Alvaro and I are researching. I can now see how invalid template0 xids could cause the instability you saw in the new database. Odd no one has seen this bug before. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com
On 2010-05-18 18:57, Bruce Momjian wrote: > jesper@krogh.cc wrote: > >> Hi >> >> I tried running pg_upgrade from the current snapshot of postgresql and >> upgrading from 8.4.4 to the snapshot version. Everything seem to look fine >> in the process and all that came out was only "ok's" but when I tried a >> simple query on the databse it keeps throwing these message out of the back >> side. >> >> DETAIL: You might have already suffered transaction-wraparound data loss. >> WARNING: some databases have not been vacuumed in over 2 billion >> transactions >> >> >> The database was around 600GB and it took a couple of minutes to run >> pg_upgrade after I had all the binaries in the correct place. >> >> It is not really an easy task to throw around 600GB of data, so I cannot >> gaurantee that the above is reproducible, but I'll see if I can get time >> and try to reproduce it. >> > This certainly should never have happened, so I am guessing it is a bug. > pg_upgrade tries hard to make sure all your datfrozenxid and > relfrozenxid are properly migrated from the old server, and the > transaction id is set properly. Unfortunately this is the first time I > have heard of such a problem, so I am unclear on its cause. > Other people are typically way faster than I am looking into it. Depesz has produced a full trace to reproduce the problem here: http://www.depesz.com/index.php/2010/05/19/waiting-for-9-0-pg_upgrade/ Jesper -- Jesper
Bruce Momjian wrote: > > This is the "production system". I have absolutely no indications that > > anything should be wrong in there. It has run rock-solid since it got > > migrated (dump/restore) to 8.4 for about 7 months now. So I am a bit > > scared about you telling that it seems wrong. (but that cannot be > > attributed to pg_upgrade) > > I am on chat with Alvaro now and it seems we do somehow connect to > template0 for transaction id wraparound. I think Alvaro will post > shortly on this. > > > > OK, thanks. This does seem odd. Frankly, having template0's > > > datfrozenxid be wrong would not cause any kind of instability because > > > template0 is used only by pg_dump, so I am wondering if something else > > > is seriously wrong. > > > > > I also think that something was seriously wrong with the pg_upgrade'd > > version. I'll try to reproduce and be a bit more carefull in tracking > > the steps > > this time. > > Thanks, but I think the entire problem might be this template0 xid issue > that Alvaro and I are researching. I can now see how invalid template0 > xids could cause the instability you saw in the new database. Odd no > one has seen this bug before. OK, after talking to Alvaro and Heikki, the problem is that while you cannot connect to template0, it is accessed by autovacuum for vacuum freeze, even if autovacuum is turned off. I think the reason you are seeing this bug is that your xid counter is near 2 billion (50% to wraparound) and the original template0 xids are the maximum distance from your counter. I am attaching the newest patch which fixes this issue. I did modify this code yesterday with another patch, and I am unclear exactly if you need that patch as well. CVS now has all these changes. If you could test with this and the earlier patch, I think it will now work fine. Thanks for the valuable testing, and quick feedback. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com Index: contrib/pg_upgrade/pg_upgrade.c =================================================================== RCS file: /cvsroot/pgsql/contrib/pg_upgrade/pg_upgrade.c,v retrieving revision 1.3 diff -c -c -r1.3 pg_upgrade.c *** contrib/pg_upgrade/pg_upgrade.c 18 May 2010 18:40:51 -0000 1.3 --- contrib/pg_upgrade/pg_upgrade.c 19 May 2010 18:20:03 -0000 *************** *** 164,170 **** check_ok(ctx); /* ! * We do freeze after analyze so pg_statistic is also frozen */ prep_status(ctx, "Freezing all rows on the new cluster"); exec_prog(ctx, true, --- 164,173 ---- check_ok(ctx); /* ! * We do freeze after analyze so pg_statistic is also frozen. ! * template0 is not frozen here, but data rows were frozen by initdb, ! * and we set its datfrozenxid and relfrozenxids later to match the ! * new xid counter later. */ prep_status(ctx, "Freezing all rows on the new cluster"); exec_prog(ctx, true, *************** *** 292,339 **** set_frozenxids(migratorContext *ctx) { int dbnum; ! PGconn *conn; PGresult *dbres; int ntups; prep_status(ctx, "Setting frozenxid counters in new cluster"); ! conn = connectToServer(ctx, "template1", CLUSTER_NEW); /* set pg_database.datfrozenxid */ ! PQclear(executeQueryOrDie(ctx, conn, "UPDATE pg_catalog.pg_database " ! "SET datfrozenxid = '%u' " ! "WHERE datallowconn = true", ctx->old.controldata.chkpnt_nxtxid)); /* get database names */ ! dbres = executeQueryOrDie(ctx, conn, ! "SELECT datname " ! "FROM pg_catalog.pg_database " ! "WHERE datallowconn = true"); ! /* free dbres below */ ! PQfinish(conn); ntups = PQntuples(dbres); for (dbnum = 0; dbnum < ntups; dbnum++) { ! conn = connectToServer(ctx, PQgetvalue(dbres, dbnum, 0), CLUSTER_NEW); /* set pg_class.relfrozenxid */ PQclear(executeQueryOrDie(ctx, conn, "UPDATE pg_catalog.pg_class " "SET relfrozenxid = '%u' " /* only heap and TOAST are vacuumed */ ! "WHERE relkind = 'r' OR " ! " relkind = 't'", ctx->old.controldata.chkpnt_nxtxid)); PQfinish(conn); } PQclear(dbres); check_ok(ctx); } --- 295,366 ---- set_frozenxids(migratorContext *ctx) { int dbnum; ! PGconn *conn, *conn_template1; PGresult *dbres; int ntups; + int i_datname; + int i_datallowconn; prep_status(ctx, "Setting frozenxid counters in new cluster"); ! conn_template1 = connectToServer(ctx, "template1", CLUSTER_NEW); /* set pg_database.datfrozenxid */ ! PQclear(executeQueryOrDie(ctx, conn_template1, "UPDATE pg_catalog.pg_database " ! "SET datfrozenxid = '%u'", ctx->old.controldata.chkpnt_nxtxid)); /* get database names */ ! dbres = executeQueryOrDie(ctx, conn_template1, ! "SELECT datname, datallowconn " ! "FROM pg_catalog.pg_database"); ! i_datname = PQfnumber(dbres, "datname"); ! i_datallowconn = PQfnumber(dbres, "datallowconn"); ntups = PQntuples(dbres); for (dbnum = 0; dbnum < ntups; dbnum++) { ! char *datname = PQgetvalue(dbres, dbnum, i_datname); ! char *datallowconn= PQgetvalue(dbres, dbnum, i_datallowconn); ! ! /* ! * We must update databases where datallowconn = false, e.g. ! * template0, because autovacuum increments their datfrozenxids and ! * relfrozenxids even if autovacuum is turned off, and even though ! * all the data rows are already frozen To enable this, we ! * temporarily change datallowconn. ! */ ! if (strcmp(datallowconn, "f") == 0) ! PQclear(executeQueryOrDie(ctx, conn_template1, ! "UPDATE pg_catalog.pg_database " ! "SET datallowconn = true " ! "WHERE datname = '%s'", datname)); ! ! conn = connectToServer(ctx, datname, CLUSTER_NEW); /* set pg_class.relfrozenxid */ PQclear(executeQueryOrDie(ctx, conn, "UPDATE pg_catalog.pg_class " "SET relfrozenxid = '%u' " /* only heap and TOAST are vacuumed */ ! "WHERE relkind IN ('r', 't')", ctx->old.controldata.chkpnt_nxtxid)); PQfinish(conn); + + /* Reset datallowconn flag */ + if (strcmp(datallowconn, "f") == 0) + PQclear(executeQueryOrDie(ctx, conn_template1, + "UPDATE pg_catalog.pg_database " + "SET datallowconn = false " + "WHERE datname = '%s'", datname)); } PQclear(dbres); + PQfinish(conn_template1); + check_ok(ctx); }
Jesper Krogh wrote: > On 2010-05-18 18:57, Bruce Momjian wrote: > > jesper@krogh.cc wrote: > > > >> Hi > >> > >> I tried running pg_upgrade from the current snapshot of postgresql and > >> upgrading from 8.4.4 to the snapshot version. Everything seem to look fine > >> in the process and all that came out was only "ok's" but when I tried a > >> simple query on the databse it keeps throwing these message out of the back > >> side. > >> > >> DETAIL: You might have already suffered transaction-wraparound data loss. > >> WARNING: some databases have not been vacuumed in over 2 billion > >> transactions > >> > >> > >> The database was around 600GB and it took a couple of minutes to run > >> pg_upgrade after I had all the binaries in the correct place. > >> > >> It is not really an easy task to throw around 600GB of data, so I cannot > >> gaurantee that the above is reproducible, but I'll see if I can get time > >> and try to reproduce it. > >> > > This certainly should never have happened, so I am guessing it is a bug. > > pg_upgrade tries hard to make sure all your datfrozenxid and > > relfrozenxid are properly migrated from the old server, and the > > transaction id is set properly. Unfortunately this is the first time I > > have heard of such a problem, so I am unclear on its cause. > > > > Other people are typically way faster than I am looking into it. > Depesz has produced a full trace to reproduce the problem here: > http://www.depesz.com/index.php/2010/05/19/waiting-for-9-0-pg_upgrade/ Thanks. I have commented on the blog to mention that we have fixed the bug reported there. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com