Thread: getting status transaction error
around 6:30 this morning, I started getting the following messages in my log: Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access status of transaction 544441911 Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access status of transaction 544441911 [...] repeated roughly once a minute. I've never seen this before. this is on postgres 8.1.1 running on fedora core 4 smp. I don't have any contextual information yet but I'm getting ready to turn statement logging on. Anybody know what this is? merlin
On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote: > around 6:30 this morning, I started getting the following messages in my log: > > Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access > status of transaction 544441911 > Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access > status of transaction 544441911 > [...] > > repeated roughly once a minute. I've never seen this before. this is > on postgres 8.1.1 running on fedora core 4 smp. I don't have any > contextual information yet but I'm getting ready to turn statement > logging on. > > Anybody know what this is? [x-posting to -hackers] actually, here is some more relevant bits from the log. Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum: processing database "template0" Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access status of transaction 544441911 Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open file "pg_clog/0207": No such file or directory repeated ad-naseum merlin
Merlin Moncure wrote: > On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote: >> around 6:30 this morning, I started getting the following messages in >> my log: >> >> Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access >> status of transaction 544441911 >> Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access >> status of transaction 544441911 >> [...] >> >> repeated roughly once a minute. I've never seen this before. this is >> on postgres 8.1.1 running on fedora core 4 smp. I don't have any >> contextual information yet but I'm getting ready to turn statement >> logging on. >> >> Anybody know what this is? > [x-posting to -hackers] > > actually, here is some more relevant bits from the log. > Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum: > processing database "template0" > Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access > status of transaction 544441911 > Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open > file "pg_clog/0207": No such file or directory hmm I first thought it could have been http://archives.postgresql.org/pgsql-committers/2006-01/msg00288.php which affects 8.1.1 but that's not the very same error as the one created by the above bug. Stefan
On 2/6/07, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > Merlin Moncure wrote: > > On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote: > >> around 6:30 this morning, I started getting the following messages in > >> my log: > >> > >> Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access > >> status of transaction 544441911 > >> Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access > >> status of transaction 544441911 > >> [...] > >> > >> repeated roughly once a minute. I've never seen this before. this is > >> on postgres 8.1.1 running on fedora core 4 smp. I don't have any > >> contextual information yet but I'm getting ready to turn statement > >> logging on. > >> > >> Anybody know what this is? > > [x-posting to -hackers] > > > > actually, here is some more relevant bits from the log. > > Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum: > > processing database "template0" > > Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access > > status of transaction 544441911 > > Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open > > file "pg_clog/0207": No such file or directory > > hmm I first thought it could have been > http://archives.postgresql.org/pgsql-committers/2006-01/msg00288.php > which affects 8.1.1 but that's not the very same error as the one > created by the above bug. ya, it doesn't seem to match, as this seems to be repeating quite regularly. interesting that my 'clog' files start at 06B6 and count up. 0207 is way off the charts. a lot of applications are hitting this database, and so far everything seems to be running ok (i found this log msg by accident), but I am now officially very nervous. merlin
Merlin Moncure wrote: > On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote: > >around 6:30 this morning, I started getting the following messages in my > >log: > > > >Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access > >status of transaction 544441911 > >Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access > >status of transaction 544441911 > >[...] > > > >repeated roughly once a minute. I've never seen this before. this is > >on postgres 8.1.1 running on fedora core 4 smp. I don't have any > >contextual information yet but I'm getting ready to turn statement > >logging on. > > > >Anybody know what this is? > [x-posting to -hackers] > > actually, here is some more relevant bits from the log. > Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum: > processing database "template0" > Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access > status of transaction 544441911 > Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open > file "pg_clog/0207": No such file or directory I guess the problem here is that autovacuum believes that template0 needs a database-wide vacuum due to Xid wraparound getting closer. And that database seems to have Xid 544441911 somewhere, the clog bit for which was in the 0207 file which was deleted some time ago. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > actually, here is some more relevant bits from the log. > > Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum: > > processing database "template0" > > Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access > > status of transaction 544441911 > > Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open > > file "pg_clog/0207": No such file or directory > > I guess the problem here is that autovacuum believes that template0 > needs a database-wide vacuum due to Xid wraparound getting closer. And > that database seems to have Xid 544441911 somewhere, the clog bit for > which was in the 0207 file which was deleted some time ago. Latest checkpoint's NextXID: 2162841139 2^31: 2147483648 is this related? merlin
Merlin Moncure wrote: > ya, it doesn't seem to match, as this seems to be repeating quite > regularly. interesting that my 'clog' files start at 06B6 and count > up. 0207 is way off the charts. > > a lot of applications are hitting this database, and so far everything > seems to be running ok (i found this log msg by accident), but I am > now officially very nervous. I don't think there's much cause for concern here. If my theory is correct, this is an autovacuum bug which was fixed in 8.1.7. What I'd do is create a 0207 clog file, fill it with 0x55 (which is "transactions committed" for all transactions in that interval), and do a VACUUM FREEZE on that database. You'll need to set pg_database.datallowconn=true beforehand. Of course, I'd copy the files somewhere else and experiment on a scratch postmaster, running on a different port, just to be sure ... -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
n 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Merlin Moncure wrote: > > > ya, it doesn't seem to match, as this seems to be repeating quite > > regularly. interesting that my 'clog' files start at 06B6 and count > > up. 0207 is way off the charts. > > > > a lot of applications are hitting this database, and so far everything > > seems to be running ok (i found this log msg by accident), but I am > > now officially very nervous. > > I don't think there's much cause for concern here. If my theory is > correct, this is an autovacuum bug which was fixed in 8.1.7. > > What I'd do is create a 0207 clog file, fill it with 0x55 (which is > "transactions committed" for all transactions in that interval), and do > a VACUUM FREEZE on that database. You'll need to set > pg_database.datallowconn=true beforehand. > > Of course, I'd copy the files somewhere else and experiment on a scratch > postmaster, running on a different port, just to be sure ... thats a big help, database is actually fairly huge, so I may have to just go ahead and do it. I'm off to a meeting, but I'll check back when I'm done and assuming nobody else says 'don't do that', I'll try the fix and post back with the result. thanks all, merlin
Merlin Moncure wrote: > On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > >> actually, here is some more relevant bits from the log. > >> Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum: > >> processing database "template0" > >> Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access > >> status of transaction 544441911 > >> Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open > >> file "pg_clog/0207": No such file or directory > > > Latest checkpoint's NextXID: 2162841139 > 2^31: 2147483648 I think the relevant arithmetic here is echo "2162841139 544441911 - p" | dc 1618399228 That's a billion and a half transactions. Autovacuum uses the formula this_whole_db = (tmp->age > (int32) ((MaxTransactionId >> 3) * 3 - 100000)); to determine whether it needs database-wide vacuum. (MaxTransactionId >> 3) is 536870911, so the calculation is 536870911 * 3 - 100000 echo "536870911 3 * 100000 - p" | dc 1610512733 which looks awfully close to the number above. About 7 million transactions must have passed since the first time the error showed up -- does that sound likely? Well, scratch that -- what's the _current_ Xid? (not lastest checkpoint's) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Merlin Moncure wrote: > n 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > >Merlin Moncure wrote: > > > >> ya, it doesn't seem to match, as this seems to be repeating quite > >> regularly. interesting that my 'clog' files start at 06B6 and count > >> up. 0207 is way off the charts. > >> > >> a lot of applications are hitting this database, and so far everything > >> seems to be running ok (i found this log msg by accident), but I am > >> now officially very nervous. > > > >I don't think there's much cause for concern here. If my theory is > >correct, this is an autovacuum bug which was fixed in 8.1.7. > > > >What I'd do is create a 0207 clog file, fill it with 0x55 (which is > >"transactions committed" for all transactions in that interval), and do > >a VACUUM FREEZE on that database. You'll need to set > >pg_database.datallowconn=true beforehand. > > > >Of course, I'd copy the files somewhere else and experiment on a scratch > >postmaster, running on a different port, just to be sure ... > > thats a big help, database is actually fairly huge, so I may have to > just go ahead and do it. I'm off to a meeting, but I'll check back > when I'm done and assuming nobody else says 'don't do that', I'll try > the fix and post back with the result. Well, you don't need to copy all databases for the test area, just the base/<oid> dir for template0 (along with all pg_xlog and pg_clog files, etc, but these shouldn't be as big as all the other stuff in base/). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Merlin Moncure wrote: > > On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote: > > >> actually, here is some more relevant bits from the log. > > >> Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum: > > >> processing database "template0" > > >> Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access > > >> status of transaction 544441911 > > >> Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open > > >> file "pg_clog/0207": No such file or directory > > > > > Latest checkpoint's NextXID: 2162841139 > > 2^31: 2147483648 > > I think the relevant arithmetic here is > > echo "2162841139 544441911 - p" | dc > 1618399228 > > That's a billion and a half transactions. Autovacuum uses the formula > > this_whole_db = (tmp->age > > (int32) ((MaxTransactionId >> 3) * 3 - 100000)); > > to determine whether it needs database-wide vacuum. > (MaxTransactionId >> 3) is 536870911, so the calculation is > 536870911 * 3 - 100000 > > echo "536870911 3 * 100000 - p" | dc > 1610512733 > > which looks awfully close to the number above. About 7 million > transactions must have passed since the first time the error showed up > -- does that sound likely? > > Well, scratch that -- what's the _current_ Xid? (not lastest > checkpoint's) I don't know any better way to get that than this: postgres=# insert into foo default values; INSERT 0 1 postgres=# select xmin,xmax from foo; xmin | xmax ------------+------ 2163877346 | 0 (1 row) merlin
This part... > repeated roughly once a minute. I've never seen this before. this is ...might mean some connection with autovacuum, maybe? I have no clue beyond that. Hope things go well for you. Kev