Database corruption event, unlockable rows, possibly bogus virtual xids? (-1/4444444444) - Mailing list pgsql-admin

From Ned Wolpert
Subject Database corruption event, unlockable rows, possibly bogus virtual xids? (-1/4444444444)
Date
Msg-id CAFehBFnV3hacj8OCcTSPPwtKCyqTmURCD5GNH6Yob9-AKYOmjA@mail.gmail.com
Whole thread Raw
Responses Re: Database corruption event, unlockable rows, possibly bogus virtual xids? (-1/4444444444)
Re: Database corruption event, unlockable rows, possibly bogus virtual xids? (-1/4444444444)
List pgsql-admin
Folks-

  I'm doing a postmortem on a corruption event we had. I have an idea on what happened, but not sure. I figure I'd share what happened and see if I'm close to right here.

  Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files, and even a nightly pg_dump all. 50G database.  Trying to update or delete a row in a small (21 row, but heavily used table) would lock up completely. Never finish. Removed all clients, restarted the db instance, no joy. Check pg_stat_activity, and nothing that wasn't idle.... run the delete, locked up.

  Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation = pg_class.oid;) with no clients touching this row, fresh restart of the db, and saw virtualtransactions against this same table where the values would be -1/nnnnn were nnnnn was a huge number.  Turns out we had about 159 entries from different tables in the database.  Checked hot-standby and, of course, no locks or anything.  Switched to hot-standby.

  Hot-standby instantly gained these locks, Also noticed that 2 prepared transactions migrated to the hot-standby. Binary upgraded to 9.1.8, locks still existed. Ended up executing the one thing we knew would work. Take the site down, pg_dumpall to fresh instance. Everything is fine.

A little more background: We were running 9.1.4 back when 9.1.6 came out. We saw there was possible corruption issues and did a binary upgrade and reindexing. Everything seemed to be fine, but I wonder if we really had problems back then. We rebuilt the hot-standby after the binary upgrade via normal restore and wal-file replays.  I should also note that this row that had the lock on it that would not go away, was created by an app server that was killed (via -9) since it was non-responsive, and the row 'create date' (in db and logs) is the exact time the app server was killed.

I was wondering if a) these virtualtransactions that start with '-1/' indicate a problem, b) if this could have happened from pre 9.1.6 corruption that was fixed in 9.1.6. Or, could this have occurred when we killed that app server? Or.... am I looking in the wrong place. 

I do still have the old data directories so I can start them up and check out the dataset. Any advice?

--
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..."   --Marlowe

pgsql-admin by date:

Previous
From: Prashanth Ranjalkar
Date:
Subject: Re: Migrating tables to new tablespace, freed up space doesn't match moved data
Next
From: Kevin Grittner
Date:
Subject: Re: Pg 9.1 master-slave replication