Re: ERROR: attempted to delete invisible tuple - Mailing list pgsql-general

From Bryan Murphy
Subject Re: ERROR: attempted to delete invisible tuple
Date
Msg-id 7fd310d10908171432i513a4b8ck69ce7b7e89bf4b56@mail.gmail.com
Whole thread Raw
In response to Re: ERROR: attempted to delete invisible tuple  (Greg Stark <gsstark@mit.edu>)
List pgsql-general


On Mon, Aug 17, 2009 at 4:02 PM, Greg Stark <gsstark@mit.edu> wrote:
For what it's worth at EDB I dealt with another case like this and I
imagine others have too. I think it's too easy to do things in the
wrong order or miss a step and end up with these kinds of problems.

I would really like to know what happened here which caused the
problem. Do you have records of how you created the slave? When you
took the initial image, did you use a cold backup or a hot backup? Did
you use pg_start_backup()/pg_stop_backup()?

I always use log shipping.   The original snapshot was taken like the following (many months ago): 

start log shipping
pg_start_backup()
tar snapshot to nfs volume
pg_stop_backup()
restore snap shot on another machine
startup in recovery mode

The machine we restored to was not the original warm spare.  It was built as a snapshot from the original warm spare.  Every few weeks (or when paranoia sets in) I replace old spares with new spares to verify that the backups are working and bring the old spares online, verify them, then throw them away.

xfs_freeze -f filesystem
file system snapshot (EBS snapshot)
xfs_freeze -u filesystem 
restore snapshot on another machine
startup in recovery mode

When you failed over was there anything special happening? Was it
because of a failure on the master? Was a vacuum full running?

We never run vacuum full, but autovacuum is always turned on.  

At the time of failure, best I can tell is that the hard drive which contained the root file system and our swap partition was failing.  Unfortunately, I was not at the office when this occurred and I'm the best one at diagnosing what's going on.  Also, they're Amazon EC2 instances, so when it went dead, it went really dead.  It's no longer available to us, and I had to quickly reclaim the EBS volumes it was using in order to get a new hot spare up and running.

So, we basically lost everything but the console output of the original database, and of course, I can't find where I put that.  I do remember it strongly indicated that we were facing imminent drive failure.

When the slave came up do you have the log messages saying it was
starting recovery and when it was finishing recovery and starting
normal operations?

This is tricky as well.  We failed over to our spare, only to find out our load had grown so much the spare was no longer able to carry us the way it was configured.  We had to quickly rebuild a new spare, and then shift over to that one.  

So, we basically shutdown the spare, shifted the volume over to the new machine, and then brought it back online.  I then rebuilt a new warm spare from the new machine.

Which leads me to the one big flaw in all of this, the log files were all going to the local drives and not the EBS volumes so I've lost them and am now kicking myself in the ass for it.  

Bryan
 

pgsql-general by date:

Previous
From: Greg Stark
Date:
Subject: Re: ERROR: attempted to delete invisible tuple
Next
From: NTPT
Date:
Subject: Re: Rapid Seek Devices (feature request)