Thread: Database corrupted

Database corrupted

From
Yann ROBIN
Date:
Hi,

Earlier this afternoon, our database crash with a stacktrace.
We killed hardly the remaining postgres process left.
Since then it's been hell !!!

First Postgres told us that there was a corrupted index and we needed
to reindex it.
We couldn't do it because there was duplicate id in the table.
So I decided to make a copy of the table and then try to remove
data/pkey constraint.

The database then crashed and couldn't restart. There was an xlog
flush request error.
Based on what we saw on internet we launch a pg_resetxlog.

Database started nicely but data was still corrupted. So we launch a
REINDEX command and got this error in the log (1000 times per second)
:
WARNING : concurrent delete in progress within table

We waited 30minutes but the message was still there and the reindex not done.

We couldn't find any help online.
Any idea what to do to get the database back ?


Thanks,

--
Yann

Re: Database corrupted

From
"Kevin Grittner"
Date:
Yann ROBIN <me.show@gmail.com> wrote:

> Earlier this afternoon, our database crash with a stacktrace.

First things first: Before you do anything else, shut down
PostgreSQL and make a copy of the data directory tree.

http://wiki.postgresql.org/wiki/Corruption

Second, please post information about your environment.  What
version of PostgreSQL is this?  What is your OS?  What hardware is
it running on?

Now, please copy from the log at the time of the crash and post all
messages, plus any possibly relevant messages in the clients and the
OS logs.  Is there a core file from the crash?  Can you get a
backtrace from it?

> We killed hardly the remaining postgres process left.

It's best to keep notes of exactly what was done.  What was the
process description of what you killed?  Which signal did you use?
Keep notes as you go.

> First Postgres told us that there was a corrupted index and we
> needed to reindex it.
> We couldn't do it because there was duplicate id in the table.
> So I decided to make a copy of the table and then try to remove
> data/pkey constraint.
>
> The database then crashed and couldn't restart. There was an xlog
> flush request error.

Copy/paste it?

> Based on what we saw on internet we launch a pg_resetxlog.
>
> Database started nicely but data was still corrupted. So we launch
> a REINDEX command and got this error in the log (1000 times per
> second)
> :
> WARNING : concurrent delete in progress within table
>
> We waited 30minutes but the message was still there and the
> reindex not done.

How big is the table?  What kind of column(s) in the index?

> We couldn't find any help online.

There is this list.  For advice on how to get the most useful advice
from it, you might want to read this page:

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

You might want to consider contracting for professional support:

http://www.postgresql.org/support/professional_support/

-Kevin

Re: Database corrupted

From
Yann ROBIN
Date:
>
> First things first: Before you do anything else, shut down
> PostgreSQL and make a copy of the data directory tree.
>
> http://wiki.postgresql.org/wiki/Corruption
>

Did it.

> Second, please post information about your environment.  What
> version of PostgreSQL is this?  What is your OS?  What hardware is
> it running on?
>

Version 9.0.4 on debian squeeze 6.0 running in KVM with virtio (kernel 2.6.32)
I think we have a hard drive issue.

> Now, please copy from the log at the time of the crash and post all
> messages, plus any possibly relevant messages in the clients and the
> OS logs.  Is there a core file from the crash?  Can you get a
> backtrace from it?
>

I rm the log file has it was taking too much space because of the 1000
warning per seconds.
I feel dumb right now. Sorry.

>
> It's best to keep notes of exactly what was done.  What was the
> process description of what you killed?  Which signal did you use?
> Keep notes as you go.
>

kill -9 of the writer process

>
> Copy/paste it?
>

Sorry lost it but it was xlog flush request 0/xxxx is not satisfied
--- flushed only to xxxx

>
> How big is the table?  What kind of column(s) in the index?
>

primary key index on int32, table have 1.2M lines


--
Yann

Re: Database corrupted

From
Scott Marlowe
Date:
On Mon, Dec 5, 2011 at 1:18 PM, Yann ROBIN <me.show@gmail.com> wrote:
>>
>> First things first: Before you do anything else, shut down
>> PostgreSQL and make a copy of the data directory tree.
>>
>> http://wiki.postgresql.org/wiki/Corruption
>>
>
> Did it.
>
>> Second, please post information about your environment.  What
>> version of PostgreSQL is this?  What is your OS?  What hardware is
>> it running on?
>>
>
> Version 9.0.4 on debian squeeze 6.0 running in KVM with virtio (kernel 2.6.32)
> I think we have a hard drive issue.
>
>> Now, please copy from the log at the time of the crash and post all
>> messages, plus any possibly relevant messages in the clients and the
>> OS logs.  Is there a core file from the crash?  Can you get a
>> backtrace from it?
>>
>
> I rm the log file has it was taking too much space because of the 1000
> warning per seconds.
> I feel dumb right now. Sorry.
>
>>
>> It's best to keep notes of exactly what was done.  What was the
>> process description of what you killed?  Which signal did you use?
>> Keep notes as you go.
>>
>
> kill -9 of the writer process

Are you sure you killed all the postgres backends before restarting
the server?  If other backends are still running, with a dead
postmaster, and you restart the server, instant corruption.

Re: Database corrupted

From
"Kevin Grittner"
Date:
Scott Marlowe <scott.marlowe@gmail.com> wrote:
> Yann ROBIN <me.show@gmail.com> wrote:

>> Version 9.0.4 on debian squeeze 6.0 running in KVM with virtio
>> (kernel 2.6.32)

I don't know anything about KVM or vertio, so hopefully others will
step up.

>> I think we have a hard drive issue.

If at all possible, I would try to sort that out before continuing
recovery.  You could keep piling up one disk error on top of
another; and each can make the others harder to sort out and fix.

>> kill -9 of the writer process
>
> Are you sure you killed all the postgres backends before
> restarting the server?  If other backends are still running, with
> a dead postmaster, and you restart the server, instant corruption.

Yeah, I would make *absolutely* sure there isn't an orphaned
postgres process still running before trying any other recovery
steps.

Once you are sure that your storage system isn't nibbling away at
your data and there isn't an old postgres process running, you might
want to list the rows with the duplicate key value, and then delete
them.  The safest course, once you've rebuilt the index, is to us
pg_dump and psql (or pg_restore) to rebuild the database.

Be sure to keep that copy of the data directory tree for at *least*
a few weeks after everything seems to be running fine.  You may well
belatedly discover a reason to go back and fish for some data.

-Kevin

Re: Database corrupted

From
Tom Lane
Date:
Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Mon, Dec 5, 2011 at 1:18 PM, Yann ROBIN <me.show@gmail.com> wrote:
>> kill -9 of the writer process

> Are you sure you killed all the postgres backends before restarting
> the server?  If other backends are still running, with a dead
> postmaster, and you restart the server, instant corruption.

There are interlocks against that ... although if you were foolish
enough to manually remove postmaster.pid, you could defeat them :-(

            regards, tom lane

Re: Database corrupted

From
Yann ROBIN
Date:
On Mon, Dec 5, 2011 at 9:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> There are interlocks against that ... although if you were foolish
> enough to manually remove postmaster.pid, you could defeat them :-(
>

We didn't remove postmaster.pid

--
Yann

Re: Database corrupted

From
Scott Marlowe
Date:
On Mon, Dec 5, 2011 at 1:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Scott Marlowe <scott.marlowe@gmail.com> writes:
>> On Mon, Dec 5, 2011 at 1:18 PM, Yann ROBIN <me.show@gmail.com> wrote:
>>> kill -9 of the writer process
>
>> Are you sure you killed all the postgres backends before restarting
>> the server?  If other backends are still running, with a dead
>> postmaster, and you restart the server, instant corruption.
>
> There are interlocks against that ... although if you were foolish
> enough to manually remove postmaster.pid, you could defeat them :-(

I've had to remove it once or twice in the past (has that behavior
changed in more recent versions, where it's smarter about it?) but I
knew to check for orphaned backends as well.  If someone did and
didn't respectively then they'd definitely be seeing odd behaviour and
a corrupted database.