Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0 - Mailing list pgsql-general

From Shaun Thomas
Subject Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0
Date
Msg-id 0683F5F5A5C7FE419A752A034B4A0B97975A247F@sswchi5pmbx2.peak6.net
Whole thread Raw
In response to Re: corruption issue after server crash - ERROR: unexpected chunk number 0  (Mike Broers <mbroers@gmail.com>)
Responses Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0
List pgsql-general
> Update - I have two hot replication slaves of this db, both have the problem.
> I took one out of recovery and ran REINDEX table session_session and it
> fixed the errors about this row.  Now Im going to run vacuum and see if
> there are other tables that complain, but Im guessing if so I will need to see
> if there is a way to force vacuum to continue on error, worst case I might
> have to script a table by table vacuum script I guess..  If anyone has a better
> suggestion for determining the extent of the damage Id appreciate it.

Oh man. I'm sorry, Mike.

One of the cardinal rules I have is to disconnect any replication following a database crash. It's just too easy for
damagedreplicated rows to be propagated unless you're on 9.3 and have checksums enabled. If you want to perform a
table-by-tablecheck, don't vacuum the database, but the individual tables. I'd go with a DO loop and have it raise
noticesinto the log so you can investigate further: 

COPY (
SELECT 'VACUUM ' || oid::regclass::text || ';'
  FROM pg_class
 WHERE relkind = 'r'
) to '/tmp/vac_all.sql';

Run the /tmp/vac_all.sql through psql and pipe the contents into a log file. Any table that doesn't vacuum successfully
willneed to be repaired manually. One way you can do this if there are dupes, is by checking the ctid value after
disablingindex scans: 

SET enable_indexscan TO False;

SELECT ctid, * FROM [broken_table] WHERE ...;

Just construct the WHERE clause based on the error output, and you should get all rows if there are dupes. You'll need
tofigure out which row to keep, then delete the bad row based on the ctid. Do this as many times as it takes, then
reindexto make sure the proper row versions are indexed. 

It's also a good idea to dump any table that came back with an error, just in case.

After you've done all of that, you should re-base your replicas once you've determined your production system is
usable.In the meantime, I highly recommend you set up a VIP you can assign to one of your replicas if your production
systemdies again, and remove any autostart code. If your production system crashes, switch the VIP immediately to a
replica,and invalidate your old production system. Data corruption is insidious when streaming replication is involved. 

Look into tools like repmgr to handle managing your replicas as a cluster to make forced invalidation and re-basing
easier.

Good luck!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


pgsql-general by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: Solution for Synonyms
Next
From: "Mahlon E. Smith"
Date:
Subject: Re: Streaming replication slave crash