Thread: DATA corruption after promoting slave to master
Hello, We are facing issues of some DB inconsistencies, while promoting the slave to master. [1] We have master-slave with Hot Standby Setup (log shipping). [2] We also have a daemon process which copies the latest partial WAL log file (which postgres is currently writing to, under pg_xlog/) every 3 secs to a different location. Master server crashed. We decided to promote the slave to master. Before, creating the trigger file we passed the partial WAL log file [2] and made the slave replay the partial file as well(So that the data loss is minimal). Following link has the errors that we saw at different stage: http://pastebin.com/raw.php?i=Ekr0VrQA We are seeing these problems in the newly promoted master now: 1. when we run queries on primary key, we don't get the rows even if it exist in db. However if we force query not to use index, we get those entries. 2. there are duplicate values for primary keys Postgres version that we use: 9.0.13 Postgres conf : http://pastebin.com/raw.php?i=B0diaitG Any help here is appreciated. Thanks in advance. - Karthik Iyer
On 06/25/2014 06:29 AM, Karthik Iyer wrote: > [2] We also have a daemon process which copies the latest partial WAL > log file (which postgres is currently writing to, under pg_xlog/) every > 3 secs to a different location. No. No, no, no, no no. No. Also, no. Partial WAL files are not valid for recovery. In fact, I'm surprised the standby node even applied it at all. > We are seeing these problems in the newly promoted master now: > > 1. when we run queries on primary key, we don't get the rows even > if it exist in db. However if we force query not to use index, we get > those entries. > 2. there are duplicate values for primary keys This is no surprise. Your slave has partial data commits, which means your table, index, or possibly both, are corrupt. The first thing you need to do is back up any tables you've noticed are having this behavior. Second, try to reindex the tables that are having problems. The errors you are seeing are due to the data and indexes being out of sync. If you get an error that says the reindex fails due to duplicate values, you can do this: SET enable_indexscan TO false; SET enable_bitmapscan TO false; SELECT primary_key, count(1) FROM broken_table GROUP BY 1 HAVING count(1) > 1; For any ID that comes back, do this: SELECT ctid, * FROM broken_table WHERE primary_key = [value(s) from above]; Then you need to delete one of the bad rows after deciding which. Use the CTID of the row you want to delete: DELETE FROM broken_table WHERE ctid = 'whatever'; Then reindex the table so the correct values are properly covered. Doing this for all of your corrupt tables may take a while depending on how many there are. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 06/26/2014 09:44 AM, Karthik Iyer wrote: > We reindexed all the primary and unique keys of all the tables, But we > did not reindex the tables. You think we should do that also ? Yes, you need to reindex. Part of the problem with this kind of table corruption, is that PostgreSQL has applied data and index page modifications to the binary files themselves. Due to this, there are probably index rows pointing to incorrect or invalid data pages. Reindexing the primary and unique indexes is a very good start, but any queries that use regular indexes can still return invalid or mismatched rows when compared to the primary/unique results. Those kind of mismatches can result in unexpected behavior from applications. > We managed to get the old server up. We are trying to play the > difference in data by checking the log files(statement logs). You think > there is any other easy alternatives ? Probably not. Now that your slave has been promoted and had its own write activity, the old primary is not useful as a source of binary replay. You *can* use it to verify which rows are old/new while vetting out corruption, though. It's a good source of data for checking the consistency of the new primary. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 06/26/2014 10:34 AM, Karthik Iyer wrote: > Any inputs here? You think a pgdump and restore would help more ? A full dump and restore would definitely help. I tend not to suggest that often because I work with very large databases that are usually extremely cumbersome to dump and restore. But yeah, if you can get a successful pg_dump from your database, a restore should obviously clean up all of your data and index inconsistencies if you're willing to wait. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Thanks Shaun. We reindexed all the primary and unique keys of all the tables, But we did not reindex the tables. You think we should do that also ? Also, do you think we should do a clean dump restore to eliminate all data inconsistencies. One more query : We managed to get the old server up. We are trying to play the difference in data by checking the log files(statement logs). You think there is any other easy alternatives ? Thanks in advance, - Karthik On Thursday 26 June 2014 01:09 AM, Shaun Thomas wrote: > On 06/25/2014 06:29 AM, Karthik Iyer wrote: > >> [2] We also have a daemon process which copies the latest partial WAL >> log file (which postgres is currently writing to, under pg_xlog/) every >> 3 secs to a different location. > > No. No, no, no, no no. No. > > Also, no. Partial WAL files are not valid for recovery. In fact, I'm > surprised the standby node even applied it at all. > >> We are seeing these problems in the newly promoted master now: >> >> 1. when we run queries on primary key, we don't get the rows even >> if it exist in db. However if we force query not to use index, we get >> those entries. >> 2. there are duplicate values for primary keys > > This is no surprise. Your slave has partial data commits, which means > your table, index, or possibly both, are corrupt. > > The first thing you need to do is back up any tables you've noticed > are having this behavior. Second, try to reindex the tables that are > having problems. The errors you are seeing are due to the data and > indexes being out of sync. If you get an error that says the reindex > fails due to duplicate values, you can do this: > > SET enable_indexscan TO false; > SET enable_bitmapscan TO false; > > SELECT primary_key, count(1) > FROM broken_table > GROUP BY 1 > HAVING count(1) > 1; > > For any ID that comes back, do this: > > SELECT ctid, * > FROM broken_table > WHERE primary_key = [value(s) from above]; > > Then you need to delete one of the bad rows after deciding which. Use > the CTID of the row you want to delete: > > DELETE FROM broken_table > WHERE ctid = 'whatever'; > > Then reindex the table so the correct values are properly covered. > Doing this for all of your corrupt tables may take a while depending > on how many there are. >
We reindexed all the primary and unique keys of all the tables, But we
did not reindex the tables. You think we should do that also ?
Yes, you need to reindex. Part of the problem with this kind of table corruption, is that PostgreSQL has applied data and index page modifications to the binary files themselves. Due to this, there are probably index rows pointing to incorrect or invalid data pages.
Reindexing the primary and unique indexes is a very good start, but any queries that use regular indexes can still return invalid or mismatched rows when compared to the primary/unique results. Those kind of mismatches can result in unexpected behavior from applications.
Thanks, we will reindex the tables too.
Also, do you think we should do a clean dump restore to eliminate all data inconsistencies.
Any inputs here? You think a pgdump and restore would help more ?
TIA,
- Karthik
> A full dump and restore would definitely help. I tend not to suggest > that often because I work with very large databases that are usually > extremely cumbersome to dump and restore. > > But yeah, if you can get a successful pg_dump from your database, a > restore should obviously clean up all of your data and index > inconsistencies if you're willing to wait. Thanks a lot Shaun. Appreciate the help. - Karthik
Hi Krit, It looks like your actual problem is here: > Index Scan using t1_orderid_creationtime_idx on t1 > (cost=0.43..1181104.36 rows=9879754 width=158) > (actual time=0.021..60830.724 rows=2416614 loops=1 This index scan estimates 9.8M rows, and had to touch 2.4M. The issue is that your LIMIT clause makes the planner overlyoptimistic. The worst case cost estimate for this part of the query is about 1.2M, which is much higher than the SEQSCAN variation you posted. The planner must think it can get the rows without incurring the full cost, otherwise I can'tsee how the 1.2M cost estimate wasn't rolled into the total estimate. Unfortunately behavior like this is pretty common when using LIMIT clauses. Sometimes the planner thinks it can get resultsmuch faster than it actually can, and it ends up reading a much larger portion of the data than it assumed would benecessary. Just out of curiosity, Can you tell me what your default_statistics_target is? ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
----------------------------------- (1) -----------------------------------
explain analyze select * from t1 inner join t2 on t2.orderid = t1.orderid where t1.currentstatus IN (E'Active') and (t1.parentkey = E'1_2_3' OR t1.parentkey LIKE E'1\\_2\\_3\\_%') order by t1.orderid limit 500 offset 2000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--
Limit (cost=61151.75..74956.58 rows=500 width=316) (actual time=48066.053..62264.179 rows=500 loops=1)
-> Merge Join (cost=5932.46..1212391.22 rows=43697 width=316) (actual time=417.461..62262.616 rows=2500 loops=1)
Merge Cond: (t2.orderid = t1.orderid)
-> Index Scan using t1_orderid_creationtime_idx on t1 (cost=0.43..1181104.36 rows=9879754 width=158) (actual time=0.021..60830.724 rows=2416614 loops=1
)
-> Sort (cost=5932.02..6041.26 rows=43697 width=158) (actual time=221.333..225.101 rows=2500 loops=1)
Sort Key: t1.orderid
Sort Method: quicksort Memory: 3573kB
-> Bitmap Heap Scan on t1 (cost=59.85..2564.02 rows=43697 width=158) (actual time=11.443..210.783 rows=12005 loops=1)
Recheck Cond: ((parentkey = '1_2_3'::text) OR (parentkey ~~ '1\_2\_3\_%'::text))
Filter: ((currentstatus = 'Active'::text) AND ((parentkey = '1_2_3'::text) OR (parentkey ~~ '1\_2\_3\_%'::text)))
Rows Removed by Filter: 915
-> BitmapOr (cost=59.85..59.85 rows=649 width=0) (actual time=7.220..7.220 rows=0 loops=1)
-> Bitmap Index Scan on t1_parentkey_idx (cost=0.00..33.43 rows=649 width=0) (actual time=5.625..5.625 rows=10646 loops=1)
Index Cond: (parentkey = '1_2_3'::text)
-> Bitmap Index Scan on t1_parentkey_idx (cost=0.00..4.57 rows=1 width=0) (actual time=1.592..1.592 rows=2445 loops=1)
Index Cond: ((parentkey ~>=~ '1_2_3_'::text) AND (parentkey ~<~ '1_2_3`'::text))
Total runtime: 62265.568 ms
(17 rows)
----------------------------------- (2) -----------------------------------
begin ;
set enable_indexscan=false;
explain analyze select * from t1 inner join t2 on t2.orderid = t1.orderid where t1.currentstatus IN (E'Active') and (t1.parentkey = E'1_2_3' OR t1.parentkey LIKE E'1\\_2\\_3\\_%') order by t1.orderid limit 500 offset 2000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=316628.97..316630.22 rows=500 width=316) (actual time=1182.978..1183.196 rows=500 loops=1)
-> Sort (cost=316623.97..316741.70 rows=47094 width=316) (actual time=1182.409..1182.935 rows=2500 loops=1)
Sort Key: t1.orderid
Sort Method: top-N heapsort Memory: 1482kB
-> Nested Loop (cost=64.13..313730.58 rows=47094 width=316) (actual time=17.099..1122.316 rows=12920 loops=1)
-> Bitmap Heap Scan on t1 (cost=61.54..2564.10 rows=47094 width=158) (actual time=17.001..501.400 rows=12920 loops=1)
Recheck Cond: ((parentkey = '1_2_3'::text) OR (parentkey ~~ '1\_2\_3\_%'::text))
Filter: ((parentkey = '1_2_3'::text) OR (parentkey ~~ '1\_2\_3\_%'::text))
-> BitmapOr (cost=61.54..61.54 rows=649 width=0) (actual time=11.575..11.575 rows=0 loops=1)
-> Bitmap Index Scan on t1_parentkey_idx (cost=0.00..33.43 rows=649 width=0) (actual time=9.221..9.221 rows=10646 loops=1)
Index Cond: (parentkey = '1_2_3'::text)
-> Bitmap Index Scan on t1_parentkey_idx (cost=0.00..4.57 rows=1 width=0) (actual time=2.352..2.352 rows=2445 loops=1)
Index Cond: ((parentkey ~>=~ '1_2_3_'::text) AND (parentkey ~<~ '1_2_3`'::text))
-> Bitmap Heap Scan on t1 (cost=2.58..6.60 rows=1 width=158) (actual time=0.043..0.043 rows=1 loops=12920)
Recheck Cond: (orderid = t1.orderid)
-> Bitmap Index Scan on t1_orderid_creationtime_idx (cost=0.00..2.58 rows=1 width=0) (actual time=0.038..0.038 rows=1 loops=12920)
Index Cond: (orderid = t1.orderid)
Total runtime: 1184.140 ms(18 rows)
A full dump and restore would definitely help. I tend not to suggest that often because I work with very large databases that are usually extremely cumbersome to dump and restore.
But yeah, if you can get a successful pg_dump from your database, a restore should obviously clean up all of your data and index inconsistencies if you're willing to wait.
Thanks a lot Shaun. Appreciate the help.
- Karthik
Krit,
It sounds like you might be hitting edge-case statistics problems on recent data. We’ve had issues with this before, ourselves. I’d personally recommend increasing your default_statistics_target to 400 for your entire database in general. But it’s possible that won’t help this. Recent data tends to have no stats at all, and if it comes in with enough volume, the planner will be wrong a good majority of the time. If that’s the data you’re trying to grab, well…
Based on that, you might want to consider tweaking your autovacuum_analyze_scale_factor setting so you get relatively frequent automatic analyzes so the stats are better through the day. Otherwise, your approach should work just fine.
Good luck!
See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Hi Krit,
It looks like your actual problem is here:
> Index Scan using t1_orderid_creationtime_idx on t1
> (cost=0.43..1181104.36 rows=9879754 width=158)
> (actual time=0.021..60830.724 rows=2416614 loops=1
This index scan estimates 9.8M rows, and had to touch 2.4M. The issue is that your LIMIT clause makes the planner overly optimistic. The worst case cost estimate for this part of the query is about 1.2M, which is much higher than the SEQ SCAN variation you posted. The planner must think it can get the rows without incurring the full cost, otherwise I can't see how the 1.2M cost estimate wasn't rolled into the total estimate.
Unfortunately behavior like this is pretty common when using LIMIT clauses. Sometimes the planner thinks it can get results much faster than it actually can, and it ends up reading a much larger portion of the data than it assumed would be necessary.
Just out of curiosity, Can you tell me what your default_statistics_target is?
______________________________________________
See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email