Thread: DATA corruption after promoting slave to master

DATA corruption after promoting slave to master

From

Karthik Iyer

Date:

25 June 2014, 17:34:53

Hello,

We are facing issues of some DB inconsistencies, while promoting the
slave to master.


[1] We have master-slave with Hot Standby Setup (log shipping).
[2]  We also have a daemon process which copies the latest partial WAL
log file (which postgres is currently writing to, under pg_xlog/) every
3 secs to a different location.

Master server crashed. We decided to promote the slave to master.
Before, creating the trigger file we passed the partial WAL log file [2]
and made the slave replay the partial file as well(So that the data loss
is minimal). Following link has the errors that we saw at different stage:

http://pastebin.com/raw.php?i=Ekr0VrQA

We are seeing these problems in the newly promoted master now:

     1. when we run queries on primary key, we don't get the rows even
if it exist in db. However if we force query not to use index, we get
those entries.
     2. there are duplicate values for primary keys


Postgres version that we use:

9.0.13

Postgres conf : http://pastebin.com/raw.php?i=B0diaitG

Any help here is appreciated.

Thanks in advance.

- Karthik Iyer

Re: DATA corruption after promoting slave to master

From

Shaun Thomas

Date:

25 June 2014, 19:40:11

On 06/25/2014 06:29 AM, Karthik Iyer wrote:

> [2]  We also have a daemon process which copies the latest partial WAL
> log file (which postgres is currently writing to, under pg_xlog/) every
> 3 secs to a different location.

No. No, no, no, no no. No.

Also, no. Partial WAL files are not valid for recovery. In fact, I'm
surprised the standby node even applied it at all.

> We are seeing these problems in the newly promoted master now:
>
>      1. when we run queries on primary key, we don't get the rows even
> if it exist in db. However if we force query not to use index, we get
> those entries.
>      2. there are duplicate values for primary keys

This is no surprise. Your slave has partial data commits, which means
your table, index, or possibly both, are corrupt.

The first thing you need to do is back up any tables you've noticed are
having this behavior. Second, try to reindex the tables that are having
problems. The errors you are seeing are due to the data and indexes
being out of sync. If you get an error that says the reindex fails due
to duplicate values, you can do this:

SET enable_indexscan TO false;
SET enable_bitmapscan TO false;

SELECT primary_key, count(1)
   FROM broken_table
  GROUP BY 1
HAVING count(1) > 1;

For any ID that comes back, do this:

SELECT ctid, *
   FROM broken_table
  WHERE primary_key = [value(s) from above];

Then you need to delete one of the bad rows after deciding which. Use
the CTID of the row you want to delete:

DELETE FROM broken_table
  WHERE ctid = 'whatever';

Then reindex the table so the correct values are properly covered. Doing
this for all of your corrupt tables may take a while depending on how
many there are.

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Re: DATA corruption after promoting slave to master

From

Shaun Thomas

Date:

26 June 2014, 15:05:24

On 06/26/2014 09:44 AM, Karthik Iyer wrote:

> We reindexed  all the primary and unique keys of all the tables, But we
> did not reindex the tables. You think we should do that also ?

Yes, you need to reindex. Part of the problem with this kind of table
corruption, is that PostgreSQL has applied data and index page
modifications to the binary files themselves. Due to this, there are
probably index rows pointing to incorrect or invalid data pages.

Reindexing the primary and unique indexes is a very good start, but any
queries that use regular indexes can still return invalid or mismatched
rows when compared to the primary/unique results. Those kind of
mismatches can result in unexpected behavior from applications.

> We managed to get the old server up. We are trying to play the
> difference in data by checking the log files(statement logs). You think
> there is any other easy alternatives ?

Probably not. Now that your slave has been promoted and had its own
write activity, the old primary is not useful as a source of binary
replay. You *can* use it to verify which rows are old/new while vetting
out corruption, though. It's a good source of data for checking the
consistency of the new primary.

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Re: DATA corruption after promoting slave to master

From

Shaun Thomas

Date:

26 June 2014, 16:07:43

On 06/26/2014 10:34 AM, Karthik Iyer wrote:

> Any inputs here? You think a pgdump and restore would help more ?

A full dump and restore would definitely help. I tend not to suggest
that often because I work with very large databases that are usually
extremely cumbersome to dump and restore.

But yeah, if you can get a successful pg_dump from your database, a
restore should obviously clean up all of your data and index
inconsistencies if you're willing to wait.

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Re: DATA corruption after promoting slave to master

From

Karthik Iyer

Date:

26 June 2014, 16:22:05

Thanks Shaun.

We reindexed  all the primary and unique keys of all the tables, But we
did not reindex the tables. You think we should do that also ?

Also, do you think we should do a clean dump restore to eliminate all
data inconsistencies.

One more query :

We managed to get the old server up. We are trying to play the
difference in data by checking the log files(statement logs). You think
there is any other easy alternatives ?

Thanks in advance,

- Karthik


On Thursday 26 June 2014 01:09 AM, Shaun Thomas wrote:
> On 06/25/2014 06:29 AM, Karthik Iyer wrote:
>
>> [2]  We also have a daemon process which copies the latest partial WAL
>> log file (which postgres is currently writing to, under pg_xlog/) every
>> 3 secs to a different location.
>
> No. No, no, no, no no. No.
>
> Also, no. Partial WAL files are not valid for recovery. In fact, I'm
> surprised the standby node even applied it at all.
>
>> We are seeing these problems in the newly promoted master now:
>>
>>      1. when we run queries on primary key, we don't get the rows even
>> if it exist in db. However if we force query not to use index, we get
>> those entries.
>>      2. there are duplicate values for primary keys
>
> This is no surprise. Your slave has partial data commits, which means
> your table, index, or possibly both, are corrupt.
>
> The first thing you need to do is back up any tables you've noticed
> are having this behavior. Second, try to reindex the tables that are
> having problems. The errors you are seeing are due to the data and
> indexes being out of sync. If you get an error that says the reindex
> fails due to duplicate values, you can do this:
>
> SET enable_indexscan TO false;
> SET enable_bitmapscan TO false;
>
> SELECT primary_key, count(1)
>   FROM broken_table
>  GROUP BY 1
> HAVING count(1) > 1;
>
> For any ID that comes back, do this:
>
> SELECT ctid, *
>   FROM broken_table
>  WHERE primary_key = [value(s) from above];
>
> Then you need to delete one of the bad rows after deciding which. Use
> the CTID of the row you want to delete:
>
> DELETE FROM broken_table
>  WHERE ctid = 'whatever';
>
> Then reindex the table so the correct values are properly covered.
> Doing this for all of your corrupt tables may take a while depending
> on how many there are.
>

Re: DATA corruption after promoting slave to master

From

Karthik Iyer

Date:

26 June 2014, 16:22:14

Hi Shaun,

We reindexed all the primary and unique keys of all the tables, But we
did not reindex the tables. You think we should do that also ?

Yes, you need to reindex. Part of the problem with this kind of table corruption, is that PostgreSQL has applied data and index page modifications to the binary files themselves. Due to this, there are probably index rows pointing to incorrect or invalid data pages.

Reindexing the primary and unique indexes is a very good start, but any queries that use regular indexes can still return invalid or mismatched rows when compared to the primary/unique results. Those kind of mismatches can result in unexpected behavior from applications.

Thanks, we will reindex the tables too.

Also, do you think we should do a clean dump restore to eliminate all data inconsistencies.

Any inputs here? You think a pgdump and restore would help more ?

TIA,

- Karthik

Re: DATA corruption after promoting slave to master

From

Karthik Iyer

Date:

26 June 2014, 20:32:16

> A full dump and restore would definitely help. I tend not to suggest
> that often because I work with very large databases that are usually
> extremely cumbersome to dump and restore.
>
> But yeah, if you can get a successful pg_dump from your database, a
> restore should obviously clean up all of your data and index
> inconsistencies if you're willing to wait.

Thanks a lot Shaun. Appreciate the help.

- Karthik

Re: DATA corruption after promoting slave to master

From

Shaun Thomas

Date:

06 November 2014, 14:09:34

Hi Krit,

It looks like your actual problem is here:

>  Index Scan using t1_orderid_creationtime_idx on t1
>  (cost=0.43..1181104.36 rows=9879754 width=158)
>  (actual time=0.021..60830.724 rows=2416614 loops=1

This index scan estimates 9.8M rows, and had to touch 2.4M. The issue is that your LIMIT clause makes the planner
overlyoptimistic. The worst case cost estimate for this part of the query is about 1.2M, which is much higher than the
SEQSCAN variation you posted. The planner must think it can get the rows without incurring the full cost, otherwise I
can'tsee how the 1.2M cost estimate wasn't rolled into the total estimate. 

Unfortunately behavior like this is pretty common when using LIMIT clauses. Sometimes the planner thinks it can get
resultsmuch faster than it actually can, and it ends up reading a much larger portion of the data than it assumed would
benecessary. 

Just out of curiosity, Can you tell me what your default_statistics_target is?

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Re: DATA corruption after promoting slave to master

From

Kirit Parmar

Date:

06 November 2014, 16:42:58

Hey Shaun,

Hope you are keeping well. Will you please help me with this un-common behaviour of postgres. When executing the query (1) it takes around 62 seconds. However when disabling the index scan the same query (2) executes in around 2 seconds. Is there any reason why the query planner prefers (1)?

----------------------------------- (1) -----------------------------------

explain analyze select * from t1 inner join t2 on t2.orderid = t1.orderid where t1.currentstatus IN (E'Active') and (t1.parentkey = E'1_2_3' OR t1.parentkey LIKE E'1\\_2\\_3\\_%') order by t1.orderid limit 500 offset 2000;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=61151.75..74956.58 rows=500 width=316) (actual time=48066.053..62264.179 rows=500 loops=1)

-> Merge Join (cost=5932.46..1212391.22 rows=43697 width=316) (actual time=417.461..62262.616 rows=2500 loops=1)

Merge Cond: (t2.orderid = t1.orderid)

-> Index Scan using t1_orderid_creationtime_idx on t1 (cost=0.43..1181104.36 rows=9879754 width=158) (actual time=0.021..60830.724 rows=2416614 loops=1

)

-> Sort (cost=5932.02..6041.26 rows=43697 width=158) (actual time=221.333..225.101 rows=2500 loops=1)

Sort Key: t1.orderid

Sort Method: quicksort Memory: 3573kB

-> Bitmap Heap Scan on t1 (cost=59.85..2564.02 rows=43697 width=158) (actual time=11.443..210.783 rows=12005 loops=1)

Recheck Cond: ((parentkey = '1_2_3'::text) OR (parentkey ~~ '1\_2\_3\_%'::text))

Filter: ((currentstatus = 'Active'::text) AND ((parentkey = '1_2_3'::text) OR (parentkey ~~ '1\_2\_3\_%'::text)))

Rows Removed by Filter: 915

-> BitmapOr (cost=59.85..59.85 rows=649 width=0) (actual time=7.220..7.220 rows=0 loops=1)

-> Bitmap Index Scan on t1_parentkey_idx (cost=0.00..33.43 rows=649 width=0) (actual time=5.625..5.625 rows=10646 loops=1)

Index Cond: (parentkey = '1_2_3'::text)

-> Bitmap Index Scan on t1_parentkey_idx (cost=0.00..4.57 rows=1 width=0) (actual time=1.592..1.592 rows=2445 loops=1)

Index Cond: ((parentkey ~>=~ '1_2_3_'::text) AND (parentkey ~<~ '1_2_3`'::text))

Total runtime: 62265.568 ms

(17 rows)

----------------------------------- (2) -----------------------------------

begin ;

set enable_indexscan=false;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=316628.97..316630.22 rows=500 width=316) (actual time=1182.978..1183.196 rows=500 loops=1)

-> Sort (cost=316623.97..316741.70 rows=47094 width=316) (actual time=1182.409..1182.935 rows=2500 loops=1)

Sort Key: t1.orderid

Sort Method: top-N heapsort Memory: 1482kB

-> Nested Loop (cost=64.13..313730.58 rows=47094 width=316) (actual time=17.099..1122.316 rows=12920 loops=1)

-> Bitmap Heap Scan on t1 (cost=61.54..2564.10 rows=47094 width=158) (actual time=17.001..501.400 rows=12920 loops=1)

Recheck Cond: ((parentkey = '1_2_3'::text) OR (parentkey ~~ '1\_2\_3\_%'::text))

Filter: ((parentkey = '1_2_3'::text) OR (parentkey ~~ '1\_2\_3\_%'::text))

-> BitmapOr (cost=61.54..61.54 rows=649 width=0) (actual time=11.575..11.575 rows=0 loops=1)

-> Bitmap Index Scan on t1_parentkey_idx (cost=0.00..33.43 rows=649 width=0) (actual time=9.221..9.221 rows=10646 loops=1)

Index Cond: (parentkey = '1_2_3'::text)

-> Bitmap Index Scan on t1_parentkey_idx (cost=0.00..4.57 rows=1 width=0) (actual time=2.352..2.352 rows=2445 loops=1)

Index Cond: ((parentkey ~>=~ '1_2_3_'::text) AND (parentkey ~<~ '1_2_3`'::text))

-> Bitmap Heap Scan on t1 (cost=2.58..6.60 rows=1 width=158) (actual time=0.043..0.043 rows=1 loops=12920)

Recheck Cond: (orderid = t1.orderid)

-> Bitmap Index Scan on t1_orderid_creationtime_idx (cost=0.00..2.58 rows=1 width=0) (actual time=0.038..0.038 rows=1 loops=12920)

Index Cond: (orderid = t1.orderid)

Total runtime: 1184.140 ms(18 rows)

Looking forward to your help.

On Thu, Jun 26, 2014 at 11:07 PM, Karthik Iyer <karthik.i@directi.com> wrote:

A full dump and restore would definitely help. I tend not to suggest that often because I work with very large databases that are usually extremely cumbersome to dump and restore.

But yeah, if you can get a successful pg_dump from your database, a restore should obviously clean up all of your data and index inconsistencies if you're willing to wait.

Thanks a lot Shaun. Appreciate the help.

- Karthik

Kirit Parmar

Re: DATA corruption after promoting slave to master

From

Shaun Thomas

Date:

07 November 2014, 17:58:52

Krit,

It sounds like you might be hitting edge-case statistics problems on recent data. We’ve had issues with this before, ourselves. I’d personally recommend increasing your default_statistics_target to 400 for your entire database in general. But it’s possible that won’t help this. Recent data tends to have no stats at all, and if it comes in with enough volume, the planner will be wrong a good majority of the time. If that’s the data you’re trying to grab, well…

Based on that, you might want to consider tweaking your autovacuum_analyze_scale_factor setting so you get relatively frequent automatic analyzes so the stats are better through the day. Otherwise, your approach should work just fine.

Good luck!

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Re: DATA corruption after promoting slave to master

From

Kirit Parmar

Date:

07 November 2014, 18:48:43

Hey Shaun,

Thanks for quick reply

We have not changed the default_statistics_target, so it shall remain to its default value 100.

I would like to share our observation on this if you can infer anything from it. : It started using query plan 1 after a our scheduled vacuum run and it continued to use the plan for a day till next day scheduled vacuum run. it switched to optimal query plan 2 and slowness disappeared after then.

However this is what we are thinking to do incase if resurface :

1. run vacuum analyse (assuming it will help to query planner to choose best possible path as it updates the stats in pg_statistic).

2. If it does not work, increase the default_statistics_target for the column.

Let us know your thoughts.

On Thu, Nov 6, 2014 at 7:39 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:

Hi Krit,

It looks like your actual problem is here:

> Index Scan using t1_orderid_creationtime_idx on t1
> (cost=0.43..1181104.36 rows=9879754 width=158)
> (actual time=0.021..60830.724 rows=2416614 loops=1

This index scan estimates 9.8M rows, and had to touch 2.4M. The issue is that your LIMIT clause makes the planner overly optimistic. The worst case cost estimate for this part of the query is about 1.2M, which is much higher than the SEQ SCAN variation you posted. The planner must think it can get the rows without incurring the full cost, otherwise I can't see how the 1.2M cost estimate wasn't rolled into the total estimate.

Unfortunately behavior like this is pretty common when using LIMIT clauses. Sometimes the planner thinks it can get results much faster than it actually can, and it ends up reading a much larger portion of the data than it assumed would be necessary.

Just out of curiosity, Can you tell me what your default_statistics_target is?

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Kirit Parmar