Thread: Index corruption
Hello postgres experts, We are running a test that periodically abruptly kills postgres process(equivalent to kill -9) and restarts it. After running this test for 24 hrs or so, we see duplicate primary key entries in postgres table. We detect this as we load internal hash-table data-structure in a separate process with primary key entries. Before hitting this issue we see following warning messages in pg_log 17365 2015-03-24 03:01:42.729 GMTWARNING: page is not marked all-visible but visibility map bit is set in relation "table_foo" page 12 17365 2015-03-24 03:01:42.729 GMTWARNING: page is not marked all-visible but visibility map bit is set in relation "table_foo" page 13 Some information about schema. - This table can contain upto 150k entries. - *IMPORTANT*: We constantly insert new entries and remove older entries from the table. Relevant columns in table_foo ----------------------------------------------------------------------------- pk_col3 | bigint | not null default 0::bigint pk_col1 | bigint | not null default 0::bigint pk_col2 | bigint | not null default 0::bigint "table_foo_pkey" PRIMARY KEY, btree (pk_col1, pk_col2, pk_col3) There are 3 other indexes on non-primary key columns in the table. Duplicate entries db=# select pk_col1, pk_col2, pk_col3, count(1) from table_foo group by pk_col1, pk_col2, pk_col3 having count(1) > 1; pk_col1 | pk_col2| pk_col3 | count --------------------+--------+----------+------- 627708949163497688 | 1 | 13467 | 2 627708949163497688 | 4 | 13566 | 2 627708949163497688 | 266 | 13565 | 2 (3 rows) Query analyzer using index only scan. sodb=# explain select pk_col1, pk_col2, pk_col3, count(1) from table_foo group by pk_col1, pk_col2, pk_col3 having count(1) > 1 order by pk_col3; QUERY PLAN ------------------------------------------------------------------------------------------------------ Sort (cost=166.25..167.97 rows=689 width=24) Sort Key: pk_col3 -> HashAggregate (cost=125.16..133.77 rows=689 width=24) Filter: (count(1) > 1) -> Index Only Scan using table_foo_pkey on table_foo (cost=0.00..113.36 rows=944 width=24) (5 rows) When non-primary key column is queried we don't get duplicate entries. Query analyzer is using sequential scan on table_foo table. sodb=# select pk_col1, pk_col2, pk_col3, creation_time, count(1) from table_foo group by pk_col1, pk_col2, pk_col3 having count(1) > 1 order by pk_col3; pk_col1 | pk_col2 | pk_col3 | creation_time | count ------------------------------------------- (0 rows) sodb=# explain select pk_col1, pk_col2, pk_col3, creation_time, count(1) from table_foo group by pk_col1, pk_col2, pk_col3 having count(1) > 1 order by pk_col3; QUERY PLAN -------------------------------------------------------------------------- Sort (cost=174.33..176.06 rows=689 width=32) Sort Key: pk_col3 -> HashAggregate (cost=133.24..141.85 rows=689 width=32) Filter: (count(1) > 1) -> Seq Scan on table_foo (cost=0.00..121.44 rows=944 width=32) (5 rows) We ran an experiment wherein we reindex the offending table on every postgres startup and we don't see the same issue after reindex. This leads us to believe that the index is corrupted but actual data on the table is fine. Some information about postgres setup. - 9.2.0 - We use standard configuration with shared_buffer setting as 32MB and checkpoint_timeout as 1 min. - In this particular case postgres replication is not enabled. Let me know if more information is needed to help understand this issue. Any help or pointers will be appreciated. Thanks, Bankim.
On 3/24/2015 11:49 AM, Bankim Bhavsar wrote: > - 9.2.0 9.2 is currently at 9.2.10. 9.2.0 was released 2.5 years ago. several of the bugs fixed in the 10 incremental updates were data corruption related. -- john, recycling bits in santa cruz
On Tuesday, March 24, 2015, Bankim Bhavsar <bankim@nimblestorage.com> wrote:
Hello postgres experts,
We are running a test that periodically abruptly kills postgres process(equivalent to kill -9) and restarts it.
After running this test for 24 hrs or so, we see duplicate primary key entries in postgres table.
We detect this as we load internal hash-table data-structure in a separate process with primary key entries.
Before hitting this issue we see following warning messages in pg_log
17365 2015-03-24 03:01:42.729 GMTWARNING: page is not marked all-visible but visibility map bit is set in relation "table_foo" page 12
17365 2015-03-24 03:01:42.729 GMTWARNING: page is not marked all-visible but visibility map bit is set in relation "table_foo" page 13
Some information about postgres setup.
- 9.2.0
- We use standard configuration with shared_buffer setting as 32MB and checkpoint_timeout as 1 min.
- In this particular case postgres replication is not enabled.
Let me know if more information is needed to help understand this issue.
Any help or pointers will be appreciated.
If you really are running 9.2.0 instead of 9.2.10 you should upgrade and reconfirm.
David J.
Thanks for the input.
We'll upgrade to 9.2.10 and attempt to reproduce the issue.
If possible, can someone point to corruption related issues fixed after 9.2.0?
Thanks,
-Bankim.
On 03/24/2015 11:56 AM, David G. Johnston wrote:
We'll upgrade to 9.2.10 and attempt to reproduce the issue.
If possible, can someone point to corruption related issues fixed after 9.2.0?
Thanks,
-Bankim.
On 03/24/2015 11:56 AM, David G. Johnston wrote:
On Tuesday, March 24, 2015, Bankim Bhavsar <bankim@nimblestorage.com> wrote:Hello postgres experts,
We are running a test that periodically abruptly kills postgres process(equivalent to kill -9) and restarts it.
After running this test for 24 hrs or so, we see duplicate primary key entries in postgres table.
We detect this as we load internal hash-table data-structure in a separate process with primary key entries.
Before hitting this issue we see following warning messages in pg_log
17365 2015-03-24 03:01:42.729 GMTWARNING: page is not marked all-visible but visibility map bit is set in relation "table_foo" page 12
17365 2015-03-24 03:01:42.729 GMTWARNING: page is not marked all-visible but visibility map bit is set in relation "table_foo" page 13Some information about postgres setup.
- 9.2.0
- We use standard configuration with shared_buffer setting as 32MB and checkpoint_timeout as 1 min.
- In this particular case postgres replication is not enabled.
Let me know if more information is needed to help understand this issue.
Any help or pointers will be appreciated.If you really are running 9.2.0 instead of 9.2.10 you should upgrade and reconfirm.David J.
On 3/24/2015 12:05 PM, Bankim Bhavsar wrote: > We'll upgrade to 9.2.10 and attempt to reproduce the issue. > > If possible, can someone point to corruption related issues fixed > after 9.2.0? in the postgres manual, see the release notes for 9.2.1 through 9.2.10 -- john, recycling bits in santa cruz
John R Pierce <pierce@hogranch.com> wrote: > On 3/24/2015 12:05 PM, Bankim Bhavsar wrote: >> If possible, can someone point to corruption related issues fixed >> after 9.2.0? > > in the postgres manual, see the release notes for 9.2.1 through 9.2.10 You can find links to those from here: http://www.postgresql.org/docs/9.2/static/release.html -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> We'll upgrade to 9.2.10 and attempt to reproduce the issue. Been running the same test with postgres 9.2.10 for over 48 hrs and we are unable to reproduce the index corruption issueobserved earlier with 9.2.0. Thanks for the help everyone!! -Bankim. ________________________________________ From: pgsql-general-owner@postgresql.org [pgsql-general-owner@postgresql.org] on behalf of Kevin Grittner [kgrittn@ymail.com] Sent: Tuesday, March 24, 2015 12:47 PM To: John R Pierce; pgsql-general@postgresql.org Subject: Re: [GENERAL] Index corruption John R Pierce <pierce@hogranch.com> wrote: > On 3/24/2015 12:05 PM, Bankim Bhavsar wrote: >> If possible, can someone point to corruption related issues fixed >> after 9.2.0? > > in the postgres manual, see the release notes for 9.2.1 through 9.2.10 You can find links to those from here: http://www.postgresql.org/docs/9.2/static/release.html -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Tue, Mar 24, 2015 at 12:49 PM, Bankim Bhavsar <bankim@nimblestorage.com> wrote: > Hello postgres experts, > > We are running a test that periodically abruptly kills postgres > process(equivalent to kill -9) and restarts it. > After running this test for 24 hrs or so, we see duplicate primary key > entries in postgres table. > Are you killing the POSTMASTER process, deleting the pid file, and then restarting it without killing its children? That's a recipe for corruption.