Re: index contains unexpected zero page at block error - Mailing list pgsql-general

From François Beausoleil
Subject Re: index contains unexpected zero page at block error
Date
Msg-id 90DF54DB-6BDF-46E6-AB88-D82519BF8528@teksol.info
Whole thread Raw
In response to index contains unexpected zero page at block error  (François Beausoleil <francois@teksol.info>)
List pgsql-general
Le 2014-07-04 à 14:42, François Beausoleil <francois@teksol.info> a écrit :

> Hi,
>
> I ran a query on the slave and got the error message in the subject. I rebuilt the slave two days ago on ZFS (with
compression).I’m wondering how that unexpected zero page happened in the first place. 
>
> Master:
> Linux kalanchoe 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:37:58 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> PostgreSQL 9.1.11 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
>
> Slave:
> Linux munn 3.2.0-65-generic #98-Ubuntu SMP Wed Jun 11 20:27:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> PostgreSQL 9.1.13 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
>
> The slave is connected to the master through streaming replication. I know I’m not perfectly in sync, and I’ll switch
theroles in a week, after a burn-in period for the slave. 
>
> I was running the following query on the slave:
>
> select persona_followers_count, count(*)
> from show_interaction_bindings
> where market_id = 'e8639780-db2d-012e-ebb6-40400fe46aa7’
>  and interaction_created_at >= '2014-06-18’
>  and service_name = ‘Facebook'
>  and show_id = 'ed7e4800-c370-012e-eb90-40400fe46aa7’
> group by 1 order by 2 desc;
>
> and I received this error message:
>
> ERROR:  index "canada_2014_06_23_pkey" contains unexpected zero page at block 135764
> HINT:  Please REINDEX it.
>
> show_interaction_bindings is a master table with partitions by market_id and week, such that I have 500+ child
tables.bindings.2014_06_23’s schema looks like: 
>
>
>                            Table "bindings.canada_2014_06_23"
>               Column                |            Type             |      Modifiers
> -------------------------------------+-----------------------------+----------------------
> interaction_id                      | uuid                        | not null
> show_id                             | uuid                        | not null
> service_name                        | character varying(8)        | not null
> interaction_created_at              | timestamp without time zone | not null
> interaction_service_id              | text                        | not null
> interaction_text                    | text                        |
> interaction_source                  | text                        |
> interaction_platform_name           | character varying(256)      |
> interaction_app_name                | character varying(256)      |
> interaction_in_reply_to_screen_name | character varying(256)      |
> interaction_in_reply_to_status_id   | character varying(64)       |
> interaction_in_reply_to_user_id     | character varying(64)       |
> interaction_language                | character varying(12)       |
> interaction_retweet_count           | integer                     |
> interaction_retweet_id              | character varying(64)       |
> persona_service_id                  | character varying(64)       | not null
> persona_screen_name                 | character varying(256)      | not null
> persona_name                        | character varying(256)      | not null
> persona_description                 | text                        |
> persona_location                    | text                        |
> persona_lang                        | character varying(16)       |
> persona_profile_image_url           | text                        |
> persona_created_at                  | timestamp without time zone |
> persona_followers_count             | integer                     | not null default 130
> persona_friends_count               | integer                     | not null default 130
> persona_time_zone                   | text                        |
> persona_utc_offset                  | integer                     |
> persona_gender                      | character varying(12)       |
> persona_age_group                   | character varying(12)       |
> persona_country                     | character varying(40)       |
> persona_region                      | character varying(80)       |
> persona_city                        | character varying(80)       |
> market_id                           | uuid                        |
> score                               | real                        |
> positive_sentiment                  | real                        | not null default 0.0
> negative_sentiment                  | real                        | not null default 0.0
> like_or_favorite                    | boolean                     |
> binder_log                          | text                        |
> Indexes:
>    "canada_2014_06_23_pkey" PRIMARY KEY, btree (show_id, interaction_service_id, service_name)
> Check constraints:
>    "market_and_created_at_in_partition" CHECK (market_id = 'e8639780-db2d-012e-ebb6-40400fe46aa7'::uuid AND
'2014-06-2307:00:00'::timestamp without time zone <= interaction_created_at AND interaction_created_at < '2014-06-30
07:00:00'::timestampwithout time zone) 
> Inherits: show_interaction_bindings
>
> I just rebuilt the slave two days ago on ZFS, from XFS. I configured ZFS to compress PG’s data directory. PG’s logs
aresilent, except for the mentioned error: 
>
> 2014-07-04 17:09:46.286 UTC - @ 42408 (00000) 2014-07-02 14:21:40 UTC - CONTEXT:  xlog redo delete: index
1663/18332/27715584;iblk 2042654, heap 1663/18332/27715411; 
> 2014-07-04 17:10:02.323 UTC - @ 42408 (00000) 2014-07-02 14:21:40 UTC - LOG:  file "pg_clog/0636" doesn't exist,
readingas zeroes 
> 2014-07-04 17:10:02.323 UTC - @ 42408 (00000) 2014-07-02 14:21:40 UTC - CONTEXT:  xlog redo delete: index
1663/18332/27715584;iblk 2041497, heap 1663/18332/27715411; 
> 2014-07-04 17:10:32.257 UTC - @ 42408 (00000) 2014-07-02 14:21:40 UTC - LOG:  file "pg_clog/08A1" doesn't exist,
readingas zeroes 
> 2014-07-04 17:10:32.257 UTC - @ 42408 (00000) 2014-07-02 14:21:40 UTC - CONTEXT:  xlog redo delete: index
1663/18332/27715584;iblk 2042678, heap 1663/18332/27715411; 
> 2014-07-04 17:12:45.201 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint complete: wrote 84024
buffers(2.0%); 0 transaction log file(s) added, 32 removed, 64 recycled; write=540.052 s, sync=0.020 s, total=540.098
s;sync files=103, longest=0.009 s, average=0.000 s 
> 2014-07-04 17:12:45.201 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  recovery restart point at 30AB/D5E94FE8
> 2014-07-04 17:12:45.201 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - DETAIL:  last completed transaction was at
logtime 2014-07-04 17:12:36.607511+00 
> 2014-07-04 17:13:02.470 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint starting: xlog
> 2014-07-04 17:22:02.090 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint complete: wrote 84292
buffers(2.0%); 0 transaction log file(s) added, 8 removed, 34 recycled; write=539.555 s, sync=0.048 s, total=539.620 s;
syncfiles=94, longest=0.006 s, average=0.000 s 
> …
> 2014-07-04 17:32:02.092 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint complete: wrote 64598
buffers(1.5%); 0 transaction log file(s) added, 0 removed, 41 recycled; write=540.000 s, sync=0.005 s, total=540.014 s;
syncfiles=85, longest=0.001 s, average=0.000 s 
> 2014-07-04 17:32:02.092 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  recovery restart point at 30AC/375129E0
> 2014-07-04 17:32:02.092 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - DETAIL:  last completed transaction was at
logtime 2014-07-04 17:31:53.506748+00 
> 2014-07-04 17:33:02.079 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint starting: time
> 2014-07-04 17:42:02.152 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint complete: wrote 78599
buffers(1.9%); 0 transaction log file(s) added, 25 removed, 31 recycled; write=540.034 s, sync=0.013 s, total=540.073
s;sync files=86, longest=0.002 s, average=0.000 s 
> 2014-07-04 17:42:02.152 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  recovery restart point at 30AC/5C0514F8
> 2014-07-04 17:42:02.152 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - DETAIL:  last completed transaction was at
logtime 2014-07-04 17:42:01.954993+00 
> 2014-07-04 17:44:47.178 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint starting: time
> …
> 2014-07-04 17:53:47.067 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint complete: wrote 60429
buffers(1.4%); 0 transaction log file(s) added, 0 removed, 37 recycled; write=539.855 s, sync=0.027 s, total=539.889 s;
syncfiles=67, longest=0.010 s, average=0.000 s 
> 2014-07-04 17:53:47.067 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  recovery restart point at 30AC/89B87FB8
> 2014-07-04 17:53:47.067 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - DETAIL:  last completed transaction was at
logtime 2014-07-04 17:53:38.213828+00 
> 2014-07-04 17:54:47.046 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint starting: time
> 2014-07-04 17:58:03.242 UTC - svanreporting@svanalytics_production 58863 (XX002) 2014-07-04 17:57:20 UTC - ERROR:
index"canada_2014_06_23_pkey" contains unexpected zero page at block 135764 
> 2014-07-04 17:58:03.242 UTC - svanreporting@svanalytics_production 58863 (XX002) 2014-07-04 17:57:20 UTC - HINT:
PleaseREINDEX it. 
> 2014-07-04 17:58:03.242 UTC - svanreporting@svanalytics_production 58863 (XX002) 2014-07-04 17:57:20 UTC - STATEMENT:
select persona_followers_count, count(*) from show_interaction_bindings where market_id =
'e8639780-db2d-012e-ebb6-40400fe46aa7'and interaction_created_at >= '2014-06-18' and service_name = 'Facebook 'and
show_id= 'ed7e4800-c370-012e-eb90-40400fe46aa7' group by 1 order by 2 desc; 
> 2014-07-04 18:03:47.184 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint complete: wrote 86440
buffers(2.1%); 0 transaction log file(s) added, 16 removed, 29 recycled; write=540.078 s, sync=0.038 s, total=540.137
s;sync files=88, longest=0.010 s, average=0.000 s 
> 2014-07-04 18:03:47.184 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  recovery restart point at 30AC/A5FF4D70
> 2014-07-04 18:03:47.184 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - DETAIL:  last completed transaction was at
logtime 2014-07-04 18:03:42.458954+00 
> 2014-07-04 18:05:02.004 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint starting: time
> 2014-07-04 18:14:02.090 UTC - @ 42453 (00000) 2014-07-02 14:21:41 UTC - LOG:  restartpoint complete: wrote 57764
buffers(1.4%); 0 transaction log file(s) added, 0 removed, 28 recycled; write=540.064 s, sync=0.015 s, total=540.085 s;
syncfiles=64, longest=0.003 s, average=0.000 s 
>
> (skipped a few syntax errors in the logs, otherwise, this is it)
>
> I found http://www.postgresql.org/message-id/9b0dbfb1d330b895475232e70cff6e02@mailserver7.nebula.fi, but this is
about9.3. The response does mention upgrading to the latest 9.3 version. 
>
> This seems to imply a data corruption problem caused by an incorrect fsync:
http://postgresql.1045698.n5.nabble.com/FATAL-index-contains-unexpected-zero-page-at-block-td5732056.html
>
> This is my first experience with ZFS. I configured two disks as mirrors. zpool status does not report any problems:
>
> $ sudo zpool status
>  pool: pgdata
> state: ONLINE
>  scan: scrub canceled on Wed Jul  2 15:41:23 2014
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        pgdata      ONLINE       0     0     0
>          mirror-0  ONLINE       0     0     0
>            sdb     ONLINE       0     0     0
>            sdc     ONLINE       0     0     0
>
> errors: No known data errors
>
> $ sudo zfs list
> NAME                           USED  AVAIL  REFER  MOUNTPOINT
> pgdata                         989G  1.71T    31K  /pgdata
> pgdata/data                    988G  1.71T    31K  /pgdata/data
> pgdata/data/dbanalytics        982G  1.71T   808G  /var/lib/postgresql
> pgdata/data/walinbox          6.66G  1.71T  6.66G  /var/backups/seevibes/wal
>
> I have a zpool scrub starting tonight, and I’ll see if there are disk errors from there.
>
> Any ideas on how that data error could slip through?

To put out more data:

* The zpool scrub found no issues;
* I successfully ran a pg_dumpall > /dev/null from the master, confirming all data files are ok.

I restored from a fresh base backup, knowing that this backup was taken while the master had not been restarted. I’m
runningpg_dumpall > /dev/null on the slave, to determine if the data errors can be reproduced. I’ll know more in 24h. 

Bye,
François Beausoleil



pgsql-general by date:

Previous
From: Martin Gudmundsson
Date:
Subject: Bi-Directional replication client awareness
Next
From: Sergey Konoplev
Date:
Subject: Re: Synonym/thesaurus dictionaries for FTS