Home > mailing lists

Thread: TOAST table repeatedly corrupted

TOAST table repeatedly corrupted

From

Niles Oien

Date:

09 May 2018, 21:54:17

I have what I think is a bug in postgres. I'd like to help track it down, if that

is indeed what's going on.

I am running a reasonably recent version of postgres :

data=# select version();

version

----------------------------------------------------------------------------------------------------------

PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17), 64-bit

(1 row)

About every six months, the logs show that vacuuming of

one of our tables (the same table every time) starts to fail :

grep ERROR postgresql-09.csv

2018-05-09 16:14:03.834 GMT,,,27018,,5af31e4b.698a,1,,2018-05-09 16:14:03 GMT,12/611211,0,ERROR,XX001,"invalid page in block 1374551 of relation base/16384/36298640",,,,,"automatic vacuum of table ""data.pg_toast.pg_toast_36298637""",,,,""

2018-05-09 16:15:03.865 GMT,,,27696,,5af31e87.6c30,1,,2018-05-09 16:15:03 GMT,12/611219,0,ERROR,XX001,"invalid page in block 1374551 of relation base/16384/36298640",,,,,"automatic vacuum of table ""data.pg_toast.pg_toast_36298637""",,,,""

2018-05-09 16:16:03.897 GMT,,,31666,,5af31ec3.7bb2,1,,2018-05-09 16:16:03 GMT,10/1330957,0,ERROR,XX001,"invalid page in block 1374551 of relation base/16384/36298640",,,,,"automatic vacuum of table ""data.pg_toast.pg_toast_36298637""",,,,""

So it looks like a TOAST table has been corrupted. I can see which

table the TOAST table is supporting :

data=# select relname from pg_class where reltoastrelid = 'pg_toast.pg_toast_36298637'::regclass;

relname

---------------

rdvtrack_fd05

(1 row)

And sure enough, I now cannot dump that table :

linux> pg_dump -p5432 -Upostgres -t hmi.rdvtrack_fd05 data > hmi_rdvtrack_fd05.sql

pg_dump: Dumping the contents of table "rdvtrack_fd05" failed: PQgetResult() failed.

pg_dump: Error message from server: ERROR: compressed data is corrupted

pg_dump: The command was: COPY hmi.rdvtrack_fd05 (recnum, sunum, slotnum, sessionid, sessionns, cparms_sg000, v_bzero, v_bscale, carrrot, cmlon, lonhg, lathg, loncm, midtime, crval1, crval2, crval3, t_start, t_stop, lonspan, coverage, zonalvel, mai, module, source, input, created, bld_vers, backgrnd, rejectlist, cmlon_index, loncm_index, lathg_index, lonhg_index, sg_000_file, sg_001_file, history, datavals, missvals, datamin, datamax, datamean, datarms, dataskew, datakurt, meanmu, meanlat) TO stdout;

I think I can see the disk files where the tables are stored :

data=# select pg_relation_filepath('hmi.rdvtrack_fd05');

pg_relation_filepath

----------------------

base/16384/36298637

(1 row)

data=# select pg_relation_filepath('pg_toast.pg_toast_36298637');

pg_relation_filepath

----------------------

base/16384/36298640

(1 row)

I can run pg_filedump on those two files, in both cases all

the flags seem unremarkable (although I may not be knowing

what to look for) and it runs to the end of the file, giving

me a message like :

*** End of File Encountered. Last Block Read: 131071 ***

I think this is probably a bug? Every time it happens

it affects the same table, hmi.rdvtrack_fd05. The hardware

we're running on is fairly new, disks and memory are

not throwing errors, and we've never had problems with any

other tables. We're running on CentOS 6.8 - again, not brand

new but hardly decrepit, either. I've rebuilt the table

several times. This time I'd like to give the PG community

a chance to chase it down before I do that.

I've put the two data files, the postgres config and

essentially what I've said here online at :

http://netdrms01.nispdc.nso.edu/pgIssue/36298640

http://netdrms01.nispdc.nso.edu/pgIssue/36298637

http://netdrms01.nispdc.nso.edu/pgIssue/postgresql.conf

http://netdrms01.nispdc.nso.edu/pgIssue/PG_ISSUE.txt

If it helps, the TOAST table looks like this :

data=# \d pg_toast.pg_toast_36298637

TOAST table "pg_toast.pg_toast_36298637"

Column | Type

------------+---------

chunk_id | oid

chunk_seq | integer

chunk_data | bytea

And the table the TOAST table is supporting looks like this :

data=# \d hmi.rdvtrack_fd05

Table "hmi.rdvtrack_fd05"

Column | Type | Modifiers

--------------+------------------+-----------

recnum | bigint | not null

sunum | bigint |

slotnum | integer |

sessionid | bigint |

sessionns | text |

cparms_sg000 | text |

v_bzero | double precision |

v_bscale | double precision |

carrrot | integer |

cmlon | real |

lonhg | real |

lathg | real |

loncm | real |

midtime | double precision |

crval1 | real |

crval2 | real |

crval3 | double precision |

t_start | double precision |

t_stop | double precision |

lonspan | real |

coverage | real |

zonalvel | real |

mai | real |

module | text |

source | text |

input | text |

created | double precision |

bld_vers | text |

backgrnd | text |

rejectlist | text |

cmlon_index | bigint |

loncm_index | bigint |

lathg_index | bigint |

lonhg_index | bigint |

sg_000_file | text |

sg_001_file | text |

history | text |

datavals | integer |

missvals | integer |

datamin | real |

datamax | real |

datamean | double precision |

datarms | double precision |

dataskew | double precision |

datakurt | double precision |

meanmu | real |

meanlat | real |

Indexes:

"rdvtrack_fd05_pkey" PRIMARY KEY, btree (recnum)

"rdvtrack_fd05_carrrot" btree (carrrot)

"rdvtrack_fd05_cmlon_index" btree (cmlon_index)

"rdvtrack_fd05_lathg_index" btree (lathg_index)

"rdvtrack_fd05_loncm_index" btree (loncm_index)

"rdvtrack_fd05_lonhg_index" btree (lonhg_index)

Niles Oien, noien@nso.edu

Niles Oien, National Solar Observatory, Boulder Colorado USA

TOAST table repeatedly corrupted

From

"David G. Johnston"

Date:

09 May 2018, 22:31:54

On Wednesday, May 9, 2018, Niles Oien <noien@nso.edu> wrote:

I have what I think is a bug in postgres. I'd like to help track it down, if that
is indeed what's going on.

I am running a reasonably recent version of postgres :

data=# select version();
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17), 64-bit
(1 row)

9.5.12 is reasonably recent. 9.5.5 is not. Even if you found a live bug you'd have to upgrade to fix it so you should just do that now and see if that works. You can scan the release notes for 9.5.6 and newer to see whether any vacuum related fixes seem to apply.

David J.

Re: TOAST table repeatedly corrupted

From

Tom Lane

Date:

09 May 2018, 23:47:11

Niles Oien <noien@nso.edu> writes:
> I am running a reasonably recent version of postgres :
>  PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
> 20120313 (Red Hat 4.4.7-17), 64-bit

As David said, that's not terribly recent.  If you are going to upgrade,
I'd suggest waiting till tomorrow and grabbing 9.5.13, because we fixed
a pretty serious TOAST data corruption bug in this week's batch of
releases.  The expected symptoms of it don't match what you're seeing,
unfortunately, but nonetheless you ought to be using the latest, just
in case this is an already-fixed issue.

> 2018-05-09 16:14:03.834 GMT,,,27018,,5af31e4b.698a,1,,2018-05-09 16:14:03
> GMT,12/611211,0,ERROR,XX001,"invalid page in block 1374551 of relation
> base/16384/36298640",,,,,"automatic vacuum of table
> ""data.pg_toast.pg_toast_36298637""",,,,""

Block 1374551 would be well past the first segment of the file, since
in a standard build (1GB segments, 8K blocks) there are only 131072
pages per segment.  This explains why you didn't see any complaints
from pg_filedump, if you only ran it over the first segment.

If you've not clobbered the DB yet, file 36298640.10 would be what
to look at, I believe.

> And sure enough, I now cannot dump that table :
> pg_dump: Error message from server: ERROR:  compressed data is corrupted

That's interesting, because it seems to indicate an independent problem.
The "invalid page" error indicates a bad page header, or possibly a
page checksum failure; either way the page would not have been allowed
into the buffer pool.  But "compressed data is corrupted" implies that
we did read a page but the data in it seems messed up.  So this evidence
says you have at least two different corrupted places in that table.

Do you have checksums enabled in this installation?  If you're going
to have to rebuild it, you should probably turn those on (use
initdb --data-checksums), in hopes of narrowing down what's happening.  

> I think this is probably a bug? Every time it happens
> it affects the same table, hmi.rdvtrack_fd05.

That's mighty suggestive all right, but unfortunately it doesn't
do much to narrow down the problem :-(

            regards, tom lane

Re: TOAST table repeatedly corrupted

From

Niles Oien

Date:

10 May 2018, 00:34:45

Thanks. I don't have checksums on. I'll look into it on the next build.

The file 36298640.10 didn't show anything unusual under pg_filedump.

I'm betting that we are suffering from a now-fixed TOAST issue, if not the recently fixed one you mentioned. That's probably all the chasing that's worth doing here given the dated nature of our production box. On our development box, where we have some room to move, we're running something a bit newer :

PostgreSQL 10.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), 64-bit

Does that need an upgrade to get this week's TOAST fixes, too? I'm not sure if CentOS's 'yum upgrade' will pick it up - I have the repo pgdg10/7/x86_64 enabled, will the update show up that way?

Thanks,

Niles.

On Wed, May 9, 2018 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Niles Oien <noien@nso.edu> writes:
> I am running a reasonably recent version of postgres :
> PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
> 20120313 (Red Hat 4.4.7-17), 64-bit

As David said, that's not terribly recent. If you are going to upgrade,
I'd suggest waiting till tomorrow and grabbing 9.5.13, because we fixed
a pretty serious TOAST data corruption bug in this week's batch of
releases. The expected symptoms of it don't match what you're seeing,
unfortunately, but nonetheless you ought to be using the latest, just
in case this is an already-fixed issue.

> 2018-05-09 16:14:03.834 GMT,,,27018,,5af31e4b.698a,1,,2018-05-09 16:14:03
> GMT,12/611211,0,ERROR,XX001,"invalid page in block 1374551 of relation
> base/16384/36298640",,,,,"automatic vacuum of table
> ""data.pg_toast.pg_toast_36298637""",,,,""

Block 1374551 would be well past the first segment of the file, since
in a standard build (1GB segments, 8K blocks) there are only 131072
pages per segment. This explains why you didn't see any complaints
from pg_filedump, if you only ran it over the first segment.

If you've not clobbered the DB yet, file 36298640.10 would be what
to look at, I believe.

> And sure enough, I now cannot dump that table :
> pg_dump: Error message from server: ERROR: compressed data is corrupted

That's interesting, because it seems to indicate an independent problem.
The "invalid page" error indicates a bad page header, or possibly a
page checksum failure; either way the page would not have been allowed
into the buffer pool. But "compressed data is corrupted" implies that
we did read a page but the data in it seems messed up. So this evidence
says you have at least two different corrupted places in that table.

Do you have checksums enabled in this installation? If you're going
to have to rebuild it, you should probably turn those on (use
initdb --data-checksums), in hopes of narrowing down what's happening.

> I think this is probably a bug? Every time it happens
> it affects the same table, hmi.rdvtrack_fd05.

That's mighty suggestive all right, but unfortunately it doesn't
do much to narrow down the problem :-(

regards, tom lane

Niles Oien, National Solar Observatory, Boulder Colorado USA

Re: TOAST table repeatedly corrupted

From

Peter Geoghegan

Date:

10 May 2018, 01:15:07

On Wed, May 9, 2018 at 11:54 AM, Niles Oien <noien@nso.edu> wrote:
> I think this is probably a bug? Every time it happens
> it affects the same table, hmi.rdvtrack_fd05. The hardware
> we're running on is fairly new, disks and memory are
> not throwing errors, and we've never had problems with any
> other tables. We're running on CentOS 6.8 - again, not brand
> new but hardly decrepit, either. I've rebuilt the table
> several times. This time I'd like to give the PG community
> a chance to chase it down before I do that.

amcheck could potentially reveal more information about the problem.
The version on Github works with 9.5, and there are Redhat/CentOS pgdg
packages you can use. See:

https://github.com/petergeoghegan/amcheck

The "heapallindexed" option may be of particular interest to you. It
verifies that the table has matching rows for a target index, in
addition to testing the structure of the target B-Tree index itself.
It doesn't do anything special with TOAST tables just yet, but it's
still the best general corruption test we have.

-- 
Peter Geoghegan

Re: TOAST table repeatedly corrupted

From

Tom Lane

Date:

10 May 2018, 01:46:04

Niles Oien <noien@nso.edu> writes:
>  PostgreSQL 10.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-16), 64-bit
> 
> Does that need an upgrade to get this week's TOAST fixes, too?

Yes.

> I'm not sure
> if CentOS's 'yum upgrade' will pick it up - I have the repo pgdg10/7/x86_64
> enabled, will the update show up that way?

I believe so.

            regards, tom lane