Next steps in debugging database storage problems? - Mailing list pgsql-general

From Jacob Bunk Nielsen
Subject Next steps in debugging database storage problems?
Date
Msg-id spamdrop+87ha31kxrc.fsf@atom.bunk.cc
Whole thread Raw
Responses Re: Next steps in debugging database storage problems?  (Jacob Bunk Nielsen <jacob@bunk.cc>)
Re: Next steps in debugging database storage problems?  (Jacob Bunk Nielsen <jacob@bunk.cc>)
List pgsql-general
Hi

We have a PostgreSQL 9.3.4 running in an LXC container on Debian
Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
stored on a XFS file system. We are seeing problems such as:

unexpected data beyond EOF in block 2 of relation base/805208133/1238511128

and

could not read block 5 in file "base/805208348/1259338118": read only 0 of 8192 bytes

This seems to occur every few days after the server has been up for
30-40 days. If we reboot the server it'll be another 30-40 days before
we see any problems again.

The server has been running fine on a Dell R710 for a long time, and was
upgraded to a Dell R620 last year, when the problems started. We have
tried switching to a different Dell R620, but that did not make a
difference. We've seen this with kernels 3.2, 3.4 and 3.10.

The two tables that run into these problems are very simple, but
rather busy. They are defined like:

CREATE TABLE jms_messages (
    messageid integer NOT NULL,
    destination text NOT NULL,
    txid integer,
    txop character(1),
    messageblob bytea
);

and

CREATE TABLE jms_transactions (
    txid integer
);

PostgreSQL does complain that it's likely due to a buggy kernel, but
then I would have expected to see problems with some of our other
machines running this kernel on similar hardware as well. We don't have
any systematic way of reproducing the problem at this point, except
leaving our database server running for a month and seeing it fail, so
I'm hoping that someone here can help me with some next steps in
debugging this.

We have multiple other PostgreSQL servers running in a similar setup
without causing any problems, but this server is probably the busiest of
our PostgreSQL servers.

I've tried writing a program to simulate a workload that resembles the
workload on the problematic tables, but I can't get that to fail. So
what should be my next step in debugging this?

Best regards

Jacob



pgsql-general by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Migration error " invalid byte sequence for encoding "UTF8": 0xff " from mysql 5.5 to postgresql 9.1
Next
From: Rich Shepard
Date:
Subject: Validating User Login Within Postgres