Thread: Next steps in debugging database storage problems?

Next steps in debugging database storage problems?

From

Jacob Bunk Nielsen

Date:

01 July 2014, 16:35:29

Hi

We have a PostgreSQL 9.3.4 running in an LXC container on Debian
Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
stored on a XFS file system. We are seeing problems such as:

unexpected data beyond EOF in block 2 of relation base/805208133/1238511128

and

could not read block 5 in file "base/805208348/1259338118": read only 0 of 8192 bytes

This seems to occur every few days after the server has been up for
30-40 days. If we reboot the server it'll be another 30-40 days before
we see any problems again.

The server has been running fine on a Dell R710 for a long time, and was
upgraded to a Dell R620 last year, when the problems started. We have
tried switching to a different Dell R620, but that did not make a
difference. We've seen this with kernels 3.2, 3.4 and 3.10.

The two tables that run into these problems are very simple, but
rather busy. They are defined like:

CREATE TABLE jms_messages (
    messageid integer NOT NULL,
    destination text NOT NULL,
    txid integer,
    txop character(1),
    messageblob bytea
);

and

CREATE TABLE jms_transactions (
    txid integer
);

PostgreSQL does complain that it's likely due to a buggy kernel, but
then I would have expected to see problems with some of our other
machines running this kernel on similar hardware as well. We don't have
any systematic way of reproducing the problem at this point, except
leaving our database server running for a month and seeing it fail, so
I'm hoping that someone here can help me with some next steps in
debugging this.

We have multiple other PostgreSQL servers running in a similar setup
without causing any problems, but this server is probably the busiest of
our PostgreSQL servers.

I've tried writing a program to simulate a workload that resembles the
workload on the problematic tables, but I can't get that to fail. So
what should be my next step in debugging this?

Best regards

Jacob

Re: Next steps in debugging database storage problems?

From

Jacob Bunk Nielsen

Date:

03 July 2014, 11:26:16

Hi

Jacob Bunk Nielsen <jacob@bunk.cc> writes:

> We have a PostgreSQL 9.3.4 running in an LXC container on Debian
> Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
> stored on a XFS file system. We are seeing problems such as:
>
> unexpected data beyond EOF in block 2 of relation base/805208133/1238511128
>
> and
>
> could not read block 5 in file "base/805208348/1259338118": read only
> 0 of 8192 bytes

We use streaming replication to a different server on different
hardware. That server had been up for 300+ days and just had an incident
of:

LOG:  consistent recovery state reached at 226/E7DE1680
WARNING:  page 0 of relation base/805208133/1274861078 does not exist
CONTEXT:  xlog redo insert: rel 1663/805208133/1274861078; tid 0/1
PANIC:  WAL contains references to invalid pages
LOG:  database system is ready to accept read only connections
CONTEXT:  xlog redo insert: rel 1663/805208133/1274861078; tid 0/1
LOG:  startup process (PID 2308) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes

We've rebooted that server now and restarted the replication. We'll see
how it goes in a few hours.

I'm still very interested in hearing any hints you guys may have to how
I should debug these problems.

> I've tried writing a program to simulate a workload that resembles the
> workload on the problematic tables, but I can't get that to fail. So
> what should be my next step in debugging this?

That program has been running for 24+ hours now, and everything just
works as expected, so still no luck in reproducing this problem.

Best regards

Jacob

P.S. Sorry about the double post with different subject - my initial
post was held up for several hours due to putting "Help" in the subject,
so I thought I had been discarded by a list admin.

Re: Next steps in debugging database storage problems?

From

Jacob Bunk Nielsen

Date:

15 August 2014, 10:23:35

Hi

On the 1st of July 2014 Jacob Bunk Nielsen <jacob@bunk.cc> wrote:

> We have a PostgreSQL 9.3.4 running in an LXC container on Debian
> Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
> stored on a XFS file system. We are seeing problems such as:
>
> unexpected data beyond EOF in block 2 of relation base/805208133/1238511128
>
> and
>
> could not read block 5 in file "base/805208348/1259338118": read only
> 0 of 8192 bytes
>
> This seems to occur every few days after the server has been up for
> 30-40 days. If we reboot the server it'll be another 30-40 days before
> we see any problems again.
>
> The server has been running fine on a Dell R710 for a long time, and was
> upgraded to a Dell R620 last year, when the problems started. We have
> tried switching to a different Dell R620, but that did not make a
> difference. We've seen this with kernels 3.2, 3.4 and 3.10.

This time it took 45 days before this happened:

LOG:  unexpected EOF on standby connection
ERROR:  unexpected data beyond EOF in block 140 of relation base/805208885/805209852
HINT:  This has been seen to occur with buggy kernels; consider updating your system.

It always happens with small tables with lots of inserts and deletes.
From previous experience we know that it's now going to happen again in
a few days, so we'll probably try to schedule a reboot to give us
another 30-40 days.

Is anyone else seeing problems with PostgreSQL on XFS filesystems?

Any hints on how to debug what goes wrong here would be still be greatly
appreciated.

> We have multiple other PostgreSQL servers running in a similar setup
> without causing any problems, but this server is probably the busiest of
> our PostgreSQL servers.

This is still the case.

Best regards

Jacob

Re: Next steps in debugging database storage problems?

From

Terry Schmitt

Date:

15 August 2014, 20:01:19

I can't offer a whole lot of detail at this point, but I experienced a pretty bad caching issue about 2 years ago using XFS.

We were migrating a 1TB+ Oracle database to EDB's Advanced server 9.1 (Close enough for this discussion). I normally use ext4, but decided to try XFS for this build-out.

This was a Redhat 6.x system using NetApp SAN for storage. We extensively leverage FlexClones for creating production "read-only" instances as well as our development and testing environments. We take the snapshots of the running database storage and create FlexClones. The newly cloned database does a quick recovery on startup and away it goes. This has worked perfectly when using ext4 for years.

The problem I experienced with XFS, was when I started up the new clone for the first time. We would start getting various block read errors when accessing tables and indexes and knew the database was totally unreliable at this point.It was super painful troubleshooting as I could recreate the problem consistently, but it took a couple days of loading data and some creative scripts to recreate.

NetApp snapshots are consistent and reliable. It was clearly obvious that the data on disk did not match the data cached by the OS and/or XFS. We worked with Redhat, but never arrived at a solution. I finally gave up and switched back to ext4 and the problem went away.

On Fri, Aug 15, 2014 at 12:23 AM, Jacob Bunk Nielsen <jacob@bunk.cc> wrote:

Hi

On the 1st of July 2014 Jacob Bunk Nielsen <jacob@bunk.cc> wrote:

> We have a PostgreSQL 9.3.4 running in an LXC container on Debian
> Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
> stored on a XFS file system. We are seeing problems such as:
>
> unexpected data beyond EOF in block 2 of relation base/805208133/1238511128
>
> and
>
> could not read block 5 in file "base/805208348/1259338118": read only
> 0 of 8192 bytes
>
> This seems to occur every few days after the server has been up for
> 30-40 days. If we reboot the server it'll be another 30-40 days before
> we see any problems again.
>
> The server has been running fine on a Dell R710 for a long time, and was
> upgraded to a Dell R620 last year, when the problems started. We have
> tried switching to a different Dell R620, but that did not make a
> difference. We've seen this with kernels 3.2, 3.4 and 3.10.

This time it took 45 days before this happened:

LOG: unexpected EOF on standby connection
ERROR: unexpected data beyond EOF in block 140 of relation base/805208885/805209852
HINT: This has been seen to occur with buggy kernels; consider updating your system.

It always happens with small tables with lots of inserts and deletes.
From previous experience we know that it's now going to happen again in
a few days, so we'll probably try to schedule a reboot to give us
another 30-40 days.

Is anyone else seeing problems with PostgreSQL on XFS filesystems?

Any hints on how to debug what goes wrong here would be still be greatly
appreciated.

> We have multiple other PostgreSQL servers running in a similar setup
> without causing any problems, but this server is probably the busiest of
> our PostgreSQL servers.

This is still the case.

Best regards

Jacob

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Next steps in debugging database storage problems?

From

Jacob Bunk Nielsen

Date:

11 December 2014, 11:31:27

Hi

A final followup from my side to this post for anyone who may find this
thread in archives in the future.

On the 15th of August Jacob Bunk Nielsen <jacob@bunk.cc> wrote:
> On the 1st of July 2014 Jacob Bunk Nielsen <jacob@bunk.cc> wrote:
>
>> We have a PostgreSQL 9.3.4 running in an LXC container on Debian
>> Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
>> stored on a XFS file system. We are seeing problems such as:
>>
>> unexpected data beyond EOF in block 2 of relation base/805208133/1238511128
>>
>> and
>>
>> could not read block 5 in file "base/805208348/1259338118": read only
>> 0 of 8192 bytes
>>
>> This seems to occur every few days after the server has been up for
>> 30-40 days. If we reboot the server it'll be another 30-40 days before
>> we see any problems again. [...]
>
> This time it took 45 days before this happened:
>
> LOG:  unexpected EOF on standby connection
> ERROR:  unexpected data beyond EOF in block 140 of relation base/805208885/805209852
> HINT:  This has been seen to occur with buggy kernels; consider updating your system.
>
> It always happens with small tables with lots of inserts and deletes.
> From previous experience we know that it's now going to happen again in
> a few days, so we'll probably try to schedule a reboot to give us
> another 30-40 days.

We have concluded that it's probably a bug in the autovacuuming. Since
we changed how often we vacuum those busy tables we haven't seen any
problems for the past 2 months:

We changed:

autovacuum_vacuum_threshold = 100000 (default: 50)

and

autovacuum_vacuum_scale_factor = 0 (default 0.2, 0 turns it off)

The default settings caused autovacuum to run every minute, and
eventually we would hit some bug that caused the problems described
above.

My colleague who has done most of the work find this has promised to try
to create a working test case and file a proper bug report.

Best regards

Jacob