Re: Next steps in debugging database storage problems? - Mailing list pgsql-general

From Jacob Bunk Nielsen
Subject Re: Next steps in debugging database storage problems?
Date
Msg-id spamdrop+87lhmed1v3.fsf@atom.bunk.cc
Whole thread Raw
In response to Re: Next steps in debugging database storage problems?  (Jacob Bunk Nielsen <jacob@bunk.cc>)
List pgsql-general
Hi

A final followup from my side to this post for anyone who may find this
thread in archives in the future.

On the 15th of August Jacob Bunk Nielsen <jacob@bunk.cc> wrote:
> On the 1st of July 2014 Jacob Bunk Nielsen <jacob@bunk.cc> wrote:
>
>> We have a PostgreSQL 9.3.4 running in an LXC container on Debian
>> Wheezy on a Linux 3.10.43 kernel on a Dell R620 server. Data are
>> stored on a XFS file system. We are seeing problems such as:
>>
>> unexpected data beyond EOF in block 2 of relation base/805208133/1238511128
>>
>> and
>>
>> could not read block 5 in file "base/805208348/1259338118": read only
>> 0 of 8192 bytes
>>
>> This seems to occur every few days after the server has been up for
>> 30-40 days. If we reboot the server it'll be another 30-40 days before
>> we see any problems again. [...]
>
> This time it took 45 days before this happened:
>
> LOG:  unexpected EOF on standby connection
> ERROR:  unexpected data beyond EOF in block 140 of relation base/805208885/805209852
> HINT:  This has been seen to occur with buggy kernels; consider updating your system.
>
> It always happens with small tables with lots of inserts and deletes.
> From previous experience we know that it's now going to happen again in
> a few days, so we'll probably try to schedule a reboot to give us
> another 30-40 days.

We have concluded that it's probably a bug in the autovacuuming. Since
we changed how often we vacuum those busy tables we haven't seen any
problems for the past 2 months:

We changed:

autovacuum_vacuum_threshold = 100000 (default: 50)

and

autovacuum_vacuum_scale_factor = 0 (default 0.2, 0 turns it off)

The default settings caused autovacuum to run every minute, and
eventually we would hit some bug that caused the problems described
above.

My colleague who has done most of the work find this has promised to try
to create a working test case and file a proper bug report.

Best regards

Jacob



pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Defining functions for arrays of any number type
Next
From: "Jack Douglas"
Date:
Subject: Re: new index type with clustering in mind.