Re: emergency outage requiring database restart - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: emergency outage requiring database restart
Date
Msg-id CAHyXU0xZirdEMYz+PtU=dDTOsozpE8ak-RenD=--Fr8zdEWAuQ@mail.gmail.com
Whole thread Raw
In response to Re: emergency outage requiring database restart  (Merlin Moncure <mmoncure@gmail.com>)
Responses Re: emergency outage requiring database restart  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>>>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>>>>> > to be to the system catalogs with at least two critical tables dropped
>>>>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>>>>> > pointing at the wrong thing.  Couple more datapoints here.
>>>>> >
>>>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>>>> > *) Another database on the same cluster was not impacted.  However
>>>>> > it's more olap style and may not have been written to during the
>>>>> > outage
>>>>> >
>>>>> > Now, this infrastructure running this system is running maybe 100ish
>>>>> > postgres clusters and maybe 1000ish sql server instances with
>>>>> > approximately zero unexplained data corruption issues in the 5 years
>>>>> > I've been here.  Having said that, this definitely smells and feels
>>>>> > like something on the infrastructure side.  I'll follow up if I have
>>>>> > any useful info.
>>>>>
>>>>> After a thorough investigation I now have credible evidence the source
>>>>> of the damage did not originate from the database itself.
>>>>> Specifically, this database is mounted on the same volume as the
>>>>> operating system (I know, I know) and something non database driven
>>>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>>>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>>>
>>>> However, disk space exhaustion should not lead to corruption unless the
>>>> underlying layers lied in some way.
>>>
>>> I agree -- however I'm sufficiently separated from the things doing
>>> the things that I can't verify that in any real way.   In the meantime
>>> I'm going to take standard precautions (enable checksums/dedicated
>>> volume/replication).  Low disk space also does not explain the bizarre
>>> outage I had last friday.
>>
>> ok, data corruption struck again.  This time disk space is ruled out,
>> and access to the database is completely denied:
>> postgres=# \c castaging
>> WARNING:  leaking still-referenced relcache entry for
>> "pg_index_indexrelid_index"
>
> Corruption struck again.
> This time got another case of view busted -- attempting to create
> gives missing 'type' error.

Call it a hunch -- I think the problem is in pl/sh.

merlin



pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: emergency outage requiring database restart
Next
From: Michael Paquier
Date:
Subject: Re: [COMMITTERS] pgsql: Remove extra comma at end of enum list