On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed
>>> > to be to the system catalogs with at least two critical tables dropped
>>> > or inaccessible in some fashion. A lot of the OIDs seemed to be
>>> > pointing at the wrong thing. Couple more datapoints here.
>>> >
>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>> > *) Another database on the same cluster was not impacted. However
>>> > it's more olap style and may not have been written to during the
>>> > outage
>>> >
>>> > Now, this infrastructure running this system is running maybe 100ish
>>> > postgres clusters and maybe 1000ish sql server instances with
>>> > approximately zero unexplained data corruption issues in the 5 years
>>> > I've been here. Having said that, this definitely smells and feels
>>> > like something on the infrastructure side. I'll follow up if I have
>>> > any useful info.
>>>
>>> After a thorough investigation I now have credible evidence the source
>>> of the damage did not originate from the database itself.
>>> Specifically, this database is mounted on the same volume as the
>>> operating system (I know, I know) and something non database driven
>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help
>>
>> However, disk space exhaustion should not lead to corruption unless the
>> underlying layers lied in some way.
>
> I agree -- however I'm sufficiently separated from the things doing
> the things that I can't verify that in any real way. In the meantime
> I'm going to take standard precautions (enable checksums/dedicated
> volume/replication). Low disk space also does not explain the bizarre
> outage I had last friday.
ok, data corruption struck again. This time disk space is ruled out,
and access to the database is completely denied:
postgres=# \c castaging
WARNING: leaking still-referenced relcache entry for
"pg_index_indexrelid_index"
merlin