Re: emergency outage requiring database restart - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: emergency outage requiring database restart
Date
Msg-id CAHyXU0xT6tJhqouu1AAmPvc40jdvK63-OD9_QJdiRR6tbDwVtQ@mail.gmail.com
Whole thread Raw
In response to Re: emergency outage requiring database restart  (Ants Aasma <ants.aasma@eesti.ee>)
Responses Re: emergency outage requiring database restart  (Greg Stark <stark@mit.edu>)
Re: emergency outage requiring database restart  (Oskari Saarenmaa <os@ohmu.fi>)
List pgsql-hackers
On Thu, Oct 27, 2016 at 2:31 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
>> verification"
>> 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37251 but expected 37244
>> 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37249 but expected 37244
>> 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 44363 but expected 44364
>> 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 49525 but expected 49539
>> 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37345 but expected 37340
>
> The checksum values are improbably close. The checksum algorithm has
> decently good mixing of all bits in the page. Having the first byte
> match in 5 checksums makes this 1:2^40 improbable. What is not mixed
> in properly is the block number, it only gets xor'ed before packing
> the value into 16bits using modulo 0xFFFF. So I'm pretty sure
> different block numbers were used for writing out and reading in the
> page. Either the blocknum gets corrupted between calculating the
> checksum and writing the page out (unlikely given the proximity), or
> the pages are somehow getting transposed in the storage.

I think we can rule out faulty storage, and I'm reasonably sure
nothing is writing to the database cluster except postgres itself.

So far I have not executed an emergency dump/restore on this database,
nor have I received any other log entries suggesting problems of any
kind.  Meaning, FWICT, the database is completely healthy
notwithstanding the reported errors.

This suggests (but does not prove) that the syscatalog damaging
activity happened during the commission of the attached function and
healed itself during the rollback.   Previously to turning checksums,
the transaction did *not* roll back and the damage was allowed stand
and probably accumulated.

As noted earlier, I was not able to reproduce the issue with
crashme.sh, which was:

NUM_FORKS=16
do_parallel psql -p 5432  -c"select PushMarketSample('1740')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('4400')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('2160')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('6680')" castaging_test
<snip>

(do_parallel is simple wrapper to executing the command in parallel up
to NUM_FORKS).   This is on the same server and cluster as above.
This kind of suggests that either
A) there is some concurrent activity from another process that is
tripping the issue
or
B) there is something particular to the session invoking the function
that is participating in the problem.  As the application is
structured, a single threaded node.js app is issuing the query that is
high traffic and long lived.  It's still running in fact and I'm kind
of tempted to find some downtime to see if I can still reproduce via
the UI.

merlin



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Transactions involving multiple postgres foreign servers
Next
From: Robert Haas
Date:
Subject: Re: [bug fix] Stats collector is not restarted on the standby