Thread: emergency outage requiring database restart

emergency outage requiring database restart

From

Merlin Moncure

Date:

14 October 2016, 00:07:51

Today I had an emergency production outage on a server.  This
particular server was running 9.5.2.   The symptoms were interesting
so I thought I'd report.  Here is what I saw:

*) User CPU was pegged 100%
*) Queries reading data would block and not respond to cancel or terminate
*) pg_stat_activity reported no waiting queries (but worked fine otherwise).

Adding all this up it smells like processes were getting stuck on a spinlock.

Connections quickly got eaten up and situation was desperately urgent
so I punted and did an immediate restart and things came back
normally.   I had a console to the database and did manage to grab
contents of pg_stat_activity and noticed several trivial queries were
running normally (according to pg_stat_activity) but were otherwise
stuck.  Attempting to run one of them myself, I noted query got stuck
and did not cancel.  I was in a terrible rush but am casting around
for stuff to grab out in case that happens again -- 'perf top' would
be a natural choice I guess.

Three autovacuum processes were running.  Obviously going to do bugfix
upgrade but was wondering if anybody has seen anything like this.
This particular server was upgraded to 9.5 somewhat recently but ran
on 9.2 for years with no issues.

merlin

Re: emergency outage requiring database restart

From

Tom Lane

Date:

14 October 2016, 00:13:34

Merlin Moncure <mmoncure@gmail.com> writes:
> Today I had an emergency production outage on a server.
> ...
> Adding all this up it smells like processes were getting stuck on a spinlock.

Maybe.  If it happens again, probably the most useful debug data would
be stack traces from some of the busy processes.
        regards, tom lane

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

17 October 2016, 21:39:58

On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> Today I had an emergency production outage on a server.
>> ...
>> Adding all this up it smells like processes were getting stuck on a spinlock.
>
> Maybe.  If it happens again, probably the most useful debug data would
> be stack traces from some of the busy processes.

Another odd datapoint on this server. Things were running pretty good
but an application crashed on a missing view.  Trying to recreate the
view, I got:

CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS SELECT ...

ERROR:  type "vw_apartmentqueuelastgood" already exists
HINT:  A relation has an associated type of the same name, so you must
use a name that doesn't conflict with any existing type.

...which was pretty strange.  I had to manually delete the pg_type
record in order to create the view.   I'm getting more reports of
'could not open relation with oid=X' errors so I could be facing data
corruption :(.

merlin

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

17 October 2016, 21:46:53

On Mon, Oct 17, 2016 at 1:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Merlin Moncure <mmoncure@gmail.com> writes:
>>> Today I had an emergency production outage on a server.
>>> ...
>>> Adding all this up it smells like processes were getting stuck on a spinlock.
>>
>> Maybe.  If it happens again, probably the most useful debug data would
>> be stack traces from some of the busy processes.
>
> Another odd datapoint on this server. Things were running pretty good
> but an application crashed on a missing view.  Trying to recreate the
> view, I got:
>
> CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS
>   SELECT ...
>
> ERROR:  type "vw_apartmentqueuelastgood" already exists
> HINT:  A relation has an associated type of the same name, so you must
> use a name that doesn't conflict with any existing type.
>
> ...which was pretty strange.  I had to manually delete the pg_type
> record in order to create the view.   I'm getting more reports of
> 'could not open relation with oid=X' errors so I could be facing data
> corruption :(.

castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
castaging-#   SELECT ...
ERROR:  42809: "pg_cast_oid_index" is an index
LINE 11:   FROM ApartmentSample s               ^
LOCATION:  heap_openrv_extended, heapam.c:1304

should I be restoring from backups?

merlin

Re: emergency outage requiring database restart

From

Alvaro Herrera

Date:

17 October 2016, 22:04:11

Merlin Moncure wrote:

> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
> castaging-#   SELECT ...
> ERROR:  42809: "pg_cast_oid_index" is an index
> LINE 11:   FROM ApartmentSample s
>                 ^
> LOCATION:  heap_openrv_extended, heapam.c:1304
> 
> should I be restoring from backups?

It's pretty clear to me that you've got catalog corruption here.  You
can try to fix things manually as they emerge, but that sounds like a
fool's errand.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

17 October 2016, 22:16:11

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>
>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>> castaging-#   SELECT ...
>> ERROR:  42809: "pg_cast_oid_index" is an index
>> LINE 11:   FROM ApartmentSample s
>>                 ^
>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>
>> should I be restoring from backups?
>
> It's pretty clear to me that you've got catalog corruption here.  You
> can try to fix things manually as they emerge, but that sounds like a
> fool's errand.

agreed. current plan is to restore from backups, and recover as much
data as I can.  Also doing bugfix release and going to enable
checksums.

We had several good backups since the previous outage so it's not
clear the events are related but after months of smooth operation I
find that coincidence highly suspicious. As always, we need to suspect
hardware problems but I'm highly abstracted from them -- using esx +
san.

merlin

Re: emergency outage requiring database restart

From

Alvaro Herrera

Date:

17 October 2016, 22:21:43

Merlin Moncure wrote:

> We had several good backups since the previous outage so it's not
> clear the events are related but after months of smooth operation I
> find that coincidence highly suspicious. As always, we need to suspect
> hardware problems but I'm highly abstracted from them -- using esx +
> san.

Ah, so you're subject not only to hardware flaws but also to
virtualization layer bugs :-)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: emergency outage requiring database restart

From

Michael Paquier

Date:

18 October 2016, 04:12:58

On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>
>> We had several good backups since the previous outage so it's not
>> clear the events are related but after months of smooth operation I
>> find that coincidence highly suspicious. As always, we need to suspect
>> hardware problems but I'm highly abstracted from them -- using esx +
>> san.
>
> Ah, so you're subject not only to hardware flaws but also to
> virtualization layer bugs :-)

Wait a couple of more years, and we'll get more complains about
Postgres running in containers running in VMs. Even more fun waiting
ahead.
-- 
Michael

Re: emergency outage requiring database restart

From

Gavin Flower

Date:

18 October 2016, 04:32:30

On 18/10/16 14:12, Michael Paquier wrote:
> On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Merlin Moncure wrote:
>>
>>> We had several good backups since the previous outage so it's not
>>> clear the events are related but after months of smooth operation I
>>> find that coincidence highly suspicious. As always, we need to suspect
>>> hardware problems but I'm highly abstracted from them -- using esx +
>>> san.
>> Ah, so you're subject not only to hardware flaws but also to
>> virtualization layer bugs :-)
> Wait a couple of more years, and we'll get more complains about
> Postgres running in containers running in VMs. Even more fun waiting
> ahead.

that started life on different hardware with a different O/S

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

18 October 2016, 16:46:01

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>
>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>> castaging-#   SELECT ...
>> ERROR:  42809: "pg_cast_oid_index" is an index
>> LINE 11:   FROM ApartmentSample s
>>                 ^
>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>
>> should I be restoring from backups?
>
> It's pretty clear to me that you've got catalog corruption here.  You
> can try to fix things manually as they emerge, but that sounds like a
> fool's errand.

Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion.  A lot of the OIDs seemed to be
pointing at the wrong thing.  Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted.  However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here.  Having said that, this definitely smells and feels
like something on the infrastructure side.  I'll follow up if I have
any useful info.

merlin

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

19 October 2016, 16:54:52

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Merlin Moncure wrote:
>>
>>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>>> castaging-#   SELECT ...
>>> ERROR:  42809: "pg_cast_oid_index" is an index
>>> LINE 11:   FROM ApartmentSample s
>>>                 ^
>>> LOCATION:  heap_openrv_extended, heapam.c:1304
>>>
>>> should I be restoring from backups?
>>
>> It's pretty clear to me that you've got catalog corruption here.  You
>> can try to fix things manually as they emerge, but that sounds like a
>> fool's errand.
>
> Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> to be to the system catalogs with at least two critical tables dropped
> or inaccessible in some fashion.  A lot of the OIDs seemed to be
> pointing at the wrong thing.  Couple more datapoints here.
>
> *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> *) Another database on the same cluster was not impacted.  However
> it's more olap style and may not have been written to during the
> outage
>
> Now, this infrastructure running this system is running maybe 100ish
> postgres clusters and maybe 1000ish sql server instances with
> approximately zero unexplained data corruption issues in the 5 years
> I've been here.  Having said that, this definitely smells and feels
> like something on the infrastructure side.  I'll follow up if I have
> any useful info.

After a thorough investigation I now have credible evidence the source
of the damage did not originate from the database itself.
Specifically, this database is mounted on the same volume as the
operating system (I know, I know) and something non database driven
sucked up disk space very rapidly and exhausted the volume -- fast
enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

merlin

Re: emergency outage requiring database restart

From

Bruce Momjian

Date:

19 October 2016, 17:56:58

On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
> > to be to the system catalogs with at least two critical tables dropped
> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
> > pointing at the wrong thing.  Couple more datapoints here.
> >
> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
> > *) Another database on the same cluster was not impacted.  However
> > it's more olap style and may not have been written to during the
> > outage
> >
> > Now, this infrastructure running this system is running maybe 100ish
> > postgres clusters and maybe 1000ish sql server instances with
> > approximately zero unexplained data corruption issues in the 5 years
> > I've been here.  Having said that, this definitely smells and feels
> > like something on the infrastructure side.  I'll follow up if I have
> > any useful info.
> 
> After a thorough investigation I now have credible evidence the source
> of the damage did not originate from the database itself.
> Specifically, this database is mounted on the same volume as the
> operating system (I know, I know) and something non database driven
> sucked up disk space very rapidly and exhausted the volume -- fast
> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help

However, disk space exhaustion should not lead to corruption unless the
underlying layers lied in some way.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

19 October 2016, 22:39:11

On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>> > to be to the system catalogs with at least two critical tables dropped
>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>> > pointing at the wrong thing.  Couple more datapoints here.
>> >
>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>> > *) Another database on the same cluster was not impacted.  However
>> > it's more olap style and may not have been written to during the
>> > outage
>> >
>> > Now, this infrastructure running this system is running maybe 100ish
>> > postgres clusters and maybe 1000ish sql server instances with
>> > approximately zero unexplained data corruption issues in the 5 years
>> > I've been here.  Having said that, this definitely smells and feels
>> > like something on the infrastructure side.  I'll follow up if I have
>> > any useful info.
>>
>> After a thorough investigation I now have credible evidence the source
>> of the damage did not originate from the database itself.
>> Specifically, this database is mounted on the same volume as the
>> operating system (I know, I know) and something non database driven
>> sucked up disk space very rapidly and exhausted the volume -- fast
>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>
> However, disk space exhaustion should not lead to corruption unless the
> underlying layers lied in some way.

I agree -- however I'm sufficiently separated from the things doing
the things that I can't verify that in any real way.   In the meantime
I'm going to take standard precautions (enable checksums/dedicated
volume/replication).  Low disk space also does not explain the bizarre
outage I had last friday.

merlin

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

20 October 2016, 21:52:47

On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>>> > to be to the system catalogs with at least two critical tables dropped
>>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>>> > pointing at the wrong thing.  Couple more datapoints here.
>>> >
>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>> > *) Another database on the same cluster was not impacted.  However
>>> > it's more olap style and may not have been written to during the
>>> > outage
>>> >
>>> > Now, this infrastructure running this system is running maybe 100ish
>>> > postgres clusters and maybe 1000ish sql server instances with
>>> > approximately zero unexplained data corruption issues in the 5 years
>>> > I've been here.  Having said that, this definitely smells and feels
>>> > like something on the infrastructure side.  I'll follow up if I have
>>> > any useful info.
>>>
>>> After a thorough investigation I now have credible evidence the source
>>> of the damage did not originate from the database itself.
>>> Specifically, this database is mounted on the same volume as the
>>> operating system (I know, I know) and something non database driven
>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>
>> However, disk space exhaustion should not lead to corruption unless the
>> underlying layers lied in some way.
>
> I agree -- however I'm sufficiently separated from the things doing
> the things that I can't verify that in any real way.   In the meantime
> I'm going to take standard precautions (enable checksums/dedicated
> volume/replication).  Low disk space also does not explain the bizarre
> outage I had last friday.

ok, data corruption struck again.  This time disk space is ruled out,
and access to the database is completely denied:
postgres=# \c castaging
WARNING:  leaking still-referenced relcache entry for
"pg_index_indexrelid_index"

merlin

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

20 October 2016, 22:05:17

On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>>>> > to be to the system catalogs with at least two critical tables dropped
>>>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>>>> > pointing at the wrong thing.  Couple more datapoints here.
>>>> >
>>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>>> > *) Another database on the same cluster was not impacted.  However
>>>> > it's more olap style and may not have been written to during the
>>>> > outage
>>>> >
>>>> > Now, this infrastructure running this system is running maybe 100ish
>>>> > postgres clusters and maybe 1000ish sql server instances with
>>>> > approximately zero unexplained data corruption issues in the 5 years
>>>> > I've been here.  Having said that, this definitely smells and feels
>>>> > like something on the infrastructure side.  I'll follow up if I have
>>>> > any useful info.
>>>>
>>>> After a thorough investigation I now have credible evidence the source
>>>> of the damage did not originate from the database itself.
>>>> Specifically, this database is mounted on the same volume as the
>>>> operating system (I know, I know) and something non database driven
>>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>>
>>> However, disk space exhaustion should not lead to corruption unless the
>>> underlying layers lied in some way.
>>
>> I agree -- however I'm sufficiently separated from the things doing
>> the things that I can't verify that in any real way.   In the meantime
>> I'm going to take standard precautions (enable checksums/dedicated
>> volume/replication).  Low disk space also does not explain the bizarre
>> outage I had last friday.
>
> ok, data corruption struck again.  This time disk space is ruled out,
> and access to the database is completely denied:
> postgres=# \c castaging
> WARNING:  leaking still-referenced relcache entry for
> "pg_index_indexrelid_index"

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG:  00000: could not change directory to "/root": Permission denied
LOCATION:  resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0  0x0000000000797d6f in ?? ()
#1  0x000000000079acf1 in RelationCacheInitializePhase3 ()
#2  0x00000000007b35c5 in InitPostgres ()
#3  0x00000000006b9b53 in PostgresMain ()
#4  0x00000000005f30fb in main ()
(gdb)


merlin

Re: emergency outage requiring database restart

From

Tom Lane

Date:

20 October 2016, 22:07:57

Merlin Moncure <mmoncure@gmail.com> writes:
> single user mode dumps core :(

You've got a mess there :-(

> Missing separate debuginfos, use: debuginfo-install
> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64

This backtrace would likely be much more informative if you did the above.
        regards, tom lane

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

20 October 2016, 22:14:57

On Thu, Oct 20, 2016 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> single user mode dumps core :(
>
> You've got a mess there :-(
>
>> Missing separate debuginfos, use: debuginfo-install
>> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
>
> This backtrace would likely be much more informative if you did the above.

can't; don't have the package unfortunately.

merlin

Re: emergency outage requiring database restart

From

Alvaro Herrera

Date:

20 October 2016, 23:16:13

Merlin Moncure wrote:

> single user mode dumps core :(
> 
> bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
> LOG:  00000: could not change directory to "/root": Permission denied
> LOCATION:  resolve_symlinks, exec.c:293
> Segmentation fault (core dumped)
> 
> Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000000000797d6f in ?? ()
> Missing separate debuginfos, use: debuginfo-install
> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
> (gdb) bt
> #0  0x0000000000797d6f in ?? ()
> #1  0x000000000079acf1 in RelationCacheInitializePhase3 ()
> #2  0x00000000007b35c5 in InitPostgres ()
> #3  0x00000000006b9b53 in PostgresMain ()
> #4  0x00000000005f30fb in main ()

Maybe rm global/pg_internal.init
and try again?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

21 October 2016, 00:53:48

On Thu, Oct 20, 2016 at 3:16 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>
>> single user mode dumps core :(
>>
>> bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
>> LOG:  00000: could not change directory to "/root": Permission denied
>> LOCATION:  resolve_symlinks, exec.c:293
>> Segmentation fault (core dumped)
>>
>> Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x0000000000797d6f in ?? ()
>> Missing separate debuginfos, use: debuginfo-install
>> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
>> (gdb) bt
>> #0  0x0000000000797d6f in ?? ()
>> #1  0x000000000079acf1 in RelationCacheInitializePhase3 ()
>> #2  0x00000000007b35c5 in InitPostgres ()
>> #3  0x00000000006b9b53 in PostgresMain ()
>> #4  0x00000000005f30fb in main ()
>
> Maybe
>   rm global/pg_internal.init
> and try again?

Will do when I can do that had to do emergency restore + some unfun
data reconstruction from the query log.

Notably there is a much larger database in the same cluster which is
undamaged.  This server is new to production usage, maybe 2 months.

Here is contents of pg_extensionplpgsqldblinkhstorepostgres_fdwplsh * not usedpg_trgm * not usedplr * not usedtablefunc
*not usedadminpack * not usedplpythonu * not usedpostgis * not usedpostgis_topology * not used
 

Short term plan is to separate the database to it's own cluster,
install replication and checksums.  All queries to this database are
logged.  Here is the contents of the log leading into and after the
the crash:

oct 17 crash:
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL:  parameters: $1 =
'21121', $2 = '8', $3 = '2016-10-13', $4 = NULL, $5 = NULL, $6 = NULL,
$7 = NULL, $8 = NULL, $9 = NULL, $10 = NULL, $11 = 't', $12
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute <unnamed>:
SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname
AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute <unnamed>:
SELECT n.nspname,p.proname,p.prorettype,p.proargtypes,
t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes
, p.o
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute <unnamed>:
select * from checkin($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
$12) as result
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL:  parameters: $1 =
'114333', $2 = 'rrosillo', $3 = 'CALLER', $4 = 'Survey', $5 = 'Happy',
$6 = 'Callback', $7 = 'OTHER', $8 = '2016-10-18 01:05:00',
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute S_3: COMMIT
2016-10-17 12:12:25 CDT [@]: ERROR:  could not open relation with OID
1203933 <-- first sign of damage
2016-10-17 12:12:25 CDT [@]: CONTEXT:  automatic analyze of table
"castaging.public.apartment"

oct 20 crash:
2016-10-20 12:46:38 CDT [postgres@castaging]: LOG:  statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:40 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG:  statement: SELECT
MarketOverviewJSON() AS market
2016-10-20 12:46:42 CDT [postgres@castaging]: LOG:  execute <unnamed>:
SELECT SubMarketOverviewJSON($1::TEXT) AS submkt
2016-10-20 12:46:42 CDT [postgres@castaging]: DETAIL:  parameters: $1 = '640'
2016-10-20 12:46:44 CDT [postgres@castaging]: LOG:  statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:46 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:47 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: ERROR:
"pg_description_o_c_o_index" is an index <-- first sign of damage
2016-10-20 12:46:47 CDT [postgres@castaging]: CONTEXT:  SQL function
"callcenterusertrackingjson" during startup
2016-10-20 12:46:47 CDT [postgres@castaging]: STATEMENT:  SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: WARNING:  leaking
still-referenced relcache entry for "pg_class_oid_index"

CallCenterUserTrackingJSON() and friends are not particularly
interesting except that they are making use of of json_agg().  They
were also called basically all day long in 5 second intervals.   I
guess this isn't saying very much, but I'm starting to smell a rat
here.

merlin

Re: emergency outage requiring database restart

From

Kevin Grittner

Date:

21 October 2016, 16:03:26

On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

> Most or all the damage seemed to be to the system catalogs with
> at least two critical tables dropped or inaccessible in some
> fashion.  A lot of the OIDs seemed to be pointing at the wrong
> thing.

While the oid in pg_class often matches the filename, that is not
true after some operations (like CLUSTER or VACUUM FULL).  It is
the relfilenode column that is the definitive link to the file.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

21 October 2016, 21:37:27

On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> Most or all the damage seemed to be to the system catalogs with
>> at least two critical tables dropped or inaccessible in some
>> fashion.  A lot of the OIDs seemed to be pointing at the wrong
>> thing.
>
> While the oid in pg_class often matches the filename, that is not
> true after some operations (like CLUSTER or VACUUM FULL).  It is
> the relfilenode column that is the definitive link to the file.

no such operations happened.  In the first instance at least one table
dropped from the system catalogs.   I have a hunch that the heap is
fine (supported by the size of the database on disk).   At this
precise moment I'm restoring the database to another fileserver in
order to do some forensic analysis, also in the hopes of getting the
second database online in order to expedite recovery.

ah -- done. :-)  deleting the init file didn't help, but starting up
single user allowed the start up to gracefully fail with a FATAL cache
lookup.

merlin

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

21 October 2016, 21:58:18

On Fri, Oct 21, 2016 at 1:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>>> Most or all the damage seemed to be to the system catalogs with
>>> at least two critical tables dropped or inaccessible in some
>>> fashion.  A lot of the OIDs seemed to be pointing at the wrong
>>> thing.
>>
>> While the oid in pg_class often matches the filename, that is not
>> true after some operations (like CLUSTER or VACUUM FULL).  It is
>> the relfilenode column that is the definitive link to the file.
>
> no such operations happened.  In the first instance at least one table
> dropped from the system catalogs.   I have a hunch that the heap is
> fine (supported by the size of the database on disk).   At this
> precise moment I'm restoring the database to another fileserver in
> order to do some forensic analysis, also in the hopes of getting the
> second database online in order to expedite recovery.
>
> ah -- done. :-)  deleting the init file didn't help, but starting up
> single user allowed the start up to gracefully fail with a FATAL cache
> lookup.

OK, I have some good (very- in the specific case of yours truly) news
to report.  Doing a filesystem level copy to a test server I was able
to relfilenode swap one of the critical tables over the place of the
refilenode of the stored backup.  Not being able know the file to copy
from, I figured out the source node by judging the size and using
'strings'  utility.  Data recovery for that table at least appears to
be 100%.

For those following along, this simple process is only likely to work
easily if the table contains only system types; no user types, enums,
composites, etc, since those have a unique ID for each data restore.

merlin

Re: emergency outage requiring database restart

From

Alvaro Herrera

Date:

21 October 2016, 22:03:11

Merlin Moncure wrote:

> OK, I have some good (very- in the specific case of yours truly) news
> to report.  Doing a filesystem level copy to a test server I was able
> to relfilenode swap one of the critical tables over the place of the
> refilenode of the stored backup.  Not being able know the file to copy
> from, I figured out the source node by judging the size and using
> 'strings'  utility.  Data recovery for that table at least appears to
> be 100%.

FWIW you can use pg_filedump and match based on the number of columns.
I suppose you could also use the pageinspect extension, by 'dd'ing a
page from the file into the database and feeding into heap_page_items as
bytea.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: emergency outage requiring database restart

From

Jim Nasby

Date:

22 October 2016, 00:00:18

On 10/21/16 2:02 PM, Alvaro Herrera wrote:
> Merlin Moncure wrote:
>
>> OK, I have some good (very- in the specific case of yours truly) news
>> to report.  Doing a filesystem level copy to a test server I was able
>> to relfilenode swap one of the critical tables over the place of the
>> refilenode of the stored backup.  Not being able know the file to copy
>> from, I figured out the source node by judging the size and using
>> 'strings'  utility.  Data recovery for that table at least appears to
>> be 100%.
>
> FWIW you can use pg_filedump and match based on the number of columns.
> I suppose you could also use the pageinspect extension, by 'dd'ing a
> page from the file into the database and feeding into heap_page_items as
> bytea.

It occurs to me that it might be worth embedding the relation name in 
the free space of the first block. Most people would never notice the 
missing 64 bytes, but it would be incredibly helpful in cases like this...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: emergency outage requiring database restart

From

Alvaro Herrera

Date:

22 October 2016, 01:38:33

Jim Nasby wrote:
> On 10/21/16 2:02 PM, Alvaro Herrera wrote:
> > Merlin Moncure wrote:
> > 
> > > OK, I have some good (very- in the specific case of yours truly) news
> > > to report.  Doing a filesystem level copy to a test server I was able
> > > to relfilenode swap one of the critical tables over the place of the
> > > refilenode of the stored backup.  Not being able know the file to copy
> > > from, I figured out the source node by judging the size and using
> > > 'strings'  utility.  Data recovery for that table at least appears to
> > > be 100%.
> > 
> > FWIW you can use pg_filedump and match based on the number of columns.
> > I suppose you could also use the pageinspect extension, by 'dd'ing a
> > page from the file into the database and feeding into heap_page_items as
> > bytea.
> 
> It occurs to me that it might be worth embedding the relation name in the
> free space of the first block. Most people would never notice the missing 64
> bytes, but it would be incredibly helpful in cases like this...

Agreed.  The problem is how to install it without breaking pg_upgrade.

Note that for DR purposes it is better to put that data in the first
block of *each segment*.  Otherwise if you have many >1GB tables, you
have to go segment by segment ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: emergency outage requiring database restart

From

Tom Lane

Date:

22 October 2016, 03:43:52

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Jim Nasby wrote:
>> It occurs to me that it might be worth embedding the relation name in the
>> free space of the first block. Most people would never notice the missing 64
>> bytes, but it would be incredibly helpful in cases like this...

> Agreed.  The problem is how to install it without breaking pg_upgrade.

Well, that's the first problem.  The second problem is how to cope with
RENAME TABLE.
        regards, tom lane

Re: emergency outage requiring database restart

From

Jim Nasby

Date:

22 October 2016, 20:19:54

On 10/21/16 7:43 PM, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Jim Nasby wrote:
>>> It occurs to me that it might be worth embedding the relation name in the
>>> free space of the first block. Most people would never notice the missing 64
>>> bytes, but it would be incredibly helpful in cases like this...
>
>> Agreed.  The problem is how to install it without breaking pg_upgrade.

It can't look up relation names?

> Well, that's the first problem.  The second problem is how to cope with
> RENAME TABLE.

If the name was only encoded in the first block of the relation I'm not 
sure how bad this would be; are there any facilities to change the name 
back on a rollback?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: emergency outage requiring database restart

From

Tom Lane

Date:

22 October 2016, 20:38:45

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> On 10/21/16 7:43 PM, Tom Lane wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> Agreed.  The problem is how to install it without breaking pg_upgrade.

> It can't look up relation names?

It can't shove 64 bytes into a page that has < 64 bytes free space.

>> Well, that's the first problem.  The second problem is how to cope with
>> RENAME TABLE.

> If the name was only encoded in the first block of the relation I'm not 
> sure how bad this would be; are there any facilities to change the name 
> back on a rollback?

No.  How would that work in crash situations?  (And keep in mind that the
argument for this hinges entirely on its being 100% trustworthy after a
crash.)
        regards, tom lane

Re: emergency outage requiring database restart

From

Alvaro Herrera

Date:

22 October 2016, 21:55:32

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Jim Nasby wrote:
> >> It occurs to me that it might be worth embedding the relation name in the
> >> free space of the first block. Most people would never notice the missing 64
> >> bytes, but it would be incredibly helpful in cases like this...
> 
> > Agreed.  The problem is how to install it without breaking pg_upgrade.
> 
> Well, that's the first problem.  The second problem is how to cope with
> RENAME TABLE.

Uh, sorry.  My proposal a couple of years back was to put the
relfilenode, not the name.  I didn't notice that it was the name being
proposed here.  However, now I notice that this idea doesn't solve the
problem for mapped relations.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: emergency outage requiring database restart

From

Tom Lane

Date:

22 October 2016, 21:59:22

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Uh, sorry.  My proposal a couple of years back was to put the
> relfilenode, not the name.  I didn't notice that it was the name being
> proposed here.  However, now I notice that this idea doesn't solve the
> problem for mapped relations.

Well, as long as a catalog lookup would be required anyway, what about
putting in the table OID?
        regards, tom lane

Re: emergency outage requiring database restart

From

Andres Freund

Date:

22 October 2016, 22:02:30

On October 22, 2016 11:59:15 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Uh, sorry.  My proposal a couple of years back was to put the
>> relfilenode, not the name.  I didn't notice that it was the name
>being
>> proposed here.  However, now I notice that this idea doesn't solve
>the
>> problem for mapped relations.
>
>Well, as long as a catalog lookup would be required anyway, what about
>putting in the table OID?

How about storing two table names? The old and what the relation is being renamed to? We wouldn't be able to tell after
acrash which is which, but that'll usually still be helpful.   Every rename would clear out the old/wrong one, and add
thetarget filename.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

25 October 2016, 02:05:48

On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>>>> > to be to the system catalogs with at least two critical tables dropped
>>>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>>>> > pointing at the wrong thing.  Couple more datapoints here.
>>>> >
>>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>>> > *) Another database on the same cluster was not impacted.  However
>>>> > it's more olap style and may not have been written to during the
>>>> > outage
>>>> >
>>>> > Now, this infrastructure running this system is running maybe 100ish
>>>> > postgres clusters and maybe 1000ish sql server instances with
>>>> > approximately zero unexplained data corruption issues in the 5 years
>>>> > I've been here.  Having said that, this definitely smells and feels
>>>> > like something on the infrastructure side.  I'll follow up if I have
>>>> > any useful info.
>>>>
>>>> After a thorough investigation I now have credible evidence the source
>>>> of the damage did not originate from the database itself.
>>>> Specifically, this database is mounted on the same volume as the
>>>> operating system (I know, I know) and something non database driven
>>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>>
>>> However, disk space exhaustion should not lead to corruption unless the
>>> underlying layers lied in some way.
>>
>> I agree -- however I'm sufficiently separated from the things doing
>> the things that I can't verify that in any real way.   In the meantime
>> I'm going to take standard precautions (enable checksums/dedicated
>> volume/replication).  Low disk space also does not explain the bizarre
>> outage I had last friday.
>
> ok, data corruption struck again.  This time disk space is ruled out,
> and access to the database is completely denied:
> postgres=# \c castaging
> WARNING:  leaking still-referenced relcache entry for
> "pg_index_indexrelid_index"

Corruption struck again.
This time got another case of view busted -- attempting to create
gives missing 'type' error.

merlin

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

25 October 2016, 02:29:20

On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>>>> > Yeah.  Believe me -- I know the drill.  Most or all the damage seemed
>>>>> > to be to the system catalogs with at least two critical tables dropped
>>>>> > or inaccessible in some fashion.  A lot of the OIDs seemed to be
>>>>> > pointing at the wrong thing.  Couple more datapoints here.
>>>>> >
>>>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>>>> > *) Another database on the same cluster was not impacted.  However
>>>>> > it's more olap style and may not have been written to during the
>>>>> > outage
>>>>> >
>>>>> > Now, this infrastructure running this system is running maybe 100ish
>>>>> > postgres clusters and maybe 1000ish sql server instances with
>>>>> > approximately zero unexplained data corruption issues in the 5 years
>>>>> > I've been here.  Having said that, this definitely smells and feels
>>>>> > like something on the infrastructure side.  I'll follow up if I have
>>>>> > any useful info.
>>>>>
>>>>> After a thorough investigation I now have credible evidence the source
>>>>> of the damage did not originate from the database itself.
>>>>> Specifically, this database is mounted on the same volume as the
>>>>> operating system (I know, I know) and something non database driven
>>>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>>>> enough that sar didn't pick it up.  Oh well :-) -- thanks for the help
>>>>
>>>> However, disk space exhaustion should not lead to corruption unless the
>>>> underlying layers lied in some way.
>>>
>>> I agree -- however I'm sufficiently separated from the things doing
>>> the things that I can't verify that in any real way.   In the meantime
>>> I'm going to take standard precautions (enable checksums/dedicated
>>> volume/replication).  Low disk space also does not explain the bizarre
>>> outage I had last friday.
>>
>> ok, data corruption struck again.  This time disk space is ruled out,
>> and access to the database is completely denied:
>> postgres=# \c castaging
>> WARNING:  leaking still-referenced relcache entry for
>> "pg_index_indexrelid_index"
>
> Corruption struck again.
> This time got another case of view busted -- attempting to create
> gives missing 'type' error.

Call it a hunch -- I think the problem is in pl/sh.

merlin

Re: emergency outage requiring database restart

From

Alvaro Herrera

Date:

25 October 2016, 05:19:08

Merlin Moncure wrote:
> On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

> > Corruption struck again.
> > This time got another case of view busted -- attempting to create
> > gives missing 'type' error.
> 
> Call it a hunch -- I think the problem is in pl/sh.

I've heard that before.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

25 October 2016, 20:52:38

On Mon, Oct 24, 2016 at 9:18 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>> On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> > Corruption struck again.
>> > This time got another case of view busted -- attempting to create
>> > gives missing 'type' error.
>>
>> Call it a hunch -- I think the problem is in pl/sh.
>
> I've heard that before.

well, yeah, previously I had an issue where the database crashed
during a heavy concurrent pl/sh based load.   However the problems
went away when I refactored the code.   Anyways, I looked at the code
and couldn't see anything obviously wrong so who knows?  All I know is
my production database is exploding continuously and I'm looking for
answers.  The only other extension in heavy use on this servers is
postgres_fdw.

The other database on the cluster is fine, which kind of suggests we
are not facing clog or WAL type problems.

After last night, I rebuilt the cluster, turning on checksums, turning
on synchronous commit (it was off) and added a standby replica.  This
should help narrow the problem down should it re-occur; if storage is
bad (note, other database on same machine is doing 10x write activity
and is fine) or something is scribbling on shared memory (my guess
here)  then checksums should be popped, right?

merlin

Re: emergency outage requiring database restart

From

Alvaro Herrera

Date:

25 October 2016, 20:57:35

Merlin Moncure wrote:

> After last night, I rebuilt the cluster, turning on checksums, turning
> on synchronous commit (it was off) and added a standby replica.  This
> should help narrow the problem down should it re-occur; if storage is
> bad (note, other database on same machine is doing 10x write activity
> and is fine) or something is scribbling on shared memory (my guess
> here)  then checksums should be popped, right?

Not really sure about that.  As I recall we compute the CRC on the
buffer's way out, based on the then-current contents, so if something
scribbles on the buffer while it's waiting to be evicted, the CRC
computation would include the new (corrupted) bytes rather than the
original ones -- see FlushBuffer.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

25 October 2016, 22:13:44

On Tue, Oct 25, 2016 at 12:57 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>
>> After last night, I rebuilt the cluster, turning on checksums, turning
>> on synchronous commit (it was off) and added a standby replica.  This
>> should help narrow the problem down should it re-occur; if storage is
>> bad (note, other database on same machine is doing 10x write activity
>> and is fine) or something is scribbling on shared memory (my guess
>> here)  then checksums should be popped, right?
>
> Not really sure about that.  As I recall we compute the CRC on the
> buffer's way out, based on the then-current contents, so if something
> scribbles on the buffer while it's waiting to be evicted, the CRC
> computation would include the new (corrupted) bytes rather than the
> original ones -- see FlushBuffer.

Huh. I have a new theory on this.  Dealing with the reconstituted
database, I'm finding more things -- functions and such, that are
simply gone and had to be rebuilt -- they escaped notice as they were
not in primary code paths.  Recall that the original outage came
manifested as queries getting stuck, possibly on spinlock (we don't
know for sure).  After that, things started to randomly disappear,
possibly from system catalogs (but now need to go back and verify
older data, I think).  There were three autovac processes running.

What if the subsequent dataloss was in fact a symptom of the first
outage?  Is in theory possible for data to appear visible but then be
eaten up as the transactions making the data visible get voided out by
some other mechanic?  I had to pull a quick restart the first time and
everything looked ok -- or so I thought.   What I think was actually
happening is that data started to slip into the void.   It's like
randomly sys catalogs were dropping off. I bet other data was, too.  I
can pull older backups and verify that.  It's as if some creeping xmin
was snuffing everything out.

The confirmation of this should be obvious -- if that's indeed the
case, the backup and restored cluster should no longer present data
loss. Given that I was getting that every 1-2 days, we should be able
to figure that out pretty soon.

merlin

Re: emergency outage requiring database restart

From

Tom Lane

Date:

25 October 2016, 22:31:42

Merlin Moncure <mmoncure@gmail.com> writes:
> What if the subsequent dataloss was in fact a symptom of the first
> outage?  Is in theory possible for data to appear visible but then be
> eaten up as the transactions making the data visible get voided out by
> some other mechanic?  I had to pull a quick restart the first time and
> everything looked ok -- or so I thought.   What I think was actually
> happening is that data started to slip into the void.   It's like
> randomly sys catalogs were dropping off. I bet other data was, too.  I
> can pull older backups and verify that.  It's as if some creeping xmin
> was snuffing everything out.

Might be interesting to look at age(xmin) in a few different system
catalogs.  I think you can ignore entries with age = 2147483647;
those should be frozen rows.  But if you see entries with very large
ages that are not that, it'd be suspicious.
        regards, tom lane

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

25 October 2016, 23:08:44

On Tue, Oct 25, 2016 at 2:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> What if the subsequent dataloss was in fact a symptom of the first
>> outage?  Is in theory possible for data to appear visible but then be
>> eaten up as the transactions making the data visible get voided out by
>> some other mechanic?  I had to pull a quick restart the first time and
>> everything looked ok -- or so I thought.   What I think was actually
>> happening is that data started to slip into the void.   It's like
>> randomly sys catalogs were dropping off. I bet other data was, too.  I
>> can pull older backups and verify that.  It's as if some creeping xmin
>> was snuffing everything out.
>
> Might be interesting to look at age(xmin) in a few different system
> catalogs.  I think you can ignore entries with age = 2147483647;
> those should be frozen rows.  But if you see entries with very large
> ages that are not that, it'd be suspicious.

nothing really stands out.

The damage did re-occur after a dump/restore -- not sure about a
cluster level rebuild.  No problems previous to that.  This suggests
that if this theory holds the damage would have had to have been under
the database level -- perhaps in clog.  Maybe hint bits and clog did
not agree as to commit or delete status for example.  clog has plenty
of history leading past the problem barrier:
-rwx------ 1 postgres postgres 256K Jul 10 16:21 0000
-rwx------ 1 postgres postgres 256K Jul 21 12:39 0001
-rwx------ 1 postgres postgres 256K Jul 21 13:19 0002
-rwx------ 1 postgres postgres 256K Jul 21 13:59 0003
<snip>

Confirmation of problem re-occurrence will come in a few days.    I'm
much more likely to believe 6+sigma occurrence (storage, freak bug,
etc) should it prove the problem goes away post rebuild.

merlin

Re: emergency outage requiring database restart

From

Jim Nasby

Date:

26 October 2016, 01:55:04

On 10/22/16 12:38 PM, Tom Lane wrote:
> Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
>> > On 10/21/16 7:43 PM, Tom Lane wrote:
>>> >> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>>> >>> Agreed.  The problem is how to install it without breaking pg_upgrade.
>> > It can't look up relation names?
> It can't shove 64 bytes into a page that has < 64 bytes free space.

Yeah, that would need to be a special case, but unless the page pointers 
are horked you'd be able to tell if there was a name in there.

>>> >> Well, that's the first problem.  The second problem is how to cope with
>>> >> RENAME TABLE.
>> > If the name was only encoded in the first block of the relation I'm not
>> > sure how bad this would be; are there any facilities to change the name
>> > back on a rollback?
> No.  How would that work in crash situations?  (And keep in mind that the

Well, if FPI's were enabled you'd get the old page back. Even without 
that recovery could remember rename records it's seen that didn't commit.

> argument for this hinges entirely on its being 100% trustworthy after a
> crash.)

I don't think this needs to be 100% reliable to be useful, especially if 
we went with Andreas suggestion to store both the old and new name (and 
storing the OID as well isn't a bad idea). If you're ever at the point 
of looking at this info you're already in deep trouble and should 
already be taking everything with a grain of salt anyway.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

26 October 2016, 19:35:07

On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Confirmation of problem re-occurrence will come in a few days.    I'm
> much more likely to believe 6+sigma occurrence (storage, freak bug,
> etc) should it prove the problem goes away post rebuild.

ok, no major reported outage yet, but just got:

2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page in
block 12 of relation base/203883/1259

merlin

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

26 October 2016, 20:43:33

On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Confirmation of problem re-occurrence will come in a few days.    I'm
>> much more likely to believe 6+sigma occurrence (storage, freak bug,
>> etc) should it prove the problem goes away post rebuild.
>
> ok, no major reported outage yet, but just got:
>
> 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page in
> block 12 of relation base/203883/1259


*) still on 9.5.2 :(  not in control of the package update process

*) getting more of these:
[root@rcdylsdbmpf001 data]# cat
/var/lib/pgsql/9.5/data_checksum/pg_log/postgresql-26.log  | grep
"invalid page"
2016-10-26 11:26:42 CDT [postgres@castaging]: ERROR:  invalid page in
block 10 of relation base/203883/1259
2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page in
block 12 of relation base/203883/1259
2016-10-26 12:16:44 CDT [postgres@castaging]: ERROR:  invalid page in
block 13 of relation base/203883/1259
2016-10-26 12:18:58 CDT [postgres@castaging]: ERROR:  invalid page in
block 15 of relation base/203883/1259
2016-10-26 12:19:12 CDT [postgres@castaging]: ERROR:  invalid page in
block 7 of relation base/203883/2662

castaging=# select relname from pg_class where oid in(1259, 2662);     relname
────────────────────pg_class_oid_indexpg_class

*) only one database in the cluster is damaged.   I do not suspect
storage at this point

*) Currently I can execute pg_dump with no problems.   So far impact
is not severe but soon I have to do dump/restore

*) age(xmin) from pg_class gives all reasonable values (approx 757k)
that aren't 2147483647

*) All cases of statement executing getting this error in a pl/pgsql
routine that does the following:
BEGIN
... <several postgres_fdw querying same 1 record 1 column tabel>
 DROP TABLE IF EXISTS foo; CREATE TEMP TABLE foo ON COMMIT DROP AS ... SELECT <sqsh function push sql script to sql
sever>... INTO Result; 
 DROP TABLE IF EXISTS bar; CREATE TEMP TABLE bar ON COMMIT DROP AS ...
PERFORM  <sqsh function removing script>;

EXCEPTION WHEN OTHERS THEN    PERFORM  <sqsh function removing script>;    RAISE;
END;

*) page verification miss is numerically close
[root@rcdylsdbmpf001 ~]# cat
/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
verification"
2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37251 but expected 37244
2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37249 but expected 37244
2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 44363 but expected 44364
2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 49525 but expected 49539
2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37345 but expected 37340

*) sample log leading into the first error (all queries are logged)
2016-10-26 11:26:40 CDT [postgres@castaging]: LOG:  execute <unnamed>:
SELECT PushMarketSample($1::TEXT) AS published
2016-10-26 11:26:40 CDT [postgres@castaging]: DETAIL:  parameters: $1 = '8840'
2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG:  statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG:  statement: SAVEPOINT s2
2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG:  execute <unnamed>:
DECLARE c1 CURSOR FOR       SELECT period FROM public.dataentryperiod
2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG:  statement: FETCH 100 FROM c1
2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG:  statement: CLOSE c1
2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG:  execute <unnamed>:
DECLARE c2 CURSOR FOR       SELECT period FROM public.dataentryperiod
2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG:  statement: FETCH 100 FROM c2
2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG:  statement: CLOSE c2
2016-10-26 11:26:41 CDT [rms@mpf2]: ERROR:  relation "tblagent" does
not exist at character 15
2016-10-26 11:26:41 CDT [rms@mpf2]: STATEMENT:  select * from tblagent
2016-10-26 11:26:41 CDT [rms@mpf2]: LOG:  execute <unnamed>: SELECT
CURRENT_SCHEMA()
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  execute <unnamed>:
DECLARE c3 CURSOR FOR       SELECT period FROM public.dataentryperiod
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: FETCH 100 FROM c3
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: CLOSE c3
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  execute <unnamed>:
DECLARE c4 CURSOR FOR       SELECT period FROM public.dataentryperiod
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: FETCH 100 FROM c4
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: CLOSE c4
2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37251 but expected 37244
2016-10-26 11:26:42 CDT [postgres@castaging]: CONTEXT:  SQL statement
"CREATE TEMP TABLE ApartmentPublishBySubmkt ON COMMIT DROP AS         SELECT           ApartmentId,
TotalUnits,          SubmarketId,           SUM(Rent * UnitCount) / NULLIF(SUM(UnitCount), 0) AS Rent,
SUM(Occupancy* UnitCount) / NULLIF(SUM(UnitCount), 0) AS Occupancy         FROM AptSample         GROUP BY 1, 2, 3"
 PL/pgSQL function pushmarketsample(text,date,integer) line 103 
at SQL statement
2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG:  statement: ROLLBACK TO
SAVEPOINT s2; RELEASE SAVEPOINT s2
2016-10-26 11:26:42 CDT [postgres@castaging]: ERROR:  invalid page in
block 10 of relation base/203883/1259
2016-10-26 11:26:42 CDT [postgres@castaging]: CONTEXT:  SQL statement
"CREATE TEMP TABLE ApartmentPublishBySubmkt ON COMMIT DROP AS         SELECT           ApartmentId,
TotalUnits,          SubmarketId,           SUM(Rent * UnitCount) / NULLIF(SUM(UnitCount), 0) AS Rent, 


merlin

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

26 October 2016, 20:57:29

On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> Confirmation of problem re-occurrence will come in a few days.    I'm
>>> much more likely to believe 6+sigma occurrence (storage, freak bug,
>>> etc) should it prove the problem goes away post rebuild.
>>
>> ok, no major reported outage yet, but just got:
>>
>> 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page in
>> block 12 of relation base/203883/1259

*) I've now strongly correlated this routine with the damage.

[root@rcdylsdbmpf001 ~]# cat
/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log  | grep -i
pushmarketsample | head -5
2016-10-26 11:26:27 CDT [postgres@castaging]: LOG:  execute <unnamed>:
SELECT PushMarketSample($1::TEXT) AS published
2016-10-26 11:26:40 CDT [postgres@castaging]: LOG:  execute <unnamed>:
SELECT PushMarketSample($1::TEXT) AS published
PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL statement
PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL statement
2016-10-26 11:26:42 CDT [postgres@castaging]: STATEMENT:  SELECT
PushMarketSample($1::TEXT) AS published

*) First invocation was 11:26:27 CDT

*) Second invocation was 11:26:40 and gave checksum error (as noted
earlier 11:26:42)

*) Routine attached (if interested)

My next step is to set up test environment and jam this routine
aggressively to see what happens.

merlin

Attachment

PushMarketSample.sql

Re: emergency outage requiring database restart

From

Tom Lane

Date:

26 October 2016, 21:10:12

Merlin Moncure <mmoncure@gmail.com> writes:
> *) I've now strongly correlated this routine with the damage.

Hmm.  Do you have any way to replace the non-core calls with something
else?  The "shexec('rm -f ' || _OutputFile)" bits could presumably be
converted to use contrib/adminpack's pg_file_unlink(), or an equivalent
one-liner in plperlu or plpythonu.  I don't know what the sqshf
incantation does, though.

This wouldn't be about fixing it so much as narrowing down where the
problem is.
        regards, tom lane

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

26 October 2016, 21:29:19

On Wed, Oct 26, 2016 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> *) I've now strongly correlated this routine with the damage.
>
> Hmm.  Do you have any way to replace the non-core calls with something
> else?  The "shexec('rm -f ' || _OutputFile)" bits could presumably be
> converted to use contrib/adminpack's pg_file_unlink(), or an equivalent
> one-liner in plperlu or plpythonu.  I don't know what the sqshf
> incantation does, though.
>
> This wouldn't be about fixing it so much as narrowing down where the
> problem is.

Will do. Before you spend a lot of time, let me get full confirmation
that this is the source of the issue and I'll bake a reproduction
script (there is still a possibility that something *else* caused it
and by happenstance we caught it here).

For posterity,  sqshf does:

CREATE OR REPLACE FUNCTION sqshf( ConnectionString TEXT, FileName TEXT) RETURNS TEXT AS
$$#!/bin/bash

cat \ $2 \ | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0"

echo "Success"

$$ LANGUAGE plsh;

sqsh being the freetds wrapping linux console to sql server.   So
basically not much.

shexec does:

CREATE OR REPLACE FUNCTION shexec(_cmd TEXT) RETURNS TEXT AS
$$#!/bin/bash

eval $1

$$ LANGUAGE plsh;

:-)

merlin

Re: emergency outage requiring database restart

From

Andres Freund

Date:

26 October 2016, 21:34:45


On October 26, 2016 8:57:22 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote:
>On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure <mmoncure@gmail.com>
>wrote:
>> On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure@gmail.com>
>wrote:
>>> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure@gmail.com>
>wrote:
>>>> Confirmation of problem re-occurrence will come in a few days.   
>I'm
>>>> much more likely to believe 6+sigma occurrence (storage, freak bug,
>>>> etc) should it prove the problem goes away post rebuild.
>>>
>>> ok, no major reported outage yet, but just got:
>>>
>>> 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR:  invalid page
>in
>>> block 12 of relation base/203883/1259
>
>*) I've now strongly correlated this routine with the damage.
>
>[root@rcdylsdbmpf001 ~]# cat
>/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log  | grep -i
>pushmarketsample | head -5
>2016-10-26 11:26:27 CDT [postgres@castaging]: LOG:  execute <unnamed>:
>SELECT PushMarketSample($1::TEXT) AS published
>2016-10-26 11:26:40 CDT [postgres@castaging]: LOG:  execute <unnamed>:
>SELECT PushMarketSample($1::TEXT) AS published
>PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL
>statement
>PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL
>statement
>2016-10-26 11:26:42 CDT [postgres@castaging]: STATEMENT:  SELECT
>PushMarketSample($1::TEXT) AS published
>
>*) First invocation was 11:26:27 CDT
>
>*) Second invocation was 11:26:40 and gave checksum error (as noted
>earlier 11:26:42)
>
>*) Routine attached (if interested)
>
>My next step is to set up test environment and jam this routine
>aggressively to see what happens.

Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly
oneway to get into odd corruption issues.
 

We processor really should use O_CLOEXEC for the majority of it file handles.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

26 October 2016, 21:38:57

On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:
> Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly
oneway to get into odd corruption issues.
 

not sure.  it's pretty small -- see
https://github.com/petere/plsh/blob/master/plsh.c

merlin

Re: emergency outage requiring database restart

From

Andres Freund

Date:

26 October 2016, 21:46:02

On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote:
>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> Any chance that plsh or the script it executes does anything with the
>file descriptors it inherits? That'd certainly one way to get into odd
>corruption issues.
>
>not sure.  it's pretty small -- see
>https://github.com/petere/plsh/blob/master/plsh.c

Afaics that could also be in your script, not just plsh. The later doesn't seem to close all file handles above stderr,
whichmeans that all handles for relations etc week be open in your script.  If you e.g. do any unusual redirections
(2>&17or such), that could end badly.   But I'm just on my phone, in a taxi without seatbelts, at 60mph, so I didn't
lookcarefully.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

26 October 2016, 21:49:18

On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote:
>>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de>
>>wrote:
>>> Any chance that plsh or the script it executes does anything with the
>>file descriptors it inherits? That'd certainly one way to get into odd
>>corruption issues.
>>
>>not sure.  it's pretty small -- see
>>https://github.com/petere/plsh/blob/master/plsh.c
>
> Afaics that could also be in your script, not just plsh. The later doesn't seem to close all file handles above
stderr,which means that all handles for relations etc week be open in your script.  If you e.g. do any unusual
redirections(2>&17 or such), that could end badly.   But I'm just on my phone, in a taxi without seatbelts, at 60mph,
soI didn't look carefully. 

gotcha :-).  see above:
*) sqshf:
#!/bin/bash
cat \ $2 \ | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0"

echo "Success"

*) shexec:
#!/bin/bash

eval $1

FWICT that's all that's happening here with respect to pl/sh. I'm
almost done with reproduction environment and if i get the issue I
should be able to zero in on the problem.  One final thing is that
concurrency is a contributing factor.

merlin

Re: emergency outage requiring database restart

From

Andres Freund

Date:

26 October 2016, 22:12:38

On 2016-10-26 13:49:12 -0500, Merlin Moncure wrote:
> On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund <andres@anarazel.de> wrote:
> >
> >
> > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote:
> >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de>
> >>wrote:
> >>> Any chance that plsh or the script it executes does anything with the
> >>file descriptors it inherits? That'd certainly one way to get into odd
> >>corruption issues.
> >>
> >>not sure.  it's pretty small -- see
> >>https://github.com/petere/plsh/blob/master/plsh.c
> >
> > Afaics that could also be in your script, not just plsh. The later doesn't seem to close all file handles above
stderr,which means that all handles for relations etc week be open in your script.  If you e.g. do any unusual
redirections(2>&17 or such), that could end badly.   But I'm just on my phone, in a taxi without seatbelts, at 60mph,
soI didn't look carefully.
 
> 
> gotcha :-).  see above:
> *) sqshf:
> #!/bin/bash
> cat \
>   $2 \
>   | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0"
> 
> echo "Success"
> 
> *) shexec:
> #!/bin/bash
> 
> eval $1
> 
> FWICT that's all that's happening here with respect to pl/sh.

My point is that that doesn't mean anything. Whatever sqsh is, or
whatever $1 eval's to (and $2 for that matter), could access the
filehandles the backend has opened.

- Andres

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

26 October 2016, 22:25:16

On Wed, Oct 26, 2016 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-10-26 13:49:12 -0500, Merlin Moncure wrote:
>> On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund <andres@anarazel.de> wrote:
>> >
>> >
>> > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote:
>> >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de>
>> >>wrote:
>> >>> Any chance that plsh or the script it executes does anything with the
>> >>file descriptors it inherits? That'd certainly one way to get into odd
>> >>corruption issues.
>> >>
>> >>not sure.  it's pretty small -- see
>> >>https://github.com/petere/plsh/blob/master/plsh.c
>> >
>> > Afaics that could also be in your script, not just plsh. The later doesn't seem to close all file handles above
stderr,which means that all handles for relations etc week be open in your script.  If you e.g. do any unusual
redirections(2>&17 or such), that could end badly.   But I'm just on my phone, in a taxi without seatbelts, at 60mph,
soI didn't look carefully. 
>>
>> gotcha :-).  see above:
>> *) sqshf:
>> #!/bin/bash
>> cat \
>>   $2 \
>>   | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0"
>>
>> echo "Success"
>>
>> *) shexec:
>> #!/bin/bash
>>
>> eval $1
>>
>> FWICT that's all that's happening here with respect to pl/sh.
>
> My point is that that doesn't mean anything. Whatever sqsh is, or
> whatever $1 eval's to (and $2 for that matter), could access the
> filehandles the backend has opened.

I don't think that's the case.  sqsh is a psql-like utility.  it
writes to stdout  and stderr only which is captured by plsh and sent.
In this context shexec only wraps rm -f 'file' where 'file' is a file
previously created with COPY in the same transaction.

In a test environment, spamming this routine with 16 threads I was not
able to reproduce.  Still hunting...

merlin

Re: emergency outage requiring database restart

From

Jim Nasby

Date:

26 October 2016, 23:06:47

On 10/26/16 2:25 PM, Merlin Moncure wrote:
> I don't think that's the case.  sqsh is a psql-like utility.  it
> writes to stdout  and stderr only which is captured by plsh and sent.
> In this context shexec only wraps rm -f 'file' where 'file' is a file
> previously created with COPY in the same transaction.

How do you know that? It could potentially be doing anything it wanted 
with file handles. Removing the exec might "solve" the problem here, 
assuming that the forked process doesn't still inherit all open FH's.

In a nutshell, running arbitrary shell code (let alone arbitrary 
executables) in a pl that's running inside the Postgres backend sounds 
insanely risky to me.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

26 October 2016, 23:45:48

On Wed, Oct 26, 2016 at 3:06 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 10/26/16 2:25 PM, Merlin Moncure wrote:
>>
>> I don't think that's the case.  sqsh is a psql-like utility.  it
>> writes to stdout  and stderr only which is captured by plsh and sent.
>> In this context shexec only wraps rm -f 'file' where 'file' is a file
>> previously created with COPY in the same transaction.
>
> How do you know that? It could potentially be doing anything it wanted with
> file handles. Removing the exec might "solve" the problem here, assuming
> that the forked process doesn't still inherit all open FH's.
>
> In a nutshell, running arbitrary shell code (let alone arbitrary
> executables) in a pl that's running inside the Postgres backend sounds
> insanely risky to me.

Well, it is insantely risky, but the code in this context is not
arbitrary; it's only:
PERFORM shexec('rm -f ' || _OutputFile);

merlin

Re: emergency outage requiring database restart

From

Andres Freund

Date:

26 October 2016, 23:49:03

On 2016-10-26 15:06:34 -0500, Jim Nasby wrote:
> Removing the exec might "solve" the problem here, assuming that the
> forked process doesn't still inherit all open FH's.

Unless you explicitly close fds or use FD_CLOEXEC when opening fds
they'll be inherited forever.

Re: emergency outage requiring database restart

From

Ants Aasma

Date:

27 October 2016, 10:31:32

On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
> verification"
> 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37251 but expected 37244
> 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37249 but expected 37244
> 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 44363 but expected 44364
> 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 49525 but expected 49539
> 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37345 but expected 37340

The checksum values are improbably close. The checksum algorithm has
decently good mixing of all bits in the page. Having the first byte
match in 5 checksums makes this 1:2^40 improbable. What is not mixed
in properly is the block number, it only gets xor'ed before packing
the value into 16bits using modulo 0xFFFF. So I'm pretty sure
different block numbers were used for writing out and reading in the
page. Either the blocknum gets corrupted between calculating the
checksum and writing the page out (unlikely given the proximity), or
the pages are somehow getting transposed in the storage.

Regards,
Ants Aasma

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

27 October 2016, 21:53:14

On Thu, Oct 27, 2016 at 2:31 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
>> verification"
>> 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37251 but expected 37244
>> 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37249 but expected 37244
>> 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 44363 but expected 44364
>> 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 49525 but expected 49539
>> 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING:  page
>> verification failed, calculated checksum 37345 but expected 37340
>
> The checksum values are improbably close. The checksum algorithm has
> decently good mixing of all bits in the page. Having the first byte
> match in 5 checksums makes this 1:2^40 improbable. What is not mixed
> in properly is the block number, it only gets xor'ed before packing
> the value into 16bits using modulo 0xFFFF. So I'm pretty sure
> different block numbers were used for writing out and reading in the
> page. Either the blocknum gets corrupted between calculating the
> checksum and writing the page out (unlikely given the proximity), or
> the pages are somehow getting transposed in the storage.

I think we can rule out faulty storage, and I'm reasonably sure
nothing is writing to the database cluster except postgres itself.

So far I have not executed an emergency dump/restore on this database,
nor have I received any other log entries suggesting problems of any
kind.  Meaning, FWICT, the database is completely healthy
notwithstanding the reported errors.

This suggests (but does not prove) that the syscatalog damaging
activity happened during the commission of the attached function and
healed itself during the rollback.   Previously to turning checksums,
the transaction did *not* roll back and the damage was allowed stand
and probably accumulated.

As noted earlier, I was not able to reproduce the issue with
crashme.sh, which was:

NUM_FORKS=16
do_parallel psql -p 5432  -c"select PushMarketSample('1740')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('4400')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('2160')" castaging_test
do_parallel psql -p 5432  -c"select PushMarketSample('6680')" castaging_test
<snip>

(do_parallel is simple wrapper to executing the command in parallel up
to NUM_FORKS).   This is on the same server and cluster as above.
This kind of suggests that either
A) there is some concurrent activity from another process that is
tripping the issue
or
B) there is something particular to the session invoking the function
that is participating in the problem.  As the application is
structured, a single threaded node.js app is issuing the query that is
high traffic and long lived.  It's still running in fact and I'm kind
of tempted to find some downtime to see if I can still reproduce via
the UI.

merlin

Re: emergency outage requiring database restart

From

Greg Stark

Date:

28 October 2016, 02:40:22

On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I think we can rule out faulty storage

Nobody ever expects the faulty storage



-- 
greg

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

28 October 2016, 16:24:03

On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <stark@mit.edu> wrote:
> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> I think we can rule out faulty storage
>
> Nobody ever expects the faulty storage

Believe me, I know.  But the evidence points elsewhere in this case;
this is clearly application driven.

merlin

Re: emergency outage requiring database restart

From

Jim Nasby

Date:

28 October 2016, 23:16:17

On 10/28/16 8:23 AM, Merlin Moncure wrote:
> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <stark@mit.edu> wrote:
>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> I think we can rule out faulty storage
>>
>> Nobody ever expects the faulty storage

LOL

> Believe me, I know.  But the evidence points elsewhere in this case;
> this is clearly application driven.

FWIW, just because it's triggered by specific application behavior 
doesn't mean there isn't a storage bug. That's what makes data 
corruption bugs such a joy to figure out.

BTW, if you haven't already, I would reset all your storage related 
options and GUCs to safe defaults... plain old FSYNC, no cute journal / 
FS / mount options, etc. Maybe this is related to the app, but the most 
helpful thing right now is to find some kind of safe config so you can 
start bisecting.

I would also consider alternatives to plsh, just to rule it out if 
nothing else. I'd certainly look at some way to get sqsh out of the loop 
(again, just to get something that doesn't crash). First idea that comes 
to mind is a stand-alone shell script that watches a named pipe for a 
filename; when it gets that file it runs it with sqsh and does something 
to signal completion.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

29 October 2016, 00:17:41

On Fri, Oct 28, 2016 at 3:16 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
> On 10/28/16 8:23 AM, Merlin Moncure wrote:
>>
>> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <stark@mit.edu> wrote:
>>>
>>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure@gmail.com>
>>> wrote:
>>>>
>>>> I think we can rule out faulty storage
>>>
>>>
>>> Nobody ever expects the faulty storage
>
>
> LOL
>
>> Believe me, I know.  But the evidence points elsewhere in this case;
>> this is clearly application driven.
>
>
> FWIW, just because it's triggered by specific application behavior doesn't
> mean there isn't a storage bug. That's what makes data corruption bugs such
> a joy to figure out.
>
> BTW, if you haven't already, I would reset all your storage related options
> and GUCs to safe defaults... plain old FSYNC, no cute journal / FS / mount
> options, etc. Maybe this is related to the app, but the most helpful thing
> right now is to find some kind of safe config so you can start bisecting.

upthread, you might have noticed that I already did that.   Here is
the other evidence:
*) server running fine for 5+ years
*) other database on same cluster not impacted with 10x write activity
*) no interesting logs reported in  /var/log/messages, dmesg, etc
*) san fabric turns over petabytes/day with no corruption. 100+
postgres clusters, 1000+ sql server clusters (and that's not
production)
*) storage/network teams have been through everything. nothing
intersting/unusual to report
*) we have infrequently run routing (posted upthread) that, when run,
database crashed within minutes
*) after turning on checksums, 30% of invocations of routine resulted
in checksum error
*) problem re-occurred after dump-restore and full cluster rebuild
*) checksum error caused routine rollback.  FWICT this prevented the damage
*) everything is fine now that routine is not being run anymore

you can come up with your conclusion, I've come up with mine.  The
only frustrating thing here is that I can't reproduce out of the
production environment.  If this database goes down I have 30 people
sitting around so I can't take downtime lightly.

> I would also consider alternatives to plsh, just to rule it out if nothing
> else. I'd certainly look at some way to get sqsh out of the loop (again,
> just to get something that doesn't crash). First idea that comes to mind is
> a stand-alone shell script that watches a named pipe for a filename; when it
> gets that file it runs it with sqsh and does something to signal completion.

I do a lot of etl to/from sql server and it's all sqsh based.  If I
can figure out how to reproduce in a better way, I'll zero in on the
problem in about 10 minutes.

merlin

Re: emergency outage requiring database restart

From

Oskari Saarenmaa

Date:

31 October 2016, 18:32:09

27.10.2016, 21:53, Merlin Moncure kirjoitti:
> As noted earlier, I was not able to reproduce the issue with
> crashme.sh, which was:
>
> NUM_FORKS=16
> do_parallel psql -p 5432  -c"select PushMarketSample('1740')" castaging_test
> do_parallel psql -p 5432  -c"select PushMarketSample('4400')" castaging_test
> do_parallel psql -p 5432  -c"select PushMarketSample('2160')" castaging_test
> do_parallel psql -p 5432  -c"select PushMarketSample('6680')" castaging_test
> <snip>
>
> (do_parallel is simple wrapper to executing the command in parallel up
> to NUM_FORKS).   This is on the same server and cluster as above.
> This kind of suggests that either
> A) there is some concurrent activity from another process that is
> tripping the issue
> or
> B) there is something particular to the session invoking the function
> that is participating in the problem.  As the application is
> structured, a single threaded node.js app is issuing the query that is
> high traffic and long lived.  It's still running in fact and I'm kind
> of tempted to find some downtime to see if I can still reproduce via
> the UI.

Your production system's postgres backends probably have a lot more open 
files associated with them than the simple test case does.  Since 
Postgres likes to keep files open as long as possible and only closes 
them when you need to free up fds to open new files, it's possible that 
your production backends have almost all allowed fds used when you 
execute your pl/sh function.

If that's the case, the sqsh process that's executed may not have enough 
fds to do what it wanted to do and because of busted error handling 
could end up writing to fds that were opened by Postgres and point to 
$PGDATA files.

/ Oskari

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

01 November 2016, 16:35:20

On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
> 27.10.2016, 21:53, Merlin Moncure kirjoitti:
>>
>> As noted earlier, I was not able to reproduce the issue with
>> crashme.sh, which was:
>>
>> NUM_FORKS=16
>> do_parallel psql -p 5432  -c"select PushMarketSample('1740')"
>> castaging_test
>> do_parallel psql -p 5432  -c"select PushMarketSample('4400')"
>> castaging_test
>> do_parallel psql -p 5432  -c"select PushMarketSample('2160')"
>> castaging_test
>> do_parallel psql -p 5432  -c"select PushMarketSample('6680')"
>> castaging_test
>> <snip>
>>
>> (do_parallel is simple wrapper to executing the command in parallel up
>> to NUM_FORKS).   This is on the same server and cluster as above.
>> This kind of suggests that either
>> A) there is some concurrent activity from another process that is
>> tripping the issue
>> or
>> B) there is something particular to the session invoking the function
>> that is participating in the problem.  As the application is
>> structured, a single threaded node.js app is issuing the query that is
>> high traffic and long lived.  It's still running in fact and I'm kind
>> of tempted to find some downtime to see if I can still reproduce via
>> the UI.
>
> Your production system's postgres backends probably have a lot more open
> files associated with them than the simple test case does.  Since Postgres
> likes to keep files open as long as possible and only closes them when you
> need to free up fds to open new files, it's possible that your production
> backends have almost all allowed fds used when you execute your pl/sh
> function.
>
> If that's the case, the sqsh process that's executed may not have enough fds
> to do what it wanted to do and because of busted error handling could end up
> writing to fds that were opened by Postgres and point to $PGDATA files.

Does that apply?  the mechanics are a sqsh function that basically does:
cat foo.sql  | sqsh <args>

pipe redirection opens a new process, right?

merlin

Re: emergency outage requiring database restart

From

Tom Lane

Date:

01 November 2016, 16:57:06

Merlin Moncure <mmoncure@gmail.com> writes:
> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
>> Your production system's postgres backends probably have a lot more open
>> files associated with them than the simple test case does.  Since Postgres
>> likes to keep files open as long as possible and only closes them when you
>> need to free up fds to open new files, it's possible that your production
>> backends have almost all allowed fds used when you execute your pl/sh
>> function.
>>
>> If that's the case, the sqsh process that's executed may not have enough fds
>> to do what it wanted to do and because of busted error handling could end up
>> writing to fds that were opened by Postgres and point to $PGDATA files.

> Does that apply?  the mechanics are a sqsh function that basically does:
> cat foo.sql  | sqsh <args>
> pipe redirection opens a new process, right?

Yeah, but I doubt that either level of the shell would attempt to close
inherited file handles.

The real problem with Oskari's theory is that it requires not merely
busted, but positively brain-dead error handling in the shell and/or
sqsh, ie ignoring open() failures altogether.  That seems kind of
unlikely.  Still, I suspect he might be onto something --- there must
be some reason you can reproduce the issue in production and not in
your test bed, and number-of-open-files is as good a theory as I've
heard.

Maybe the issue is not with open() failures, but with the resulting
FD numbers being much larger than sqsh is expecting.  It would be
weird to try to store an FD in something narrower than int, but
I could see a use of select() being unprepared for large FDs.
Still, it's hard to translate that idea into scribbling on the
wrong file...
        regards, tom lane

Re: emergency outage requiring database restart

From

Andres Freund

Date:

01 November 2016, 17:05:59

On 2016-11-01 09:56:45 -0400, Tom Lane wrote:
> The real problem with Oskari's theory is that it requires not merely
> busted, but positively brain-dead error handling in the shell and/or
> sqsh, ie ignoring open() failures altogether.  That seems kind of
> unlikely.  Still, I suspect he might be onto something --- there must
> be some reason you can reproduce the issue in production and not in
> your test bed, and number-of-open-files is as good a theory as I've
> heard.

I've seen shell code akin to
exec >16 somefile # assume fd 16 is unused
more than one :(

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

01 November 2016, 18:05:18

On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
>>> Your production system's postgres backends probably have a lot more open
>>> files associated with them than the simple test case does.  Since Postgres
>>> likes to keep files open as long as possible and only closes them when you
>>> need to free up fds to open new files, it's possible that your production
>>> backends have almost all allowed fds used when you execute your pl/sh
>>> function.
>>>
>>> If that's the case, the sqsh process that's executed may not have enough fds
>>> to do what it wanted to do and because of busted error handling could end up
>>> writing to fds that were opened by Postgres and point to $PGDATA files.
>
>> Does that apply?  the mechanics are a sqsh function that basically does:
>> cat foo.sql  | sqsh <args>
>> pipe redirection opens a new process, right?
>
> Yeah, but I doubt that either level of the shell would attempt to close
> inherited file handles.
>
> The real problem with Oskari's theory is that it requires not merely
> busted, but positively brain-dead error handling in the shell and/or
> sqsh, ie ignoring open() failures altogether.  That seems kind of
> unlikely.  Still, I suspect he might be onto something --- there must
> be some reason you can reproduce the issue in production and not in
> your test bed, and number-of-open-files is as good a theory as I've
> heard.
>
> Maybe the issue is not with open() failures, but with the resulting
> FD numbers being much larger than sqsh is expecting.  It would be
> weird to try to store an FD in something narrower than int, but
> I could see a use of select() being unprepared for large FDs.
> Still, it's hard to translate that idea into scribbling on the
> wrong file...

Looking at the sqsh code, nothing really stands out.  It's highly
developed and all obvious errors are checked.  There certainly could
be a freak bug in there (or in libfreetds which sqsh links to) doing
the damage though.  In the meantime I'll continue to try and work a
reliable reproduction. This particular routine only gets called in
batches on a quarterly basis so things have settled down.

Just a thought; could COPY be tricked into writing into the wrong file
descriptor?  For example, if a file was killed with a rm -rf and the
fd pressured backend reopened the fd immediately?

merlin

Re: emergency outage requiring database restart

From

Oskari Saarenmaa

Date:

02 November 2016, 18:46:05

26.10.2016, 21:34, Andres Freund kirjoitti:
> Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly
oneway to get into odd corruption issues. 
>
> We processor really should use O_CLOEXEC for the majority of it file handles.

Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not
using EXEC_BACKEND.  It'd be nice to not expose all fds to most
pl-languages either, but I guess there's no easy solution to that
without forcibly closing all fds whenever any functions are called.

/ Oskari

Attachment

0001-BasicOpenFile-always-use-O_CLOEXEC-if-it-is-availabl.patch

Re: emergency outage requiring database restart

From

Merlin Moncure

Date:

08 November 2016, 01:31:25

On Wed, Nov 2, 2016 at 10:45 AM, Oskari Saarenmaa <os@ohmu.fi> wrote:
> 26.10.2016, 21:34, Andres Freund kirjoitti:
>>
>> Any chance that plsh or the script it executes does anything with the file
>> descriptors it inherits? That'd certainly one way to get into odd corruption
>> issues.
>>
>> We processor really should use O_CLOEXEC for the majority of it file
>> handles.
>
>
> Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not using
> EXEC_BACKEND.  It'd be nice to not expose all fds to most pl-languages
> either, but I guess there's no easy solution to that without forcibly
> closing all fds whenever any functions are called.

FYI, this is not my first run-in with strange behavior, on this thread
(not necessarily worth reading);
https://www.postgresql.org/message-id/CAHyXU0x5mW-SbSuUBEshzumOaN7JPUWa7Ejza68HE-KY0Nq7Kg%40mail.gmail.com

I had a similar set of starting conditions that resulted in very
strange behavior (but not data corruption AFAICT) --the problem
mysteriously disappeared when I fixed some bugs that would cause the
routine to concurrently do the same operation.  I would like to point
out that I use both pl/sh (and via it, sqsh) very highly, so these
problems are not necessarily the norm.

Regardless, it seems like you might be on to something, and I'm
inclined to patch your change, test it, and roll it out to production.
If it helps or at least narrows the problem down, we ought to give it
consideration for inclusion (unless someone else can think of a good
reason not to do that, heh!).

merlin

Re: [HACKERS] emergency outage requiring database restart

From

Peter Eisentraut

Date:

04 January 2017, 01:05:21

On 11/7/16 5:31 PM, Merlin Moncure wrote:
> Regardless, it seems like you might be on to something, and I'm
> inclined to patch your change, test it, and roll it out to production.
> If it helps or at least narrows the problem down, we ought to give it
> consideration for inclusion (unless someone else can think of a good
> reason not to do that, heh!).

Any results yet?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] emergency outage requiring database restart

From

Peter Eisentraut

Date:

04 January 2017, 01:17:18

On 11/2/16 11:45 AM, Oskari Saarenmaa wrote:
> 26.10.2016, 21:34, Andres Freund kirjoitti:
>> Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly
oneway to get into odd corruption issues.

>>
>> We processor really should use O_CLOEXEC for the majority of it file handles.
> 
> Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not 
> using EXEC_BACKEND.  It'd be nice to not expose all fds to most 
> pl-languages either, but I guess there's no easy solution to that 
> without forcibly closing all fds whenever any functions are called.

It seems like everyone was generally in favor of this.  I looked around
the internet for caveats but everyone was basically saying, you should
definitely do this.

Why not for EXEC_BACKEND?

O_CLOEXEC is a newer interface.  There are older systems that don't have
it but have FD_CLOEXEC for fcntl().  We should use that as a fallback.

Have you gone through the code and checked for other ways file
descriptors might get opened?  Here is a blog posts that lists some
candidates: http://udrepper.livejournal.com/20407.html

Ideally, we would have a test case that exec's something that lists the
open file descriptors, and we check that there are only those we expect.

The comment "We don't expect execve() calls inside the postgres code" is
not quite correct, as we do things like archive_command and COPY to
program (see OpenPipeStream()).

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] emergency outage requiring database restart

From

Merlin Moncure

Date:

04 January 2017, 21:36:25

On Tue, Jan 3, 2017 at 1:05 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 11/7/16 5:31 PM, Merlin Moncure wrote:
>> Regardless, it seems like you might be on to something, and I'm
>> inclined to patch your change, test it, and roll it out to production.
>> If it helps or at least narrows the problem down, we ought to give it
>> consideration for inclusion (unless someone else can think of a good
>> reason not to do that, heh!).
>
> Any results yet?

Not yet unfortunately.  I compiled the server with the change, but was
not able get $libdir working so that I could just do a binary swap
over my pgdg compiled package.  If anyone has some pointers on how to
do that, I'd appreciated it.

Still getting checksum failures.   Over the last 30 days, I see the
following.  Since enabling checksums FWICT none of the damage is
permanent and rolls back with the transaction.   So creepy!

[root@rcdylsdbmpf001 pg_log]# cat *.log | grep "page verification failed"
2016-12-05 10:17:48 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 61797 but expected 61798
2016-12-05 11:15:31 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 37750 but expected 37749
2016-12-05 11:15:58 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 44483 but expected 44482
2016-12-05 11:16:33 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 58926 but expected 58925
2016-12-05 11:17:08 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 38527 but expected 38528
2016-12-05 11:18:34 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 61932 but expected 61933
2016-12-05 11:18:55 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 23757 but expected 23758
2016-12-05 12:13:48 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 44192 but expected 44225 at character 417
2016-12-08 14:18:37 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 36083 but expected 36082
2016-12-08 15:52:31 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 63414 but expected 63415 at
character 1096
2016-12-09 09:12:21 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 25781 but expected 25780
2016-12-09 09:13:20 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 63043 but expected 63044 at
character 4230
2016-12-12 08:57:45 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 31775 but expected 31771
2016-12-13 09:47:11 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 40802 but expected 40806
2016-12-15 12:49:04 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 11625 but expected 11592 at character 417
2016-12-15 12:51:08 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 51017 but expected 51018
2016-12-15 12:52:36 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 51017 but expected 51018 at character 417
2016-12-16 12:16:31 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 23580 but expected 23576
2016-12-20 13:59:33 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 45273 but expected 45285
2016-12-20 14:00:22 CST [postgres@castaging]: WARNING:  page
verification failed, calculated checksum 10524 but expected 10525

note second database 'mpf2'.  This is a new development.  Example of
query that is jacking things is this:
2016-12-15 12:51:08 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 51017 but expected 51018
2016-12-15 12:51:08 CST [rms@mpf2]: CONTEXT:  SQL statement "     COPY (       SELECT 'DELETE FROM tblNAptCommonSample
WHEREReportPeriod = 201612;'       UNION ALL       SELECT format(   'INSERT INTO tblNAptCommonSample('
'ReportPeriod,Period, AdjustmentType, PlanType, MSA, MSASubMkt, '     'Sample, Occupancy, OccupancyChange,
AverageRent,
AverageRentChange, RentSF, '     'RentSFChange)'   'VALUES('     '%s, %s, ''%s'', ''%s'', ''%s'', %s,'     '%s, %s, %s,
%s,%s, %s,'     '%s)',     ReportPeriod, Period, AdjustmentType, PlanType, MSA, MSASubMkt,     c(Sample), c(Occupancy),
c(OccupancyChange),c(AverageRent),
 
c(AverageRentChange), c(RentSF),     c(RentSFChange))   FROM tblNAptCommonSample     WHERE Period = 201612     AND MSA
!='5610'   UNION ALL     SELECT 'go'     ) TO '/tmp/tblnaptcommonsample.sql';   "   PL/pgSQL function
writempf1history(integer)line 75 at EXECUTE
 


or this:
2016-12-15 12:52:36 CST [rms@mpf2]: WARNING:  page verification
failed, calculated checksum 51017 but expected 51018 at character 417
2016-12-15 12:52:36 CST [rms@mpf2]: QUERY:     COPY (       SELECT 'DELETE FROM tbltwrexistingunits WHERE ReportPeriod
=201612;'       UNION ALL       SELECT format(   'INSERT INTO tbltwrexistingunits('     'ReportPeriod, market,
submarketnum,yr_qtr, cmpltns, deletions, '     'existing, unadjexisting)'   'VALUES('     '%s, ''%s'', %s, ''%s'', %s,
%s,'    '%s, %s)',     ReportPeriod, market, submarket, yr_qtr, c(cmpltns), c(deletions),     c(existing),
c(unadjexisting))  FROM tbltwrexistingunits     WHERE ReportPeriod = 201612       AND market != '5610'   UNION ALL
SELECT'go'     ) TO '/tmp/tbltwrexistingunits.sql';
 

2016-12-15 12:52:36 CST [rms@mpf2]: CONTEXT:  PL/pgSQL function
writempf1history(integer) line 109 at EXECUTE

This is another (much simpler) routine that:
1. writes out data to scratch file with COPY
2. uses pl/sh to pipe to sqsh

merlin

Re: [HACKERS] emergency outage requiring database restart

From

Ants Aasma

Date:

18 January 2017, 16:11:12

On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Still getting checksum failures.   Over the last 30 days, I see the
> following.  Since enabling checksums FWICT none of the damage is
> permanent and rolls back with the transaction.   So creepy!

The checksums still only differ in least significant digits which
pretty much means that there is a block number mismatch. So if you
rule out filesystem not doing its job correctly and transposing
blocks, it could be something else that is resulting in blocks getting
read from a location that happens to differ by a small multiple of
page size. Maybe somebody is racily mucking with table fd's between
seeking and reading. That would explain the issue disappearing after a
retry.

Maybe you can arrange for the RelFileNode and block number to be
logged for the checksum failures and check what the actual checksums
are in data files surrounding the failed page. If the requested block
number contains something completely else, but the page that follows
contains the expected checksum value, then it would support this
theory.

Regards,
Ants Aasma

Re: [HACKERS] emergency outage requiring database restart

From

Merlin Moncure

Date:

18 January 2017, 20:33:50

On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Still getting checksum failures.   Over the last 30 days, I see the
>> following.  Since enabling checksums FWICT none of the damage is
>> permanent and rolls back with the transaction.   So creepy!
>
> The checksums still only differ in least significant digits which
> pretty much means that there is a block number mismatch. So if you
> rule out filesystem not doing its job correctly and transposing
> blocks, it could be something else that is resulting in blocks getting
> read from a location that happens to differ by a small multiple of
> page size. Maybe somebody is racily mucking with table fd's between
> seeking and reading. That would explain the issue disappearing after a
> retry.
>
> Maybe you can arrange for the RelFileNode and block number to be
> logged for the checksum failures and check what the actual checksums
> are in data files surrounding the failed page. If the requested block
> number contains something completely else, but the page that follows
> contains the expected checksum value, then it would support this
> theory.

will do.   Main challenge is getting hand compiled server to swap in
so that libdir continues to work.  Getting access to the server is
difficult as is getting a maintenance window.  I'll post back ASAP.

merlin

Re: [HACKERS] emergency outage requiring database restart

From

Michael Paquier

Date:

31 January 2017, 11:46:47

On Wed, Jan 4, 2017 at 4:17 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> It seems like everyone was generally in favor of this.  I looked around
> the internet for caveats but everyone was basically saying, you should
> definitely do this.
>
> Why not for EXEC_BACKEND?
>
> O_CLOEXEC is a newer interface.  There are older systems that don't have
> it but have FD_CLOEXEC for fcntl().  We should use that as a fallback.
>
> Have you gone through the code and checked for other ways file
> descriptors might get opened?  Here is a blog posts that lists some
> candidates: http://udrepper.livejournal.com/20407.html
>
> Ideally, we would have a test case that exec's something that lists the
> open file descriptors, and we check that there are only those we expect.
>
> The comment "We don't expect execve() calls inside the postgres code" is
> not quite correct, as we do things like archive_command and COPY to
> program (see OpenPipeStream()).

Oskari, are you planning to answer to this review? As the thread has
died 3 weeks ago, I am marking this as returned with feedback. Don't
hesitate to change the status of the patch if you have a new version.
-- 
Michael

Re: [HACKERS] emergency outage requiring database restart

From

Ants Aasma

Date:

10 August 2017, 23:01:45

On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> Still getting checksum failures.   Over the last 30 days, I see the
>>> following.  Since enabling checksums FWICT none of the damage is
>>> permanent and rolls back with the transaction.   So creepy!
>>
>> The checksums still only differ in least significant digits which
>> pretty much means that there is a block number mismatch. So if you
>> rule out filesystem not doing its job correctly and transposing
>> blocks, it could be something else that is resulting in blocks getting
>> read from a location that happens to differ by a small multiple of
>> page size. Maybe somebody is racily mucking with table fd's between
>> seeking and reading. That would explain the issue disappearing after a
>> retry.
>>
>> Maybe you can arrange for the RelFileNode and block number to be
>> logged for the checksum failures and check what the actual checksums
>> are in data files surrounding the failed page. If the requested block
>> number contains something completely else, but the page that follows
>> contains the expected checksum value, then it would support this
>> theory.
>
> will do.   Main challenge is getting hand compiled server to swap in
> so that libdir continues to work.  Getting access to the server is
> difficult as is getting a maintenance window.  I'll post back ASAP.

As a new datapoint, we just had a customer with an issue that I think
might be related. The issue was reasonably repeatable by running a
report on the standby system. Issue manifested itself by first "could
not open relation" and/or "column is not in index" errors, followed a
few minutes later by a PANIC from startup process due to "specified
item offset is too large", "invalid max offset number" or "page X of
relation base/16384/1259 is uninitialized". I took a look at the xlog
dump and it was completely fine. For instance in the "specified item
offset is too large" case there was a INSERT_LEAF redo record
inserting the preceding offset just a couple hundred kilobytes back.
Restarting the server sometimes successfully applied the offending
WAL, sometimes it failed with other corruption errors. The offending
relations were always pg_class or pg_class_oid_index. Replacing plsh
functions with dummy plpgsql functions made the problem go away,
reintroducing plsh functions made it reappear.

The only thing I came up with that is consistent with the symptoms is
that a page got thrown out of shared_buffers between the two xlog
records referencing it (shared_buffers was default 128MB), and then
read back by a backend process, where in the time between FileSeek and
FileRead calls in mdread a subprocess mucked with the fd's offset so
that a different page than intended got read in. Or basically the same
race condition, but on the write side. Maybe somebody else has a
better imagination than me...

Regards,
Ants Aasma

Re: [HACKERS] emergency outage requiring database restart

From

Merlin Moncure

Date:

11 August 2017, 02:02:03

On Thu, Aug 10, 2017 at 12:01 PM, Ants Aasma <ants.aasma@eesti.ee> wrote:
> On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
>>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>>> Still getting checksum failures.   Over the last 30 days, I see the
>>>> following.  Since enabling checksums FWICT none of the damage is
>>>> permanent and rolls back with the transaction.   So creepy!
>>>
>>> The checksums still only differ in least significant digits which
>>> pretty much means that there is a block number mismatch. So if you
>>> rule out filesystem not doing its job correctly and transposing
>>> blocks, it could be something else that is resulting in blocks getting
>>> read from a location that happens to differ by a small multiple of
>>> page size. Maybe somebody is racily mucking with table fd's between
>>> seeking and reading. That would explain the issue disappearing after a
>>> retry.
>>>
>>> Maybe you can arrange for the RelFileNode and block number to be
>>> logged for the checksum failures and check what the actual checksums
>>> are in data files surrounding the failed page. If the requested block
>>> number contains something completely else, but the page that follows
>>> contains the expected checksum value, then it would support this
>>> theory.
>>
>> will do.   Main challenge is getting hand compiled server to swap in
>> so that libdir continues to work.  Getting access to the server is
>> difficult as is getting a maintenance window.  I'll post back ASAP.
>
> As a new datapoint, we just had a customer with an issue that I think
> might be related. The issue was reasonably repeatable by running a
> report on the standby system. Issue manifested itself by first "could
> not open relation" and/or "column is not in index" errors, followed a
> few minutes later by a PANIC from startup process due to "specified
> item offset is too large", "invalid max offset number" or "page X of
> relation base/16384/1259 is uninitialized". I took a look at the xlog
> dump and it was completely fine. For instance in the "specified item
> offset is too large" case there was a INSERT_LEAF redo record
> inserting the preceding offset just a couple hundred kilobytes back.
> Restarting the server sometimes successfully applied the offending
> WAL, sometimes it failed with other corruption errors. The offending
> relations were always pg_class or pg_class_oid_index. Replacing plsh
> functions with dummy plpgsql functions made the problem go away,
> reintroducing plsh functions made it reappear.

Fantastic.  I was never able to attempt to apply O_CLOEXEC patch (see
upthread) due to the fact that access to the system is highly limited
and compiling a replacement binary was a bit of a headache.  IIRC this
was the best theory on the table as to the underlying cause and we
ought to to try that first, right?

Reminder; I was able to completely eliminate all damage (but had to
handle occasional unexpected rollback) via enabling checksums.

merlin

Re: [HACKERS] emergency outage requiring database restart

From

Merlin Moncure

Date:

07 February 2020, 19:44:01

On Tue, Jan 3, 2017 at 1:05 PM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> On 11/7/16 5:31 PM, Merlin Moncure wrote:
> > Regardless, it seems like you might be on to something, and I'm
> > inclined to patch your change, test it, and roll it out to production.
> > If it helps or at least narrows the problem down, we ought to give it
> > consideration for inclusion (unless someone else can think of a good
> > reason not to do that, heh!).
>
> Any results yet?

Not yet.   But I do have some interesting findings.  At this point I
do not think the problem is within  pl/sh itself, but that when a
process is invoked from pl/sh misbehaves that misbehavior can
penetrate into the database processes.  I also believe that this
problem is fd related, so that the 'close on exec' might reasonably
fix it.  All cases of database damage I have observed remain
completely mitigated by enabling database checksums.

Recently, a sqsh process kicked off via pl/sh crashed with signal 11
but the database process was otherwise intact and fine.  This is
strong supporting evidence to my points above, I think.  I've also
turned up a fairly reliable reproduction case from some unrelated
application changes.  If I can demonstrate that close on exec flag
works and prevents these occurrences we can close the book on this.

merlin