Re: emergency outage requiring database restart - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: emergency outage requiring database restart
Date
Msg-id CAHyXU0y5ukBJ8U5naYiMm5nD1QF9B=LXfKPo0Ttn2WSCsmUPRg@mail.gmail.com
Whole thread Raw
In response to Re: emergency outage requiring database restart  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On Thu, Oct 20, 2016 at 3:16 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>
>> single user mode dumps core :(
>>
>> bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
>> LOG:  00000: could not change directory to "/root": Permission denied
>> LOCATION:  resolve_symlinks, exec.c:293
>> Segmentation fault (core dumped)
>>
>> Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x0000000000797d6f in ?? ()
>> Missing separate debuginfos, use: debuginfo-install
>> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
>> (gdb) bt
>> #0  0x0000000000797d6f in ?? ()
>> #1  0x000000000079acf1 in RelationCacheInitializePhase3 ()
>> #2  0x00000000007b35c5 in InitPostgres ()
>> #3  0x00000000006b9b53 in PostgresMain ()
>> #4  0x00000000005f30fb in main ()
>
> Maybe
>   rm global/pg_internal.init
> and try again?

Will do when I can do that had to do emergency restore + some unfun
data reconstruction from the query log.

Notably there is a much larger database in the same cluster which is
undamaged.  This server is new to production usage, maybe 2 months.

Here is contents of pg_extensionplpgsqldblinkhstorepostgres_fdwplsh * not usedpg_trgm * not usedplr * not usedtablefunc
*not usedadminpack * not usedplpythonu * not usedpostgis * not usedpostgis_topology * not used
 

Short term plan is to separate the database to it's own cluster,
install replication and checksums.  All queries to this database are
logged.  Here is the contents of the log leading into and after the
the crash:

oct 17 crash:
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL:  parameters: $1 =
'21121', $2 = '8', $3 = '2016-10-13', $4 = NULL, $5 = NULL, $6 = NULL,
$7 = NULL, $8 = NULL, $9 = NULL, $10 = NULL, $11 = 't', $12
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute <unnamed>:
SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname
AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute <unnamed>:
SELECT n.nspname,p.proname,p.prorettype,p.proargtypes,
t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes
, p.o
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute <unnamed>:
select * from checkin($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
$12) as result
2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL:  parameters: $1 =
'114333', $2 = 'rrosillo', $3 = 'CALLER', $4 = 'Survey', $5 = 'Happy',
$6 = 'Callback', $7 = 'OTHER', $8 = '2016-10-18 01:05:00',
2016-10-17 12:12:24 CDT [rms@castaging]: LOG:  execute S_3: COMMIT
2016-10-17 12:12:25 CDT [@]: ERROR:  could not open relation with OID
1203933 <-- first sign of damage
2016-10-17 12:12:25 CDT [@]: CONTEXT:  automatic analyze of table
"castaging.public.apartment"

oct 20 crash:
2016-10-20 12:46:38 CDT [postgres@castaging]: LOG:  statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:40 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:41 CDT [postgres@castaging]: LOG:  statement: SELECT
MarketOverviewJSON() AS market
2016-10-20 12:46:42 CDT [postgres@castaging]: LOG:  execute <unnamed>:
SELECT SubMarketOverviewJSON($1::TEXT) AS submkt
2016-10-20 12:46:42 CDT [postgres@castaging]: DETAIL:  parameters: $1 = '640'
2016-10-20 12:46:44 CDT [postgres@castaging]: LOG:  statement: SELECT
CallsByUser() AS byuser
2016-10-20 12:46:46 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterOverviewJSON() AS overview
2016-10-20 12:46:47 CDT [postgres@castaging]: LOG:  statement: SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: ERROR:
"pg_description_o_c_o_index" is an index <-- first sign of damage
2016-10-20 12:46:47 CDT [postgres@castaging]: CONTEXT:  SQL function
"callcenterusertrackingjson" during startup
2016-10-20 12:46:47 CDT [postgres@castaging]: STATEMENT:  SELECT
CallCenterUserTrackingJSON() AS tracking
2016-10-20 12:46:47 CDT [postgres@castaging]: WARNING:  leaking
still-referenced relcache entry for "pg_class_oid_index"

CallCenterUserTrackingJSON() and friends are not particularly
interesting except that they are making use of of json_agg().  They
were also called basically all day long in 5 second intervals.   I
guess this isn't saying very much, but I'm starting to smell a rat
here.

merlin



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: WIP: Fix invalid XML explain plans for track_io_timing
Next
From: Tom Lane
Date:
Subject: Re: Improve output of BitmapAnd EXPLAIN ANALYZE