Thread: emergency outage requiring database restart
Today I had an emergency production outage on a server. This particular server was running 9.5.2. The symptoms were interesting so I thought I'd report. Here is what I saw: *) User CPU was pegged 100% *) Queries reading data would block and not respond to cancel or terminate *) pg_stat_activity reported no waiting queries (but worked fine otherwise). Adding all this up it smells like processes were getting stuck on a spinlock. Connections quickly got eaten up and situation was desperately urgent so I punted and did an immediate restart and things came back normally. I had a console to the database and did manage to grab contents of pg_stat_activity and noticed several trivial queries were running normally (according to pg_stat_activity) but were otherwise stuck. Attempting to run one of them myself, I noted query got stuck and did not cancel. I was in a terrible rush but am casting around for stuff to grab out in case that happens again -- 'perf top' would be a natural choice I guess. Three autovacuum processes were running. Obviously going to do bugfix upgrade but was wondering if anybody has seen anything like this. This particular server was upgraded to 9.5 somewhat recently but ran on 9.2 for years with no issues. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > Today I had an emergency production outage on a server. > ... > Adding all this up it smells like processes were getting stuck on a spinlock. Maybe. If it happens again, probably the most useful debug data would be stack traces from some of the busy processes. regards, tom lane
On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> Today I had an emergency production outage on a server. >> ... >> Adding all this up it smells like processes were getting stuck on a spinlock. > > Maybe. If it happens again, probably the most useful debug data would > be stack traces from some of the busy processes. Another odd datapoint on this server. Things were running pretty good but an application crashed on a missing view. Trying to recreate the view, I got: CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS SELECT ... ERROR: type "vw_apartmentqueuelastgood" already exists HINT: A relation has an associated type of the same name, so you must use a name that doesn't conflict with any existing type. ...which was pretty strange. I had to manually delete the pg_type record in order to create the view. I'm getting more reports of 'could not open relation with oid=X' errors so I could be facing data corruption :(. merlin
On Mon, Oct 17, 2016 at 1:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Oct 13, 2016 at 4:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure <mmoncure@gmail.com> writes: >>> Today I had an emergency production outage on a server. >>> ... >>> Adding all this up it smells like processes were getting stuck on a spinlock. >> >> Maybe. If it happens again, probably the most useful debug data would >> be stack traces from some of the busy processes. > > Another odd datapoint on this server. Things were running pretty good > but an application crashed on a missing view. Trying to recreate the > view, I got: > > CREATE OR REPLACE VIEW vw_ApartmentQueueLastGood AS > SELECT ... > > ERROR: type "vw_apartmentqueuelastgood" already exists > HINT: A relation has an associated type of the same name, so you must > use a name that doesn't conflict with any existing type. > > ...which was pretty strange. I had to manually delete the pg_type > record in order to create the view. I'm getting more reports of > 'could not open relation with oid=X' errors so I could be facing data > corruption :(. castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS castaging-# SELECT ... ERROR: 42809: "pg_cast_oid_index" is an index LINE 11: FROM ApartmentSample s ^ LOCATION: heap_openrv_extended, heapam.c:1304 should I be restoring from backups? merlin
Merlin Moncure wrote: > castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS > castaging-# SELECT ... > ERROR: 42809: "pg_cast_oid_index" is an index > LINE 11: FROM ApartmentSample s > ^ > LOCATION: heap_openrv_extended, heapam.c:1304 > > should I be restoring from backups? It's pretty clear to me that you've got catalog corruption here. You can try to fix things manually as they emerge, but that sounds like a fool's errand. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure wrote: > >> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS >> castaging-# SELECT ... >> ERROR: 42809: "pg_cast_oid_index" is an index >> LINE 11: FROM ApartmentSample s >> ^ >> LOCATION: heap_openrv_extended, heapam.c:1304 >> >> should I be restoring from backups? > > It's pretty clear to me that you've got catalog corruption here. You > can try to fix things manually as they emerge, but that sounds like a > fool's errand. agreed. current plan is to restore from backups, and recover as much data as I can. Also doing bugfix release and going to enable checksums. We had several good backups since the previous outage so it's not clear the events are related but after months of smooth operation I find that coincidence highly suspicious. As always, we need to suspect hardware problems but I'm highly abstracted from them -- using esx + san. merlin
Merlin Moncure wrote: > We had several good backups since the previous outage so it's not > clear the events are related but after months of smooth operation I > find that coincidence highly suspicious. As always, we need to suspect > hardware problems but I'm highly abstracted from them -- using esx + > san. Ah, so you're subject not only to hardware flaws but also to virtualization layer bugs :-) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure wrote: > >> We had several good backups since the previous outage so it's not >> clear the events are related but after months of smooth operation I >> find that coincidence highly suspicious. As always, we need to suspect >> hardware problems but I'm highly abstracted from them -- using esx + >> san. > > Ah, so you're subject not only to hardware flaws but also to > virtualization layer bugs :-) Wait a couple of more years, and we'll get more complains about Postgres running in containers running in VMs. Even more fun waiting ahead. -- Michael
On 18/10/16 14:12, Michael Paquier wrote: > On Tue, Oct 18, 2016 at 4:21 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Merlin Moncure wrote: >> >>> We had several good backups since the previous outage so it's not >>> clear the events are related but after months of smooth operation I >>> find that coincidence highly suspicious. As always, we need to suspect >>> hardware problems but I'm highly abstracted from them -- using esx + >>> san. >> Ah, so you're subject not only to hardware flaws but also to >> virtualization layer bugs :-) > Wait a couple of more years, and we'll get more complains about > Postgres running in containers running in VMs. Even more fun waiting > ahead. that started life on different hardware with a different O/S
On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure wrote: > >> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS >> castaging-# SELECT ... >> ERROR: 42809: "pg_cast_oid_index" is an index >> LINE 11: FROM ApartmentSample s >> ^ >> LOCATION: heap_openrv_extended, heapam.c:1304 >> >> should I be restoring from backups? > > It's pretty clear to me that you've got catalog corruption here. You > can try to fix things manually as they emerge, but that sounds like a > fool's errand. Yeah. Believe me -- I know the drill. Most or all the damage seemed to be to the system catalogs with at least two critical tables dropped or inaccessible in some fashion. A lot of the OIDs seemed to be pointing at the wrong thing. Couple more datapoints here. *) This database is OLTP, doing ~ 20 tps avg (but very bursty) *) Another database on the same cluster was not impacted. However it's more olap style and may not have been written to during the outage Now, this infrastructure running this system is running maybe 100ish postgres clusters and maybe 1000ish sql server instances with approximately zero unexplained data corruption issues in the 5 years I've been here. Having said that, this definitely smells and feels like something on the infrastructure side. I'll follow up if I have any useful info. merlin
On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Merlin Moncure wrote: >> >>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS >>> castaging-# SELECT ... >>> ERROR: 42809: "pg_cast_oid_index" is an index >>> LINE 11: FROM ApartmentSample s >>> ^ >>> LOCATION: heap_openrv_extended, heapam.c:1304 >>> >>> should I be restoring from backups? >> >> It's pretty clear to me that you've got catalog corruption here. You >> can try to fix things manually as they emerge, but that sounds like a >> fool's errand. > > Yeah. Believe me -- I know the drill. Most or all the damage seemed > to be to the system catalogs with at least two critical tables dropped > or inaccessible in some fashion. A lot of the OIDs seemed to be > pointing at the wrong thing. Couple more datapoints here. > > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) > *) Another database on the same cluster was not impacted. However > it's more olap style and may not have been written to during the > outage > > Now, this infrastructure running this system is running maybe 100ish > postgres clusters and maybe 1000ish sql server instances with > approximately zero unexplained data corruption issues in the 5 years > I've been here. Having said that, this definitely smells and feels > like something on the infrastructure side. I'll follow up if I have > any useful info. After a thorough investigation I now have credible evidence the source of the damage did not originate from the database itself. Specifically, this database is mounted on the same volume as the operating system (I know, I know) and something non database driven sucked up disk space very rapidly and exhausted the volume -- fast enough that sar didn't pick it up. Oh well :-) -- thanks for the help merlin
On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: > > Yeah. Believe me -- I know the drill. Most or all the damage seemed > > to be to the system catalogs with at least two critical tables dropped > > or inaccessible in some fashion. A lot of the OIDs seemed to be > > pointing at the wrong thing. Couple more datapoints here. > > > > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) > > *) Another database on the same cluster was not impacted. However > > it's more olap style and may not have been written to during the > > outage > > > > Now, this infrastructure running this system is running maybe 100ish > > postgres clusters and maybe 1000ish sql server instances with > > approximately zero unexplained data corruption issues in the 5 years > > I've been here. Having said that, this definitely smells and feels > > like something on the infrastructure side. I'll follow up if I have > > any useful info. > > After a thorough investigation I now have credible evidence the source > of the damage did not originate from the database itself. > Specifically, this database is mounted on the same volume as the > operating system (I know, I know) and something non database driven > sucked up disk space very rapidly and exhausted the volume -- fast > enough that sar didn't pick it up. Oh well :-) -- thanks for the help However, disk space exhaustion should not lead to corruption unless the underlying layers lied in some way. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >> > to be to the system catalogs with at least two critical tables dropped >> > or inaccessible in some fashion. A lot of the OIDs seemed to be >> > pointing at the wrong thing. Couple more datapoints here. >> > >> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) >> > *) Another database on the same cluster was not impacted. However >> > it's more olap style and may not have been written to during the >> > outage >> > >> > Now, this infrastructure running this system is running maybe 100ish >> > postgres clusters and maybe 1000ish sql server instances with >> > approximately zero unexplained data corruption issues in the 5 years >> > I've been here. Having said that, this definitely smells and feels >> > like something on the infrastructure side. I'll follow up if I have >> > any useful info. >> >> After a thorough investigation I now have credible evidence the source >> of the damage did not originate from the database itself. >> Specifically, this database is mounted on the same volume as the >> operating system (I know, I know) and something non database driven >> sucked up disk space very rapidly and exhausted the volume -- fast >> enough that sar didn't pick it up. Oh well :-) -- thanks for the help > > However, disk space exhaustion should not lead to corruption unless the > underlying layers lied in some way. I agree -- however I'm sufficiently separated from the things doing the things that I can't verify that in any real way. In the meantime I'm going to take standard precautions (enable checksums/dedicated volume/replication). Low disk space also does not explain the bizarre outage I had last friday. merlin
On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote: >> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >>> > to be to the system catalogs with at least two critical tables dropped >>> > or inaccessible in some fashion. A lot of the OIDs seemed to be >>> > pointing at the wrong thing. Couple more datapoints here. >>> > >>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) >>> > *) Another database on the same cluster was not impacted. However >>> > it's more olap style and may not have been written to during the >>> > outage >>> > >>> > Now, this infrastructure running this system is running maybe 100ish >>> > postgres clusters and maybe 1000ish sql server instances with >>> > approximately zero unexplained data corruption issues in the 5 years >>> > I've been here. Having said that, this definitely smells and feels >>> > like something on the infrastructure side. I'll follow up if I have >>> > any useful info. >>> >>> After a thorough investigation I now have credible evidence the source >>> of the damage did not originate from the database itself. >>> Specifically, this database is mounted on the same volume as the >>> operating system (I know, I know) and something non database driven >>> sucked up disk space very rapidly and exhausted the volume -- fast >>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help >> >> However, disk space exhaustion should not lead to corruption unless the >> underlying layers lied in some way. > > I agree -- however I'm sufficiently separated from the things doing > the things that I can't verify that in any real way. In the meantime > I'm going to take standard precautions (enable checksums/dedicated > volume/replication). Low disk space also does not explain the bizarre > outage I had last friday. ok, data corruption struck again. This time disk space is ruled out, and access to the database is completely denied: postgres=# \c castaging WARNING: leaking still-referenced relcache entry for "pg_index_indexrelid_index" merlin
On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote: >>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >>>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >>>> > to be to the system catalogs with at least two critical tables dropped >>>> > or inaccessible in some fashion. A lot of the OIDs seemed to be >>>> > pointing at the wrong thing. Couple more datapoints here. >>>> > >>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) >>>> > *) Another database on the same cluster was not impacted. However >>>> > it's more olap style and may not have been written to during the >>>> > outage >>>> > >>>> > Now, this infrastructure running this system is running maybe 100ish >>>> > postgres clusters and maybe 1000ish sql server instances with >>>> > approximately zero unexplained data corruption issues in the 5 years >>>> > I've been here. Having said that, this definitely smells and feels >>>> > like something on the infrastructure side. I'll follow up if I have >>>> > any useful info. >>>> >>>> After a thorough investigation I now have credible evidence the source >>>> of the damage did not originate from the database itself. >>>> Specifically, this database is mounted on the same volume as the >>>> operating system (I know, I know) and something non database driven >>>> sucked up disk space very rapidly and exhausted the volume -- fast >>>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help >>> >>> However, disk space exhaustion should not lead to corruption unless the >>> underlying layers lied in some way. >> >> I agree -- however I'm sufficiently separated from the things doing >> the things that I can't verify that in any real way. In the meantime >> I'm going to take standard precautions (enable checksums/dedicated >> volume/replication). Low disk space also does not explain the bizarre >> outage I had last friday. > > ok, data corruption struck again. This time disk space is ruled out, > and access to the database is completely denied: > postgres=# \c castaging > WARNING: leaking still-referenced relcache entry for > "pg_index_indexrelid_index" single user mode dumps core :( bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging LOG: 00000: could not change directory to "/root": Permission denied LOCATION: resolve_symlinks, exec.c:293 Segmentation fault (core dumped) Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000797d6f in ?? () Missing separate debuginfos, use: debuginfo-install postgresql95-server-9.5.2-1PGDG.rhel6.x86_64 (gdb) bt #0 0x0000000000797d6f in ?? () #1 0x000000000079acf1 in RelationCacheInitializePhase3 () #2 0x00000000007b35c5 in InitPostgres () #3 0x00000000006b9b53 in PostgresMain () #4 0x00000000005f30fb in main () (gdb) merlin
Merlin Moncure <mmoncure@gmail.com> writes: > single user mode dumps core :( You've got a mess there :-( > Missing separate debuginfos, use: debuginfo-install > postgresql95-server-9.5.2-1PGDG.rhel6.x86_64 This backtrace would likely be much more informative if you did the above. regards, tom lane
On Thu, Oct 20, 2016 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> single user mode dumps core :( > > You've got a mess there :-( > >> Missing separate debuginfos, use: debuginfo-install >> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64 > > This backtrace would likely be much more informative if you did the above. can't; don't have the package unfortunately. merlin
Merlin Moncure wrote: > single user mode dumps core :( > > bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging > LOG: 00000: could not change directory to "/root": Permission denied > LOCATION: resolve_symlinks, exec.c:293 > Segmentation fault (core dumped) > > Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'. > Program terminated with signal 11, Segmentation fault. > #0 0x0000000000797d6f in ?? () > Missing separate debuginfos, use: debuginfo-install > postgresql95-server-9.5.2-1PGDG.rhel6.x86_64 > (gdb) bt > #0 0x0000000000797d6f in ?? () > #1 0x000000000079acf1 in RelationCacheInitializePhase3 () > #2 0x00000000007b35c5 in InitPostgres () > #3 0x00000000006b9b53 in PostgresMain () > #4 0x00000000005f30fb in main () Maybe rm global/pg_internal.init and try again? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Oct 20, 2016 at 3:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure wrote: > >> single user mode dumps core :( >> >> bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging >> LOG: 00000: could not change directory to "/root": Permission denied >> LOCATION: resolve_symlinks, exec.c:293 >> Segmentation fault (core dumped) >> >> Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'. >> Program terminated with signal 11, Segmentation fault. >> #0 0x0000000000797d6f in ?? () >> Missing separate debuginfos, use: debuginfo-install >> postgresql95-server-9.5.2-1PGDG.rhel6.x86_64 >> (gdb) bt >> #0 0x0000000000797d6f in ?? () >> #1 0x000000000079acf1 in RelationCacheInitializePhase3 () >> #2 0x00000000007b35c5 in InitPostgres () >> #3 0x00000000006b9b53 in PostgresMain () >> #4 0x00000000005f30fb in main () > > Maybe > rm global/pg_internal.init > and try again? Will do when I can do that had to do emergency restore + some unfun data reconstruction from the query log. Notably there is a much larger database in the same cluster which is undamaged. This server is new to production usage, maybe 2 months. Here is contents of pg_extensionplpgsqldblinkhstorepostgres_fdwplsh * not usedpg_trgm * not usedplr * not usedtablefunc *not usedadminpack * not usedplpythonu * not usedpostgis * not usedpostgis_topology * not used Short term plan is to separate the database to it's own cluster, install replication and checksums. All queries to this database are logged. Here is the contents of the log leading into and after the the crash: oct 17 crash: 2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL: parameters: $1 = '21121', $2 = '8', $3 = '2016-10-13', $4 = NULL, $5 = NULL, $6 = NULL, $7 = NULL, $8 = NULL, $9 = NULL, $10 = NULL, $11 = 't', $12 2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>: SELECT NULL AS PROCEDURE_CAT, n.nspname AS PROCEDURE_SCHEM, p.proname AS PROCEDURE_NAME, NULL, NULL, NULL, d.description AS REMARKS 2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>: SELECT n.nspname,p.proname,p.prorettype,p.proargtypes, t.typtype,t.typrelid , p.proargnames, p.proargmodes, p.proallargtypes , p.o 2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute <unnamed>: select * from checkin($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12) as result 2016-10-17 12:12:24 CDT [rms@castaging]: DETAIL: parameters: $1 = '114333', $2 = 'rrosillo', $3 = 'CALLER', $4 = 'Survey', $5 = 'Happy', $6 = 'Callback', $7 = 'OTHER', $8 = '2016-10-18 01:05:00', 2016-10-17 12:12:24 CDT [rms@castaging]: LOG: execute S_3: COMMIT 2016-10-17 12:12:25 CDT [@]: ERROR: could not open relation with OID 1203933 <-- first sign of damage 2016-10-17 12:12:25 CDT [@]: CONTEXT: automatic analyze of table "castaging.public.apartment" oct 20 crash: 2016-10-20 12:46:38 CDT [postgres@castaging]: LOG: statement: SELECT CallsByUser() AS byuser 2016-10-20 12:46:40 CDT [postgres@castaging]: LOG: statement: SELECT CallCenterOverviewJSON() AS overview 2016-10-20 12:46:41 CDT [postgres@castaging]: LOG: statement: SELECT CallCenterUserTrackingJSON() AS tracking 2016-10-20 12:46:41 CDT [postgres@castaging]: LOG: statement: SELECT MarketOverviewJSON() AS market 2016-10-20 12:46:42 CDT [postgres@castaging]: LOG: execute <unnamed>: SELECT SubMarketOverviewJSON($1::TEXT) AS submkt 2016-10-20 12:46:42 CDT [postgres@castaging]: DETAIL: parameters: $1 = '640' 2016-10-20 12:46:44 CDT [postgres@castaging]: LOG: statement: SELECT CallsByUser() AS byuser 2016-10-20 12:46:46 CDT [postgres@castaging]: LOG: statement: SELECT CallCenterOverviewJSON() AS overview 2016-10-20 12:46:47 CDT [postgres@castaging]: LOG: statement: SELECT CallCenterUserTrackingJSON() AS tracking 2016-10-20 12:46:47 CDT [postgres@castaging]: ERROR: "pg_description_o_c_o_index" is an index <-- first sign of damage 2016-10-20 12:46:47 CDT [postgres@castaging]: CONTEXT: SQL function "callcenterusertrackingjson" during startup 2016-10-20 12:46:47 CDT [postgres@castaging]: STATEMENT: SELECT CallCenterUserTrackingJSON() AS tracking 2016-10-20 12:46:47 CDT [postgres@castaging]: WARNING: leaking still-referenced relcache entry for "pg_class_oid_index" CallCenterUserTrackingJSON() and friends are not particularly interesting except that they are making use of of json_agg(). They were also called basically all day long in 5 second intervals. I guess this isn't saying very much, but I'm starting to smell a rat here. merlin
On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > Most or all the damage seemed to be to the system catalogs with > at least two critical tables dropped or inaccessible in some > fashion. A lot of the OIDs seemed to be pointing at the wrong > thing. While the oid in pg_class often matches the filename, that is not true after some operations (like CLUSTER or VACUUM FULL). It is the relfilenode column that is the definitive link to the file. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner <kgrittn@gmail.com> wrote: > On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> Most or all the damage seemed to be to the system catalogs with >> at least two critical tables dropped or inaccessible in some >> fashion. A lot of the OIDs seemed to be pointing at the wrong >> thing. > > While the oid in pg_class often matches the filename, that is not > true after some operations (like CLUSTER or VACUUM FULL). It is > the relfilenode column that is the definitive link to the file. no such operations happened. In the first instance at least one table dropped from the system catalogs. I have a hunch that the heap is fine (supported by the size of the database on disk). At this precise moment I'm restoring the database to another fileserver in order to do some forensic analysis, also in the hopes of getting the second database online in order to expedite recovery. ah -- done. :-) deleting the init file didn't help, but starting up single user allowed the start up to gracefully fail with a FATAL cache lookup. merlin
On Fri, Oct 21, 2016 at 1:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner <kgrittn@gmail.com> wrote: >> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> >>> Most or all the damage seemed to be to the system catalogs with >>> at least two critical tables dropped or inaccessible in some >>> fashion. A lot of the OIDs seemed to be pointing at the wrong >>> thing. >> >> While the oid in pg_class often matches the filename, that is not >> true after some operations (like CLUSTER or VACUUM FULL). It is >> the relfilenode column that is the definitive link to the file. > > no such operations happened. In the first instance at least one table > dropped from the system catalogs. I have a hunch that the heap is > fine (supported by the size of the database on disk). At this > precise moment I'm restoring the database to another fileserver in > order to do some forensic analysis, also in the hopes of getting the > second database online in order to expedite recovery. > > ah -- done. :-) deleting the init file didn't help, but starting up > single user allowed the start up to gracefully fail with a FATAL cache > lookup. OK, I have some good (very- in the specific case of yours truly) news to report. Doing a filesystem level copy to a test server I was able to relfilenode swap one of the critical tables over the place of the refilenode of the stored backup. Not being able know the file to copy from, I figured out the source node by judging the size and using 'strings' utility. Data recovery for that table at least appears to be 100%. For those following along, this simple process is only likely to work easily if the table contains only system types; no user types, enums, composites, etc, since those have a unique ID for each data restore. merlin
Merlin Moncure wrote: > OK, I have some good (very- in the specific case of yours truly) news > to report. Doing a filesystem level copy to a test server I was able > to relfilenode swap one of the critical tables over the place of the > refilenode of the stored backup. Not being able know the file to copy > from, I figured out the source node by judging the size and using > 'strings' utility. Data recovery for that table at least appears to > be 100%. FWIW you can use pg_filedump and match based on the number of columns. I suppose you could also use the pageinspect extension, by 'dd'ing a page from the file into the database and feeding into heap_page_items as bytea. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10/21/16 2:02 PM, Alvaro Herrera wrote: > Merlin Moncure wrote: > >> OK, I have some good (very- in the specific case of yours truly) news >> to report. Doing a filesystem level copy to a test server I was able >> to relfilenode swap one of the critical tables over the place of the >> refilenode of the stored backup. Not being able know the file to copy >> from, I figured out the source node by judging the size and using >> 'strings' utility. Data recovery for that table at least appears to >> be 100%. > > FWIW you can use pg_filedump and match based on the number of columns. > I suppose you could also use the pageinspect extension, by 'dd'ing a > page from the file into the database and feeding into heap_page_items as > bytea. It occurs to me that it might be worth embedding the relation name in the free space of the first block. Most people would never notice the missing 64 bytes, but it would be incredibly helpful in cases like this... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
Jim Nasby wrote: > On 10/21/16 2:02 PM, Alvaro Herrera wrote: > > Merlin Moncure wrote: > > > > > OK, I have some good (very- in the specific case of yours truly) news > > > to report. Doing a filesystem level copy to a test server I was able > > > to relfilenode swap one of the critical tables over the place of the > > > refilenode of the stored backup. Not being able know the file to copy > > > from, I figured out the source node by judging the size and using > > > 'strings' utility. Data recovery for that table at least appears to > > > be 100%. > > > > FWIW you can use pg_filedump and match based on the number of columns. > > I suppose you could also use the pageinspect extension, by 'dd'ing a > > page from the file into the database and feeding into heap_page_items as > > bytea. > > It occurs to me that it might be worth embedding the relation name in the > free space of the first block. Most people would never notice the missing 64 > bytes, but it would be incredibly helpful in cases like this... Agreed. The problem is how to install it without breaking pg_upgrade. Note that for DR purposes it is better to put that data in the first block of *each segment*. Otherwise if you have many >1GB tables, you have to go segment by segment ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Jim Nasby wrote: >> It occurs to me that it might be worth embedding the relation name in the >> free space of the first block. Most people would never notice the missing 64 >> bytes, but it would be incredibly helpful in cases like this... > Agreed. The problem is how to install it without breaking pg_upgrade. Well, that's the first problem. The second problem is how to cope with RENAME TABLE. regards, tom lane
On 10/21/16 7:43 PM, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Jim Nasby wrote: >>> It occurs to me that it might be worth embedding the relation name in the >>> free space of the first block. Most people would never notice the missing 64 >>> bytes, but it would be incredibly helpful in cases like this... > >> Agreed. The problem is how to install it without breaking pg_upgrade. It can't look up relation names? > Well, that's the first problem. The second problem is how to cope with > RENAME TABLE. If the name was only encoded in the first block of the relation I'm not sure how bad this would be; are there any facilities to change the name back on a rollback? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
Jim Nasby <Jim.Nasby@BlueTreble.com> writes: > On 10/21/16 7:43 PM, Tom Lane wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>> Agreed. The problem is how to install it without breaking pg_upgrade. > It can't look up relation names? It can't shove 64 bytes into a page that has < 64 bytes free space. >> Well, that's the first problem. The second problem is how to cope with >> RENAME TABLE. > If the name was only encoded in the first block of the relation I'm not > sure how bad this would be; are there any facilities to change the name > back on a rollback? No. How would that work in crash situations? (And keep in mind that the argument for this hinges entirely on its being 100% trustworthy after a crash.) regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Jim Nasby wrote: > >> It occurs to me that it might be worth embedding the relation name in the > >> free space of the first block. Most people would never notice the missing 64 > >> bytes, but it would be incredibly helpful in cases like this... > > > Agreed. The problem is how to install it without breaking pg_upgrade. > > Well, that's the first problem. The second problem is how to cope with > RENAME TABLE. Uh, sorry. My proposal a couple of years back was to put the relfilenode, not the name. I didn't notice that it was the name being proposed here. However, now I notice that this idea doesn't solve the problem for mapped relations. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Uh, sorry. My proposal a couple of years back was to put the > relfilenode, not the name. I didn't notice that it was the name being > proposed here. However, now I notice that this idea doesn't solve the > problem for mapped relations. Well, as long as a catalog lookup would be required anyway, what about putting in the table OID? regards, tom lane
On October 22, 2016 11:59:15 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Uh, sorry. My proposal a couple of years back was to put the >> relfilenode, not the name. I didn't notice that it was the name >being >> proposed here. However, now I notice that this idea doesn't solve >the >> problem for mapped relations. > >Well, as long as a catalog lookup would be required anyway, what about >putting in the table OID? How about storing two table names? The old and what the relation is being renamed to? We wouldn't be able to tell after acrash which is which, but that'll usually still be helpful. Every rename would clear out the old/wrong one, and add thetarget filename. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote: >>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >>>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >>>> > to be to the system catalogs with at least two critical tables dropped >>>> > or inaccessible in some fashion. A lot of the OIDs seemed to be >>>> > pointing at the wrong thing. Couple more datapoints here. >>>> > >>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) >>>> > *) Another database on the same cluster was not impacted. However >>>> > it's more olap style and may not have been written to during the >>>> > outage >>>> > >>>> > Now, this infrastructure running this system is running maybe 100ish >>>> > postgres clusters and maybe 1000ish sql server instances with >>>> > approximately zero unexplained data corruption issues in the 5 years >>>> > I've been here. Having said that, this definitely smells and feels >>>> > like something on the infrastructure side. I'll follow up if I have >>>> > any useful info. >>>> >>>> After a thorough investigation I now have credible evidence the source >>>> of the damage did not originate from the database itself. >>>> Specifically, this database is mounted on the same volume as the >>>> operating system (I know, I know) and something non database driven >>>> sucked up disk space very rapidly and exhausted the volume -- fast >>>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help >>> >>> However, disk space exhaustion should not lead to corruption unless the >>> underlying layers lied in some way. >> >> I agree -- however I'm sufficiently separated from the things doing >> the things that I can't verify that in any real way. In the meantime >> I'm going to take standard precautions (enable checksums/dedicated >> volume/replication). Low disk space also does not explain the bizarre >> outage I had last friday. > > ok, data corruption struck again. This time disk space is ruled out, > and access to the database is completely denied: > postgres=# \c castaging > WARNING: leaking still-referenced relcache entry for > "pg_index_indexrelid_index" Corruption struck again. This time got another case of view busted -- attempting to create gives missing 'type' error. merlin
On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce@momjian.us> wrote: >>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote: >>>>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed >>>>> > to be to the system catalogs with at least two critical tables dropped >>>>> > or inaccessible in some fashion. A lot of the OIDs seemed to be >>>>> > pointing at the wrong thing. Couple more datapoints here. >>>>> > >>>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty) >>>>> > *) Another database on the same cluster was not impacted. However >>>>> > it's more olap style and may not have been written to during the >>>>> > outage >>>>> > >>>>> > Now, this infrastructure running this system is running maybe 100ish >>>>> > postgres clusters and maybe 1000ish sql server instances with >>>>> > approximately zero unexplained data corruption issues in the 5 years >>>>> > I've been here. Having said that, this definitely smells and feels >>>>> > like something on the infrastructure side. I'll follow up if I have >>>>> > any useful info. >>>>> >>>>> After a thorough investigation I now have credible evidence the source >>>>> of the damage did not originate from the database itself. >>>>> Specifically, this database is mounted on the same volume as the >>>>> operating system (I know, I know) and something non database driven >>>>> sucked up disk space very rapidly and exhausted the volume -- fast >>>>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help >>>> >>>> However, disk space exhaustion should not lead to corruption unless the >>>> underlying layers lied in some way. >>> >>> I agree -- however I'm sufficiently separated from the things doing >>> the things that I can't verify that in any real way. In the meantime >>> I'm going to take standard precautions (enable checksums/dedicated >>> volume/replication). Low disk space also does not explain the bizarre >>> outage I had last friday. >> >> ok, data corruption struck again. This time disk space is ruled out, >> and access to the database is completely denied: >> postgres=# \c castaging >> WARNING: leaking still-referenced relcache entry for >> "pg_index_indexrelid_index" > > Corruption struck again. > This time got another case of view busted -- attempting to create > gives missing 'type' error. Call it a hunch -- I think the problem is in pl/sh. merlin
Merlin Moncure wrote: > On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > > Corruption struck again. > > This time got another case of view busted -- attempting to create > > gives missing 'type' error. > > Call it a hunch -- I think the problem is in pl/sh. I've heard that before. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Oct 24, 2016 at 9:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure wrote: >> On Mon, Oct 24, 2016 at 6:01 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> > Corruption struck again. >> > This time got another case of view busted -- attempting to create >> > gives missing 'type' error. >> >> Call it a hunch -- I think the problem is in pl/sh. > > I've heard that before. well, yeah, previously I had an issue where the database crashed during a heavy concurrent pl/sh based load. However the problems went away when I refactored the code. Anyways, I looked at the code and couldn't see anything obviously wrong so who knows? All I know is my production database is exploding continuously and I'm looking for answers. The only other extension in heavy use on this servers is postgres_fdw. The other database on the cluster is fine, which kind of suggests we are not facing clog or WAL type problems. After last night, I rebuilt the cluster, turning on checksums, turning on synchronous commit (it was off) and added a standby replica. This should help narrow the problem down should it re-occur; if storage is bad (note, other database on same machine is doing 10x write activity and is fine) or something is scribbling on shared memory (my guess here) then checksums should be popped, right? merlin
Merlin Moncure wrote: > After last night, I rebuilt the cluster, turning on checksums, turning > on synchronous commit (it was off) and added a standby replica. This > should help narrow the problem down should it re-occur; if storage is > bad (note, other database on same machine is doing 10x write activity > and is fine) or something is scribbling on shared memory (my guess > here) then checksums should be popped, right? Not really sure about that. As I recall we compute the CRC on the buffer's way out, based on the then-current contents, so if something scribbles on the buffer while it's waiting to be evicted, the CRC computation would include the new (corrupted) bytes rather than the original ones -- see FlushBuffer. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 25, 2016 at 12:57 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure wrote: > >> After last night, I rebuilt the cluster, turning on checksums, turning >> on synchronous commit (it was off) and added a standby replica. This >> should help narrow the problem down should it re-occur; if storage is >> bad (note, other database on same machine is doing 10x write activity >> and is fine) or something is scribbling on shared memory (my guess >> here) then checksums should be popped, right? > > Not really sure about that. As I recall we compute the CRC on the > buffer's way out, based on the then-current contents, so if something > scribbles on the buffer while it's waiting to be evicted, the CRC > computation would include the new (corrupted) bytes rather than the > original ones -- see FlushBuffer. Huh. I have a new theory on this. Dealing with the reconstituted database, I'm finding more things -- functions and such, that are simply gone and had to be rebuilt -- they escaped notice as they were not in primary code paths. Recall that the original outage came manifested as queries getting stuck, possibly on spinlock (we don't know for sure). After that, things started to randomly disappear, possibly from system catalogs (but now need to go back and verify older data, I think). There were three autovac processes running. What if the subsequent dataloss was in fact a symptom of the first outage? Is in theory possible for data to appear visible but then be eaten up as the transactions making the data visible get voided out by some other mechanic? I had to pull a quick restart the first time and everything looked ok -- or so I thought. What I think was actually happening is that data started to slip into the void. It's like randomly sys catalogs were dropping off. I bet other data was, too. I can pull older backups and verify that. It's as if some creeping xmin was snuffing everything out. The confirmation of this should be obvious -- if that's indeed the case, the backup and restored cluster should no longer present data loss. Given that I was getting that every 1-2 days, we should be able to figure that out pretty soon. merlin
Merlin Moncure <mmoncure@gmail.com> writes: > What if the subsequent dataloss was in fact a symptom of the first > outage? Is in theory possible for data to appear visible but then be > eaten up as the transactions making the data visible get voided out by > some other mechanic? I had to pull a quick restart the first time and > everything looked ok -- or so I thought. What I think was actually > happening is that data started to slip into the void. It's like > randomly sys catalogs were dropping off. I bet other data was, too. I > can pull older backups and verify that. It's as if some creeping xmin > was snuffing everything out. Might be interesting to look at age(xmin) in a few different system catalogs. I think you can ignore entries with age = 2147483647; those should be frozen rows. But if you see entries with very large ages that are not that, it'd be suspicious. regards, tom lane
On Tue, Oct 25, 2016 at 2:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> What if the subsequent dataloss was in fact a symptom of the first >> outage? Is in theory possible for data to appear visible but then be >> eaten up as the transactions making the data visible get voided out by >> some other mechanic? I had to pull a quick restart the first time and >> everything looked ok -- or so I thought. What I think was actually >> happening is that data started to slip into the void. It's like >> randomly sys catalogs were dropping off. I bet other data was, too. I >> can pull older backups and verify that. It's as if some creeping xmin >> was snuffing everything out. > > Might be interesting to look at age(xmin) in a few different system > catalogs. I think you can ignore entries with age = 2147483647; > those should be frozen rows. But if you see entries with very large > ages that are not that, it'd be suspicious. nothing really stands out. The damage did re-occur after a dump/restore -- not sure about a cluster level rebuild. No problems previous to that. This suggests that if this theory holds the damage would have had to have been under the database level -- perhaps in clog. Maybe hint bits and clog did not agree as to commit or delete status for example. clog has plenty of history leading past the problem barrier: -rwx------ 1 postgres postgres 256K Jul 10 16:21 0000 -rwx------ 1 postgres postgres 256K Jul 21 12:39 0001 -rwx------ 1 postgres postgres 256K Jul 21 13:19 0002 -rwx------ 1 postgres postgres 256K Jul 21 13:59 0003 <snip> Confirmation of problem re-occurrence will come in a few days. I'm much more likely to believe 6+sigma occurrence (storage, freak bug, etc) should it prove the problem goes away post rebuild. merlin
On 10/22/16 12:38 PM, Tom Lane wrote: > Jim Nasby <Jim.Nasby@BlueTreble.com> writes: >> > On 10/21/16 7:43 PM, Tom Lane wrote: >>> >> Alvaro Herrera <alvherre@2ndquadrant.com> writes: >>>> >>> Agreed. The problem is how to install it without breaking pg_upgrade. >> > It can't look up relation names? > It can't shove 64 bytes into a page that has < 64 bytes free space. Yeah, that would need to be a special case, but unless the page pointers are horked you'd be able to tell if there was a name in there. >>> >> Well, that's the first problem. The second problem is how to cope with >>> >> RENAME TABLE. >> > If the name was only encoded in the first block of the relation I'm not >> > sure how bad this would be; are there any facilities to change the name >> > back on a rollback? > No. How would that work in crash situations? (And keep in mind that the Well, if FPI's were enabled you'd get the old page back. Even without that recovery could remember rename records it's seen that didn't commit. > argument for this hinges entirely on its being 100% trustworthy after a > crash.) I don't think this needs to be 100% reliable to be useful, especially if we went with Andreas suggestion to store both the old and new name (and storing the OID as well isn't a bad idea). If you're ever at the point of looking at this info you're already in deep trouble and should already be taking everything with a grain of salt anyway. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Confirmation of problem re-occurrence will come in a few days. I'm > much more likely to believe 6+sigma occurrence (storage, freak bug, > etc) should it prove the problem goes away post rebuild. ok, no major reported outage yet, but just got: 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR: invalid page in block 12 of relation base/203883/1259 merlin
On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> Confirmation of problem re-occurrence will come in a few days. I'm >> much more likely to believe 6+sigma occurrence (storage, freak bug, >> etc) should it prove the problem goes away post rebuild. > > ok, no major reported outage yet, but just got: > > 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR: invalid page in > block 12 of relation base/203883/1259 *) still on 9.5.2 :( not in control of the package update process *) getting more of these: [root@rcdylsdbmpf001 data]# cat /var/lib/pgsql/9.5/data_checksum/pg_log/postgresql-26.log | grep "invalid page" 2016-10-26 11:26:42 CDT [postgres@castaging]: ERROR: invalid page in block 10 of relation base/203883/1259 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR: invalid page in block 12 of relation base/203883/1259 2016-10-26 12:16:44 CDT [postgres@castaging]: ERROR: invalid page in block 13 of relation base/203883/1259 2016-10-26 12:18:58 CDT [postgres@castaging]: ERROR: invalid page in block 15 of relation base/203883/1259 2016-10-26 12:19:12 CDT [postgres@castaging]: ERROR: invalid page in block 7 of relation base/203883/2662 castaging=# select relname from pg_class where oid in(1259, 2662); relname ────────────────────pg_class_oid_indexpg_class *) only one database in the cluster is damaged. I do not suspect storage at this point *) Currently I can execute pg_dump with no problems. So far impact is not severe but soon I have to do dump/restore *) age(xmin) from pg_class gives all reasonable values (approx 757k) that aren't 2147483647 *) All cases of statement executing getting this error in a pl/pgsql routine that does the following: BEGIN ... <several postgres_fdw querying same 1 record 1 column tabel> DROP TABLE IF EXISTS foo; CREATE TEMP TABLE foo ON COMMIT DROP AS ... SELECT <sqsh function push sql script to sql sever>... INTO Result; DROP TABLE IF EXISTS bar; CREATE TEMP TABLE bar ON COMMIT DROP AS ... PERFORM <sqsh function removing script>; EXCEPTION WHEN OTHERS THEN PERFORM <sqsh function removing script>; RAISE; END; *) page verification miss is numerically close [root@rcdylsdbmpf001 ~]# cat /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page verification" 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING: page verification failed, calculated checksum 37251 but expected 37244 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING: page verification failed, calculated checksum 37249 but expected 37244 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING: page verification failed, calculated checksum 44363 but expected 44364 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING: page verification failed, calculated checksum 49525 but expected 49539 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING: page verification failed, calculated checksum 37345 but expected 37340 *) sample log leading into the first error (all queries are logged) 2016-10-26 11:26:40 CDT [postgres@castaging]: LOG: execute <unnamed>: SELECT PushMarketSample($1::TEXT) AS published 2016-10-26 11:26:40 CDT [postgres@castaging]: DETAIL: parameters: $1 = '8840' 2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG: statement: START TRANSACTION ISOLATION LEVEL REPEATABLE READ 2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG: statement: SAVEPOINT s2 2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG: execute <unnamed>: DECLARE c1 CURSOR FOR SELECT period FROM public.dataentryperiod 2016-10-26 11:26:40 CDT [postgres@mpf2]: LOG: statement: FETCH 100 FROM c1 2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG: statement: CLOSE c1 2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG: execute <unnamed>: DECLARE c2 CURSOR FOR SELECT period FROM public.dataentryperiod 2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG: statement: FETCH 100 FROM c2 2016-10-26 11:26:41 CDT [postgres@mpf2]: LOG: statement: CLOSE c2 2016-10-26 11:26:41 CDT [rms@mpf2]: ERROR: relation "tblagent" does not exist at character 15 2016-10-26 11:26:41 CDT [rms@mpf2]: STATEMENT: select * from tblagent 2016-10-26 11:26:41 CDT [rms@mpf2]: LOG: execute <unnamed>: SELECT CURRENT_SCHEMA() 2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG: execute <unnamed>: DECLARE c3 CURSOR FOR SELECT period FROM public.dataentryperiod 2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG: statement: FETCH 100 FROM c3 2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG: statement: CLOSE c3 2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG: execute <unnamed>: DECLARE c4 CURSOR FOR SELECT period FROM public.dataentryperiod 2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG: statement: FETCH 100 FROM c4 2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG: statement: CLOSE c4 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING: page verification failed, calculated checksum 37251 but expected 37244 2016-10-26 11:26:42 CDT [postgres@castaging]: CONTEXT: SQL statement "CREATE TEMP TABLE ApartmentPublishBySubmkt ON COMMIT DROP AS SELECT ApartmentId, TotalUnits, SubmarketId, SUM(Rent * UnitCount) / NULLIF(SUM(UnitCount), 0) AS Rent, SUM(Occupancy* UnitCount) / NULLIF(SUM(UnitCount), 0) AS Occupancy FROM AptSample GROUP BY 1, 2, 3" PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL statement 2016-10-26 11:26:42 CDT [postgres@mpf2]: LOG: statement: ROLLBACK TO SAVEPOINT s2; RELEASE SAVEPOINT s2 2016-10-26 11:26:42 CDT [postgres@castaging]: ERROR: invalid page in block 10 of relation base/203883/1259 2016-10-26 11:26:42 CDT [postgres@castaging]: CONTEXT: SQL statement "CREATE TEMP TABLE ApartmentPublishBySubmkt ON COMMIT DROP AS SELECT ApartmentId, TotalUnits, SubmarketId, SUM(Rent * UnitCount) / NULLIF(SUM(UnitCount), 0) AS Rent, merlin
On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> Confirmation of problem re-occurrence will come in a few days. I'm >>> much more likely to believe 6+sigma occurrence (storage, freak bug, >>> etc) should it prove the problem goes away post rebuild. >> >> ok, no major reported outage yet, but just got: >> >> 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR: invalid page in >> block 12 of relation base/203883/1259 *) I've now strongly correlated this routine with the damage. [root@rcdylsdbmpf001 ~]# cat /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep -i pushmarketsample | head -5 2016-10-26 11:26:27 CDT [postgres@castaging]: LOG: execute <unnamed>: SELECT PushMarketSample($1::TEXT) AS published 2016-10-26 11:26:40 CDT [postgres@castaging]: LOG: execute <unnamed>: SELECT PushMarketSample($1::TEXT) AS published PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL statement PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL statement 2016-10-26 11:26:42 CDT [postgres@castaging]: STATEMENT: SELECT PushMarketSample($1::TEXT) AS published *) First invocation was 11:26:27 CDT *) Second invocation was 11:26:40 and gave checksum error (as noted earlier 11:26:42) *) Routine attached (if interested) My next step is to set up test environment and jam this routine aggressively to see what happens. merlin
Attachment
Merlin Moncure <mmoncure@gmail.com> writes: > *) I've now strongly correlated this routine with the damage. Hmm. Do you have any way to replace the non-core calls with something else? The "shexec('rm -f ' || _OutputFile)" bits could presumably be converted to use contrib/adminpack's pg_file_unlink(), or an equivalent one-liner in plperlu or plpythonu. I don't know what the sqshf incantation does, though. This wouldn't be about fixing it so much as narrowing down where the problem is. regards, tom lane
On Wed, Oct 26, 2016 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> *) I've now strongly correlated this routine with the damage. > > Hmm. Do you have any way to replace the non-core calls with something > else? The "shexec('rm -f ' || _OutputFile)" bits could presumably be > converted to use contrib/adminpack's pg_file_unlink(), or an equivalent > one-liner in plperlu or plpythonu. I don't know what the sqshf > incantation does, though. > > This wouldn't be about fixing it so much as narrowing down where the > problem is. Will do. Before you spend a lot of time, let me get full confirmation that this is the source of the issue and I'll bake a reproduction script (there is still a possibility that something *else* caused it and by happenstance we caught it here). For posterity, sqshf does: CREATE OR REPLACE FUNCTION sqshf( ConnectionString TEXT, FileName TEXT) RETURNS TEXT AS $$#!/bin/bash cat \ $2 \ | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0" echo "Success" $$ LANGUAGE plsh; sqsh being the freetds wrapping linux console to sql server. So basically not much. shexec does: CREATE OR REPLACE FUNCTION shexec(_cmd TEXT) RETURNS TEXT AS $$#!/bin/bash eval $1 $$ LANGUAGE plsh; :-) merlin
On October 26, 2016 8:57:22 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote: >On Wed, Oct 26, 2016 at 12:43 PM, Merlin Moncure <mmoncure@gmail.com> >wrote: >> On Wed, Oct 26, 2016 at 11:35 AM, Merlin Moncure <mmoncure@gmail.com> >wrote: >>> On Tue, Oct 25, 2016 at 3:08 PM, Merlin Moncure <mmoncure@gmail.com> >wrote: >>>> Confirmation of problem re-occurrence will come in a few days. >I'm >>>> much more likely to believe 6+sigma occurrence (storage, freak bug, >>>> etc) should it prove the problem goes away post rebuild. >>> >>> ok, no major reported outage yet, but just got: >>> >>> 2016-10-26 11:27:55 CDT [postgres@castaging]: ERROR: invalid page >in >>> block 12 of relation base/203883/1259 > >*) I've now strongly correlated this routine with the damage. > >[root@rcdylsdbmpf001 ~]# cat >/var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep -i >pushmarketsample | head -5 >2016-10-26 11:26:27 CDT [postgres@castaging]: LOG: execute <unnamed>: >SELECT PushMarketSample($1::TEXT) AS published >2016-10-26 11:26:40 CDT [postgres@castaging]: LOG: execute <unnamed>: >SELECT PushMarketSample($1::TEXT) AS published >PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL >statement >PL/pgSQL function pushmarketsample(text,date,integer) line 103 at SQL >statement >2016-10-26 11:26:42 CDT [postgres@castaging]: STATEMENT: SELECT >PushMarketSample($1::TEXT) AS published > >*) First invocation was 11:26:27 CDT > >*) Second invocation was 11:26:40 and gave checksum error (as noted >earlier 11:26:42) > >*) Routine attached (if interested) > >My next step is to set up test environment and jam this routine >aggressively to see what happens. Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly oneway to get into odd corruption issues. We processor really should use O_CLOEXEC for the majority of it file handles. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote: > Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly oneway to get into odd corruption issues. not sure. it's pretty small -- see https://github.com/petere/plsh/blob/master/plsh.c merlin
On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote: >On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> >wrote: >> Any chance that plsh or the script it executes does anything with the >file descriptors it inherits? That'd certainly one way to get into odd >corruption issues. > >not sure. it's pretty small -- see >https://github.com/petere/plsh/blob/master/plsh.c Afaics that could also be in your script, not just plsh. The later doesn't seem to close all file handles above stderr, whichmeans that all handles for relations etc week be open in your script. If you e.g. do any unusual redirections (2>&17or such), that could end badly. But I'm just on my phone, in a taxi without seatbelts, at 60mph, so I didn't lookcarefully. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund <andres@anarazel.de> wrote: > > > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote: >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> >>wrote: >>> Any chance that plsh or the script it executes does anything with the >>file descriptors it inherits? That'd certainly one way to get into odd >>corruption issues. >> >>not sure. it's pretty small -- see >>https://github.com/petere/plsh/blob/master/plsh.c > > Afaics that could also be in your script, not just plsh. The later doesn't seem to close all file handles above stderr,which means that all handles for relations etc week be open in your script. If you e.g. do any unusual redirections(2>&17 or such), that could end badly. But I'm just on my phone, in a taxi without seatbelts, at 60mph, soI didn't look carefully. gotcha :-). see above: *) sqshf: #!/bin/bash cat \ $2 \ | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0" echo "Success" *) shexec: #!/bin/bash eval $1 FWICT that's all that's happening here with respect to pl/sh. I'm almost done with reproduction environment and if i get the issue I should be able to zero in on the problem. One final thing is that concurrency is a contributing factor. merlin
On 2016-10-26 13:49:12 -0500, Merlin Moncure wrote: > On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund <andres@anarazel.de> wrote: > > > > > > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote: > >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> > >>wrote: > >>> Any chance that plsh or the script it executes does anything with the > >>file descriptors it inherits? That'd certainly one way to get into odd > >>corruption issues. > >> > >>not sure. it's pretty small -- see > >>https://github.com/petere/plsh/blob/master/plsh.c > > > > Afaics that could also be in your script, not just plsh. The later doesn't seem to close all file handles above stderr,which means that all handles for relations etc week be open in your script. If you e.g. do any unusual redirections(2>&17 or such), that could end badly. But I'm just on my phone, in a taxi without seatbelts, at 60mph, soI didn't look carefully. > > gotcha :-). see above: > *) sqshf: > #!/bin/bash > cat \ > $2 \ > | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0" > > echo "Success" > > *) shexec: > #!/bin/bash > > eval $1 > > FWICT that's all that's happening here with respect to pl/sh. My point is that that doesn't mean anything. Whatever sqsh is, or whatever $1 eval's to (and $2 for that matter), could access the filehandles the backend has opened. - Andres
On Wed, Oct 26, 2016 at 2:12 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-10-26 13:49:12 -0500, Merlin Moncure wrote: >> On Wed, Oct 26, 2016 at 1:45 PM, Andres Freund <andres@anarazel.de> wrote: >> > >> > >> > On October 26, 2016 9:38:49 PM GMT+03:00, Merlin Moncure <mmoncure@gmail.com> wrote: >> >>On Wed, Oct 26, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> >> >>wrote: >> >>> Any chance that plsh or the script it executes does anything with the >> >>file descriptors it inherits? That'd certainly one way to get into odd >> >>corruption issues. >> >> >> >>not sure. it's pretty small -- see >> >>https://github.com/petere/plsh/blob/master/plsh.c >> > >> > Afaics that could also be in your script, not just plsh. The later doesn't seem to close all file handles above stderr,which means that all handles for relations etc week be open in your script. If you e.g. do any unusual redirections(2>&17 or such), that could end badly. But I'm just on my phone, in a taxi without seatbelts, at 60mph, soI didn't look carefully. >> >> gotcha :-). see above: >> *) sqshf: >> #!/bin/bash >> cat \ >> $2 \ >> | eval "sqsh $1 -L'datetime=%Y-%m-%d %H:%M:%S.%u' -G 7.0" >> >> echo "Success" >> >> *) shexec: >> #!/bin/bash >> >> eval $1 >> >> FWICT that's all that's happening here with respect to pl/sh. > > My point is that that doesn't mean anything. Whatever sqsh is, or > whatever $1 eval's to (and $2 for that matter), could access the > filehandles the backend has opened. I don't think that's the case. sqsh is a psql-like utility. it writes to stdout and stderr only which is captured by plsh and sent. In this context shexec only wraps rm -f 'file' where 'file' is a file previously created with COPY in the same transaction. In a test environment, spamming this routine with 16 threads I was not able to reproduce. Still hunting... merlin
On 10/26/16 2:25 PM, Merlin Moncure wrote: > I don't think that's the case. sqsh is a psql-like utility. it > writes to stdout and stderr only which is captured by plsh and sent. > In this context shexec only wraps rm -f 'file' where 'file' is a file > previously created with COPY in the same transaction. How do you know that? It could potentially be doing anything it wanted with file handles. Removing the exec might "solve" the problem here, assuming that the forked process doesn't still inherit all open FH's. In a nutshell, running arbitrary shell code (let alone arbitrary executables) in a pl that's running inside the Postgres backend sounds insanely risky to me. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Wed, Oct 26, 2016 at 3:06 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 10/26/16 2:25 PM, Merlin Moncure wrote: >> >> I don't think that's the case. sqsh is a psql-like utility. it >> writes to stdout and stderr only which is captured by plsh and sent. >> In this context shexec only wraps rm -f 'file' where 'file' is a file >> previously created with COPY in the same transaction. > > How do you know that? It could potentially be doing anything it wanted with > file handles. Removing the exec might "solve" the problem here, assuming > that the forked process doesn't still inherit all open FH's. > > In a nutshell, running arbitrary shell code (let alone arbitrary > executables) in a pl that's running inside the Postgres backend sounds > insanely risky to me. Well, it is insantely risky, but the code in this context is not arbitrary; it's only: PERFORM shexec('rm -f ' || _OutputFile); merlin
On 2016-10-26 15:06:34 -0500, Jim Nasby wrote: > Removing the exec might "solve" the problem here, assuming that the > forked process doesn't still inherit all open FH's. Unless you explicitly close fds or use FD_CLOEXEC when opening fds they'll be inherited forever.
On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page > verification" > 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 37251 but expected 37244 > 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 37249 but expected 37244 > 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 44363 but expected 44364 > 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 49525 but expected 49539 > 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING: page > verification failed, calculated checksum 37345 but expected 37340 The checksum values are improbably close. The checksum algorithm has decently good mixing of all bits in the page. Having the first byte match in 5 checksums makes this 1:2^40 improbable. What is not mixed in properly is the block number, it only gets xor'ed before packing the value into 16bits using modulo 0xFFFF. So I'm pretty sure different block numbers were used for writing out and reading in the page. Either the blocknum gets corrupted between calculating the checksum and writing the page out (unlikely given the proximity), or the pages are somehow getting transposed in the storage. Regards, Ants Aasma
On Thu, Oct 27, 2016 at 2:31 AM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page >> verification" >> 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING: page >> verification failed, calculated checksum 37251 but expected 37244 >> 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING: page >> verification failed, calculated checksum 37249 but expected 37244 >> 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING: page >> verification failed, calculated checksum 44363 but expected 44364 >> 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING: page >> verification failed, calculated checksum 49525 but expected 49539 >> 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING: page >> verification failed, calculated checksum 37345 but expected 37340 > > The checksum values are improbably close. The checksum algorithm has > decently good mixing of all bits in the page. Having the first byte > match in 5 checksums makes this 1:2^40 improbable. What is not mixed > in properly is the block number, it only gets xor'ed before packing > the value into 16bits using modulo 0xFFFF. So I'm pretty sure > different block numbers were used for writing out and reading in the > page. Either the blocknum gets corrupted between calculating the > checksum and writing the page out (unlikely given the proximity), or > the pages are somehow getting transposed in the storage. I think we can rule out faulty storage, and I'm reasonably sure nothing is writing to the database cluster except postgres itself. So far I have not executed an emergency dump/restore on this database, nor have I received any other log entries suggesting problems of any kind. Meaning, FWICT, the database is completely healthy notwithstanding the reported errors. This suggests (but does not prove) that the syscatalog damaging activity happened during the commission of the attached function and healed itself during the rollback. Previously to turning checksums, the transaction did *not* roll back and the damage was allowed stand and probably accumulated. As noted earlier, I was not able to reproduce the issue with crashme.sh, which was: NUM_FORKS=16 do_parallel psql -p 5432 -c"select PushMarketSample('1740')" castaging_test do_parallel psql -p 5432 -c"select PushMarketSample('4400')" castaging_test do_parallel psql -p 5432 -c"select PushMarketSample('2160')" castaging_test do_parallel psql -p 5432 -c"select PushMarketSample('6680')" castaging_test <snip> (do_parallel is simple wrapper to executing the command in parallel up to NUM_FORKS). This is on the same server and cluster as above. This kind of suggests that either A) there is some concurrent activity from another process that is tripping the issue or B) there is something particular to the session invoking the function that is participating in the problem. As the application is structured, a single threaded node.js app is issuing the query that is high traffic and long lived. It's still running in fact and I'm kind of tempted to find some downtime to see if I can still reproduce via the UI. merlin
On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > I think we can rule out faulty storage Nobody ever expects the faulty storage -- greg
On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <stark@mit.edu> wrote: > On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> I think we can rule out faulty storage > > Nobody ever expects the faulty storage Believe me, I know. But the evidence points elsewhere in this case; this is clearly application driven. merlin
On 10/28/16 8:23 AM, Merlin Moncure wrote: > On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <stark@mit.edu> wrote: >> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> I think we can rule out faulty storage >> >> Nobody ever expects the faulty storage LOL > Believe me, I know. But the evidence points elsewhere in this case; > this is clearly application driven. FWIW, just because it's triggered by specific application behavior doesn't mean there isn't a storage bug. That's what makes data corruption bugs such a joy to figure out. BTW, if you haven't already, I would reset all your storage related options and GUCs to safe defaults... plain old FSYNC, no cute journal / FS / mount options, etc. Maybe this is related to the app, but the most helpful thing right now is to find some kind of safe config so you can start bisecting. I would also consider alternatives to plsh, just to rule it out if nothing else. I'd certainly look at some way to get sqsh out of the loop (again, just to get something that doesn't crash). First idea that comes to mind is a stand-alone shell script that watches a named pipe for a filename; when it gets that file it runs it with sqsh and does something to signal completion. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On Fri, Oct 28, 2016 at 3:16 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 10/28/16 8:23 AM, Merlin Moncure wrote: >> >> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <stark@mit.edu> wrote: >>> >>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmoncure@gmail.com> >>> wrote: >>>> >>>> I think we can rule out faulty storage >>> >>> >>> Nobody ever expects the faulty storage > > > LOL > >> Believe me, I know. But the evidence points elsewhere in this case; >> this is clearly application driven. > > > FWIW, just because it's triggered by specific application behavior doesn't > mean there isn't a storage bug. That's what makes data corruption bugs such > a joy to figure out. > > BTW, if you haven't already, I would reset all your storage related options > and GUCs to safe defaults... plain old FSYNC, no cute journal / FS / mount > options, etc. Maybe this is related to the app, but the most helpful thing > right now is to find some kind of safe config so you can start bisecting. upthread, you might have noticed that I already did that. Here is the other evidence: *) server running fine for 5+ years *) other database on same cluster not impacted with 10x write activity *) no interesting logs reported in /var/log/messages, dmesg, etc *) san fabric turns over petabytes/day with no corruption. 100+ postgres clusters, 1000+ sql server clusters (and that's not production) *) storage/network teams have been through everything. nothing intersting/unusual to report *) we have infrequently run routing (posted upthread) that, when run, database crashed within minutes *) after turning on checksums, 30% of invocations of routine resulted in checksum error *) problem re-occurred after dump-restore and full cluster rebuild *) checksum error caused routine rollback. FWICT this prevented the damage *) everything is fine now that routine is not being run anymore you can come up with your conclusion, I've come up with mine. The only frustrating thing here is that I can't reproduce out of the production environment. If this database goes down I have 30 people sitting around so I can't take downtime lightly. > I would also consider alternatives to plsh, just to rule it out if nothing > else. I'd certainly look at some way to get sqsh out of the loop (again, > just to get something that doesn't crash). First idea that comes to mind is > a stand-alone shell script that watches a named pipe for a filename; when it > gets that file it runs it with sqsh and does something to signal completion. I do a lot of etl to/from sql server and it's all sqsh based. If I can figure out how to reproduce in a better way, I'll zero in on the problem in about 10 minutes. merlin
27.10.2016, 21:53, Merlin Moncure kirjoitti: > As noted earlier, I was not able to reproduce the issue with > crashme.sh, which was: > > NUM_FORKS=16 > do_parallel psql -p 5432 -c"select PushMarketSample('1740')" castaging_test > do_parallel psql -p 5432 -c"select PushMarketSample('4400')" castaging_test > do_parallel psql -p 5432 -c"select PushMarketSample('2160')" castaging_test > do_parallel psql -p 5432 -c"select PushMarketSample('6680')" castaging_test > <snip> > > (do_parallel is simple wrapper to executing the command in parallel up > to NUM_FORKS). This is on the same server and cluster as above. > This kind of suggests that either > A) there is some concurrent activity from another process that is > tripping the issue > or > B) there is something particular to the session invoking the function > that is participating in the problem. As the application is > structured, a single threaded node.js app is issuing the query that is > high traffic and long lived. It's still running in fact and I'm kind > of tempted to find some downtime to see if I can still reproduce via > the UI. Your production system's postgres backends probably have a lot more open files associated with them than the simple test case does. Since Postgres likes to keep files open as long as possible and only closes them when you need to free up fds to open new files, it's possible that your production backends have almost all allowed fds used when you execute your pl/sh function. If that's the case, the sqsh process that's executed may not have enough fds to do what it wanted to do and because of busted error handling could end up writing to fds that were opened by Postgres and point to $PGDATA files. / Oskari
On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote: > 27.10.2016, 21:53, Merlin Moncure kirjoitti: >> >> As noted earlier, I was not able to reproduce the issue with >> crashme.sh, which was: >> >> NUM_FORKS=16 >> do_parallel psql -p 5432 -c"select PushMarketSample('1740')" >> castaging_test >> do_parallel psql -p 5432 -c"select PushMarketSample('4400')" >> castaging_test >> do_parallel psql -p 5432 -c"select PushMarketSample('2160')" >> castaging_test >> do_parallel psql -p 5432 -c"select PushMarketSample('6680')" >> castaging_test >> <snip> >> >> (do_parallel is simple wrapper to executing the command in parallel up >> to NUM_FORKS). This is on the same server and cluster as above. >> This kind of suggests that either >> A) there is some concurrent activity from another process that is >> tripping the issue >> or >> B) there is something particular to the session invoking the function >> that is participating in the problem. As the application is >> structured, a single threaded node.js app is issuing the query that is >> high traffic and long lived. It's still running in fact and I'm kind >> of tempted to find some downtime to see if I can still reproduce via >> the UI. > > Your production system's postgres backends probably have a lot more open > files associated with them than the simple test case does. Since Postgres > likes to keep files open as long as possible and only closes them when you > need to free up fds to open new files, it's possible that your production > backends have almost all allowed fds used when you execute your pl/sh > function. > > If that's the case, the sqsh process that's executed may not have enough fds > to do what it wanted to do and because of busted error handling could end up > writing to fds that were opened by Postgres and point to $PGDATA files. Does that apply? the mechanics are a sqsh function that basically does: cat foo.sql | sqsh <args> pipe redirection opens a new process, right? merlin
Merlin Moncure <mmoncure@gmail.com> writes: > On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote: >> Your production system's postgres backends probably have a lot more open >> files associated with them than the simple test case does. Since Postgres >> likes to keep files open as long as possible and only closes them when you >> need to free up fds to open new files, it's possible that your production >> backends have almost all allowed fds used when you execute your pl/sh >> function. >> >> If that's the case, the sqsh process that's executed may not have enough fds >> to do what it wanted to do and because of busted error handling could end up >> writing to fds that were opened by Postgres and point to $PGDATA files. > Does that apply? the mechanics are a sqsh function that basically does: > cat foo.sql | sqsh <args> > pipe redirection opens a new process, right? Yeah, but I doubt that either level of the shell would attempt to close inherited file handles. The real problem with Oskari's theory is that it requires not merely busted, but positively brain-dead error handling in the shell and/or sqsh, ie ignoring open() failures altogether. That seems kind of unlikely. Still, I suspect he might be onto something --- there must be some reason you can reproduce the issue in production and not in your test bed, and number-of-open-files is as good a theory as I've heard. Maybe the issue is not with open() failures, but with the resulting FD numbers being much larger than sqsh is expecting. It would be weird to try to store an FD in something narrower than int, but I could see a use of select() being unprepared for large FDs. Still, it's hard to translate that idea into scribbling on the wrong file... regards, tom lane
On 2016-11-01 09:56:45 -0400, Tom Lane wrote: > The real problem with Oskari's theory is that it requires not merely > busted, but positively brain-dead error handling in the shell and/or > sqsh, ie ignoring open() failures altogether. That seems kind of > unlikely. Still, I suspect he might be onto something --- there must > be some reason you can reproduce the issue in production and not in > your test bed, and number-of-open-files is as good a theory as I've > heard. I've seen shell code akin to exec >16 somefile # assume fd 16 is unused more than one :(
On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <os@ohmu.fi> wrote: >>> Your production system's postgres backends probably have a lot more open >>> files associated with them than the simple test case does. Since Postgres >>> likes to keep files open as long as possible and only closes them when you >>> need to free up fds to open new files, it's possible that your production >>> backends have almost all allowed fds used when you execute your pl/sh >>> function. >>> >>> If that's the case, the sqsh process that's executed may not have enough fds >>> to do what it wanted to do and because of busted error handling could end up >>> writing to fds that were opened by Postgres and point to $PGDATA files. > >> Does that apply? the mechanics are a sqsh function that basically does: >> cat foo.sql | sqsh <args> >> pipe redirection opens a new process, right? > > Yeah, but I doubt that either level of the shell would attempt to close > inherited file handles. > > The real problem with Oskari's theory is that it requires not merely > busted, but positively brain-dead error handling in the shell and/or > sqsh, ie ignoring open() failures altogether. That seems kind of > unlikely. Still, I suspect he might be onto something --- there must > be some reason you can reproduce the issue in production and not in > your test bed, and number-of-open-files is as good a theory as I've > heard. > > Maybe the issue is not with open() failures, but with the resulting > FD numbers being much larger than sqsh is expecting. It would be > weird to try to store an FD in something narrower than int, but > I could see a use of select() being unprepared for large FDs. > Still, it's hard to translate that idea into scribbling on the > wrong file... Looking at the sqsh code, nothing really stands out. It's highly developed and all obvious errors are checked. There certainly could be a freak bug in there (or in libfreetds which sqsh links to) doing the damage though. In the meantime I'll continue to try and work a reliable reproduction. This particular routine only gets called in batches on a quarterly basis so things have settled down. Just a thought; could COPY be tricked into writing into the wrong file descriptor? For example, if a file was killed with a rm -rf and the fd pressured backend reopened the fd immediately? merlin
26.10.2016, 21:34, Andres Freund kirjoitti: > Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly oneway to get into odd corruption issues. > > We processor really should use O_CLOEXEC for the majority of it file handles. Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not using EXEC_BACKEND. It'd be nice to not expose all fds to most pl-languages either, but I guess there's no easy solution to that without forcibly closing all fds whenever any functions are called. / Oskari
Attachment
On Wed, Nov 2, 2016 at 10:45 AM, Oskari Saarenmaa <os@ohmu.fi> wrote: > 26.10.2016, 21:34, Andres Freund kirjoitti: >> >> Any chance that plsh or the script it executes does anything with the file >> descriptors it inherits? That'd certainly one way to get into odd corruption >> issues. >> >> We processor really should use O_CLOEXEC for the majority of it file >> handles. > > > Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not using > EXEC_BACKEND. It'd be nice to not expose all fds to most pl-languages > either, but I guess there's no easy solution to that without forcibly > closing all fds whenever any functions are called. FYI, this is not my first run-in with strange behavior, on this thread (not necessarily worth reading); https://www.postgresql.org/message-id/CAHyXU0x5mW-SbSuUBEshzumOaN7JPUWa7Ejza68HE-KY0Nq7Kg%40mail.gmail.com I had a similar set of starting conditions that resulted in very strange behavior (but not data corruption AFAICT) --the problem mysteriously disappeared when I fixed some bugs that would cause the routine to concurrently do the same operation. I would like to point out that I use both pl/sh (and via it, sqsh) very highly, so these problems are not necessarily the norm. Regardless, it seems like you might be on to something, and I'm inclined to patch your change, test it, and roll it out to production. If it helps or at least narrows the problem down, we ought to give it consideration for inclusion (unless someone else can think of a good reason not to do that, heh!). merlin
On 11/7/16 5:31 PM, Merlin Moncure wrote: > Regardless, it seems like you might be on to something, and I'm > inclined to patch your change, test it, and roll it out to production. > If it helps or at least narrows the problem down, we ought to give it > consideration for inclusion (unless someone else can think of a good > reason not to do that, heh!). Any results yet? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/2/16 11:45 AM, Oskari Saarenmaa wrote: > 26.10.2016, 21:34, Andres Freund kirjoitti: >> Any chance that plsh or the script it executes does anything with the file descriptors it inherits? That'd certainly oneway to get into odd corruption issues. >> >> We processor really should use O_CLOEXEC for the majority of it file handles. > > Attached a patch to always use O_CLOEXEC in BasicOpenFile if we're not > using EXEC_BACKEND. It'd be nice to not expose all fds to most > pl-languages either, but I guess there's no easy solution to that > without forcibly closing all fds whenever any functions are called. It seems like everyone was generally in favor of this. I looked around the internet for caveats but everyone was basically saying, you should definitely do this. Why not for EXEC_BACKEND? O_CLOEXEC is a newer interface. There are older systems that don't have it but have FD_CLOEXEC for fcntl(). We should use that as a fallback. Have you gone through the code and checked for other ways file descriptors might get opened? Here is a blog posts that lists some candidates: http://udrepper.livejournal.com/20407.html Ideally, we would have a test case that exec's something that lists the open file descriptors, and we check that there are only those we expect. The comment "We don't expect execve() calls inside the postgres code" is not quite correct, as we do things like archive_command and COPY to program (see OpenPipeStream()). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 3, 2017 at 1:05 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 11/7/16 5:31 PM, Merlin Moncure wrote: >> Regardless, it seems like you might be on to something, and I'm >> inclined to patch your change, test it, and roll it out to production. >> If it helps or at least narrows the problem down, we ought to give it >> consideration for inclusion (unless someone else can think of a good >> reason not to do that, heh!). > > Any results yet? Not yet unfortunately. I compiled the server with the change, but was not able get $libdir working so that I could just do a binary swap over my pgdg compiled package. If anyone has some pointers on how to do that, I'd appreciated it. Still getting checksum failures. Over the last 30 days, I see the following. Since enabling checksums FWICT none of the damage is permanent and rolls back with the transaction. So creepy! [root@rcdylsdbmpf001 pg_log]# cat *.log | grep "page verification failed" 2016-12-05 10:17:48 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 61797 but expected 61798 2016-12-05 11:15:31 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 37750 but expected 37749 2016-12-05 11:15:58 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 44483 but expected 44482 2016-12-05 11:16:33 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 58926 but expected 58925 2016-12-05 11:17:08 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 38527 but expected 38528 2016-12-05 11:18:34 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 61932 but expected 61933 2016-12-05 11:18:55 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 23757 but expected 23758 2016-12-05 12:13:48 CST [rms@mpf2]: WARNING: page verification failed, calculated checksum 44192 but expected 44225 at character 417 2016-12-08 14:18:37 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 36083 but expected 36082 2016-12-08 15:52:31 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 63414 but expected 63415 at character 1096 2016-12-09 09:12:21 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 25781 but expected 25780 2016-12-09 09:13:20 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 63043 but expected 63044 at character 4230 2016-12-12 08:57:45 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 31775 but expected 31771 2016-12-13 09:47:11 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 40802 but expected 40806 2016-12-15 12:49:04 CST [rms@mpf2]: WARNING: page verification failed, calculated checksum 11625 but expected 11592 at character 417 2016-12-15 12:51:08 CST [rms@mpf2]: WARNING: page verification failed, calculated checksum 51017 but expected 51018 2016-12-15 12:52:36 CST [rms@mpf2]: WARNING: page verification failed, calculated checksum 51017 but expected 51018 at character 417 2016-12-16 12:16:31 CST [rms@mpf2]: WARNING: page verification failed, calculated checksum 23580 but expected 23576 2016-12-20 13:59:33 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 45273 but expected 45285 2016-12-20 14:00:22 CST [postgres@castaging]: WARNING: page verification failed, calculated checksum 10524 but expected 10525 note second database 'mpf2'. This is a new development. Example of query that is jacking things is this: 2016-12-15 12:51:08 CST [rms@mpf2]: WARNING: page verification failed, calculated checksum 51017 but expected 51018 2016-12-15 12:51:08 CST [rms@mpf2]: CONTEXT: SQL statement " COPY ( SELECT 'DELETE FROM tblNAptCommonSample WHEREReportPeriod = 201612;' UNION ALL SELECT format( 'INSERT INTO tblNAptCommonSample(' 'ReportPeriod,Period, AdjustmentType, PlanType, MSA, MSASubMkt, ' 'Sample, Occupancy, OccupancyChange, AverageRent, AverageRentChange, RentSF, ' 'RentSFChange)' 'VALUES(' '%s, %s, ''%s'', ''%s'', ''%s'', %s,' '%s, %s, %s, %s,%s, %s,' '%s)', ReportPeriod, Period, AdjustmentType, PlanType, MSA, MSASubMkt, c(Sample), c(Occupancy), c(OccupancyChange),c(AverageRent), c(AverageRentChange), c(RentSF), c(RentSFChange)) FROM tblNAptCommonSample WHERE Period = 201612 AND MSA !='5610' UNION ALL SELECT 'go' ) TO '/tmp/tblnaptcommonsample.sql'; " PL/pgSQL function writempf1history(integer)line 75 at EXECUTE or this: 2016-12-15 12:52:36 CST [rms@mpf2]: WARNING: page verification failed, calculated checksum 51017 but expected 51018 at character 417 2016-12-15 12:52:36 CST [rms@mpf2]: QUERY: COPY ( SELECT 'DELETE FROM tbltwrexistingunits WHERE ReportPeriod =201612;' UNION ALL SELECT format( 'INSERT INTO tbltwrexistingunits(' 'ReportPeriod, market, submarketnum,yr_qtr, cmpltns, deletions, ' 'existing, unadjexisting)' 'VALUES(' '%s, ''%s'', %s, ''%s'', %s, %s,' '%s, %s)', ReportPeriod, market, submarket, yr_qtr, c(cmpltns), c(deletions), c(existing), c(unadjexisting)) FROM tbltwrexistingunits WHERE ReportPeriod = 201612 AND market != '5610' UNION ALL SELECT'go' ) TO '/tmp/tbltwrexistingunits.sql'; 2016-12-15 12:52:36 CST [rms@mpf2]: CONTEXT: PL/pgSQL function writempf1history(integer) line 109 at EXECUTE This is another (much simpler) routine that: 1. writes out data to scratch file with COPY 2. uses pl/sh to pipe to sqsh merlin
On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Still getting checksum failures. Over the last 30 days, I see the > following. Since enabling checksums FWICT none of the damage is > permanent and rolls back with the transaction. So creepy! The checksums still only differ in least significant digits which pretty much means that there is a block number mismatch. So if you rule out filesystem not doing its job correctly and transposing blocks, it could be something else that is resulting in blocks getting read from a location that happens to differ by a small multiple of page size. Maybe somebody is racily mucking with table fd's between seeking and reading. That would explain the issue disappearing after a retry. Maybe you can arrange for the RelFileNode and block number to be logged for the checksum failures and check what the actual checksums are in data files surrounding the failed page. If the requested block number contains something completely else, but the page that follows contains the expected checksum value, then it would support this theory. Regards, Ants Aasma
On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> Still getting checksum failures. Over the last 30 days, I see the >> following. Since enabling checksums FWICT none of the damage is >> permanent and rolls back with the transaction. So creepy! > > The checksums still only differ in least significant digits which > pretty much means that there is a block number mismatch. So if you > rule out filesystem not doing its job correctly and transposing > blocks, it could be something else that is resulting in blocks getting > read from a location that happens to differ by a small multiple of > page size. Maybe somebody is racily mucking with table fd's between > seeking and reading. That would explain the issue disappearing after a > retry. > > Maybe you can arrange for the RelFileNode and block number to be > logged for the checksum failures and check what the actual checksums > are in data files surrounding the failed page. If the requested block > number contains something completely else, but the page that follows > contains the expected checksum value, then it would support this > theory. will do. Main challenge is getting hand compiled server to swap in so that libdir continues to work. Getting access to the server is difficult as is getting a maintenance window. I'll post back ASAP. merlin
On Wed, Jan 4, 2017 at 4:17 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > It seems like everyone was generally in favor of this. I looked around > the internet for caveats but everyone was basically saying, you should > definitely do this. > > Why not for EXEC_BACKEND? > > O_CLOEXEC is a newer interface. There are older systems that don't have > it but have FD_CLOEXEC for fcntl(). We should use that as a fallback. > > Have you gone through the code and checked for other ways file > descriptors might get opened? Here is a blog posts that lists some > candidates: http://udrepper.livejournal.com/20407.html > > Ideally, we would have a test case that exec's something that lists the > open file descriptors, and we check that there are only those we expect. > > The comment "We don't expect execve() calls inside the postgres code" is > not quite correct, as we do things like archive_command and COPY to > program (see OpenPipeStream()). Oskari, are you planning to answer to this review? As the thread has died 3 weeks ago, I am marking this as returned with feedback. Don't hesitate to change the status of the patch if you have a new version. -- Michael
On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote: >> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> Still getting checksum failures. Over the last 30 days, I see the >>> following. Since enabling checksums FWICT none of the damage is >>> permanent and rolls back with the transaction. So creepy! >> >> The checksums still only differ in least significant digits which >> pretty much means that there is a block number mismatch. So if you >> rule out filesystem not doing its job correctly and transposing >> blocks, it could be something else that is resulting in blocks getting >> read from a location that happens to differ by a small multiple of >> page size. Maybe somebody is racily mucking with table fd's between >> seeking and reading. That would explain the issue disappearing after a >> retry. >> >> Maybe you can arrange for the RelFileNode and block number to be >> logged for the checksum failures and check what the actual checksums >> are in data files surrounding the failed page. If the requested block >> number contains something completely else, but the page that follows >> contains the expected checksum value, then it would support this >> theory. > > will do. Main challenge is getting hand compiled server to swap in > so that libdir continues to work. Getting access to the server is > difficult as is getting a maintenance window. I'll post back ASAP. As a new datapoint, we just had a customer with an issue that I think might be related. The issue was reasonably repeatable by running a report on the standby system. Issue manifested itself by first "could not open relation" and/or "column is not in index" errors, followed a few minutes later by a PANIC from startup process due to "specified item offset is too large", "invalid max offset number" or "page X of relation base/16384/1259 is uninitialized". I took a look at the xlog dump and it was completely fine. For instance in the "specified item offset is too large" case there was a INSERT_LEAF redo record inserting the preceding offset just a couple hundred kilobytes back. Restarting the server sometimes successfully applied the offending WAL, sometimes it failed with other corruption errors. The offending relations were always pg_class or pg_class_oid_index. Replacing plsh functions with dummy plpgsql functions made the problem go away, reintroducing plsh functions made it reappear. The only thing I came up with that is consistent with the symptoms is that a page got thrown out of shared_buffers between the two xlog records referencing it (shared_buffers was default 128MB), and then read back by a backend process, where in the time between FileSeek and FileRead calls in mdread a subprocess mucked with the fd's offset so that a different page than intended got read in. Or basically the same race condition, but on the write side. Maybe somebody else has a better imagination than me... Regards, Ants Aasma
On Thu, Aug 10, 2017 at 12:01 PM, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aasma@eesti.ee> wrote: >>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>>> Still getting checksum failures. Over the last 30 days, I see the >>>> following. Since enabling checksums FWICT none of the damage is >>>> permanent and rolls back with the transaction. So creepy! >>> >>> The checksums still only differ in least significant digits which >>> pretty much means that there is a block number mismatch. So if you >>> rule out filesystem not doing its job correctly and transposing >>> blocks, it could be something else that is resulting in blocks getting >>> read from a location that happens to differ by a small multiple of >>> page size. Maybe somebody is racily mucking with table fd's between >>> seeking and reading. That would explain the issue disappearing after a >>> retry. >>> >>> Maybe you can arrange for the RelFileNode and block number to be >>> logged for the checksum failures and check what the actual checksums >>> are in data files surrounding the failed page. If the requested block >>> number contains something completely else, but the page that follows >>> contains the expected checksum value, then it would support this >>> theory. >> >> will do. Main challenge is getting hand compiled server to swap in >> so that libdir continues to work. Getting access to the server is >> difficult as is getting a maintenance window. I'll post back ASAP. > > As a new datapoint, we just had a customer with an issue that I think > might be related. The issue was reasonably repeatable by running a > report on the standby system. Issue manifested itself by first "could > not open relation" and/or "column is not in index" errors, followed a > few minutes later by a PANIC from startup process due to "specified > item offset is too large", "invalid max offset number" or "page X of > relation base/16384/1259 is uninitialized". I took a look at the xlog > dump and it was completely fine. For instance in the "specified item > offset is too large" case there was a INSERT_LEAF redo record > inserting the preceding offset just a couple hundred kilobytes back. > Restarting the server sometimes successfully applied the offending > WAL, sometimes it failed with other corruption errors. The offending > relations were always pg_class or pg_class_oid_index. Replacing plsh > functions with dummy plpgsql functions made the problem go away, > reintroducing plsh functions made it reappear. Fantastic. I was never able to attempt to apply O_CLOEXEC patch (see upthread) due to the fact that access to the system is highly limited and compiling a replacement binary was a bit of a headache. IIRC this was the best theory on the table as to the underlying cause and we ought to to try that first, right? Reminder; I was able to completely eliminate all damage (but had to handle occasional unexpected rollback) via enabling checksums. merlin
On Tue, Jan 3, 2017 at 1:05 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > On 11/7/16 5:31 PM, Merlin Moncure wrote: > > Regardless, it seems like you might be on to something, and I'm > > inclined to patch your change, test it, and roll it out to production. > > If it helps or at least narrows the problem down, we ought to give it > > consideration for inclusion (unless someone else can think of a good > > reason not to do that, heh!). > > Any results yet? Not yet. But I do have some interesting findings. At this point I do not think the problem is within pl/sh itself, but that when a process is invoked from pl/sh misbehaves that misbehavior can penetrate into the database processes. I also believe that this problem is fd related, so that the 'close on exec' might reasonably fix it. All cases of database damage I have observed remain completely mitigated by enabling database checksums. Recently, a sqsh process kicked off via pl/sh crashed with signal 11 but the database process was otherwise intact and fine. This is strong supporting evidence to my points above, I think. I've also turned up a fairly reliable reproduction case from some unrelated application changes. If I can demonstrate that close on exec flag works and prevents these occurrences we can close the book on this. merlin