Thread: debian bugrept involving fast default crash in pg11.7
I happened across this bugreport, which seems to have just enough information to be interesting. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=953204 |Version: 11.7-0+deb10u1 |2020-03-05 16:55:55.511 UTC [515] LOG: background worker "parallel worker" (PID 884) was terminated by signal 11: Segmentationfault |2020-03-05 16:55:55.511 UTC [515] DETAIL: Failed process was running: |SELECT distinct student_prob.student_id, student_prob.score, student_name, v_capacity_score.capacity |FROM data JOIN model on model.id = 2 AND data_stage(data) = model.target_begin_field_id |JOIN student_prob ON data.crm_id = student_prob.student_id AND model.id = student_prob.model_id AND (student_prob.additional_aid< 1) |LEFT JOIN v_capacity_score ON data.crm_id = v_capacity_score.student_id AND student_prob.model_id = v_capacity_score.model_id |WHERE data.term_code = '202090' AND student_prob.score > 0 |ORDER BY student_prob.score DESC, student_name |LIMIT 100 OFFSET 100 ; Tim: it'd be nice to get more information, if and when possible: - "explain" plan for that query; - \d for the tables involved: constraints, inheritence, defaults; - corefile or backtrace; it looks like there's two different crashes (maybe same problem) so both would be useful; - Can you reprodue the crash if you "SET max_parallel_workers_per_gather=0" ? - Do you know if it crashed under v11.6 ? If anyone wants to hack on the .deb: https://packages.debian.org/buster/amd64/postgresql-11/download and (I couldn't find the dbg package anywhere else) https://snapshot.debian.org/package/postgresql-11/11.7-0%2Bdeb10u1/#postgresql-11-dbgsym_11.7-0:2b:deb10u1 $ mkdir pg11 $ cd pg11 $ wget -q http://security.debian.org/debian-security/pool/updates/main/p/postgresql-11/postgresql-11_11.7-0+deb10u1_amd64.deb $ ar x ./postgresql-11_11.7-0+deb10u1_amd64.deb $ tar xf ./data.tar.xz $ ar x postgresql-11-dbgsym_11.7-0+deb10u1_amd64.deb $ tar tf data.tar.xz $ gdb usr/lib/postgresql/11/bin/postgres (gdb) set debug-file-directory usr/lib/debug/ (gdb) file usr/lib/postgresql/11/bin/postmaster (gdb) info target If I repeat the process Bernhard used (thanks for that) on the first crash in libc6, I get: (gdb) find /b 0x0000000000022320, 0x000000000016839b, 0xf9, 0x20, 0x77, 0x1f, 0xc5, 0xfd, 0x74, 0x0f, 0xc5, 0xfd, 0xd7, 0xc1,0x85, 0xc0, 0x0f, 0x85, 0xdf, 0x00, 0x00, 0x00, 0x48, 0x83, 0xc7, 0x20, 0x83, 0xe1, 0x1f, 0x48, 0x83, 0xe7, 0xe0, 0xeb,0x36, 0x66, 0x90, 0x83, 0xe1, 0x1f, 0x48, 0x83, 0xe7, 0xe0, 0xc5, 0xfd, 0x74, 0x0f, 0xc5, 0xfd, 0xd7, 0xc1, 0xd3, 0xf8,0x85, 0xc0, 0x74, 0x1b, 0xf3, 0x0f, 0xbc, 0xc0, 0x48, 0x01, 0xf8, 0x48 0x15c17d <__strlen_avx2+13> warning: Unable to access 1631 bytes of target memory at 0x167d3d, halting search. 1 pattern found. I'm tentatively guessing that heap_modify_tuple() is involved, since it calls getmissingattr and (probably) fill_val. It looks like maybe some data structure is corrupted which crashed two parallel workers, one in fill_val()/strlen() and one in heap_deform_tuple()/getmissingattr(). Maybe something not initialized in parallel worker, or a use-after-free? I'll stop guessing. Justin
I've attached a file containing the \d+ for all the tables involved and the EXPLAIN ANALYZE for the query. Yes, the crash happened under v11.6. I had tried downgrading when I first encountered the problem. While trying to put together this information the crash started happening less frequently (I was only able to reproduce it it twice and not in a row) and I am unable to confirm if SET max_parallel_workers_per_gather=0 had any effect. Also since I've been able to reproduce I'm currently unable to provide a corefile or backtrace. I'll continue to try and reproduce the error so I can get one or the other. I did find a work around for the crash by making the view (v_capacity_score) a materialized view. Thanks tim On 3/28/20 6:30 PM, Justin Pryzby wrote: > I happened across this bugreport, which seems to have just enough information > to be interesting. > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=953204 > |Version: 11.7-0+deb10u1 > |2020-03-05 16:55:55.511 UTC [515] LOG: background worker "parallel worker" (PID 884) was terminated by signal 11: Segmentationfault > |2020-03-05 16:55:55.511 UTC [515] DETAIL: Failed process was running: > |SELECT distinct student_prob.student_id, student_prob.score, student_name, v_capacity_score.capacity > |FROM data JOIN model on model.id = 2 AND data_stage(data) = model.target_begin_field_id > |JOIN student_prob ON data.crm_id = student_prob.student_id AND model.id = student_prob.model_id AND (student_prob.additional_aid< 1) > |LEFT JOIN v_capacity_score ON data.crm_id = v_capacity_score.student_id AND student_prob.model_id = v_capacity_score.model_id > |WHERE data.term_code = '202090' AND student_prob.score > 0 > |ORDER BY student_prob.score DESC, student_name > |LIMIT 100 OFFSET 100 ; > > Tim: it'd be nice to get more information, if and when possible: > - "explain" plan for that query; > - \d for the tables involved: constraints, inheritence, defaults; > - corefile or backtrace; it looks like there's two different crashes (maybe same problem) so both would be useful; > - Can you reprodue the crash if you "SET max_parallel_workers_per_gather=0" ? > - Do you know if it crashed under v11.6 ? > > If anyone wants to hack on the .deb: > https://packages.debian.org/buster/amd64/postgresql-11/download and (I couldn't find the dbg package anywhere else) > https://snapshot.debian.org/package/postgresql-11/11.7-0%2Bdeb10u1/#postgresql-11-dbgsym_11.7-0:2b:deb10u1 > > $ mkdir pg11 > $ cd pg11 > $ wget -q http://security.debian.org/debian-security/pool/updates/main/p/postgresql-11/postgresql-11_11.7-0+deb10u1_amd64.deb > $ ar x ./postgresql-11_11.7-0+deb10u1_amd64.deb > $ tar xf ./data.tar.xz > $ ar x postgresql-11-dbgsym_11.7-0+deb10u1_amd64.deb > $ tar tf data.tar.xz > $ gdb usr/lib/postgresql/11/bin/postgres > > (gdb) set debug-file-directory usr/lib/debug/ > (gdb) file usr/lib/postgresql/11/bin/postmaster > (gdb) info target > > If I repeat the process Bernhard used (thanks for that) on the first crash in > libc6, I get: > > (gdb) find /b 0x0000000000022320, 0x000000000016839b, 0xf9, 0x20, 0x77, 0x1f, 0xc5, 0xfd, 0x74, 0x0f, 0xc5, 0xfd, 0xd7,0xc1, 0x85, 0xc0, 0x0f, 0x85, 0xdf, 0x00, 0x00, 0x00, 0x48, 0x83, 0xc7, 0x20, 0x83, 0xe1, 0x1f, 0x48, 0x83, 0xe7, 0xe0,0xeb, 0x36, 0x66, 0x90, 0x83, 0xe1, 0x1f, 0x48, 0x83, 0xe7, 0xe0, 0xc5, 0xfd, 0x74, 0x0f, 0xc5, 0xfd, 0xd7, 0xc1, 0xd3,0xf8, 0x85, 0xc0, 0x74, 0x1b, 0xf3, 0x0f, 0xbc, 0xc0, 0x48, 0x01, 0xf8, 0x48 > 0x15c17d <__strlen_avx2+13> > warning: Unable to access 1631 bytes of target memory at 0x167d3d, halting search. > 1 pattern found. > > I'm tentatively guessing that heap_modify_tuple() is involved, since it calls > getmissingattr and (probably) fill_val. It looks like maybe some data > structure is corrupted which crashed two parallel workers, one in > fill_val()/strlen() and one in heap_deform_tuple()/getmissingattr(). Maybe > something not initialized in parallel worker, or a use-after-free? I'll stop > guessing. > > Justin >
Attachment
On Thu, Apr 09, 2020 at 02:05:22PM -0400, Tim Bishop wrote: > I've attached a file containing the \d+ for all the tables involved and the > EXPLAIN ANALYZE for the query. Thanks for your response. Do you know if you used the "fast default feature" ? That would happen if you did "ALTER TABLE tbl ADD col DEFAULT val" I guess this is the way to tell: SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing; -- Justin
SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing; -[ RECORD 1 ]-+--------- attrelid | download attrelid | 22749 attname | filetype atttypid | 1043 attstattarget | -1 attlen | -1 attnum | 5 attndims | 0 attcacheoff | -1 atttypmod | 36 attbyval | f attstorage | x attalign | i attnotnull | t atthasdef | t atthasmissing | t attidentity | attisdropped | f attislocal | t attinhcount | 0 attcollation | 100 attacl | attoptions | attfdwoptions | attmissingval | {csv} On 4/9/20 2:31 PM, Justin Pryzby wrote: > On Thu, Apr 09, 2020 at 02:05:22PM -0400, Tim Bishop wrote: >> I've attached a file containing the \d+ for all the tables involved and the >> EXPLAIN ANALYZE for the query. > > Thanks for your response. > > Do you know if you used the "fast default feature" ? > That would happen if you did "ALTER TABLE tbl ADD col DEFAULT val" > > I guess this is the way to tell: > SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing; >
On Thu, Apr 09, 2020 at 02:36:26PM -0400, Tim Bishop wrote: > SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing; > -[ RECORD 1 ]-+--------- > attrelid | download > attrelid | 22749 > attname | filetype But that table isn't involved in the crashing query, right ? Are data_stage() and income_index() locally defined functions ? PLPGSQL ?? Do they access the download table (or view or whatever it is) ? Thanks, -- Justin
On 4/9/20 4:39 PM, Justin Pryzby wrote: > On Thu, Apr 09, 2020 at 02:36:26PM -0400, Tim Bishop wrote: >> SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing; >> -[ RECORD 1 ]-+--------- >> attrelid | download >> attrelid | 22749 >> attname | filetype > But that table isn't involved in the crashing query, right ? > Are data_stage() and income_index() locally defined functions ? PLPGSQL ?? > Do they access the download table (or view or whatever it is) ? > As requested I have reviewed this old thread. You are correct, this table is not involved in the query. That doesn't mean that the changes made by the fast default code haven't caused a problem, but it makes it a bit less likely. If the OP is still interested and can provide a self-contained recipe to reproduce the problem I can investigate. Without that it's difficult to know what to look at. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com