Thread: debian bugrept involving fast default crash in pg11.7

debian bugrept involving fast default crash in pg11.7

From
Justin Pryzby
Date:
I happened across this bugreport, which seems to have just enough information
to be interesting.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=953204
|Version: 11.7-0+deb10u1
|2020-03-05 16:55:55.511 UTC [515] LOG:  background worker "parallel worker" (PID 884) was terminated by signal 11:
Segmentationfault
 
|2020-03-05 16:55:55.511 UTC [515] DETAIL:  Failed process was running: 
|SELECT distinct student_prob.student_id, student_prob.score, student_name, v_capacity_score.capacity
|FROM data JOIN model on model.id = 2 AND data_stage(data) = model.target_begin_field_id
|JOIN student_prob ON data.crm_id = student_prob.student_id AND model.id = student_prob.model_id AND
(student_prob.additional_aid< 1)
 
|LEFT JOIN v_capacity_score ON data.crm_id = v_capacity_score.student_id AND student_prob.model_id =
v_capacity_score.model_id
|WHERE data.term_code = '202090' AND student_prob.score > 0
|ORDER BY student_prob.score DESC, student_name
|LIMIT 100 OFFSET 100 ;

Tim: it'd be nice to get more information, if and when possible:
 - "explain" plan for that query;
 - \d for the tables involved: constraints, inheritence, defaults;
 - corefile or backtrace; it looks like there's two different crashes (maybe same problem) so both would be useful;
 - Can you reprodue the crash if you "SET max_parallel_workers_per_gather=0" ?
 - Do you know if it crashed under v11.6 ?

If anyone wants to hack on the .deb:
https://packages.debian.org/buster/amd64/postgresql-11/download and (I couldn't find the dbg package anywhere else)
https://snapshot.debian.org/package/postgresql-11/11.7-0%2Bdeb10u1/#postgresql-11-dbgsym_11.7-0:2b:deb10u1

$ mkdir pg11
$ cd pg11
$ wget -q
http://security.debian.org/debian-security/pool/updates/main/p/postgresql-11/postgresql-11_11.7-0+deb10u1_amd64.deb
$ ar x ./postgresql-11_11.7-0+deb10u1_amd64.deb 
$ tar xf ./data.tar.xz 
$ ar x postgresql-11-dbgsym_11.7-0+deb10u1_amd64.deb
$ tar tf data.tar.xz
$ gdb usr/lib/postgresql/11/bin/postgres 

(gdb) set debug-file-directory usr/lib/debug/
(gdb) file usr/lib/postgresql/11/bin/postmaster 
(gdb) info target     

If I repeat the process Bernhard used (thanks for that) on the first crash in
libc6, I get:

(gdb) find /b 0x0000000000022320, 0x000000000016839b, 0xf9, 0x20, 0x77, 0x1f, 0xc5, 0xfd, 0x74, 0x0f, 0xc5, 0xfd, 0xd7,
0xc1,0x85, 0xc0, 0x0f, 0x85, 0xdf, 0x00, 0x00, 0x00, 0x48, 0x83, 0xc7, 0x20, 0x83, 0xe1, 0x1f, 0x48, 0x83, 0xe7, 0xe0,
0xeb,0x36, 0x66, 0x90, 0x83, 0xe1, 0x1f, 0x48, 0x83, 0xe7, 0xe0, 0xc5, 0xfd, 0x74, 0x0f, 0xc5, 0xfd, 0xd7, 0xc1, 0xd3,
0xf8,0x85, 0xc0, 0x74, 0x1b, 0xf3, 0x0f, 0xbc, 0xc0, 0x48, 0x01, 0xf8, 0x48
 
0x15c17d <__strlen_avx2+13>
warning: Unable to access 1631 bytes of target memory at 0x167d3d, halting search.
1 pattern found.

I'm tentatively guessing that heap_modify_tuple() is involved, since it calls
getmissingattr and (probably) fill_val.  It looks like maybe some data
structure is corrupted which crashed two parallel workers, one in
fill_val()/strlen() and one in heap_deform_tuple()/getmissingattr().  Maybe
something not initialized in parallel worker, or a use-after-free?  I'll stop
guessing.

Justin



Re: debian bugrept involving fast default crash in pg11.7

From
Tim Bishop
Date:
I've attached a file containing the \d+ for all the tables involved and 
the EXPLAIN ANALYZE for the query.

Yes, the crash happened under v11.6.  I had tried downgrading when I 
first encountered the problem.

While trying to put together this information the crash started 
happening less frequently (I was only able to reproduce it it twice and 
not in a row) and I am unable to confirm if  SET 
max_parallel_workers_per_gather=0 had any effect.

Also since I've been able to reproduce I'm currently unable to provide a 
corefile or backtrace.  I'll continue to try and reproduce the error so 
I can get one or the other.

I did find a work around for the crash by making the view 
(v_capacity_score) a materialized view.

Thanks
tim

On 3/28/20 6:30 PM, Justin Pryzby wrote:
> I happened across this bugreport, which seems to have just enough information
> to be interesting.
> 
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=953204
> |Version: 11.7-0+deb10u1
> |2020-03-05 16:55:55.511 UTC [515] LOG:  background worker "parallel worker" (PID 884) was terminated by signal 11:
Segmentationfault
 
> |2020-03-05 16:55:55.511 UTC [515] DETAIL:  Failed process was running:
> |SELECT distinct student_prob.student_id, student_prob.score, student_name, v_capacity_score.capacity
> |FROM data JOIN model on model.id = 2 AND data_stage(data) = model.target_begin_field_id
> |JOIN student_prob ON data.crm_id = student_prob.student_id AND model.id = student_prob.model_id AND
(student_prob.additional_aid< 1)
 
> |LEFT JOIN v_capacity_score ON data.crm_id = v_capacity_score.student_id AND student_prob.model_id =
v_capacity_score.model_id
> |WHERE data.term_code = '202090' AND student_prob.score > 0
> |ORDER BY student_prob.score DESC, student_name
> |LIMIT 100 OFFSET 100 ;
> 
> Tim: it'd be nice to get more information, if and when possible:
>   - "explain" plan for that query;
>   - \d for the tables involved: constraints, inheritence, defaults;
>   - corefile or backtrace; it looks like there's two different crashes (maybe same problem) so both would be useful;
>   - Can you reprodue the crash if you "SET max_parallel_workers_per_gather=0" ?
>   - Do you know if it crashed under v11.6 ?
> 
> If anyone wants to hack on the .deb:
> https://packages.debian.org/buster/amd64/postgresql-11/download and (I couldn't find the dbg package anywhere else)
> https://snapshot.debian.org/package/postgresql-11/11.7-0%2Bdeb10u1/#postgresql-11-dbgsym_11.7-0:2b:deb10u1
> 
> $ mkdir pg11
> $ cd pg11
> $ wget -q
http://security.debian.org/debian-security/pool/updates/main/p/postgresql-11/postgresql-11_11.7-0+deb10u1_amd64.deb
> $ ar x ./postgresql-11_11.7-0+deb10u1_amd64.deb
> $ tar xf ./data.tar.xz
> $ ar x postgresql-11-dbgsym_11.7-0+deb10u1_amd64.deb
> $ tar tf data.tar.xz
> $ gdb usr/lib/postgresql/11/bin/postgres
> 
> (gdb) set debug-file-directory usr/lib/debug/
> (gdb) file usr/lib/postgresql/11/bin/postmaster
> (gdb) info target
> 
> If I repeat the process Bernhard used (thanks for that) on the first crash in
> libc6, I get:
> 
> (gdb) find /b 0x0000000000022320, 0x000000000016839b, 0xf9, 0x20, 0x77, 0x1f, 0xc5, 0xfd, 0x74, 0x0f, 0xc5, 0xfd,
0xd7,0xc1, 0x85, 0xc0, 0x0f, 0x85, 0xdf, 0x00, 0x00, 0x00, 0x48, 0x83, 0xc7, 0x20, 0x83, 0xe1, 0x1f, 0x48, 0x83, 0xe7,
0xe0,0xeb, 0x36, 0x66, 0x90, 0x83, 0xe1, 0x1f, 0x48, 0x83, 0xe7, 0xe0, 0xc5, 0xfd, 0x74, 0x0f, 0xc5, 0xfd, 0xd7, 0xc1,
0xd3,0xf8, 0x85, 0xc0, 0x74, 0x1b, 0xf3, 0x0f, 0xbc, 0xc0, 0x48, 0x01, 0xf8, 0x48
 
> 0x15c17d <__strlen_avx2+13>
> warning: Unable to access 1631 bytes of target memory at 0x167d3d, halting search.
> 1 pattern found.
> 
> I'm tentatively guessing that heap_modify_tuple() is involved, since it calls
> getmissingattr and (probably) fill_val.  It looks like maybe some data
> structure is corrupted which crashed two parallel workers, one in
> fill_val()/strlen() and one in heap_deform_tuple()/getmissingattr().  Maybe
> something not initialized in parallel worker, or a use-after-free?  I'll stop
> guessing.
> 
> Justin
> 


Attachment

Re: debian bugrept involving fast default crash in pg11.7

From
Justin Pryzby
Date:
On Thu, Apr 09, 2020 at 02:05:22PM -0400, Tim Bishop wrote:
> I've attached a file containing the \d+ for all the tables involved and the
> EXPLAIN ANALYZE for the query.

Thanks for your response.

Do you know if you used the "fast default feature" ?
That would happen if you did "ALTER TABLE tbl ADD col DEFAULT val"

I guess this is the way to tell:
SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing;

-- 
Justin



Re: debian bugrept involving fast default crash in pg11.7

From
Tim Bishop
Date:
SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing;
-[ RECORD 1 ]-+---------
attrelid      | download
attrelid      | 22749
attname       | filetype
atttypid      | 1043
attstattarget | -1
attlen        | -1
attnum        | 5
attndims      | 0
attcacheoff   | -1
atttypmod     | 36
attbyval      | f
attstorage    | x
attalign      | i
attnotnull    | t
atthasdef     | t
atthasmissing | t
attidentity   |
attisdropped  | f
attislocal    | t
attinhcount   | 0
attcollation  | 100
attacl        |
attoptions    |
attfdwoptions |
attmissingval | {csv}


On 4/9/20 2:31 PM, Justin Pryzby wrote:
> On Thu, Apr 09, 2020 at 02:05:22PM -0400, Tim Bishop wrote:
>> I've attached a file containing the \d+ for all the tables involved and the
>> EXPLAIN ANALYZE for the query.
> 
> Thanks for your response.
> 
> Do you know if you used the "fast default feature" ?
> That would happen if you did "ALTER TABLE tbl ADD col DEFAULT val"
> 
> I guess this is the way to tell:
> SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing;
> 




Re: debian bugrept involving fast default crash in pg11.7

From
Justin Pryzby
Date:
On Thu, Apr 09, 2020 at 02:36:26PM -0400, Tim Bishop wrote:
> SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing;
> -[ RECORD 1 ]-+---------
> attrelid      | download
> attrelid      | 22749
> attname       | filetype

But that table isn't involved in the crashing query, right ?
Are data_stage() and income_index() locally defined functions ?  PLPGSQL ??
Do they access the download table (or view or whatever it is) ?

Thanks,
-- 
Justin



Re: debian bugrept involving fast default crash in pg11.7

From
Andrew Dunstan
Date:
On 4/9/20 4:39 PM, Justin Pryzby wrote:
> On Thu, Apr 09, 2020 at 02:36:26PM -0400, Tim Bishop wrote:
>> SELECT attrelid::regclass, * FROM pg_attribute WHERE atthasmissing;
>> -[ RECORD 1 ]-+---------
>> attrelid      | download
>> attrelid      | 22749
>> attname       | filetype
> But that table isn't involved in the crashing query, right ?
> Are data_stage() and income_index() locally defined functions ?  PLPGSQL ??
> Do they access the download table (or view or whatever it is) ?
>

As requested I have reviewed this old thread. You are correct, this
table is not involved in the query. That doesn't mean that the changes
made by the fast default code haven't caused a problem, but it makes it
a bit less likely. If the OP is still interested and can provide a
self-contained recipe to reproduce the problem I can investigate.
Without that it's difficult to know what to look at.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com