Thread: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query
BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16122 Logged by: ascott Email address: adam.c.scott@gmail.com PostgreSQL version: 10.11 Operating system: Ubuntu & CentOS Description: Seg fault can be repeated by running this query: select count(*) from natica_hdu where boundary is not null and "dec_range" <@ '[89.9999998611111,90.0000001388889)' AND "ra_range" <@ '[45.0,45.1]'; it crashes on this line: if (VARATT_IS_EXTENDED(datum)) in fmgr.c:1833 GDB stacktrace is below. The table definition for natica_hdu is as follows: CREATE TABLE public.natica_hdu ( id integer NOT NULL, updated timestamp with time zone NOT NULL, hdu_idx smallint NOT NULL, ra double precision, "dec" double precision, boundary double precision[], extras jsonb NOT NULL, fitsfile_id character varying(32) NOT NULL, dec_range numrange, ra_range numrange ) WITH ( OIDS=FALSE ); ALTER TABLE public.natica_hdu OWNER TO postgres; CREATE INDEX natica_hdu_dec_range_56c7d92d ON public.natica_hdu USING btree (dec_range); CREATE INDEX natica_hdu_fitsfile_id_3a3363fe ON public.natica_hdu USING btree (fitsfile_id COLLATE pg_catalog."default"); CREATE INDEX natica_hdu_fitsfile_id_3a3363fe_like ON public.natica_hdu USING btree (fitsfile_id COLLATE pg_catalog."default" varchar_pattern_ops); CREATE INDEX natica_hdu_q3c_ang2ipix_idx ON public.natica_hdu USING btree (q3c_ang2ipix(ra, "dec")); ALTER TABLE public.natica_hdu CLUSTER ON natica_hdu_q3c_ang2ipix_idx; CREATE INDEX natica_hdu_ra_range_b9f4d3ac ON public.natica_hdu USING btree (ra_range); #0 pg_detoast_datum (datum=0xffffffff) at fmgr.c:1833 #1 0x0000557a18c19545 in numeric_sub (fcinfo=fcinfo@entry=0x7ffff5795e30) at numeric.c:2288 #2 0x0000557a18c9bb44 in DirectFunctionCall2Coll ( func=0x557a18c19520 <numeric_sub>, collation=collation@entry=0, arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:739 #3 0x0000557a18c2752a in numrange_subdiff (fcinfo=<optimized out>) at rangetypes.c:1424 #4 0x0000557a18c9c4af in FunctionCall2Coll ( flinfo=flinfo@entry=0x557a1ad85098, collation=<optimized out>, arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1059 #5 0x0000557a18c2c1ce in get_position ( typcache=typcache@entry=0x557a1ad84f58, value=value@entry=0x7ffff5796690, hist2=hist2@entry=0x557a1ad76700, hist1=<optimized out>, hist1=<optimized out>) at rangetypes_selfuncs.c:696 #6 0x0000557a18c2d679 in calc_hist_selectivity_contained ( length_hist_nvalues=101, length_hist_values=0x557a1ad88568, hist_nvalues=<optimized out>, hist_lower=<optimized out>, upper=0x7ffff57966a0, lower=0x7ffff5796690, typcache=0x557a1ad84f58) at rangetypes_selfuncs.c:1065 #7 calc_hist_selectivity (operator=<optimized out>, constval=<optimized out>, vardata=0x7ffff57966b0, typcache=0x557a1ad84f58) at rangetypes_selfuncs.c:561 #8 calc_rangesel (operator=<optimized out>, constval=<optimized out>, vardata=0x7ffff57966b0, typcache=0x557a1ad84f58) at rangetypes_selfuncs.c:332 #9 rangesel (fcinfo=<optimized out>) at rangetypes_selfuncs.c:216 #10 0x0000557a18c9cfa9 in OidFunctionCall4Coll (functionId=<optimized out>, collation=collation@entry=0, arg1=arg1@entry=93982924614952, arg2=arg2@entry=3892, arg3=arg3@entry=93982924616184, arg4=arg4@entry=0) at fmgr.c:1415 #11 0x0000557a18b0f9e4 in restriction_selectivity ( root=root@entry=0x557a1ad62d28, operatorid=3892, args=0x557a1ad631f8, ---Type <return> to continue, or q <return> to quit--- inputcollid=0, varRelid=varRelid@entry=0) at plancat.c:1670 #12 0x0000557a18ad56db in clause_selectivity (root=root@entry=0x557a1ad62d28, clause=0x557a1ad63298, clause@entry=0x557a1ad74960, varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x0) at clausesel.c:746 #13 0x0000557a18ad594d in clauselist_selectivity ( root=root@entry=0x557a1ad62d28, clauses=<optimized out>, varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x0) at clausesel.c:168 #14 0x0000557a18adb28d in set_baserel_size_estimates ( root=root@entry=0x557a1ad62d28, rel=rel@entry=0x557a1ad73568) at costsize.c:4028 #15 0x0000557a18ad3c85 in set_plain_rel_size (rte=<optimized out>, rel=0x557a1ad73568, root=0x557a1ad62d28) at allpaths.c:526 #16 set_rel_size (root=0x557a1ad62d28, rel=0x557a1ad73568, rti=1, rte=<optimized out>) at allpaths.c:368 #17 0x0000557a18ad5015 in set_base_rel_sizes (root=<optimized out>) at allpaths.c:280 #18 make_one_rel (root=root@entry=0x557a1ad62d28, joinlist=joinlist@entry=0x557a1ad73798) at allpaths.c:178 #19 0x0000557a18af22a0 in query_planner (root=root@entry=0x557a1ad62d28, tlist=tlist@entry=0x557a1ad63038, qp_callback=qp_callback@entry=0x557a18af3140 <standard_qp_callback>, qp_extra=qp_extra@entry=0x7ffff5796f60) at planmain.c:259 #20 0x0000557a18af4900 in grouping_planner (root=root@entry=0x557a1ad62d28, inheritance_update=inheritance_update@entry=0 '\000', tuple_fraction=<optimized out>, tuple_fraction@entry=0) at planner.c:1733 #21 0x0000557a18af7a2f in subquery_planner (glob=glob@entry=0x557a1ad62c98, parse=parse@entry=0x557a1ad43ef8, parent_root=parent_root@entry=0x0, hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=tuple_fraction@entry=0) at planner.c:862 #22 0x0000557a18af88b2 in standard_planner (parse=0x557a1ad43ef8, cursorOptions=256, boundParams=0x0) at planner.c:334 ---Type <return> to continue, or q <return> to quit--- #23 0x0000557a18b907dc in pg_plan_query ( querytree=querytree@entry=0x557a1ad43ef8, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:819 #24 0x0000557a18b908b6 in pg_plan_queries (querytrees=<optimized out>, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:885 #25 0x0000557a18b90cf9 in exec_simple_query ( query_string=0x557a1ad429c8 "select count(*) from natica_hdu where boundary is not null\nand\n\"dec_range\" <@ '[89.9999998611111,90.0000001388889)' AND \"ra_range\" <@ '[45.0,45.1]';") at postgres.c:1050 #26 0x0000557a18b929f1 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x557a1aceee90, dbname=<optimized out>, username=<optimized out>) at postgres.c:4117 #27 0x0000557a188ea6ac in BackendRun (port=0x557a1ace77a0) at postmaster.c:4408 #28 BackendStartup (port=0x557a1ace77a0) at postmaster.c:4080 #29 ServerLoop () at postmaster.c:1756 #30 0x0000557a18b24d35 in PostmasterMain (argc=5, argv=0x557a1acbf790) at postmaster.c:1364 #31 0x0000557a188ebea9 in main (argc=5, argv=0x557a1acbf790) at main.c:228
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
Hi Adam, On Mon, Nov 18, 2019 at 01:27:22AM +0000, PG Bug reporting form wrote: > Email address: adam.c.scott@gmail.com > PostgreSQL version: 10.11 > Operating system: Ubuntu & CentOS > Description: Did you see that after updating to 10.11. If you used 10.10 or an older version, did the problem happen? > Seg fault can be repeated by running this query: > > select count(*) from natica_hdu where boundary is not null > and > "dec_range" <@ '[89.9999998611111,90.0000001388889)' AND "ra_range" <@ > '[45.0,45.1]'; It would help if we had a sample of data here able to reproduce the problem. Something looks to be wrong with this range operator when working on numeric ranges, still I cannot reproduce the problem with for example stuff like that: CREATE TABLE aa (a numrange); CREATE INDEX aai ON aa(a); INSERT INTO aa SELECT ('['|| (90.0 - a::numeric/10000000000) || ',' || (90.0 + a::numeric/10000000000) || ')')::numrange FROM generate_series(1,10000) as a; SELECT count(*) FROM aa WHERE a <@ '[89.9999998611111,90.0000001388889)'; Seeing the plan of your query may help as well. Could you run EXPLAIN on it or does it crash before? Perhaps a parallel plan is involved here? > #0 pg_detoast_datum (datum=0xffffffff) at fmgr.c:1833 > #1 0x0000557a18c19545 in numeric_sub (fcinfo=fcinfo@entry=0x7ffff5795e30) > at numeric.c:2288 Hmm. Per the shape of this backtrace, you are indeed processing the range operator, and this portion of memory looks freed. My simplified example above basically goes through the same when planning the query. -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Tomas Vondra
Date:
Hi Adam, thanks for the report. On Mon, Nov 18, 2019 at 01:27:22AM +0000, PG Bug reporting form wrote: >The following bug has been logged on the website: > >Bug reference: 16122 >Logged by: ascott >Email address: adam.c.scott@gmail.com >PostgreSQL version: 10.11 >Operating system: Ubuntu & CentOS >Description: > >Seg fault can be repeated by running this query: > >select count(*) from natica_hdu where boundary is not null >and >"dec_range" <@ '[89.9999998611111,90.0000001388889)' AND "ra_range" <@ >'[45.0,45.1]'; > >it crashes on this line: > >if (VARATT_IS_EXTENDED(datum)) in fmgr.c:1833 > >GDB stacktrace is below. > Hmmm, this does seem like some sort of memory/data corruption, probably. But I'm unable to reproduce the issue, it's likely data dependent. Can you try extracting a data set reproducing the issue? Not sure how large or sensitive the data is, but if needed you can share it off-list. >The table definition for natica_hdu is as follows: >CREATE TABLE public.natica_hdu >( > id integer NOT NULL, > updated timestamp with time zone NOT NULL, > hdu_idx smallint NOT NULL, > ra double precision, > "dec" double precision, > boundary double precision[], > extras jsonb NOT NULL, > fitsfile_id character varying(32) NOT NULL, > dec_range numrange, > ra_range numrange >) >WITH ( > OIDS=FALSE >); Hm, the query only really touches dec_range and ra_range. If you try just select count(*) from natica_hdu where "dec_range" <@ '[89.9999998611111,90.0000001388889)' and select count(*) from natica_hdu where "ra_range" <@ '[45.0,45.1]' does either of those queries crash too? That would reduce the amount of data for the reproducer. If you try just COPY (SELECT ra_range, dec_range FROM public.natica_hdu) TO '/dev/null'; does that fail too? If yes, it's likely due to on-disk data corruption. If not, it seems more like a memory corruption during processing. I think the grants and indexes are mostly irrelevant for the issue, particularly those not related to columns in the query. Did you include just because they are defined, or because you think it's actually related? >ALTER TABLE public.natica_hdu > OWNER TO postgres; > >CREATE INDEX natica_hdu_dec_range_56c7d92d > ON public.natica_hdu > USING btree > (dec_range); > >CREATE INDEX natica_hdu_fitsfile_id_3a3363fe > ON public.natica_hdu > USING btree > (fitsfile_id COLLATE pg_catalog."default"); > >CREATE INDEX natica_hdu_fitsfile_id_3a3363fe_like > ON public.natica_hdu > USING btree > (fitsfile_id COLLATE pg_catalog."default" varchar_pattern_ops); > > >CREATE INDEX natica_hdu_q3c_ang2ipix_idx > ON public.natica_hdu > USING btree > (q3c_ang2ipix(ra, "dec")); >ALTER TABLE public.natica_hdu CLUSTER ON natica_hdu_q3c_ang2ipix_idx; > >CREATE INDEX natica_hdu_ra_range_b9f4d3ac > ON public.natica_hdu > USING btree > (ra_range); ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Adam Scott
Date:
> Did you see that after updating to 10.11. If you used 10.10 or
> an older version, did the problem happen?
Originally this first was discovered on Centos 7, PG 10.10, then loaded the 40gb table on 10.11 on Ubuntu.
>Seeing the plan of your query may help as well. Could you run EXPLAIN
>on it or does it crash before? Perhaps a parallel plan is involved
>here?
>on it or does it crash before? Perhaps a parallel plan is involved
>here?
Explain plan works fine with no crash
I've done a binary search to find out where the error occurs in the data, but no luck. It seems intermittent now. Finally, I was able to reproduce the error repeatably with a blank table:
Stop and start postgres from fresh, and then run this query (notice, I removed a sarg from the originally supplied query):
select id from natica_hdu_test
WHERE
"dec_range" <@ '[88.9999998611111,90.0000001388889)';
WHERE
"dec_range" <@ '[88.9999998611111,90.0000001388889)';
Thank you for your help in tracking this down!
On Sun, Nov 17, 2019 at 8:50 PM Michael Paquier <michael@paquier.xyz> wrote:
Hi Adam,
On Mon, Nov 18, 2019 at 01:27:22AM +0000, PG Bug reporting form wrote:
> Email address: adam.c.scott@gmail.com
> PostgreSQL version: 10.11
> Operating system: Ubuntu & CentOS
> Description:
Did you see that after updating to 10.11. If you used 10.10 or
an older version, did the problem happen?
> Seg fault can be repeated by running this query:
>
> select count(*) from natica_hdu where boundary is not null
> and
> "dec_range" <@ '[89.9999998611111,90.0000001388889)' AND "ra_range" <@
> '[45.0,45.1]';
It would help if we had a sample of data here able to reproduce the
problem. Something looks to be wrong with this range operator when
working on numeric ranges, still I cannot reproduce the problem with
for example stuff like that:
CREATE TABLE aa (a numrange);
CREATE INDEX aai ON aa(a);
INSERT INTO aa
SELECT ('['|| (90.0 - a::numeric/10000000000) || ',' ||
(90.0 + a::numeric/10000000000) || ')')::numrange
FROM generate_series(1,10000) as a;
SELECT count(*) FROM aa
WHERE a <@ '[89.9999998611111,90.0000001388889)';
Seeing the plan of your query may help as well. Could you run EXPLAIN
on it or does it crash before? Perhaps a parallel plan is involved
here?
> #0 pg_detoast_datum (datum=0xffffffff) at fmgr.c:1833
> #1 0x0000557a18c19545 in numeric_sub (fcinfo=fcinfo@entry=0x7ffff5795e30)
> at numeric.c:2288
Hmm. Per the shape of this backtrace, you are indeed processing the
range operator, and this portion of memory looks freed. My simplified
example above basically goes through the same when planning the
query.
--
Michael
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Adam Scott
Date:
Hi Tomas, thank you for your help! @Michael it looks like 15 repeats is not enough to say it is "reliably" reproduced: hope this does it.
To reproduce, be sure to start postgres from scratch. Load the dump file (below). And then run the following to reliably reproduce the error:
select id from natica_hdu_test
WHERE
"dec_range" <@ '[88.9999998611111,90.0000001388889)';
WHERE
"dec_range" <@ '[88.9999998611111,90.0000001388889)';
It was bugging out with an empty table but it seems intermittent. A successful query followed by the one above will not fail sometimes; hence the need for a from-scratch startup instance of postgres.
Here is the backup file ~3.5 MB: https://drive.google.com/open?id=1cMc8pMT_aDAW7miplbrpDDcn-JZ0td8M
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
On Mon, Nov 18, 2019 at 02:51:25PM -0800, Adam Scott wrote: >> Did you see that after updating to 10.11. If you used 10.10 or >> an older version, did the problem happen? > > Originally this first was discovered on Centos 7, PG 10.10, then loaded the > 40gb table on 10.11 on Ubuntu. Which means that prior 10.9 you used this query, but did not notice anything? Or is 10.10 the first version in the 10.X series you used? >> Seeing the plan of your query may help as well. Could you run EXPLAIN >> on it or does it crash before? Perhaps a parallel plan is involved >> here? > Explain plan works fine with no crash I was also wondering about the shape of the plan selected here. The crash happens when doing some selectivity on the clause when working on a plan, but it could help. > I've done a binary search to find out where the error occurs in the data, > but no luck. It seems intermittent now. Finally, I was able to reproduce > the error repeatably with a blank table: > > Stop and start postgres from fresh, and then run this query (notice, I > removed a sarg from the originally supplied query): > select id from natica_hdu_test > WHERE > "dec_range" <@ '[88.9999998611111,90.0000001388889)'; Thanks. I can see that you have provided a dump on the other part of the thread. Let's continue from that. -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
On Mon, Nov 18, 2019 at 03:59:14PM -0800, Adam Scott wrote: > It was bugging out with an empty table but it seems intermittent. A > successful query followed by the one above will not fail sometimes; hence > the need for a from-scratch startup instance of postgres. Okay, that has proved to help a lot. I have restored the dump, and after an analyze on natica_hdu_test I can see the failure immediately. I have also extracted a small test case able to reproduce the failure. That's actually pretty close to the scenarios I was playing with upthread, still I missed it: CREATE TABLE rangetab (a numrange); INSERT INTO rangetab SELECT ('['|| (45.0 - a::numeric/10000000) || ',' || (45.1 + a::numeric/10000000) || ')')::numrange FROM generate_series(1,1000) as a; ANALYZE rangetab; SELECT a FROM rangetab WHERE a <@ '[89.9999998611111,90.0000001388889)'; I can easily reproduce the failure on HEAD, REL_12_STABLE and REL_11_STABLE. On 10, I could not. But now that we have a test case that's only a matter of time. -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Adam Scott
Date:
Great! Thank you Michael and Tomas!
Maybe a clue to the issue, why does a numrange need to be detoasted since that is for large strings?
Let me know how I can further help.
Cheers,
Adam
On Mon, Nov 18, 2019 at 5:48 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Nov 18, 2019 at 03:59:14PM -0800, Adam Scott wrote:
> It was bugging out with an empty table but it seems intermittent. A
> successful query followed by the one above will not fail sometimes; hence
> the need for a from-scratch startup instance of postgres.
Okay, that has proved to help a lot. I have restored the dump, and
after an analyze on natica_hdu_test I can see the failure
immediately.
I have also extracted a small test case able to reproduce the failure.
That's actually pretty close to the scenarios I was playing with
upthread, still I missed it:
CREATE TABLE rangetab (a numrange);
INSERT INTO rangetab
SELECT ('['|| (45.0 - a::numeric/10000000) || ',' ||
(45.1 + a::numeric/10000000) || ')')::numrange
FROM generate_series(1,1000) as a;
ANALYZE rangetab;
SELECT a FROM rangetab
WHERE a <@ '[89.9999998611111,90.0000001388889)';
I can easily reproduce the failure on HEAD, REL_12_STABLE and
REL_11_STABLE. On 10, I could not. But now that we have a test case
that's only a matter of time.
--
Michael
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
On Mon, Nov 18, 2019 at 06:22:11PM -0800, Adam Scott wrote: > Maybe a clue to the issue, why does a numrange need to be detoasted since > that is for large strings? I looked at the code a bit but could not really figure out what could go wrong. Then I switched my strategy and bisected for the first commit leading to the failure, and this is pointing me to the integration of stdbool.h for my environment (Debian SID with kernel at 5.3): commit: 9a95a77d9d5d3003d2d67121f2731b6e5fc37336 author: Peter Eisentraut <peter_e@gmx.net> date: Thu, 22 Mar 2018 20:42:25 -0400 Use stdbool.h if suitable Using the standard bool type provided by C allows some recent compilers and debuggers to give better diagnostics. Also, some extension code and third-party headers are increasingly pulling in stdbool.h, so it's probably saner if everyone uses the same definition. I have also checked that manually and I am getting a match. And the thing is that this commit did not make it through 10... -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Tomas Vondra
Date:
On Tue, Nov 19, 2019 at 12:04:28PM +0900, Michael Paquier wrote: >On Mon, Nov 18, 2019 at 06:22:11PM -0800, Adam Scott wrote: >> Maybe a clue to the issue, why does a numrange need to be detoasted since >> that is for large strings? > >I looked at the code a bit but could not really figure out what could >go wrong. Then I switched my strategy and bisected for the first >commit leading to the failure, and this is pointing me to the >integration of stdbool.h for my environment (Debian SID with kernel at >5.3): >commit: 9a95a77d9d5d3003d2d67121f2731b6e5fc37336 >author: Peter Eisentraut <peter_e@gmx.net> >date: Thu, 22 Mar 2018 20:42:25 -0400 >Use stdbool.h if suitable > >Using the standard bool type provided by C allows some recent compilers >and debuggers to give better diagnostics. Also, some extension code and >third-party headers are increasingly pulling in stdbool.h, so it's >probably saner if everyone uses the same definition. > >I have also checked that manually and I am getting a match. And the >thing is that this commit did not make it through 10... Not sure I understand. How could this be the root cause when it's not present in the 10.x branch (which is what Adam is using)? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Tomas Vondra
Date:
On Tue, Nov 19, 2019 at 10:48:27AM +0900, Michael Paquier wrote: >On Mon, Nov 18, 2019 at 03:59:14PM -0800, Adam Scott wrote: >> It was bugging out with an empty table but it seems intermittent. A >> successful query followed by the one above will not fail sometimes; hence >> the need for a from-scratch startup instance of postgres. > >Okay, that has proved to help a lot. I have restored the dump, and >after an analyze on natica_hdu_test I can see the failure >immediately. > >I have also extracted a small test case able to reproduce the failure. >That's actually pretty close to the scenarios I was playing with >upthread, still I missed it: >CREATE TABLE rangetab (a numrange); >INSERT INTO rangetab >SELECT ('['|| (45.0 - a::numeric/10000000) || ',' || > (45.1 + a::numeric/10000000) || ')')::numrange >FROM generate_series(1,1000) as a; >ANALYZE rangetab; >SELECT a FROM rangetab > WHERE a <@ '[89.9999998611111,90.0000001388889)'; > >I can easily reproduce the failure on HEAD, REL_12_STABLE and >REL_11_STABLE. On 10, I could not. But now that we have a test case >that's only a matter of time. FWIW I've managed to reproduce this on 10, but I had to build without --enable-cassert. So this does trigger the issue: $ ./configure --enable-debug --prefix=/home/user/pg-10 $ make -s clean $ make -s -j8 install $ pg_ctl -D /tmp/data init $ pg_ctl -D /tmp/data -l pg.log start $ createdb test $ psql test < bug.sql $ pgbench -n -c 1 -f q.sql test where q.sql is simply select id from natica_hdu_test WHERE "dec_range" <@ '[88.9999998611111,90.0000001388889)'; Haven't investigated further yet. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
On Tue, Nov 19, 2019 at 11:38:13AM +0100, Tomas Vondra wrote: > FWIW I've managed to reproduce this on 10, but I had to build without > --enable-cassert. So this does trigger the issue: > > Haven't investigated further yet. If you add an ANALYZE on the table natica_hdu_test after restoring, I am rather sure that you would reproduce the crash more quickly because the handling around the stats of the column are busted here. Anyway, taking my example of upthread, I have been also able to reproduce the problem on REL_10_STABLE even with assertions enabled: the trick is that you need to leave once the session after the analyze on the table. Then a SELECT within a new session is enough to crash the server. The change with stdbool.h actually makes the crash easier to reproduce as there is no need to leave the session. I am not sure how it mattered.. [ ... And one bisect later ... ] This looks more correct as culprit than the precedent because it touches the area of the crash: commit: 9aab83fc5039d83e84144b7bed3fb1d62a74ae78 author: Tom Lane <tgl@sss.pgh.pa.us> date: Sat, 13 May 2017 15:14:39 -0400 Redesign get_attstatsslot()/free_attstatsslot() for more safety and speed. It seems to me that that we are simply free'ing an area which still needs to be accessed for the stat estimations. -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
On Tue, Nov 19, 2019 at 11:18:18AM +0100, Tomas Vondra wrote: > Not sure I understand. How could this be the root cause when it's not > present in the 10.x branch (which is what Adam is using)? Sorry for the misunderstanding here. The switch to stdbool.h visibly makes it easier for the issue to appear, but doing more retries of the test case when bisecting the error originates from 9aab83f. -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
On Tue, Nov 19, 2019 at 08:40:56PM +0900, Michael Paquier wrote: > If you add an ANALYZE on the table natica_hdu_test after restoring, I > am rather sure that you would reproduce the crash more quickly because > the handling around the stats of the column are busted here. Anyway, > taking my example of upthread, I have been also able to reproduce the > problem on REL_10_STABLE even with assertions enabled: the trick is > that you need to leave once the session after the analyze on the > table. Then a SELECT within a new session is enough to crash the > server. So... I have looked more at this one, and from my previous example it seems that we have a one-off error when looking up at the array holding the histograms for ranges (lower and upper bound). In my previous example, we get to build 101 RangeBounds when beginning to calculate the range operator selectivity in calc_hist_selectivity(). However, when we get to the point of calc_hist_selectivity_contained(), upper_index gets calculated at 100 which is just at the limit of the indexed bounds, and the code would happily look at the last bound as well as the one-after-the-last bound as range_cmp_bounds() sees fit, but the latter just points to the void. The code looks wrong since its introduction in 59d0bf9d and it seems that the changes done for free_attstatsslot() in 9aab83f make the issue more easily reproducible. A fix like the rough POC attached addresses the issue, but I think that's too naive to not count for the first bin in the ranges evaluated. Tomas, you may be more familiar with this area of the code than I am. What do you think? -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Andrey Borodin
Date:
> 10 дек. 2019 г., в 10:38, Michael Paquier <michael@paquier.xyz> написал(а): > > On Tue, Nov 19, 2019 at 08:40:56PM +0900, Michael Paquier wrote: >> If you add an ANALYZE on the table natica_hdu_test after restoring, I >> am rather sure that you would reproduce the crash more quickly because >> the handling around the stats of the column are busted here. Anyway, >> taking my example of upthread, I have been also able to reproduce the >> problem on REL_10_STABLE even with assertions enabled: the trick is >> that you need to leave once the session after the analyze on the >> table. Then a SELECT within a new session is enough to crash the >> server. > > So... I have looked more at this one, and from my previous example it > seems that we have a one-off error when looking up at the array > holding the histograms for ranges (lower and upper bound). > > In my previous example, we get to build 101 RangeBounds when beginning > to calculate the range operator selectivity in > calc_hist_selectivity(). However, when we get to the point of > calc_hist_selectivity_contained(), upper_index gets calculated at 100 > which is just at the limit of the indexed bounds, and the code would > happily look at the last bound as well as the one-after-the-last bound > as range_cmp_bounds() sees fit, but the latter just points to the > void. The code looks wrong since its introduction in 59d0bf9d and > it seems that the changes done for free_attstatsslot() in 9aab83f make > the issue more easily reproducible. > > A fix like the rough POC attached addresses the issue, but I think > that's too naive to not count for the first bin in the ranges > evaluated. Tomas, you may be more familiar with this area of the code > than I am. What do you think? So... I was looking for some patches on CF and found this one. It's one-liner, what could go wrong? 1. New tests do no fail on my machine on added test case without that line. (on master branch; from this thread i concludedthat they should fail) 2. I believe line should be not like + for (i = upper_index - 1; i >= 0; i--) but rather + for (i = min(upper_index, hist_nvalues - 2); i >= 0; i--) I will dig into this during this CF. Currently, that's my 2 cents. Best regards, Andrey Borodin.
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Andrey Borodin
Date:
> 4 янв. 2020 г., в 0:05, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > I believe line should be not like > + for (i = upper_index - 1; i >= 0; i--) > but rather > + for (i = min(upper_index, hist_nvalues - 2); i >= 0; i--) > > I will dig into this during this CF. Currently, that's my 2 cents. I think I have a little more correct fix. As far as I understand, we have an array of bin lower bounds hist_lower with size hist_nvalues. There is a function get_position(..,value, lower, upper), which calculates relative position of value between lower and upperbound. We call get_position(typcache, lower, &hist_lower[i], &hist_lower[i + 1]); when i is last bin, i.e. i + 1 == hist_nvalues,thus passing bogus upper to get_position(). PFA possible fix for this. Upper in the code same situation when upper is undefined is treated as if get_position returned 0. Also, while get_position() is not prepared to bogus upper, but it is well aware of infinite bounds. We could just add lastinfinite value to hist_lower and remove upper_index < hist_nvalues - 1 and i + 1 != hist_nvalues checks entirely. Best regards, Andrey Borodin.
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query
From
Tom Lane
Date:
Andrey Borodin <x4mmm@yandex-team.ru> writes: > PFA possible fix for this. I looked into this a bit too. I don't have a lot of faith in Michael's example; it took me four tries to find a machine it would fail on. Still, I did eventually get a crash, and I concur that the problem is that calc_hist_selectivity_contained tries to access the entry after the histogram end. I believe that Michael's draft fix is actually about right, because we don't want to treat the key space above the histogram upper bound as being an actual bin: we should assume that there are no values there. So just starting the loop at the last real bin is sufficient. However, I'd prefer to handle the two edge cases (i.e., probe value below histogram lower bound or above histogram upper bound) explicitly, because that makes it more apparent what's going on. There are a couple of other ways in which this code seems insufficiently paranoid to me: * It seems worth spending a couple lines of code at the beginning to verify that the histogram has at least one bin, as it already did for the range-length histogram. Otherwise we've got various divide-by-zero hazards here, along with a need for extra tests in the looping functions. * I think we'd better guard against NaNs coming back from the range subdiff function, as well as the possibility of getting a NaN from the division in get_position (compare [1]). * I am not quite totally sure about this part, but it seems to me that calc_hist_selectivity_contains probably ought to handle the range-bound cases the same as calc_hist_selectivity_contained, even though only the latter is trying to access an unsafe array index. That all leads me to the attached draft patch. It lacks a test case --- I wonder if we can find one that crashes more portably than Michael's? And more eyeballs on calc_hist_selectivity_contains would be good too. regards, tom lane [1] https://www.postgresql.org/message-id/12957.1577974305@sss.pgh.pa.us diff --git a/src/backend/utils/adt/rangetypes_selfuncs.c b/src/backend/utils/adt/rangetypes_selfuncs.c index a4d7a7a..0a41016 100644 --- a/src/backend/utils/adt/rangetypes_selfuncs.c +++ b/src/backend/utils/adt/rangetypes_selfuncs.c @@ -400,6 +400,13 @@ calc_hist_selectivity(TypeCacheEntry *typcache, VariableStatData *vardata, ATTSTATSSLOT_VALUES))) return -1.0; + /* check that it's a histogram, not just a dummy entry */ + if (hslot.nvalues < 2) + { + free_attstatsslot(&hslot); + return -1.0; + } + /* * Convert histogram of ranges into histograms of its lower and upper * bounds. @@ -696,21 +703,22 @@ get_position(TypeCacheEntry *typcache, const RangeBound *value, const RangeBound return 0.5; /* Calculate relative position using subdiff function. */ - bin_width = DatumGetFloat8(FunctionCall2Coll( - &typcache->rng_subdiff_finfo, + bin_width = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo, typcache->rng_collation, hist2->val, hist1->val)); - if (bin_width <= 0.0) + if (isnan(bin_width) || bin_width <= 0.0) return 0.5; /* zero width bin */ - position = DatumGetFloat8(FunctionCall2Coll( - &typcache->rng_subdiff_finfo, + position = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo, typcache->rng_collation, value->val, hist1->val)) / bin_width; + if (isnan(position)) + return 0.5; /* punt if we have any NaNs or Infs */ + /* Relative position must be in [0,1] range */ position = Max(position, 0.0); position = Min(position, 1.0); @@ -806,11 +814,19 @@ get_distance(TypeCacheEntry *typcache, const RangeBound *bound1, const RangeBoun * value of 1.0 if no subdiff is available. */ if (has_subdiff) - return - DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo, - typcache->rng_collation, - bound2->val, - bound1->val)); + { + float8 res; + + res = DatumGetFloat8(FunctionCall2Coll(&typcache->rng_subdiff_finfo, + typcache->rng_collation, + bound2->val, + bound1->val)); + /* Protect against possible NaN result */ + if (isnan(res)) + return 1.0; + else + return res; + } else return 1.0; } @@ -1021,16 +1037,28 @@ calc_hist_selectivity_contained(TypeCacheEntry *typcache, false); /* + * If the upper bound value is below the histogram's lower limit, there + * are no matches. + */ + if (upper_index < 0) + return 0.0; + + /* + * If the upper bound value is at or beyond the histogram's upper limit, + * start our loop at the last actual bin. (This corresponds to assuming + * that the data population above the histogram's upper limit is empty, + * exactly like what we just assumed for the lower limit.) + */ + upper_index = Min(upper_index, hist_nvalues - 2); + + /* * Calculate upper_bin_width, ie. the fraction of the (upper_index, * upper_index + 1) bin which is greater than upper bound of query range * using linear interpolation of subdiff function. */ - if (upper_index >= 0 && upper_index < hist_nvalues - 1) - upper_bin_width = get_position(typcache, upper, - &hist_lower[upper_index], - &hist_lower[upper_index + 1]); - else - upper_bin_width = 0.0; + upper_bin_width = get_position(typcache, upper, + &hist_lower[upper_index], + &hist_lower[upper_index + 1]); /* * In the loop, dist and prev_dist are the distance of the "current" bin's @@ -1125,15 +1153,27 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache, true); /* + * If the lower bound value is below the histogram's lower limit, there + * are no matches. + */ + if (lower_index < 0) + return 0.0; + + /* + * If the lower bound value is at or beyond the histogram's upper limit, + * start our loop at the last actual bin. (This corresponds to assuming + * that the data population above the histogram's upper limit is empty, + * exactly like what we just assumed for the lower limit.) + */ + lower_index = Min(lower_index, hist_nvalues - 2); + + /* * Calculate lower_bin_width, ie. the fraction of the of (lower_index, * lower_index + 1) bin which is greater than lower bound of query range * using linear interpolation of subdiff function. */ - if (lower_index >= 0 && lower_index < hist_nvalues - 1) - lower_bin_width = get_position(typcache, lower, &hist_lower[lower_index], - &hist_lower[lower_index + 1]); - else - lower_bin_width = 0.0; + lower_bin_width = get_position(typcache, lower, &hist_lower[lower_index], + &hist_lower[lower_index + 1]); /* * Loop through all the lower bound bins, smaller than the query lower
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
On Mon, Jan 06, 2020 at 03:28:34PM -0500, Tom Lane wrote: > I believe that Michael's draft fix is actually about right, because > we don't want to treat the key space above the histogram upper bound > as being an actual bin: we should assume that there are no values there. > So just starting the loop at the last real bin is sufficient. However, > I'd prefer to handle the two edge cases (i.e., probe value below histogram > lower bound or above histogram upper bound) explicitly, because that makes > it more apparent what's going on. Actually, I suspect that the original error comes from a copy-pasto from calc_hist_selectivity_contains to calc_hist_selectivity_contained when the code has been originally written. > There are a couple of other ways in which this code seems insufficiently > paranoid to me: > > * It seems worth spending a couple lines of code at the beginning to > verify that the histogram has at least one bin, as it already did for > the range-length histogram. Otherwise we've got various divide-by-zero > hazards here, along with a need for extra tests in the looping functions. Fine by me. > * I think we'd better guard against NaNs coming back from the range > subdiff function, as well as the possibility of getting a NaN from > the division in get_position (compare [1]). Good idea. > * I am not quite totally sure about this part, but it seems to me > that calc_hist_selectivity_contains probably ought to handle the > range-bound cases the same as calc_hist_selectivity_contained, > even though only the latter is trying to access an unsafe array > index. Both could actually be merged, no? The assumptions behind the upper/lower bound lookups are actually just mirrors of each other based on which side of the operators <@ or @> the code looks at. Perhaps the current code is cleaner as done currently anyway.. > That all leads me to the attached draft patch. It lacks a test > case --- I wonder if we can find one that crashes more portably > than Michael's? And more eyeballs on calc_hist_selectivity_contains > would be good too. It would be nice to keep the test case as it crashes immediately on own Debian laptop. Anyway, if we add a test, we really need something that pokes at the bin with the histogram with the upper bound. Just something like the attached on HEAD proves that we have zero coverage for it: --- a/src/backend/utils/adt/rangetypes_selfuncs.c +++ b/src/backend/utils/adt/rangetypes_selfuncs.c @@ -1050,6 +1050,7 @@ calc_hist_selectivity_contained(TypeCacheEntry *typcache, double dist; double length_hist_frac; bool final_bin = false; + Assert(i == upper_index); So that's not good from the start.. - if (bin_width <= 0.0) + if (isnan(bin_width) || bin_width <= 0.0) return 0.5; /* zero width bin */ This comment is incorrect now? + if (isnan(position)) + return 0.5; /* punt if we have any NaNs or Infs */ + This comment is incorrect as well, no? In this case both histograms have finite bounds, and you just check if that's a number. -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > On Mon, Jan 06, 2020 at 03:28:34PM -0500, Tom Lane wrote: >> * I am not quite totally sure about this part, but it seems to me >> that calc_hist_selectivity_contains probably ought to handle the >> range-bound cases the same as calc_hist_selectivity_contained, >> even though only the latter is trying to access an unsafe array >> index. > Both could actually be merged, no? The assumptions behind the > upper/lower bound lookups are actually just mirrors of each other > based on which side of the operators <@ or @> the code looks at. Hmm ... maybe, but I'm not entirely convinced that they should be mirror images. The endpoint handling is the same right now, but should it be? Anyway, merging them seems out of scope for a bug fix. >> That all leads me to the attached draft patch. It lacks a test >> case --- I wonder if we can find one that crashes more portably >> than Michael's? And more eyeballs on calc_hist_selectivity_contains >> would be good too. > It would be nice to keep the test case as it crashes immediately on > own Debian laptop. I wouldn't mind having a test case if it crashes reliably, but it doesn't for me. I poked into why not, and realized that even if get_position is passed a pointer to garbage data in hist2, it will not crash unless the "!hist2->infinite" test succeeds, because otherwise it won't call the rng_subdiff function. Depending on compiler and optimization level, that usually means that any nonzero value in that byte will prevent a crash. Given that the whole point here is we're accessing memory that hasn't been initialized by this code, we don't have any way to make it likely that that won't be true. (Of course, we could have calc_hist_selectivity() allocate one extra array slot and zero it, but that's getting a bit silly.) BTW, I noticed that at -O0, my compiler generates a code sequence that effectively implements that test as "hist2->infinite != true", ie it will succeed for any bit pattern that's not exactly 0x01. This greatly raises the odds of seeing the crash of course (so maybe your compiler is doing likewise?). I find that rather disturbing because my mental model has always been that any nonzero bit pattern in a bool would be interpreted as true. Apparently the gcc boys think that they're entitled to consider _Bool values other than exactly 0 and 1 as undefined, which is scary because I'm not sure we are totally consistent about that everywhere. It definitely makes it harder to reason about what will happen with garbage memory contents. (This might also explain why you saw a change in behavior at 9a95a77d9d.) So the bottom line here is that I don't think that we can build a test case that will crash with any large probability, especially not if it's just one part of a test script --- the longer the session has been running, the greater the odds are that whatever is just past the allocated histogram array contains nonzero garbage. I'm not sure it's worth spending test cycles forevermore on a test that most likely will fail to reveal a bug even if it's there. > - if (bin_width <= 0.0) > + if (isnan(bin_width) || bin_width <= 0.0) > return 0.5; /* zero width bin */ > This comment is incorrect now? Yeah, it could be improved. Maybe "punt for NaN or zero-width bin"? > + if (isnan(position)) > + return 0.5; /* punt if we have any NaNs or Infs */ > + > This comment is incorrect as well, no? In this case both histograms > have finite bounds, and you just check if that's a number. The point is that we could get a NaN from the division either from the rng_subdiff function returning NaN, or from an undefined case such as Inf/Inf. Maybe write "punt for NaN input, Inf/Inf, etc"? (We don't need to test for Inf explicitly, because the tests just below can deal with that.) regards, tom lane
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
On Tue, Jan 07, 2020 at 10:08:07AM -0500, Tom Lane wrote: > Hmm ... maybe, but I'm not entirely convinced that they should be > mirror images. The endpoint handling is the same right now, but > should it be? Anyway, merging them seems out of scope for a bug fix. Sure. That was just a random though while reading again the code. > I wouldn't mind having a test case if it crashes reliably, but it > doesn't for me. I poked into why not, and realized that even if > get_position is passed a pointer to garbage data in hist2, it will > not crash unless the "!hist2->infinite" test succeeds, because > otherwise it won't call the rng_subdiff function. Depending > on compiler and optimization level, that usually means that any > nonzero value in that byte will prevent a crash. Given that the > whole point here is we're accessing memory that hasn't been > initialized by this code, we don't have any way to make it likely > that that won't be true. Ah, OK. That makes sense. If you don't wish to keep the test around in core, that's fine to me. I may keep that into one of my own repos just for history's sake. > BTW, I noticed that at -O0, my compiler generates a code sequence > that effectively implements that test as "hist2->infinite != true", > ie it will succeed for any bit pattern that's not exactly 0x01. > This greatly raises the odds of seeing the crash of course (so maybe > your compiler is doing likewise?). I find that rather disturbing > because my mental model has always been that any nonzero bit pattern > in a bool would be interpreted as true. Apparently the gcc boys think > that they're entitled to consider _Bool values other than exactly 0 > and 1 as undefined, which is scary because I'm not sure we are totally > consistent about that everywhere. It definitely makes it harder to > reason about what will happen with garbage memory contents. (This > might also explain why you saw a change in behavior at 9a95a77d9d.) That would not be the first only time my environment behaves different than the others. My build scripts have been pretty good at catching issues with compiled code, particularly when it comes to ordering. >> - if (bin_width <= 0.0) >> + if (isnan(bin_width) || bin_width <= 0.0) >> return 0.5; /* zero width bin */ >> This comment is incorrect now? > > Yeah, it could be improved. Maybe "punt for NaN or zero-width bin"? Sounds fine to me. >> + if (isnan(position)) >> + return 0.5; /* punt if we have any NaNs or Infs */ >> + >> This comment is incorrect as well, no? In this case both histograms >> have finite bounds, and you just check if that's a number. > > The point is that we could get a NaN from the division either from > the rng_subdiff function returning NaN, or from an undefined case > such as Inf/Inf. Maybe write "punt for NaN input, Inf/Inf, etc"? > (We don't need to test for Inf explicitly, because the tests just > below can deal with that.) I would rewrite this comment for clarity. But I am fine to leave that to you at the end if you feel differently. -- Michael
Attachment
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833 numrange query
From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes: > On Tue, Jan 07, 2020 at 10:08:07AM -0500, Tom Lane wrote: >> I wouldn't mind having a test case if it crashes reliably, but it >> doesn't for me. > Ah, OK. That makes sense. If you don't wish to keep the test around > in core, that's fine to me. I may keep that into one of my own repos > just for history's sake. OK. Hearing no further comments, I made another pass over the code (improved some comments, added one more guard against silly subdiff results) and pushed this. regards, tom lane
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Adam Scott
Date:
Thank you Michael! I know this is a late response but didn't want to clutter the message stream until we had confirmation, but I applied this patch to our server and could not replicate the bug. So this patch is working for us presently.
Thanks again Michael,
Adam
On Mon, Dec 9, 2019 at 9:38 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 19, 2019 at 08:40:56PM +0900, Michael Paquier wrote:
> If you add an ANALYZE on the table natica_hdu_test after restoring, I
> am rather sure that you would reproduce the crash more quickly because
> the handling around the stats of the column are busted here. Anyway,
> taking my example of upthread, I have been also able to reproduce the
> problem on REL_10_STABLE even with assertions enabled: the trick is
> that you need to leave once the session after the analyze on the
> table. Then a SELECT within a new session is enough to crash the
> server.
So... I have looked more at this one, and from my previous example it
seems that we have a one-off error when looking up at the array
holding the histograms for ranges (lower and upper bound).
In my previous example, we get to build 101 RangeBounds when beginning
to calculate the range operator selectivity in
calc_hist_selectivity(). However, when we get to the point of
calc_hist_selectivity_contained(), upper_index gets calculated at 100
which is just at the limit of the indexed bounds, and the code would
happily look at the last bound as well as the one-after-the-last bound
as range_cmp_bounds() sees fit, but the latter just points to the
void. The code looks wrong since its introduction in 59d0bf9d and
it seems that the changes done for free_attstatsslot() in 9aab83f make
the issue more easily reproducible.
A fix like the rough POC attached addresses the issue, but I think
that's too naive to not count for the first bin in the ranges
evaluated. Tomas, you may be more familiar with this area of the code
than I am. What do you think?
--
Michael
Re: BUG #16122: segfault pg_detoast_datum (datum=0x0) at fmgr.c:1833numrange query
From
Michael Paquier
Date:
Hi Adam, On Mon, Jan 13, 2020 at 10:56:32AM -0800, Adam Scott wrote: > Thank you Michael! I know this is a late response but didn't want to > clutter the message stream until we had confirmation, but I applied this > patch to our server and could not replicate the bug. So this patch is > working for us presently. This was broken for ages, so there is a lot that has been done thanks to you because you took the time to report the bug in the first place. Tom has committed a much more paranoid (and better!) fix as of 652686a, so you will have the fix with the next minor release planned on the second week of February, 10.12 in your case. The committed patch did not include a regression test, but I am keeping track of it. -- Michael