Re: BUG #19443: GIN jsonb_path_ops index not used when many rows contain nested null values (PostgreSQL 18.3) - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #19443: GIN jsonb_path_ops index not used when many rows contain nested null values (PostgreSQL 18.3)
Date
Msg-id 2546010.1774897426@sss.pgh.pa.us
Whole thread Raw
In response to BUG #19443: GIN jsonb_path_ops index not used when many rows contain nested null values (PostgreSQL 18.3)  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
PG Bug reporting form <noreply@postgresql.org> writes:
> When a jsonb column has a high proportion of rows with {"a": null}, and a
> small
> subset with valid nested paths like {"a":{"b":{"c":1}}}, the query planner
> chooses a sequential scan instead of using a jsonb_path_ops GIN index for
> the
> path query @? '$.a.b.c' LIMIT 10.
> This leads to unexpected performance degradation.

If I force the plan selection with enable_seqscan, the query gets
slower, and not by a small amount:

regression=# EXPLAIN (ANALYZE)
SELECT * FROM test_jsonb
WHERE data @? '$.a.b.c'
LIMIT 10;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..1.91 rows=10 width=22) (actual time=0.009..0.024 rows=10 loops=1)
   ->  Seq Scan on test_jsonb  (cost=0.00..1938.00 rows=10130 width=22) (actual time=0.008..0.022 rows=10 loops=1)
         Filter: (data @? '$."a"."b"."c"'::jsonpath)
         Rows Removed by Filter: 75
 Planning Time: 0.060 ms
 Execution Time: 0.036 ms
(6 rows)

regression=# set enable_seqscan TO 0;
SET
regression=# EXPLAIN (ANALYZE)
SELECT * FROM test_jsonb
WHERE data @? '$.a.b.c'
LIMIT 10;
                                                               QUERY PLAN
                

----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=150.51..151.31 rows=10 width=22) (actual time=4.963..4.979 rows=10 loops=1)
   ->  Bitmap Heap Scan on test_jsonb  (cost=150.51..965.13 rows=10130 width=22) (actual time=4.963..4.977 rows=10
loops=1)
         Recheck Cond: (data @? '$."a"."b"."c"'::jsonpath)
         Rows Removed by Index Recheck: 75
         Heap Blocks: exact=1
         ->  Bitmap Index Scan on idx_test_jsonb  (cost=0.00..147.98 rows=10130 width=0) (actual time=4.900..4.900
rows=100000loops=1) 
               Index Cond: (data @? '$."a"."b"."c"'::jsonpath)
 Planning Time: 0.122 ms
 Execution Time: 4.998 ms
(9 rows)

So it appears to me that the planner's decision was correct.

It looks like the index is unable to distinguish real matches
from false matches in this example, so it returns all 100000
rows and the bitmap heap scan has to filter most of them away.
So that's why it's slow.  I'm not sure offhand if the planner's
cost model is accurately modeling this situation or if it
just got lucky ... but in any case, you've not demonstrated
that there's a planner bug.

            regards, tom lane



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #19446: Domain DEFAULT not reflected in system catalogs and information_schema (PG 18.3)
Next
From: Justin Christensen
Date:
Subject: Surprising behavior with pushing predicates down into a view