Thread: Selectivity and row count estimates for JSONB columns
Hi all
I have a Postgres/PostGIS database with two separate database tables each of which has the following column structure:
- identifier (text)
- geometry (geometryz,27700)
- properties (jsonb)
I have created a GIN index (jsonb_path_ops) over the properties column on both tables.
As I understand it, Postgres is unable to calculate statistics on the contents of JSONB columns and so should use hard-coded estimates of selectivity when planning queries. However, despite both tables having identical structures, similar row counts, identical indexes, and both having been recently ANALYZEd, I am finding that the query planner is producing different row count estimates between the tables.
To expand: both tables have a 'description' field within their JSONB columns. I am executing the following query:
EXPLAIN
SELECT identifier, geometry, properties FROM table
WHERE properties @@ '$.description == "test"'
ORDER BY identifier;
I am expecting that the selectivity value used for such queries should be 0.010 which is the default result of the matchingsel selectivity function which the @@ operator uses when operating on jsonb.
For both tables, the planner opts for a Bitmap Index Scan -> Sort -> Gather Merge. However, the estimated number of rows returned from the Index Scan node differs substantially.
- Table A (1,611,752 rows): 159 estimated rows (this is roughly 0.01% and makes sense);
- Table B (1,656,110 rows): 16566 estimated rows (roughly 1% - why??)
This difference is causing the planner to come up with some strange plans for queries on Table B which it does not do for Table A. I know that both estimates are just that and that neither is 'better', but I would really prefer it to be consistent and use a selectivity of 0.01 rather than 0.1 as it should be doing.
I know this is getting long now, so apologies. But one thing I did notice and wondered if it could be related is that the pg_stats table for Table A seems to have values for histogram_bounds and correlation, but nulls for these values for Table B.
Any ideas?
Thanks
Joel
After some further investigation I think I might have actually got this the wrong way round. I now strongly suspect that what I am seeing is related to the presence of histogram_bounds for Table A (possibly a bug? or me being silly).
Looking into the code (https://github.com/postgres/postgres/blob/8c1144ba73478b818d9cebe8ecd64a14b7d45bde/src/backend/utils/adt/selfuncs.c):
- The matchingsel() function calls the generic_restriction_selectivity() function
- generic_restriction_selectivity() returns the default value (0.01) for data types that Postgres doesn't collect standard MCV and/or histogram statistics for. I think this is what happens with Table B which (quite correctly) does not have these statistics in pg_stats
- Table A does have histogram_bounds for some reason, so I suspect it gets further through the function, operating on garbage, to this line:
if (selec < 0.0001)
selec = 0.0001;
selec = 0.0001;
which ultimate ends up getting returned.
Therefore, Table A gets the 'correct' default value of 0.01 (which is actually 1%, silly me). Table B meanwhile gets an erroneous value of 0.0001 (0.1%) .
So my question now is: why is Postgres creating histogram_bounds on some jsonb columns but not others? I've looked at my CREATE TABLE scripts and they are identical. We have the database deployed in another environment and this issue reoccurs there too.
Any help would be appreciated :)
Thanks
Joel
On Wed, Oct 20, 2021 at 3:56 PM Joel Perren <joel.perren@gmail.com> wrote:
Hi allI have a Postgres/PostGIS database with two separate database tables each of which has the following column structure:- identifier (text)- geometry (geometryz,27700)- properties (jsonb)I have created a GIN index (jsonb_path_ops) over the properties column on both tables.As I understand it, Postgres is unable to calculate statistics on the contents of JSONB columns and so should use hard-coded estimates of selectivity when planning queries. However, despite both tables having identical structures, similar row counts, identical indexes, and both having been recently ANALYZEd, I am finding that the query planner is producing different row count estimates between the tables.To expand: both tables have a 'description' field within their JSONB columns. I am executing the following query:EXPLAINSELECT identifier, geometry, properties FROM tableWHERE properties @@ '$.description == "test"'ORDER BY identifier;I am expecting that the selectivity value used for such queries should be 0.010 which is the default result of the matchingsel selectivity function which the @@ operator uses when operating on jsonb.For both tables, the planner opts for a Bitmap Index Scan -> Sort -> Gather Merge. However, the estimated number of rows returned from the Index Scan node differs substantially.- Table A (1,611,752 rows): 159 estimated rows (this is roughly 0.01% and makes sense);- Table B (1,656,110 rows): 16566 estimated rows (roughly 1% - why??)This difference is causing the planner to come up with some strange plans for queries on Table B which it does not do for Table A. I know that both estimates are just that and that neither is 'better', but I would really prefer it to be consistent and use a selectivity of 0.01 rather than 0.1 as it should be doing.I know this is getting long now, so apologies. But one thing I did notice and wondered if it could be related is that the pg_stats table for Table A seems to have values for histogram_bounds and correlation, but nulls for these values for Table B.Any ideas?ThanksJoel
Joel Perren <joel.perren@gmail.com> writes: > - generic_restriction_selectivity() returns the default value (0.01) for > data types that Postgres doesn't collect standard MCV and/or histogram > statistics for. I think this is what happens with Table B which (quite > correctly) does not have these statistics in pg_stats There's nothing "correct" about that. JSONB does have sorting support (admittedly with a pretty arbitrary sort order), so I'd expect ANALYZE to collect a histogram as well as MCV values for it. Perhaps for your Table B it's omitting the histogram because the MCV list captures the entire contents of the column? Or you've got auto-analyze disabled for some reason? Anyway, assuming you're using v14, the idea of matchingsel() is to apply the given restriction clause to all the MCV and histogram entries to see how many satisfy it [1]. In principle this'll produce a far better estimate than any fixed default could provide. If you're coming out with a crappy estimate, you might be able to improve matters by increasing the column's statistics target so that more MCV and histogram entries are collected. regards, tom lane [1] https://www.postgresql.org/message-id/flat/12237.1582833074%40sss.pgh.pa.us