Re: Inaccurate Rows estimate for "Bitmap And" causes Planner tochoose wrong join - Mailing list pgsql-performance

From Steve Pritchard
Subject Re: Inaccurate Rows estimate for "Bitmap And" causes Planner tochoose wrong join
Date
Msg-id CAF7AqmyicqVv99XjO5vgS1N4+dSUWrgB_DMp6meHLNktzBpesg@mail.gmail.com
Whole thread Raw
In response to Re: Inaccurate Rows estimate for "Bitmap And" causes Planner tochoose wrong join  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-performance
Many thanks Justin & Jeff for your replies.

Presumbly the conditions are partially redundant, so loc_id => user_id

Yes you're right. I had overlooked this.

I've done some further testing and this confirms what you say: if the WHERE columns are independent, then the Planner makes a reasonable estimate of the number of rows. irrespective of whether it uses a single index or a "Bitmap And" of two indexes.

I've also tested "create statistics" on Postgres 12:
  • gives good estimate with WHERE user_id = 'USER123' and loc_id = 'LOC12345678'
  • but Plan Rows = 5 with WHERE user_id = 'USER123'  and loc_id = ANY('{ LOC12345678 }'::text[])
  • Note: if I omit the user_id condition then it gives a good estimate, i.e. with WHERE loc_id = ANY('{ LOC12345678 }'::text[])
So statistics objects don't seem to be able to handle the combination of dependencies and arrays (at least in 12.2).

Steve

On Wed, 6 May 2020 at 19:25, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, May 6, 2020 at 12:20 PM Steve Pritchard <steve.pritchard@bto.org> wrote:
Version: Postgres 9.6.3 production system (but also tested on Postgres 12)

For my query the Planner is sometimes choosing an execution plan that uses "Bitmap And" (depending on the parameters):

->  Bitmap Heap Scan on observation  (cost=484.92..488.93 rows=1 width=203) (actual time=233.129..330.886 rows=15636 loops=1)
  Recheck Cond: (((user_id)::text = 'USER123'::text) AND ((loc_id)::text = ANY ('{LOC12345678}'::text[])))

If you change " = ANY(array_of_one)" to " = scalar", does that change anything?  You might be able to fix this (in v12) using CREATE STATISTICS, but I don't know if that mechanism can see through the ANY(array_of_one) wrapper.
 
Note that in cases where the Planner selects a single Index Scan for this query (with different parameters), the Planner makes an accurate estimate of the number of rows and then makes sensible selections of joins (i.e. quick).
i.e. the issue seems to be with the "Bitmap And".


I don't know if this nitpick matters, but I don't think that that is how the planner works.  The row estimates work from the top down, not the bottom up.  The row estimate of 1 is based on what conditions the bitmap heap scan implements, it is not arrived at by combining the estimates from the index scans below it.  If it were to change to a different type of node but implemented the same conditions, I think it would have the same row estimate. 
 

I don't have an index with both user_id & loc_id, as this is one of several different combinations that can arise (it would require quite a few indexes to cover all the possible combinations).

Are you actually experiencing problems with those other combinations as well?  If not, I wouldn't worry about solving hypothetical problems.  If those other combinations are actually problems and you go with CREATE STATISTICS, then you would have to be creating a lot of different statistics.  That would still be ugly, but at least the overhead for statistics is lower than for indexes.
 
However if I did have such an index, the planner would presumably be able to use the statistics for user_id and loc_id to estimate the number of rows.

Indexes on physical columns do not have statistics, so making that index would not help with the estimation.  (Expressional indexes do have statistics, but I don't see that helping you here).   So while this node would execute faster with that index, it would still be kicking the unshown nested loop left join 15,636 times when it thinks it will be doing it once, and so would still be slow.  The most robust solution might be to make the outer part of that nested loop left join faster, so that your system would be more tolerant of statistics problems.  
 

So why can't it make an accurate estimate of the rows with a "Bitmap And" & " Bitmap Heap Scan"? (as above)

In the absence of custom statistics, it assumes the selectivities of user_id = 'USER123', of loc_id = ANY ('{LOC12345678}'::text[]), and of taxa = 'Birds' are all independent of each other and can be multiplied to arrive at the overall selectivity.  But clearly that is not the case.  Bird watchers mostly watch near where they live, not in random other places.

Cheers,

Jeff


--
Steve Pritchard
Database Developer

British Trust for Ornithology, The Nunnery, Thetford, Norfolk IP24 2PU, UK 
Tel: +44 (0)1842 750050, fax: +44 (0)1842 750030
Registered Charity No 216652 (England & Wales) No SC039193 (Scotland)
Company Limited by Guarantee No 357284 (England & Wales)

pgsql-performance by date:

Previous
From: Arya F
Date:
Subject: Re: 600 million rows of data. Bad hardware or need partitioning?
Next
From: James Thompson
Date:
Subject: Re: Please help! Query jumps from 1s -> 4m