Re: distinct estimate of a hard-coded VALUES list - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: distinct estimate of a hard-coded VALUES list
Date
Msg-id CAMkU=1w_sLaUHrBMmhCv3fHXYmO0yv+nRzQ58KN_A+XSx=7gRQ@mail.gmail.com
Whole thread Raw
In response to Re: distinct estimate of a hard-coded VALUES list  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Aug 22, 2016 at 10:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Aug 20, 2016 at 4:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> On Thu, Aug 18, 2016 at 2:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> It does know it, what it doesn't know is how many duplicates there are.
>
>> Does it know whether the count comes from a parsed query-string list/array,
>> rather than being an estimate from something else?  If it came from a join,
>> I can see why it would be dangerous to assume they are mostly distinct.
>> But if someone throws 6000 things into a query string and only 200 distinct
>> values among them, they have no one to blame but themselves when it makes
>> bad choices off of that.
>
> I am not exactly sold on this assumption that applications have
> de-duplicated the contents of a VALUES or IN list.  They haven't been
> asked to do that in the past, so why do you think they are doing it?


I think most people wouldn't go out of their way to de-duplicate simply because the way they get the list is inherently deduplicated in the first place, or at least is not going to routinely have massive redundancy.  If I thought it was likely to have massive redundancy, then I would add code to deduplicate just because I don't want to lug redundant data around inside my own app server code, regardless of what it does to PostgreSQL's planner.

We can be sure that someone somewhere is sending huge lists which only a couple hundred distinct values, but I don't think we should let that block changes.

I think the more serious source of regressions will be cases where the current bad distinct estimate is cancelling some other bad estimate elsewhere in the planner, yielding an efficient plan by accident.  Improving the estimate could push the plan over to a bad choice, leading to a large regression upon upgrading.  For that reason, I think it is better not to try to get your patch (which I tested and looks good to me, except that it still isn't picking the best plan for unrelated reasons) into 9.6 after beta4 but just put it into 10.0.  I don't know when most people with stringent SLA start validating the performance of new versions, but I suspect some are already doing so before beta4 and it wouldn't be nice to change this on them.
 

It's hard to know, but my intuition is that most people would
deduplicate.  I mean, nobody is going to want to their query generator
to send X IN (1, 1, <repeat a zillion more times>) to the server if it
could have just sent X IN (1).

Yes, I agree with that.  But note that this isn't currently relevant to X IN (1) anyway.  'X IN (list)' gets planned as if it were "X=ANY(ARRAY...)", which goes through a different machinery than "X=ANY(VALUES...)"

There is a 9 year old to-do item which proposes (if I understand it correctly) to change this so that "X=ANY(ARRAY...)" goes through the same plan as "X=ANY(VALUES...)"   A problem with doing that is currently people can, in lieu of planner hints, re-write their queries between the two ways of expressing them, in case one way gives a better plan.  If they were always planned identically, it would take away that safety hatch.

The most conservative thing, given the current state, would be to fix "X=ANY(ARRAY...)" so that it builds a hash table out of the ARRAY, and then probes that table for each X, rather than iterating over the ARRAY for each X, which is what it currently does.  That way you would get the benefit of using a hash, without the risk of tipping over into some unsuitable join order in the process.  It should have no user-facing downsides, just the ugliness of having two ways to implement fundamentally the same thing.

So basically this line of a plan:

   Filter: (aid = ANY ('{...}')

Currently glosses over a nested loop over the array.  We could make it gloss over a hash probe into the "array" instead.

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: src/include/catalog/pg_foreign_table.h still refers genbki.sh
Next
From: Tom Lane
Date:
Subject: Reminder: 9.6rc1 wraps tomorrow