Re: Additional improvements to extended statistics - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Additional improvements to extended statistics
Date
Msg-id 56df91cd-c7dc-f46b-6bc3-cb36982ff9ae@enterprisedb.com
Whole thread Raw
In response to Re: Additional improvements to extended statistics  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: Additional improvements to extended statistics
List pgsql-hackers

On 11/17/20 4:35 PM, Dean Rasheed wrote:
> On Thu, 12 Nov 2020 at 14:18, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>> Here is an improved WIP version of the patch series, modified to address
>> the issue with repeatedly applying the extended statistics, as discussed
>> with Dean in this thread. It's a bit rough and not committable, but I
>> need some feedback so I'm posting it in this state.
> 
> As it stands, it doesn't compile if 0003 is applied, because it missed
> one of the callers of clauselist_selectivity_simple(), but that's
> easily fixed.
> 
>> 0001 is the original patch improving estimates of OR clauses
>>
>> 0002 adds thin wrappers for clause[list]_selectivity, with "internal"
>> functions allowing to specify whether to keep considering extended stats
>>
>> 0003 does the same for the "simple" functions
>>
>>
>> I've kept it like this to demonstrate that 0002 is not sufficient. In my
>> response from March 24 I wrote this:
>>
>>> Isn't it the case that clauselist_selectivity_simple (and the OR
>>> variant) should ignore extended stats entirely? That is, we'd need
>>> to add a flag (or _simple variant) to clause_selectivity, so that it
>>> calls causelist_selectivity_simple_or.
>> But that's actually wrong, as 0002 shows (as it breaks a couple of
>> regression tests), because of the way we handle OR clauses. At the top
>> level, an OR-clause is actually just a single clause and it may get
>> passed to clauselist_selectivity_simple. So entirely disabling extended
>> stats for the "simple" functions would also mean disabling extended
>> stats for a large number of OR clauses. Which is clearly wrong.
>>
>> So 0003 addresses that, by adding a flag to the two "simple" functions.
>> Ultimately, this should probably do the same thing as 0002 and add thin
>> wrappers, because the existing functions are part of the public API.
> 
> I agree that, taken together, these patches fix the
> multiple-extended-stats-evaluation issue. However:
> 
> I think this has ended up with too many variants of these functions,
> since we now have "_internal" and "_simple" variants, and you're
> proposing adding more. The original purpose of the "_simple" variants
> was to compute selectivities without looking at extended stats, and
> now the "_internal" variants compute selectivities with an additional
> "use_extended_stats" flag to control whether or not to look at
> extended stats. Thus they're basically the same, and could be rolled
> together.
> 

Yeah, I agree there were far too many functions. Your patch looks much
cleaner / saner than the one I shared last week.

> Additionally, it's worth noting that the "_simple" variants expose the
> "estimatedclauses" bitmap as an argument, which IMO is a bit messy as
> an API. All callers of the "_simple" functions outside of clausesel.c
> actually pass in estimatedclauses=NULL, so it's possible to refactor
> and get rid of that, turning estimatedclauses into a purely internal
> variable.
> 

Hmmm. I think there were two reasons for exposing the estimatedclauses
bitmap like that: (a) we used the function internally and (b) we wanted
to allow cases where the user code might do something with the bitmap.
The first item is not an issue - we can hide that. As for the second
item, my guess is it was unnecessary future-proofing - we don't know
about any use case that might need this, so +1 to get rid of it.

> Also, it's quite messy that clauselist_selectivity_simple_or() needs
> to be passed a Selectivity input (the final argument) that is the
> selectivity of any already-estimated clauses, or the value to return
> if no not-already-estimated clauses are found, and must be 0.0 when
> called from the extended stats code.
> 

True.

> Attached is the kind of thing I had in mind (as a single patch, since
> I don't think it's worth splitting up). This replaces the "_simple"
> and "_internal" variants of these functions with "_opt_ext_stats"
> variants whose signatures match the originals except for having the
> single extra "use_extended_stats" boolean parameter. Additionally, the
> "_simple" functions are merged into the originals (making them more
> like they were in PG11) so that the "estimatedclauses" bitmap and
> partial-OR-list Selectivity become internal details, no longer exposed
> in the API.
> 

Seems fine to me, although the "_opt_ext_stats" is rather cryptic.
AFAICS we use "_internal" for similar functions.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Allow matching whole DN from a client certificate
Next
From: Andres Freund
Date:
Subject: Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)