Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id 4a2918bb-49b0-ccd5-3af7-f7795941a6ad@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: PoC/WIP: Extended statistics on expressions
List pgsql-hackers

On 12/7/20 5:02 PM, Dean Rasheed wrote:
> On Mon, 7 Dec 2020 at 14:15, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 12/7/20 10:56 AM, Dean Rasheed wrote:
>>> it might actually be
>>> neater to have separate documented syntaxes for single- and
>>> multi-column statistics:
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>     ON (expression)
>>>     FROM table_name
>>>
>>>   CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
>>>     [ ( statistics_kind [, ... ] ) ]
>>>     ON { column_name | (expression) } , { column_name | (expression) } [, ...]
>>>     FROM table_name
>>
>> I think it makes sense in general. I see two issues with this approach,
>> though:
>>
>> * By adding expression/standard stats for individual statistics, it
>> makes the list of statistics longer - I wonder if this might have
>> measurable impact on lookups in this list.
>>
>> * I'm not sure it's a good idea that the second syntax would always
>> build the per-expression stats. Firstly, it seems a bit strange that it
>> behaves differently than the other kinds. Secondly, I wonder if there
>> are cases where it'd be desirable to explicitly disable building these
>> per-expression stats. For example, what if we have multiple extended
>> statistics objects, overlapping on a couple expressions. It seems
>> pointless to build the stats for all of them.
>>
> 
> Hmm, I'm not sure it would really be a good idea to build MCV stats on
> expressions without also building the standard stats for those
> expressions, otherwise the assumptions that
> mcv_combine_selectivities() makes about simple_sel and mcv_basesel
> wouldn't really hold. But then, if multiple MCV stats shared the same
> expression, it would be quite wasteful to build standard stats on the
> expression more than once.
> 

Yeah. You're right it'd be problematic to build MCV on expressions
without having the per-expression stats. In fact, that's exactly the
problem what forced me to add the per-expression stats to this patch.
Originally I planned to address it in a later patch, but I had to move
it forward.

So I think you're right we need to ensure we have standard stats for
each expression at least once, to make this work well.

> It feels like it should build a single extended stats object for each
> unique expression, with appropriate dependencies for any MCV stats
> that used those expressions, but I'm not sure how complex that would
> be. Dropping the last MCV stat object using a standard expression stat
> object might reasonably drop the expression stats ... except if they
> were explicitly created by the user, independently of any MCV stats.
> That could get quite messy.
> 

Possibly. But I don't think it's worth the extra complexity. I don't
expect people to have a lot of overlapping stats, so the amount of
wasted space and CPU time is expected to be fairly limited.

So I don't think it's worth spending too much time on this now. Let's
just do what you proposed, and revisit this later if needed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Denis Smirnov
Date:
Subject: Re: PoC Refactor AM analyse API
Next
From: Tomas Vondra
Date:
Subject: Re: Additional improvements to extended statistics