Re: PoC/WIP: Extended statistics on expressions - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PoC/WIP: Extended statistics on expressions
Date
Msg-id f865af90-c19c-769e-8c65-19692e245479@enterprisedb.com
Whole thread Raw
In response to Re: PoC/WIP: Extended statistics on expressions  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: PoC/WIP: Extended statistics on expressions
List pgsql-hackers

On 1/21/21 12:11 PM, Dean Rasheed wrote:
> On Tue, 19 Jan 2021 at 01:57, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>> A slightly bigger issue that I don't like is the way it assigns
>>> attribute numbers for expressions starting from
>>> MaxHeapAttributeNumber+1, so the first expression has an attnum of
>>> 1601. That leads to pretty inefficient use of Bitmapsets, since most
>>> tables only contain a handful of columns, and there's a large block of
>>> unused space in the middle the Bitmapset.
>>>
>>> An alternative approach might be to use regular attnums for columns
>>> and use negative indexes -1, -2, -3, ... for expressions in the stored
>>> stats. Then when storing and retrieving attnums from Bitmapsets, it
>>> could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
>>> in the Bitmapsets, since there can't be more than that many
>>> expressions (just like other code stores system attributes using
>>> FirstLowInvalidHeapAttributeNumber).
>>
>> Well, I tried this but unfortunately it's not that simple. We still need
>> to build the bitmaps, so I don't think add_expression_to_attributes can
>> be just removed. I mean, we need to do the offsetting somewhere, even if
>> we change how we do it.
> 
> Hmm, I was imagining that the offsetting would happen in each place
> that adds or retrieves an attnum from a Bitmapset, much like a lot of
> other code does for system attributes, and then you'd know you had an
> expression if the resulting attnum was negative.
> 
> I was also thinking that it would be these negative attnums that would
> be stored in the stats data, so instead of something like "1, 2 =>
> 1601", it would be "1, 2 => -1", so in some sense "-1" would be the
> "real" attnum associated with the expression.
> 
>> But the main issue is that in some cases the number of expressions is
>> not really limited by STATS_MAX_DIMENSIONS - for example when applying
>> functional dependencies, we "merge" multiple statistics, so we may end
>> up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.
> 
> Ah, I see. I hadn't really fully understood what that code was doing.
> 
> ISTM though that this is really an internal problem to the
> dependencies code, in that these "merged" Bitmapsets containing attrs
> from multiple different stats objects do not (and must not) ever go
> outside that local code that uses them. So that code would be free to
> use a different offset for its own purposes -- e..g., collect all the
> distinct expressions across all the stats objects and then offset by
> the number of distinct expressions.
> 

>> Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts
>> the order of processing (so far we've assumed expressions are after
>> regular attnums). So the changes are more extensive - I tried doing that
>> anyway, and I'm still struggling with crashes and regression failures.
>> Of course, that doesn't mean we shouldn't do it, but it's far from
>> mechanical. (Some of that is probably a sign this code needs a bit more
>> work to polish.)
> 
> Interesting. What code assumes expressions come after attributes?
> Ideally, I think it would be cleaner if no code assumed any particular
> order, but I can believe that it might be convenient in some
> circumstances.
> 

Well, in a bunch of places we look at the index (from the bitmap) and 
use it to determine whether the value is a regular attribute or an 
expression, because the values are stored in separate arrays.

This is solvable, all I'm saying (both here and in the preceding part 
about dependencies) is that it's not entirely mechanical task. But it 
might be better to rethink the separation of simple values and 
expression, and make it more "unified" so that most of the code does not 
really need to deal with these differences.

>> But I wonder if it'd be easier to just calculate the actual max attnum
>> and then use it instead of MaxHeapAttributeNumber ...
> 
> Hmm, I'm not sure how that would work. There still needs to be an
> attnum that gets stored in the database, and it has to continue to
> work if the user adds columns to the table. That's why I was
> advocating storing negative values, though I haven't actually tried it
> to see what might go wrong.
> 

Well, yeah, we need to identify the expression in some statistics (e.g. 
in dependencies or ndistinct items). And yeah, offsetting the expression 
attnums by MaxHeapAttributeNumber may be inefficient in this case.


Attached is an updated version of the patch, hopefully addressing all 
issues pointed out by you, Justin and Zhihong, with the exception of the 
expression attnums discussed here.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Support for NSS as a libpq TLS backend
Next
From: Greg Nancarrow
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)