Thread: Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)
Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)
From
Danny Shemesh
Date:
Hey everyone,
I'm working on improving gathered statistics on several large tables (2TB / 500M records);
I've created extended stats on correlated columns, ran analyze, compared the pre and post explains, and it seems as though the extended statistics are ignored -
the estimation doesn't change much, and I can accurately derive the est. rows from multiplying the single-column freqs from pg_stats with the supplied values.
Context:
- pg 12.8
- relation w/ three columns of interest - user (text), row_type (text), deleted_at (timestamp).
- different users have different distributions for row_type
- deleted_at is set if said record is marked for deletion - stays marked for a long retention period, around 95% of the rows have a value, and around 5% have null
- we query and operate on the 5% of records with deleted_at = null, most of our indexes are partial on that as well
I'll give a simplified example with two users, one being extremely dominant w/ 99.9% of the data; the next holds much less, but still accounts to around 200k rows.
From pg_stats:
attname | user
n_distinct | 2.0
most_common_vals | {A,B}
most_common_freqs | [0.9996333, 0.00036666667]
n_distinct | 2.0
most_common_vals | {A,B}
most_common_freqs | [0.9996333, 0.00036666667]
attname | row_type
n_distinct | 4.0
most_common_vals | {A,B}
most_common_freqs | [0.9968, 0.0025333334]
n_distinct | 4.0
most_common_vals | {A,B}
most_common_freqs | [0.9968, 0.0025333334]
attname | deleted_at
n_distinct | 20761.0
null_frac | 0.043133333
null_frac | 0.043133333
<values / bounds are timestamps, omitted for brevity>
Querying before extended stats:
$> explain select 1 from my_rel where user ='B and row_type = 'A' and deleted_at is null;
Index Only Scan using idx_user_row_type_where_deleted_at_is_null on my_rel (cost=0.69..2851.78 rows=8213 width=4)
Index Cond: ((user = 'B') AND (row_type = 'B'))
Index Only Scan using idx_user_row_type_where_deleted_at_is_null on my_rel (cost=0.69..2851.78 rows=8213 width=4)
Index Cond: ((user = 'B') AND (row_type = 'B'))
The number is derived from: reltuples*user_b_freq*row_type_a_freq*deleted_at_null_frac = 520982816*0.00036666667*0.9968*0.043133333 = 8213.26586
When explain-analyzing, I get - (actual time=0.051..72.503 rows=174954 loops=1)
So in this specific case, the estimation is off by around 20x (note that this is a simplified case just to showcase the symptom).
I then create extended stats - I've tried to add them on all three columns in all combinations, and in pairs in all combinations - all leading to the same result,
I'll only showcase the three column variant for brevity:
$> create statistics s1 on user, row_type, deleted_at from my_rel;
$> analyze my_rel;$> explain select 1 from my_rel where user ='B' and row_type = 'A' and deleted_at is null;
Index Only Scan using idx_user_row_type_where_deleted_at_is_null on my_rel (cost=0.69..1560.01 rows=4491 width=4)After analyzing, pg_stats contain different values, as the large table is sampled - I have user_b_freq = 0.0002, row_type_a_freq = 0.99686664, deleted_at_null_frac = 0.043233335,
thus the new calculation is: 520982816*0.0002*0.99686664*0.043233335 = 4490.64987.
Now it's off by around 40x - and it seems to still only consider the single column distributions.
Is there anything I'm missing ? I thought that maybe in my case, due to the high null fractions of deleted_at, the extended stats aren't used, but couldn't find an obvious hint from the code that would suggest that.
Appreciate your time !
Danny
Re: Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)
From
Bruce Momjian
Date:
On Wed, Jun 1, 2022 at 07:28:58PM +0300, Danny Shemesh wrote: > Hey everyone, > > I'm working on improving gathered statistics on several large tables (2TB / > 500M records); > I've created extended stats on correlated columns, ran analyze, compared the > pre and post explains, and it seems as though the extended statistics are > ignored - > the estimation doesn't change much, and I can accurately derive the est. rows > from multiplying the single-column freqs from pg_stats with the supplied > values. > > > Context: > - pg 12.8 You should be on the most recent version of 12.X, and I do see some extended statistics change in the later releases you are missing. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)
From
Danny Shemesh
Date:
Hey bruce, thanks for the prompt reply !
It reproduces in 12.11 and 13.7, we use a managed offering that is yet to include pg 14,
so sadly I can't try expressional extended statistics as of yet, nor can I attach gdb to the process
to debug the flow.
If any other directions come to mind, I'd really appreciate hearing them.
Thanks again !
On Wed, Jun 1, 2022 at 11:08 PM Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Jun 1, 2022 at 07:28:58PM +0300, Danny Shemesh wrote:
> Hey everyone,
>
> I'm working on improving gathered statistics on several large tables (2TB /
> 500M records);
> I've created extended stats on correlated columns, ran analyze, compared the
> pre and post explains, and it seems as though the extended statistics are
> ignored -
> the estimation doesn't change much, and I can accurately derive the est. rows
> from multiplying the single-column freqs from pg_stats with the supplied
> values.
>
>
> Context:
> - pg 12.8
You should be on the most recent version of 12.X, and I do see some
extended statistics change in the later releases you are missing.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Indecision is a decision. Inaction is an action. Mark Batterson