Thread: a wrong index choose when statistics is out of date

a wrong index choose when statistics is out of date

From

Andy Fan

Date:

03 March, 10:01:23

The issue can be reproduced with the following steps:

create table x_events (.., created_at timestamp, a int, b int); 

create index idx_1 on t(created_at, a);
create index idx_2 on t(created_at, b);

query:
select * from t where create_at = current_timestamp and b = 1;

index (created_at, a) rather than (created_at, b) may be chosen for the
above query if the statistics think "create_at = current_timestamp" has
no rows, then both index are OK, actually it is true just because
statistics is out of date.

I just run into this again recently and have two new idea this time,
I'd like gather some feedback on this.

1. We can let the user define the column as the value is increased day by
   day. the syntax may be:

   ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED.

   then when a query like 'create_at op const', the statistics module can
   treat it as 'created_at = $1'. so the missing statistics doesn't make
   difference. Then I think the above issue can be avoided. 

   This is different from letting user using a PreparedStmt directly
   because it is possible that we always choose a custom plan, the
   easiest way to make this happen is we do a planning time partition 
   prune.

2. Use some AI approach to forecast the data it doesn't gather yet. The
   training stage may happen at analyze stage, take the above case for
   example, it may get a model like 'there are 100 rows per second for
   the time during 9:00 to 18:00 and there are 2 rows per seconds for
   other time range.
   
For now, I think option 1 may be easier to happen.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

Andrei Lepikhov

Date:

04 March, 07:57:56

On 3/3/2024 14:01, Andy Fan wrote:
> 1. We can let the user define the column as the value is increased day by
>     day. the syntax may be:
> 
>     ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED.
> 
>     then when a query like 'create_at op const', the statistics module can
>     treat it as 'created_at = $1'. so the missing statistics doesn't make
>     difference. Then I think the above issue can be avoided.
Let me write some words to support your efforts in that way.
I also have some user cases where they periodically insert data in large 
chunks. These chunks contain 'always increased' values, and it causes 
trouble each time they start an analytic query over this new data before 
the analyze command.
I have thought about that issue before but invented nothing special 
except a more aggressive analysis of such tables.
Your trick can work, but it needs a new parameter in pg_type and a lot 
of additional code for such a rare case.
I'm looking forward to the demo patch.

-- 
regards,
Andrei Lepikhov
Postgres Professional

Re: a wrong index choose when statistics is out of date

From

David Rowley

Date:

04 March, 08:33:06

On Sun, 3 Mar 2024 at 20:08, Andy Fan <zhihuifan1213@163.com> wrote:
> The issue can be reproduced with the following steps:
>
> create table x_events (.., created_at timestamp, a int, b int);
>
> create index idx_1 on t(created_at, a);
> create index idx_2 on t(created_at, b);
>
> query:
> select * from t where create_at = current_timestamp and b = 1;
>
> index (created_at, a) rather than (created_at, b) may be chosen for the
> above query if the statistics think "create_at = current_timestamp" has
> no rows, then both index are OK, actually it is true just because
> statistics is out of date.

I don't think there's really anything too special about the fact that
the created_at column is always increasing. We commonly get 1-row
estimates after multiplying the selectivities from individual stats.
Your example just seems like yet another reason that this could
happen.

I've been periodically talking about introducing "risk" as a factor
that the planner should consider.  I did provide some detail in [1]
about the design that was in my head at that time.  I'd not previously
thought that it could also solve this problem, but after reading your
email, I think it can.

I don't think it would be right to fudge the costs in any way, but I
think the risk factor for IndexPaths could take into account the
number of unmatched index clauses and increment the risk factor, or
"certainty_factor" as it is currently in my brain-based design. That
way add_path() would be more likely to prefer the index that matches
the most conditions.

The exact maths to calculate the certainty_factor for this case I
don't quite have worked out yet. I plan to work on documenting the
design of this and try and get a prototype patch out sometime during
this coming southern hemisphere winter so that there's at least a full
cycle of feedback opportunity before the PG18 freeze.

We should do anything like add column options in the meantime. Those
are hard to remove once added.

David

[1] https://www.postgresql.org/message-id/CAApHDvo2sMPF9m%3Di%2BYPPUssfTV1GB%3DZ8nMVa%2B9Uq4RZJ8sULeQ%40mail.gmail.com

Re: a wrong index choose when statistics is out of date

From

Andrei Lepikhov

Date:

04 March, 09:20:22

On 4/3/2024 12:33, David Rowley wrote:
> [1]
https://www.postgresql.org/message-id/CAApHDvo2sMPF9m%3Di%2BYPPUssfTV1GB%3DZ8nMVa%2B9Uq4RZJ8sULeQ%40mail.gmail.com
Thanks for the link!
Could we use the trick with the get_actual_variable_range() to find some 
reason and extrapolate histogram data out of the boundaries when an 
index shows us that we have min/max outside known statistics?
Because it would be used for the values out of the histogram, it should 
only add an overhead with a reason.

-- 
regards,
Andrei Lepikhov
Postgres Professional

Re: a wrong index choose when statistics is out of date

From

David Rowley

Date:

04 March, 10:04:18

On Mon, 4 Mar 2024 at 19:20, Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
> Could we use the trick with the get_actual_variable_range() to find some
> reason and extrapolate histogram data out of the boundaries when an
> index shows us that we have min/max outside known statistics?
> Because it would be used for the values out of the histogram, it should
> only add an overhead with a reason.

I think, in theory, it would be possible to add a function similar to
get_actual_variable_range() for equality clauses, but I'd be worried
about the overheads of doing so. I imagine it would be much more
common to find an equality condition with a value that does not fit in
any histogram/MCV bucket.  get_actual_variable_range() can be quite
costly when there are a large number of tuples ready to be vacuumed,
and having an equivalent function for equality conditions could appear
to make the planner "randomly" slow without much of an explanation as
to why.

I think we still do get some complaints about
get_actual_variable_range() despite it now using
SnapshotNonVacuumable.  It used to be much worse with the snapshot
type it used previous to what it uses today. IIRC it took a few
iterations to get the performance of the function to a level that
seems "mostly acceptable".

David

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

04 March, 14:20:30

David Rowley <dgrowleyml@gmail.com> writes:

> On Sun, 3 Mar 2024 at 20:08, Andy Fan <zhihuifan1213@163.com> wrote:
>> The issue can be reproduced with the following steps:
>>
>> create table x_events (.., created_at timestamp, a int, b int);
>>
>> create index idx_1 on t(created_at, a);
>> create index idx_2 on t(created_at, b);
>>
>> query:
>> select * from t where create_at = current_timestamp and b = 1;
>>
>> index (created_at, a) rather than (created_at, b) may be chosen for the
>> above query if the statistics think "create_at = current_timestamp" has
>> no rows, then both index are OK, actually it is true just because
>> statistics is out of date.
>
> I don't think there's really anything too special about the fact that
> the created_at column is always increasing. We commonly get 1-row
> estimates after multiplying the selectivities from individual stats.
> Your example just seems like yet another reason that this could
> happen.

You are right about there are more cases which lead this happen. However
this is the only case where the created_at = $1 trick can works, which
was the problem I wanted to resove when I was writing. 

> I've been periodically talking about introducing "risk" as a factor
> that the planner should consider.  I did provide some detail in [1]
> about the design that was in my head at that time.  I'd not previously
> thought that it could also solve this problem, but after reading your
> email, I think it can.

Haha, I remeber you were against "risk factor" before at [1], and at
that time we are talking about the exact same topic as here, and I
proposaled another risk factor. Without an agreement, I did it in my
own internal version and get hurted then, something like I didn't pay
enough attention to Bitmap Index Scan and Index scan. Then I forget the
"risk factor".

>
> I don't think it would be right to fudge the costs in any way, but I
> think the risk factor for IndexPaths could take into account the
> number of unmatched index clauses and increment the risk factor, or
> "certainty_factor" as it is currently in my brain-based design. That
> way add_path() would be more likely to prefer the index that matches
> the most conditions.

This is somehow similar with my proposal at [1]?  What do you think
about the treat 'col op const' as 'col op $1' for the marked column?
This could just resolve a subset of questions in your mind, but the
method looks have a solid reason.

Currently I treat the risk factor as what you did before, but this maybe
another time for me to switch my mind again.

[1] https://www.postgresql.org/message-id/CAApHDvovVWCbeR4v%2BA4Dkwb%3DYS_GuJG9OyCm8jZu%2B%2BcP2xsY_A%40mail.gmail.com
-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

04 March, 14:37:54

Andrei Lepikhov <a.lepikhov@postgrespro.ru> writes:

> On 3/3/2024 14:01, Andy Fan wrote:
>> 1. We can let the user define the column as the value is increased day by
>>     day. the syntax may be:
>>     ALTER TABLE x_events ALTER COLUMN created_at ALWAYS_INCREASED.
>>     then when a query like 'create_at op const', the statistics module
>> can
>>     treat it as 'created_at = $1'. so the missing statistics doesn't make
>>     difference. Then I think the above issue can be avoided.
> Let me write some words to support your efforts in that way.
> I also have some user cases where they periodically insert data in large
> chunks. These chunks contain 'always increased' values, and it causes
> trouble each time they start an analytic query over this new data before
> the analyze command.
> I have thought about that issue before but invented nothing special
> except a more aggressive analysis of such tables.

I have to say we run into a exactly same sistuation and use the same
trick to solve the problem, and we know no matter how aggressive it is,
the problem may still happen.

> Your trick can work, but it needs a new parameter in pg_type and a lot
> of additional code for such a rare case.
> I'm looking forward to the demo patch.

Maybe my word "auto_increased" is too like a type, but actually what I
want to is adding a new attribute for pg_attribute which ties with one
column in one relation.  When we figure out a selective on this
*column*, we do such trick. This probably doesn't need much code.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

Tomas Vondra

Date:

04 March, 20:03:52

On 3/4/24 06:33, David Rowley wrote:
> On Sun, 3 Mar 2024 at 20:08, Andy Fan <zhihuifan1213@163.com> wrote:
>> The issue can be reproduced with the following steps:
>>
>> create table x_events (.., created_at timestamp, a int, b int);
>>
>> create index idx_1 on t(created_at, a);
>> create index idx_2 on t(created_at, b);
>>
>> query:
>> select * from t where create_at = current_timestamp and b = 1;
>>
>> index (created_at, a) rather than (created_at, b) may be chosen for the
>> above query if the statistics think "create_at = current_timestamp" has
>> no rows, then both index are OK, actually it is true just because
>> statistics is out of date.
> 
> I don't think there's really anything too special about the fact that
> the created_at column is always increasing. We commonly get 1-row
> estimates after multiplying the selectivities from individual stats.
> Your example just seems like yet another reason that this could
> happen.
> 
> I've been periodically talking about introducing "risk" as a factor
> that the planner should consider.  I did provide some detail in [1]
> about the design that was in my head at that time.  I'd not previously
> thought that it could also solve this problem, but after reading your
> email, I think it can.
> 
> I don't think it would be right to fudge the costs in any way, but I
> think the risk factor for IndexPaths could take into account the
> number of unmatched index clauses and increment the risk factor, or
> "certainty_factor" as it is currently in my brain-based design. That
> way add_path() would be more likely to prefer the index that matches
> the most conditions.
> 
> The exact maths to calculate the certainty_factor for this case I
> don't quite have worked out yet. I plan to work on documenting the
> design of this and try and get a prototype patch out sometime during
> this coming southern hemisphere winter so that there's at least a full
> cycle of feedback opportunity before the PG18 freeze.
> 

I've been thinking about this stuff too, so I'm curious to hear what
kind of plan you come up with. Every time I tried to formalize a more
concrete plan, I ended up with a very complex (and possible yet more
fragile) approach.

I think we'd need to consider a couple things:

1) reliability of cardinality estimates

I think this is pretty much the same concept as confidence intervals,
i.e. knowing not just the regular estimate, but also a range where the
actual value lies with high confidence (e.g. 90%).

For a single clauses this might not be terribly difficult, but for more
complex cases (multiple conditions, ...) it seems far more complex :-(
For example, let's say we know confidence intervals for two conditions.
What's the confidence interval when combined using AND or OR?

2) robustness of the paths

Knowing just the confidence intervals does not seem sufficient, though.
The other thing that matters is how this affects the paths, how robust
the paths are. I mean, if we have alternative paths with costs that flip
somewhere in the confidence interval - which one to pick? Surely one
thing to consider is how fast the costs change for each path.

3) complexity of the model

I suppose we'd prefer a systematic approach (and not some ad hoc
solution for one particular path/plan type). So this would be somehow
integrated into the cost model, making it yet more complex. I'm quite
worried about this (not necessarily because of performance reasons).

I wonder if trying to improve the robustness solely by changes in the
planning phase is a very promising approach. I mean - sure, we should
improve that, but by definition it relies on a priori information. And
not only the stats may be stale - it's a very lossy approximation of the
actual data. Even if the stats are perfectly up to date / very detailed,
there's still going to be a lot of uncertainty.

I wonder if we'd be better off if we experimented with more robust
plans, like SmoothScan [1] or g-join [2].

regards

[1]
https://stratos.seas.harvard.edu/sites/scholar.harvard.edu/files/stratos/files/smooth_vldbj.pdf

[2]
http://wwwlgis.informatik.uni-kl.de/cms/fileadmin/users/haerder/2011/JoinAndGrouping.pdf

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: a wrong index choose when statistics is out of date

From

David Rowley

Date:

05 March, 02:13:38

On Tue, 5 Mar 2024 at 00:37, Andy Fan <zhihuifan1213@163.com> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > I don't think it would be right to fudge the costs in any way, but I
> > think the risk factor for IndexPaths could take into account the
> > number of unmatched index clauses and increment the risk factor, or
> > "certainty_factor" as it is currently in my brain-based design. That
> > way add_path() would be more likely to prefer the index that matches
> > the most conditions.
>
> This is somehow similar with my proposal at [1]?  What do you think
> about the treat 'col op const' as 'col op $1' for the marked column?
> This could just resolve a subset of questions in your mind, but the
> method looks have a solid reason.

Do you mean this?

> + /*
> + * To make the planner more robust to handle some inaccurate statistics
> + * issue, we will add a extra cost to qpquals so that the less qpquals
> + * the lower cost it has.
> + */
> + cpu_run_cost += 0.01 * list_length(qpquals);

I don't think it's a good idea to add cost penalties like you proposed
there. This is what I meant by "I don't think it would be right to
fudge the costs in any way".

If you modify the costs to add some small penalty so that the planner
is more likely to favour some other plan, what happens if we then
decide the other plan has some issue and we want to penalise that for
some other reason? Adding the 2nd penalty might result in the original
plan choice again. Which one should be penalised more? I think the
uncertainty needs to be tracked separately.

Fudging the costs like this is also unlikely to play nicely with
add_path's use of STD_FUZZ_FACTOR.  There'd be an incentive to do
things like total_cost *= STD_FUZZ_FACTOR; to ensure we get a large
enough penalty.

David

> [1]
https://www.postgresql.org/message-id/CAApHDvovVWCbeR4v%2BA4Dkwb%3DYS_GuJG9OyCm8jZu%2B%2BcP2xsY_A%40mail.gmail.com

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

05 March, 08:24:29

David Rowley <dgrowleyml@gmail.com> writes:

> On Tue, 5 Mar 2024 at 00:37, Andy Fan <zhihuifan1213@163.com> wrote:
>>
>> David Rowley <dgrowleyml@gmail.com> writes:
>> > I don't think it would be right to fudge the costs in any way, but I
>> > think the risk factor for IndexPaths could take into account the
>> > number of unmatched index clauses and increment the risk factor, or
>> > "certainty_factor" as it is currently in my brain-based design. That
>> > way add_path() would be more likely to prefer the index that matches
>> > the most conditions.
>>
>> This is somehow similar with my proposal at [1]?  What do you think
>> about the treat 'col op const' as 'col op $1' for the marked column?
>> This could just resolve a subset of questions in your mind, but the
>> method looks have a solid reason.
>
> Do you mean this?

Yes, it is not cautious enough to say "similar" too quick.

After reading your opinion again, I think what you are trying to do is
adding one more dimension to Path compared with the existing cost and
pathkey information and it would take effects on add_path stage. That is
impressive, and I'm pretty willing to do more testing once the v1 is
done.

I just noted you have expressed your idea about my proposal 1,

> We should do anything like add column options in the meantime. Those
> are hard to remove once added.

I will try it very soon.  and I'm a bit of upset no one care about my
proposal 2 which is the AI method, I see many companies want to
introduce AI to planner even I don't seen any impressive success, but
this user case looks like a candidate. 

>> + /*
>> + * To make the planner more robust to handle some inaccurate statistics
>> + * issue, we will add a extra cost to qpquals so that the less qpquals
>> + * the lower cost it has.
>> + */
>> + cpu_run_cost += 0.01 * list_length(qpquals);
>
> I don't think it's a good idea to add cost penalties like you proposed
> there. This is what I meant by "I don't think it would be right to
> fudge the costs in any way".
>
> If you modify the costs to add some small penalty so that the planner
> is more likely to favour some other plan, what happens if we then
> decide the other plan has some issue and we want to penalise that for
> some other reason? Adding the 2nd penalty might result in the original
> plan choice again. Which one should be penalised more? I think the
> uncertainty needs to be tracked separately.
>
> Fudging the costs like this is also unlikely to play nicely with
> add_path's use of STD_FUZZ_FACTOR.  There'd be an incentive to do
> things like total_cost *= STD_FUZZ_FACTOR; to ensure we get a large
> enough penalty.

I agree and I just misunderstood your proposal yesterday. 

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

05 March, 15:56:29

Hi,

>
>> We should do anything like add column options in the meantime. Those
>> are hard to remove once added.
>
> I will try it very soon.

Attached is a PoC version. and here is the test case.

create table t(a int, b int, c int) with (autovacuum_enabled=off);
create index on t(a, b);
create index on t(a, c);

insert into t select floor(random() * 100 + 1)::int, i, i
from generate_series(1, 100000) i;

analyze t;

insert into t
select floor(random() * 10 + 1)::int + 100 , i, i
from generate_series(1, 1000) i;

-- one of the below queries would choose a wrong index.
-- here is the result from my test.
explain (costs off) select * from t where a = 109 and c = 8;
              QUERY PLAN               
---------------------------------------
 Index Scan using t_a_c_idx on t
   Index Cond: ((a = 109) AND (c = 8))
(2 rows)

explain (costs off) select * from t where a = 109 and b = 8;
           QUERY PLAN            
---------------------------------
 Index Scan using t_a_c_idx on t
   Index Cond: (a = 109)
   Filter: (b = 8)
(3 rows)

Wrong index is chosen for the second case!

-- After applying the new API.

alter table t alter column a set (force_generic=on);

explain (costs off) select * from t where a = 109 and c = 8;
              QUERY PLAN               
---------------------------------------
 Index Scan using t_a_c_idx on t
   Index Cond: ((a = 109) AND (c = 8))
(2 rows)

explain (costs off) select * from t where a = 109 and b = 8;
              QUERY PLAN               
---------------------------------------
 Index Scan using t_a_b_idx on t
   Index Cond: ((a = 109) AND (b = 8))
(2 rows)

Then both cases can choose a correct index.

commit f8cca472479c50ba73479ec387882db43d203522 (HEAD -> shared_detoast_value)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Tue Mar 5 18:27:48 2024 +0800

    Add a "force_generic" attoptions for selfunc.c
    
    Sometime user just care about the recent data and the optimizer
    statistics for such data is not gathered, then some bad decision may
    happen. Before this patch, we have to make the autoanalyze often and
    often, but it is not only expensive but also may be too late.
    
    This patch introduces a new attoptions like this:
    
    ALTER TABLE t ALTER COLUMN col set (force_generic=true);
    
    Then selfunc.c realizes this and ignore the special Const value, then
    average selectivity is chosen. This fall into the weakness of generic
    plan, but this patch doesn't introduce any new weakness and we leave the
    decision to user which could resolve some problem. Also this logic only
    apply to eqsel since the ineqsel have the get_actual_variable_range
    mechanism which is helpful for index choose case at least.

I think it is OK for a design review, for the implementaion side, the
known issue includes:

1. Support grap such infromation from its parent for partitioned table
if the child doesn't have such information.
2. builtin document and testing. 

Any feedback is welcome.

-- 
Best Regards
Andy Fan

Attachment

v0-0001-Add-a-force_generic-attoptions-for-selfunc.c.patch

Re: a wrong index choose when statistics is out of date

From

Andrei Lepikhov

Date:

07 March, 11:17:10

On 5/3/2024 19:56, Andy Fan wrote:
> I think it is OK for a design review, for the implementaion side, the
> known issue includes:
> 
> 1. Support grap such infromation from its parent for partitioned table
> if the child doesn't have such information.
> 2. builtin document and testing.
> 
> Any feedback is welcome.
Thanks for your efforts.
I was confused when you showed the problem connected to clauses like 
"Var op Const" and "Var op Param".
As far as I know, the estimation logic of such clauses uses MCV and 
number-distinct statistics. So, being out of MCV values, it becomes 
totally insensitive to any internal skew in data and any data outside 
the statistics boundaries.
Having studied the example you provided with the patch, I think it is 
not a correct example:
Difference between var_eq_const and var_eq_non_const quite obvious:
In the second routine, you don't have information about the const value 
and can't use MCV for estimation. Also, you can't exclude MCV values 
from the estimation. And it is just luck that you've got the right 
answer. I think if you increased the weight of the unknown part, you 
would get a bad result, too.
I would like to ask David why the var_eq_const estimator doesn't have an 
option for estimation with a histogram. Having that would relieve a 
problem with skewed data. Detecting the situation with incoming const 
that is out of the covered area would allow us to fall back to ndistinct 
estimation or something else. At least, histogram usage can be 
restricted by the reltuples value and ratio between the total number of 
MCV values and the total number of distinct values in the table.

Just for demo: demonstration of data skew issue:

CREATE EXTENSION tablefunc;
CREATE TABLE norm_test AS
   SELECT abs(r::integer) AS val
   FROM normal_rand(1E7::integer, 5.::float8, 300.::float8) AS r;
ANALYZE norm_test;

-- First query is estimated with MCV quite precisely:
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 100;
-- result: planned rows=25669, actual rows=25139

-- Here we have numdistinct estimation, mostly arbitrary:
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 200;
-- result: planned rows=8604, actual rows=21239
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 500;
-- result: planned rows=8604, actual rows=6748
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 600;
-- result: planned rows=8604, actual rows=3501
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 700;
-- result: planned rows=8604, actual rows=1705
EXPLAIN ANALYZE SELECT * FROM norm_test WHERE val = 1000;
-- result: planned rows=8604, actual rows=91

-- 
regards,
Andrei Lepikhov
Postgres Professional

Re: a wrong index choose when statistics is out of date

From

David Rowley

Date:

07 March, 13:06:22

On Wed, 6 Mar 2024 at 02:09, Andy Fan <zhihuifan1213@163.com> wrote:
>     This patch introduces a new attoptions like this:
>
>     ALTER TABLE t ALTER COLUMN col set (force_generic=true);
>
>     Then selfunc.c realizes this and ignore the special Const value, then
>     average selectivity is chosen. This fall into the weakness of generic
>     plan, but this patch doesn't introduce any new weakness and we leave the
>     decision to user which could resolve some problem. Also this logic only
>     apply to eqsel since the ineqsel have the get_actual_variable_range
>     mechanism which is helpful for index choose case at least.

If you don't want the planner to use the statistics for the column why
not just do the following?

ALTER TABLE t ALTER COLUMN col SET STATISTICS 0;

ANALYZE won't delete any existing statistics, so that might need to be
done manually.

David

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

07 March, 13:16:52

David Rowley <dgrowleyml@gmail.com> writes:

> On Wed, 6 Mar 2024 at 02:09, Andy Fan <zhihuifan1213@163.com> wrote:
>>     This patch introduces a new attoptions like this:
>>
>>     ALTER TABLE t ALTER COLUMN col set (force_generic=true);
>>
>>     Then selfunc.c realizes this and ignore the special Const value, then
>>     average selectivity is chosen. This fall into the weakness of generic
>>     plan, but this patch doesn't introduce any new weakness and we leave the
>>     decision to user which could resolve some problem. Also this logic only
>>     apply to eqsel since the ineqsel have the get_actual_variable_range
>>     mechanism which is helpful for index choose case at least.
>
> If you don't want the planner to use the statistics for the column why
> not just do the following?

Acutally I didn't want the planner to ignore the statistics totally, I
want the planner to treat the "Const" which probably miss optimizer part
average, which is just like what we did for generic plan for the blow
query.  

prepare s as SELECT * FROM t WHERE a = $1 and b = $2;
explain (costs off) execute s(109, 8);
           QUERY PLAN            
---------------------------------
 Index Scan using t_a_c_idx on t
   Index Cond: (a = 109)
   Filter: (b = 8) 

(3 rows)

custom plan, Wrong index due to we have a bad estimation for a = 109.

set plan_cache_mode to force_generic_plan ;
explain (costs off) execute s(109, 8);
              QUERY PLAN               
---------------------------------------
 Index Scan using t_a_b_idx on t
   Index Cond: ((a = $1) AND (b = $2))   -- Correct index.
(2 rows)

Generic plan - we use the average estimation for the missed optimizer
statistics part and *if the new value is not so different from existing
ones*, we can get a disired result. 

It is true that the "generic" way is not as exactly accurate as the
"custom" way since the later one can use the data in MCV, but that is
the cost we have to pay to make the missed optimizer statistics less
imporant and generic plan has the same issue as well. As for this
aspect, I think the way you proposed probably have a wider use case.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

David Rowley

Date:

07 March, 13:32:26

On Thu, 7 Mar 2024 at 21:17, Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
> I would like to ask David why the var_eq_const estimator doesn't have an
> option for estimation with a histogram. Having that would relieve a
> problem with skewed data. Detecting the situation with incoming const
> that is out of the covered area would allow us to fall back to ndistinct
> estimation or something else. At least, histogram usage can be
> restricted by the reltuples value and ratio between the total number of
> MCV values and the total number of distinct values in the table.

If you can think of a way how to calculate it, you should propose a patch.

IIRC, we try to make the histogram buckets evenly sized based on the
number of occurrences. I've not followed the code in default, I'd
guess that doing that allows us to just subtract off the MCV
frequencies and assume the remainder is evenly split over each
histogram bucket, so unless we had an n_distinct per histogram bucket,
or at the very least n_distinct_for_histogram_values, then how would
the calculation look for what we currently record?

David

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

07 March, 13:42:31

Andrei Lepikhov <a.lepikhov@postgrespro.ru> writes:

> On 5/3/2024 19:56, Andy Fan wrote:
>> I think it is OK for a design review, for the implementaion side, the
>> known issue includes:
>> 1. Support grap such infromation from its parent for partitioned table
>> if the child doesn't have such information.
>> 2. builtin document and testing.
>> Any feedback is welcome.
> Thanks for your efforts.
> I was confused when you showed the problem connected to clauses like
> "Var op Const" and "Var op Param".

hmm, then what is the soluation in your mind when you say the "ticky" in
[1]?  I am thinking we have some communication gap here.

> As far as I know, the estimation logic of such clauses uses MCV and
> number-distinct statistics. So, being out of MCV values, it becomes
> totally insensitive to any internal skew in data and any data outside
> the statistics boundaries.
> Having studied the example you provided with the patch, I think it is
> not a correct example:
> Difference between var_eq_const and var_eq_non_const quite obvious:

The response should be same as what I did in [2], let's see if we can
make the gap between us smaller.

> In the second routine, you don't have information about the const value
> and can't use MCV for estimation. Also, you can't exclude MCV values
> from the estimation. And it is just luck that you've got the right
> answer. I think if you increased the weight of the unknown part, you
> would get a bad result, too.

> I would like to ask David why the var_eq_const estimator doesn't have an
> option for estimation with a histogram. Having that would relieve a
> problem with skewed data. Detecting the situation with incoming const
> that is out of the covered area would allow us to fall back to ndistinct
> estimation or something else. At least, histogram usage can be
> restricted by the reltuples value and ratio between the total number of
> MCV values and the total number of distinct values in the table.

I think an example which show your algorithm is better would be pretty
helpful for communication. 

[1] https://www.postgresql.org/message-id/15381eea-cbc3-4087-9d90-ab752292bd54%40postgrespro.ru
[2] https://www.postgresql.org/message-id/87msra9vgo.fsf%40163.com
-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

David Rowley

Date:

07 March, 14:28:49

On Thu, 7 Mar 2024 at 23:40, Andy Fan <zhihuifan1213@163.com> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > If you don't want the planner to use the statistics for the column why
> > not just do the following?
>
> Acutally I didn't want the planner to ignore the statistics totally, I
> want the planner to treat the "Const" which probably miss optimizer part
> average, which is just like what we did for generic plan for the blow
> query.

I'm with Andrei on this one and agree with his "And it is just luck
that you've got the right answer".

I think we should fix the general problem of the planner not choosing
the better index.  I understand you've had a go at that before, but I
didn't think fudging the costs was the right fix to coax the planner
into the safer choice.

I'm not personally interested in any bandaid fixes for this. I'd
rather see us come up with a long-term solution that just makes things
better.

I also understand you're probably frustrated and just want to make
something better. However, it's not like it's a new problem. The more
general problem of the planner making risky choices outdates both of
our time spent working on PostgreSQL, so I don't think a hasty
solution that fixes some small subset of the problem is that helpful.

David

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

08 March, 12:45:54

David Rowley <dgrowleyml@gmail.com> writes:

> On Thu, 7 Mar 2024 at 23:40, Andy Fan <zhihuifan1213@163.com> wrote:
>>
>> David Rowley <dgrowleyml@gmail.com> writes:
>> > If you don't want the planner to use the statistics for the column why
>> > not just do the following?
>>
>> Acutally I didn't want the planner to ignore the statistics totally, I
>> want the planner to treat the "Const" which probably miss optimizer part
>> average, which is just like what we did for generic plan for the blow
>> query.
>
> I'm with Andrei on this one and agree with his "And it is just luck
> that you've got the right answer".

Any example to support this conclusion? and what's the new problem after
this?

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

Andrei Lepikhov

Date:

08 March, 13:20:39

On 7/3/2024 17:32, David Rowley wrote:
> On Thu, 7 Mar 2024 at 21:17, Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote:
>> I would like to ask David why the var_eq_const estimator doesn't have an
>> option for estimation with a histogram. Having that would relieve a
>> problem with skewed data. Detecting the situation with incoming const
>> that is out of the covered area would allow us to fall back to ndistinct
>> estimation or something else. At least, histogram usage can be
>> restricted by the reltuples value and ratio between the total number of
>> MCV values and the total number of distinct values in the table.
> 
> If you can think of a way how to calculate it, you should propose a patch.
> 
> IIRC, we try to make the histogram buckets evenly sized based on the
> number of occurrences. I've not followed the code in default, I'd
> guess that doing that allows us to just subtract off the MCV
> frequencies and assume the remainder is evenly split over each
> histogram bucket, so unless we had an n_distinct per histogram bucket,
> or at the very least n_distinct_for_histogram_values, then how would
> the calculation look for what we currently record?
Yeah, It is my mistake; I see nothing special here with such a kind of 
histogram: in the case of a coarse histogram net, the level of 
uncertainty in one bin is too high to make a better estimation. I am 
just pondering detection situations when estimation constant is just out 
of statistics scope to apply to alternative, more expensive logic 
involving the number of index pages out of the boundary, index tuple 
width, and distinct value. The Left and right boundaries of the 
histogram are suitable detectors for such a situation.

-- 
regards,
Andrei Lepikhov
Postgres Professional

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

08 March, 14:53:14

After some more thoughts about the diference of the two ideas, then I
find we are resolving two different issues, just that in the wrong index
choose cases, both of them should work generally. 

Your idea actually adding some rule based logic named certainty_factor,
just the implemenation is very grace. for the example in this case, it
take effects *when the both indexes has the same cost*. I believe that
can resolve the index choose here, but how about the rows estimation?
issue due to the fact that the design will not fudge the cost anyway, I
assume you will not fudge the rows or selectivity as well. Then if the
optimizer statistics is missing, what can we do for both index choosing
and rows estimation? I think that's where my idea comes out.  

Due to the fact that optimizer statistics can't be up to date by design,
and assume we have a sistuation where the customer's queries needs that
statistcs often, how about doing the predication with the history
statistics? it can cover for both index choose and rows estimation. Then
the following arguments may be arised. a). we can't decide when the
missed optimizer statistics is wanted *automatically*, b). if we
predicate the esitmiation with the history statistics, the value of MCV
information is missed. The answer for them is a). It is controlled by
human with the "alter table t alter column a set
(force_generic=on)". b). it can't be resolved I think, and it only take
effects when the real Const is so different from the ones in
history. generic plan has the same issue I think.

I just reviewed the bad queries plan for the past half years internally,
I found many queries used the Nested loop which is the direct cause. now
I think I find out a new reason for this, because the missed optimizer
statistics cause the rows in outer relation to be 1, which make the Nest
loop is choosed. I'm not sure your idea could help on this or can help
on this than mine at this aspect.

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

Andrei Lepikhov

Date:

12 March, 13:22:15

On 8/3/2024 18:53, Andy Fan wrote:
> I just reviewed the bad queries plan for the past half years internally,
> I found many queries used the Nested loop which is the direct cause. now
> I think I find out a new reason for this, because the missed optimizer
> statistics cause the rows in outer relation to be 1, which make the Nest
> loop is choosed. I'm not sure your idea could help on this or can help
> on this than mine at this aspect.

Having had the same problem for a long time, I've made an attempt and 
invented a patch that probes an index to determine whether the estimated 
constant is within statistics' scope.
I remember David's remark on the overhead problem, but I don't argue it 
here. This patch is on the table to have one more solution sketch for 
further discussion.
Also, Andy, if you have a specific problem with index choosing, you can 
try a tiny option that makes the index-picking technique less dependent 
on the ordering of index lists [1].

[1] 
https://www.postgresql.org/message-id/9b3dbf6d-776a-4953-a5a4-60992939321d@postgrespro.ru

-- 
regards,
Andrei Lepikhov
Postgres Professional

Attachment

draft-3.diff

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

13 March, 10:39:07

>
> Having had the same problem for a long time, I've made an attempt and
> invented a patch that probes an index to determine whether the estimated
> constant is within statistics' scope.
> I remember David's remark on the overhead problem, but I don't argue it
> here. This patch is on the table to have one more solution sketch for
> further discussion.

I think the following code will be really horrendous on peformance
aspect, think about the cases where we have thousands of tuples.

+        index_rescan(index_scan, scankeys, 1, NULL, 0);
+        while (index_getnext_tid(index_scan, ForwardScanDirection) != NULL)
+        {
+            ntuples++;
+        }
+

> Also, Andy, if you have a specific problem with index choosing, you can
> try a tiny option that makes the index-picking technique less dependent
> on the ordering of index lists [1].

thanks, index choosing issue already not the only issue I want to address now.

You said the my patch was kind of lucky to work at [1], have you figure
out an example to prove that?

[1]
https://www.postgresql.org/message-id/701d2097-2c5b-41e2-8629-734e3c8ba613%40postgrespro.ru 

-- 
Best Regards
Andy Fan

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

31 March, 06:53:12

Hello everyone,

> After some more thoughts about the diference of the two ideas, then I
> find we are resolving two different issues, just that in the wrong index
> choose cases, both of them should work generally. 

Here is the formal version for the attribute reloptions direction.

commit 0d842e39275710a544b11033f5eec476147daf06 (HEAD -> force_generic)
Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
Date:   Sun Mar 31 11:51:28 2024 +0800

    Add a attopt to disable MCV when estimating for Var = Const
    
    As of current code, when calculating the selectivity for Var = Const,
    planner first checks if the Const is an most common value and if not, it
    takes out all the portions of MCV's selectivity and num of distinct
    value, and treat the selectivity for Const equal for the rest
    n_distinct.
    
    This logic works great when the optimizer statistic is up to date,
    however if the known most common value has taken up most of the
    selectivity at the last run of analyze, and the new most common value in
    reality has not been gathered, the estimation for the new MCV will be
    pretty bad. A common case for this would be created_at = {current_date};
    
    To overcome this issue, we provides a new syntax:
    
    ALTER TABLE tablename ALTER COLUMN created_at SET (force_generic=on);
    
    After this, planner ignores the value of MCV for this column when
    estimating for Var = Const and treating all the values equally.
    
    This would cause some badness if the values for a column are pretty not
    equal which is what MCV is designed for, however this patch just provide
    one more option to user and let user make the decision.

Here is an example about its user case.

create table t(a int, b int, c int) with (autovacuum_enabled=off);
create index on t(a, b);
create index on t(a, c);
create table t2 (id int primary key, a int);
insert into t2 select i , i from generate_series(1, 800)i;

insert into t select floor(random() * 100 + 1)::int, i, i
from generate_series(1, 100000) i;
analyze t,t2;

insert into t
select floor(random() * 10 + 1)::int + 100 , i, i
from generate_series(1, 10000) i;

explain (costs off) select * from t where a = 109 and b = 8;
explain (costs off, analyze)
select * from t join t2 on t.c = t2.id where t.a = 109;

ALTER TABLE t ALTER COLUMN a SET (force_generic=on);

-- We will see some good result now.
explain (costs off) select * from t where a = 109 and b = 8;
explain (costs off, analyze)
select * from t join t2 on t.c = t2.id where t.a = 109;

I will add this to our commitfest application, any feedback is welcome! 

-- 
Best Regards
Andy Fan

Attachment

v1-0001-Add-a-attopt-to-disable-MCV-when-estimating-for-V.patch

Re: a wrong index choose when statistics is out of date

From

Andy Fan

Date:

28 April, 05:39:27

Andy Fan <zhihuifan1213@163.com> writes:

> Hello everyone,
>
>> After some more thoughts about the diference of the two ideas, then I
>> find we are resolving two different issues, just that in the wrong index
>> choose cases, both of them should work generally. 
>
> Here is the formal version for the attribute reloptions direction.

> commit 0d842e39275710a544b11033f5eec476147daf06 (HEAD -> force_generic)
> Author: yizhi.fzh <yizhi.fzh@alibaba-inc.com>
> Date:   Sun Mar 31 11:51:28 2024 +0800
>
>     Add a attopt to disable MCV when estimating for Var = Const
>     
>     As of current code, when calculating the selectivity for Var = Const,
>     planner first checks if the Const is an most common value and if not, it
>     takes out all the portions of MCV's selectivity and num of distinct
>     value, and treat the selectivity for Const equal for the rest
>     n_distinct.
>     
>     This logic works great when the optimizer statistic is up to date,
>     however if the known most common value has taken up most of the
>     selectivity at the last run of analyze, and the new most common value in
>     reality has not been gathered, the estimation for the new MCV will be
>     pretty bad. A common case for this would be created_at = {current_date};
>     
>     To overcome this issue, we provides a new syntax:
>     
>     ALTER TABLE tablename ALTER COLUMN created_at SET (force_generic=on);
>     
>     After this, planner ignores the value of MCV for this column when
>     estimating for Var = Const and treating all the values equally.
>     
>     This would cause some badness if the values for a column are pretty not
>     equal which is what MCV is designed for, however this patch just provide
>     one more option to user and let user make the decision.
>
> Here is an example about its user case.

...

Here are some add-ups for this feature:

- After the use this feature, we still to gather the MCV on these
  columns because they are still useful for the join case, see
  eqjoinsel_inner function.

- Will this feature make some cases worse since it relies on the fact
  that not using the MCV list for var = Const? That's is true in
  theory. But if user use this feature right, they will not use this
  feature for these columns. The feature is just designed for the user
  case in the commit message and the theory is exactly same as generic
  plan. If user uses it right, they may save the effort of run 'analyze'
  pretty frequently and get some better result on both index choose and
  rows estimation. Plus the patch is pretty not aggressive and it's easy
  to maintain.

- Is the 'force_generic' a good name for attribute option? Probably not,
  we can find out a good name after we agree on this direction.  

- Will it be conflicted with David's idea of certainty_factor? Probably
  not,even both of them can handle the index-choose-case. See my point
  on [1]

[1] https://www.postgresql.org/message-id/877cicao6e.fsf%40163.com 

-- 
Best Regards
Andy Fan