Thread: Propose a new hook for mutating the query bounds

Propose a new hook for mutating the query bounds

From
Xiaozhe Yao
Date:
Hi hackers,

I am currently working on improving the cardinality estimation component in PostgreSQL with machine learning. I came up with a solution that mutates the bounds for different columns. For example, assume that we have a query

```
select * from test where X<10 and Y<20;
```

Our approach tries to learn the relation between X and Y. For example, if we have a linear relation, Y=X+10. Then Y<20 is essentially equivalent to X<10. Therefore we can mutate the Y<20 to Y<INT_MAX so that the selectivity will be 1, and we will have a more accurate estimation.

It seems to me that we can achieve something similar by mutating the pg_statistics, however, mutating the bounds is something more straightforward to me and less expensive.

I am wondering if it is possible to have such an extension? Or if there is a better solution to this? I have already implemented this stuff in a private repository, and if this is something you like, I can further propose the patch to the list.

Best regards,
Xiaozhe

Re: Propose a new hook for mutating the query bounds

From
Tomas Vondra
Date:

On 11/17/21 2:24 PM, Xiaozhe Yao wrote:
> Hi hackers,
> 
> I am currently working on improving the cardinality estimation component
> in PostgreSQL with machine learning. I came up with a solution that
> mutates the bounds for different columns. For example, assume that we
> have a query
> 
> ```
> select * from test where X<10 and Y<20;
> ```
> 
> Our approach tries to learn the relation between X and Y. For example,
> if we have a linear relation, Y=X+10. Then Y<20 is essentially
> equivalent to X<10. Therefore we can mutate the Y<20 to Y<INT_MAX so
> that the selectivity will be 1, and we will have a more accurate estimation.
> 

OK. FWIW the extended statistics patch originally included a patch for
multi-dimensional histograms, and that would have worked for this
example just fine, I guess. But yeah, there are various other
dependencies for which a histogram would not help. And ML might discover
that and help ...

> It seems to me that we can achieve something similar by mutating the
> pg_statistics, however, mutating the bounds is something more
> straightforward to me and less expensive.
> 

I don't understand how you could achieve this by mutating pg_statistic,
without also breaking estimation for queries that only have Y<20.

> I am wondering if it is possible to have such an extension? Or if there
> is a better solution to this? I have already implemented this stuff in a
> private repository, and if this is something you like, I can further
> propose the patch to the list.
> 

Maybe, but it's really hard to comment on this without seeing any PoC
patches. We don't know where you you'd like the hook called, what info
would it have access to, how would it tweak the selectivities etc.

If you think this would work, write a PoC patch and we'll see.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Propose a new hook for mutating the query bounds

From
Xiaozhe Yao
Date:
Hi Tomas and Hackers,

Thanks for your reply and feedback!

> I don't understand how you could achieve this by mutating pg_statistic,
without also breaking estimation for queries that only have Y<20.

I agree, if we mutate pg_statistics, we will break lots of stuff and the process becomes complicated. That's also why I think mutating the bounds makes more sense and is easier to achieve.

> Maybe, but it's really hard to comment on this without seeing any PoC
patches. We don't know where you you'd like the hook called, what info
would it have access to, how would it tweak the selectivities etc.

I have attached a PoC patch to this mail. Essentially in this patch, I only try to pass the pointer of the constval in ```scalarineqsql``` function. It is enough from the Postgres side. With that, I can handle other things in an independent extension.

I hope this makes sense.

Best regards,
Xiaozhe

On Wed, Nov 17, 2021 at 2:49 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


On 11/17/21 2:24 PM, Xiaozhe Yao wrote:
> Hi hackers,
>
> I am currently working on improving the cardinality estimation component
> in PostgreSQL with machine learning. I came up with a solution that
> mutates the bounds for different columns. For example, assume that we
> have a query
>
> ```
> select * from test where X<10 and Y<20;
> ```
>
> Our approach tries to learn the relation between X and Y. For example,
> if we have a linear relation, Y=X+10. Then Y<20 is essentially
> equivalent to X<10. Therefore we can mutate the Y<20 to Y<INT_MAX so
> that the selectivity will be 1, and we will have a more accurate estimation.
>

OK. FWIW the extended statistics patch originally included a patch for
multi-dimensional histograms, and that would have worked for this
example just fine, I guess. But yeah, there are various other
dependencies for which a histogram would not help. And ML might discover
that and help ...

> It seems to me that we can achieve something similar by mutating the
> pg_statistics, however, mutating the bounds is something more
> straightforward to me and less expensive.
>

I don't understand how you could achieve this by mutating pg_statistic,
without also breaking estimation for queries that only have Y<20.

> I am wondering if it is possible to have such an extension? Or if there
> is a better solution to this? I have already implemented this stuff in a
> private repository, and if this is something you like, I can further
> propose the patch to the list.
>

Maybe, but it's really hard to comment on this without seeing any PoC
patches. We don't know where you you'd like the hook called, what info
would it have access to, how would it tweak the selectivities etc.

If you think this would work, write a PoC patch and we'll see.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Propose a new hook for mutating the query bounds

From
Tom Lane
Date:
Xiaozhe Yao <askxzyao@gmail.com> writes:

+    if (mutate_bounds_hook) {
+        mutate_bounds_hook(root, &constval, isgt, iseq);
+    }

It seems unlikely that this could do anything actually useful,
and impossible that it could do anything useful without enormous waste
of cycles along the way.  Basically, each time one calls scalarineqsel,
the hook would have to re-analyze the entire query to see if it should do
anything.  Most of the time the answer would be "no", after a lot of
cycles wasted.  It would also have to keep some state (where?) to
coordinate mutation of different Consts in a WHERE clause.  And why only
a hook in scalarineqsel?  Is that really the only context that you'd need
to adjust the results in?

Another important deficiency in this API spec is that the hook has no
idea *which* constant it's being called on, so I don't see how it could
really deliver correct answers at all.

I can buy that ML techniques might provide a way to improve selectivity
estimates overall, but I think inserting them would be better done with
a much higher-level hook, maybe about at the level of
clauselist_selectivity.

            regards, tom lane



Re: Propose a new hook for mutating the query bounds

From
Xiaozhe Yao
Date:
Hi Tom,

Thanks for your feedback. I completely agree with you that a higher-level hook is better suited for this case. I have adjusted the PoC patch to this email.

Now it is located in the clauselist_selectivity_ext function, where we first check if the hook is defined. If so, we let the hook estimate the selectivity and return the result. With this one, I can also develop extensions to better estimate the selectivity.

I hope it makes more sense. Also please forgive me if I am understanding Postgres somehow wrong, as I am quite new to this community :)

Best regards,
Xiaozhe
Attachment

Re: Propose a new hook for mutating the query bounds

From
Tomas Vondra
Date:
On 11/17/21 16:39, Xiaozhe Yao wrote:
> Hi Tom,
> 
> Thanks for your feedback. I completely agree with you that a 
> higher-level hook is better suited for this case. I have adjusted the 
> PoC patch to this email.
> 
> Now it is located in the clauselist_selectivity_ext function, where we 
> first check if the hook is defined. If so, we let the hook estimate the 
> selectivity and return the result. With this one, I can also develop 
> extensions to better estimate the selectivity.
> 

I think clauselist_selectivity is the right level, because this is 
pretty similar to what extended statistics are doing. I'm not sure if 
the hook should be called in clauselist_selectivity_ext or in the plain 
clauselist_selectivity. But it should be in clauselist_selectivity_or 
too, probably.

The way the hook is used seems pretty inconvenient, though. I mean, if 
you do this

     if (clauselist_selectivity_hook)
         return clauselist_selectivity_hook(...);

then what will happen when the ML model has no information applicable to 
a query? This is called for all relations, all conditions, etc. and 
you've short-circuited all the regular code, so the hook will have to 
copy all of that. Seems pretty silly and fragile.

IMO the right approach is what statext_clauselist_selectivity is doing, 
i.e. estimate clauses, mark them as estimated in a bitmap, and let the 
rest of the existing code take care of the remaining clauses. So more 
something like

     if (clauselist_selectivity_hook)
         s1 *= clauselist_selectivity_hook(..., &estimatedclauses);


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Propose a new hook for mutating the query bounds

From
Xiaozhe Yao
Date:
Hi,

Thanks for the previous feedbacks!

> The way the hook is used seems pretty inconvenient, though.

I see the problem, and I agree.

I looked into how other hooks work, and I am wondering if it looks ok if we: pass a pointer to the hook, and let the hook check if there is any information applicable. If there is none, the hook just returns False and we let the rest of the code handle. If it is true, we get the selectivity from the hook and return it. So something like

```
if (clauselist_selectivity_hook &&
(*clauselist_selectivity_hook) (root, clauses, varRelid, jointype, sjinfo, use_extended_stats, &s1))
{
return s1;
}
```

What I am trying to mock is the get_index_stats_hook (https://github.com/taminomara/psql-hooks/blob/master/Detailed.md#get_index_stats_hook).

Am I understanding your idea correctly and does this look somehow better?

Best regards,
Xiaozhe

On Wed, Nov 17, 2021 at 7:47 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 11/17/21 16:39, Xiaozhe Yao wrote:
> Hi Tom,
>
> Thanks for your feedback. I completely agree with you that a
> higher-level hook is better suited for this case. I have adjusted the
> PoC patch to this email.
>
> Now it is located in the clauselist_selectivity_ext function, where we
> first check if the hook is defined. If so, we let the hook estimate the
> selectivity and return the result. With this one, I can also develop
> extensions to better estimate the selectivity.
>

I think clauselist_selectivity is the right level, because this is
pretty similar to what extended statistics are doing. I'm not sure if
the hook should be called in clauselist_selectivity_ext or in the plain
clauselist_selectivity. But it should be in clauselist_selectivity_or
too, probably.

The way the hook is used seems pretty inconvenient, though. I mean, if
you do this

     if (clauselist_selectivity_hook)
         return clauselist_selectivity_hook(...);

then what will happen when the ML model has no information applicable to
a query? This is called for all relations, all conditions, etc. and
you've short-circuited all the regular code, so the hook will have to
copy all of that. Seems pretty silly and fragile.

IMO the right approach is what statext_clauselist_selectivity is doing,
i.e. estimate clauses, mark them as estimated in a bitmap, and let the
rest of the existing code take care of the remaining clauses. So more
something like

     if (clauselist_selectivity_hook)
         s1 *= clauselist_selectivity_hook(..., &estimatedclauses);


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Propose a new hook for mutating the query bounds

From
Tomas Vondra
Date:
On 11/18/21 10:59, Xiaozhe Yao wrote:
> Hi,
> 
> Thanks for the previous feedbacks!
> 
>  > The way the hook is used seems pretty inconvenient, though.
> 
> I see the problem, and I agree.
> 
> I looked into how other hooks work, and I am wondering if it looks ok if 
> we: pass a pointer to the hook, and let the hook check if there is any 
> information applicable. If there is none, the hook just returns False 
> and we let the rest of the code handle. If it is true, we get the 
> selectivity from the hook and return it. So something like
> 
> ```
> if (clauselist_selectivity_hook &&
> (*clauselist_selectivity_hook) (root, clauses, varRelid, jointype, 
> sjinfo, use_extended_stats, &s1))
> {
> return s1;
> }
> ```
> 

No, that doesn't really solve the issue, because it's all or nothing 
approach. What if you ML can help estimating just a subset of clauses? 
IMHO the hooks should allow estimating the clauses the ML model was 
built on, and then do the usual estimation for the remaining ones. 
Otherwise you still have to copy large parts of the code.

> What I am trying to mock is the get_index_stats_hook 
> (https://github.com/taminomara/psql-hooks/blob/master/Detailed.md#get_index_stats_hook 
> <https://github.com/taminomara/psql-hooks/blob/master/Detailed.md#get_index_stats_hook>). 
> 

But that hook only deals with a single index at a time - either it finds 
stats for it or not. But this new hook deals with a list of clauses, it 
should allow processing just a subset of them, I think.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company