Thread: Selectivity estimation for intarray with @@

Selectivity estimation for intarray with @@

From
Uriy Zhuravlev
Date:
Hello.

Attached patch based on:
http://www.postgresql.org/message-id/CAPpHfdssY+qEPDCOvxx-b4LP3ybR+qS04M6-ARgGKNFk3FrFow@mail.gmail.com

and adds selectivity estimation functions to @@ (port from tsquery). Now we
support &&, @>, <@ and @@.
In addition it was written migration to version 1.1 intarray. Because of what
this patch requires my other patch:
http://www.postgresql.org/message-id/14346041.DNcb5Y1inS@dinodell

Alexander Korotkov know about this patch.

Thanks.

--
Uriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: Selectivity estimation for intarray with @@

From
Jeff Janes
Date:
On Tue, May 26, 2015 at 4:58 AM, Uriy Zhuravlev <u.zhuravlev@postgrespro.ru> wrote:
Hello.

Attached patch based on:
http://www.postgresql.org/message-id/CAPpHfdssY+qEPDCOvxx-b4LP3ybR+qS04M6-ARgGKNFk3FrFow@mail.gmail.com

and adds selectivity estimation functions to @@ (port from tsquery). Now we
support &&, @>, <@ and @@.
In addition it was written migration to version 1.1 intarray. Because of what
this patch requires my other patch:
http://www.postgresql.org/message-id/14346041.DNcb5Y1inS@dinodell

Alexander Korotkov know about this patch.


Hi Uriy,

This patch looks pretty good.

The first line of intarray--1.1.sql mis-identifies itself as "/* contrib/intarray/intarray--1.0.sql */"

The real file intarray--1.0.sql file probably should not be included in the final patch, but I like having it for testing.

It applies and builds cleanly over the alter operator patch (and now the commit as well), passes make check of the contrib module both with and without cassert.

I could succesfully upgrade from version 1.0 to 1.1 without having to drop the gin__int_ops indexes in the process.

I could do pg_upgrade from 9.2 and 9.4 to 9.6devel with large indexes in place, and then upgrade the extension to 1.1, and it worked without having to rebuild the index.

It does what it says, and I think we want this.

There were some cases where the estimates were not very good, but that seems to be limitation of what pg_stats makes available, not of this patch.  Specifically if the elements listed in the query text are not part of most_common_elems (or worse yet, most_common_elems is null) then it is left to guess with no real data, and those guesses can be pretty bad.  It is not this patches job to fix that, however.

It assumes all the stats are independent and so doesn't account for correlation between members.  This is also how the core selectivity estimates work between columns, so I can't really complain about that.  It is easy to trick it with things like @@ '(!300 & 300)'::query_int, but I don't think that is necessary to fix that.

I have not been creative enough to come up with queries for which this improvement in selectivity estimate causes the execution plan to change in important ways.  I'm sure the serious users of this module would have such cases, however.

I haven't tested gist__int_ops as I can't get those indexes to build in a feasible amount of time.  But the selectivity estimates are independent of any actual index, so testing those doesn't seem to be critical.

There is no documentation change, which makes sense as this internal stuff which isn't documented to start with.

There are no regression test changes.  Not sure about that, it does seem like regression tests would be desirable.

I haven't gone through the C code.

Cheers,

Jeff

Re: Selectivity estimation for intarray with @@

From
Alexander Korotkov
Date:
Hi!

While Uriy is on vacation, I've revised this patch a bit.

On Tue, Jul 14, 2015 at 8:55 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
Hi Uriy,

This patch looks pretty good.

The first line of intarray--1.1.sql mis-identifies itself as "/* contrib/intarray/intarray--1.0.sql */"

Fixed.
 
The real file intarray--1.0.sql file probably should not be included in the final patch, but I like having it for testing.

I've removed intarray--1.0.sql since it shouldn't be in final commit.
 
It applies and builds cleanly over the alter operator patch (and now the commit as well), passes make check of the contrib module both with and without cassert.

I could succesfully upgrade from version 1.0 to 1.1 without having to drop the gin__int_ops indexes in the process.

I could do pg_upgrade from 9.2 and 9.4 to 9.6devel with large indexes in place, and then upgrade the extension to 1.1, and it worked without having to rebuild the index.

It does what it says, and I think we want this.

Good. 
 
There were some cases where the estimates were not very good, but that seems to be limitation of what pg_stats makes available, not of this patch.  Specifically if the elements listed in the query text are not part of most_common_elems (or worse yet, most_common_elems is null) then it is left to guess with no real data, and those guesses can be pretty bad.  It is not this patches job to fix that, however.
 
It assumes all the stats are independent and so doesn't account for correlation between members.  This is also how the core selectivity estimates work between columns, so I can't really complain about that.  It is easy to trick it with things like @@ '(!300 & 300)'::query_int, but I don't think that is necessary to fix that.

Analysis of all the dependencies inside query is NP-complete task. We could try to workout simple cases, but postgres optimizer currently doesn't care about it.

# explain select * from test where val = 'val' and val != 'val';
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on test  (cost=0.00..39831.00 rows=499995 width=8)
   Filter: ((val <> 'val'::text) AND (val = 'val'::text))
(2 rows)

I think we could do the same inside intquery until we figure out some better solution for postgres optimizer in general.
 
I have not been creative enough to come up with queries for which this improvement in selectivity estimate causes the execution plan to change in important ways.  I'm sure the serious users of this module would have such cases, however.

I haven't tested gist__int_ops as I can't get those indexes to build in a feasible amount of time.  But the selectivity estimates are independent of any actual index, so testing those doesn't seem to be critical.

There is no documentation change, which makes sense as this internal stuff which isn't documented to start with.
 
Yes. For instance, tsquery make very similar selectivity estimation as intquery, but it's assumed to be internal and isn't mentioned in documentation.

There are no regression test changes.  Not sure about that, it does seem like regression tests would be desirable.

It would be nice to cover selectivity estimation with regression tests, but AFAICS we didn't do it for any selectivity estimation functions yet. Problem is that selectivity estimation is quite complex process and its result depending on random sampling during analyze, floating points operations and so on. We could make a test like "with very high level of confidence, estimate number of rows here should be in [10;100]". But it's hard to fit such assumptions into our current regression tests infrastructure.

I haven't gone through the C code.

I also did some coding style and comments modifications.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: Selectivity estimation for intarray with @@

From
Teodor Sigaev
Date:
Some notices about code:

1 Near function transformOperator() there is wrong operator name "@<"
2 int_query (and intquery) should be replaced to query_int to be consistent with 
actual type name. At least where it's used as separate lexeme.
3 For historical reasons @@ doesn't commutate with itself, it has a commutator 
~~. Patch assumes that @@ is self-commutator, but ~~ will use  'contsel' as a 
restrict estimation. Suppose, we need to declare ~~ as deprecated and introduce 
commutating @@ operator.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: Selectivity estimation for intarray with @@

From
Alexander Korotkov
Date:
On Tue, Jul 21, 2015 at 5:14 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Some notices about code:

1 Near function transformOperator() there is wrong operator name "@<"

Fixed.
 
2 int_query (and intquery) should be replaced to query_int to be consistent with actual type name. At least where it's used as separate lexeme.

Fixed.
 
3 For historical reasons @@ doesn't commutate with itself, it has a commutator ~~. Patch assumes that @@ is self-commutator, but ~~ will use  'contsel' as a restrict estimation. Suppose, we need to declare ~~ as deprecated and introduce commutating @@ operator.

I think deprecating ~~ is a subject of separate patch. I fixed patch behavior in the different way. @@ and ~~ are now handled by the same function. The function handles "var @@ const" and "const ~~ var", but doesn't handle "const @@ var" and "var  ~~ const". It determines the case by type of variable: it should be int[].

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
 
Attachment

Re: Selectivity estimation for intarray with @@

From
Heikki Linnakangas
Date:
On 07/21/2015 03:44 PM, Alexander Korotkov wrote:
> While Uriy is on vacation, I've revised this patch a bit.

I whacked this around quite a bit, and I think it's in a committable
state now. But if you could run whatever tests you were using before on
this, to make sure it still produces the same estimates, that would be
great. I didn't change the estimates it should produce, only the code
structure.

One thing that bothers me slightly with this patch is the way it peeks
into the Most-Common-Elements arrays, which is produced by the built-in
type analyze function. If we ever change what statistics are collected
for arrays, or the way they are stored, this might break. In matchsel,
why don't we just call the built-in estimator function for each element
that we need to probe, and not look into the statistics ourselves at
all? I actually experimented with that, and it did slash much of the
code, and it would be more future-proof. However, it was also a lot
slower for queries that contain multiple values. That's understandable:
the built-in estimator will fetch the statistics tuple, parse the
arrays, etc. separately for each value in the query_int, while this
patch will do it only once for the whole query, and perform a simple
binary search for each value. So overall, I think this is OK as it is.
But if we find that we need to use the MCE list in this fashion in more
places in the future, it might be worthwhile to add some support code
for this in the backend to allow extracting the stats once, and doing
multiple "lightweight estimations" using the extracted stats.

Some things I fixed/changed:

* I didn't like that transformOperator() function, which looked up the
function's name. I replaced it with separate wrapper functions for each
operator, so that the built-in operator's OID can be hardcoded into each.

* I refactored the matchsel function heavily. I think it's more readable
now.

* I got rid of the Int4Freq array. It didn't seem significantly easier
to work with than the separate values/numbers arrays, so I just used
those directly.

* Also use the matchsel estimator for ~~ (the commutator of @@)

* Also use the estimators for the obsolete @ and ~ operators. Not that I
care much about those as they are obsolete, but seems strange not to, as
it's a trivial matter of setting the right estimator function.

* I added an ANALYZE in the regression test. It still won't
systematically test all the cost estimation code, and there's nothing to
check that the estimates make sense, but at least more of the code will
now run.

- Heikki


Attachment

Re: Selectivity estimation for intarray with @@

From
Alexander Korotkov
Date:
On Tue, Jul 21, 2015 at 6:49 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 07/21/2015 03:44 PM, Alexander Korotkov wrote:
While Uriy is on vacation, I've revised this patch a bit.

I whacked this around quite a bit, and I think it's in a committable state now. But if you could run whatever tests you were using before on this, to make sure it still produces the same estimates, that would be great. I didn't change the estimates it should produce, only the code structure.

One thing that bothers me slightly with this patch is the way it peeks into the Most-Common-Elements arrays, which is produced by the built-in type analyze function. If we ever change what statistics are collected for arrays, or the way they are stored, this might break. In matchsel, why don't we just call the built-in estimator function for each element that we need to probe, and not look into the statistics ourselves at all? I actually experimented with that, and it did slash much of the code, and it would be more future-proof. However, it was also a lot slower for queries that contain multiple values. That's understandable: the built-in estimator will fetch the statistics tuple, parse the arrays, etc. separately for each value in the query_int, while this patch will do it only once for the whole query, and perform a simple binary search for each value. So overall, I think this is OK as it is. But if we find that we need to use the MCE list in this fashion in more places in the future, it might be worthwhile to add some support code for this in the backend to allow extracting the stats once, and doing multiple "lightweight estimations" using the extracted stats.

Yeah, I see. We could end up with something like this. But probably we would need something more general for extensions which wants to play with statistics.
For instance, pg_trgm could estimate selectivity for "text % text" operator. But in order to provide that it needs trigram statistics. Now it could be implemented by defining separate datatype, but it's a kluge. Probably, we would end up with custom additional statistics for datatypes.
 
Some things I fixed/changed:

* I didn't like that transformOperator() function, which looked up the function's name. I replaced it with separate wrapper functions for each operator, so that the built-in operator's OID can be hardcoded into each.

* I refactored the matchsel function heavily. I think it's more readable now.

* I got rid of the Int4Freq array. It didn't seem significantly easier to work with than the separate values/numbers arrays, so I just used those directly.

* Also use the matchsel estimator for ~~ (the commutator of @@)

In this version of patch it's not checked if variable is actually and int[] not query_int. See following test case.

# create table test2 as (select '1'::query_int val from generate_series(1,1000000));
# analyze test2;

# explain select * from test2 where '{1}'::int[] @@ val;
ERROR:  unrecognized int query item type: 0
 
I've added this check.

* Also use the estimators for the obsolete @ and ~ operators. Not that I care much about those as they are obsolete, but seems strange not to, as it's a trivial matter of setting the right estimator function.

* I added an ANALYZE in the regression test. It still won't systematically test all the cost estimation code, and there's nothing to check that the estimates make sense, but at least more of the code will now run.

You also forgot to include intarray--1.0--1.1.sql into patch. I've also added it.
 
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Attachment

Re: Selectivity estimation for intarray with @@

From
Heikki Linnakangas
Date:
On 07/21/2015 07:28 PM, Alexander Korotkov wrote:
> On Tue, Jul 21, 2015 at 6:49 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> In this version of patch it's not checked if variable is actually and int[]
> not query_int. See following test case.
>
> # create table test2 as (select '1'::query_int val from
> generate_series(1,1000000));
> # analyze test2;
>
> # explain select * from test2 where '{1}'::int[] @@ val;
> ERROR:  unrecognized int query item type: 0
>
> I've added this check.
>
> * Also use the estimators for the obsolete @ and ~ operators. Not that I
>> care much about those as they are obsolete, but seems strange not to, as
>> it's a trivial matter of setting the right estimator function.
>>
>> * I added an ANALYZE in the regression test. It still won't systematically
>> test all the cost estimation code, and there's nothing to check that the
>> estimates make sense, but at least more of the code will now run.
>
> You also forgot to include intarray--1.0--1.1.sql into patch. I've also
> added it.

Thanks, committed!

- Heikki