Thread: Geometric types row estimation

Geometric types row estimation

From
Igor ALBUQUERQUE SILVA
Date:
Hello everyone,

I'm having a problem regarding the point type/gist indexes. Here's a minimal reproduction of it:

create table test(p point);
insert into test(p) values (point(0, 0));
insert into test(p) values (point(0, 1));
insert into test(p) values (point(1, 0));
insert into test(p) values (point(1, 1));
insert into test(p) values (point(50, 0));
analyze test;
explain analyze select * from test where p <@ box '(0,0),(1,1)';
explain analyze select * from test where p <@ box '(50,0),(51,1)';

The two queries get the same cost/row estimation, of 1 row. This is the EXPLAIN ANALYZE of the first query:

Seq Scan on test  (cost=0.00..1.07 rows=1 width=16) (actual time=0.022..0.026 rows=4 loops=1)
   Filter: ((p[0] >= '0'::double precision) AND (p[0] <= '1'::double precision))
   Rows Removed by Filter: 1
 Planning Time: 0.115 ms
 Execution Time: 0.055 ms
(5 rows)

What I was expecting is the first query to estimate 4 rows and the second to estimate 1, like what I get If I try the same thing using integers.

create table test(x integer, y integer);
insert into test(x, y) values (0, 0);
insert into test(x, y) values (0, 1);
insert into test(x, y) values (1, 0);
insert into test(x, y) values (1, 1);
insert into test(x, y) values (50, 0);
analyze test;
explain analyze select * from test where x between 0 and 1 and y between 0 and 1;
explain analyze select * from test where x between 50 and 51 and y between 0 and 1;

My question is: is this expected behaviour? I actually have a much larger table with a gist index where I found this occurring, and this causes the planner to make bad decisions: every query that I do will have the same estimation, and whenever this estimation is very wrong, the planner does not take the optimal decision.

I'm using the official docker image, PostgreSQL 15.1 (Debian 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit, running everything in psql (PostgreSQL) 15.1 (Ubuntu 15.1-1.pgdg22.04+1).

Best regards,
Igor

Re: Geometric types row estimation

From
Igor ALBUQUERQUE SILVA
Date:
I'm sorry, I sent the wrong EXPLAIN ANALYZE for the first query, this is the correct one:

Seq Scan on test  (cost=0.00..1.06 rows=1 width=16) (actual time=0.018..0.022 rows=4 loops=1)
   Filter: (p <@ '(1,1),(0,0)'::box)
   Rows Removed by Filter: 1
 Planning Time: 0.211 ms
 Execution Time: 0.051 ms
(5 rows)

On Wed, 30 Nov 2022 at 17:44, Igor ALBUQUERQUE SILVA <i.albuquerque-silva@kayrros.com> wrote:
Hello everyone,

I'm having a problem regarding the point type/gist indexes. Here's a minimal reproduction of it:

create table test(p point);
insert into test(p) values (point(0, 0));
insert into test(p) values (point(0, 1));
insert into test(p) values (point(1, 0));
insert into test(p) values (point(1, 1));
insert into test(p) values (point(50, 0));
analyze test;
explain analyze select * from test where p <@ box '(0,0),(1,1)';
explain analyze select * from test where p <@ box '(50,0),(51,1)';

The two queries get the same cost/row estimation, of 1 row. This is the EXPLAIN ANALYZE of the first query:

Seq Scan on test  (cost=0.00..1.07 rows=1 width=16) (actual time=0.022..0.026 rows=4 loops=1)
   Filter: ((p[0] >= '0'::double precision) AND (p[0] <= '1'::double precision))
   Rows Removed by Filter: 1
 Planning Time: 0.115 ms
 Execution Time: 0.055 ms
(5 rows)

What I was expecting is the first query to estimate 4 rows and the second to estimate 1, like what I get If I try the same thing using integers.

create table test(x integer, y integer);
insert into test(x, y) values (0, 0);
insert into test(x, y) values (0, 1);
insert into test(x, y) values (1, 0);
insert into test(x, y) values (1, 1);
insert into test(x, y) values (50, 0);
analyze test;
explain analyze select * from test where x between 0 and 1 and y between 0 and 1;
explain analyze select * from test where x between 50 and 51 and y between 0 and 1;

My question is: is this expected behaviour? I actually have a much larger table with a gist index where I found this occurring, and this causes the planner to make bad decisions: every query that I do will have the same estimation, and whenever this estimation is very wrong, the planner does not take the optimal decision.

I'm using the official docker image, PostgreSQL 15.1 (Debian 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit, running everything in psql (PostgreSQL) 15.1 (Ubuntu 15.1-1.pgdg22.04+1).

Best regards,
Igor

Re: Geometric types row estimation

From
Tom Lane
Date:
Igor ALBUQUERQUE SILVA <i.albuquerque-silva@kayrros.com> writes:
> I'm having a problem regarding the point type/gist indexes. Here's a
> minimal reproduction of it:
> ...
> What I was expecting is the first query to estimate 4 rows and the second
> to estimate 1, like what I get If I try the same thing using integers.

Unfortunately, the selectivity estimation functions for PG's geometric
types are mostly just stubs.  The estimation function for point <@ box
in particular is contsel [1]:

/*
 *    contsel -- How likely is a box to contain (be contained by) a given box?
 *
 * This is a tighter constraint than "overlap", so produce a smaller
 * estimate than areasel does.
 */
Datum
contsel(PG_FUNCTION_ARGS)
{
    PG_RETURN_FLOAT8(0.001);
}

It's been like that (excepting notational changes) since Berkeley days,
because nobody has bothered to make it better.

In general, PG's built-in geometric types have never gotten much
beyond their origins as an academic proof-of-concept.  I think people
who are doing serious work that requires such operations mostly use
PostGIS, and I'd suggest looking into that.

Or, if you feel like doing a lot of work to make these estimators
better, have at it.

            regards, tom lane

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=src/backend/utils/adt/geo_selfuncs.c;hb=HEAD



Re: Geometric types row estimation

From
Igor ALBUQUERQUE SILVA
Date:
Hi Tom,

Thanks a lot for the explanation, I thought the built-in types were more standard, so I didn't mention that I was having the same thing using postgis. Here's the example (I changed the values a little bit to avoid rounding errors):

create table test(p geometry(point));
insert into test(p) values (st_makepoint(0,0));
insert into test(p) values (st_makepoint(0,1));
insert into test(p) values (st_makepoint(1,0));
insert into test(p) values (st_makepoint(1,1));
insert into test(p) values (st_makepoint(50,0));
analyze test;
explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((-1 -1,2 -1,2 2,-1 2,-1 -1))'), p);
explain analyze select * from test where ST_Contains(ST_GeomFromText('POLYGON((49 -1,51 -1,51 1,49 1,49 -1))'), p);

EXPLAIN ANALYZE:

 Seq Scan on test  (cost=0.00..126.05 rows=1 width=32) (actual time=0.015..0.022 rows=4 loops=1)
   Filter: st_contains('01030000000100000005000000000000000000F0BF000000000000F0BF0000000000000040000000000000F0BF00000000000000400000000000000040000000000000F0BF0000000000000040000000000000F0BF000000000000F0BF'::geometry, p)
   Rows Removed by Filter: 1
 Planning Time: 0.072 ms
 Execution Time: 0.035 ms
(5 rows)

Do you know if the functions in Postgis are also stubbed? Or maybe I'm doing something wrong with the syntax?

This time I'm using the postgis docker image, PostgreSQL 15.1 (Debian 15.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

Best regards,
Igor

On Wed, 30 Nov 2022 at 18:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Igor ALBUQUERQUE SILVA <i.albuquerque-silva@kayrros.com> writes:
> I'm having a problem regarding the point type/gist indexes. Here's a
> minimal reproduction of it:
> ...
> What I was expecting is the first query to estimate 4 rows and the second
> to estimate 1, like what I get If I try the same thing using integers.

Unfortunately, the selectivity estimation functions for PG's geometric
types are mostly just stubs.  The estimation function for point <@ box
in particular is contsel [1]:

/*
 *      contsel -- How likely is a box to contain (be contained by) a given box?
 *
 * This is a tighter constraint than "overlap", so produce a smaller
 * estimate than areasel does.
 */
Datum
contsel(PG_FUNCTION_ARGS)
{
        PG_RETURN_FLOAT8(0.001);
}

It's been like that (excepting notational changes) since Berkeley days,
because nobody has bothered to make it better.

In general, PG's built-in geometric types have never gotten much
beyond their origins as an academic proof-of-concept.  I think people
who are doing serious work that requires such operations mostly use
PostGIS, and I'd suggest looking into that.

Or, if you feel like doing a lot of work to make these estimators
better, have at it.

                        regards, tom lane

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=src/backend/utils/adt/geo_selfuncs.c;hb=HEAD

Re: Geometric types row estimation

From
Tom Lane
Date:
Igor ALBUQUERQUE SILVA <i.albuquerque-silva@kayrros.com> writes:
> Thanks a lot for the explanation, I thought the built-in types were more
> standard, so I didn't mention that I was having the same thing using
> postgis.

Hm --- you'd have to take that up with the PostGIS people.  But they
at least would be likely to have motivation to improve things.

            regards, tom lane



Re: Geometric types row estimation

From
Igor ALBUQUERQUE SILVA
Date:
Ok I'll do that, thanks a lot!

On Wed, 30 Nov 2022 at 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Igor ALBUQUERQUE SILVA <i.albuquerque-silva@kayrros.com> writes:
> Thanks a lot for the explanation, I thought the built-in types were more
> standard, so I didn't mention that I was having the same thing using
> postgis.

Hm --- you'd have to take that up with the PostGIS people.  But they
at least would be likely to have motivation to improve things.

                        regards, tom lane