Thread: Proximity query with GIST and row estimation

Proximity query with GIST and row estimation

From

"Guillaume Smet"

Date:

13 February 2007, 13:09:49

Hi all,

Following the work on Mark Stosberg on this list (thanks Mark!), I
optimized our slow proximity queries by using cube, earthdistance
(shipped with contrib) and a gist index. The result is globally very
interesting apart for a specific query and we'd like to be able to fix
it too to be more consistent (it's currently faster with a basic
distance calculation based on acos, cos and so on but it's slow
anyway).

The problem is that we have sometimes very few places near a given
location (small city) and sometimes a lot of them (in Paris, Bruxelles
and so on - it's the case we have here). The gist index I created
doesn't estimate the number of rows in the area very well.

Table: lieu (100k rows) with wgslat and wgslon as numeric
Table: lieugelieu (200k rows, 1k with codegelieu = 'PKG')
Index: "idx_lieu_earth" gist (ll_to_earth(wgslat::double precision,
wgslon::double precision))

The simplified query is:
SELECT DISTINCT l.numlieu, l.nomlieu, ROUND
(earth_distance(ll_to_earth(48.85957600, 2.34860800),
ll_to_earth(l.wgslat, l.wgslon))) as dist
    FROM lieu l, lieugelieu lgl
    WHERE lgl.codegelieu = 'PKG' AND earth_box(ll_to_earth(48.85957600,
2.34860800), 1750) @ ll_to_earth(l.wgslat, l.wgslon) AND lgl.numlieu =
l.numlieu ORDER BY dist ASC LIMIT 2;
It's used to find the nearest car parks from a given location.

The plan is attached plan_earthdistance_nestedloop.txt. It uses a
nested loop because the row estimate is pretty bad: (cost=0.00..3.38
rows=106 width=0) (actual time=30.229..30.229 rows=5864 loops=1).

If I disable the nested loop, the plan is different and faster (see
plan_earthdistance_hash.txt attached).

Is there any way to improve this estimation? I tried to set the
statistics of wgslat and wgslon higher but it doesn't change anything
(I don't know if the operator is designed to use the statistics).

Any other idea to optimize this query is very welcome too.

--
Guillaume

Attachment

Re: Proximity query with GIST and row estimation

From

"Guillaume Smet"

Date:

14 February 2007, 14:13:09

Paul,

On 2/14/07, Paul Ramsey <pramsey@refractions.net> wrote:
> You'll find that PostGIS does a pretty good job of selectivity
> estimation.

PostGIS is probably what I'm going to experiment in the future. The
only problem is that it's really big for a very basic need.
With my current method, I don't even have to create a new column: I
create directly a functional index so it's really easy to use.
Using PostGIS requires to create a new column and triggers to maintain
it and install PostGIS of course. That's why it was not my first
choice.

Thanks for your answer.

--
Guillaume

Re: Proximity query with GIST and row estimation

From

Paul Ramsey

Date:

14 February 2007, 18:32:54

You'll find that PostGIS does a pretty good job of selectivity
estimation.

P

On 13-Feb-07, at 9:09 AM, Guillaume Smet wrote:

> Hi all,
>
> Following the work on Mark Stosberg on this list (thanks Mark!), I
> optimized our slow proximity queries by using cube, earthdistance
> (shipped with contrib) and a gist index. The result is globally very
> interesting apart for a specific query and we'd like to be able to fix
> it too to be more consistent (it's currently faster with a basic
> distance calculation based on acos, cos and so on but it's slow
> anyway).
>
> The problem is that we have sometimes very few places near a given
> location (small city) and sometimes a lot of them (in Paris, Bruxelles
> and so on - it's the case we have here). The gist index I created
> doesn't estimate the number of rows in the area very well.
>
> Table: lieu (100k rows) with wgslat and wgslon as numeric
> Table: lieugelieu (200k rows, 1k with codegelieu = 'PKG')
> Index: "idx_lieu_earth" gist (ll_to_earth(wgslat::double precision,
> wgslon::double precision))
>
> The simplified query is:
> SELECT DISTINCT l.numlieu, l.nomlieu, ROUND
> (earth_distance(ll_to_earth(48.85957600, 2.34860800),
> ll_to_earth(l.wgslat, l.wgslon))) as dist
>     FROM lieu l, lieugelieu lgl
>     WHERE lgl.codegelieu = 'PKG' AND earth_box(ll_to_earth(48.85957600,
> 2.34860800), 1750) @ ll_to_earth(l.wgslat, l.wgslon) AND lgl.numlieu =
> l.numlieu ORDER BY dist ASC LIMIT 2;
> It's used to find the nearest car parks from a given location.
>
> The plan is attached plan_earthdistance_nestedloop.txt. It uses a
> nested loop because the row estimate is pretty bad: (cost=0.00..3.38
> rows=106 width=0) (actual time=30.229..30.229 rows=5864 loops=1).
>
> If I disable the nested loop, the plan is different and faster (see
> plan_earthdistance_hash.txt attached).
>
> Is there any way to improve this estimation? I tried to set the
> statistics of wgslat and wgslon higher but it doesn't change anything
> (I don't know if the operator is designed to use the statistics).
>
> Any other idea to optimize this query is very welcome too.
>
> --
> Guillaume
> <plan_earthdistance_nestedloop.txt>
> <plan_earthdistance_hash.txt>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: explain analyze is your friend

Re: Proximity query with GIST and row estimation

From

"Guillaume Smet"

Date:

15 February 2007, 12:13:47

On 2/14/07, Paul Ramsey <pramsey@refractions.net> wrote:
> You'll find that PostGIS does a pretty good job of selectivity
> estimation.

So I finally have a working PostGIS and I fixed the query to use PostGIS.

The use of PostGIS is slower than the previous cube/earthdistance
approach (on a similar query and plan). But you're right, it does a
pretty good job to calculate the selectivity and the estimations are
really good.
It helps to select a good plan (or a bad one if the previous false
numbers led to a better plan which is my case for certain queries).
I suppose it's normal to be slower as it's more precise. I don't know
which approach is better in my case as I don't need the precision of
PostGIS.

For the record, here is what I did:
select AddGeometryColumn('lieu','earthpoint',32631,'POINT',2);
update lieu set earthpoint=Transform(SetSRID(MakePoint(wgslon,
wgslat), 4327), 32631);
create index idx_lieu_earthpoint on lieu using gist(earthpoint
gist_geometry_ops);

analyze lieu;

select numlieu, nomlieu, wgslon, wgslat, astext(earthpoint) from lieu
where earthpoint && Expand(Transform(SetSRID(MakePoint(12.49244400,
41.89103400), 4326), 32631), 3000);

(3000 is the distance in meters)

--
Guillaume

Re: Proximity query with GIST and row estimation

From

"Guillaume Smet"

Date:

16 February 2007, 14:31:52

On 2/15/07, Guillaume Smet <guillaume.smet@gmail.com> wrote:
> The use of PostGIS is slower than the previous cube/earthdistance
> approach (on a similar query and plan).

For the record, here are new information about my proximity query work.

Thanks to Tom Lane, I found the reason of the performance drop. The
problem is that the gist index for operator && is lossy (declared as
RECHECK in the op class).
AFAICS, for the && operator it's done to prevent problems when SRIDs
are not compatible: it forces the execution of the filter and so even
with a "should be non lossy" bitmap index scan, it throws an error as
if we use a seqscan (Paul, correct me if I'm wrong) because it forces
the execution of the filter.

As I'm sure I won't have this problem (I will write a wrapper stored
procedure so that the end users won't see the SRID used), I created a
different opclass without the RECHECK clause:
CREATE OPERATOR CLASS gist_geometry_ops_norecheck FOR TYPE geometry
USING gist AS
        OPERATOR        3        &&,
        FUNCTION        1        LWGEOM_gist_consistent (internal,
geometry, int4),
        FUNCTION        2        LWGEOM_gist_union (bytea, internal),
        FUNCTION        3        LWGEOM_gist_compress (internal),
        FUNCTION        4        LWGEOM_gist_decompress (internal),
        FUNCTION        5        LWGEOM_gist_penalty (internal,
internal, internal),
        FUNCTION        6        LWGEOM_gist_picksplit (internal, internal),
        FUNCTION        7        LWGEOM_gist_same (box2d, box2d, internal);

UPDATE pg_opclass
    SET opckeytype = (SELECT oid FROM pg_type
                          WHERE typname = 'box2d'
                          AND typnamespace = (SELECT oid FROM pg_namespace
                                              WHERE nspname=current_schema()))
    WHERE opcname = 'gist_geometry_ops_norecheck'
        AND opcnamespace = (SELECT oid from pg_namespace
                            WHERE nspname=current_schema());

As I use only the && operator, I put only this one.

And I recreated my index using:
CREATE INDEX idx_lieu_earthpoint ON lieu USING gist(earthpoint
gist_geometry_ops_norecheck);

In the case presented before, the bitmap index scan is then non lossy
and I have similar performances than with earthdistance method.

--
Guillaume