Thread: Slow query: Select all buildings that have >1 pharmacies and >1 schools within 1000m

Hi

I have an interesting query to be optimized related to this one [1].

The query definition is: Select all buildings that have more than 1
pharmacies and more than 1 schools within a radius of 1000m.

The problem is that I think that this query is inherently O(n^2). In
fact the solution I propose below takes forever...

My questions:

1. Any comments about the nature of this problem?

2. ... on how to speed it up ?

3. In the original query [1] there's a count which contains a
subquery. According to my tests PostgreSQL does not allow this despite
the documentation which says "count(expression)".

Remarks: I know that "count(*)" could be faster on PostgreSQL but
"count(osm_id)" does not change the query plan and this does not seem
to be the bottleneck here anyway.

Yours, S.

[1] http://gis.stackexchange.com/questions/11445/selecting-pois-around-specific-buildings-using-postgis


Here's my query:

-- Select all buildings that have >1 pharmacies and >1 schools within 1000m:
SELECT osm_id AS building_id
FROM
  (SELECT osm_id, way
   FROM osm_polygon
   WHERE tags @> hstore('building','yes')
  ) AS b
WHERE
 (SELECT count(*) > 1
  FROM osm_poi AS p
  WHERE p.tags @> hstore('amenity','pharmacy')
  AND ST_DWithin(b.way,p.way,1000)
 )
 AND
 (SELECT count(*) > 1
  FROM osm_poi AS p
  WHERE p.tags @> hstore('amenity','school')
  AND ST_DWithin(b.way,p.way,1000)
 )
-- Total query runtime: 4308488 ms. 66345 rows retrieved.

Here's the query plan (from EXPLAIN):
"Index Scan using osm_polygon_tags_idx on osm_polygon
(cost=0.00..406812.81 rows=188 width=901)"
"  Index Cond: (tags @> '"building"=>"yes"'::hstore)"
"  Filter: ((SubPlan 1) AND (SubPlan 2))"
"  SubPlan 1"
"    ->  Aggregate  (cost=269.19..269.20 rows=1 width=0)"
"          ->  Bitmap Heap Scan on osm_poi p  (cost=7.76..269.19
rows=1 width=0)"
"                Recheck Cond: (way && st_expand(osm_polygon.way,
1000::double precision))"
"                Filter: ((tags @> '"amenity"=>"pharmacy"'::hstore)
AND (osm_polygon.way && st_expand(way, 1000::double precision)) AND
_st_dwithin(osm_polygon.way, way, 1000::double precision))"
"                ->  Bitmap Index Scan on osm_poi_way_idx
(cost=0.00..7.76 rows=62 width=0)"
"                      Index Cond: (way && st_expand(osm_polygon.way,
1000::double precision))"
"  SubPlan 2"
"    ->  Aggregate  (cost=269.19..269.20 rows=1 width=0)"
"          ->  Bitmap Heap Scan on osm_poi p  (cost=7.76..269.19
rows=1 width=0)"
"                Recheck Cond: (way && st_expand(osm_polygon.way,
1000::double precision))"
"                Filter: ((tags @> '"amenity"=>"school"'::hstore) AND
(osm_polygon.way && st_expand(way, 1000::double precision)) AND
_st_dwithin(osm_polygon.way, way, 1000::double precision))"
"                ->  Bitmap Index Scan on osm_poi_way_idx
(cost=0.00..7.76 rows=62 width=0)"
"                      Index Cond: (way && st_expand(osm_polygon.way,
1000::double precision))"

***

On 7 Srpen 2012, 14:01, Stefan Keller wrote:
> Hi
>
> I have an interesting query to be optimized related to this one [1].
>
> The query definition is: Select all buildings that have more than 1
> pharmacies and more than 1 schools within a radius of 1000m.
>
> The problem is that I think that this query is inherently O(n^2). In
> fact the solution I propose below takes forever...

What about plain INTERSECT? Something like

SELECT osm_id FROM osm_poi AS p, osm_polygon b
   WHERE p.tags @> hstore('amenity','pharmacy')
   AND ST_DWithin(b.way,p.way,1000)
INTERSECT
SELECT osm_id FROM osm_poi AS p, osm_polygon b
   WHERE p.tags @> hstore('amenity','school')
   AND ST_DWithin(b.way,p.way,1000)

Or something like that. But maybe it's a complete nonsense ...

Tomas


Your proposal lacks the requirement that it's the same building from
where pharmacies and schools are reachable.
But I think about.

Yours, S.

2012/8/7 Tomas Vondra <tv@fuzzy.cz>:
> On 7 Srpen 2012, 14:01, Stefan Keller wrote:
>> Hi
>>
>> I have an interesting query to be optimized related to this one [1].
>>
>> The query definition is: Select all buildings that have more than 1
>> pharmacies and more than 1 schools within a radius of 1000m.
>>
>> The problem is that I think that this query is inherently O(n^2). In
>> fact the solution I propose below takes forever...
>
> What about plain INTERSECT? Something like
>
> SELECT osm_id FROM osm_poi AS p, osm_polygon b
>    WHERE p.tags @> hstore('amenity','pharmacy')
>    AND ST_DWithin(b.way,p.way,1000)
> INTERSECT
> SELECT osm_id FROM osm_poi AS p, osm_polygon b
>    WHERE p.tags @> hstore('amenity','school')
>    AND ST_DWithin(b.way,p.way,1000)
>
> Or something like that. But maybe it's a complete nonsense ...
>
> Tomas
>

On 7 Srpen 2012, 14:22, Stefan Keller wrote:
> Your proposal lacks the requirement that it's the same building from
> where pharmacies and schools are reachable.
> But I think about.

I don't know the dataset so I've expected the osm_id to identify the
building - then the intersect should work as AND for the conditions.

And I see I've forgot to include the 'is building' condition, so it should
be like this:

 SELECT b.osm_id FROM osm_poi AS p, osm_polygon b
    WHERE p.tags @> hstore('amenity','pharmacy')
    AND b.tags @> hstore('building','yes')
    AND ST_DWithin(b.way,p.way,1000)
 INTERSECT
 SELECT b.osm_id FROM osm_poi AS p, osm_polygon b
    WHERE p.tags @> hstore('amenity','school')
    AND b.tags @> hstore('building','yes')
    AND ST_DWithin(b.way,p.way,1000)

Tomas


On Tue, Aug 7, 2012 at 5:01 AM, Stefan Keller <sfkeller@gmail.com> wrote:
Hi

I have an interesting query to be optimized related to this one [1].

The query definition is: Select all buildings that have more than 1
pharmacies and more than 1 schools within a radius of 1000m.

The problem is that I think that this query is inherently O(n^2). In
fact the solution I propose below takes forever...

Maybe you could get rid of the O(n^2) aspect like this:

   Select all buildings that have more than 1
   pharmacies and more than 1 schools within a radius of 1000m
   from
      (Select all buildings that have more than four (pharmacy or school)
        within a radius of 1000m)

The inner select should be fast -- you could make it fast by creating a new property like "building of interest" that was "pharmacy or school" and build an index on the "building of interest" property.

The inner query would reduce your sample set to a much smaller set of buildings, and presumably the outer query could handle that pretty quickly.

Craig James
 

My questions:

1. Any comments about the nature of this problem?

2. ... on how to speed it up ?

3. In the original query [1] there's a count which contains a
subquery. According to my tests PostgreSQL does not allow this despite
the documentation which says "count(expression)".

Remarks: I know that "count(*)" could be faster on PostgreSQL but
"count(osm_id)" does not change the query plan and this does not seem
to be the bottleneck here anyway.

Yours, S.

[1] http://gis.stackexchange.com/questions/11445/selecting-pois-around-specific-buildings-using-postgis


Here's my query:

-- Select all buildings that have >1 pharmacies and >1 schools within 1000m:
SELECT osm_id AS building_id
FROM
  (SELECT osm_id, way
   FROM osm_polygon
   WHERE tags @> hstore('building','yes')
  ) AS b
WHERE
 (SELECT count(*) > 1
  FROM osm_poi AS p
  WHERE p.tags @> hstore('amenity','pharmacy')
  AND ST_DWithin(b.way,p.way,1000)
 )
 AND
 (SELECT count(*) > 1
  FROM osm_poi AS p
  WHERE p.tags @> hstore('amenity','school')
  AND ST_DWithin(b.way,p.way,1000)
 )
-- Total query runtime: 4308488 ms. 66345 rows retrieved.

Here's the query plan (from EXPLAIN):
"Index Scan using osm_polygon_tags_idx on osm_polygon
(cost=0.00..406812.81 rows=188 width=901)"
"  Index Cond: (tags @> '"building"=>"yes"'::hstore)"
"  Filter: ((SubPlan 1) AND (SubPlan 2))"
"  SubPlan 1"
"    ->  Aggregate  (cost=269.19..269.20 rows=1 width=0)"
"          ->  Bitmap Heap Scan on osm_poi p  (cost=7.76..269.19
rows=1 width=0)"
"                Recheck Cond: (way && st_expand(osm_polygon.way,
1000::double precision))"
"                Filter: ((tags @> '"amenity"=>"pharmacy"'::hstore)
AND (osm_polygon.way && st_expand(way, 1000::double precision)) AND
_st_dwithin(osm_polygon.way, way, 1000::double precision))"
"                ->  Bitmap Index Scan on osm_poi_way_idx
(cost=0.00..7.76 rows=62 width=0)"
"                      Index Cond: (way && st_expand(osm_polygon.way,
1000::double precision))"
"  SubPlan 2"
"    ->  Aggregate  (cost=269.19..269.20 rows=1 width=0)"
"          ->  Bitmap Heap Scan on osm_poi p  (cost=7.76..269.19
rows=1 width=0)"
"                Recheck Cond: (way && st_expand(osm_polygon.way,
1000::double precision))"
"                Filter: ((tags @> '"amenity"=>"school"'::hstore) AND
(osm_polygon.way && st_expand(way, 1000::double precision)) AND
_st_dwithin(osm_polygon.way, way, 1000::double precision))"
"                ->  Bitmap Index Scan on osm_poi_way_idx
(cost=0.00..7.76 rows=62 width=0)"
"                      Index Cond: (way && st_expand(osm_polygon.way,
1000::double precision))"

***

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Hi Craig

Clever proposal!
I slightly tried to adapt it to the hstore involved.
Now I'm having a weird problem that PG says that "relation 'p' does not exist".
Why does PG recognize table b in the subquery but not table p?
Any ideas?

-- Stefan


SELECT b.way AS building_geometry
FROM
 (SELECT way
  FROM osm_polygon
  WHERE tags @> hstore('building','yes')
 ) AS b,
 (SELECT way, tags->'amenity' as value
  FROM osm_poi
  WHERE tags ? 'amenity'
 ) AS p
WHERE
 (SELECT count(*) > 1
  FROM p
  WHERE p.value = 'pharmacy'
  AND ST_DWithin(b.way,p.way,1000)
 )
 AND
 (SELECT count(*) > 1
  FROM p
  WHERE p.value = 'school'
  AND ST_DWithin(b.way,p.way,1000)
 )

ERROR:  relation "p" does not exist
LINE 14:   FROM p


2012/8/7 Craig James <cjames@emolecules.com>:
> On Tue, Aug 7, 2012 at 5:01 AM, Stefan Keller <sfkeller@gmail.com> wrote:
>>
>> Hi
>>
>> I have an interesting query to be optimized related to this one [1].
>>
>> The query definition is: Select all buildings that have more than 1
>> pharmacies and more than 1 schools within a radius of 1000m.
>>
>> The problem is that I think that this query is inherently O(n^2). In
>> fact the solution I propose below takes forever...
>
>
> Maybe you could get rid of the O(n^2) aspect like this:
>
>
>    Select all buildings that have more than 1
>    pharmacies and more than 1 schools within a radius of 1000m
>    from
>       (Select all buildings that have more than four (pharmacy or school)
>         within a radius of 1000m)
>
> The inner select should be fast -- you could make it fast by creating a new
> property like "building of interest" that was "pharmacy or school" and build
> an index on the "building of interest" property.
>
> The inner query would reduce your sample set to a much smaller set of
> buildings, and presumably the outer query could handle that pretty quickly.
>
> Craig James
>
>>
>>
>> My questions:
>>
>> 1. Any comments about the nature of this problem?
>>
>> 2. ... on how to speed it up ?
>>
>> 3. In the original query [1] there's a count which contains a
>> subquery. According to my tests PostgreSQL does not allow this despite
>> the documentation which says "count(expression)".
>>
>> Remarks: I know that "count(*)" could be faster on PostgreSQL but
>> "count(osm_id)" does not change the query plan and this does not seem
>> to be the bottleneck here anyway.
>>
>> Yours, S.
>>
>> [1]
>> http://gis.stackexchange.com/questions/11445/selecting-pois-around-specific-buildings-using-postgis
>>
>>
>> Here's my query:
>>
>> -- Select all buildings that have >1 pharmacies and >1 schools within
>> 1000m:
>> SELECT osm_id AS building_id
>> FROM
>>   (SELECT osm_id, way
>>    FROM osm_polygon
>>    WHERE tags @> hstore('building','yes')
>>   ) AS b
>> WHERE
>>  (SELECT count(*) > 1
>>   FROM osm_poi AS p
>>   WHERE p.tags @> hstore('amenity','pharmacy')
>>   AND ST_DWithin(b.way,p.way,1000)
>>  )
>>  AND
>>  (SELECT count(*) > 1
>>   FROM osm_poi AS p
>>   WHERE p.tags @> hstore('amenity','school')
>>   AND ST_DWithin(b.way,p.way,1000)
>>  )
>> -- Total query runtime: 4308488 ms. 66345 rows retrieved.
>>
>> Here's the query plan (from EXPLAIN):
>> "Index Scan using osm_polygon_tags_idx on osm_polygon
>> (cost=0.00..406812.81 rows=188 width=901)"
>> "  Index Cond: (tags @> '"building"=>"yes"'::hstore)"
>> "  Filter: ((SubPlan 1) AND (SubPlan 2))"
>> "  SubPlan 1"
>> "    ->  Aggregate  (cost=269.19..269.20 rows=1 width=0)"
>> "          ->  Bitmap Heap Scan on osm_poi p  (cost=7.76..269.19
>> rows=1 width=0)"
>> "                Recheck Cond: (way && st_expand(osm_polygon.way,
>> 1000::double precision))"
>> "                Filter: ((tags @> '"amenity"=>"pharmacy"'::hstore)
>> AND (osm_polygon.way && st_expand(way, 1000::double precision)) AND
>> _st_dwithin(osm_polygon.way, way, 1000::double precision))"
>> "                ->  Bitmap Index Scan on osm_poi_way_idx
>> (cost=0.00..7.76 rows=62 width=0)"
>> "                      Index Cond: (way && st_expand(osm_polygon.way,
>> 1000::double precision))"
>> "  SubPlan 2"
>> "    ->  Aggregate  (cost=269.19..269.20 rows=1 width=0)"
>> "          ->  Bitmap Heap Scan on osm_poi p  (cost=7.76..269.19
>> rows=1 width=0)"
>> "                Recheck Cond: (way && st_expand(osm_polygon.way,
>> 1000::double precision))"
>> "                Filter: ((tags @> '"amenity"=>"school"'::hstore) AND
>> (osm_polygon.way && st_expand(way, 1000::double precision)) AND
>> _st_dwithin(osm_polygon.way, way, 1000::double precision))"
>> "                ->  Bitmap Index Scan on osm_poi_way_idx
>> (cost=0.00..7.76 rows=62 width=0)"
>> "                      Index Cond: (way && st_expand(osm_polygon.way,
>> 1000::double precision))"
>>
>> ***
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>

On Tue, Aug 7, 2012 at 5:07 PM, Stefan Keller <sfkeller@gmail.com> wrote:
> Hi Craig
>
> Clever proposal!
> I slightly tried to adapt it to the hstore involved.
> Now I'm having a weird problem that PG says that "relation 'p' does not exist".
> Why does PG recognize table b in the subquery but not table p?
> Any ideas?

I don't think it does recognize b, either.  It just fell over on p
before it had a chance to fall over on b.

I think you have to use WITH if you want to reference the same
subquery in multiple FROMs.

Another approach would be to add explicit conditions for there being
at least 1 school and 1 pharmacy within distance.  There can't be >1
unless there is >=1, but the join possibilities for >=1 (i.e. "where
exists" rather than "where (select count(*)...)>1" )  are much more
attractive than the ones for >1.

Cheers,

Jeff

Hi

2012/8/8 Jeff Janes <jeff.janes@gmail.com>:
> On Tue, Aug 7, 2012 at 5:07 PM, Stefan Keller <sfkeller@gmail.com> wrote:
>> Hi Craig
>>
>> Clever proposal!
>> I slightly tried to adapt it to the hstore involved.
>> Now I'm having a weird problem that PG says that "relation 'p' does not exist".
>> Why does PG recognize table b in the subquery but not table p?
>> Any ideas?
>
> I don't think it does recognize b, either.  It just fell over on p
> before it had a chance to fall over on b.

No, the b get's recognized. See my original query.
That's a strange behaviour of the SQL parser which I can't understand.

> I think you have to use WITH if you want to reference the same
> subquery in multiple FROMs.

I'll try that with CTE too.

> Another approach would be to add explicit conditions for there being
> at least 1 school and 1 pharmacy within distance.  There can't be >1
> unless there is >=1, but the join possibilities for >=1 (i.e. "where
> exists" rather than "where (select count(*)...)>1" )  are much more
> attractive than the ones for >1.
>
> Cheers,
>
> Jeff

You mean, first doing a select on existence and then apply the count
condition later?

Stefan

On Thu, Aug 9, 2012 at 4:00 AM, Stefan Keller <sfkeller@gmail.com> wrote:
> Hi
>
> 2012/8/8 Jeff Janes <jeff.janes@gmail.com>:
>> On Tue, Aug 7, 2012 at 5:07 PM, Stefan Keller <sfkeller@gmail.com> wrote:
>>> Hi Craig
>>>
>>> Clever proposal!
>>> I slightly tried to adapt it to the hstore involved.
>>> Now I'm having a weird problem that PG says that "relation 'p' does not exist".
>>> Why does PG recognize table b in the subquery but not table p?
>>> Any ideas?
>>
>> I don't think it does recognize b, either.  It just fell over on p
>> before it had a chance to fall over on b.
>
> No, the b get's recognized. See my original query.
> That's a strange behaviour of the SQL parser which I can't understand.

Oh, I see.  You are referencing b only as the qualifier for a column
name, while you are trying to reference p as a an entire query.  I
initially misread it and thought you referencing both b and p in both
ways each.

>
>> I think you have to use WITH if you want to reference the same
>> subquery in multiple FROMs.
>
> I'll try that with CTE too.
>
>> Another approach would be to add explicit conditions for there being
>> at least 1 school and 1 pharmacy within distance.  There can't be >1
>> unless there is >=1, but the join possibilities for >=1 (i.e. "where
>> exists" rather than "where (select count(*)...)>1" )  are much more
>> attractive than the ones for >1.
>>
>> Cheers,
>>
>> Jeff
>
> You mean, first doing a select on existence and then apply the count
> condition later?

Yes, exactly.

Of course this won't help if most buildings do have at least one of
each within distance, as then the prefilter is not very selective.

Cheers,

Jeff