Thread: Very specialised query

From:
Matthew Wakeling
Date:

So, I have an query that has a very great difference between the possible
performance and the achieved performance. I was wondering if there was a
possibility that Postgres would approach the possible performance by some
devious method.

The query returns a list of overlaps between objects. Each object is
defined by a start position and end position and a foreign key to the
thing that is is located on. It's genes on chromosomes, in case you were
wondering. The relevant parts of the location table are as follows:

release-16.0-preview-14-mar=# \d location
         Table "public.location"
      Column      |  Type   | Modifiers
-----------------+---------+-----------
  end             | integer |
  start           | integer |
  objectid        | integer |
  id              | integer | not null
Indexes:
     "location__object" btree (objectid, id)
     "location__start" btree (start)
     "location_bioseg" gist (bioseg_create(intermine_start, intermine_end))

The table has 3490079 entries with 21875 distinct objectids, although the
majority of entries are concentrated on about ten distinct objectids. The
entire table is in cache.

The query to run is:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = l2.objectid
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l2.start < l1.end

The EXPLAIN result:
                                         QUERY PLAN
----------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..180896163.93 rows=54169773639 width=8)
    Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end) AND (l2.start < l1.end))
    ->  Seq Scan on location l1  (cost=0.00..78076.79 rows=3490079 width=16)
    ->  Index Scan using location__object on location l2  (cost=0.00..24.43 rows=1369 width=16)
          Index Cond: (l2.objectid = l1.objectid)
(5 rows)

I could get an EXPLAIN ANALYSE, but it would take quite a few hours.

Now, we have a spacial gist index on (start, end), using the bioseg addon,
which works great for single overlap lookups, and does speed up the above
query, if we alter it to have an equivalent meaning, but use the bioseg
function:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = l2.objectid
     AND l1.id <> l2.id
     AND bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end);

The EXPLAIN result:
                                        QUERY PLAN
--------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..99982692.10 rows=4875280 width=8)
    Join Filter: ((l1.id <> l2.id) AND (l1.objectid = l2.objectid))
    ->  Seq Scan on location l1  (cost=0.00..78076.79 rows=3490079 width=16)
    ->  Index Scan using location_bioseg on location l2  (cost=0.00..12.92 rows=698 width=16)
          Index Cond: (bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end))
(5 rows)

This query takes about two hours.

Now, it happens that there is an algorithm for calculating overlaps which
is really quick. It involves iterating through the table in order of the
start variable and keeping a list of ranges which "haven't ended yet".
When you read the next range from the table, you firstly purge all the
ranges from the list that end before the beginning of the new range. Then,
you output a result row for each element in the list combined with the new
range, then you add the new range to the list.

This algorithm just doesn't seem to fit into SQL at all. However, it would
reduce two hours down to a few seconds. Any chances of it running in
Postgres, or any other tips?

Matthew

--
 Hi! You have reached 555-0129. None of us are here to answer the phone and
 the cat doesn't have opposing thumbs, so his messages are illegible. Please
 leave your name and message after the beep ...

From:
"Kevin Grittner"
Date:

Matthew Wakeling <> wrote:
> any other tips?

I would try adding an index on (objectid, start) and another on
(objectid, end) and see how that first query does.  Also, if id is a
unique identifier, I'd recommend a unique constraint or (better) a
primary key definition.  Check the system tables to see whether all
the existing indexes are actually being used -- if not, drop them.

-Kevin

From:
Tom Lane
Date:

Matthew Wakeling <> writes:
> This query takes about two hours.

> Now, it happens that there is an algorithm for calculating overlaps which
> is really quick. It involves iterating through the table in order of the
> start variable and keeping a list of ranges which "haven't ended yet".
> When you read the next range from the table, you firstly purge all the
> ranges from the list that end before the beginning of the new range. Then,
> you output a result row for each element in the list combined with the new
> range, then you add the new range to the list.

> This algorithm just doesn't seem to fit into SQL at all.

No, it doesn't.  Have you thought about coding it in plpgsql?

I have a feeling that it might be possible to do it using SQL:2003
recursive queries, but the procedural coding is likely to be easier
to understand and better-performing.  Not to mention that you won't
have to wait for 8.4...

            regards, tom lane

From:
Matthew Wakeling
Date:

On Thu, 26 Mar 2009, Tom Lane wrote:
> No, it doesn't.  Have you thought about coding it in plpgsql?

*Looks* Nice.

So, it looks like I would be able to write a plpgsql function that returns
a table equivalent to the query I posted earlier. However, I'd like to
eat my cake *and* have it. My intention is to create a view with those
results, and then use that view in all sorts of other queries. This will
mean things like constraining the chromosome, or even constraining one of
the locations.

The algorithm I quoted will work great for the simple case of generating
*all* overlaps. However, it will not be ideal for when the chromosome is
constrained (the constraint needs to be pushed into the query that the
algorithm iterates over, rather than filtered after the algorithm runs),
and it will be very much less than ideal when one of the locations is
constrained (at which point a simple bio_seg index lookup is the fastest
way).

Is there a way to define these three methods of generating the results and
get the planner to choose the fastest one?

Matthew

--
 Beware of bugs in the above code; I have only proved it correct, not
 tried it.                                               --Donald Knuth

From:
Matthew Wakeling
Date:

On Thu, 26 Mar 2009, I wrote:
> release-16.0-preview-14-mar=# \d location
>        Table "public.location"
>     Column      |  Type   | Modifiers
> -----------------+---------+-----------
> end             | integer |
> start           | integer |
> objectid        | integer |
> id              | integer | not null
> Indexes:
>    "location__object" btree (objectid, id)
>    "location__start" btree (start)
>    "location_bioseg" gist (bioseg_create(intermine_start, intermine_end))

So, it would be useful if we could make the location_bioseg index a
multi-column index, like this:

CREATE INDEX location_bioseg3 ON location USING GIST (objectid, bioseg_create(intermine_start, intermine_end));

However, I get the following error message:

ERROR:  data type integer has no default operator class for access method "gist"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

Is there an operator class for integer for gist indexes that I can use?

Matthew

--
 And why do I do it that way? Because I wish to remain sane. Um, actually,
 maybe I should just say I don't want to be any worse than I already am.
         - Computer Science Lecturer

From:
Tom Lane
Date:

Matthew Wakeling <> writes:
> Is there an operator class for integer for gist indexes that I can use?

See contrib/btree_gist.

            regards, tom lane

From:
Віталій Тимчишин
Date:

Hello.

You could try  adding    "AND l2.start > l1.start" to the first query.  This will drop symmetric half of intersections (the ones that will remain are l2 inside or to the left of l1), but you can redo results by
id1,id2 union all id2, id1 and may allow to use start index for "between",  for my "like" test this looks like the next:

"  ->  Index Scan using location__start on location l2  (cost=0.00..756.34 rows=37787 width=12)"
"        Index Cond: ((l2.start < l1.eend) AND (l2.start > l1.start))"

also an index on (objectid, start) would help resulting in :

"  ->  Index Scan using lt on location l2  (cost=0.00..0.84 rows=20 width=16)"
"        Index Cond: ((l2.objectid = l1.objectid) AND (l2.start < l1.eend) AND (l2.start > l1.start))"

Best regards,
 Vitalii Tymchyshyn
From:
Matthew Wakeling
Date:

On Fri, 27 Mar 2009, Віталій Тимчишин wrote:
> ...an index on (objectid, start) would help...

Definitely.

> You could try  adding    "AND l2.start > l1.start" to the first query. 
> This will drop symmetric half of intersections (the ones that will
> remain are l2 inside or to the left of l1), but you can redo results by
> id1,id2 union all id2, id1 and may allow to use start index for
> "between"

That's remarkably clever, and should have been obvious. Thanks.

Adding the constraint as you say does indeed make the query fast. However,
there seems to be a problem with the planner, in that it does not
distinguish between the costs of two alternative plans, which have vastly
different real costs. Consider the following:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = 228000093
     AND l2.objectid = 228000093
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l1.end > l2.start
     AND l1.start < l2.start
UNION ALL
SELECT
     l1.id AS id1, l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = 228000093
     AND l2.objectid = 228000093
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l1.end > l2.start
     AND l1.start >= l2.start;
                                                                     QUERY
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
  Append  (cost=0.00..20479179.74 rows=138732882 width=8)
    ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
          Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
          ->  Index Scan using location_test_obj_end on location l1  (cost=0.00..55966.07 rows=43277 width=12)
                Index Cond: (objectid = 228000093)
          ->  Index Scan using location_test_obj_start on location l2  (cost=0.00..123.10 rows=4809 width=12)
                Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start))
    ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
          Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
          ->  Index Scan using location_test_obj_end on location l1  (cost=0.00..55966.07 rows=43277 width=12)
                Index Cond: (objectid = 228000093)
          ->  Index Scan using location_test_obj_start on location l2  (cost=0.00..123.10 rows=4809 width=12)
                Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start))
(13 rows)


Notice the two different index conditions:
     (l1.end > l2.start) AND (l1.start < l2.start)  - "between"
     (l1.end > l2.start) AND (l1.start >= l2.start) - open-ended
Both have a cost of (cost=0.00..123.10 rows=4809 width=12)

Postgres estimates these two index scans to be equivalent in cost, where
they are actually vastly different in real cost. Shouldn't Postgres favour
a "between" index scan over an open-ended one?

Matthew

--
 [About NP-completeness] These are the problems that make efficient use of
 the Fairy Godmother.                    -- Computer Science Lecturer

From:
Tom Lane
Date:

Matthew Wakeling <> writes:
> Notice the two different index conditions:
>      (l1.end > l2.start) AND (l1.start < l2.start)  - "between"
>      (l1.end > l2.start) AND (l1.start >= l2.start) - open-ended
> Both have a cost of (cost=0.00..123.10 rows=4809 width=12)

> Postgres estimates these two index scans to be equivalent in cost, where
> they are actually vastly different in real cost. Shouldn't Postgres favour
> a "between" index scan over an open-ended one?

Currently the planner only notices that for a range check that involves
comparisons of the same variable expression to two constants (or
pseudoconstants anyway).  In principle it might be reasonable to have a
heuristic that reduces the estimated selectivity in the example above,
but it looks to me like it'd make clauselist_selectivity() a lot slower
and more complicated.  When you see (l1.end > l2.start), how do you know
which variable to try to match up against others?  And if you try to
match both, what do you do when you get matches for both?

            regards, tom lane

From:
Dimitri Fontaine
Date:

Hi,

Le 26 mars 09 à 15:30, Matthew Wakeling a écrit :
> Now, it happens that there is an algorithm for calculating overlaps
> which is really quick. It involves iterating through the table in
> order of the start variable and keeping a list of ranges which
> "haven't ended yet". When you read the next range from the table,
> you firstly purge all the ranges from the list that end before the
> beginning of the new range. Then, you output a result row for each
> element in the list combined with the new range, then you add the
> new range to the list.
>
> This algorithm just doesn't seem to fit into SQL at all.

Maybe it's just that I didn't devote enough time to reading your
detailed explanation above, but this part sounds like it could be done
in an aggregate you'd use in a correlated subquery containing the
right ORDER BY, couldn't it?
   http://www.postgresql.org/docs/8.3/interactive/sql-createaggregate.html

HTH,
--
dim




From:
"Marc Mamin"
Date:

Hello,

if your data are mostly static and you have a few mains objects,
maybe you can have some gain while defining conditional indexes for those plus one for the rest
and then slicing the query:


create index o_1x on X (start,end,id) where object_id = 1
create index o_2x on X (start,end,id) where object_id = 2
create index o_3x on X (start,end,id) where object_id = 3
create index o_4x on X (start,end,id) where object_id = 4
...
create index o_4x on X (start,end,id) where object_id not in (1,2,3,4..)


I'm not sure that putting all in one index and using the BETWEEN clause
as in my example is the best method though.

Marc Mamin


SELECT
    l1.id AS id1,
    l2.id AS id2
FROM
    location l1,
    location l2
WHERE l1.objectid = 1
    AND (l2.start BETWEEN  l1.start AND l1.end
         OR
         l1.start BETWEEN  l2.start AND l2.end
         )
         l1.start
    AND l2.start <> l2.start -- if required
    AND l2.start <> l2.end   -- if required
    AND l1.id <> l2.id


UNION ALL

...
        WHERE l1.objectid = 2
...    

UNION ALL

...
        WHERE l1.objectid not in (1,2,3,4..)

From:
Matthew Wakeling
Date:

On Fri, 27 Mar 2009, Tom Lane wrote:
>> Notice the two different index conditions:
>>      (l1.end > l2.start) AND (l1.start < l2.start)  - "between"
>>      (l1.end > l2.start) AND (l1.start >= l2.start) - open-ended
>> Both have a cost of (cost=0.00..123.10 rows=4809 width=12)

> Currently the planner only notices that for a range check that involves
> comparisons of the same variable expression to two constants (or
> pseudoconstants anyway).  In principle it might be reasonable to have a
> heuristic that reduces the estimated selectivity in the example above,
> but it looks to me like it'd make clauselist_selectivity() a lot slower
> and more complicated.  When you see (l1.end > l2.start), how do you know
> which variable to try to match up against others?  And if you try to
> match both, what do you do when you get matches for both?

Those two index conditions are on an index scan on the field l2.start.
Therefore, I would expect to only have to take any notice of l2.start when
working out selectivity on a range check for this particular plan. When
there is an index scan on a different field, then try and match that one
up instead.

Matthew

--

From:
Matthew Wakeling
Date:

On Fri, 27 Mar 2009, Dimitri Fontaine wrote:
> Maybe it's just that I didn't devote enough time to reading your detailed
> explanation above, but this part sounds like it could be done in an aggregate
> you'd use in a correlated subquery containing the right ORDER BY, couldn't
> it?
> http://www.postgresql.org/docs/8.3/interactive/sql-createaggregate.html

Can you return multiple rows from an aggregate function?

Matthew

--

From:
Matthew Wakeling
Date:

On Fri, 27 Mar 2009, Marc Mamin wrote:
> if your data are mostly static and you have a few mains objects,
> maybe you can have some gain while defining conditional indexes for those plus one for the rest
> and then slicing the query:

Maybe. I thought about doing that. However, I am not convinced that would
be much of a gain, and would require major rewriting of the queries, as
you suggest.

> WHERE (l2.start BETWEEN  l1.start AND l1.end
>          OR
>          l1.start BETWEEN  l2.start AND l2.end
>          )

Yes, that's another way to calculate an overlap. However, it turns out to
not be that fast. The problem is that OR there, which causes a bitmap
index scan, as the leaf of a nested loop join, which can be rather slow.

Matthew

--
 I'd try being be a pessimist, but it probably wouldn't work anyway.

From:
"Marc Mamin"
Date:

>> WHERE (l2.start BETWEEN  l1.start AND l1.end
>>          OR
>>          l1.start BETWEEN  l2.start AND l2.end
>>          )

>Yes, that's another way to calculate an overlap. However, it turns out to not be that fast.
>The problem is that OR there, which causes a bitmap index scan, as the leaf of a nested loop join,
>which can be rather slow.


Ok , than splitting these checks in 2 Queries with UNION  is better.
But I often read that BETWEEN is faster than using 2 comparison operators.
Here I guess that a combined index on (start,end) makes sense:

..
WHERE l2.start BETWEEN  l1.start AND l1.end
..
UNION
..
WHERE l1.start BETWEEN  l2.start AND l2.end
..


The first clause being equivalent to

    AND l1.start <= l2.end
    AND l1.end   >= l2.start
    AND l1.start <= l2.start

I don't know how you have to deal the limit conditions...


Marc Mamin

From:
Matthew Wakeling
Date:

>> Shouldn't Postgres favour a "between" index scan over an open-ended
>> one?

On Fri, 27 Mar 2009, Tom Lane wrote:
> Currently the planner only notices that for a range check that involves
> comparisons of the same variable expression to two constants (or
> pseudoconstants anyway).  In principle it might be reasonable to have a
> heuristic that reduces the estimated selectivity in the example above,
> but it looks to me like it'd make clauselist_selectivity() a lot slower
> and more complicated.  When you see (l1.end > l2.start), how do you know
> which variable to try to match up against others?  And if you try to
> match both, what do you do when you get matches for both?

Arguably, having multiple "greater than" constraints on a field should not
affect the selectivity much, because the separate constraints will all be
throwing away the same set of rows. So having a single "greater than" will
halve the number of rows, while two "greater than"s will divide the number
of rows by three, etc. That's a vast approximation of course, and should
be skewed by the statistics.

However, combining a "greater than" with a "less than" has a different
effect on selectivity. If the numbers were completely random with
identical value spread in each field, then a single "greater than" would
halve the number of rows and an additional "less than" would halve the
number again. However, in most real life situations the values will not be
completely random, but will be very skewed, as in our case. Unless the
planner starts collecting statistics on the standard distribution of the
difference between two fields, that will be hard to account for though.

Have a look at the following EXPLAIN ANALYSE:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = 228000093
     AND l2.objectid = 228000093
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l1.end > l2.start
     AND l1.start < l2.start
UNION ALL
SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = 228000093
     AND l2.objectid = 228000093
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l1.end > l2.start
     AND l1.start >= l2.start;

                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------
  Append  (cost=0.00..20479179.74 rows=138732882 width=8)
          (actual time=0.055..2362748.298 rows=258210 loops=1)
    ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
                     (actual time=0.053..627.038 rows=99436 loops=1)
          Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
          ->  Index Scan using location_test_obj_end on location l1
                          (cost=0.00..55966.07 rows=43277 width=12)
                          (actual time=0.025..58.604 rows=43534 loops=1)
                Index Cond: (objectid = 228000093)
          ->  Index Scan using location_test_obj_start on location l2
                          (cost=0.00..123.10 rows=4809 width=12)
                          (actual time=0.003..0.006 rows=2 loops=43534)
                Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start < l2.start))
    ->  Nested Loop  (cost=0.00..9545925.46 rows=69366441 width=8)
                     (actual time=0.041..2361632.540 rows=158774 loops=1)
          Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end))
          ->  Index Scan using location_test_obj_end on location l1
                          (cost=0.00..55966.07 rows=43277 width=12)
                          (actual time=0.009..80.814 rows=43534 loops=1)
                Index Cond: (objectid = 228000093)
          ->  Index Scan using location_test_obj_start on location l2
                          (cost=0.00..123.10 rows=4809 width=12)
                          (actual time=0.012..29.823 rows=21768 loops=43534)
                Index Cond: ((l2.objectid = 228000093) AND (l1.end > l2.start) AND (l1.start >= l2.start))
  Total runtime: 2363015.959 ms
(14 rows)

Note again the two leaf index scans that really matter for performance. On
one of them, there is a difference, and the other is open ended.

                Expected rows     Seen rows
difference       4809                  2
open-ended       4809              21768

The first query fragment takes 700ms to execute, and the second takes
about forty minutes. It is effectively doing a nested loop join with
hardly any benefit gained from the indexes at all, over a simple index on
objectid. I may as well make the query a lot simpler, and do:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = 228000093
     AND l2.objectid = 228000093
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l1.end > l2.start

Unless this particular issue is improved in the planner, I don't think
this particular style of query is going to work for us. I know that the
database could in theory return a result in about 1400ms. I'll see how
close to that I can get with plpgsql.

Matthew

--
 First law of computing:  Anything can go wro
 sig: Segmentation fault.  core dumped.

From:
Matthew Wakeling
Date:

On Mon, 30 Mar 2009, Marc Mamin wrote:
> But I often read that BETWEEN is faster than using 2 comparison operators.

http://www.postgresql.org/docs/current/static/functions-comparison.html
says otherwise.

> a BETWEEN x AND y
>
> is equivalent to
>
> a >= x AND a <= y
>
> There is no difference between the two respective forms apart from the
> CPU cycles required to rewrite the first one into the second one
> internally.

Matthew

--
 Don't worry!  The world can't end today because it's already tomorrow
 in Australia.

From:
Matthew Wakeling
Date:

On Mon, 30 Mar 2009, Віталій Тимчишин wrote:
> select
> case when n == 1 then id1 else id2 end,
> case when n == 2 then id1 else id2 end
>
> from (
>    ... a, (values (1),(2)) b(n)

Yeah, that's nice.

However, it is still the case that we can't trust the database to choose
the correct plan. It is currently only choosing the correct plan now by
chance, and some time later it may by chance switch to one that takes 40
minutes.

I'll certainly add it to my list of possibilities.

Matthew

--
 Jadzia: Don't forget the 34th rule of acquisition: Peace is good for business.
 Quark:  That's the 35th.
 Jadzia: Oh yes, that's right. What's the 34th again?
 Quark:  War is good for business. It's easy to get them mixed up.

From:
Віталій Тимчишин
Date:

Hi.

Look, what I did mean by "symmetric" is that you don't need to make second part of query because you will get just same results simply by

select
case when n == 1 then id1 else id2 end,
case when n == 2 then id1 else id2 end

from (
SELECT
   l1.id AS id1,
   l2.id AS id2
FROM
   location l1,
   location l2
WHERE
       l1.objectid = 228000093
   AND l2.objectid = 228000093
   AND l1.id <> l2.id
   AND l1.start < l2.end
   AND l1.end > l2.start
   AND l1.start < l2.start) a, (values (1),(2)) b(n)

(I may miss some border cases like when l1.start=l2.start and/or l1.end=l2.end, but this can be fixed by adding "=" to query).

Look,  You can have 4 types of intersections:
a)  1s 2s 2e 1e - 2 inside 1
b)  2s 1s 1e 2e - 1 inside 2 (symmetric to (a), if you have 1,2 from (a) you can generate 2,1 for (b))
c)  1s 2s 1e 2e - 1 to the left of 2
d)  2s 1s 2e 1e - 2 to the left of 1 (symmetric to (c), if you have 1,2 from (c) you can generate 2,1 for (d))

The query above gives you results for (a) and (c) and you don't need  any second part - simply add "symmetric" results.

Correct me if I miss something.

Best Regards, Vitalii Tymchyshyn
From:
"Marc Mamin"
Date:

Hello Matthew,

Another idea:

Are your objects limited to some smaller ranges of your whole interval ?
If yes you may possibly reduce the ranges to search for while using an additional table with the min(start) max(end) of each object...

Marc Mamin

From:
Віталій Тимчишин
Date:


Yeah, that's nice.

However, it is still the case that we can't trust the database to choose the correct plan. It is currently only choosing the correct plan now by chance, and some time later it may by chance switch to one that takes 40 minutes.

What is the bad plan? Is it like the first plan from your first message?
You can sometimes tweak optimizer to make sure it will do correct plan. E.g. when your database fits in memory, you can tweak page access costs. Also don't forget to raise statistics target.

BTW: About aggregates: they can return arrays, but I can't imagine what you can group by on... May be windowing functions from 8.4 could help.

Also, if your maximum length (select max(end-start) from location) is low enough, you can try adding some more constraints to make optimizer happy (have it more precise row count to select correct plan).

From:
Matthew Wakeling
Date:

On Mon, 30 Mar 2009, Marc Mamin wrote:
> Are your objects limited to some smaller ranges of your whole interval ?
> If yes you may possibly reduce the ranges to search for while using an additional table with the min(start) max(end)
ofeach 
> object...

No, they aren't. However, even if they were, that would not actually speed
up the query at all. We are already looking up in the index by objectid,
and that would be an equivalent constraint to limiting by the available
range of start/end values.

I'm currently arguing with plpgsql over this problem, but it looks like
it will run reasonably fast.

Matthew

--
 If you're thinking "Oh no, this lecturer thinks Turing Machines are a feasible
 method of computation, where's the door?", then you are in luck. There are
 some there, there, and by the side there. Oxygen masks will not drop from the
 ceiling...                              -- Computer Science Lecturer

From:
Matthew Wakeling
Date:

On Mon, 30 Mar 2009, Віталій Тимчишин wrote:
> select
> case when n == 1 then id1 else id2 end,
> case when n == 2 then id1 else id2 end
>
> from (
> SELECT
>    l1.id AS id1,
>    l2.id AS id2
> FROM
>    location l1,
>    location l2
> WHERE
>        l1.objectid = 228000093
>    AND l2.objectid = 228000093
>    AND l1.id <> l2.id
>    AND l1.start < l2.end
>    AND l1.end > l2.start
>    AND l1.start < l2.start) a, (values (1),(2)) b(n)

It is a nice idea. However, the planner gets the join the wrong way round:

select distinct
     case when n = 1 then id1 else id2 end,
     case when n = 1 then id2 else id1 end
FROM (
     select
         l1.id AS id1,
         l2.id AS id2
     FROM
         location l1,
         location l2
     WHERE
             l1.id <> l2.id
         AND l1.objectid = l2.objectid
         AND l1.start <= l2.end
         AND l2.start <= l1.end
         AND l1.start <= l2.start
     ) AS a,
     (values (1), (2)) b(n);

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Unique  (cost=7366497963.75..7637346831.94 rows=36113182426 width=12)
          (actual time=1642178.623..2206678.691 rows=139606782 loops=1)
    ->  Sort  (cost=7366497963.75..7456780919.81 rows=36113182426 width=12)
              (actual time=1642178.619..1899057.147 rows=166377424 loops=1)
          Sort Key: (CASE WHEN ("*VALUES*".column1 = 1) THEN l1.subjectid ELSE l2.subjectid END), (CASE WHEN
("*VALUES*".column1= 1) THEN l2.subjectid ELSE l1.subjectid END) 
          Sort Method:  external merge  Disk: 3903272kB
          ->  Nested Loop  (cost=0.00..592890483.66 rows=36113182426 width=12)
                           (actual time=85.333..984211.011 rows=166377424 loops=1)
                ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=4)
                                               (actual time=0.002..0.008 rows=2 loops=1)
                ->  Nested Loop  (cost=0.00..25596373.62 rows=18056591213 width=8)
                                 (actual time=42.684..322743.335 rows=83188712 loops=2)
                      Join Filter: ((l1.subjectid <> l2.subjectid) AND (l1.intermine_start <= l2.intermine_end))
                      ->  Seq Scan on location l1
                                 (cost=0.00..78076.79 rows=3490079 width=16)
                                 (actual time=0.008..3629.672 rows=3490079 loops=2)
                      ->  Index Scan using location_test_obj_start on location l2
                                 (cost=0.00..3.89 rows=152 width=16)
                                 (actual time=0.005..0.038 rows=25 loops=6980158)
                            Index Cond: ((l2.objectid = l1.objectid) AND (l2.intermine_start <= l1.intermine_end) AND
(l1.intermine_start<= l2.intermine_start)) 
  Total runtime: 2339619.383 ms

The outer nested join has the VALUES as the main loop, and the complicated
join as the leaf. So, the complicated overlap-finding join gets run twice.

Oh, there's also the great big sort and unique, but I think I can get rid
of that.

Matthew

--
 Contrary to popular belief, Unix is user friendly. It just happens to be
 very selective about who its friends are.                 -- Kyle Hearn

From:
Віталій Тимчишин
Date:


The outer nested join has the VALUES as the main loop, and the complicated join as the leaf. So, the complicated overlap-finding join gets run twice.

That's weird. What do you have as statistics target? Planner is incorrect few orders of magnitude, so increasing it may help.
BTW: One of constraints is redundant l1.start <= l2.start implies l1.start <= l2.end, so latter can be removed as for me.
 


Oh, there's also the great big sort and unique, but I think I can get rid of that.

As far as I can see, duplicates will occur if and only if l1.start == l2.start && l1.end == l2.end.
That can be easily filtered by adding "where n=1 or l1.start != l2.start or l1.end != l2.end" to outer select.

From:
Matthew Wakeling
Date:

On Wed, 1 Apr 2009, Віталій Тимчишин wrote:
>       The outer nested join has the VALUES as the main loop, and the complicated join as the leaf. So, the
complicated
>       overlap-finding join gets run twice.
>
> That's weird. What do you have as statistics target? Planner is incorrect few orders of magnitude, so increasing it
mayhelp. 

Unfortunately, the statistics are skewed, so increasing the statistics
target won't help. The problem is this:

select avg(end - start), stddev_pop(end - start), min(start), max(start) from location;

           avg          |   stddev_pop   | min |   max
-----------------------+----------------+-----+----------
  1716.7503512098150214 | 24935.63375733 |   1 | 61544858
(1 row)

>       Oh, there's also the great big sort and unique, but I think I can get rid of that.
>
>
> As far as I can see, duplicates will occur if and only if l1.start == l2.start && l1.end == l2.end.
> That can be easily filtered by adding "where n=1 or l1.start != l2.start or l1.end != l2.end" to outer select.

Close - duplicates will occur when l1.start == l2.start, so you filter
them out by adding "where n = 1 OR l1.start <> l2.start".

Matthew

--
 Lord grant me patience, and I want it NOW!

From:
Matthew Wakeling
Date:

On Mon, 30 Mar 2009, Віталій Тимчишин wrote:
> What is the bad plan? Is it like the first plan from your first message?

It's the plan a few messages back. The UNION ALL query I showed
effectively got the database to do it both ways round.

It's the case that a "between" index scan will return much fewer rows than
an open-ended index scan.

> BTW: About aggregates: they can return arrays, but I can't imagine what you can group by on... May be windowing
functionsfrom 8.4 
> could help.

A normal function seems the best way to go about this - they can return
multiple rows.

So, I have written a plpgsql function to calculate overlaps. It works
reasonably quickly where there aren't that many overlaps. However, it
seems to go very slowly when there are a large number of rows to return. I
am considering recoding it as a C function instead.

1. The docs say that returning multiple rows from plpgsql waits until the
     whole lot are done before returning any. Does this happen with the C
     functions too?
2. What sort of speedup would I be likely to see?
3. How do I RAISE NOTICE in a C function?

> Also, if your maximum length (select max(end-start) from location) is low enough, you can try adding some more
constraintsto make 
> optimizer happy (have it more precise row count to select correct plan).

Alas:

select min(start), max(start), min(end), max(end), max(end - start) from location;

  min |   max    | min |   max    |   max
-----+----------+-----+----------+----------
    1 | 61544858 |   1 | 61545105 | 21512431
(1 row)

Matthew

--
 I suppose some of you have done a Continuous Maths course. Yes? Continuous
 Maths? <menacing stares from audience> Whoah, it was like that, was it!
                                        -- Computer Science Lecturer

From:
Matthew Wakeling
Date:

On Wed, 1 Apr 2009, Matthew Wakeling wrote:
> So, I have written a plpgsql function to calculate overlaps. It works
> reasonably quickly where there aren't that many overlaps. However, it seems
> to go very slowly when there are a large number of rows to return.

In plpgsql, what happens about memory allocation? It's just, I'm
generating and throwing away an awful lot of arrays. When do they get
unallocated?

Also, if I assign a variable (or an element of an array) to be the
contents of another variable (which may be a primitive, array, or row),
does it copy the contents or do it by reference?

Matthew

--
 For those of you who are into writing programs that are as obscure and
 complicated as possible, there are opportunities for... real fun here
                                        -- Computer Science Lecturer

From:
Matthew Wakeling
Date:

Trying to rewrite a plpgsql function in C.

How do I do the equivalent of:

FOR loc IN SELECT * FROM location ORDER BY objectid, intermine_start, intermine_end LOOP
END LOOP;

in a C function?

Matthew

--
 I wouldn't be so paranoid if you weren't all out to get me!!

From:
Craig Ringer
Date:

Matthew Wakeling wrote:
> Trying to rewrite a plpgsql function in C.
>
> How do I do the equivalent of:
>
> FOR loc IN SELECT * FROM location ORDER BY objectid, intermine_start,
> intermine_end LOOP
> END LOOP;
>
> in a C function?

Please create a new message to the list with a new subject line for a
new question.

Thanks.

--
Craig Ringer