Home > mailing lists

Very specialised query - Mailing list pgsql-performance

From	Matthew Wakeling
Subject	Very specialised query
Date	March 26, 2009 11:31:09
Msg-id	alpine.DEB.2.00.0903261313290.21772@aragorn.flymine.org Whole thread Raw
Responses	Re: Very specialised query Re: Very specialised query Re: Very specialised query Re: Very specialised query Re: Very specialised query
List	pgsql-performance

Tree view

So, I have an query that has a very great difference between the possible
performance and the achieved performance. I was wondering if there was a
possibility that Postgres would approach the possible performance by some
devious method.

The query returns a list of overlaps between objects. Each object is
defined by a start position and end position and a foreign key to the
thing that is is located on. It's genes on chromosomes, in case you were
wondering. The relevant parts of the location table are as follows:

release-16.0-preview-14-mar=# \d location
         Table "public.location"
      Column      |  Type   | Modifiers
-----------------+---------+-----------
  end             | integer |
  start           | integer |
  objectid        | integer |
  id              | integer | not null
Indexes:
     "location__object" btree (objectid, id)
     "location__start" btree (start)
     "location_bioseg" gist (bioseg_create(intermine_start, intermine_end))

The table has 3490079 entries with 21875 distinct objectids, although the
majority of entries are concentrated on about ten distinct objectids. The
entire table is in cache.

The query to run is:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = l2.objectid
     AND l1.id <> l2.id
     AND l1.start < l2.end
     AND l2.start < l1.end

The EXPLAIN result:
                                         QUERY PLAN
----------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..180896163.93 rows=54169773639 width=8)
    Join Filter: ((l1.id <> l2.id) AND (l1.start < l2.end) AND (l2.start < l1.end))
    ->  Seq Scan on location l1  (cost=0.00..78076.79 rows=3490079 width=16)
    ->  Index Scan using location__object on location l2  (cost=0.00..24.43 rows=1369 width=16)
          Index Cond: (l2.objectid = l1.objectid)
(5 rows)

I could get an EXPLAIN ANALYSE, but it would take quite a few hours.

Now, we have a spacial gist index on (start, end), using the bioseg addon,
which works great for single overlap lookups, and does speed up the above
query, if we alter it to have an equivalent meaning, but use the bioseg
function:

SELECT
     l1.id AS id1,
     l2.id AS id2
FROM
     location l1,
     location l2
WHERE
         l1.objectid = l2.objectid
     AND l1.id <> l2.id
     AND bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end);

The EXPLAIN result:
                                        QUERY PLAN
--------------------------------------------------------------------------------------------
  Nested Loop  (cost=0.00..99982692.10 rows=4875280 width=8)
    Join Filter: ((l1.id <> l2.id) AND (l1.objectid = l2.objectid))
    ->  Seq Scan on location l1  (cost=0.00..78076.79 rows=3490079 width=16)
    ->  Index Scan using location_bioseg on location l2  (cost=0.00..12.92 rows=698 width=16)
          Index Cond: (bioseg_create(l1.start, l1.end) && bioseg_create(l2.start, l2.end))
(5 rows)

This query takes about two hours.

Now, it happens that there is an algorithm for calculating overlaps which
is really quick. It involves iterating through the table in order of the
start variable and keeping a list of ranges which "haven't ended yet".
When you read the next range from the table, you firstly purge all the
ranges from the list that end before the beginning of the new range. Then,
you output a result row for each element in the list combined with the new
range, then you add the new range to the list.

This algorithm just doesn't seem to fit into SQL at all. However, it would
reduce two hours down to a few seconds. Any chances of it running in
Postgres, or any other tips?

Matthew

--
 Hi! You have reached 555-0129. None of us are here to answer the phone and
 the cat doesn't have opposing thumbs, so his messages are illegible. Please
 leave your name and message after the beep ...

pgsql-performance by date:

From: "Kenny Gorman"
Date: 26 March 2009, 10:51:39
Subject: Re: I have a fusion IO drive available for testing

From: "Kevin Grittner"
Date: 26 March 2009, 11:47:44
Subject: Re: Very specialised query

Very specialised query - Mailing list pgsql-performance

Previous

Next