Thread: Query plan for "heavy" SELECT with "lite" sub-SELECTs

Query plan for "heavy" SELECT with "lite" sub-SELECTs

From

"Nikolay Samokhvalov"

Date:

02 November 2006, 07:07:23

Hello,

I do not understand, why Postgres very ofter starts execution from
sub-select instead of doing main select and then proceeding to "lite"
sub-selects. For example:

(example is quite weird, but it demonstrates the problem)

1. explain analyze select * from pg_proc offset 1500 limit 1;
"Limit  (cost=116.91..116.99 rows=1 width=365) (actual
time=2.111..2.112 rows=1 loops=1)"
"  ->  Seq Scan on pg_proc  (cost=0.00..175.52 rows=2252 width=365)
(actual time=0.034..1.490 rows=1501 loops=1)"
"Total runtime: 2.156 ms"

3. explain analyze select oid,* from pg_type where oid=2277 limit 1;
"Limit  (cost=0.00..5.91 rows=1 width=816) (actual time=0.021..0.022
rows=1 loops=1)"
"  ->  Index Scan using pg_type_oid_index on pg_type  (cost=0.00..5.91
rows=1 width=816) (actual time=0.018..0.018 rows=1 loops=1)"
"        Index Cond: (oid = 2277::oid)"
"Total runtime: 0.079 ms"

2. explain analyze select
  *,
  (select typname from pg_type where pg_type.oid=pg_proc.prorettype limit 1)
from pg_proc offset 1500 limit 1;
"Limit  (cost=8983.31..8989.30 rows=1 width=365) (actual
time=17.648..17.649 rows=1 loops=1)"
"  ->  Seq Scan on pg_proc  (cost=0.00..13486.95 rows=2252 width=365)
(actual time=0.100..16.851 rows=1501 loops=1)"
"        SubPlan"
"          ->  Limit  (cost=0.00..5.91 rows=1 width=64) (actual
time=0.006..0.007 rows=1 loops=1501)"
"                ->  Index Scan using pg_type_oid_index on pg_type
(cost=0.00..5.91 rows=1 width=64) (actual time=0.004..0.004 rows=1
loops=1501)"
"                      Index Cond: (oid = $0)"
"Total runtime: 17.784 ms"

We see that in the 2nd example Postgres starts with "Index Scan using
pg_type_oid_index" (1501 iterations!). My understanding of SQL says me
that the simplest (and, in this case - and probably in *most* cases -
fastest) way to perform such queries is to start from main SELECT and
then, when we already have rows from "main" table, perform "lite"
sub-selects. So, I expected  smth near 2.156 ms + 0.079 ms, but obtain
17.784 ms... For large table this is killing behaviour.

What should I do to make Postgres work properly in such cases (I have
a lot of similar queries; surely, they are executed w/o seqscans, but
overall picture is the same - I see that starting from sub-selects
dramatically decrease performance)?

--
Best regards,
Nikolay

Re: Query plan for "heavy" SELECT with "lite" sub-SELECTs

From

Richard Huxton

Date:

02 November 2006, 08:38:50

Nikolay Samokhvalov wrote:
> 2. explain analyze select
>  *,
>  (select typname from pg_type where pg_type.oid=pg_proc.prorettype limit 1)
> from pg_proc offset 1500 limit 1;
> "Limit  (cost=8983.31..8989.30 rows=1 width=365) (actual
> time=17.648..17.649 rows=1 loops=1)"
> "  ->  Seq Scan on pg_proc  (cost=0.00..13486.95 rows=2252 width=365)
> (actual time=0.100..16.851 rows=1501 loops=1)"
> "        SubPlan"
> "          ->  Limit  (cost=0.00..5.91 rows=1 width=64) (actual
> time=0.006..0.007 rows=1 loops=1501)"
> "                ->  Index Scan using pg_type_oid_index on pg_type
> (cost=0.00..5.91 rows=1 width=64) (actual time=0.004..0.004 rows=1
> loops=1501)"
> "                      Index Cond: (oid = $0)"
> "Total runtime: 17.784 ms"
>
> We see that in the 2nd example Postgres starts with "Index Scan using
> pg_type_oid_index" (1501 iterations!).

No, what you see here is that the inner loop is the index-scan over
pg_type_oid. It's running a sequential scan on pg_proc and then runs
1501 index scans against pg_type.

> My understanding of SQL says me
> that the simplest (and, in this case - and probably in *most* cases -
> fastest) way to perform such queries is to start from main SELECT and
> then, when we already have rows from "main" table, perform "lite"
> sub-selects. So, I expected  smth near 2.156 ms + 0.079 ms, but obtain
> 17.784 ms... For large table this is killing behaviour.

You've forgotten about the cost of matching up the two sets of rows.
Now, if the first part of the query outputs only one row then you might
be right, but I'm not sure that the SQL standard allows the subquery to
be delayed to that stage without explicitly organising the query that
way. From memory, the OFFSET/LIMIT takes place at the very end of the
query processing.

> What should I do to make Postgres work properly in such cases (I have
> a lot of similar queries; surely, they are executed w/o seqscans, but
> overall picture is the same - I see that starting from sub-selects
> dramatically decrease performance)?

Do you have a real example? That might be more practical.

--
   Richard Huxton
   Archonet Ltd

Re: Query plan for "heavy" SELECT with "lite" sub-SELECTs

From

"Dave Dutcher"

Date:

02 November 2006, 10:25:50

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> Nikolay Samokhvalov
>
> What should I do to make Postgres work properly in such cases (I have
> a lot of similar queries; surely, they are executed w/o seqscans, but
> overall picture is the same - I see that starting from sub-selects
> dramatically decrease performance)?

How about this:

explain analyze
select (select typname from pg_type where pg_type.oid=mainq.prorettype limit
1)
from (select * from pg_proc offset 1500 limit 1) mainq;

                                                                QUERY PLAN
----------------------------------------------------------------------------
-------------------------------------------------
 Subquery Scan mainq  (cost=50.99..56.85 rows=1 width=4) (actual
time=13.646..13.659 rows=1 loops=1)
   ->  Limit  (cost=50.99..51.02 rows=1 width=310) (actual
time=13.575..13.579 rows=1 loops=1)
         ->  Seq Scan on pg_proc  (cost=0.00..62.34 rows=1834 width=310)
(actual time=0.014..7.297 rows=1501 loops=1)
   SubPlan
     ->  Limit  (cost=0.00..5.82 rows=1 width=64) (actual time=0.038..0.043
rows=1 loops=1)
           ->  Index Scan using pg_type_oid_index on pg_type
(cost=0.00..5.82 rows=1 width=64) (actual time=0.028..0.028 rows=1 loops=1)
                 Index Cond: (oid = $0)
 Total runtime: 13.785 ms

I would expect you to get closer to 2 ms on that query.  My machine takes 13
ms to do just the seq scan of pg_proc.

Dave

Re: Query plan for "heavy" SELECT with "lite" sub-SELECTs

From

Alvaro Herrera

Date:

02 November 2006, 15:16:11

Dave Dutcher wrote:
> > -----Original Message-----
> > From: pgsql-performance-owner@postgresql.org
> > Nikolay Samokhvalov
> >
> > What should I do to make Postgres work properly in such cases (I have
> > a lot of similar queries; surely, they are executed w/o seqscans, but
> > overall picture is the same - I see that starting from sub-selects
> > dramatically decrease performance)?
>
> How about this:
>
> explain analyze
> select (select typname from pg_type where pg_type.oid=mainq.prorettype limit
> 1)
> from (select * from pg_proc offset 1500 limit 1) mainq;

What's the use of such a query?  One would think that in the real world,
you'd at least have an ORDER BY somewhere in the subqueries.

Performance analysis of strange queries is useful, but the input queries
have to be meaningful as well.  Otherwise you end up optimizing bizarre
and useless cases.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Query plan for "heavy" SELECT with "lite" sub-SELECTs

From

Arjen van der Meijden

Date:

03 November 2006, 05:46:22

Alvaro Herrera wrote:
> Performance analysis of strange queries is useful, but the input queries
> have to be meaningful as well.  Otherwise you end up optimizing bizarre
> and useless cases.
>

I had a similar one a few weeks ago. I did some batch-processing over a
bunch of documents and discovered postgresql was faster if I let it
process just 1000 documents, in stead of all 45000 at the same time. But
with 1000 it was faster than 1000x one document.

So I started with a query like:
SELECT docid, (SELECT work to be done for each document)
FROM documents
ORDER BY docid
LIMIT 1000
OFFSET ?

And I noticed the 44th iteration was much slower than the first.

Rewriting it to something like this made the last iteration about as
fast as the first:
SELECT docid, (SELECT work to be done for each document)
FROM documents
WHERE docid IN (SELECT docid FROM documents
    ORDER BY docid
    LIMIT 1000
    OFFSET ?
)

I know something like that isn't very set-based thinking, but then again
the query's structure did come from a iterative algoritm, but turned out
to be faster (less query-overhead) and easier to scale in PostgreSQL.
I've tried a few more set-like structures, but those were all slower
than this aproach probably because they would be were a little more
complex. Some of them took more than 10x the amount of time...

Another real-life example would be to display the amount of replies to a
topic in a topic listing of a forum or the name of the author of the
last message. You probably don't want to count all the replies for each
topic if you're only going to display headings 100 - 200.
And there are a few more examples to think of where a join+group by
isn't going to work, but a subquery in the selectlist just does what you
want.
Of course most of the time you won't be using a OFFSET then.

Best regards,

Arjen

Re: Query plan for "heavy" SELECT with "lite" sub-SELECTs

From

Tom Lane

Date:

03 November 2006, 10:51:04

Arjen van der Meijden <acmmailing@tweakers.net> writes:
> ... Rewriting it to something like this made the last iteration about as
> fast as the first:

> SELECT docid, (SELECT work to be done for each document)
> FROM documents
> WHERE docid IN (SELECT docid FROM documents
>     ORDER BY docid
>     LIMIT 1000
>     OFFSET ?
> )

The reason for this, of course, is that the LIMIT/OFFSET filter is the
last step in a query plan --- it comes *after* computation of the SELECT
output list.  (So does ORDER BY, if an explicit sort step is needed.)
So if you have an expensive-to-compute output list, a trick like Arjen's
will help.  I don't think you can use an "IN" though, at least not if
you want to preserve the sort ordering in the final result.

            regards, tom lane