LIMIT causes SEQSCAN in subselect - Mailing list pgsql-performance

From Mike Rylander
Subject LIMIT causes SEQSCAN in subselect
Date
Msg-id b918cf3d04121010405e6c963c@mail.gmail.com
Whole thread Raw
Responses Re: LIMIT causes SEQSCAN in subselect
List pgsql-performance
First off, WOO-HOO!  The lists are back and I can finally get my PG
fix!!!  Now, on to the business at hand...

I have four query plans below, the first two help explain my question,
and the last two are about a speculative alternative.  The first query
use a subselects that are generated from a middleware layer and are
then joined to create the final result set.  In order to keep total
execution time down, the middleware imposes a LIMIT clause on each.

I'm using the fact that Postgres can elevate a subselect-join to a
simple join when there are no aggregates involved and I think I
remember there has been some work recently on elevating subselects
that contain a LIMIT, so I went back and ran the plans without the
LIMITs to see what would happen.  Well, the limit killed the subselect
elevation.  I'm not too worried about the relative execution times
since it's very fast, but more curious about the plan that was chosen.

It seems that the planner knows that the results from subselect 'b'
will contain just one row due to the fact that the index it is
scanning is unique.  Would it not make sense to discard the LIMIT
clause on that subselect?  That would result in the third plan, which
has better performance than the generated query, and is guaranteed to
return the same results since the index in use is unique.  Also,
wouldn't it make sense for subselect 'a' to be elevated sans LIMIT
just to see if there is a unique index it might be able to use?

I realize this is a rather specialized case and not really great form.
 But because PG can, in some cases, elevate subselects, writing
middleware to join results becomes pretty easy.  Just a matter of
defining result sets independently, and creating a simple wrapper to
join them.

In any case, I will probably end up just detecting the subselect
condition in the middleware and drop the limit when there are some
WHERE clauses on the inner query.  I just thought I'd bring up a
possible optimization for the future, and was curious what the gurus
might have to say!



--  Version info and queries in question.

oils4=# select version();

version

---------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.0.0beta4 on x86_64-unknown-linux-gnu, compiled by GCC
gcc (GCC) 3.3.4 20040623 (Gentoo Linux 3.3.4-r1, ssp-3.3.2-2,
pie-8.7.6)
(1 row)


-- query 1:  the query generated by middleware

oils4=# EXPLAIN ANALYZE select a.record, b.control from (select * from
biblio.record where id = 100000 limit 1000) b, (select * from
biblio.metarecord_field_entry limit 1000) a where a.source = b.id;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3.68..44.49 rows=5 width=40) (actual
time=2.066..2.066 rows=0 loops=1)
   Hash Cond: ("outer".source = "inner".id)
   ->  Subquery Scan a  (cost=0.00..35.75 rows=1000 width=16) (actual
time=0.005..1.295 rows=1000 loops=1)
         ->  Limit  (cost=0.00..25.75 rows=1000 width=87) (actual
time=0.003..0.641 rows=1000 loops=1)
               ->  Seq Scan on metarecord_field_entry
(cost=0.00..43379.75 rows=1684575 width=87) (actual time=0.003..0.435
rows=1000 loops=1)
   ->  Hash  (cost=3.68..3.68 rows=1 width=40) (actual
time=0.039..0.039 rows=0 loops=1)
         ->  Subquery Scan b  (cost=0.00..3.68 rows=1 width=40)
(actual time=0.031..0.033 rows=1 loops=1)
               ->  Limit  (cost=0.00..3.67 rows=1 width=1070) (actual
time=0.029..0.030 rows=1 loops=1)
                     ->  Index Scan using biblio_record_pkey on record
 (cost=0.00..3.67 rows=1 width=1070) (actual time=0.027..0.028 rows=1
loops=1)
                           Index Cond: (id = 100000)
 Total runtime: 2.171 ms
(11 rows)


-- query 2: the fast query, no limit allows elevation of subselects

oils4=# EXPLAIN ANALYZE select a.record, b.control from (select * from
biblio.record where id = 100000) b, (select * from
biblio.metarecord_field_entry) a where a.source = b.id;

     QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..19.95 rows=9 width=22) (actual
time=0.043..0.055 rows=7 loops=1)
   ->  Index Scan using biblio_record_pkey on record  (cost=0.00..3.67
rows=1 width=22) (actual time=0.025..0.026 rows=1 loops=1)
         Index Cond: (id = 100000)
   ->  Index Scan using metarecord_field_entry_source_idx on
metarecord_field_entry  (cost=0.00..16.19 rows=9 width=16) (actual
time=0.011..0.018 rows=7 loops=1)
         Index Cond: (source = 100000)
 Total runtime: 0.101 ms
(6 rows)



-- query 3:  if we were to drop the limit, since we're using a unique index

oils4=# EXPLAIN ANALYZE select a.record, b.control from (select * from
biblio.record where id = 100000) b, (select * from
biblio.metarecord_field_entry limit 1000) a where a.source = b.id;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..41.97 rows=5 width=22) (actual
time=1.169..1.169 rows=0 loops=1)
   ->  Index Scan using biblio_record_pkey on record  (cost=0.00..3.67
rows=1 width=22) (actual time=0.036..0.038 rows=1 loops=1)
         Index Cond: (id = 100000)
   ->  Subquery Scan a  (cost=0.00..38.25 rows=5 width=16) (actual
time=1.126..1.126 rows=0 loops=1)
         Filter: (source = 100000)
         ->  Limit  (cost=0.00..25.75 rows=1000 width=87) (actual
time=0.005..0.673 rows=1000 loops=1)
               ->  Seq Scan on metarecord_field_entry
(cost=0.00..43379.75 rows=1684575 width=87) (actual time=0.004..0.424
rows=1000 loops=1)
 Total runtime: 1.243 ms
(8 rows)



-- query 4:  what I would like the seqscan in query 3 to become...

oils4=# EXPLAIN ANALYZE select * from biblio.metarecord_field_entry
where source = 100000 limit 1000;

     QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..16.19 rows=9 width=87) (actual time=0.026..0.035
rows=7 loops=1)
   ->  Index Scan using metarecord_field_entry_source_idx on
metarecord_field_entry  (cost=0.00..16.19 rows=9 width=87) (actual
time=0.025..0.032 rows=7 loops=1)
         Index Cond: (source = 100000)
 Total runtime: 0.069 ms
(4 rows)

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

pgsql-performance by date:

Previous
From: Christopher Kings-Lynne
Date:
Subject: Slow insert
Next
From: Josh Berkus
Date:
Subject: Re: scaling beyond 4 processors