Re: limit clause breaks query planner? - Mailing list pgsql-performance

From David West
Subject Re: limit clause breaks query planner?
Date
Msg-id 004101c90c41$3a754020$af5fc060$@west@cusppoint.com
Whole thread Raw
In response to Re: limit clause breaks query planner?  ("Pavel Stehule" <pavel.stehule@gmail.com>)
List pgsql-performance
Thanks for your suggestion but the result is the same.

Here is the explain analyse output from different queries.
Select * from my_table where A is null and B = '21' limit 15

"Limit  (cost=0.00..3.68 rows=15 width=128) (actual time=85837.043..85896.140 rows=15 loops=1)"
"  ->  Seq Scan on my_table this_  (cost=0.00..258789.88 rows=1055580 width=128) (actual time=85837.038..85896.091
rows=15loops=1)" 
"        Filter: ((A IS NULL) AND ((B)::text = '21'::text))"
"Total runtime: 85896.214 ms"

As you can see the estimated cost was 3.68: a long way from the true value.

Doing 'set enable_seqscan=false' and repeating the select:
"Limit  (cost=0.00..5.58 rows=15 width=128) (actual time=4426.438..4426.834 rows=15 loops=1)"
"  ->  Index Scan using idx_A on my_table this_  (cost=0.00..392956.76 rows=1055970 width=128) (actual
time=4426.433..4426.768rows=15 loops=1)" 
"        Index Cond: (A IS NULL)"
"        Filter: ((B)::text = '21'::text)"
"Total runtime: 4426.910 ms"

Probably some caching made this query faster, but it's still too slow, and using the wrong index.

Deleting index A gives:
"Limit  (cost=0.00..56.47 rows=15 width=128) (actual time=10.298..10.668 rows=15 loops=1)"
"  ->  Index Scan using idx_B on my_table this_  (cost=0.00..3982709.15 rows=1057960 width=128) (actual
time=10.293..10.618rows=15 loops=1)" 
"        Index Cond: ((B)::text = '21'::text)"
"        Filter: (A IS NULL)"
"Total runtime: 10.735 ms"
Much better.  However I need index A for another query so I can't just delete it.

Looking at the estimated cost, you can see why it's choosing the order that it is choosing, but it just doesn't seem to
reflectreality at all. 

Now here's the result of the query, with both indexes in place and sequential scan enabled
Select * from my_table where A is null and B = '21'
"Bitmap Heap Scan on my_table this_  (cost=20412.89..199754.37 rows=1060529 width=128) (actual time=470.772..7432.062
rows=1020062loops=1)" 
"  Recheck Cond: ((B)::text = '21'::text)"
"  Filter: (A IS NULL)"
"  ->  Bitmap Index Scan on idx_B (cost=0.00..20147.76 rows=1089958 width=0) (actual time=466.545..466.545 rows=1020084
loops=1)"
"        Index Cond: ((B)::text = '21'::text)"
"Total runtime: 8940.119 ms"

In this case it goes for the correct index.  It appears that the query planner makes very simplistic assumptions when
itcomes to LIMIT? 

Thanks
David

-----Original Message-----
From: Pavel Stehule [mailto:pavel.stehule@gmail.com]
Sent: 01 September 2008 13:53
To: David West
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] limit clause breaks query planner?

Hello

you should partial index

create index foo(b) on mytable where a is null;

regards
Pavel Stehule

2008/9/1 David West <david.west@cusppoint.com>:
> Hi,
>
>
>
> I have a single table with about 10 million rows, and two indexes.  Index A
> is on a column A with 95% null values.  Index B is on a column B with about
> 10 values, ie. About a million rows of each value.
>
>
>
> When I do a simple query on the table (no joins) with the following
> condition:
>
> A is null AND
>
> B = '21'
>
>
>
> it uses the correct index, index B.  However, when I add a limit clause of
> 15, postgres decides to do a sequential scan :s.  Looking at the results
> from explain:
>
>
>
> "Limit  (cost=0.00..3.69 rows=15 width=128)"
>
> "  ->  Seq Scan on my_table this_  (cost=0.00..252424.24 rows=1025157
> width=128)"
>
> "        Filter: ((A IS NULL) AND ((B)::text = '21'::text))"
>
>
>
> It appears that postgres is (very incorrectly) assuming that it will only
> have to retrieve 15 rows on a sequential scan, and gives a total cost of
> 3.69.  In reality, it has to scan many rows forward until it finds the
> correct value, yielding very poor performance for my table.
>
>
>
> If I disable sequential scan (set enable_seqscan=false) it then incorrectly
> uses the index A that has 95% null values: it seems to incorrectly apply the
> same logic again that it will only have to retrieve 15 rows with the limit
> clause, and thinks that the index scan using A is faster than index scan B.
>
>
>
> Only by deleting the index on A and disabling sequential scan will it use
> the correct index, which is of course by far the fastest.
>
>
>
> Is there an assumption in the planner that a limit of 15 will mean that
> postgres will only have to read 15 rows?  If so is this a bad assumption?
> If a particular query is faster without a limit, then surely it will also be
> faster with the limit.
>
>
>
> Any workarounds for this?
>
>
>
> Thanks
>
> David


pgsql-performance by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: too many clog files
Next
From: "Scott Marlowe"
Date:
Subject: Re: slow update of index during insert/copy