Thread: Select performance variation based on the different combinations of using where lower(), order by, and limit

I have a table that has around 8 million rows.  The table has 71 columns and 33 indexes.

The relevant indexes are:
"callingpartynumber" btree ("CallingPartyNumber")
"callingpartynumber_lower" btree (lower("CallingPartyNumber") text_pattern_ops)

My question is about the explain analyze output, below.  In my desired query, I want to use (1) a where clause (with a call to lower() on the column name), (2) order by, and (3) limit.  When I use all three, my query is slow, but if i use 2 of the 3 clauses, the query is fast.

case 1: where clause with lower(), and order by
case 2: where clause without lower(), order by, and limit
case 3: where clause with lower(), and limit
case 4: where clause with lower(), order by, and limit

I don't understand why the performance of case 4 is so much slower than the other three cases.  It isn't using the callingpartynumber_lower index, when the only difference between case 4 and case 1 is the limit 100 clause.  If I were to use limit 1, there is no difference.


case 1:
mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key";
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=128701.67..128804.70 rows=41212 width=757) (actual time=0.425..0.495 rows=11 loops=1)
   Sort Key: key
   Sort Method:  quicksort  Memory: 30kB
   ->  Bitmap Heap Scan on cdr  (cost=916.83..111735.11 rows=41212 width=757) (actual time=0.118..0.246 rows=11 loops=1)
         Recheck Cond: (lower("CallingPartyNumber") = '9725551212'::text)
         ->  Bitmap Index Scan on callingpartynumber_lower  (cost=0.00..906.53 rows=41212 width=0) (actual time=0.083..0.083 rows=11 loops=1)
               Index Cond: (lower("CallingPartyNumber") = '9725551212'::text)
 Total runtime: 0.830 ms
(8 rows)


case 2:
mydb=> explain analyze SELECT * FROM "cdr" WHERE "CallingPartyNumber" = '9725551212' order by "key" limit 100;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=958.12..958.37 rows=100 width=757) (actual time=3.741..4.035 rows=11 loops=1)
   ->  Sort  (cost=958.12..958.72 rows=240 width=757) (actual time=3.723..3.834 rows=11 loops=1)
         Sort Key: key
         Sort Method:  quicksort  Memory: 30kB
         ->  Bitmap Heap Scan on cdr  (cost=7.30..948.94 rows=240 width=757) (actual time=3.425..3.553 rows=11 loops=1)
               Recheck Cond: ("CallingPartyNumber" = '9725551212'::text)
               ->  Bitmap Index Scan on callingpartynumber  (cost=0.00..7.24 rows=240 width=0) (actual time=3.385..3.385 rows=11 loops=1)
                     Index Cond: ("CallingPartyNumber" = '9725551212'::text)
 Total runtime: 4.550 ms
(9 rows)


case 3:
mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' limit 100;
                                                                   QUERY PLAN                                                                 
  
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..385.54 rows=100 width=757) (actual time=0.079..0.343 rows=11 loops=1)
   ->  Index Scan using callingpartynumber_lower on cdr  (cost=0.00..158886.65 rows=41212 width=757) (actual time=0.059..0.177 rows=11 loops=1)
         Index Cond: (lower("CallingPartyNumber") = '9725551212'::text)
 Total runtime: 0.687 ms
(4 rows)


case 4:
mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..72882.05 rows=100 width=757) (actual time=20481.083..30464.960 rows=11 loops=1)
   ->  Index Scan using cdr_pkey on cdr  (cost=0.00..30036152.32 rows=41212 width=757) (actual time=20481.049..30464.686 rows=11 loops=1)
         Filter: (lower("CallingPartyNumber") = '9725551212'::text)
 Total runtime: 30465.246 ms
(4 rows)

Tyler Reese <jukey91@gmail.com> wrote:

> I don't understand why the performance of case 4 is so much slower

>case 4:
>mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;

> Limit  (cost=0.00..72882.05 rows=100 width=757) (actual time=20481.083..30464.960 rows=11 loops=1)
>   ->  Index Scan using cdr_pkey on cdr  (cost=0.00..30036152.32 rows=41212 width=757) (actual
time=20481.049..30464.686rows=11 loops=1) 
>         Filter: (lower("CallingPartyNumber") = '9725551212'::text)
> Total runtime: 30465.246 ms

It thinks that it will only need to read 1/412th of the table to
find 100 matching rows, and using that index it will be able to
skip the sort.  Since there aren't 100 matching rows, it has to
read the whole table through the index.  Raising the statistics
target and running ANALYZE might allow it to use a more accurate
estimate, and thereby make a better choice.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


So, since it thinks it needs to read 1/412th of the table is the reason why the query planner chooses to use the primary key index instead of the callingpartynumber index, like it does in the first 3 cases?  I'm curious as to why it says "rows=41212".  Is that the estimate of the number of rows that meet the filter condition?  Where does that come from?

I haven't heard of raising the statistics target, so I'll read up on that.  A few days ago, all 4 cases were responding equally fast.  I had been messing around with the postgres settings, and I went and dropped all of the indexes and recreated them just to see what would happen.  I wouldn't think that recreating the indexes would cause case 4 to go slow, but that's the symptom I am seeing now.  Should I be running analyze on a table after it has been reindexed?


On Sun, Aug 18, 2013 at 3:02 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Tyler Reese <jukey91@gmail.com> wrote:

> I don't understand why the performance of case 4 is so much slower

>case 4:
>mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;

> Limit  (cost=0.00..72882.05 rows=100 width=757) (actual time=20481.083..30464.960 rows=11 loops=1)
>   ->  Index Scan using cdr_pkey on cdr  (cost=0.00..30036152.32 rows=41212 width=757) (actual time=20481.049..30464.686 rows=11 loops=1)
>         Filter: (lower("CallingPartyNumber") = '9725551212'::text)
> Total runtime: 30465.246 ms

It thinks that it will only need to read 1/412th of the table to
find 100 matching rows, and using that index it will be able to
skip the sort.  Since there aren't 100 matching rows, it has to
read the whole table through the index.  Raising the statistics
target and running ANALYZE might allow it to use a more accurate
estimate, and thereby make a better choice.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tyler Reese <jukey91@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Tyler Reese <jukey91@gmail.com> wrote:

>>> mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit
100;
>>>
>>> Limit  (cost=0.00..72882.05 rows=100 width=757) (actual time=20481.083..30464.960 rows=11 loops=1)
>>>   ->  Index Scan using cdr_pkey on cdr  (cost=0.00..30036152.32 rows=41212 width=757) (actual
time=20481.049..30464.686rows=11 loops=1) 
>>>         Filter: (lower("CallingPartyNumber") = '9725551212'::text)
>>> Total runtime: 30465.246 ms
>>
>> It thinks that it will only need to read 1/412th of the table to
>> find 100 matching rows, and using that index it will be able to
>> skip the sort.  Since there aren't 100 matching rows, it has to
>> read the whole table through the index.
 
> So, since it thinks it needs to read 1/412th of the table is the
> reason why the query planner chooses to use the primary key index
> instead of the callingpartynumber index, like it does in the
> first 3 cases?

The optimizer compares the estimated cost of reading all matching
rows (but only matching rows) out of order and then sorting them to
the estimated cost of reading them in order and filtering out the
rows that don't match (and stopping when the limit is reached).
Since it though a lot of rows would match, that made the sort look
more expensive and also like it would not reed to read a very large
percentage of the table.

> I'm curious as to why it says "rows=41212".  Is that the estimate
> of the number of rows that meet the filter condition?  Where does
> that come from?

That's based on the distribution observed in the last random sample
when ANALYZE was run (as a command or by autovacuum).  When there
is an error in the estimate that bad, either autovacuum is not
configured to be aggressive enough in analyzing or the stample size
was not large enough.

>>  Raising the statistics target and running ANALYZE might allow
>> it to use a more accurate estimate, and thereby make a better
>> choice.
>
> I haven't heard of raising the statistics target, so I'll read up
> on that.

http://www.postgresql.org/docs/9.2/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER

http://www.postgresql.org/docs/9.2/interactive/sql-altertable.html

> A few days ago, all 4 cases were responding equally fast. I had
> been messing around with the postgres settings, and I went and
> dropped all of the indexes and recreated them just to see what
> would happen.  I wouldn't think that recreating the indexes would
> cause case 4 to go slow, but that's the symptom I am seeing now.
> Should I be running analyze on a table after it has been
> reindexed?

Only if you have indexes on expressions rather than simple column
names.  If you have an index on lower("CallingPartyNumber") that
would qualify as an expression.

The normal reason to need to get fresh statistics is because of
changes in the distribution of values in a column, particularly
after a bulk load.  Also, columns with a large number of distinct
values tend to benefit from a higher statistics target. The down
sides of higher statistics targets are a longer time to ANALYZE and
increased planning time; so it's generally best to use the default
except where a particular problem has been observed, like in this
case.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Sun, Aug 18, 2013 at 10:33 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Tyler Reese <jukey91@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Tyler Reese <jukey91@gmail.com> wrote:

>>> mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
>>>
>>> Limit  (cost=0.00..72882.05 rows=100 width=757) (actual time=20481.083..30464.960 rows=11 loops=1)
>>>   ->  Index Scan using cdr_pkey on cdr  (cost=0.00..30036152.32 rows=41212 width=757) (actual time=20481.049..30464.686 rows=11 loops=1)
>>>         Filter: (lower("CallingPartyNumber") = '9725551212'::text)
>>> Total runtime: 30465.246 ms
>>
>> It thinks that it will only need to read 1/412th of the table to
>> find 100 matching rows, and using that index it will be able to
>> skip the sort.  Since there aren't 100 matching rows, it has to
>> read the whole table through the index.
 
> So, since it thinks it needs to read 1/412th of the table is the
> reason why the query planner chooses to use the primary key index
> instead of the callingpartynumber index, like it does in the
> first 3 cases?

The optimizer compares the estimated cost of reading all matching
rows (but only matching rows) out of order and then sorting them to
the estimated cost of reading them in order and filtering out the
rows that don't match (and stopping when the limit is reached).
Since it though a lot of rows would match, that made the sort look
more expensive and also like it would not reed to read a very large
percentage of the table.

> I'm curious as to why it says "rows=41212".  Is that the estimate
> of the number of rows that meet the filter condition?  Where does
> that come from?

That's based on the distribution observed in the last random sample
when ANALYZE was run (as a command or by autovacuum).  When there
is an error in the estimate that bad, either autovacuum is not
configured to be aggressive enough in analyzing or the stample size
was not large enough.

>>  Raising the statistics target and running ANALYZE might allow
>> it to use a more accurate estimate, and thereby make a better
>> choice.
>
> I haven't heard of raising the statistics target, so I'll read up
> on that.

http://www.postgresql.org/docs/9.2/interactive/runtime-config-query.html#RUNTIME-CONFIG-QUERY-OTHER

http://www.postgresql.org/docs/9.2/interactive/sql-altertable.html

> A few days ago, all 4 cases were responding equally fast. I had
> been messing around with the postgres settings, and I went and
> dropped all of the indexes and recreated them just to see what
> would happen.  I wouldn't think that recreating the indexes would
> cause case 4 to go slow, but that's the symptom I am seeing now.
> Should I be running analyze on a table after it has been
> reindexed?

Only if you have indexes on expressions rather than simple column
names.  If you have an index on lower("CallingPartyNumber") that
would qualify as an expression.

The normal reason to need to get fresh statistics is because of
changes in the distribution of values in a column, particularly
after a bulk load.  Also, columns with a large number of distinct
values tend to benefit from a higher statistics target. The down
sides of higher statistics targets are a longer time to ANALYZE and
increased planning time; so it's generally best to use the default
except where a particular problem has been observed, like in this
case.

I performed ANALYZE on the table and now case 4 is fast again:

mydb=> explain analyze SELECT * FROM "cdr" WHERE lower("CallingPartyNumber") = '9725551212' order by "key" limit 100;
                                                                   QUERY PLAN                                                 
                  
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=958.72..958.97 rows=100 width=758) (actual time=0.251..0.310 rows=11 loops=1)
   ->  Sort  (cost=958.72..959.32 rows=240 width=758) (actual time=0.244..0.266 rows=11 loops=1)
         Sort Key: key
         Sort Method:  quicksort  Memory: 30kB
         ->  Bitmap Heap Scan on cdr  (cost=7.30..949.55 rows=240 width=758) (actual time=0.105..0.162 rows=11 loops=1)
               Recheck Cond: (lower("CallingPartyNumber") = '9725551212'::text)
               ->  Bitmap Index Scan on callingpartynumber_lower  (cost=0.00..7.24 rows=240 width=0) (actual time=0.085..0.085 rows=11 loops=1)
                     Index Cond: (lower("CallingPartyNumber") = '9725551212'::text)
 Total runtime: 0.517 ms
(9 rows)

Thanks for the help, Kevin.

On Sun, Aug 18, 2013 at 4:46 PM, Tyler Reese <jukey91@gmail.com> wrote:

> I haven't heard of raising the statistics target, so I'll read up on that.
> A few days ago, all 4 cases were responding equally fast.  I had been
> messing around with the postgres settings, and I went and dropped all of the
> indexes and recreated them just to see what would happen.  I wouldn't think
> that recreating the indexes would cause case 4 to go slow, but that's the
> symptom I am seeing now.  Should I be running analyze on a table after it
> has been reindexed?

PostgreSQL keeps statistics on the table's columns with the table, and
they survive a re-index.  But the "column" used by the function-based
index is not a real table column.  Those statistics are kept with the
index, not the table, and they do not survive the re-index.  So you
should analyze the table in order to reacquire those statistics. Since
the problem is that you no longer had statistics at all for that
"column", there is probably no need to increase the statistics target,
just doing the analyze should get you back in business.

Arguably PostgreSQL's autovacuum logic should be better about dealing
with expression-based indices.  But for now, a manual analyze is
needed when a new expression-based index is created, or when an
existing one is re-indexed.

Cheers,

Jeff