Thread: [PERFORM] Number of characters in column preventing index usage

[PERFORM] Number of characters in column preventing index usage

From
Hustler DBA
Date:
I am seeing this strange behavior, I don't know if this is by design by Postgres. 

I have an index on a column which is defined as "character varying(255)". When the value I am searching for is of a certain length, the optimizer uses the index but when the value is long, the optimizer doesn't use the index but does a seq scan on the table. Is this by design? How can I make the optimizer use the index no matter what the size/length of the value being searched for?   


PostgreSQL version: 9.4 


my_db=# explain (analyze, buffers) select count(*) from tab where ID = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' ;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Aggregate  (cost=14.80..14.81 rows=1 width=0) (actual time=0.114..0.114 rows=1 loops=1)
   Buffers: shared hit=12
   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual time=0.025..0.109 rows=5 loops=1)
         Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
         Rows Removed by Filter: 218
         Buffers: shared hit=12
 Planning time: 0.155 ms
 Execution time: 0.167 ms
(8 rows)

my_db=# create index tab_idx1 on tab(ID);                                                               
CREATE INDEX
my_db=# explain (analyze, buffers) select count(*) from tab where ID = '01625cfa-2bf8-45cf' ;                  
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual time=0.048..0.048 rows=1 loops=1)
   Buffers: shared read=2
   ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1 width=0) (actual time=0.043..0.043 rows=0 loops=1)
         Index Cond: (ID = '01625cfa-2bf8-45cf'::text)
         Heap Fetches: 0
         Buffers: shared read=2
 Planning time: 0.250 ms
 Execution time: 0.096 ms
(8 rows)

my_db=# explain (analyze, buffers) select count(*) from tab where ID = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' ;
                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Aggregate  (cost=14.80..14.81 rows=1 width=0) (actual time=0.115..0.115 rows=1 loops=1)
   Buffers: shared hit=12
   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual time=0.031..0.108 rows=5 loops=1)
         Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
         Rows Removed by Filter: 218
         Buffers: shared hit=12
 Planning time: 0.122 ms
 Execution time: 0.180 ms
(8 rows)

my_db=# 



Re: [PERFORM] Number of characters in column preventing index usage

From
"David G. Johnston"
Date:
On Fri, Feb 17, 2017 at 3:19 PM, Hustler DBA <hustlerdba@gmail.com> wrote:

my_db=# create index tab_idx1 on tab(ID);                                                               
CREATE INDEX
my_db=# explain (analyze, buffers) select count(*) from tab where ID = '01625cfa-2bf8-45cf' ;                  
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual time=0.048..0.048 rows=1 loops=1)
   Buffers: shared read=2
   ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1 width=0) (actual time=0.043..0.043 rows=0 loops=1)
         Index Cond: (ID = '01625cfa-2bf8-45cf'::text)

 
   ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual time=0.031..0.108 rows=5 loops=1)
         Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
         Rows Removed by Filter: 218
         Buffers: shared hit=12
 Planning time: 0.122 ms
 Execution time: 0.180 ms
(8 rows)

​IIRC the only reason the first query cares to use the index is because it can perform an Index Only Scan and thus avoid touching the heap at all.  If it cannot avoid touching the heap the planner is going to just use a sequential scan to retrieve the records directly from the heap and save the index lookup step.

David J.

Re: [PERFORM] Number of characters in column preventing index usage

From
Tomas Vondra
Date:
Hi,

On 02/17/2017 11:19 PM, Hustler DBA wrote:
> I am seeing this strange behavior, I don't know if this is by design by
> Postgres.
>
> I have an index on a column which is defined as "character
> varying(255)". When the value I am searching for is of a certain length,
> the optimizer uses the index but when the value is long, the optimizer
> doesn't use the index but does a seq scan on the table. Is this by
> design? How can I make the optimizer use the index no matter what the
> size/length of the value being searched for?
>

AFAIK there are no such checks, i.e. the optimizer does not consider the
length of the value when deciding between scan types.

>
> PostgreSQL version: 9.4
>

That's good to know, but we also need information about the table
involved in your queries. I'd bet the table is tiny (it seems to be just
12 pages, so ~100kB), making the indexes rather useless.

> my_db=# explain (analyze, buffers) select count(*) from tab where ID =
> '01625cfa-2bf8-45cf' ;
>                                                               QUERY PLAN
>
>
---------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual time=0.048..0.048
> rows=1 loops=1)
>    Buffers: shared read=2
>    ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1
> width=0) (actual time=0.043..0.043 rows=0 loops=1)
>          Index Cond: (ID = '01625cfa-2bf8-45cf'::text)
>          Heap Fetches: 0
>          Buffers: shared read=2
>  Planning time: 0.250 ms
>  Execution time: 0.096 ms
> (8 rows)
>
> my_db=# explain (analyze, buffers) select count(*) from tab where ID =
> '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' ;
>                                                 QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=14.80..14.81 rows=1 width=0) (actual time=0.115..0.115
> rows=1 loops=1)
>    Buffers: shared hit=12
>    ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
> time=0.031..0.108 rows=5 loops=1)
>          Filter: ((ID)::text = '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
>          Rows Removed by Filter: 218
>          Buffers: shared hit=12
>  Planning time: 0.122 ms
>  Execution time: 0.180 ms
> (8 rows)

The only difference I see is that for the long value the planner expects
5 rows, while for the short one it expects 1 row. That may seem a bit
strange, but I'd bet it finds the short value in some statistic (MCV,
histogram) ans so can provide very accurate estimate. While for the
longer one, it ends up using some default (0.5% for equality IIRC) or
value deduced from ndistinct. Or something like that.

The differences between the two plans are rather negligible, both in
terms of costs (8.3 vs. 14.81) and runtime (0.1 vs 0.2 ms). The choice
of a sequential scan seems perfectly reasonable for such tiny tables.

FWIW it's impossible to draw conclusions based on two EXPLAIN ANALYZE
executions. The timing instrumentation from EXPLAIN ANALYZE may have
significant impact impact (different for each plan!). You also need to
testing with more values and longer runs, not just a single execution
(there are caching effects etc.)

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PERFORM] Number of characters in column preventing index usage

From
Alvaro Herrera
Date:
Hustler DBA wrote:
> I am seeing this strange behavior, I don't know if this is by design by
> Postgres.
>
> I have an index on a column which is defined as "character varying(255)".
> When the value I am searching for is of a certain length, the optimizer
> uses the index but when the value is long, the optimizer doesn't use the
> index but does a seq scan on the table. Is this by design? How can I make
> the optimizer use the index no matter what the size/length of the value
> being searched for?

As I recall, selectivity for strings is estimated based on the length of
the string.  Since your sample string looks suspiciously like an UUID,
perhaps you'd be better served by using an UUID column for it, which may
give better results.  This would prevent you from using the shortened
version for searches (which I suppose you can do with LIKE using the
varchar type), but you could replace it with something like this:

select *
from tab
where ID between '01625cfa-2bf8-45cf-0000-000000000000' and
          '01625cfa-2bf8-45cf-ffff-ffffffffffff';

Storage (both the table and indexes) is going to be more efficient this
way too.

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PERFORM] Number of characters in column preventing index usage

From
Tomas Vondra
Date:
On 02/17/2017 11:42 PM, David G. Johnston wrote:
> On Fri, Feb 17, 2017 at 3:19 PM, Hustler DBA <hustlerdba@gmail.com
> <mailto:hustlerdba@gmail.com>>wrote:
>
>
>     my_db=# create index tab_idx1 on tab(ID);
>
>     CREATE INDEX
>     my_db=# explain (analyze, buffers) select count(*) from tab where ID
>     = '01625cfa-2bf8-45cf' ;
>                                                                   QUERY
>     PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------
>      Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual
>     time=0.048..0.048 rows=1 loops=1)
>        Buffers: shared read=2
>        ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29
>     rows=1 width=0) (actual time=0.043..0.043 rows=0 loops=1)
>              Index Cond: (ID = '01625cfa-2bf8-45cf'::text)
>
>
>
>        ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
>     time=0.031..0.108 rows=5 loops=1)
>              Filter: ((ID)::text =
>     '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
>              Rows Removed by Filter: 218
>              Buffers: shared hit=12
>      Planning time: 0.122 ms
>      Execution time: 0.180 ms
>     (8 rows)
>
>
> ​IIRC the only reason the first query cares to use the index is because
> it can perform an Index Only Scan and thus avoid touching the heap at
> all.  If it cannot avoid touching the heap the planner is going to just
> use a sequential scan to retrieve the records directly from the heap and
> save the index lookup step.
>

I don't follow - the queries are exactly the same in both cases, except
the parameter value. So both cases are eligible for index only scan.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [PERFORM] Number of characters in column preventing index usage

From
Hustler DBA
Date:
Yes, both queries are the same, I just shorten the parameter value to see what would have happened. The database that I inherited has a column that stores GUID/UUIDs in a varchar(255) and a select on that table on that column is doing a FULL TABLE SCAN (seq scan). All the values in the column are 36 characters long. The table is 104 KB. 

I realize that there was no index on that column so when I created the index and tried to search on a parameter value, it doesn't use the index, but when I shorten the parameter value then the optimizer decides to use an index for the search.



On Fri, Feb 17, 2017 at 5:52 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 02/17/2017 11:42 PM, David G. Johnston wrote:
On Fri, Feb 17, 2017 at 3:19 PM, Hustler DBA <hustlerdba@gmail.com
<mailto:hustlerdba@gmail.com>>wrote:



    my_db=# create index tab_idx1 on tab(ID);

    CREATE INDEX
    my_db=# explain (analyze, buffers) select count(*) from tab where ID
    = '01625cfa-2bf8-45cf' ;
                                                                  QUERY
    PLAN
    ---------------------------------------------------------------------------------------------------------------------------------------
     Aggregate  (cost=8.29..8.30 rows=1 width=0) (actual
    time=0.048..0.048 rows=1 loops=1)
       Buffers: shared read=2
       ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29
    rows=1 width=0) (actual time=0.043..0.043 rows=0 loops=1)
             Index Cond: (ID = '01625cfa-2bf8-45cf'::text)



       ->  Seq Scan on tab  (cost=0.00..14.79 rows=5 width=0) (actual
    time=0.031..0.108 rows=5 loops=1)
             Filter: ((ID)::text =
    '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'::text)
             Rows Removed by Filter: 218
             Buffers: shared hit=12
     Planning time: 0.122 ms
     Execution time: 0.180 ms
    (8 rows)


​IIRC the only reason the first query cares to use the index is because
it can perform an Index Only Scan and thus avoid touching the heap at
all.  If it cannot avoid touching the heap the planner is going to just
use a sequential scan to retrieve the records directly from the heap and
save the index lookup step.


I don't follow - the queries are exactly the same in both cases, except the parameter value. So both cases are eligible for index only scan.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Number of characters in column preventing index usage

From
"David G. Johnston"
Date:
On Fri, Feb 17, 2017 at 3:49 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: 
 That may seem a bit strange, but I'd bet it finds the short value in some statistic (MCV, histogram) ans so can provide very accurate estimate.

​​ ->  Index Only Scan using tab_idx1 on tab  (cost=0.27..8.29 rows=1
width=0) (actual time=0.043..0.043 rows=0 loops=1)

​I'm not seeing how any of the statistic columns would capture a value that doesn't actually appear in the table...(actual ... row=0)​

Unless there is some prefix matching going on here since the short value is a substring(1, n) of the longer one which does appear 5 times.

​I guess maybe because the value doesn't appear it uses the index (via IOS) to confirm absence (or near absence, i.e., 1) while, knowing the larger value appears 5 times out of 223, it decides a quick table scan is faster than any form of double-lookup (whether on the visibility map or the heap).


​David J.​

Re: [PERFORM] Number of characters in column preventing index usage

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Fri, Feb 17, 2017 at 3:49 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>> That may seem a bit strange, but I'd bet it finds the short value in some
>> statistic (MCV, histogram) ans so can provide very accurate estimate.

> ​I'm not seeing how any of the statistic columns would capture a value that
> doesn't actually appear in the table...(actual ... row=0)​

I think it's the other way around.  It found
'01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' in the stats, concluded
(accurately) that there would be five matches, and on the strength of that
decided that a seqscan over this very tiny table would be faster than an
indexscan.  In the other case, the short string exists neither in the
table nor the stats, and the default estimate is turning out to be that
there's a single match, for which it likes the indexscan solution.  This
is all pretty unsurprising if '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'
is in the most-common-values list.  Anything that's *not* in that list
is going to get a smaller rowcount estimate.  (I don't think that the
string length, per se, has anything to do with it.)

I'm not sure what performance problem the OP was looking to solve,
but expecting experiments on toy-sized tables to give the same plans
as you get on large tables is a standard mistake when learning to work
with the PG planner.

Also, if toy-sized tables are all you've got, meaning the whole database
can be expected to stay RAM-resident at all times, it'd be a good idea
to reduce random_page_cost to reflect that.  The default planner cost
settings are meant for data that's mostly on spinning rust.

            regards, tom lane


Re: [PERFORM] Number of characters in column preventing index usage

From
Hustler DBA
Date:
Thanks you guys are correct... the size of the table caused the optimizer to do a seq scan instead of using the index. I tried it on a  24 MB and 1 GB table and the expected index was used. 



On Fri, Feb 17, 2017 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Fri, Feb 17, 2017 at 3:49 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
>> That may seem a bit strange, but I'd bet it finds the short value in some
>> statistic (MCV, histogram) ans so can provide very accurate estimate.

> ​I'm not seeing how any of the statistic columns would capture a value that
> doesn't actually appear in the table...(actual ... row=0)​

I think it's the other way around.  It found
'01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea' in the stats, concluded
(accurately) that there would be five matches, and on the strength of that
decided that a seqscan over this very tiny table would be faster than an
indexscan.  In the other case, the short string exists neither in the
table nor the stats, and the default estimate is turning out to be that
there's a single match, for which it likes the indexscan solution.  This
is all pretty unsurprising if '01625cfa-2bf8-45cf-bf4c-aa5f3c6fa8ea'
is in the most-common-values list.  Anything that's *not* in that list
is going to get a smaller rowcount estimate.  (I don't think that the
string length, per se, has anything to do with it.)

I'm not sure what performance problem the OP was looking to solve,
but expecting experiments on toy-sized tables to give the same plans
as you get on large tables is a standard mistake when learning to work
with the PG planner.

Also, if toy-sized tables are all you've got, meaning the whole database
can be expected to stay RAM-resident at all times, it'd be a good idea
to reduce random_page_cost to reflect that.  The default planner cost
settings are meant for data that's mostly on spinning rust.

                        regards, tom lane


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance