Thread: Wrong index choosen?

Wrong index choosen?

From
Gaetano Mendola
Date:
I hall
I have a query in this form:

empdb=# explain analyze select * from v_past_connections where id_user = 26195 and login_time > '2004-07-21';
                                                   QUERY PLAN
 

-----------------------------------------------------------------------------------------------------------------------------------------
IndexScan using idx_user_logs_login_time on user_logs  (cost=0.00..14.10 rows=1 width=28) (actual time=66.890..198.998
rows=5loops=1)   Index Cond: (login_time > '2004-07-21 00:00:00+02'::timestamp with time zone)   Filter: (id_user =
26195)Total runtime: 199.083 ms
 
(4 rows)


as you see the index on the time stamp column is used

The table have indexes on both columns:

empdb=# explain analyze select * from v_past_connections where login_time > '2004-07-21';
                                  QUERY PLAN
 

----------------------------------------------------------------------------------------------------------------------------------------------
IndexScan using idx_user_logs_login_time on user_logs  (cost=0.00..12.90 rows=481 width=28) (actual time=7.338..661.300
rows=22477loops=1)   Index Cond: (login_time > '2004-07-21 00:00:00+02'::timestamp with time zone) Total runtime:
676.472ms
 
(3 rows)

empdb=# explain analyze select * from v_past_connections where id_user = 26195;
                    QUERY PLAN
 

---------------------------------------------------------------------------------------------------------------------------------------
IndexScan using idx_user_user_logs on user_logs  (cost=0.00..252.47 rows=320 width=28) (actual time=4.420..100.122
rows=221loops=1)   Index Cond: (id_user = 26195) Total runtime: 100.348 ms
 
(3 rows)


The rows filtered out with both condictions are two order of magnitude differents,
also the extimated rows are close to real numbers:


empdb=# select count(*) from v_past_connections where id_user = 26195; count
-------   221
(1 row)

empdb=# select count(*) from v_past_connections where login_time > '2004-07-21'; count
------- 22441
(1 row)


why then the planner choose to do an index scan using the filter that retrieve a bigger ammount of rows ? A bug ?





Regards
Gaetano Mendola

















Re: Wrong index choosen?

From
Dennis Bjorklund
Date:
On Fri, 23 Jul 2004, Gaetano Mendola wrote:

> empdb=# explain analyze select * from v_past_connections where login_time > '2004-07-21';
>                                                                    QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------
>   Index Scan using idx_user_logs_login_time on user_logs  (cost=0.00..12.90 rows=481 width=28) (actual
time=7.338..661.300rows=22477 loops=1)
 
>     Index Cond: (login_time > '2004-07-21 00:00:00+02'::timestamp with time zone)
>   Total runtime: 676.472 ms
> (3 rows)

In this plan it estimates to get 481 but it got 22477. So the estimation 
was very wrong. You can increase the statistics tarhet on the login_time 
and it will probably be better (after the next analyze).

> why then the planner choose to do an index scan using the filter that
> retrieve a bigger ammount of rows ? A bug ?

Because it has to decide on the plan before it knows exactly what the 
result will be. As seen above, the estimation was wrong and thus the plan 
was not as good as it could have been.

In this case you probably also want to create a combined index on both
columns:

CREATE INDEX foo ON user_log (id_user, login_time);

ps. This letter belonged to pgsql-performance and not pgsql-hackers.

-- 
/Dennis Björklund



Re: Wrong index choosen?

From
Tom Lane
Date:
Dennis Bjorklund <db@zigo.dhs.org> writes:
> In this plan it estimates to get 481 but it got 22477. So the estimation 
> was very wrong. You can increase the statistics tarhet on the login_time 
> and it will probably be better (after the next analyze).

Given the nature of the data (login times), I'd imagine that the problem
is simply that he hasn't analyzed recently enough.  A bump in stats
target may not be needed, but he's going to have to re-analyze that
column often if he wants this sort of query to be estimated accurately,
because the fraction of entries later than a given time T is *always*
going to be changing.
        regards, tom lane