Thread: Index-only scan not working when IN clause has 2 or more values

Index-only scan not working when IN clause has 2 or more values

From
Anna B.
Date:
 
Hi all,
 
CREATE TABLE transaction
(
    client_id     decimal(18, 0) NOT NULL,
    trans_dttm    timestamp,
    division_code varchar(255),
    ...
)
 
One row size = approx 2Kb.
 
Usually table is searched by client_id and trans_dttm, but sometimes also by division_code, so:
 
create index "ix-transaction-client-trans_dttm-division"
    on transaction (client_id,
                    trans_dttm desc,
                    division_code);
 
Query:
 
EXPLAIN (ANALYZE, BUFFERS)
select *
from transaction
where client_id = 123456
  and (trans_dttm between TO_DATE('01.01.2020', 'dd.mm.yyyy') and TO_DATE('31.12.2022', 'dd.mm.yyyy'))
  and (division_code in
       ('not_existing_code1', 'not_existing_code2'))
order by trans_dttm desc
limit 50 offset 0;
 
Result:
 
Limit  (cost=0.57..14874.50 rows=50 width=2675) (actual time=703291.836..703291.845 rows=0 loops=1)
  Buffers: shared hit=7981349 read=2078884
  I/O Timings: read=683180.769
"  ->  Index Scan using ""ix-transaction-client-trans_dttm-division"" on transaction  (cost=0.57..8350814.66 rows=28072 width=2675) (actual time=703291.834..703291.835 rows=0 loops=1)"
"        Index Cond: ((client_id = '123456'::numeric) AND (trans_dttm >= to_date('01.01.2020'::text, 'dd.mm.yyyy'::text)) AND (trans_dttm <= to_date('31.12.2022'::text, 'dd.mm.yyyy'::text)))"
"        Filter: ((division_code)::text = ANY ('{not_existing_code1,not_existing_code2}'::text[]))"
        Rows Removed by Filter: 10000000
        Buffers: shared hit=7981349 read=2078884
        I/O Timings: read=683180.769
Planning Time: 0.258 ms
Execution Time: 703291.901 ms
 
In case when only one division_code passed: division_code in ('not_existing_code1'), the result is much better:
 
Limit  (cost=0.57..810.34 rows=50 width=2675) (actual time=1479.254..1479.257 rows=0 loops=1)
  Buffers: shared read=60245
  I/O Timings: read=231.032
"  ->  Index Scan using ""ix-transaction-client-trans_dttm-division"" on transaction  (cost=0.57..227318.91 rows=14036 width=2675) (actual time=1479.251..1479.253 rows=0 loops=1)"
"        Index Cond: ((client_id = '123456'::numeric) AND (trans_dttm >= to_date('01.01.2020'::text, 'dd.mm.yyyy'::text)) AND (trans_dttm <= to_date('31.12.2022'::text, 'dd.mm.yyyy'::text)) AND ((division_code)::text = 'not_existing_code1'::text))"
        Buffers: shared read=60245
        I/O Timings: read=231.032
Planning Time: 1.031 ms
Execution Time: 1479.342 ms
 
Thanks in advance,
Dmitry
 
 

Re: Index-only scan not working when IN clause has 2 or more values

From
Tom Lane
Date:
=?UTF-8?B?QW5uYSBCLg==?= <terzi@bk.ru> writes:
> create index "ix-transaction-client-trans_dttm-division"
>     on transaction (client_id,
>                     trans_dttm desc,
>                     division_code);
>
> EXPLAIN (ANALYZE, BUFFERS)
> select *
> from transaction
> where client_id = 123456
>   and (trans_dttm between TO_DATE('01.01.2020', 'dd.mm.yyyy') and TO_DATE('31.12.2022', 'dd.mm.yyyy'))
>   and (division_code in
>        ('not_existing_code1', 'not_existing_code2'))
> order by trans_dttm desc
> limit 50 offset 0;

The reason you get a plan like this:

> "  ->  Index Scan using ""ix-transaction-client-trans_dttm-division"" on transaction  (cost=0.57..8350814.66
rows=28072width=2675) (actual time=703291.834..703291.835 rows=0 loops=1)" 
> "        Index Cond: ((client_id = '123456'::numeric) AND (trans_dttm >= to_date('01.01.2020'::text,
'dd.mm.yyyy'::text))AND (trans_dttm <= to_date('31.12.2022'::text, 'dd.mm.yyyy'::text)))" 
> "        Filter: ((division_code)::text = ANY ('{not_existing_code1,not_existing_code2}'::text[]))"

is that if the =ANY clause were an index condition, it would result
in multiple scans of the index, therefore the output would (in all
probability) not be sorted in index order.  To produce the demanded
result, the plan would have to read the entire index scan and sort
its output.  The planner estimates that that would be slower than
what it has done here.  In practice it looks like you're reading
the whole scan output anyway because there are less than 50
matching rows, but the planner didn't know that.

The problem with =ANY producing unordered output can be dodged if
the =ANY is on the first index column; but I suppose that does not
help you here, since making division_code the first index column
would defeat getting output that's sorted by trans_dttm anyway.

You might try making extended stats on these three columns to see
if that helps the planner to get a better rowcount estimate.
If it understood that there were fewer than 50 matching rows,
it might opt for the use-the-=ANY-and-sort plan type.

            regards, tom lane



Re[2]: Index-only scan not working when IN clause has 2 or more values

From
Anna B.
Date:
Hi Tom and community,
 
Thank you very much! 
After digging how Postgres planner uses statistics, I have increased table statistics from 100 to 1000. It was enough for planner to use multiple scans of the index and then sort!
(Also I have added dependency extended stats on the three columns as you suggested).
 
Can I ask one more question. I am also testing same index but covering version:
 
create index "ix-transaction-client-trans_dttm-include-division"
    on transaction (client_id,
                    trans_dttm desc)
    include (division_code);
 
Why tuned statistics does not improved it?
 
EXPLAIN (ANALYZE, BUFFERS)
select *
from transaction
where client_id = 123456
  and (trans_dttm between TO_DATE('01.01.2020', 'dd.mm.yyyy') and TO_DATE('31.12.2022', 'dd.mm.yyyy'))
  and (division_code in
       ('not_existing_code1', 'not_existing_code2'))
order by trans_dttm desc
 
"Index Scan using ""ix-transaction-client-trans_dttm-include-division"" on transaction  (cost=0.57..8243559.04 rows=240 width=921) (actual time=23920.988..23920.989 rows=0 loops=1)"
"  Index Cond: ((client_id = '123456'::numeric) AND (trans_dttm >= to_date('01.01.2020'::text, 'dd.mm.yyyy'::text)) AND (trans_dttm <= to_date('31.12.2022'::text, 'dd.mm.yyyy'::text)))"
"  Filter: ((division_code)::text = ANY ('{not_existing_code1,not_existing_code2}'::text[]))"
  Rows Removed by Filter: 10000000
  Buffers: shared hit=8021895 read=2038341
  I/O Timings: read=8902.706
Planning Time: 1.278 ms
Execution Time: 23921.026 ms
 
Yes, I have read about covering indexes in Postgres, about why it has to check rows visibility. But do not understand why Postgres prefers to filter 10000000 table rows instead of filtering in index + using visibility map.
Btw, visibility map is up to date:
relpages, reltuples, relallvisible
23478634, 210520464, 23478634
 
Thank you in advance,
Dmitry

Пятница, 25 ноября 2022, 18:40 +03:00 от Tom Lane <tgl@sss.pgh.pa.us>:
 
Anna B. <terzi@bk.ru> writes:
> create index "ix-transaction-client-trans_dttm-division"
> on transaction (client_id,
> trans_dttm desc,
> division_code);
>
> EXPLAIN (ANALYZE, BUFFERS)
> select *
> from transaction
> where client_id = 123456
> and (trans_dttm between TO_DATE('01.01.2020', 'dd.mm.yyyy') and TO_DATE('31.12.2022', 'dd.mm.yyyy'))
> and (division_code in
> ('not_existing_code1', 'not_existing_code2'))
> order by trans_dttm desc
> limit 50 offset 0;

The reason you get a plan like this:

> " -> Index Scan using ""ix-transaction-client-trans_dttm-division"" on transaction (cost=0.57..8350814.66 rows=28072 width=2675) (actual time=703291.834..703291.835 rows=0 loops=1)"
> " Index Cond: ((client_id = '123456'::numeric) AND (trans_dttm >= to_date('01.01.2020'::text, 'dd.mm.yyyy'::text)) AND (trans_dttm <= to_date('31.12.2022'::text, 'dd.mm.yyyy'::text)))"
> " Filter: ((division_code)::text = ANY ('{not_existing_code1,not_existing_code2}'::text[]))"

is that if the =ANY clause were an index condition, it would result
in multiple scans of the index, therefore the output would (in all
probability) not be sorted in index order. To produce the demanded
result, the plan would have to read the entire index scan and sort
its output. The planner estimates that that would be slower than
what it has done here. In practice it looks like you're reading
the whole scan output anyway because there are less than 50
matching rows, but the planner didn't know that.

The problem with =ANY producing unordered output can be dodged if
the =ANY is on the first index column; but I suppose that does not
help you here, since making division_code the first index column
would defeat getting output that's sorted by trans_dttm anyway.

You might try making extended stats on these three columns to see
if that helps the planner to get a better rowcount estimate.
If it understood that there were fewer than 50 matching rows,
it might opt for the use-the-=ANY-and-sort plan type.

regards, tom lane