Home > mailing lists

Thread: Pg makes nonoptimal choice between two multicolumn indexes with the same columns but in different order.

Pg makes nonoptimal choice between two multicolumn indexes with the same columns but in different order.

From

Michael Kolomeitsev

Date:

27 December 2013, 09:35:12

It seems postgresql is unable to choose correct index in such cases.

(my pg version is 9.3.2)

Let's see example:

create table t1 as select a.a, b.b from generate_series(1, 100) a(a), generate_series(1,500000) b(b);

create index t1_a_idx on t1(a);

create index t1_b_idx on t1(b);

create index t1_a_b_idx on t1(a,b);

create index t1_b_a_idx on t1(b,a);

alter table t1 alter a set statistics 10000;

alter table t1 alter b set statistics 10000;

analyze t1;

test=> explain select count(*) from t1 where a in (1, 9, 17, 26, 35, 41, 50) and b = 333333;

QUERY PLAN

--------------------------------------------------------------------------------------

Aggregate (cost=46.62..46.63 rows=1 width=0)

-> Index Only Scan using t1_a_b_idx on t1 (cost=0.57..46.60 rows=7 width=0)

Index Cond: ((a = ANY ('{1,9,17,26,35,41,50}'::integer[])) AND (b = 333333))

(3 rows)

Rows estimation is exact.

But I think using t1_a_b_idx index is not the best choice.

Let's check:

# drop pg and disc buffers/caches

systemctl stop postgresql.service ; echo 3 >/proc/sys/vm/drop_caches ; systemctl start postgresql.service ; sleep 2

# warm up pg and check the plan

{ echo '\\timing' && echo "explain select count(*) from t1 where a in (1, 9, 17, 26, 35, 41, 50) and b = 333333;" ; } | psql test

# do the benchmark

{ echo '\\timing' && echo "select count(*) from t1 where a in (1, 9, 17, 26, 35, 41, 50) and b = 333333;" ; } | psql test

I have 200-210ms timing for the last query and t1_a_b_idx is used always. I checked several times.

Ok. Now 'drop index t1_a_b_idx;' and check again.

Pg now uses t1_b_a_idx and I have 90-100ms for that control query. This is much better.

I took pageinspect contrib module, learnt btree structure and it is clear for me

why t1_b_a_idx is better. The question is: Is postgresql able to see that?

Re: Pg makes nonoptimal choice between two multicolumn indexes with the same columns but in different order.

From

Kevin Grittner

Date:

29 December 2013, 00:03:36

Michael Kolomeitsev <mkolomeitsev@gmail.com> wrote:

> it is clear for me why t1_b_a_idx is better. The question is: Is
> postgresql able to see that?

For a number of reasons I never consider a bulk load complete until
I run VACUUM FREEZE ANALYZE on the table(s) involved.  When I try
your test case without that, I get the bad index choice.  When I
then run VACUUM FREEZE ANALYZE on the database I get the good index
choice.

There may be some lesser maintenance which sets up visibility
information and provides the planner with enough data to make a
good choice, I just noticed that you were not following what I
consider to be rote good practice, tried it, and it solved the
problem.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Pg makes nonoptimal choice between two multicolumn indexes with the same columns but in different order.

From

Gavin Flower

Date:

29 December 2013, 00:19:52

On 29/12/13 10:03, Kevin Grittner wrote:
> Michael Kolomeitsev <mkolomeitsev@gmail.com> wrote:
>
>> it is clear for me why t1_b_a_idx is better. The question is: Is
>> postgresql able to see that?
> For a number of reasons I never consider a bulk load complete until
> I run VACUUM FREEZE ANALYZE on the table(s) involved.  When I try
> your test case without that, I get the bad index choice.  When I
> then run VACUUM FREEZE ANALYZE on the database I get the good index
> choice.
>
> There may be some lesser maintenance which sets up visibility
> information and provides the planner with enough data to make a
> good choice, I just noticed that you were not following what I
> consider to be rote good practice, tried it, and it solved the
> problem.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
Curious: Would it be feasible to do some kind of ANALYZE during a bulk
operation?  Say if you could tell the system you expected to change 20%
of the records in advance: then you could sample some of the changes and
modify the statistics with 0.2 times that plus 0.8 of the pre-existing
statistics.

BEGIN BULK OPERATION CHANGE 20%
[... several transactions ...]
END BULK OPERATION

The sampling could be done as part of the individual operations or at
the end of the bulk operation - whichever is deemed more practicable
(possibly a bit of both?).


Cheers,
Gavin

Re: Pg makes nonoptimal choice between two multicolumn indexes with the same columns but in different order.

From

Tom Lane

Date:

29 December 2013, 00:56:54

Kevin Grittner <kgrittn@ymail.com> writes:
> Michael Kolomeitsev <mkolomeitsev@gmail.com> wrote:
>> it is clear for me why t1_b_a_idx is better. The question is: Is
>> postgresql able to see that?

> For a number of reasons I never consider a bulk load complete until
> I run VACUUM FREEZE ANALYZE on the table(s) involved.� When I try
> your test case without that, I get the bad index choice.� When I
> then run VACUUM FREEZE ANALYZE on the database I get the good index
> choice.

I think that's just chance, because AFAICS the cost estimates are exactly
the same for both indexes, once you've done the vacuum to make all the
heap pages all-visible.  What's more, I'm not sure that that's wrong,
because according to EXPLAIN (ANALYZE, BUFFERS) the exact same number of
index pages are touched for either index.  So I think Michael's claim that
the one index is better is at best unproven.

regression=# explain (analyze, buffers) select count(*) from t1 where a in (1, 9, 17, 26, 35, 41, 50) and b = 333333;
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=32.12..32.13 rows=1 width=0) (actual time=0.097..0.098 rows=1 loops=1)
   Buffers: shared hit=30
   ->  Index Only Scan using t1_b_a_idx on t1  (cost=0.57..32.10 rows=7 width=0) (actual time=0.044..0.085 rows=7
loops=1)
         Index Cond: ((b = 333333) AND (a = ANY ('{1,9,17,26,35,41,50}'::integer[])))
         Heap Fetches: 0
         Buffers: shared hit=30
 Total runtime: 0.174 ms
(7 rows)

regression=# begin; drop index t1_b_a_idx;
BEGIN
DROP INDEX
regression=# explain (analyze, buffers) select count(*) from t1 where a in (1, 9, 17, 26, 35, 41, 50) and b = 333333;
                                                        QUERY PLAN
   

---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=32.12..32.13 rows=1 width=0) (actual time=0.110..0.110 rows=1 loops=1)
   Buffers: shared hit=30
   ->  Index Only Scan using t1_a_b_idx on t1  (cost=0.57..32.10 rows=7 width=0) (actual time=0.039..0.101 rows=7
loops=1)
         Index Cond: ((a = ANY ('{1,9,17,26,35,41,50}'::integer[])) AND (b = 333333))
         Heap Fetches: 0
         Buffers: shared hit=30
 Total runtime: 0.199 ms
(7 rows)

regression=# abort;
ROLLBACK

I grant the theory that the repeated index probes in t1_b_a_idx should be
more localized than those in t1_a_b_idx, but PG's optimizer doesn't
attempt to estimate such effects, and this example isn't doing much to
convince me that it'd be worth the trouble.

            regards, tom lane

Re: Pg makes nonoptimal choice between two multicolumn indexes with the same columns but in different order.

From

Michael Kolomeitsev

Date:

29 December 2013, 03:52:14

2013/12/29 Tom Lane <tgl@sss.pgh.pa.us>

I think that's just chance, because AFAICS the cost estimates are exactly
the same for both indexes, once you've done the vacuum to make all the
heap pages all-visible. What's more, I'm not sure that that's wrong,
because according to EXPLAIN (ANALYZE, BUFFERS) the exact same number of
index pages are touched for either index. So I think Michael's claim that
the one index is better is at best unproven.

Let me prove :)

1. I do benchmarking after dropping Pg and OS disk caches/buffers.

In a way I posted in my first message:

sh# systemctl stop postgresql.service ; echo 3 >/proc/sys/vm/drop_caches ; systemctl start postgresql.service

And timing results are quite stable: 200-210ms using t1_a_b_idx and 90-100ms using t1_b_a_idx.

Trying 'explain (analyze, buffers) ... ' I got this:

* using t1_a_b_idx:

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=46.62..46.63 rows=1 width=0) (actual time=228.853..228.854 rows=1 loops=1)

Buffers: shared hit=12 read=23

-> Index Only Scan using t1_a_b_idx on t1 (cost=0.57..46.60 rows=7 width=0) (actual time=52.171..228.816 rows=7 loops=1)

Index Cond: ((a = ANY ('{1,9,17,26,35,41,50}'::integer[])) AND (b = 333333))

Heap Fetches: 7

Buffers: shared hit=12 read=23

Total runtime: 229.012 ms

* using t1_b_a_idx:

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=60.12..60.13 rows=1 width=0) (actual time=115.617..115.617 rows=1 loops=1)

Buffers: shared hit=24 read=11

-> Index Only Scan using t1_b_a_idx on t1 (cost=0.57..60.10 rows=7 width=0) (actual time=80.460..115.590 rows=7 loops=1)

Index Cond: ((b = 333333) AND (a = ANY ('{1,9,17,26,35,41,50}'::integer[])))

Heap Fetches: 7

Buffers: shared hit=24 read=11

Total runtime: 116.480 ms

There is a difference in read operations and moreover in cost estimation.

23 - 11 = 12 excess read operations. If they are all random reads they may take ~100ms on typical home/workstation SATA hard drive. That's the difference between

timings I showed above.

Yes, I understand that Pg doesn't know (while planning the query) how many pages will be hitted in shared buffers.

But I can't get why there is the same buffers count (35) in both plans...

And I can't get why I have different cost estimations...

I grant the theory that the repeated index probes in t1_b_a_idx should be
more localized than those in t1_a_b_idx, but PG's optimizer doesn't

Yes, I see t1_a_b_idx and t1_b_a_idx have 3 levels in btree. For t1_a_b_idx Pg have to read 1 (header) + 1 (root) + 1 (common level 1 node) + 7 * 2 = 17 pages in it

and for t1_b_a_idx 1 + 1 + 3 = 5 pages ('cause all 7 pairs of (a, b) are located in one btree leaf node). 17 - 5 = 12 - this is the same difference as we can see in

'explain (analyze, buffers)'.

attempt to estimate such effects, and this example isn't doing much to
convince me that it'd be worth the trouble.

In a real life situation I have two kinds of queries for that table:

* select ... from t1 where a in (...) and b = ?

* select ... from t1 where a = ? and b in (...)

I select fields from t1 that are not in indexes thus there is no 'Index Only Scan', more random reads and performance impact of choosing t1_a_b_idx in both queries is a bit lesser.

And I got the answer ("PG's optimizer doesn't attempt to estimate such effects") for my situation.

Thanks a lot.