Thread: Re: cannot use multicolumn index

Re: cannot use multicolumn index

From

MirrorX

Date:

14 September 2011, 13:09:19

here is the explain analyze output->
server=# explain analyze select count(*) from temp_by_hour where xid > 100
and xdate > now() - interval '1 week';
                                                                    QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=29359311.58..29359311.59 rows=1 width=0) (actual
time=2728061.589..2728061.590 rows=1 loops=1)
   ->  Seq Scan on temp_by_hour (cost=0.00..29345234.14 rows=5630975
width=0) (actual time=560446.661..2726838.501 rows=5760724 loops=1)
         Filter: ((xid > 100) AND (xdate > (now() - '7 days'::interval)))
 Total runtime: 2728063.170 ms

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/cannot-use-multicolumn-index-tp4802634p4802699.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: cannot use multicolumn index

From

Grzegorz Jaśkiewicz

Date:

14 September 2011, 13:18:14

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

Re: cannot use multicolumn index

From

MirrorX

Date:

14 September 2011, 14:02:24

-postgres version -> 8.4.4
-os -> redhat 5.6
-specs ->24 cores, 96GB ram, shared_buffers=32 GB
-postgresql.conf -> i havent made any changes as far as the query tuning
parameters are concerned.
#------------------------------------------------------------------------------
# QUERY TUNING
#------------------------------------------------------------------------------

# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

#seq_page_cost = 1.0                    # measured on an arbitrary scale
#random_page_cost = 4.0                 # same scale as above
#cpu_tuple_cost = 0.01                  # same scale as above
#cpu_index_tuple_cost = 0.005           # same scale as above
#cpu_operator_cost = 0.0025             # same scale as above
effective_cache_size = 50GB

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5                        # range 1-10
#geqo_pool_size = 0                     # selects default based on effort
#geqo_generations = 0                   # selects default based on effort
#geqo_selection_bias = 2.0              # range 1.5-2.0

# - Other Planner Options -

#default_statistics_target = 100        # range 1-10000
#constraint_exclusion = partition       # on, off, or partition
#cursor_tuple_fraction = 0.1            # range 0.0-1.0
#from_collapse_limit = 8
#join_collapse_limit = 8                # 1 disables collapsing of explicit
                                        # JOIN clauses
--------------------------------------------------------------------------------------------------------------


if any other parameters are relative to my question pls tell me which you
want and i can post them (i can post the whole postgresql.conf if it's
helpful). my shared buffers ar

my question apart from the specific example, is a little more general. so,
is it normal to expect such an index to be used? can i write the query in
another form so as to use this index? is it for example that the conditions
are '>' and not '=' a factor why the index is not used?

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/cannot-use-multicolumn-index-tp4802634p4802871.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: cannot use multicolumn index

From

"Tomas Vondra"

Date:

14 September 2011, 14:15:07

On 14 Září 2011, 15:09, MirrorX wrote:
> here is the explain analyze output->
> server=# explain analyze select count(*) from temp_by_hour where xid > 100
> and xdate > now() - interval '1 week';
>                                                                     QUERY
> PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=29359311.58..29359311.59 rows=1 width=0) (actual
> time=2728061.589..2728061.590 rows=1 loops=1)
>    ->  Seq Scan on temp_by_hour (cost=0.00..29345234.14 rows=5630975
> width=0) (actual time=560446.661..2726838.501 rows=5760724 loops=1)
>          Filter: ((xid > 100) AND (xdate > (now() - '7 days'::interval)))
>  Total runtime: 2728063.170 ms

Sorry, but with this amount of information, no one can actually help.

- What is the problem, i.e. what behaviour you expect?
- How much data is the table?
- What portion of it matches the conditions?
- What is the index definition?

My bet is the conditions are not selective enough and the index scan would
be less effective than reading the whole table. Try to disable seqscan or
modify the cost variables so that the index scan is used and see if it's
faster or not.

Tomas

Re: cannot use multicolumn index

From

MirrorX

Date:

14 September 2011, 15:14:11

thx for the answer.

- What is the problem, i.e. what behaviour you expect?
- How much data is the table?
- What portion of it matches the conditions?
- What is the index definition?

i think in my first post i provided most of these details but ->
1) what i expect is to be able to understand why the index is not used and
if possibly to use it somehow, or recreate it in a better way
2) the table has 115 GB and about 700 milion rows
3) the result should be less than 10 millions rows
4) the index is a btree

i tried to disable seq_scan and the query plan was changed and used another
index and not the one i wanted.

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/cannot-use-multicolumn-index-tp4802634p4803198.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

Re: cannot use multicolumn index

From

Vitalii Tymchyshyn

Date:

14 September 2011, 15:28:50

14.09.11 18:14, MirrorX написав(ла):
> i think in my first post i provided most of these details but ->
> 1) what i expect is to be able to understand why the index is not used and
> if possibly to use it somehow, or recreate it in a better way
> 2) the table has 115 GB and about 700 milion rows
> 3) the result should be less than 10 millions rows
> 4) the index is a btree
>
> i tried to disable seq_scan and the query plan was changed and used another
> index and not the one i wanted.
You has ">" check on both columns, this means that it has to scan each
subtree that satisfy one criteria to check against the other. Here index
column order is significant. E.g. if you have a lot of xid > 100 and xid
is first index column, it must check all (a lot) the index subtrees for
xid>100.
Multicolumn indexes work best when first columns are checked with "="
and only last column with range criteria.
You may still try to change order of columns in your index if this will
give best selectivity on first column.
Another option is multiple single column indexes - postgres may merge
such an indexes at runtime (don't remember since which version this
feature is available).

Best regards, Vitalii Tymchyshyn.

Re: cannot use multicolumn index

From

"Tomas Vondra"

Date:

14 September 2011, 16:05:34

On 14 Září 2011, 17:14, MirrorX wrote:
> thx for the answer.
>
> - What is the problem, i.e. what behaviour you expect?
> - How much data is the table?
> - What portion of it matches the conditions?
> - What is the index definition?
>
> i think in my first post i provided most of these details but ->

Hmmm, I haven't received that post and I don't see that in the archives:

http://archives.postgresql.org/pgsql-performance/2011-09/msg00210.php

It's displayed on nabble.com, but it's marked as 'not yet accepted'.
That's strange.

Anyway there's still a lot of missing info - what version of PostgreSQL is
this? What is the table structure, what indexes are there?

> 1) what i expect is to be able to understand why the index is not used and
> if possibly to use it somehow, or recreate it in a better way
> 2) the table has 115 GB and about 700 milion rows

Really? Because the explain analyze output you posted states there are
just 5.760.724 rows, not 700.000.000.

> 3) the result should be less than 10 millions rows

That's about 1.5% of the rows, but it may be much larger portion of the
table. The table is stored by blocks - whenever you need to read a row,
you need to read the whole block.

115GB is about 15.073.280 blocks (8kB). If each row happens to be stored
in a different block, you'll have to read about 66% of blocks (although
you need just 1.4% of rows).

Sure, in reality the assumption 'a different block for each row' is not
true, but with a table this large the block probably won't stay in the
cache (and thus will be read repeatedly from the device).

And that's just the table - you have to read the index too (which is 35GB
in this case).

So it's not just about the 'row selectivity', it's about 'block
selectivity' too.

In short - my guess is the seq scan will be more efficient in this case,
but it's hard to prove without the necessary info.

> 4) the index is a btree

Great, but what are the columns? What data types are used?

BTW I've noticed you stated this in the first post "i have read in the
manual that the multicolumn index can be used only if the clauses of the
query are in the same order as the columns of the index".

That's not true since 8.1, so unless you're using a very old version of
PostgreSQL (8.0 or older), you may use whatever columns you want although
it's not as efficient.

Do you need both columns (xid, xdate) in the WHERE condition, or have you
used one of them just to fulfill the 'leftmost columns' rule by adding a
condition that matches everything? If that's the case, it's hardly going
to improve the effectivity.

I see two possible solutions:

1) partition the table and use constraint_exclusion so that just a small
portion of the table is scanned - there are pros/cons of this solution

2) cluster the table by one of the columns, so that an index scan may be
more effective (but this might hurt other queries and you'll have to do
that repeatedly)

Tomas

Re: cannot use multicolumn index

From

MirrorX

Date:

15 September 2011, 09:43:08

thank you all for your advice. i will try the table partitioning approach to
reduce the size of the tables and to be able to handle them more efficiently

--
View this message in context:
http://postgresql.1045698.n5.nabble.com/cannot-use-multicolumn-index-tp4802634p4806239.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.