Thread: Multicolumn index - WHERE ... ORDER BY

Multicolumn index - WHERE ... ORDER BY

From
Lucas Maystre
Date:
Hi there,

I've got a small question about multicolumn indexes.

I have a table with ~5M rows (43 bytes per column -  is that relevant?)
(but eventually it could grow up to 50M rows), used to store e-mail
logs. I am trying to build a web frontend to search mails in this table.

I usually want only the last mails processed by my mail system, so
typically all my queries end with:
... ORDER BY time LIMIT 50;

Before that, I have usually have a WHERE clause on a indexed column.
Example of a query I might have:

SELECT id FROM mail WHERE from_address LIKE 'bill%'
ORDER BY time DESC LIMIT 50;

I observed that the ordering adds a significant overhead to my queries -
this seems quite logical, because of the ORDER BY which has to inspect
every row matching the WHERE clause.

The approach taken by the query planner is one of the following:

1) if it thinks there are "not so much" rows containg 'bill' as prefix
of the 'from_address' column, it performs an index scan (or a bitmap
index scan) using my index on 'from_address', then sorts all results
according to the 'time' column.

2) if it thinks there are "many" rows containing 'bill' as prefix of the
'from_address' column, it performs an reverse index scan using my index
on 'time', and looks "sequentially" if the 'from_address' column
contains 'bill' as prefix.

The problem is that "not so much" is in my case approx 10K rows
sometimes. It seems to be pretty costly to perform an (bitmap) index
scan over all these rows. As I only want the first few rows anyway
(LIMIT 50), I thought that there had to be some better solution.

The solution I had in mind was to create a multicolumn index over
'from_address' and 'time':

CREATE INDEX idx_from_time ON mail (from_address, time DESC);

so that it could directly use the 'time' ordering and lookup only the
first 50 rows using the index.

but... it doesn't work :-) i.e. my multicolumn index is never used. So:
- do you guys have any ideas why it doesn't work?
- do you see an alternative solution?

Infos:
- I use PostgreSQL 8.4.2
- I regularly VACUUM and ANALYZE my db. Statistics look OK.
- I'm relatively new to PostgreSQL, so maybe this question is trivial?

Thanks in advance, and happy holidays!

--
lucas maystre
trainee

open systems ag
raeffelstrasse 29
ch-8045 zurich
t: +41 44 455 74 00
f: +41 44 455 74 01
lum@open.ch

http://www.open.ch

Re: Multicolumn index - WHERE ... ORDER BY

From
Tom Lane
Date:
Lucas Maystre <lum@open.ch> writes:
> Example of a query I might have:
> SELECT id FROM mail WHERE from_address LIKE 'bill%'
> ORDER BY time DESC LIMIT 50;

> The solution I had in mind was to create a multicolumn index over
> 'from_address' and 'time':
> CREATE INDEX idx_from_time ON mail (from_address, time DESC);
> so that it could directly use the 'time' ordering and lookup only the
> first 50 rows using the index.

> but... it doesn't work :-) i.e. my multicolumn index is never used. So:
> - do you guys have any ideas why it doesn't work?

The from_address condition isn't simple equality, so the output of a
scan wouldn't be sorted by time --- it would have subranges that are
sorted, but that's no help overall.  You still have to read the whole
scan output and re-sort.  So this index has no advantage over the
smaller index on just from_address.

> - do you see an alternative solution?

There might be some use in an index on (time, from_address).  That
gives the correct time ordering, and at least the prefix part of the
from_address condition can be checked in the index without visiting the
heap.

            regards, tom lane