Thread: increase index performance

From:
Thomas Finneid
Date:

Hi

have the following table (theoretical)

table apartment_location (

    city_id       int,
    street_id  int,
    house_id   int,
    floor_id   int,
    owner       string
    ...
)

index .. ( city_id, street_id, house_id, floor_id ) tablespc indexspace;

on a database with 260 GB of data and an index size of 109GB on separate
raid disks. there are
    85 city_ids, 2000
    street_ids per city,
    20 house_ids per street per city
    5 floor_ids per house_ per street per city

Then I perform a query to retrieve all house_ids for a specified city,
house and floor ( a bit contrived, but the same cardinality applies)

   select street_id, floor_id from apartment_location where
    city_id = 67 and
    house_id = 6 and
    floor_id = 4

this returns about 2000 rows, but the query takes 3-4 seconds. It
performas an index scan, and everything happens inside 6GB of memory.

So the question, any suggestions on how to possibly decrease the query
time. From iostat etc. its seems that most of the work is reading the
index, reading the data takes almost next to nothing.

Any suggestions?

regards

thomas







From:
Greg Smith
Date:

On Tue, 12 May 2009, Thomas Finneid wrote:

> on a database with 260 GB of data and an index size of 109GB on separate raid
> disks. there are
>     85 city_ids, 2000
>     street_ids per city,
>     20 house_ids per street per city
>     5 floor_ids per house_ per street per city

You should test what happens if you reduce the index to just being
(city_id,street_id).  Having all the fields in there makes the index
larger, and it may end up being faster to just pull all of the ~100 data
rows for a particular (city_id,street_id) using the smaller index and then
filter out just the ones you need.  Having a smaller index to traverse
also means that you'll be more likely to keep all the index blocks in the
buffer cache moving forward.

A second level improvement there is to then CLUSTER on the smaller index,
which increases the odds you'll get all of the rows you need by fetching
only a small number of data pages.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

From:
Thomas Finneid
Date:

First off, is there a  way to pre-filter some of this data, by a view,
temporary table, partitioned indexes or something.

Secondly, one of the problems seems to be the size of the data and its
index, how can I calulate how much space a particular part of the index
needs in memory? maybe I could rearrange things a bit better so it
better first inside pages and so on.

Thirdly I was a bit unclear and this was the best example I could think
of (my client probably dont want me to talk about this at all... hence
the contrived example):

        85 city_ids,
        2000 street_ids per city,
        10 house_ids per street
        500 floor_ids per house

Now it should have the correct data distribution and the correct
cardinality.

In this particular query I am interested in all streets in a city that
have the specific house id and the specific floor id.

By specifying
    city_id, house_id and floor_id

I should get all street_ids that matches

The example you gave Greg assumed I wanted to follow cardinality, but I
need to skip the second field in order to get the right query. So
pulling the data based on the two first fields, City and Street would
just give me data for a single street, when I want it for all streets.









Greg Smith wrote:
> On Tue, 12 May 2009, Thomas Finneid wrote:
>
>> on a database with 260 GB of data and an index size of 109GB on
>> separate raid disks. there are
>>     85 city_ids, 2000
>>     street_ids per city,
>>     20 house_ids per street per city
>>     5 floor_ids per house_ per street per city
>
> You should test what happens if you reduce the index to just being
> (city_id,street_id).  Having all the fields in there makes the index
> larger, and it may end up being faster to just pull all of the ~100 data
> rows for a particular (city_id,street_id) using the smaller index and
> then filter out just the ones you need.  Having a smaller index to
> traverse also means that you'll be more likely to keep all the index
> blocks in the buffer cache moving forward.
>
> A second level improvement there is to then CLUSTER on the smaller
> index, which increases the odds you'll get all of the rows you need by
> fetching only a small number of data pages.
>
> --
> * Greg Smith  http://www.gregsmith.com Baltimore, MD


From:
Matthew Wakeling
Date:

On Tue, 12 May 2009, Greg Smith wrote:
> You should test what happens if you reduce the index to just being
> (city_id,street_id).

I think you're missing the point a little here. The point is that Thomas
is creating an index on (city_id, street_id, house_id, floor_id) and
running a query on (city_id, house_id, floor_id).

Thomas, the order of columns in the index matters. The index is basically
a tree structure, which resolves the left-most column before resolving the
column to the right of it. So to answer your query, it will resolve the
city_id, then it will have to scan almost all of the tree under that,
because you are not constraining for street_id. A much better index to
answer your query is (city_id, house_id, floor_id) - then it can just look
up straight away. Instead of the index returning 200000 rows to check, it
will return just the 2000.

Matthew

--
 An ant doesn't have a lot of processing power available to it. I'm not trying
 to be speciesist - I wouldn't want to detract you from such a wonderful
 creature, but, well, there isn't a lot there, is there?
                                        -- Computer Science Lecturer

From:
Thomas Finneid
Date:

Matthew Wakeling wrote:
> Thomas, the order of columns in the index matters. The index is
> basically a tree structure, which resolves the left-most column before
> resolving the column to the right of it. So to answer your query, it
> will resolve the city_id, then it will have to scan almost all of the
> tree under that, because you are not constraining for street_id. A much
> better index to answer your query is (city_id, house_id, floor_id) -
> then it can just look up straight away. Instead of the index returning
> 200000 rows to check, it will return just the 2000.

Thats something I was a bit unsure about, because of the cardinality of
the data. But thanks, I will try it. Just need to populate a new data
base with the new index. (Apparently, creating a new index on an already
existing database is slower than just recreating the db, when the db is
250GB big)


thomas

From:
"Ow Mun Heng"
Date:


-----Original Message-----
From:  [mailto:pgsql-performance- A
much
>> better index to answer your query is (city_id, house_id, floor_id) -
>> then it can just look up straight away. Instead of the index returning
>> 200000 rows to check, it will return just the 2000.

Shouldn't BITMAP indexes come into play?

Does having one index w/ 3 parameters being better than 3 index w/ 3
different parameters be better for BITMAP index seeks?



From:
Matthew Wakeling
Date:

On Thu, 14 May 2009, Ow Mun Heng wrote:
> Shouldn't BITMAP indexes come into play?
>
> Does having one index w/ 3 parameters being better than 3 index w/ 3
> different parameters be better for BITMAP index seeks?

I'll let someone correct me if I'm wrong, but using a single index that
exactly covers your search is always going to be better than munging
together results from several indexes, even if the planner decides to turn
it into a bitmap index scan (which will be more likely in PG8.4 with
effective_concurrency set).

Matthew

--
 I don't want the truth. I want something I can tell parliament!
                                              -- Rt. Hon. Jim Hacker MP

From:
"Ow Mun Heng"
Date:


-----Original Message-----
From: Matthew Wakeling [mailto:]
On Thu, 14 May 2009, Ow Mun Heng wrote:
>> Shouldn't BITMAP indexes come into play?
>>
>> Does having one index w/ 3 parameters being better than 3 index w/ 3
>> different parameters be better for BITMAP index seeks?

>I'll let someone correct me if I'm wrong, but using a single index that
>exactly covers your search is always going to be better than munging
>together results from several indexes, even if the planner decides to turn
>it into a bitmap index scan (which will be more likely in PG8.4 with
>effective_concurrency set).

I don't dis-agree there, but for multiple field indexes, the sequence has to
be followed. If queries hit those exact sequence, then we're good to go, if
not, then it's wasted and not used to it's full capability.

Effective_concurrency you say.. Hmm... gotta goggle that