Expected performance of querying 5k records from 4 million records? - Mailing list pgsql-performance

From Anish Kejariwal
Subject Expected performance of querying 5k records from 4 million records?
Date
Msg-id CAOpcnr8S2hmV1z-=ZF1k+t+uWWyw82nUbwhpQmdf=8R_Kpv78g@mail.gmail.com
Whole thread Raw
Responses Re: Expected performance of querying 5k records from 4 million records?  (Josh Berkus <josh@agliodbs.com>)
Re: Expected performance of querying 5k records from 4 million records?  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-performance
I've been struggling with this issue for the last several days, and I feel like I'm running into a few different issues that I don't understand.  I'm using postgres 9.0.8, and here's the OS I'm running this on:
inux  2.6.18-308.4.1.el5xen #1 SMP Tue Apr 17 17:49:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

from show all:
shared_buffers                  | 4GB 
work_mem                        | 192MB  
 maintenance_work_mem            | 1GB              
effective_cache_size            | 24GB             
 wal_sync_method                 | fdatasync 
random_page_cost                | 4                 

My situtation: I have an empty parent table, that has 250 partitions. Each partition has 4 million records (250 megs).  I'm querying 5k records directly from one partition (no joins), and it's taking ~2 seconds to get the results.  This feels very slow to me for an indexed table of only 4 million records. 

Quick overview of my questions::
1. expected performance? tips on what to look into to increase performance?
2. should multicolumn indices help?
3. does reindex table cache the table?

Below are the tables, queries, and execution plans with my questions with more detail.  (Since I have 250 partitions, I can query one partition after the other to ensure that I'm not pulling results form the cache)

Parent table:
# \d data
       Table "public.data"
    Column    |       Type       | Modifiers 
--------------+------------------+-----------
 data_id    | integer          | not null
 dataset_id   | integer          | not null
 stat            | double precision | not null
 stat_id | integer          | not null
Number of child tables: 254 (Use \d+ to list them.)


Child (partition) with ~4 million records:

\d data_part_201
genepool_1_11=# \d data_part_201
   Table "public.data_part_201"
    Column    |       Type       | Modifiers 
--------------+------------------+-----------
 data_id    | integer          | not null
 dataset_id   | integer          | not null
 stat            | double precision | not null
 stat_id | integer          | not null
Indexes:
    "data_unq_201" UNIQUE, btree (data_id)
    "data_part_201_dataset_id_idx" btree (dataset_id)
    "data_part_201_stat_id_idx" btree (stat_id)
Check constraints:
    "data_chk_201" CHECK (dataset_id = 201)
Inherits: data

explain analyze select data_id, dataset_id, stat from data_part_201 where dataset_id = 201
and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

 Bitmap Heap Scan on data_part_201  (cost=115.79..14230.69 rows=4383 width=16) (actual time=36.103..1718.141 rows=5350 loops=1)
   Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
   Filter: ((dataset_id = 201) AND (stat_id = 6))
   ->  Bitmap Index Scan on data_unq_201  (cost=0.00..114.70 rows=5403 width=0) (actual time=26.756..26.756 rows=5350 loops=1)
         Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
 Total runtime: 1728.447 ms
(6 rows)

Time: 1743.535 ms

QUESTION 1: you can see that the query is very simple.  is this the optimal execution plan? any tips on what to look into to increase performance?

I then tried adding the following multi-column index:
"data_part_202_dataset_regionset_data_idx" btree (dataset_id, data_id, stat_id)

The query now takes 27 seconds!:   
explain analyze select data_id, dataset_id, stat from data_part_202 where dataset_id = 202
and stat_id = 6 and data_id>=50544630 and data_id<=50549979; 

 Index Scan using data_part_202_dataset_regionset_data_idx on data_part_202  (cost=0.00..7987.83 rows=4750 width=16) (actual time=39.152..27339.401 rows=5350 loops=1)
   Index Cond: ((dataset_id = 202) AND (data_id >= 50544630) AND (data_id <= 50549979) AND (stat_id = 6))
 Total runtime: 27349.091 ms
(3 rows)

QUESTION 2: why is a multicolumn index causing the query to run so much slower?  I had expected it to increase the performance


QUESTION 3:
If I do the following:  reindex table data_part_204 the query now takes 50-70 milliseconds.  Is this because the table is getting cached?  How do I know if a particular query is coming from the cache? The reason why I think "reindex table" is caching the results, is that select count(*) from the partition also causes the query to be fast.

(and yes, vacuum analyze on the partition makes no difference)


pgsql-performance by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: pg_dump and thousands of schemas
Next
From: Josh Berkus
Date:
Subject: Re: Expected performance of querying 5k records from 4 million records?