Home > mailing lists

Expected performance of querying 5k records from 4 million records? - Mailing list pgsql-performance

From	Anish Kejariwal
Subject	Expected performance of querying 5k records from 4 million records?
Date	June 15, 2012 13:17:55
Msg-id	CAOpcnr8S2hmV1z-=ZF1k+t+uWWyw82nUbwhpQmdf=8R_Kpv78g@mail.gmail.com Whole thread Raw
Responses	Re: Expected performance of querying 5k records from 4 million records? Re: Expected performance of querying 5k records from 4 million records?
List	pgsql-performance

Tree view

I've been struggling with this issue for the last several days, and I feel like I'm running into a few different issues that I don't understand. I'm using postgres 9.0.8, and here's the OS I'm running this on:

inux 2.6.18-308.4.1.el5xen #1 SMP Tue Apr 17 17:49:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

from show all:

shared_buffers | 4GB

work_mem | 192MB

maintenance_work_mem | 1GB

effective_cache_size | 24GB

wal_sync_method | fdatasync

random_page_cost | 4

My situtation: I have an empty parent table, that has 250 partitions. Each partition has 4 million records (250 megs). I'm querying 5k records directly from one partition (no joins), and it's taking ~2 seconds to get the results. This feels very slow to me for an indexed table of only 4 million records.

Quick overview of my questions::

1. expected performance? tips on what to look into to increase performance?

2. should multicolumn indices help?

3. does reindex table cache the table?

Below are the tables, queries, and execution plans with my questions with more detail. (Since I have 250 partitions, I can query one partition after the other to ensure that I'm not pulling results form the cache)

Parent table:

# \d data

Table "public.data"

Column | Type | Modifiers

--------------+------------------+-----------

data_id | integer | not null

dataset_id | integer | not null

stat | double precision | not null

stat_id | integer | not null

Number of child tables: 254 (Use \d+ to list them.)

Child (partition) with ~4 million records:

\d data_part_201

genepool_1_11=# \d data_part_201

Table "public.data_part_201"

Column | Type | Modifiers

--------------+------------------+-----------

data_id | integer | not null

dataset_id | integer | not null

stat | double precision | not null

stat_id | integer | not null

Indexes:

"data_unq_201" UNIQUE, btree (data_id)

"data_part_201_dataset_id_idx" btree (dataset_id)

"data_part_201_stat_id_idx" btree (stat_id)

Check constraints:

"data_chk_201" CHECK (dataset_id = 201)

Inherits: data

explain analyze select data_id, dataset_id, stat from data_part_201 where dataset_id = 201

and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

Bitmap Heap Scan on data_part_201 (cost=115.79..14230.69 rows=4383 width=16) (actual time=36.103..1718.141 rows=5350 loops=1)

Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))

Filter: ((dataset_id = 201) AND (stat_id = 6))

-> Bitmap Index Scan on data_unq_201 (cost=0.00..114.70 rows=5403 width=0) (actual time=26.756..26.756 rows=5350 loops=1)

Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))

Total runtime: 1728.447 ms

(6 rows)

Time: 1743.535 ms

QUESTION 1: you can see that the query is very simple. is this the optimal execution plan? any tips on what to look into to increase performance?

I then tried adding the following multi-column index:

"data_part_202_dataset_regionset_data_idx" btree (dataset_id, data_id, stat_id)

The query now takes 27 seconds!:

explain analyze select data_id, dataset_id, stat from data_part_202 where dataset_id = 202

and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

Index Scan using data_part_202_dataset_regionset_data_idx on data_part_202 (cost=0.00..7987.83 rows=4750 width=16) (actual time=39.152..27339.401 rows=5350 loops=1)

Index Cond: ((dataset_id = 202) AND (data_id >= 50544630) AND (data_id <= 50549979) AND (stat_id = 6))

Total runtime: 27349.091 ms

(3 rows)

QUESTION 2: why is a multicolumn index causing the query to run so much slower? I had expected it to increase the performance

QUESTION 3:

If I do the following: reindex table data_part_204 the query now takes 50-70 milliseconds. Is this because the table is getting cached? How do I know if a particular query is coming from the cache? The reason why I think "reindex table" is caching the results, is that select count(*) from the partition also causes the query to be fast.

(and yes, vacuum analyze on the partition makes no difference)

pgsql-performance by date:

From: Tatsuo Ishii
Date: 12 June 2012, 22:46:54
Subject: Re: pg_dump and thousands of schemas

From: Josh Berkus
Date: 15 June 2012, 14:45:04
Subject: Re: Expected performance of querying 5k records from 4 million records?

Expected performance of querying 5k records from 4 million records? - Mailing list pgsql-performance

Previous

Next