Thread: Expected performance of querying 5k records from 4 million records?

Expected performance of querying 5k records from 4 million records?

From
Anish Kejariwal
Date:
I've been struggling with this issue for the last several days, and I feel like I'm running into a few different issues that I don't understand.  I'm using postgres 9.0.8, and here's the OS I'm running this on:
inux  2.6.18-308.4.1.el5xen #1 SMP Tue Apr 17 17:49:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

from show all:
shared_buffers                  | 4GB 
work_mem                        | 192MB  
 maintenance_work_mem            | 1GB              
effective_cache_size            | 24GB             
 wal_sync_method                 | fdatasync 
random_page_cost                | 4                 

My situtation: I have an empty parent table, that has 250 partitions. Each partition has 4 million records (250 megs).  I'm querying 5k records directly from one partition (no joins), and it's taking ~2 seconds to get the results.  This feels very slow to me for an indexed table of only 4 million records. 

Quick overview of my questions::
1. expected performance? tips on what to look into to increase performance?
2. should multicolumn indices help?
3. does reindex table cache the table?

Below are the tables, queries, and execution plans with my questions with more detail.  (Since I have 250 partitions, I can query one partition after the other to ensure that I'm not pulling results form the cache)

Parent table:
# \d data
       Table "public.data"
    Column    |       Type       | Modifiers 
--------------+------------------+-----------
 data_id    | integer          | not null
 dataset_id   | integer          | not null
 stat            | double precision | not null
 stat_id | integer          | not null
Number of child tables: 254 (Use \d+ to list them.)


Child (partition) with ~4 million records:

\d data_part_201
genepool_1_11=# \d data_part_201
   Table "public.data_part_201"
    Column    |       Type       | Modifiers 
--------------+------------------+-----------
 data_id    | integer          | not null
 dataset_id   | integer          | not null
 stat            | double precision | not null
 stat_id | integer          | not null
Indexes:
    "data_unq_201" UNIQUE, btree (data_id)
    "data_part_201_dataset_id_idx" btree (dataset_id)
    "data_part_201_stat_id_idx" btree (stat_id)
Check constraints:
    "data_chk_201" CHECK (dataset_id = 201)
Inherits: data

explain analyze select data_id, dataset_id, stat from data_part_201 where dataset_id = 201
and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

 Bitmap Heap Scan on data_part_201  (cost=115.79..14230.69 rows=4383 width=16) (actual time=36.103..1718.141 rows=5350 loops=1)
   Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
   Filter: ((dataset_id = 201) AND (stat_id = 6))
   ->  Bitmap Index Scan on data_unq_201  (cost=0.00..114.70 rows=5403 width=0) (actual time=26.756..26.756 rows=5350 loops=1)
         Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
 Total runtime: 1728.447 ms
(6 rows)

Time: 1743.535 ms

QUESTION 1: you can see that the query is very simple.  is this the optimal execution plan? any tips on what to look into to increase performance?

I then tried adding the following multi-column index:
"data_part_202_dataset_regionset_data_idx" btree (dataset_id, data_id, stat_id)

The query now takes 27 seconds!:   
explain analyze select data_id, dataset_id, stat from data_part_202 where dataset_id = 202
and stat_id = 6 and data_id>=50544630 and data_id<=50549979; 

 Index Scan using data_part_202_dataset_regionset_data_idx on data_part_202  (cost=0.00..7987.83 rows=4750 width=16) (actual time=39.152..27339.401 rows=5350 loops=1)
   Index Cond: ((dataset_id = 202) AND (data_id >= 50544630) AND (data_id <= 50549979) AND (stat_id = 6))
 Total runtime: 27349.091 ms
(3 rows)

QUESTION 2: why is a multicolumn index causing the query to run so much slower?  I had expected it to increase the performance


QUESTION 3:
If I do the following:  reindex table data_part_204 the query now takes 50-70 milliseconds.  Is this because the table is getting cached?  How do I know if a particular query is coming from the cache? The reason why I think "reindex table" is caching the results, is that select count(*) from the partition also causes the query to be fast.

(and yes, vacuum analyze on the partition makes no difference)


Re: Expected performance of querying 5k records from 4 million records?

From
Josh Berkus
Date:
Anish,

> I've been struggling with this issue for the last several days, and I feel
> like I'm running into a few different issues that I don't understand.  I'm
> using postgres 9.0.8, and here's the OS I'm running this on:
> inux  2.6.18-308.4.1.el5xen #1 SMP Tue Apr 17 17:49:15 EDT 2012 x86_64
> x86_64 x86_64 GNU/Linux

RAM?  What does your disk support look like? (disks especially are
relevant, see below).

> explain analyze select data_id, dataset_id, stat from data_part_201 where
> dataset_id = 201
> and stat_id = 6 and data_id>=50544630 and data_id<=50549979;
>
>  Bitmap Heap Scan on data_part_201  (cost=115.79..14230.69 rows=4383
> width=16) (actual time=36.103..1718.141 rows=5350 loops=1)
>    Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
>    Filter: ((dataset_id = 201) AND (stat_id = 6))
>    ->  Bitmap Index Scan on data_unq_201  (cost=0.00..114.70 rows=5403
> width=0) (actual time=26.756..26.756 rows=5350 loops=1)
>          Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
>  Total runtime: 1728.447 ms
> (6 rows)

I've seen extremely slow Bitmap Heap Scans like this before.  There's a
few things which can cause them in my experience:

1) Table is on disk, and random access to disk is very slow for some reason.

2) Recheck condition is computationally expensive (unlikely here)

3) Index is very bloated and needs reindexing (again, unlikely because
the initial Bitmap Index Scan is quite fast).

To test the above: run the exact same query several times in a row.
Does it get dramatically faster on the 2nd and successive runs?

>  Index Scan using data_part_202_dataset_regionset_data_idx on data_part_202
>  (cost=0.00..7987.83 rows=4750 width=16) (actual time=39.152..27339.401
> rows=5350 loops=1)
>    Index Cond: ((dataset_id = 202) AND (data_id >= 50544630) AND (data_id
> <= 50549979) AND (stat_id = 6))
>  Total runtime: 27349.091 ms
> (3 rows)

I'll point out that you're now querying a different partition than you
did above.

Again, this would point to random access to the underlying partition
being very slow.

> QUESTION 3:
> If I do the following:  reindex table data_part_204 the query now takes
> 50-70 milliseconds.  Is this because the table is getting cached?  How do I
> know if a particular query is coming from the cache? The reason why I think
> "reindex table" is caching the results, is that select count(*) from the
> partition also causes the query to be fast.

Yes, it's most likely because the table is being cached.  To test this,
run one of the slow query versions above repeatedly.

Things to investigate:

1) Is there some reason why random access on your disks would be
unusually slow?  iSCSI, cheap NAS/SAN, RAID 5+0, running on OSX or
Windows, etc.?

2) Is there a possibility that the partitions or indexes involved might
be unusually bloated, such as a large number of historical updates to
indexed columns?  If so, does a CLUSTER on one partition make the issue
go away?

3) Test your database using PostgreSQL 9.2 Beta2.  Do the new index-only
scans solve this issue?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Expected performance of querying 5k records from 4 million records?

From
Jeff Janes
Date:
On Fri, Jun 15, 2012 at 9:17 AM, Anish Kejariwal <anishkej@gmail.com> wrote:
>
> Below are the tables, queries, and execution plans with my questions with
> more detail.  (Since I have 250 partitions, I can query one partition after
> the other to ensure that I'm not pulling results form the cache)

Doesn't that explain why it is slow?  If you have 15000 rpm drives and
each row is in a different block and uncached, it would take 20
seconds to read them all in.  You are getting 10 times better than
that, either due to caching or because your rows are clustered, or
because effective_io_concurrency is doing its thing.

>
> explain analyze select data_id, dataset_id, stat from data_part_201 where
> dataset_id = 201
> and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

What does "explain (analyze, buffers)" show?


> QUESTION 1: you can see that the query is very simple.  is this the optimal
> execution plan? any tips on what to look into to increase performance?
>
> I then tried adding the following multi-column index:
> "data_part_202_dataset_regionset_data_idx" btree (dataset_id, data_id,
> stat_id)

Since you query stat_id for equality and data_id for range, you should
probably reverse the order of those columns in the index.


>
> QUESTION 3:
> If I do the following:  reindex table data_part_204 the query now takes
> 50-70 milliseconds.  Is this because the table is getting cached?  How do I
> know if a particular query is coming from the cache?

Using explain (analyze, buffers) will show you if it is coming from
the shared_buffers cache.

It is harder to see if it is coming from the file system cache.  If
the server is mostly idle other than your stuff, you can run vmstat
and see how much physical IO is caused by your activity.

Cheers,

Jeff

Re: Expected performance of querying 5k records from 4 million records?

From
Anish Kejariwal
Date:
Thanks for the help, Jeff and Josh.  It looks reclustering the multi-column index might solve things.  For my particular query, because I'm getting a range of records back, it makes sense that reclustering will benefit me if I have a slow disk even if I had expected that the indices would be sufficient .  I now need to make sure that the speed up I'm seeing is not because things have been cached.

That being said, here's what I have:
2CPUs, 12 physical cores, hyperthreaded (24 virtual cores), 2.67Ghz
96G RAM, 80G available to dom0
CentOS 5.8, Xen
3Gbps SATA (7200 RPM, Hitachi ActiveStar Enterprise Class)

So, I have lots of RAM, but not necessarily the fastest disk.

default_statistics_target = 50 # pgtune wizard 2011-03-16
maintenance_work_mem = 1GB # pgtune wizard 2011-03-16
constraint_exclusion = on # pgtune wizard 2011-03-16
checkpoint_completion_target = 0.9 # pgtune wizard 2011-03-16
effective_cache_size = 24GB # pgtune wizard 2011-03-16
work_mem = 192MB # pgtune wizard 2011-03-16
wal_buffers = 8MB # pgtune wizard 2011-03-16
checkpoint_segments = 128 # pgtune wizard 2011-03-16, amended by am, 30may2011
shared_buffers = 4GB # pgtune wizard 2011-03-16
max_connections = 100 # pgtune wizard 2011-03-16: 80, bumped up to 100
max_locks_per_transaction = 1000

I didn't know about explain (analyze,buffers).  Very cool.  So, based on your advice,  I ran it and here's what I found:

1st time I ran the query:
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on data_part_213  (cost=113.14..13725.77 rows=4189 width=16) (actual time=69.807..2763.174 rows=5350 loops=1)
   Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
   Filter: ((dataset_id = 213) AND (stat_id = 6))
   Buffers: shared read=4820
   ->  Bitmap Index Scan on data_unq_213  (cost=0.00..112.09 rows=5142 width=0) (actual time=51.918..51.918 rows=5350 loops=1)
         Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
         Buffers: shared read=19
 Total runtime: 2773.099 ms
(8 rows)

the second time I run the query it's very fast, since all the buffered read counts have turned into hit counts showing I'm reading from cache (as I expected):
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on data_part_213  (cost=113.14..13725.77 rows=4189 width=16) (actual time=1.661..14.376 rows=5350 loops=1)
   Recheck Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
   Filter: ((dataset_id = 213) AND (stat_id = 6))
   Buffers: shared hit=4819
   ->  Bitmap Index Scan on data_unq_213  (cost=0.00..112.09 rows=5142 width=0) (actual time=0.879..0.879 rows=5350 loops=1)
         Index Cond: ((data_id >= 50544630) AND (data_id <= 50549979))
         Buffers: shared hit=18
 Total runtime: 20.232 ms
(8 rows)



Next, I tried reclustering a partition with the multicolumn-index.  the big things is that the read count has dropped dramatically!
 Index Scan using data_part_214_dataset_stat_data_idx on data_part_214  (cost=0.00..7223.05 rows=4265 width=16) (actual time=0.093..7.251 rows=5350 loops=1)
   Index Cond: ((dataset_id = 214) AND (data_id >= 50544630) AND (data_id <= 50549979) AND (stat_id = 6))
   Buffers: shared hit=45 read=24
 Total runtime: 12.929 ms
(4 rows)


second time:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using data_part_214_dataset_stat_data_idx on data_part_214  (cost=0.00..7223.05 rows=4265 width=16) (actual time=0.378..7.696 rows=5350 loops=1)
   Index Cond: ((dataset_id = 214) AND (data_id >= 50544630) AND (data_id <= 50549979) AND (stat_id = 6))
   Buffers: shared hit=68
 Total runtime: 13.511 ms
(4 rows)

So, it looks like clustering the index appropriately fixes things!  Also, I'll recreate the index switching the order to (dataset_id, stat_id,data_id)

thanks!

On Fri, Jun 15, 2012 at 11:20 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Jun 15, 2012 at 9:17 AM, Anish Kejariwal <anishkej@gmail.com> wrote:
>
> Below are the tables, queries, and execution plans with my questions with
> more detail.  (Since I have 250 partitions, I can query one partition after
> the other to ensure that I'm not pulling results form the cache)

Doesn't that explain why it is slow?  If you have 15000 rpm drives and
each row is in a different block and uncached, it would take 20
seconds to read them all in.  You are getting 10 times better than
that, either due to caching or because your rows are clustered, or
because effective_io_concurrency is doing its thing.

>
> explain analyze select data_id, dataset_id, stat from data_part_201 where
> dataset_id = 201
> and stat_id = 6 and data_id>=50544630 and data_id<=50549979;

What does "explain (analyze, buffers)" show?


> QUESTION 1: you can see that the query is very simple.  is this the optimal
> execution plan? any tips on what to look into to increase performance?
>
> I then tried adding the following multi-column index:
> "data_part_202_dataset_regionset_data_idx" btree (dataset_id, data_id,
> stat_id)

Since you query stat_id for equality and data_id for range, you should
probably reverse the order of those columns in the index.


>
> QUESTION 3:
> If I do the following:  reindex table data_part_204 the query now takes
> 50-70 milliseconds.  Is this because the table is getting cached?  How do I
> know if a particular query is coming from the cache?

Using explain (analyze, buffers) will show you if it is coming from
the shared_buffers cache.

It is harder to see if it is coming from the file system cache.  If
the server is mostly idle other than your stuff, you can run vmstat
and see how much physical IO is caused by your activity.

Cheers,

Jeff

Re: Expected performance of querying 5k records from 4 million records?

From
Samuel Gendler
Date:
On Mon, Jun 18, 2012 at 9:39 AM, Anish Kejariwal <anishkej@gmail.com> wrote:

So, it looks like clustering the index appropriately fixes things!  Also, I'll recreate the index switching the order to (dataset_id, stat_id,data_id)

Just keep in mind that clustering is a one-time operation.  Inserts and updates will change the order of records in the table, so you'll need to re-cluster periodically to keep performance high if there are a lot of inserts and updates into the tables. I didn't re-read the thread, but I seem recall a partitioned table, so assuming you are partitioning in a manner which keeps the number of partitions that are actively being inserted/updated on to a minimum, you only need to cluster the active partitions, which isn't usually terribly painful.  Also, if you are bulk loading data (and not creating random spaces in the table by deleting and updating), you can potentially order the data on the way into the table to avoid the need to cluster repeatedly.

--sam

Re: Expected performance of querying 5k records from 4 million records?

From
Josh Berkus
Date:
On 6/18/12 9:39 AM, Anish Kejariwal wrote:
> Thanks for the help, Jeff and Josh.  It looks reclustering the multi-column
> index might solve things.  For my particular query, because I'm getting a
> range of records back, it makes sense that reclustering will benefit me if
> I have a slow disk even if I had expected that the indices would be
> sufficient .  I now need to make sure that the speed up I'm seeing is not
> because things have been cached.

Well, other than that your performance is as expected because of your
much-larger-than-RAM database and your relatively slow disk.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgbouncer - massive overhead?

From
"Strange, John W"
Date:
Given a baseline postgresql.conf config and a couple DL580 40 core/256GB memory I noticed a large over head for
pgbouncer,has  anyone seen this before?
 


$ pgbench -h `hostname -i` -j 32 -p 4320 -U asgprod -s 500 -c 32 -S -T 60 pgbench_500
Scale option ignored, using pgbench_branches table count = 500
starting vacuum...end.
transaction type: SELECT only
scaling factor: 500
query mode: simple
number of clients: 32
number of threads: 32
duration: 60 s
number of transactions actually processed: 1743073
tps = 29049.886666 (including connections establishing)
tps = 29050.308194 (excluding connections establishing)

$ pgbench -h `hostname -i` -j 32 -p 4310 -U asgprod -s 500 -c 32 -S -T 60 pgbench_500
Scale option ignored, using pgbench_branches table count = 500
starting vacuum...end.
transaction type: SELECT only
scaling factor: 500
query mode: simple
number of clients: 32
number of threads: 32
duration: 60 s
number of transactions actually processed: 8692204
tps = 144857.505107 (including connections establishing)
tps = 144880.181341 (excluding connections establishing)

processor       : 39
vendor_id       : GenuineIntel
cpu family      : 6
model           : 47
model name      :        Intel(R) Xeon(R) CPU E7- 4860  @ 2.27GHz



This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

Re: pgbouncer - massive overhead?

From
Steve Crawford
Date:
On 06/19/2012 09:00 AM, Strange, John W wrote:
> Given a baseline postgresql.conf config and a couple DL580 40 core/256GB memory I noticed a large over head for
pgbouncer,has  anyone seen this before? 
>
>
> $ pgbench -h `hostname -i` -j 32 -p 4320 -U asgprod -s 500 -c 32 -S -T 60 pgbench_500
> Scale option ignored, using pgbench_branches table count = 500
> starting vacuum...end.
> transaction type: SELECT only
> scaling factor: 500
> query mode: simple
> number of clients: 32
> number of threads: 32
> duration: 60 s
> number of transactions actually processed: 1743073
> tps = 29049.886666 (including connections establishing)
> tps = 29050.308194 (excluding connections establishing)
>
> $ pgbench -h `hostname -i` -j 32 -p 4310 -U asgprod -s 500 -c 32 -S -T 60 pgbench_500
> Scale option ignored, using pgbench_branches table count = 500
> starting vacuum...end.
> transaction type: SELECT only
> scaling factor: 500
> query mode: simple
> number of clients: 32
> number of threads: 32
> duration: 60 s
> number of transactions actually processed: 8692204
> tps = 144857.505107 (including connections establishing)
> tps = 144880.181341 (excluding connections establishing)
>
> processor       : 39
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 47
> model name      :        Intel(R) Xeon(R) CPU E7- 4860  @ 2.27GHz
>
>
I'm very dubious that the stats are meaningful as run. Were the above
stats generated on consecutive runs on the same machine or was the test
database fully returned to baseline between runs and the machine
restarted to clear cache?

I doubt anyone here would trust the results of a 60-second pgbench run -
especially a select-only test on a server that will likely end up with
virtually everything ultimately in cache. Make sure each run is started
from the same state and run for 30-60 minutes.

Still, you *are* adding a layer between the client and the server.
Running the simplest of read-only queries against a fully-cached
database on a fast many-core machine is likely to emphasize any latency
introduced by pgbouncer. But it's also not a use-case for which
pgbouncer is intended. If you were to add -C so each query required a
new client connection a different picture would emerge. Same thing if
you had 2000 client connections of which only a handful were running
queries at any moment.

Cheers,
Steve