Thread: large tables and simple "= constant" queries using indexes

large tables and simple "= constant" queries using indexes

From
John Beaver
Date:
Hi, I've started my first project with Postgres (after several years of
using Mysql), and I'm having an odd performance problem that I was
hoping someone might be able to explain the cause of.

----My query----
    - select count(*) from gene_prediction_view where gene_ref = 523
    - takes 26 seconds to execute, and returns 2400 (out of a total of
15 million records in the table)

---My problem---
    Using a single-column index to count 2400 records which are exactly
one constant value doesn't sound like something that would take 26
seconds. What's the slowdown? Any silver bullets that might fix this?

----Steps I've taken----
    - I ran vacuum and analyze
    - I upped the shared_buffers to 58384, and I upped some of the other
postgresql.conf values as well. Nothing seemed to help significantly,
but maybe I missed something that would help specifically for this query
type?
    - I tried to create a hash index, but gave up after more than 4
hours of waiting for it to finish indexing

----Table stats----
    - 15 million rows; I'm expecting to have four or five times this
number eventually.
    - 1.5 gigs of hard drive usage

----My development environment---
    - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm
hard drive
    - OS X 10.5.2
    - Postgres 8.3 (installed via MacPorts)

----My table----

CREATE TABLE gene_prediction_view
(
  id serial NOT NULL,
  gene_ref integer NOT NULL,
  go_id integer NOT NULL,
  go_description character varying(200) NOT NULL,
  go_category character varying(50) NOT NULL,
  function_verified_exactly boolean NOT NULL,
  function_verified_with_parent_go boolean NOT NULL,
  function_verified_with_child_go boolean NOT NULL,
  score numeric(10,2) NOT NULL,
  precision_score numeric(10,2) NOT NULL,
  CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
  CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
      REFERENCES sgd_annotations (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
      REFERENCES go_terms (term) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
)
WITH (OIDS=FALSE);
ALTER TABLE gene_prediction_view OWNER TO postgres;

CREATE INDEX ix_gene_prediction_view_gene_ref
  ON gene_prediction_view
  USING btree
  (gene_ref);




Re: large tables and simple "= constant" queries using indexes

From
Arjen van der Meijden
Date:
First of all, there is the 'explain analyze' output, which is pretty
helpful in postgresql.

My guess is, postgresql decides to do a table scan for some reason. It
might not have enough statistics for this particular table or column, to
make a sound decision. What you can try is to increase the statistics
target, which works pretty easy:
ALTER TABLE gene_prediction_view ALTER gene_ref SET STATISTICS 200;

Valid ranges are from 1(0?) - 1000, the default is 10, the default on my
systems is usually 100. For such a large table, I'd go with 200.

After that, you'll need to re-analyze your table and you can try again.

Perhaps analyze should try to establish its own best guess to how many
samples it should take? The default of 10 is rather limited for large
tables.

Best regards,

Arjen

On 9-4-2008 22:58 John Beaver wrote:
> Hi, I've started my first project with Postgres (after several years of
> using Mysql), and I'm having an odd performance problem that I was
> hoping someone might be able to explain the cause of.
>
> ----My query----
>    - select count(*) from gene_prediction_view where gene_ref = 523
>    - takes 26 seconds to execute, and returns 2400 (out of a total of 15
> million records in the table)
>
> ---My problem---
>    Using a single-column index to count 2400 records which are exactly
> one constant value doesn't sound like something that would take 26
> seconds. What's the slowdown? Any silver bullets that might fix this?
>
> ----Steps I've taken----
>    - I ran vacuum and analyze
>    - I upped the shared_buffers to 58384, and I upped some of the other
> postgresql.conf values as well. Nothing seemed to help significantly,
> but maybe I missed something that would help specifically for this query
> type?
>    - I tried to create a hash index, but gave up after more than 4 hours
> of waiting for it to finish indexing
>
> ----Table stats----
>    - 15 million rows; I'm expecting to have four or five times this
> number eventually.
>    - 1.5 gigs of hard drive usage
>
> ----My development environment---
>    - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm hard
> drive
>    - OS X 10.5.2
>    - Postgres 8.3 (installed via MacPorts)
>
> ----My table----
>
> CREATE TABLE gene_prediction_view
> (
>  id serial NOT NULL,
>  gene_ref integer NOT NULL,
>  go_id integer NOT NULL,
>  go_description character varying(200) NOT NULL,
>  go_category character varying(50) NOT NULL,
>  function_verified_exactly boolean NOT NULL,
>  function_verified_with_parent_go boolean NOT NULL,
>  function_verified_with_child_go boolean NOT NULL,
>  score numeric(10,2) NOT NULL,
>  precision_score numeric(10,2) NOT NULL,
>  CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
>  CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
>      REFERENCES sgd_annotations (id) MATCH SIMPLE
>      ON UPDATE NO ACTION ON DELETE NO ACTION,
>  CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
>      REFERENCES go_terms (term) MATCH SIMPLE
>      ON UPDATE NO ACTION ON DELETE NO ACTION,
>  CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
> )
> WITH (OIDS=FALSE);
> ALTER TABLE gene_prediction_view OWNER TO postgres;
>
> CREATE INDEX ix_gene_prediction_view_gene_ref
>  ON gene_prediction_view
>  USING btree
>  (gene_ref);
>
>
>
>

Re: large tables and simple "= constant" queries using indexes

From
Bill Moran
Date:
This is a FAQ, it comes up on an almost weekly basis.  Please do a
little Googling on count(*) and PostgreSQL and you'll get all the
explanations and suggestions on how to fix the problem you could
ever want.

In response to Arjen van der Meijden <acmmailing@tweakers.net>:

> First of all, there is the 'explain analyze' output, which is pretty
> helpful in postgresql.
>
> My guess is, postgresql decides to do a table scan for some reason. It
> might not have enough statistics for this particular table or column, to
> make a sound decision. What you can try is to increase the statistics
> target, which works pretty easy:
> ALTER TABLE gene_prediction_view ALTER gene_ref SET STATISTICS 200;
>
> Valid ranges are from 1(0?) - 1000, the default is 10, the default on my
> systems is usually 100. For such a large table, I'd go with 200.
>
> After that, you'll need to re-analyze your table and you can try again.
>
> Perhaps analyze should try to establish its own best guess to how many
> samples it should take? The default of 10 is rather limited for large
> tables.
>
> Best regards,
>
> Arjen
>
> On 9-4-2008 22:58 John Beaver wrote:
> > Hi, I've started my first project with Postgres (after several years of
> > using Mysql), and I'm having an odd performance problem that I was
> > hoping someone might be able to explain the cause of.
> >
> > ----My query----
> >    - select count(*) from gene_prediction_view where gene_ref = 523
> >    - takes 26 seconds to execute, and returns 2400 (out of a total of 15
> > million records in the table)
> >
> > ---My problem---
> >    Using a single-column index to count 2400 records which are exactly
> > one constant value doesn't sound like something that would take 26
> > seconds. What's the slowdown? Any silver bullets that might fix this?
> >
> > ----Steps I've taken----
> >    - I ran vacuum and analyze
> >    - I upped the shared_buffers to 58384, and I upped some of the other
> > postgresql.conf values as well. Nothing seemed to help significantly,
> > but maybe I missed something that would help specifically for this query
> > type?
> >    - I tried to create a hash index, but gave up after more than 4 hours
> > of waiting for it to finish indexing
> >
> > ----Table stats----
> >    - 15 million rows; I'm expecting to have four or five times this
> > number eventually.
> >    - 1.5 gigs of hard drive usage
> >
> > ----My development environment---
> >    - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm hard
> > drive
> >    - OS X 10.5.2
> >    - Postgres 8.3 (installed via MacPorts)
> >
> > ----My table----
> >
> > CREATE TABLE gene_prediction_view
> > (
> >  id serial NOT NULL,
> >  gene_ref integer NOT NULL,
> >  go_id integer NOT NULL,
> >  go_description character varying(200) NOT NULL,
> >  go_category character varying(50) NOT NULL,
> >  function_verified_exactly boolean NOT NULL,
> >  function_verified_with_parent_go boolean NOT NULL,
> >  function_verified_with_child_go boolean NOT NULL,
> >  score numeric(10,2) NOT NULL,
> >  precision_score numeric(10,2) NOT NULL,
> >  CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
> >  CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
> >      REFERENCES sgd_annotations (id) MATCH SIMPLE
> >      ON UPDATE NO ACTION ON DELETE NO ACTION,
> >  CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
> >      REFERENCES go_terms (term) MATCH SIMPLE
> >      ON UPDATE NO ACTION ON DELETE NO ACTION,
> >  CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
> > )
> > WITH (OIDS=FALSE);
> > ALTER TABLE gene_prediction_view OWNER TO postgres;
> >
> > CREATE INDEX ix_gene_prediction_view_gene_ref
> >  ON gene_prediction_view
> >  USING btree
> >  (gene_ref);
> >
> >
> >
> >
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

****************************************************************
IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************

Re: large tables and simple "= constant" queries using indexes

From
PFC
Date:
> Hi, I've started my first project with Postgres (after several years of
> using Mysql), and I'm having an odd performance problem that I was
> hoping someone might be able to explain the cause of.
>
> ----My query----
>     - select count(*) from gene_prediction_view where gene_ref = 523
>     - takes 26 seconds to execute, and returns 2400 (out of a total of
> 15 million records in the table)
>  ---My problem---
>     Using a single-column index to count 2400 records which are exactly
> one constant value doesn't sound like something that would take 26
> seconds. What's the slowdown? Any silver bullets that might fix this?

    * Please post an EXPLAIN ANALYZE of your query which will allow to choose
between these two options :
    - If Postgres uses a bad plan (like a seq scan), you need to up the
statistics for this column
    - If you get the correct plan (index scan or bitmap index scan) then it
is likely that postgres does one disk seek per row that has to be counted.
26 seconds for 2400 rows would be consistent with a 10ms seek time. The
unmistakable sign is that re-running the query will result in a very fast
runtime (I'd say a couple ms for counting 2400 rows if no disk IO is
involved).


Re: large tables and simple "= constant" queries using indexes

From
Jeremy Harris
Date:
Bill Moran wrote:
> This is a FAQ, it comes up on an almost weekly basis.

I don't think so.   "where".

>>>    - select count(*) from gene_prediction_view where gene_ref = 523

Cheers,
    Jeremy


Re: large tables and simple "= constant" queries using indexes

From
Arjen van der Meijden
Date:
Hi John,

You don't use the same 'gene_ref ='-value, so its not a perfect
comparison. And obviously, there is the fact that the data can be in the
disk cache, the second time you run it, which would explain the almost
instantaneous result for the second query.

If repeating the query a few times with 200 still makes it do its work
in 15 seconds and with 800 in less than 100ms, than you might have found
a bug, or it is at least something I don't know how to fix.

I doubt upping the default for all tables to 1000 is a good idea. The
data collected is used in the query-planning-stage, where more data
means more processing time. Obviously there is a tradeoff somewhere
between having more statistics and thus being able to plan the query
better versus requiring more time to process those statistics.

Best regards,

Arjen

On 10-4-2008 0:24 John Beaver wrote:
> Perfect - thanks Arjen. Using your value of 200 decreased the time to 15
> seconds, and using a value of 800 makes it almost instantaneous. I'm
> really not concerned about space usage; if having more statistics
> increases performance this much, maybe I'll just default it to 1000?
>
> Strangely, the steps taken in the explain analyze are all the same. The
> only differences are the predicted costs (and execution times).
>
> explain analyze for a statistics of 200:
> "Aggregate  (cost=8831.27..8831.28 rows=1 width=0) (actual
> time=15198.407..15198.408 rows=1 loops=1)"
> "  ->  Bitmap Heap Scan on gene_prediction_view  (cost=44.16..8825.29
> rows=2392 width=0) (actual time=19.719..15191.875 rows=2455 loops=1)"
> "        Recheck Cond: (gene_ref = 500)"
> "        ->  Bitmap Index Scan on ix_gene_prediction_view_gene_ref
> (cost=0.00..43.56 rows=2392 width=0) (actual time=18.871..18.871
> rows=2455 loops=1)"
> "              Index Cond: (gene_ref = 500)"
> "Total runtime: 15198.651 ms"
>
> explain analyze for a statistics of 800:
> "Aggregate  (cost=8873.75..8873.76 rows=1 width=0) (actual
> time=94.473..94.473 rows=1 loops=1)"
> "  ->  Bitmap Heap Scan on gene_prediction_view  (cost=44.25..8867.74
> rows=2404 width=0) (actual time=39.358..93.733 rows=2455 loops=1)"
> "        Recheck Cond: (gene_ref = 301)"
> "        ->  Bitmap Index Scan on ix_gene_prediction_view_gene_ref
> (cost=0.00..43.65 rows=2404 width=0) (actual time=38.472..38.472
> rows=2455 loops=1)"
> "              Index Cond: (gene_ref = 301)"
> "Total runtime: 94.622 ms"
>
>
>
>
> Arjen van der Meijden wrote:
>> First of all, there is the 'explain analyze' output, which is pretty
>> helpful in postgresql.
>>
>> My guess is, postgresql decides to do a table scan for some reason. It
>> might not have enough statistics for this particular table or column,
>> to make a sound decision. What you can try is to increase the
>> statistics target, which works pretty easy:
>> ALTER TABLE gene_prediction_view ALTER gene_ref SET STATISTICS 200;
>>
>> Valid ranges are from 1(0?) - 1000, the default is 10, the default on
>> my systems is usually 100. For such a large table, I'd go with 200.
>>
>> After that, you'll need to re-analyze your table and you can try again.
>>
>> Perhaps analyze should try to establish its own best guess to how many
>> samples it should take? The default of 10 is rather limited for large
>> tables.
>>
>> Best regards,
>>
>> Arjen
>>
>> On 9-4-2008 22:58 John Beaver wrote:
>>> Hi, I've started my first project with Postgres (after several years
>>> of using Mysql), and I'm having an odd performance problem that I was
>>> hoping someone might be able to explain the cause of.
>>>
>>> ----My query----
>>>    - select count(*) from gene_prediction_view where gene_ref = 523
>>>    - takes 26 seconds to execute, and returns 2400 (out of a total of
>>> 15 million records in the table)
>>>
>>> ---My problem---
>>>    Using a single-column index to count 2400 records which are
>>> exactly one constant value doesn't sound like something that would
>>> take 26 seconds. What's the slowdown? Any silver bullets that might
>>> fix this?
>>>
>>> ----Steps I've taken----
>>>    - I ran vacuum and analyze
>>>    - I upped the shared_buffers to 58384, and I upped some of the
>>> other postgresql.conf values as well. Nothing seemed to help
>>> significantly, but maybe I missed something that would help
>>> specifically for this query type?
>>>    - I tried to create a hash index, but gave up after more than 4
>>> hours of waiting for it to finish indexing
>>>
>>> ----Table stats----
>>>    - 15 million rows; I'm expecting to have four or five times this
>>> number eventually.
>>>    - 1.5 gigs of hard drive usage
>>>
>>> ----My development environment---
>>>    - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm
>>> hard drive
>>>    - OS X 10.5.2
>>>    - Postgres 8.3 (installed via MacPorts)
>>>
>>> ----My table----
>>>
>>> CREATE TABLE gene_prediction_view
>>> (
>>>  id serial NOT NULL,
>>>  gene_ref integer NOT NULL,
>>>  go_id integer NOT NULL,
>>>  go_description character varying(200) NOT NULL,
>>>  go_category character varying(50) NOT NULL,
>>>  function_verified_exactly boolean NOT NULL,
>>>  function_verified_with_parent_go boolean NOT NULL,
>>>  function_verified_with_child_go boolean NOT NULL,
>>>  score numeric(10,2) NOT NULL,
>>>  precision_score numeric(10,2) NOT NULL,
>>>  CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
>>>  CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
>>>      REFERENCES sgd_annotations (id) MATCH SIMPLE
>>>      ON UPDATE NO ACTION ON DELETE NO ACTION,
>>>  CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
>>>      REFERENCES go_terms (term) MATCH SIMPLE
>>>      ON UPDATE NO ACTION ON DELETE NO ACTION,
>>>  CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
>>> )
>>> WITH (OIDS=FALSE);
>>> ALTER TABLE gene_prediction_view OWNER TO postgres;
>>>
>>> CREATE INDEX ix_gene_prediction_view_gene_ref
>>>  ON gene_prediction_view
>>>  USING btree
>>>  (gene_ref);
>>>
>>>
>>>
>>>
>>
>

Re: large tables and simple "= constant" queries using indexes

From
PFC
Date:
>> Perfect - thanks Arjen. Using your value of 200 decreased the time to
>> 15 seconds, and using a value of 800 makes it almost instantaneous. I'm
>> really not concerned about space usage; if having more statistics
>> increases performance this much, maybe I'll just default it to 1000?
>>  Strangely, the steps taken in the explain analyze are all the same.
>> The only differences are the predicted costs (and execution times).
>>  explain analyze for a statistics of 200:


    Actually, since you got the exact same plans and the second one is a lot
faster, this can mean that the data is in the disk cache, or that the
second query has all the rows it needs contiguous on disk whereas the
first one has its rows all over the place. Therefore you are IO-bound.
Statistics helped, perhaps (impossible to know since you don't provide the
plan wit statistics set to 10), but your main problem is IO.
    Usually setting the statistics to 100 is enough...

    Now, here are some solutions to your problem in random order :

    - Install 64 bit Linux, 64 bit Postgres, and get lots of RAM, lol.
    - Switch to a RAID10 (4 times the IOs per second, however zero gain if
you're single-threaded, but massive gain when concurrent)

    - If you just need a count by gene_ref, a simple solution is to keep it
in a separate table and update it via triggers, this is a frequently used
solution, it works well unless gene_ref is updated all the time (which is
probably not your case). Since you will be vacuuming this count-cache
table often, don't put the count as a field in your sgd_annotations table,
just create a small table with 2 fields, gene_ref and count (unless you
want to use the count for other things and you don't like the join).

    From your table definition gene_ref references another table. It would
seem that you have many rows in gene_prediction_view with the same
gene_ref value.

    - If you often query rows with the same gene_ref, consider using CLUSTER
to physically group those rows on disk. This way you can get all rows with
the same gene_ref in 1 seek instead of 2000. Clustered tables also make
Bitmap scan happy.
    This one is good since it can also speed up other queries (not just the
count).
    You could also cluster on (gene_ref,go_id) perhaps, I don't know what
your columns mean. Only you can decide that because clustering order has
to be meaningful (to group rows according to something that makes sense
and not at random).

    * Lose some weight :

CREATE INDEX ix_gene_prediction_view_gene_ref
  ON gene_prediction_view
  USING btree
  (gene_ref);

    - This index is useless since you have an UNIQUE on (gene_ref, go_id)
which is also an index.
    Remove the index on (gene_ref), it will leave space in the disk cache for
other things.

    - Since (gene_ref, go_id) is UNIQUE NOT NULL, you might be able to use
that as your primary key, but only if it is never updated of course. Saves
another index.

    - If you often do queries that fetch many rows, but seldom fetch the
description, tell PG to always store the description in offline compressed
form (read the docs on ALTER TABLE ... SET STORAGE ..., I forgot the
syntax). Point being to make the main table smaller.

    - Also I see a category as VARCHAR. If you have a million different
categories, that's OK, but if you have 100 categories for your 15M rows,
put them in a separate table and replace that by a category_id (normalize
!)


Re: large tables and simple "= constant" queries using indexes

From
Matthew
Date:
On Thu, 10 Apr 2008, PFC wrote:

... Lots of useful advice ...

>     - If you often query rows with the same gene_ref, consider using
> CLUSTER to physically group those rows on disk. This way you can get all rows
> with the same gene_ref in 1 seek instead of 2000. Clustered tables also make
> Bitmap scan happy.

In my opinion this is the one that will make the most difference. You will
need to run:

CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key;

after you insert significant amounts of data into the table. This
re-orders the table according to the index, but new data is always written
out of order, so after adding lots more data the table will need to be
re-clustered again.

> - Switch to a RAID10 (4 times the IOs per second, however zero gain if
> you're single-threaded, but massive gain when concurrent)

Greg Stark has a patch in the pipeline that will change this, for bitmap
index scans, by using fadvise(), so a single thread can utilise multiple
discs in a RAID array.

Matthew

--
Prolog doesn't have enough parentheses. -- Computer Science Lecturer

Re: large tables and simple "= constant" queries using indexes

From
John Beaver
Date:
Thanks a lot, all of you - this is excellent advice. With the data
clustered and statistics at a more reasonable value of 100, it now
reproducibly takes even less time - 20-57 ms per query.

After reading the section on "Statistics Used By the Planner" in the
manual, I was a little concerned that, while the statistics sped up the
queries that I tried immeasurably, that the most_common_vals array was
where the speedup was happening, and that the values which wouldn't fit
in this array wouldn't be sped up. Though I couldn't offhand find an
example where this occurred, the clustering approach seems intuitively
like a much more complete and scalable solution, at least for a
read-only table like this.

As to whether the entire index/table was getting into ram between my
statistics calls, I don't think this was the case. Here's the behavior
that I found:
- With statistics at 10, the query took 25 (or so) seconds no matter how
many times I tried different values. The query plan was the same as for
the 200 and 800 statistics below.
- Trying the same constant a second time gave an instantaneous result,
I'm guessing because of query/result caching.
- Immediately on increasing the statistics to 200, the query took a
reproducibly less amount of time. I tried about 10 different values
- Immediately on increasing the statistics to 800, the query
reproducibly took less than a second every time. I tried about 30
different values.
- Decreasing the statistics to 100 and running the cluster command
brought it to 57 ms per query.
- The Activity Monitor (OSX) lists the relevant postgres process as
taking a little less than 500 megs.
- I didn't try decreasing the statistics back to 10 before I ran the
cluster command, so I can't show the search times going up because of
that. But I tried killing the 500 meg process. The new process uses less
than 5 megs of ram, and still reproducibly returns a result in less than
60 ms. Again, this is with a statistics value of 100 and the data
clustered by gene_prediction_view_gene_ref_key.

And I'll consider the idea of using triggers with an ancillary table for
other purposes; seems like it could be a useful solution for something.

Matthew wrote:
> On Thu, 10 Apr 2008, PFC wrote:
>
> ... Lots of useful advice ...
>
>>     - If you often query rows with the same gene_ref, consider using
>> CLUSTER to physically group those rows on disk. This way you can get
>> all rows with the same gene_ref in 1 seek instead of 2000. Clustered
>> tables also make Bitmap scan happy.
>
> In my opinion this is the one that will make the most difference. You
> will need to run:
>
> CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key;
>
> after you insert significant amounts of data into the table. This
> re-orders the table according to the index, but new data is always
> written out of order, so after adding lots more data the table will
> need to be re-clustered again.
>
>> - Switch to a RAID10 (4 times the IOs per second, however zero gain
>> if you're single-threaded, but massive gain when concurrent)
>
> Greg Stark has a patch in the pipeline that will change this, for
> bitmap index scans, by using fadvise(), so a single thread can utilise
> multiple discs in a RAID array.
>
> Matthew
>

Re: large tables and simple "= constant" queries using indexes

From
Erik Jones
Date:
On Apr 10, 2008, at 9:44 AM, John Beaver wrote:
> Thanks a lot, all of you - this is excellent advice. With the data
> clustered and statistics at a more reasonable value of 100, it now
> reproducibly takes even less time - 20-57 ms per query.
>
> After reading the section on "Statistics Used By the Planner" in the
> manual, I was a little concerned that, while the statistics sped up
> the queries that I tried immeasurably, that the most_common_vals
> array was where the speedup was happening, and that the values which
> wouldn't fit in this array wouldn't be sped up. Though I couldn't
> offhand find an example where this occurred, the clustering approach
> seems intuitively like a much more complete and scalable solution,
> at least for a read-only table like this.
>
> As to whether the entire index/table was getting into ram between my
> statistics calls, I don't think this was the case. Here's the
> behavior that I found:
> - With statistics at 10, the query took 25 (or so) seconds no matter
> how many times I tried different values. The query plan was the same
> as for the 200 and 800 statistics below.
> - Trying the same constant a second time gave an instantaneous
> result, I'm guessing because of query/result caching.
> - Immediately on increasing the statistics to 200, the query took a
> reproducibly less amount of time. I tried about 10 different values
> - Immediately on increasing the statistics to 800, the query
> reproducibly took less than a second every time. I tried about 30
> different values.
> - Decreasing the statistics to 100 and running the cluster command
> brought it to 57 ms per query.
> - The Activity Monitor (OSX) lists the relevant postgres process as
> taking a little less than 500 megs.
> - I didn't try decreasing the statistics back to 10 before I ran the
> cluster command, so I can't show the search times going up because
> of that. But I tried killing the 500 meg process. The new process
> uses less than 5 megs of ram, and still reproducibly returns a
> result in less than 60 ms. Again, this is with a statistics value of
> 100 and the data clustered by gene_prediction_view_gene_ref_key.
>
> And I'll consider the idea of using triggers with an ancillary table
> for other purposes; seems like it could be a useful solution for
> something.

FWIW, killing the backend process responsible for the query won't
necessarily clear the table's data from memory as that will be in the
shared_buffers.  If you really want to flush the data from memory you
need to read in data from other tables of a size total size greater
than your shared_buffers setting.

Erik Jones

DBA | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




Re: large tables and simple "= constant" queries using indexes

From
John Beaver
Date:
Thanks Eric and Gaestano - interesting, and both examples of my naivite. :)<br /><br /> I tried running large
select(*)queries on other tables followed by another try at the offending query, and it was still fast. Just to be
absolutelysure this is a scalable solution, I'll try restarting my computer in a few hours to see if it affects
anythingcache-wise.<br /><br /><br /> Gaetano Mendola wrote: <blockquote cite="mid:47FE3DC7.4010700@gmail.com"
type="cite"><prewrap="">John Beaver wrote:
 
 </pre><blockquote type="cite"><pre wrap="">- Trying the same constant a second time gave an instantaneous result,
I'm guessing because of query/result caching.   </pre></blockquote><pre wrap="">
AFAIK no query/result caching is in place in postgres, what you are experiencing
is OS disk/memory caching.


Regards
Gaetano Mendola

 </pre></blockquote>

Re: large tables and simple "= constant" queries using indexes

From
Bill Moran
Date:
In response to John Beaver <john.e.beaver@gmail.com>:

> Thanks Eric and Gaestano - interesting, and both examples of my naivite. :)
>
> I tried running large select(*) queries on other tables followed by another try at the offending query, and it was
stillfast. Just to be absolutely sure this is a scalable solution, I'll try restarting my computer in a few hours to
seeif it affects anything cache-wise. 

I say this over and over again ... because I think it's really cool and
useful.

If you install the pg_buffercache addon, you can actually look into
PostgreSQL's internals and see what tables are in the buffer in real
time.  If you're having trouble, it can (potentially) be a helpful
tool.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: large tables and simple "= constant" queries using indexes

From
Greg Smith
Date:
On Thu, 10 Apr 2008, Bill Moran wrote:

> If you install the pg_buffercache addon, you can actually look into
> PostgreSQL's internals and see what tables are in the buffer in real
> time.

The "Inside the PostgreSQL Buffer Cache" talk I did at the recent East
conference is now on-line at
http://www.westnet.com/~gsmith/content/postgresql/

The slides explain how that information gets updated and used internally,
and the separate "sample queries" file there shows some more complicated
views I've written against pg_buffercache.  Here's a sample one:

relname       |buffered| buffers % | % of rel
accounts      | 306 MB | 65.3      | 24.7
accounts_pkey | 160 MB | 34.1      | 93.2

This shows that 65.3% of the buffer cache is filled with the accounts
table, which is caching 24.7% of the full table.  These are labeled
"relations" because there's a mix of table and index data there.
accounts_pkey is an index for example, which is why almost all of it is
staying inside the buffer cache.

The queries that use usage_count only work against 8.3, that one above
should work on older versions as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: large tables and simple "= constant" queries using indexes

From
PFC
Date:
> Thanks a lot, all of you - this is excellent advice. With the data
> clustered and statistics at a more reasonable value of 100, it now
> reproducibly takes even less time - 20-57 ms per query.

    1000x speedup with proper tuning - always impressive, lol.
    IO seeks are always your worst enemy.

> After reading the section on "Statistics Used By the Planner" in the
> manual, I was a little concerned that, while the statistics sped up the
> queries that I tried immeasurably, that the most_common_vals array was
> where the speedup was happening, and that the values which wouldn't fit
> in this array wouldn't be sped up. Though I couldn't offhand find an
> example where this occurred, the clustering approach seems intuitively
> like a much more complete and scalable solution, at least for a
> read-only table like this.

    Actually, with statistics set to 100, then 100 values will be stored in
most_common_vals. This would mean that the values not in most_common_vals
will have less than 1% frequency, and probably much less than that. The
choice of plan for these rare values is pretty simple.

    With two columns, "interesting" stuff can happen, like if you have col1
in [1...10] and col2 in [1...10] and use a condition on col1=const and
col2=const, the selectivity of the result depends not only on the
distribution of col1 and col2 but also their correlation.

    As for the tests you did, it's hard to say without seeing the explain
analyze outputs. If you change the stats and the plan choice (EXPLAIN)
stays the same, and you use the same values in your query, any difference
in timing comes from caching, since postgres is executing the same plan
and therefore doing the exact same thing. Caching (from PG and from the
OS) can make the timings vary a lot.

> - Trying the same constant a second time gave an instantaneous result,
> I'm guessing because of query/result caching.

    PG does not cache queries or results. It caches data & index pages in its
shared buffers, and then the OS adds another layer of the usual disk cache.
    A simple query like selecting one row based on PK takes about 60
microseconds of CPU time, but if it needs one seek for the index and one
for the data it may take 20 ms waiting for the moving parts to move...
Hence, CLUSTER is a very useful tool.

    Bitmap index scans love clustered tables because all the interesting rows
end up being grouped together, so much less pages need to be visited.

> - I didn't try decreasing the statistics back to 10 before I ran the
> cluster command, so I can't show the search times going up because of
> that. But I tried killing the 500 meg process. The new process uses less
> than 5 megs of ram, and still reproducibly returns a result in less than
> 60 ms. Again, this is with a statistics value of 100 and the data
> clustered by gene_prediction_view_gene_ref_key.

    Killing it or just restarting postgres ?
    If you let postgres run (not idle) for a while, naturally it will fill
the RAM up to the shared_buffers setting that you specified in the
configuration file. This is good, since grabbing data from postgres' own
cache is faster than having to make a syscall to the OS to get it from the
OS disk cache (or disk). This isn't bloat.
    But what those 500 MB versus 6 MB show is that before, postgres had to
read a lot of data for your query, so it stayed in the cache ; after
tuning it needs to read much less data (thanks to CLUSTER) so the cache
stays empty.


Re: large tables and simple "= constant" queries using indexes

From
Gaetano Mendola
Date:
John Beaver wrote:

> - Trying the same constant a second time gave an instantaneous result,
> I'm guessing because of query/result caching.

AFAIK no query/result caching is in place in postgres, what you are experiencing
is OS disk/memory caching.


Regards
Gaetano Mendola