Re: large tables and simple "= constant" queries using indexes - Mailing list pgsql-performance

From PFC
Subject Re: large tables and simple "= constant" queries using indexes
Date
Msg-id op.t9dy1aqwcigqcu@apollo13.peufeu.com
Whole thread Raw
In response to Re: large tables and simple "= constant" queries using indexes  (Arjen van der Meijden <acmmailing@tweakers.net>)
Responses Re: large tables and simple "= constant" queries using indexes  (Matthew <matthew@flymine.org>)
List pgsql-performance
>> Perfect - thanks Arjen. Using your value of 200 decreased the time to
>> 15 seconds, and using a value of 800 makes it almost instantaneous. I'm
>> really not concerned about space usage; if having more statistics
>> increases performance this much, maybe I'll just default it to 1000?
>>  Strangely, the steps taken in the explain analyze are all the same.
>> The only differences are the predicted costs (and execution times).
>>  explain analyze for a statistics of 200:


    Actually, since you got the exact same plans and the second one is a lot
faster, this can mean that the data is in the disk cache, or that the
second query has all the rows it needs contiguous on disk whereas the
first one has its rows all over the place. Therefore you are IO-bound.
Statistics helped, perhaps (impossible to know since you don't provide the
plan wit statistics set to 10), but your main problem is IO.
    Usually setting the statistics to 100 is enough...

    Now, here are some solutions to your problem in random order :

    - Install 64 bit Linux, 64 bit Postgres, and get lots of RAM, lol.
    - Switch to a RAID10 (4 times the IOs per second, however zero gain if
you're single-threaded, but massive gain when concurrent)

    - If you just need a count by gene_ref, a simple solution is to keep it
in a separate table and update it via triggers, this is a frequently used
solution, it works well unless gene_ref is updated all the time (which is
probably not your case). Since you will be vacuuming this count-cache
table often, don't put the count as a field in your sgd_annotations table,
just create a small table with 2 fields, gene_ref and count (unless you
want to use the count for other things and you don't like the join).

    From your table definition gene_ref references another table. It would
seem that you have many rows in gene_prediction_view with the same
gene_ref value.

    - If you often query rows with the same gene_ref, consider using CLUSTER
to physically group those rows on disk. This way you can get all rows with
the same gene_ref in 1 seek instead of 2000. Clustered tables also make
Bitmap scan happy.
    This one is good since it can also speed up other queries (not just the
count).
    You could also cluster on (gene_ref,go_id) perhaps, I don't know what
your columns mean. Only you can decide that because clustering order has
to be meaningful (to group rows according to something that makes sense
and not at random).

    * Lose some weight :

CREATE INDEX ix_gene_prediction_view_gene_ref
  ON gene_prediction_view
  USING btree
  (gene_ref);

    - This index is useless since you have an UNIQUE on (gene_ref, go_id)
which is also an index.
    Remove the index on (gene_ref), it will leave space in the disk cache for
other things.

    - Since (gene_ref, go_id) is UNIQUE NOT NULL, you might be able to use
that as your primary key, but only if it is never updated of course. Saves
another index.

    - If you often do queries that fetch many rows, but seldom fetch the
description, tell PG to always store the description in offline compressed
form (read the docs on ALTER TABLE ... SET STORAGE ..., I forgot the
syntax). Point being to make the main table smaller.

    - Also I see a category as VARCHAR. If you have a million different
categories, that's OK, but if you have 100 categories for your 15M rows,
put them in a separate table and replace that by a category_id (normalize
!)


pgsql-performance by date:

Previous
From: Arjen van der Meijden
Date:
Subject: Re: large tables and simple "= constant" queries using indexes
Next
From: Richard Huxton
Date:
Subject: Re: varchar index joins not working?