Re: speeding up a query on a large table - Mailing list pgsql-general

From Manfred Koizar
Subject Re: speeding up a query on a large table
Date
Msg-id nc77g1tfn1m5urhqphjtkatudd3o1nll6n@4ax.com
Whole thread Raw
In response to speeding up a query on a large table  (Kevin Murphy <murphy@genome.chop.edu>)
Responses Re: speeding up a query on a large table  (Mike Rylander <mrylander@gmail.com>)
List pgsql-general
On Mon, 25 Jul 2005 17:50:55 -0400, Kevin Murphy
<murphy@genome.chop.edu> wrote:
> and because the number of possible search terms is so large, it
>would be nice if the entire index could somehow be preloaded into memory
>and encouraged to stay there.

Postgres does not have such a feature and I wouldn't recommend to mess
around inside Postgres.  You could try to copy the relevant index
file(s) to /dev/null to populate the OS cache ...

>There are 1.3M unique values in that column.

That'd mean that each value occours 10 times on average.  In your
tests the planner consistently estimates 81, and the real numbers are
even higher.  Can this be explained by the nature of the data
distribution?

>and from postgresql.conf:
>shared_buffers = 15000
>work_mem = 32768
>Everything else in postgresql.conf is default.

Setting effective_cache_size to a sane value wouldn't hurt.  I don't
know about OS X; does it, like Linux, automatically tune its disk
cache or do you have to configure it somewhere?

>tagged_genes table:
>13,982,464 rows
>422,028 pages  (although about half of that is the experimental tsvector
>column, though!)
>The index of the query column (mention) is 226,229 pages (= 1.7 GB?).

The average tuples per page ratio seems a bit low, both for the heap
(~33) and for the index (~62).  If the planner's tuple size estimation
of 67 bytes is approximately right, there's a lot of free space in
your relations.  Try VACUUM FULL and REINDEX or CLUSTER to shrink
these files.

>create table tagged_genes (
>        id           bigint NOT NULL PRIMARY KEY,  -- artificial primary key
>        mention      text,             -- a gene name or description
>        pmid         bigint,           -- identifies the document that
>the mention occurs in
>        create_date  timestamp NOT NULL,
>        vector       tsvector          -- experimental tsearch2 index of
>mention column
>);
>create index tg_mention_idx on tagged_genes(mention);
>create index tg_pmid_idx    on tagged_genes(pmid);
>create index tg_vector_idx  on tagged_genes(vector);

If mention is long (which is not implied by your examples, but an int
is still smaller than any nonempty text) and there are many
duplicates, it might pay off to put them in their own table:

CREATE TABLE mentions (
    id    SERIAL PRIMARY KEY,
    mention    text UNIQUE,
        vector  tsvector          -- experimental tsearch2 index
) WITHOUT oids;
create index me_vector_idx  on mentions(vector);

and reference them from tagged_genes:

create table tagged_genes (
        id           bigint NOT NULL PRIMARY KEY,
        mentionid    int REFERENCES mentions,
        pmid         bigint,           -- identifies the document that
                                       -- the mention occurs in
        create_date  timestamp NOT NULL
) WITHOUT oids;

Unless my math is wrong, this would result in a heap size of ~120K
pages and an index size of ~52K pages, plus some 10% slack for updated
and deleted tuples, if you VACUUM regularly.

Servus
 Manfred


pgsql-general by date:

Previous
From: Mario Guenterberg
Date:
Subject: Re: Adding contrib modules
Next
From: Bruno Wolff III
Date:
Subject: Re: Finding nearest numeric value