speeding up a query on a large table - Mailing list pgsql-general
From | Kevin Murphy |
---|---|
Subject | speeding up a query on a large table |
Date | |
Msg-id | 42E55EBF.4050109@genome.chop.edu Whole thread Raw |
Responses |
Re: speeding up a query on a large table
|
List | pgsql-general |
I'm trying to speed up a query on a text column of a 14M-row table. Uncached query times vary between 1-20 seconds (maybe more), depending on the search term. In between time trials I've been trying to flush the disk buffer cache by selecting count(*) from a separate 4GB table, and times are pretty consistent for a given search term. If a search term hasn't been seen in a while, or my full table scan has purged memory, a query may take 20 seconds, whereas if the relevant pages are cached, it may take 8.5 milliseconds. I'd really like to avoid that 20 second turn-off for users. I'm guessing I need to put lots of RAM in this machine (currently only 2GB; max 8GB) and somehow pre-load the entire index? The production version of this database will be read-only, btw. Because this index will be key to the whole application, and because the number of possible search terms is so large, it would be nice if the entire index could somehow be preloaded into memory and encouraged to stay there. Alternatively, we could prime the disk buffers by doing searches on what we guess will be the most common terms. I wasn't paying attention to the recent thread about ram disks, but maybe this is a situation that might call for one? The rest of this message contains details about the situation. Thanks for the advice, as usual! -Kevin Murphy Sample query: explain analyze select * from tagged_genes where mention = 'bcl2'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using tagged_genes_mention_idx on tagged_genes (cost=0.00..327.64 rows=81 width=67) (actual time=28.694..5544.779 rows=848 loops=1) Index Cond: (mention = 'bcl2'::text) Total runtime: 5545.434 ms The index of the query column (mention) is 226,229 pages (= 1.7 GB?). There are 1.3M unique values in that column. I've run 'vacuum analyze' on the table. I also tried setting the statistics target to 1000, and it did speed up searches for some search terms. Just out of curiosity, and because I am interested in possibly using tsearch2 in the future, I created a tsvector column indexing the mention column (and added a GIST index and vacuum-analyzed the table again). tsearch2 is a lot slower, presumably because it's doing a lot more (although in this case, it doesn't return all that much more). Here is a typical result of a tsearch2 search on my data: explain analyze select * from tagged_genes where vector @@ to_tsquery('bcl2'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using vector_idx on tagged_genes (cost=0.00..56092.98 rows=13983 width=67) (actual time=202.078..43122.688 rows=980 loops=1) Index Cond: (vector @@ '\'bcl2\''::tsquery) Total runtime: 43124.215 ms Blech. I'd love to use tsearch2, but it's hard to see it being useful with my puny hardware and not so puny data. I'm using PG 8.0.3 on Mac OS X 10.4.2 on a dual 2.5GHz G5 currently with 2GB RAM. The data is physically stored on an Xserve RAID array (seven 7200rpm ultra-ATA drives, RAID 5) connected via fibre channel directly to the G5. I did some tests of this array's performance for a single user. Using a large block size (256K), this array can do sequential reads at 134 MB/sec, but the performance drops to 12MB/sec for 4K sequential reads. Random 4K reads are 5MB/sec, and random 256K reads are 121MB/sec (all these according to "XBench" - not sure if it's a good benchmark). I'm using these shared memory settings: kern.sysv.shmmax=134217728 kern.sysv.shmall=32768 kern.sysv.shmmin=1 kern.sysv.shmmni=32 kern.sysv.shmseg=8 and from postgresql.conf: shared_buffers = 15000 work_mem = 32768 Everything else in postgresql.conf is default. ------- Here is the table description: tagged_genes table: 13,982,464 rows 422,028 pages (although about half of that is the experimental tsvector column, though!) create table tagged_genes ( id bigint NOT NULL PRIMARY KEY, -- artificial primary key mention text, -- a gene name or description pmid bigint, -- identifies the document that the mention occurs in create_date timestamp NOT NULL, vector tsvector -- experimental tsearch2 index of mention column ); create index tg_mention_idx on tagged_genes(mention); create index tg_pmid_idx on tagged_genes(pmid); create index tg_vector_idx on tagged_genes(vector); ======================================== Some trials with different search terms: mycn: Index Scan using tagged_genes_mention_idx on tagged_genes (cost=0.00..327.64 rows=81 width=67) (actual time=41.703..2751.600 rows=479 loops=1) Index Cond: (mention = 'mycn'::text) Total runtime: 2751.936 ms mycn trials: 2752 ms, 2755 ms, 2766 ms bcl2: Index Scan using tagged_genes_mention_idx on tagged_genes (cost=0.00..327.64 rows=81 width=67) (actual time=28.694..5544.779 rows=848 loops=1) Index Cond: (mention = 'bcl2'::text) Total runtime: 5545.434 ms bcl2 trials: 5545 ms, 5492 ms, 5493 ms cyp3a4: Index Scan using tagged_genes_mention_idx on tagged_genes (cost=0.00..7867.60 rows=1958 width=67) (actual time=58.138..9602.558 rows=1985 loops=1) Index Cond: (mention = 'cyp3a4'::text) Total runtime: 9603.733 ms cyp3a4 trials: 9604 ms, 11872 ms, 9970 ms tp53: Index Scan using tagged_genes_mention_idx on tagged_genes (cost=0.00..327.64 rows=81 width=67) (actual time=28.505..8064.808 rows=1484 loops=1) Index Cond: (mention = 'tp53'::text) Total runtime: 8065.791 ms P.S. I've heard that OS X sucks as a server OS. I don't know if that's true, but I'm interested to install a PPC Linux and give that a go for the sake of comparison. Don't know where that would leave my storage hardware, though.
pgsql-general by date: