Postgres query optimization with varchar fields - Mailing list pgsql-performance
From | W.H. van Atteveldt |
---|---|
Subject | Postgres query optimization with varchar fields |
Date | |
Msg-id | C7F588D7403F1F4591EB34F1E2B5CFD10B1A3E@GAIA.2at.local Whole thread Raw |
Responses |
Re: Postgres query optimization with varchar fields
|
List | pgsql-performance |
Dear reader, I am investigating whether it is useful to directly query a database containing a rather large text corpus (order of magnitude 100k - 1m newspaper articles, so around 100 million words), or whether I should use third party text indexing services. I want to know things such as: how often is a certain word (or pattern) mentioned in an article and how often it is mentioned with the condition that another word is nearby (same article or n words distant). I created a table listing the words one word per row, and created an index on the word and wordnr columns. An example query would be: simple: select articleid, count(*) as count from words w where articleid in (select id from articles where batchid in (84,85,100,101,118,121)) and (word like '<PATTERN>') group by articleid complex: select articleid, count(*) as count from words w where articleid in (select id from articles where batchid in (84,85,100,101,118,121)) and (word like '<PATTERN>') and exists (select * from words w2 where w.articleid = w2.articleid and (word like '<PATTERN2>')) group by articleid According to the diagnostics, the database does use the indices for the query, but it is still rather slow (around 10 minutes for a 'simple query', x seconds for a complex one) It is important that the complex query only counts instances where the PATTERN is found and PATTERN2 only functions as a criterium and does not add to the count. My questions are: (technical details provided below) - Does anyone disagree with the general setup? - Is there a more sensible way to phrase my SQL? - Any other ideas to improve performance? Thanks, Wouter van Atteveldt Free University Amsterdam ------ Technicalities: I am using a Postgresql 7.4.1 database on a linux machine (uname -a: Linux swpc450.cs.vu.nl 2.4.22-1.2115.nptl #1 Wed Oct 29 15:31:21 EST 2003 i686 athlon i386 GNU/Linux). The table of interest is: (lemma, pos, simplepos currently not used) Table "public.words" Column | Type | Modifiers ------------+------------------------+---------------------------------- --------------------- id | integer | not null default nextval('public.words_id_seq'::text) articleid | integer | not null sentencenr | integer | not null word | character varying(255) | not null lemma | character varying(255) | pos | character varying(255) | simplepos | character(1) | wordnr | integer | not null parnr | integer | not null Indexes: "words_pkey" primary key, btree (id) "words_aid" btree (articleid) "words_word" btree (word) "words_word_ptrn" btree (word varchar_pattern_ops) "words_wordnr" btree (wordnr) Query plans: anoko=> explain select articleid, count(*) as count from words w where articleid in (select id from articles where batchid in (84,85,100,101,118,121)) and (word like 'integratie%') group by articleid; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------ HashAggregate (cost=937959.21..937959.22 rows=2 width=4) -> Hash IN Join (cost=95863.70..937816.01 rows=28640 width=4) Hash Cond: ("outer".articleid = "inner".id) -> Index Scan using words_word_ptrn on words w (cost=0.00..836604.62 rows=208886 width=4) Index Cond: (((word)::text ~>=~ 'integratie'::character varying) AND ((word)::text ~<~ 'integratif'::character varying)) Filter: ((word)::text ~~ 'integratie%'::text) -> Hash (cost=94998.60..94998.60 rows=146041 width=4) -> Index Scan using articles_batchid, articles_batchid, articles_batchid, articles_batchid, articles_batchid, articles_batchid on articles (cost=0.00..94998.60 rows=146041 width=4) Index Cond: ((batchid = 84) OR (batchid = 85) OR (batchid = 100) OR (batchid = 101) OR (batchid = 118) OR (batchid = 121)) explain select articleid, count(*) as count from words w where articleid in (select id from articles where batchid in (84,85,100,101,118,121)) and (word like '<PATTERN>') and exists (select * from words w2 where w.articleid = w2.articleid and (word like '<PATTERN2>')) group by articleid anoko-> ; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------ GroupAggregate (cost=168253089.23..168254556.46 rows=1 width=4) -> Merge IN Join (cost=168253089.23..168254484.85 rows=14320 width=4) Merge Cond: ("outer".articleid = "inner".id) -> Sort (cost=168144438.23..168144699.33 rows=104443 width=4) Sort Key: w.articleid -> Index Scan using words_word_ptrn on words w (cost=0.00..168134972.17 rows=104443 width=4) Index Cond: ((word)::text ~=~ '<PATTERN>'::character varying) Filter: (((word)::text ~~ '<PATTERN>'::text) AND (subplan)) SubPlan -> Index Scan using words_aid on words w2 (cost=0.00..836948.84 rows=1045 width=460) Index Cond: ($0 = articleid) Filter: ((word)::text ~~ '<PATTERN2>'::text) -> Sort (cost=108651.01..109016.11 rows=146041 width=4) Sort Key: articles.id -> Index Scan using articles_batchid, articles_batchid, articles_batchid, articles_batchid, articles_batchid, articles_batchid on articles (cost=0.00..94998.60 rows=146041 width=4) Index Cond: ((batchid = 84) OR (batchid = 85) OR (batchid = 100) OR (batchid = 101) OR (batchid = 118) OR (batchid = 121))
pgsql-performance by date: