Thread: Wich hardware suits best for large full-text indexed databases
Hi folks, I have a database using tsearch2 to index 300 000 documents. I've already have optimized the queries, and the database is vacuumed on a daily basis. The stat function tells me that my index has aprox. 460 000 unique words (I'm using stemmer and a nice stopword list). The problem is performance, some queries take more than 10 seconds to execute, and I'm not sure if my bottleneck is memory or io. The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem... Does anyone has an idea of a more cost eficient solution? How to get a better performance without having to invest some astronomicaly high amount of money? TIA, -- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
> -----Original Message----- > From: Diogo Biazus [mailto:diogo@ikono.com.br] > Sent: Tuesday, March 30, 2004 1:55 PM > To: pgsql-general@postgresql.org > Subject: [GENERAL] Wich hardware suits best for large > full-text indexed databases > > > Hi folks, > > I have a database using tsearch2 to index 300 000 documents. > I've already have optimized the queries, and the database is > vacuumed on > a daily basis. > The stat function tells me that my index has aprox. 460 000 > unique words > (I'm using stemmer and a nice stopword list). > The problem is performance, some queries take more than 10 seconds to > execute, and I'm not sure if my bottleneck is memory or io. > The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running > postgresql > 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem... > > Does anyone has an idea of a more cost eficient solution? > How to get a better performance without having to invest some > astronomicaly high amount of money? What does the EXPLAIN command say about the slowest queries?
Dann Corbit wrote:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1270.87..1271.60 rows=293 width=880)
Sort Key: ajustadata(documentos.datapublicacao), ajustahora(documentos.horapublicacao), documentos.coddocumento
-> Hash Join (cost=1246.88..1258.87 rows=293 width=880)
Hash Cond: ("outer".codfonte = "inner".codfonte)
-> Merge Join (cost=1245.38..1252.25 rows=293 width=861)
Merge Cond: (("outer".codfonte = "inner".codfonte) AND ("outer".codtipodocumento = "inner".codtipodocumento))
-> Sort (cost=1195.00..1195.73 rows=291 width=845)
Sort Key: documentos.codfonte, documentos.codtipodocumento
-> Index Scan using documentos_conteudo_stem_ix_ix on documentos (cost=0.00..1183.08 rows=291 width=845)
Index Cond: (conteudo_stem_ix @@ '\'brasil\''::tsquery)
Filter: (conteudo_stem_ix @@ '\'brasil\''::tsquery)
-> Sort (cost=50.38..50.89 rows=204 width=16)
Sort Key: fontes_gruposusuario.codfonte, tiposdocumento_gruposusuario.codtipodocumento
-> Merge Join (cost=0.00..42.57 rows=204 width=16)
Merge Cond: ("outer".codgrupousuario = "inner".codgrupousuario)
-> Index Scan using fontes_gruposusuario_codgrupousuario_ix on fontes_gruposusuario (cost=0.00..24.75 rows=28 width=8)
Index Cond: (codgrupousuario = 1)
-> Index Scan using tiposdocumento_gruposusuario_codgrupousuario_ix on tiposdocumento_gruposusuario (cost=0.00..13.85 rows=542 width=8)
-> Hash (cost=1.40..1.40 rows=40 width=19)
-> Seq Scan on fontes (cost=0.00..1.40 rows=40 width=19)
explain SELECT qrydocumentos.coddocumento, qrydocumentos.nomedocumento, qrydocumentos.conteudo, qrydocumentos.tamanho, qrydocumentos.hora, qrydocumentos.data, qrydocumentos.codfonte, qrydocumentos.nomefonte, qrydocumentos.numeroimagens as "numeroImagens", qrydocumentos.subtitulo, qrydocumentos.codtipodocumento, qrydocumentos.codformato, numeroacessos AS acessos FROM qrydocumentos WHERE qrydocumentos.codgrupousuario = 1 AND (qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese', 'brasil')) ORDER BY ajustadata(qrydocumentos.datapublicacao) DESC, ajustahora(qrydocumentos.horapublicacao) DESC, qrydocumentos.coddocumento DESC;What does the EXPLAIN command say about the slowest queries?
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1270.87..1271.60 rows=293 width=880)
Sort Key: ajustadata(documentos.datapublicacao), ajustahora(documentos.horapublicacao), documentos.coddocumento
-> Hash Join (cost=1246.88..1258.87 rows=293 width=880)
Hash Cond: ("outer".codfonte = "inner".codfonte)
-> Merge Join (cost=1245.38..1252.25 rows=293 width=861)
Merge Cond: (("outer".codfonte = "inner".codfonte) AND ("outer".codtipodocumento = "inner".codtipodocumento))
-> Sort (cost=1195.00..1195.73 rows=291 width=845)
Sort Key: documentos.codfonte, documentos.codtipodocumento
-> Index Scan using documentos_conteudo_stem_ix_ix on documentos (cost=0.00..1183.08 rows=291 width=845)
Index Cond: (conteudo_stem_ix @@ '\'brasil\''::tsquery)
Filter: (conteudo_stem_ix @@ '\'brasil\''::tsquery)
-> Sort (cost=50.38..50.89 rows=204 width=16)
Sort Key: fontes_gruposusuario.codfonte, tiposdocumento_gruposusuario.codtipodocumento
-> Merge Join (cost=0.00..42.57 rows=204 width=16)
Merge Cond: ("outer".codgrupousuario = "inner".codgrupousuario)
-> Index Scan using fontes_gruposusuario_codgrupousuario_ix on fontes_gruposusuario (cost=0.00..24.75 rows=28 width=8)
Index Cond: (codgrupousuario = 1)
-> Index Scan using tiposdocumento_gruposusuario_codgrupousuario_ix on tiposdocumento_gruposusuario (cost=0.00..13.85 rows=542 width=8)
-> Hash (cost=1.40..1.40 rows=40 width=19)
-> Seq Scan on fontes (cost=0.00..1.40 rows=40 width=19)
-- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
Can we see the schema for the table qrydocumentos ? -----Original Message----- From: Diogo Biazus [mailto:diogo@ikono.com.br] Sent: Tuesday, March 30, 2004 2:32 PM To: Dann Corbit Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Wich hardware suits best for large full-text indexed databases Dann Corbit wrote: What does the EXPLAIN command say about the slowest queries? explain SELECT qrydocumentos.coddocumento, qrydocumentos.nomedocumento, qrydocumentos.conteudo, qrydocumentos.tamanho, qrydocumentos.hora, qrydocumentos.data, qrydocumentos.codfonte, qrydocumentos.nomefonte, qrydocumentos.numeroimagens as "numeroImagens", qrydocumentos.subtitulo, qrydocumentos.codtipodocumento, qrydocumentos.codformato, numeroacessos AS acessos FROM qrydocumentos WHERE qrydocumentos.codgrupousuario = 1 AND (qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese', 'brasil')) ORDER BY ajustadata(qrydocumentos.datapublicacao) DESC, ajustahora(qrydocumentos.horapublicacao) DESC, qrydocumentos.coddocumento DESC; QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------------ --------------------- Sort (cost=1270.87..1271.60 rows=293 width=880) Sort Key: ajustadata(documentos.datapublicacao), ajustahora(documentos.horapublicacao), documentos.coddocumento -> Hash Join (cost=1246.88..1258.87 rows=293 width=880) Hash Cond: ("outer".codfonte = "inner".codfonte) -> Merge Join (cost=1245.38..1252.25 rows=293 width=861) Merge Cond: (("outer".codfonte = "inner".codfonte) AND ("outer".codtipodocumento = "inner".codtipodocumento)) -> Sort (cost=1195.00..1195.73 rows=291 width=845) Sort Key: documentos.codfonte, documentos.codtipodocumento -> Index Scan using documentos_conteudo_stem_ix_ix on documentos (cost=0.00..1183.08 rows=291 width=845) Index Cond: (conteudo_stem_ix @@ '\'brasil\''::tsquery) Filter: (conteudo_stem_ix @@ '\'brasil\''::tsquery) -> Sort (cost=50.38..50.89 rows=204 width=16) Sort Key: fontes_gruposusuario.codfonte, tiposdocumento_gruposusuario.codtipodocumento -> Merge Join (cost=0.00..42.57 rows=204 width=16) Merge Cond: ("outer".codgrupousuario = "inner".codgrupousuario) -> Index Scan using fontes_gruposusuario_codgrupousuario_ix on fontes_gruposusuario (cost=0.00..24.75 rows=28 width=8) Index Cond: (codgrupousuario = 1) -> Index Scan using tiposdocumento_gruposusuario_codgrupousuario_ix on tiposdocumento_gruposusuario (cost=0.00..13.85 rows=542 width=8) -> Hash (cost=1.40..1.40 rows=40 width=19) -> Seq Scan on fontes (cost=0.00..1.40 rows=40 width=19) -- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
Diogo Biazus wrote: > Hi folks, > > I have a database using tsearch2 to index 300 000 documents. > I've already have optimized the queries, and the database is vacuumed on > a daily basis. > The stat function tells me that my index has aprox. 460 000 unique words > (I'm using stemmer and a nice stopword list). > The problem is performance, some queries take more than 10 seconds to > execute, and I'm not sure if my bottleneck is memory or io. > The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running postgresql > 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem... > > Does anyone has an idea of a more cost eficient solution? > How to get a better performance without having to invest some > astronomicaly high amount of money? This isn't hardware related, but FreeBSD 5 is not a particularly impressive performer. Especially 5.0 ... 5.2.1 would be better, but if you're shooting for performance, 4.9 will probably outperform both of them at this stage of the game. Something to consider if the query tuning that others are helping with doesn't solve the problem. Follow through with that _first_ though. However, if you insist on running 5, make sure your kernel is compiled without WITNESS ... it speeds things up noticably. -- Bill Moran Potential Technologies http://www.potentialtech.com
Look into running Swish-e instead: http://www.swish-e.org Great speed, nice engine, excellent boolean searches. We run it on several sites each with over 500,000 documents. Performance is consistently sub-second response time, and we also integrate it within PHP, Perl and Postgresql too. I know, it is nice to use tsearch2, but we also found the performance lacking for those big indices. Maybe Oleg and the tsearch2 gang have some extra tips? - Ericson Bill Moran wrote: > Diogo Biazus wrote: > >> Hi folks, >> >> I have a database using tsearch2 to index 300 000 documents. >> I've already have optimized the queries, and the database is vacuumed >> on a daily basis. >> The stat function tells me that my index has aprox. 460 000 unique >> words (I'm using stemmer and a nice stopword list). >> The problem is performance, some queries take more than 10 seconds to >> execute, and I'm not sure if my bottleneck is memory or io. >> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running >> postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and >> sort_mem... >> >> Does anyone has an idea of a more cost eficient solution? >> How to get a better performance without having to invest some >> astronomicaly high amount of money? > > > This isn't hardware related, but FreeBSD 5 is not a particularly > impressive > performer. Especially 5.0 ... 5.2.1 would be better, but if you're > shooting > for performance, 4.9 will probably outperform both of them at this > stage of > the game. > > Something to consider if the query tuning that others are helping with > doesn't > solve the problem. Follow through with that _first_ though. > > However, if you insist on running 5, make sure your kernel is compiled > without > WITNESS ... it speeds things up noticably. >
> explain SELECT qrydocumentos.coddocumento, qrydocumentos.nomedocumento, > qrydocumentos.conteudo, qrydocumentos.tamanho, qrydocumentos.hora, > qrydocumentos.data, qrydocumentos.codfonte, qrydocumentos.nomefonte, > qrydocumentos.numeroimagens as "numeroImagens", qrydocumentos.subtitulo, > qrydocumentos.codtipodocumento, qrydocumentos.codformato, numeroacessos > AS acessos FROM qrydocumentos WHERE qrydocumentos.codgrupousuario = 1 > AND (qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese', > 'brasil')) ORDER BY ajustadata(qrydocumentos.datapublicacao) DESC, > ajustahora(qrydocumentos.horapublicacao) DESC, > qrydocumentos.coddocumento DESC; What is distribution of records by codgrupousuario field? You can use multi-column index (with contrib/btree_gist): create index fti on qrydocumentos using gist (codgrupousuario, conteudo_stem_ix); or partial index create index fti on qrydocumentos using gist (conteudo_stem_ix) where codgrupousuario = 1; One more. Let you use ispell dictionary ( I suppose, for Portuguese language, http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Portuguese-dicts ) -- Teodor Sigaev E-mail: teodor@sigaev.ru
On Tue, 30 Mar 2004, Diogo Biazus wrote: > Hi folks, > > I have a database using tsearch2 to index 300 000 documents. > I've already have optimized the queries, and the database is vacuumed on > a daily basis. > The stat function tells me that my index has aprox. 460 000 unique words > (I'm using stemmer and a nice stopword list). 460 000 unique words is a lot ! Have you seen on them ? Sometimes it's very useful to analyze what did you indexed and do you want all of them. I suggest you to use ispell dictionary and, if you index numbers (look statistics), use special dictionaries for integer and decimal numbers http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/README.intdict > The problem is performance, some queries take more than 10 seconds to > execute, and I'm not sure if my bottleneck is memory or io. > The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running postgresql > 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem... > > Does anyone has an idea of a more cost eficient solution? > How to get a better performance without having to invest some > astronomicaly high amount of money? > > TIA, > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Diogo, could you send us 'explain analyze' for fts-specific part of your query ? For example, www=# explain analyze select title from s_pages where to_tsquery('regression') @@ fts_index; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Index Scan using s_gist_key on s_pages (cost=0.00..5.95 rows=1 width=29) (actual time=0.781..11.403 rows=62 loops=1) Index Cond: ('\'regress\''::tsquery @@ fts_index) Filter: ('\'regress\''::tsquery @@ fts_index) Total runtime: 11.554 ms (4 rows) Frankly, I don't understand your query :) Could you explain what do you want to find ? qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese', 'brasil') Oleg On Tue, 30 Mar 2004, Diogo Biazus wrote: > Dann Corbit wrote: > > >What does the EXPLAIN command say about the slowest queries? > > > > > explain SELECT qrydocumentos.coddocumento, qrydocumentos.nomedocumento, > qrydocumentos.conteudo, qrydocumentos.tamanho, qrydocumentos.hora, > qrydocumentos.data, qrydocumentos.codfonte, qrydocumentos.nomefonte, > qrydocumentos.numeroimagens as "numeroImagens", qrydocumentos.subtitulo, > qrydocumentos.codtipodocumento, qrydocumentos.codformato, numeroacessos > AS acessos FROM qrydocumentos WHERE qrydocumentos.codgrupousuario = 1 > AND (qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese', > 'brasil')) ORDER BY ajustadata(qrydocumentos.datapublicacao) DESC, > ajustahora(qrydocumentos.horapublicacao) DESC, > qrydocumentos.coddocumento DESC; > > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Sort (cost=1270.87..1271.60 rows=293 width=880) > Sort Key: ajustadata(documentos.datapublicacao), > ajustahora(documentos.horapublicacao), documentos.coddocumento > -> Hash Join (cost=1246.88..1258.87 rows=293 width=880) > Hash Cond: ("outer".codfonte = "inner".codfonte) > -> Merge Join (cost=1245.38..1252.25 rows=293 width=861) > Merge Cond: (("outer".codfonte = "inner".codfonte) AND > ("outer".codtipodocumento = "inner".codtipodocumento)) > -> Sort (cost=1195.00..1195.73 rows=291 width=845) > Sort Key: documentos.codfonte, > documentos.codtipodocumento > -> Index Scan using documentos_conteudo_stem_ix_ix > on documentos (cost=0.00..1183.08 rows=291 width=845) > Index Cond: (conteudo_stem_ix @@ > '\'brasil\''::tsquery) > Filter: (conteudo_stem_ix @@ > '\'brasil\''::tsquery) > -> Sort (cost=50.38..50.89 rows=204 width=16) > Sort Key: fontes_gruposusuario.codfonte, > tiposdocumento_gruposusuario.codtipodocumento > -> Merge Join (cost=0.00..42.57 rows=204 width=16) > Merge Cond: ("outer".codgrupousuario = > "inner".codgrupousuario) > -> Index Scan using > fontes_gruposusuario_codgrupousuario_ix on fontes_gruposusuario > (cost=0.00..24.75 rows=28 width=8) > Index Cond: (codgrupousuario = 1) > -> Index Scan using > tiposdocumento_gruposusuario_codgrupousuario_ix on > tiposdocumento_gruposusuario (cost=0.00..13.85 rows=542 width=8) > -> Hash (cost=1.40..1.40 rows=40 width=19) > -> Seq Scan on fontes (cost=0.00..1.40 rows=40 width=19) > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Wed, 31 Mar 2004, Ericson Smith wrote: > Look into running Swish-e instead: > http://www.swish-e.org > > Great speed, nice engine, excellent boolean searches. We run it on > several sites each with over 500,000 documents. Performance is > consistently sub-second response time, and we also integrate it within > PHP, Perl and Postgresql too. it's very different story ! There are hundreds *standalone* search engine based on inverted indices, but you don't have *native* access to metadata stored in database, so your search collection isn't consistent. tsearch2 was developed specially for online update and consistency (think about access control to documents). If you're not care about that you don't need tsearch2. btw, tsearch2 scaled much better with long queries. > > I know, it is nice to use tsearch2, but we also found the performance > lacking for those big indices. Maybe Oleg and the tsearch2 gang have > some extra tips? > Not very much, most of them are written in documentation. Other tips are general to databases, like use multi-key indices, use dictionaries, stop words, check words statistics. One interesting option we have - is standalone archive search based on OpenFTS. The main idea is to have separate searches - online search (fresh documents) and archive (static documents) search. Online search is performed using as usual (tsearch2), while archive search uses inverted indices ) like swish-e, google,...... The nice thing is that results from both searches could be easily merged because they use the same ranking function ! So, you may have online index for month's news and archive part for older news and I bet you could manage millions documents. > - Ericson > > Bill Moran wrote: > > > Diogo Biazus wrote: > > > >> Hi folks, > >> > >> I have a database using tsearch2 to index 300 000 documents. > >> I've already have optimized the queries, and the database is vacuumed > >> on a daily basis. > >> The stat function tells me that my index has aprox. 460 000 unique > >> words (I'm using stemmer and a nice stopword list). > >> The problem is performance, some queries take more than 10 seconds to > >> execute, and I'm not sure if my bottleneck is memory or io. > >> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running > >> postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and > >> sort_mem... > >> > >> Does anyone has an idea of a more cost eficient solution? > >> How to get a better performance without having to invest some > >> astronomicaly high amount of money? > > > > > > This isn't hardware related, but FreeBSD 5 is not a particularly > > impressive > > performer. Especially 5.0 ... 5.2.1 would be better, but if you're > > shooting > > for performance, 4.9 will probably outperform both of them at this > > stage of > > the game. > > > > Something to consider if the query tuning that others are helping with > > doesn't > > solve the problem. Follow through with that _first_ though. > > > > However, if you insist on running 5, make sure your kernel is compiled > > without > > WITNESS ... it speeds things up noticably. > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Dann Corbit wrote: >Can we see the schema for the table qrydocumentos ? > > Sure: CREATE OR REPLACE VIEW public.qrydocumentos AS SELECT documentos.coddocumento, documentos.codfonte, documentos.codsecao, documentos.codusuario, documentos.codposicaopagina, documentos.codinterno, documentos.nomedocumento, documentos.subtitulo, documentos.resumohumano, documentos.resumoautomatico, documentos.conteudo, documentos.datapublicacao, documentos.localorigem, documentos.horapublicacao, documentos.pagina, documentos.anoedicao, documentos.numeroedicao_old, documentos.nomeautor, documentos.datainsercao, documentos.url, documentos.codtipoinsercao, documentos.codbusca, documentos.codtipodocumento, documentos.codformato, documentos.analisado, documentos.numeroedicao, documentos.conteudo_stem_ix, documentos.conteudo_ix, documentos.numeroimagens, documentos.numeroacessos, fontes.nomefonte, fontes.codtipofonte, fontes_gruposusuario.codgrupousuario, character_length(documentos.conteudo ) AS tamanho, to_char((documentos.horapublicacao )::interval, 'HH24:MI'::text ) AS hora, to_char((documentos.datapublicacao )::timestamp with time zone, 'dd/mm/yyyy'::text ) AS data FROM (((fontes_gruposusuario JOIN tiposdocumento_gruposusuario ON ((tiposdocumento_gruposusuario.codgrupousuario = fontes_gruposusuario.codgrupousuario ))) JOIN documentos ON (((documentos.codtipodocumento = tiposdocumento_gruposusuario.codtipodocumento ) AND (documentos.codfonte = fontes_gruposusuario.codfonte )))) JOIN fontes ON ((fontes.codfonte = documentos.codfonte ))); -- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
Teodor Sigaev wrote: > What is distribution of records by codgrupousuario field? There are lots of documents in common between diferent user groups (codgrupousuario = user group id), but each user groups can access a subset of documents based on the types of document and data sources (fontes) they are allowed to use. > You can use multi-column index (with contrib/btree_gist): > create index fti on qrydocumentos using gist (codgrupousuario, > conteudo_stem_ix); > or partial index > create index fti on qrydocumentos using gist (conteudo_stem_ix) where > codgrupousuario = 1; I´d rather not create partial index (I would have to create a trigger that creates an index each time a row is inserted on codgrupousuario). But I'll try it to see how it performs. The multi-column index seems a good idea. > One more. Let you use ispell dictionary ( I suppose, for Portuguese > language, > http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Portuguese-dicts > ) I'll make some tests with ispell, thanks for the advices. My Regards, -- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
Oleg Bartunov wrote: >On Tue, 30 Mar 2004, Diogo Biazus wrote: > > > >>Hi folks, >> >>I have a database using tsearch2 to index 300 000 documents. >>I've already have optimized the queries, and the database is vacuumed on >>a daily basis. >>The stat function tells me that my index has aprox. 460 000 unique words >>(I'm using stemmer and a nice stopword list). >> >> > >460 000 unique words is a lot ! Have you seen on them ? Sometimes it's >very useful to analyze what did you indexed and do you want all of them. >I suggest you to use ispell dictionary and, if you index numbers >(look statistics), use special dictionaries for integer and decimal numbers >http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/README.intdict > > I 'll try the ispell dictionaries and dicts for numbers too ;) Could the synonym dictionary help me on this (reducing unique words)? thanks, -- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
Bill Moran wrote: > Diogo Biazus wrote: > >> Hi folks, >> >> I have a database using tsearch2 to index 300 000 documents. >> I've already have optimized the queries, and the database is vacuumed >> on a daily basis. >> The stat function tells me that my index has aprox. 460 000 unique >> words (I'm using stemmer and a nice stopword list). >> The problem is performance, some queries take more than 10 seconds to >> execute, and I'm not sure if my bottleneck is memory or io. >> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running >> postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and >> sort_mem... >> >> Does anyone has an idea of a more cost eficient solution? >> How to get a better performance without having to invest some >> astronomicaly high amount of money? > > > This isn't hardware related, but FreeBSD 5 is not a particularly > impressive > performer. Especially 5.0 ... 5.2.1 would be better, but if you're > shooting > for performance, 4.9 will probably outperform both of them at this > stage of > the game. > > Something to consider if the query tuning that others are helping with > doesn't > solve the problem. Follow through with that _first_ though. > > However, if you insist on running 5, make sure your kernel is compiled > without > WITNESS ... it speeds things up noticably. Thanks for the advice, I'll try recompiling the kernel. Does the freebsd version make a noticeable diference? Regards, -- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
Oleg Bartunov wrote: >it's very different story ! There are hundreds *standalone* search engine >based on inverted indices, but you don't have *native* access to metadata >stored in database, so your search collection isn't consistent. >tsearch2 was developed specially for online update and consistency >(think about access control to documents). If you're not care about that >you don't need tsearch2. btw, tsearch2 scaled much better with long >queries. > > > Actually swish-e has excellent support for metadata. This allows you to nicely partition your indices, or to search only user-defined parts based on as much custom meta-data as you'd care to define. Granted tsearch2 allows you to have *live* updates to the index. But we usually reindex nightly and that tends to be good enough for most cases. - Ericson Smith > > > > >>- Ericson >> >>Bill Moran wrote: >> >> >> >>>Diogo Biazus wrote: >>> >>> >>> >>>>Hi folks, >>>> >>>>I have a database using tsearch2 to index 300 000 documents. >>>>I've already have optimized the queries, and the database is vacuumed >>>>on a daily basis. >>>>The stat function tells me that my index has aprox. 460 000 unique >>>>words (I'm using stemmer and a nice stopword list). >>>>The problem is performance, some queries take more than 10 seconds to >>>>execute, and I'm not sure if my bottleneck is memory or io. >>>>The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running >>>>postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and >>>>sort_mem... >>>> >>>>Does anyone has an idea of a more cost eficient solution? >>>>How to get a better performance without having to invest some >>>>astronomicaly high amount of money? >>>> >>>> >>>This isn't hardware related, but FreeBSD 5 is not a particularly >>>impressive >>>performer. Especially 5.0 ... 5.2.1 would be better, but if you're >>>shooting >>>for performance, 4.9 will probably outperform both of them at this >>>stage of >>>the game. >>> >>>Something to consider if the query tuning that others are helping with >>>doesn't >>>solve the problem. Follow through with that _first_ though. >>> >>>However, if you insist on running 5, make sure your kernel is compiled >>>without >>>WITNESS ... it speeds things up noticably. >>> >>> >>> >>---------------------------(end of broadcast)--------------------------- >>TIP 6: Have you searched our list archives? >> >> http://archives.postgresql.org >> >> >> > > Regards, > Oleg >_____________________________________________________________ >Oleg Bartunov, sci.researcher, hostmaster of AstroNet, >Sternberg Astronomical Institute, Moscow University (Russia) >Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >phone: +007(095)939-16-83, +007(095)939-23-83 > > >
Attachment
Can we see the underlying tables also? > -----Original Message----- > From: Diogo Biazus [mailto:diogo@ikono.com.br] > Sent: Wednesday, March 31, 2004 8:46 AM > To: Dann Corbit > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Wich hardware suits best for large > full-text indexed databases > > > Dann Corbit wrote: > > >Can we see the schema for the table qrydocumentos ? > > > > > Sure: > > CREATE OR REPLACE VIEW public.qrydocumentos AS > > SELECT documentos.coddocumento, documentos.codfonte, > documentos.codsecao, documentos.codusuario, > documentos.codposicaopagina, > documentos.codinterno, documentos.nomedocumento, > documentos.subtitulo, > documentos.resumohumano, documentos.resumoautomatico, > documentos.conteudo, documentos.datapublicacao, > documentos.localorigem, > documentos.horapublicacao, documentos.pagina, documentos.anoedicao, > documentos.numeroedicao_old, documentos.nomeautor, > documentos.datainsercao, documentos.url, documentos.codtipoinsercao, > documentos.codbusca, documentos.codtipodocumento, > documentos.codformato, > documentos.analisado, documentos.numeroedicao, > documentos.conteudo_stem_ix, documentos.conteudo_ix, > documentos.numeroimagens, documentos.numeroacessos, fontes.nomefonte, > fontes.codtipofonte, fontes_gruposusuario.codgrupousuario, > character_length(documentos.conteudo ) AS tamanho, > to_char((documentos.horapublicacao )::interval, 'HH24:MI'::text ) AS > hora, to_char((documentos.datapublicacao )::timestamp with time zone, > 'dd/mm/yyyy'::text ) AS data > FROM (((fontes_gruposusuario > JOIN tiposdocumento_gruposusuario ON > ((tiposdocumento_gruposusuario.codgrupousuario = > fontes_gruposusuario.codgrupousuario ))) > JOIN documentos ON (((documentos.codtipodocumento = > tiposdocumento_gruposusuario.codtipodocumento ) AND > (documentos.codfonte > = fontes_gruposusuario.codfonte )))) > JOIN fontes ON ((fontes.codfonte = documentos.codfonte ))); > > > > -- > Diogo Biazus > diogo@ikono.com.br > http://www.ikono.com.br > >
Oleg Bartunov wrote: >Diogo, > >could you send us 'explain analyze' for fts-specific part of your >query ? For example, > > > Alright, here it goes: eyeknow=# explain analyze select * from documentos where conteudo_stem_ix @@ to_tsquery('default_portuguese', 'brasil'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using documentos_conteudo_stem_ix_ix on documentos (cost=0.00..1183.08 rows=291 width=1156) (actual time=26.33..840039.40 rows=70686 loops=1) Index Cond: (conteudo_stem_ix @@ '\'brasil\''::tsquery) Filter: (conteudo_stem_ix @@ '\'brasil\''::tsquery) Total runtime: 840248.91 msec (4 rows) >Frankly, I don't understand your query :) Could you explain what do you >want to find ? > >qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese', 'brasil') > > I'm trying to find documents which contain the word 'brasil' (stemmed). This is just an example, because the user can input any query to the database. The rest of the query is used to make sure that the user is accessing only documents he has permission to. Regards, -- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
> -----Original Message----- > From: Diogo Biazus [mailto:diogo@ikono.com.br] > Sent: Wednesday, March 31, 2004 11:02 AM > To: Dann Corbit > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Wich hardware suits best for large > full-text indexed databases > > > Dann Corbit wrote: > > >I see that only table documentos has a unique index. Is it > really so > >that none of the other tables has a key to uniquely identify > a record? > >Perhaps the others have a unique attribute but it was never declared > >when forming the index? > > > I thougth that a primary key would create a unique index implictly. Yes, I believe that is correct also. It is strange that it is not reflected in the schema. > >I do not remember which version of PostgreSQL you are using, > but there > >is now a CLUSTER command. Now, I am speaking with my experience in > >other database systems, and so perhaps the PostgreSQL group > may find it > >necessary to correct me. I have only recently began working with > >PostgreSQL versions beyond 7.1.3 which had no cluster command (or at > >least I was not aware of one). In other database systems, a > clustered > >index is incredibly valuable. In fact, a unique, clustered index is > >always a silver bullet to marvelous performance. Perhaps > some of the > >experts can give measurements as to the effectiveness of > clustering in > >PostgreSQL. Here is the link for the cluster documentation: > >http://www.postgresql.org/docs/current/static/sql-cluster.html > > > >>From the description, clustering is expensive and your tables are > >>large. > >So it should be an infrequent operation. > > > > > Sure, my pgsql is version 7.3.4 (I was obviosly mistaken when I wrote > 7.4.3 in a previous email) I used to do clustering in some > tables, but > is a expensive operation as said... Can you send me an unload of your schema? I will pull it into ERWin/Erx and have a look at the model. It is hard for me to picture things unless I can see them visually.
I see that only table documentos has a unique index. Is it really so that none of the other tables has a key to uniquely identify a record? Perhaps the others have a unique attribute but it was never declared when forming the index? I do not remember which version of PostgreSQL you are using, but there is now a CLUSTER command. Now, I am speaking with my experience in other database systems, and so perhaps the PostgreSQL group may find it necessary to correct me. I have only recently began working with PostgreSQL versions beyond 7.1.3 which had no cluster command (or at least I was not aware of one). In other database systems, a clustered index is incredibly valuable. In fact, a unique, clustered index is always a silver bullet to marvelous performance. Perhaps some of the experts can give measurements as to the effectiveness of clustering in PostgreSQL. Here is the link for the cluster documentation: http://www.postgresql.org/docs/current/static/sql-cluster.html From the description, clustering is expensive and your tables are large. So it should be an infrequent operation. > -----Original Message----- > From: Diogo Biazus [mailto:diogo@ikono.com.br] > Sent: Wednesday, March 31, 2004 10:30 AM > To: Dann Corbit > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Wich hardware suits best for large > full-text indexed databases > > > Dann Corbit wrote: > > >Can we see the underlying tables also? > > > > > CREATE TABLE public.documentos > ( > coddocumento serial NOT NULL, > codfonte int4 NOT NULL, > codsecao int4, > codusuario int4, > codposicaopagina int4, > codinterno varchar(255), > nomedocumento text NOT NULL, > subtitulo text, > resumohumano text, > resumoautomatico text, > conteudo text, > datapublicacao date, > localorigem varchar(255), > horapublicacao time, > pagina varchar(5), > anoedicao varchar(5), > numeroedicao_old varchar(10), > nomeautor varchar(255), > datainsercao timestamp NOT NULL DEFAULT ('now'::text)::timestamp(6) > with time zone, > url text, > codtipoinsercao varchar(1) NOT NULL, > codbusca int4, > codtipodocumento int4 NOT NULL, > codformato int4 NOT NULL, > analisado bool NOT NULL DEFAULT false, > numeroedicao varchar(120), > conteudo_stem_ix public.tsvector, > conteudo_ix public.tsvector, > numeroimagens int4 DEFAULT 0, > numeroacessos int4 DEFAULT 0, > codtarefa int8, > md5arquivo varchar(32), > CONSTRAINT documentos_pkey PRIMARY KEY (coddocumento), > CONSTRAINT arquivos_documentos_fk FOREIGN KEY (md5arquivo) > REFERENCES > public.arquivos (md5arquivo) ON UPDATE CASCADE ON DELETE CASCADE, > CONSTRAINT buscas_documentos FOREIGN KEY (codbusca) REFERENCES > public.buscas (codbusca) ON UPDATE CASCADE ON DELETE CASCADE, > CONSTRAINT fontes_documentos FOREIGN KEY (codfonte) REFERENCES > public.fontes (codfonte) ON UPDATE NO ACTION ON DELETE NO ACTION, > CONSTRAINT secoes_documentos FOREIGN KEY (codsecao) REFERENCES > public.secoes (codsecao) ON UPDATE NO ACTION ON DELETE NO ACTION, > CONSTRAINT tiposdocumento_documentos FOREIGN KEY (codtipodocumento) > REFERENCES public.tiposdocumento (codtipodocumento) ON UPDATE > CASCADE ON > DELETE NO ACTION, > CONSTRAINT tiposinsercao_documentos FOREIGN KEY (codtipoinsercao) > REFERENCES public.tiposinsercao (codtipoinsercao) ON UPDATE > NO ACTION ON > DELETE NO ACTION, > CONSTRAINT usuarios_documentos FOREIGN KEY (codusuario) REFERENCES > public.usuarios (codusuario) ON UPDATE NO ACTION ON DELETE NO ACTION > ) WITH OIDS; > > CREATE INDEX documentos_ajustaautor_ix > ON public.documentos > USING btree > (public.ajustaautor(nomeautor) text_ops); > > CREATE INDEX documentos_ajustadata_ix > ON public.documentos > USING btree > (public.ajustadata(datapublicacao) date_ops); > > CREATE INDEX documentos_ajustahora_ix > ON public.documentos > USING btree > (public.ajustahora(horapublicacao) time_ops); > > CREATE INDEX documentos_codtipodocumento_ix > ON public.documentos > USING btree > (codtipodocumento); > > CREATE UNIQUE INDEX documentos_codfonte_codinterno_ix > ON public.documentos > USING btree > (codinterno, codfonte); > > CREATE INDEX documentos_codfonte_ix > ON public.documentos > USING btree > (codfonte); > > CREATE INDEX documentos_codformato_ix > ON public.documentos > USING btree > (codformato); > > CREATE INDEX documentos_conteudo_stem_ix_ix > ON public.documentos > USING gist > (conteudo_stem_ix); > > CREATE INDEX documentos_conteudo_ix_ix > ON public.documentos > USING gist > (conteudo_ix); > > > > CREATE TABLE public.fontes > ( > codfonte serial NOT NULL, > codtipofonte int4 NOT NULL, > nomefonte varchar(50) NOT NULL, > ativo bool NOT NULL DEFAULT true, > periodicidade varchar(1), > codagente int4, > CONSTRAINT fontes_pkey PRIMARY KEY (codfonte), > CONSTRAINT tiposfonte_fontes FOREIGN KEY (codtipofonte) REFERENCES > public.tiposfonte (codtipofonte) ON UPDATE NO ACTION ON > DELETE NO ACTION > ) WITH OIDS; > > CREATE INDEX fontes_codtipofonte_ix > ON public.fontes > USING btree > (codtipofonte); > > CREATE INDEX fontes_nomefonte_ix > ON public.fontes > USING btree > (nomefonte); > > > > CREATE TABLE public.fontes_gruposusuario > ( > codfonte int4 NOT NULL, > codgrupousuario int4 NOT NULL, > CONSTRAINT fontes_gruposusuario_pkey PRIMARY KEY (codfonte, > codgrupousuario), > CONSTRAINT fontes_gruposusuario_codfonte_fk FOREIGN KEY (codfonte) > REFERENCES public.fontes (codfonte) ON UPDATE CASCADE ON > DELETE CASCADE, > CONSTRAINT fontes_gruposusuario_codgrupousuario_fk FOREIGN KEY > (codgrupousuario) REFERENCES public.gruposusuario > (codgrupousuario) ON > UPDATE CASCADE ON DELETE CASCADE > ) WITH OIDS; > > CREATE INDEX fontes_gruposusuario_codfonte_ix > ON public.fontes_gruposusuario > USING btree > (codfonte); > > CREATE INDEX fontes_gruposusuario_codgrupousuario_ix > ON public.fontes_gruposusuario > USING btree > (codgrupousuario); > > CREATE TABLE public.tiposdocumento_gruposusuario > ( > codtipodocumento int4 NOT NULL, > codgrupousuario int4 NOT NULL, > CONSTRAINT tiposdocumento_gruposusuario_pkey PRIMARY KEY > (codtipodocumento, codgrupousuario), > CONSTRAINT tiposdocumento_gruposusuario_codtipodocumento_fk FOREIGN > KEY (codtipodocumento) REFERENCES public.tiposdocumento > (codtipodocumento) ON UPDATE CASCADE ON DELETE CASCADE, > CONSTRAINT tiposdocumentos_gruposusuario_codgrupousuario_fk FOREIGN > KEY (codgrupousuario) REFERENCES public.gruposusuario > (codgrupousuario) > ON UPDATE CASCADE ON DELETE CASCADE > ) WITHOUT OIDS; > > > CREATE INDEX tiposdocumento_gruposusuario_codgrupousuario_ix > ON public.tiposdocumento_gruposusuario > USING btree > (codgrupousuario); > > CREATE INDEX tiposdocumento_gruposusuario_codtipodocumento_ix > ON public.tiposdocumento_gruposusuario > USING btree > (codtipodocumento); > > > > -- > Diogo Biazus > diogo@ikono.com.br > http://www.ikono.com.br > >
Diogo Biazus wrote: > Bill Moran wrote: > >> Diogo Biazus wrote: >> >>> Hi folks, >>> <SNIP> >>> >>> Does anyone has an idea of a more cost eficient solution? >>> How to get a better performance without having to invest some >>> astronomicaly high amount of money? >> >> This isn't hardware related, but FreeBSD 5 is not a particularly >> impressive >> performer. Especially 5.0 ... 5.2.1 would be better, but if you're >> shooting >> for performance, 4.9 will probably outperform both of them at this >> stage of >> the game. >> >> Something to consider if the query tuning that others are helping with >> doesn't >> solve the problem. Follow through with that _first_ though. >> >> However, if you insist on running 5, make sure your kernel is compiled >> without >> WITNESS ... it speeds things up noticably. > > Thanks for the advice, I'll try recompiling the kernel. Does the freebsd > version make a noticeable diference? Absolutely ... FreeBSD 5.0 is awful slow. 5.1 is better, 5.2.1 is almost as fast as 4.9. If you're really concerned with speed, though, you need to stick with 4.9 for now. I'd stay focused on the db tuning as long as there look like reasonable things to tune there, but FreeBSD 5.0 is NOT a good performer - it's still too early in the development process. If you have a reason to use 5, 5.2.1 is what you want, but if you want the best performer, use 4.9. -- Bill Moran Potential Technologies http://www.potentialtech.com
On Wed, 31 Mar 2004, Diogo Biazus wrote: > Oleg Bartunov wrote: > > >On Tue, 30 Mar 2004, Diogo Biazus wrote: > > > > > > > >>Hi folks, > >> > >>I have a database using tsearch2 to index 300 000 documents. > >>I've already have optimized the queries, and the database is vacuumed on > >>a daily basis. > >>The stat function tells me that my index has aprox. 460 000 unique words > >>(I'm using stemmer and a nice stopword list). > >> > >> > > > >460 000 unique words is a lot ! Have you seen on them ? Sometimes it's > >very useful to analyze what did you indexed and do you want all of them. > >I suggest you to use ispell dictionary and, if you index numbers > >(look statistics), use special dictionaries for integer and decimal numbers > >http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/README.intdict > > > > > I 'll try the ispell dictionaries and dicts for numbers too ;) > Could the synonym dictionary help me on this (reducing unique words)? why not ? It useful for words, which doesnt' correctly stemmed. > > thanks, > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Dann Corbit wrote: >Can you send me an unload of your schema? I will pull it into ERWin/Erx >and have a look at the model. It is hard for me to picture things >unless I can see them visually. > I need some time to do this, I'll try to send you tonight. Thank you very much for the help. Regards, -- Diogo Biazus diogo@ikono.com.br http://www.ikono.com.br
> I have a database using tsearch2 to index 300 000 documents. > I've already have optimized the queries, and the database is vacuumed on > > a daily basis. > > The stat function tells me that my index has aprox. 460 000 unique words > > (I'm using stemmer and a nice stopword list). > > The problem is performance, some queries take more than 10 seconds to > > execute, and I'm not sure if my bottleneck is memory or io. > > The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running postgresql > > 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem... > > > > Does anyone has an idea of a more cost eficient solution? > > How to get a better performance without having to invest some > > astronomicaly high amount of money? Can you identify a bottleneck during the 10 sec? CPU at 100%, memory swapping like crazy, HDD access non-stop? I assume you're running cpu FSB & memory at 266. Moving to an Athlon 2600 or faster with faster synchronous memory (DDR333 or 400) should give at least a 20% boost if cpu is an issue, 4Gb of RAM will be a boost if you are swapping or using all the memory, make sure the HDD is a 7200RPM 8mb cache model, or even adding a second drive & RAIDing them if HDD access is the problem. The high performance WD Raptor drive on a suitable SATA controller may give a boost, as would moving to a fast SCSI drive. There are also a few hardware tweaks possible, if you are running your cpu at FSB 266 and memory at 333, you MAY get better performance dropping memory to 266, as the speed loss may be more than offset by the gains due to having synchronous memory. Run a benchmark (bonnie or hdparm) & see that the hard drive is performing OK, you should be getting up around 40-50Mb/sec on a good ATA hard drive. It all really depends where the bottleneck is.... Cheers, Brent Wood