Re: Mnogosearch (Was: Re: website doc search is ... ) - Mailing list pgsql-general

From Dave Cramer
Subject Re: Mnogosearch (Was: Re: website doc search is ... )
Date
Msg-id 1072922687.1946.23.camel@localhost.localdomain
Whole thread Raw
In response to Mnogosearch (Was: Re: website doc search is ... )  ("Marc G. Fournier" <scrappy@postgresql.org>)
Responses Re: Mnogosearch (Was: Re: website doc search is ... )  ("Marc G. Fournier" <scrappy@postgresql.org>)
List pgsql-general
What is the locale of the database?

like won't use an index, unless it is 'C' locale, or you use 7.4 and
change the operator of the index.

Dave
On Wed, 2003-12-31 at 20:49, Marc G. Fournier wrote:
> On Wed, 31 Dec 2003, Bruce Momjian wrote:
>
> > Marc G. Fournier wrote:
> > > On Wed, 31 Dec 2003, Bruce Momjian wrote:
> > >
> > > > > Out of everything I've found so far, mnogosearch is one of the best ... I
> > > > > just wish I could figure out where the bottleneck for it was, since, from
> > > > > reading their docs, their method of storing the data doesn't appear to be
> > > > > particularly off.  I'm tempted to try their caching storage manager, and
> > > > > getting away from SQL totally, but I *really* want to showcase PostgreSQL
> > > > > on this :(
> > > >
> > > > Well, PostgreSQL is being un-showcased in the current setup, that's for
> > > > sure.  :-(
> > >
> > > Agreed ... I could install the MySQL backend, whichits designed for, and
> > > advertise it as PostgreSQL? :)
> >
> > I would be curious to know if it is faster --- that would tell use if it
> > is tuned only for MySQL.
> >
> > Have you tried CLUSTER?  I think the MySQL ISAM files are
> > auto-clustered, and clustering is usually important for full-text
> > searches.
>
> Actually, check out http://www.mnogosearch.com ... the way they do the
> indexing doesn't (at least, as far as I can tell) make use of full-text
> searching.  Simplistically, it appears to take the web page, sort -u all
> the words it finds, removes all 'stopwords' (and, the, in, etc) from the
> result, and then dumps the resultant words to the database, link'd to the
> URL ...
>
> We're using crc-multi, so a CRC value of the word is what is stored in the
> database, not the actual word itself ... the '-multi' part spreads the
> words across several tables depending on the word size, to keep total # of
> rows down ...
>
> The slow part on the database is finding those words, as can be seen by
> the following search on 'SECURITY INVOKER':
>
> Jan  1 01:21:05 pgsql74 postgres[59959]: [44-1] LOG:  statement: SELECT ndict8.url_id,ndict8.intag FROM ndict8, url
WHEREndict8.word_id=417851441 AND url.rec_id=ndict8 
> .url_id
> Jan  1 01:21:05 pgsql74 postgres[59959]: [44-2]  AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%')
> Jan  1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG:  duration: 55015.644 ms
> Jan  1 01:22:00 pgsql74 postgres[59959]: [46-1] LOG:  statement: SELECT ndict7.url_id,ndict7.intag FROM ndict7, url
WHEREndict7.word_id=-509484498 AND url.rec_id=ndict 
> 7.url_id
> Jan  1 01:22:00 pgsql74 postgres[59959]: [46-2]  AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%')
> Jan  1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG:  duration: 1167.407 ms
>
> ndict8 looks like:
>
> 186_archives=# select count(1) from ndict8;
>   count
> ---------
>  6320380
> (1 row)
> rchives=# select count(1) from ndict8 where word_id=417851441;
>  count
> -------
>  15532
> (1 row)
>
> 186_archives=# \d ndict8
>          Table "public.ndict8"
>  Column  |  Type   |     Modifiers
> ---------+---------+--------------------
>  url_id  | integer | not null default 0
>  word_id | integer | not null default 0
>  intag   | integer | not null default 0
> Indexes:
>     "n8_url" btree (url_id)
>     "n8_word" btree (word_id)
>
>
> and ndict7 looks like:
>
> 186_archives=# select count(1) from ndict7;
>   count
> ---------
>  8400616
> (1 row)
> 186_archives=# select count(1) from ndict7 where word_id=-509484498;
>  count
> -------
>    333
> (1 row)
>
> 186_archives=# \d ndict7
>          Table "public.ndict7"
>  Column  |  Type   |     Modifiers
> ---------+---------+--------------------
>  url_id  | integer | not null default 0
>  word_id | integer | not null default 0
>  intag   | integer | not null default 0
> Indexes:
>     "n7_url" btree (url_id)
>     "n7_word" btree (word_id)
>
>
> The slowdown is the LIKE condition, as the ndict[78] word_id conditions
> return near instantly when run individually, and when I run the 'url/LIKE'
> condition, it takes "forever" ...
>
> 186_archives-# ;
>  count
> --------
>  304811
> (1 row)
>
> 186_archives=# explain analyze select count(1) from url where ((url.url || '') LIKE
'http://archives.postgresql.org/%%');
>                                                     QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=93962.19..93962.19 rows=1 width=0) (actual time=5833.084..5833.088 rows=1 loops=1)
>    ->  Seq Scan on url  (cost=0.00..93957.26 rows=1968 width=0) (actual time=0.069..4387.378 rows=304811 loops=1)
>          Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
>  Total runtime: 5833.179 ms
> (4 rows)
>
>
> Hrmmm ... I don't have much (any) experience with tsearch, but could it be
> used to replace the LIKE?  Then again, when its returning 300k rows out of
> 393k, it wouldn't help much on the above, would it?
>
> The full first query:
>
> SELECT ndict8.url_id,ndict8.intag
>   FROM ndict8, url
>  WHERE ndict8.word_id=417851441
>    AND url.rec_id=ndict8.url_id
>    AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%');
>
> returns 13415 rows, and explain analyze shows:
>
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=0.00..30199.82 rows=17 width=8) (actual time=0.312..1459.504 rows=13415 loops=1)
>    ->  Index Scan using n8_word on ndict8  (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.186..387.673
rows=15532loops=1) 
>          Index Cond: (word_id = 417851441)
>    ->  Index Scan using url_rec_id on url  (cost=0.00..5.45 rows=1 width=4) (actual time=0.029..0.050 rows=1
loops=15532)
>          Index Cond: (url.rec_id = "outer".url_id)
>          Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
>  Total runtime: 1520.145 ms
> (7 rows)
>
> Which, of course, doesn't come close to matching what the duration showed
> in the original, most likely due to catching :(
>
> The server that the database is on rarely jumps abov a loadavg of 1, isn't
> using any swap (after 77 days up, used swap is 0% -or- 17meg) and the
> database itself is on a strip'd file system ...
>
> I'm open to ideas/things to try here ...
>
> The whole 'process' of the search shows the following times for the
> queries:
>
> pgsql74# grep 59959 /var/log/pgsql | grep duration
> Jan  1 01:21:05 pgsql74 postgres[59959]: [39-1] LOG:  duration: 25.663 ms
> Jan  1 01:21:05 pgsql74 postgres[59959]: [41-1] LOG:  duration: 4.376 ms
> Jan  1 01:21:05 pgsql74 postgres[59959]: [43-1] LOG:  duration: 11.179 ms
> Jan  1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG:  duration: 55015.644 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG:  duration: 1167.407 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [49-1] LOG:  duration: 7.886 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [51-1] LOG:  duration: 1.516 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [53-1] LOG:  duration: 3.539 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [55-1] LOG:  duration: 109.890 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [57-1] LOG:  duration: 15.582 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [59-1] LOG:  duration: 1.631 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [61-1] LOG:  duration: 0.838 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [63-1] LOG:  duration: 2.148 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [65-1] LOG:  duration: 0.810 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [67-1] LOG:  duration: 1.211 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [69-1] LOG:  duration: 0.798 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [71-1] LOG:  duration: 0.861 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [73-1] LOG:  duration: 0.748 ms
> Jan  1 01:22:01 pgsql74 postgres[59959]: [75-1] LOG:  duration: 0.555 ms
>
> With the two >1000ms queries being the above two ndict[78] queries ...
>
> Doing two subsequent searches, on "setuid functions" and "privilege
> rules", just so that caching isn't involved, shows pretty much the same
> distribution:
>
> grep 61697 /var/log/pgsql | grep duration
> Jan  1 01:44:25 pgsql74 postgres[61697]: [41-1] LOG:  duration: 1.244 ms
> Jan  1 01:44:25 pgsql74 postgres[61697]: [43-1] LOG:  duration: 21.868 ms
> Jan  1 01:44:25 pgsql74 postgres[61697]: [45-1] LOG:  duration: 17.956 ms
> Jan  1 01:44:29 pgsql74 postgres[61697]: [47-1] LOG:  duration: 4452.326 ms
> Jan  1 01:44:57 pgsql74 postgres[61697]: [49-1] LOG:  duration: 27992.581 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [51-1] LOG:  duration: 357.158 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [53-1] LOG:  duration: 1.338 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [55-1] LOG:  duration: 11.438 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [57-1] LOG:  duration: 63.389 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [59-1] LOG:  duration: 134.941 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [61-1] LOG:  duration: 0.570 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [63-1] LOG:  duration: 0.489 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [65-1] LOG:  duration: 0.477 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [67-1] LOG:  duration: 0.470 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [69-1] LOG:  duration: 0.471 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [71-1] LOG:  duration: 0.468 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [73-1] LOG:  duration: 0.473 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [75-1] LOG:  duration: 0.466 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [77-1] LOG:  duration: 0.469 ms
> Jan  1 01:44:58 pgsql74 postgres[61697]: [79-1] LOG:  duration: 0.515 ms
>
> and:
>
> grep 61869 /var/log/pgsql | grep duration
> Jan  1 01:46:50 pgsql74 postgres[61869]: [41-1] LOG:  duration: 19.776 ms
> Jan  1 01:46:50 pgsql74 postgres[61869]: [43-1] LOG:  duration: 58.352 ms
> Jan  1 01:46:50 pgsql74 postgres[61869]: [45-1] LOG:  duration: 0.897 ms
> Jan  1 01:46:53 pgsql74 postgres[61869]: [47-1] LOG:  duration: 2859.331 ms
> Jan  1 01:47:47 pgsql74 postgres[61869]: [49-1] LOG:  duration: 54774.241 ms
> Jan  1 01:47:47 pgsql74 postgres[61869]: [51-1] LOG:  duration: 14.926 ms
> Jan  1 01:47:47 pgsql74 postgres[61869]: [53-1] LOG:  duration: 1.502 ms
> Jan  1 01:47:47 pgsql74 postgres[61869]: [55-1] LOG:  duration: 3.865 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [57-1] LOG:  duration: 110.435 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [59-1] LOG:  duration: 0.646 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [61-1] LOG:  duration: 0.503 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [63-1] LOG:  duration: 0.498 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [65-1] LOG:  duration: 0.484 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [67-1] LOG:  duration: 0.487 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [69-1] LOG:  duration: 0.478 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [71-1] LOG:  duration: 0.479 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [73-1] LOG:  duration: 0.480 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [75-1] LOG:  duration: 0.478 ms
> Jan  1 01:47:48 pgsql74 postgres[61869]: [77-1] LOG:  duration: 0.477 ms
>
> So it looks like its those joins that are really killing things ...
>
> Note that I haven't made many changes to the postgresql.conf file, so
> there might be something really obvious I've overlooked, but here are the
> uncommented ones (ie. ones I've modified from defaults):
>
> tcpip_socket = true
> max_connections = 512
> shared_buffers = 10000          # min 16, at least max_connections*2, 8KB each
> sort_mem = 10240                # min 64, size in KB
> vacuum_mem = 81920              # min 1024, size in KB
> syslog = 2                      # range 0-2; 0=stdout; 1=both; 2=syslog
> syslog_facility = 'LOCAL0'
> syslog_ident = 'postgres'
> log_connections = true
> log_duration = false
> log_statement = false
> lc_messages = 'C'               # locale for system error message strings
> lc_monetary = 'C'               # locale for monetary formatting
> lc_numeric = 'C'                # locale for number formatting
> lc_time = 'C'                   # locale for time formatting
>
>
> ----
> Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
> Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>
--
Dave Cramer
519 939 0336
ICQ # 1467551


pgsql-general by date:

Previous
From: "Marc G. Fournier"
Date:
Subject: Re: website doc search is extremely SLOW
Next
From: "Marc G. Fournier"
Date:
Subject: Re: Mnogosearch (Was: Re: website doc search is ... )