Re: Mnogosearch (Was: Re: website doc search is ... ) - Mailing list pgsql-general

From Marc G. Fournier
Subject Re: Mnogosearch (Was: Re: website doc search is ... )
Date
Msg-id 20040101152310.E913@ganymede.hub.org
Whole thread Raw
In response to Re: Mnogosearch (Was: Re: website doc search is ... )  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Mnogosearch (Was: Re: website doc search is ... )  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On Thu, 1 Jan 2004, Tom Lane wrote:

> "Marc G. Fournier" <scrappy@postgresql.org> writes:
> > I don't see a difference between the two, other then time changes, but
> > that could just be that runA had a server a bit more idle then runB ...
> > something I'm not seeing here?
>
> Well, the difference I was hoping for was a more accurate rows estimate
> for the indexscan, which indeed we got (estimate went from 3210 to
> 16075, vs reality of 15533).  But it didn't change the plan :-(.
>
> Looking more closely, I see the rows estimate for the seqscan on "url"
> is pretty awful too (1968 vs reality of 304811).  I think it would get
> better if you were doing just
>     AND (url.url LIKE 'http://archives.postgresql.org/%%');
> without the concatenation of an empty string.  Is there a reason for the
> concatenation part of the expression?

Believe it or not, the concatenation was based on a discussion *way* back
(2 years, maybe?) when we first started using Mnogosearch, in which you
suggested going that route ... in fact, at the time (bear in mind, this is
back in 7.2 days), it actually sped things up ...

Ok, with statistics set to 10, we now have:

                                                              QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..31672.49 rows=1927 width=8) (actual time=117.064..54476.806 rows=13415 loops=1)
   ->  Index Scan using n8_word on ndict8  (cost=0.00..12567.73 rows=3210 width=8) (actual time=80.230..47844.752
rows=15533loops=1) 
         Index Cond: (word_id = 417851441)
   ->  Index Scan using url_rec_id on url  (cost=0.00..5.94 rows=1 width=4) (actual time=0.392..0.398 rows=1
loops=15533)
         Index Cond: (url.rec_id = "outer".url_id)
         Filter: (url ~~ 'http://archives.postgresql.org/%%'::text)
 Total runtime: 54555.011 ms
(7 rows)

And, at 1000 (and appropriate vacuum analyze on ndict8):

                                                                 QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=91613.33..92959.41 rows=9026 width=8) (actual time=12834.316..16726.018 rows=13415 loops=1)
   Merge Cond: ("outer".url_id = "inner".rec_id)
   ->  Sort  (cost=59770.57..59808.18 rows=15043 width=8) (actual time=776.823..849.798 rows=15533 loops=1)
         Sort Key: ndict8.url_id
         ->  Index Scan using n8_word on ndict8  (cost=0.00..58726.82 rows=15043 width=8) (actual time=0.296..680.139
rows=15533loops=1) 
               Index Cond: (word_id = 417851441)
   ->  Sort  (cost=31842.76..32433.09 rows=236133 width=4) (actual time=12056.594..14159.852 rows=311731 loops=1)
         Sort Key: url.rec_id
         ->  Index Scan using url_url on url  (cost=0.00..10768.79 rows=236133 width=4) (actual time=225.243..8353.024
rows=304811loops=1) 
               Index Cond: ((url >= 'http://archives.postgresql.org/'::text) AND (url <
'http://archives.postgresql.org0'::text))
               Filter: (url ~~ 'http://archives.postgresql.org/%%'::text)
 Total runtime: 16796.932 ms
(12 rows)

Closer to what you were looking/hoping for?

Second run, @1000, shows:

 Total runtime: 12194.016 ms
(12 rows)

Second run, after knocking her back down to 10, shows:

 Total runtime: 58119.150 ms
(7 rows)

so we're definitely improved ... if this is the kinda results you were
hoping to see, then I guess next step would be to increase/reanalyze all
the word_id columns ... what about the url.url column?  should that be
done as well?  what does that setting affect, *just* the time it takes to
analyze the table?  from the verbose output, it looks like it is scanning
more rows on an analyze then @ 10 ... is this something that can be set
database wide, before loading data?  and/or something that the default is
currently just too low?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Mnogosearch (Was: Re: website doc search is ... )
Next
From: "Marc G. Fournier"
Date:
Subject: Re: Mnogosearch (Was: Re: website doc search is ... )