Re: Full Text Search ideas - Mailing list pgsql-general
From | Oleg Bartunov |
---|---|
Subject | Re: Full Text Search ideas |
Date | |
Msg-id | Pine.LNX.4.64.1007201458380.32129@sn.sai.msu.ru Whole thread Raw |
In response to | Full Text Search ideas (Howard Rogers <hjr@diznix.com>) |
List | pgsql-general |
It's doable. but requires a lot of work. We need support for this. Oleg On Sun, 18 Jul 2010, Howard Rogers wrote: > I asked recently about a performance problem I'd been having with some > full text queries, and got really useful help that pointed me to the > root issues. Currently, I'm trying to see if our document search > (running on Oracle Text) can be migrated to PostgreSQL, and the reason > I asked that earlier question points to a fundamental design issue > we'll have with PostgreSQL that doesn't affect us in Oracle (not, I > hasten to add, that that means Oracle is better/right-er/whatever. > It's just different -but the difference will cause us a problem). > > Consider the following example (which is just one of 40-odd I could > have picked). > > Some of our documents are in panoramic format, for example. But not > many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic', > called 'sb12'. It's either 'y' or 'n' for any document. So a search > for 'sb12n' (find me all documents which are not panoramic) is > logically the same as a search for 'not sb12y'. However, 95% or more > of documents will be an sb12n, because hardly any documents are > panoramic in the first place. So. although the numeric outcome of > 'sb12n' and 'not sb12y' will always be the same, you would have to > check the entire table to find which ones are 'sb12n' (because most > documents are marked that way), whereas you'd only have to check the > 5% of records to find 'sb12y', because so few are marked that way. > > But in Oracle Text, this doesn't seem to happen: > > SQL> select count(*) from search_digital_rm where > contains(textsearch,'bat and sb12n')>0; > > COUNT(*) > ---------- > 3040 > > Elapsed: 00:00:00.10 > > SQL> select count(*) from search_digital_rm where > contains(textsearch,'bat not sb12y')>0; > > COUNT(*) > ---------- > 3040 > > Elapsed: 00:00:00.06 > > In both cases, the same number of records are returned. But, within a > margin of error, the time taken to do each test is about the same. > Even though the first test must be matching 'sb12n' for many millions > of records, it's taking not much longer than the search for 'sb12y', > which can only match about 90,000. It would seem (I can't tell from > the explain plan itself) as though what's happened is that the set of > 'bat' records has been fetched first (about 8000 in all). For so few > records, whether you're looking for sb12y or sb12n then becomes mostly > irrelevant for timing purposes, and hence the duration equivalence of > both queries. > > This is not what happens in PostgreSQL, however (as I now know, thanks > to the help on my question from a couple of days ago): > > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n'); > count > ------- > 3849 > (1 row) > > Time: 408.962 ms > > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat > & !sb12y'); > count > ------- > 3849 > (1 row) > > Time: 11.533 ms > > Now, one test takes about 40 times longer than the other, though the > one taking just 11ms is as fast as Oracle can manage (impressive, > considering I've done absolutely nothing to tune this PostgreSQL > testbed as yet!). Logically equivalent the two tests may be, but > hunting through lots of sb12n records and working out which are > related to bats is apparently a lot slower than finding things the > other way around, it would seem. > > I'm wondering firstly if there's any way I can configure PostgreSQL > FTS so that it produces the sort of results we've gotten used to from > Oracle, i.e., where search speeds do not go up wildly when a 'search > term' is applied that happens to be used by the vast majority of > document records. (For example, we currently allows searches for file > types, where 80% of documents would be "word documents", another 19% > would be PDFs and the remaining 1% of documents could be pretty much > anything else! We can't have people searching for "definitely want > only Word documents" if that means matching 8 million records and > search speeds shoot to the stratosphere as a result). > > Secondly, I'm open to any suggestions as to how you would organise > things or re-write the SQL so that the "attribute filter" is only > applied to the small subset of records which match the relevant "real > word" search term, if that's what's needed here. In other words, is my > best bet in the earlier examples to fetch *all* "bat" records, and > then nest that query within an outer query that adds a test for a > separate attribute column? Or is something else called for here? > > Would appreciate any thoughts on the subject! > > Regards > HJR > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
pgsql-general by date: