Thread: Full Text Search ideas
I asked recently about a performance problem I'd been having with some full text queries, and got really useful help that pointed me to the root issues. Currently, I'm trying to see if our document search (running on Oracle Text) can be migrated to PostgreSQL, and the reason I asked that earlier question points to a fundamental design issue we'll have with PostgreSQL that doesn't affect us in Oracle (not, I hasten to add, that that means Oracle is better/right-er/whatever. It's just different -but the difference will cause us a problem). Consider the following example (which is just one of 40-odd I could have picked). Some of our documents are in panoramic format, for example. But not many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic', called 'sb12'. It's either 'y' or 'n' for any document. So a search for 'sb12n' (find me all documents which are not panoramic) is logically the same as a search for 'not sb12y'. However, 95% or more of documents will be an sb12n, because hardly any documents are panoramic in the first place. So. although the numeric outcome of 'sb12n' and 'not sb12y' will always be the same, you would have to check the entire table to find which ones are 'sb12n' (because most documents are marked that way), whereas you'd only have to check the 5% of records to find 'sb12y', because so few are marked that way. But in Oracle Text, this doesn't seem to happen: SQL> select count(*) from search_digital_rm where contains(textsearch,'bat and sb12n')>0; COUNT(*) ---------- 3040 Elapsed: 00:00:00.10 SQL> select count(*) from search_digital_rm where contains(textsearch,'bat not sb12y')>0; COUNT(*) ---------- 3040 Elapsed: 00:00:00.06 In both cases, the same number of records are returned. But, within a margin of error, the time taken to do each test is about the same. Even though the first test must be matching 'sb12n' for many millions of records, it's taking not much longer than the search for 'sb12y', which can only match about 90,000. It would seem (I can't tell from the explain plan itself) as though what's happened is that the set of 'bat' records has been fetched first (about 8000 in all). For so few records, whether you're looking for sb12y or sb12n then becomes mostly irrelevant for timing purposes, and hence the duration equivalence of both queries. This is not what happens in PostgreSQL, however (as I now know, thanks to the help on my question from a couple of days ago): ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n'); count ------- 3849 (1 row) Time: 408.962 ms ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & !sb12y'); count ------- 3849 (1 row) Time: 11.533 ms Now, one test takes about 40 times longer than the other, though the one taking just 11ms is as fast as Oracle can manage (impressive, considering I've done absolutely nothing to tune this PostgreSQL testbed as yet!). Logically equivalent the two tests may be, but hunting through lots of sb12n records and working out which are related to bats is apparently a lot slower than finding things the other way around, it would seem. I'm wondering firstly if there's any way I can configure PostgreSQL FTS so that it produces the sort of results we've gotten used to from Oracle, i.e., where search speeds do not go up wildly when a 'search term' is applied that happens to be used by the vast majority of document records. (For example, we currently allows searches for file types, where 80% of documents would be "word documents", another 19% would be PDFs and the remaining 1% of documents could be pretty much anything else! We can't have people searching for "definitely want only Word documents" if that means matching 8 million records and search speeds shoot to the stratosphere as a result). Secondly, I'm open to any suggestions as to how you would organise things or re-write the SQL so that the "attribute filter" is only applied to the small subset of records which match the relevant "real word" search term, if that's what's needed here. In other words, is my best bet in the earlier examples to fetch *all* "bat" records, and then nest that query within an outer query that adds a test for a separate attribute column? Or is something else called for here? Would appreciate any thoughts on the subject! Regards HJR
Howard Rogers <hjr@diznix.com> writes: > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n'); > count > ------- > 3849 > (1 row) > Time: 408.962 ms > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & !sb12y'); > count > ------- > 3849 > (1 row) > Time: 11.533 ms Yeah, I imagine that the first of these will involve examining all the index entries for sb12n. There's not a lot of smarts about that inside the GIN index machinery, AFAIK: it'll just fetch all the relevant TIDs for both terms and then AND them. > I'm wondering firstly if there's any way I can configure PostgreSQL > FTS so that it produces the sort of results we've gotten used to from > Oracle, i.e., where search speeds do not go up wildly when a 'search > term' is applied that happens to be used by the vast majority of > document records. If you're willing to split out the search terms that are like this, you could probably get better results with something like select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english','bat') AND to_tsvector('english', textsearch) @@ to_tsquery('english','sb12n'); That will put it on the optimizer's head as to whether to use the index for one term or both terms. It might be worth noting that the optimizer will of course not get this right unless it has decent statistics about both search terms --- and there is an as-yet-unreleased patch about tsvector stats gathering: http://archives.postgresql.org/pgsql-committers/2010-05/msg00360.php I am not sure that the situation addressed by that patch applies in your case, but it might. regards, tom lane
On Mon, Jul 19, 2010 at 6:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Howard Rogers <hjr@diznix.com> writes: >> ims=# select count(*) from search_rm >> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n'); >> count >> ------- >> 3849 >> (1 row) > >> Time: 408.962 ms > >> ims=# select count(*) from search_rm >> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & !sb12y'); >> count >> ------- >> 3849 >> (1 row) > >> Time: 11.533 ms > > Yeah, I imagine that the first of these will involve examining all the > index entries for sb12n. There's not a lot of smarts about that inside > the GIN index machinery, AFAIK: it'll just fetch all the relevant TIDs > for both terms and then AND them. > >> I'm wondering firstly if there's any way I can configure PostgreSQL >> FTS so that it produces the sort of results we've gotten used to from >> Oracle, i.e., where search speeds do not go up wildly when a 'search >> term' is applied that happens to be used by the vast majority of >> document records. > > If you're willing to split out the search terms that are like this, > you could probably get better results with something like > > select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat') AND > to_tsvector('english', textsearch) @@ to_tsquery('english','sb12n'); > > That will put it on the optimizer's head as to whether to use the index > for one term or both terms. > > It might be worth noting that the optimizer will of course not get this > right unless it has decent statistics about both search terms --- and > there is an as-yet-unreleased patch about tsvector stats gathering: > http://archives.postgresql.org/pgsql-committers/2010-05/msg00360.php > I am not sure that the situation addressed by that patch applies in > your case, but it might. > > regards, tom lane > Thanks, Tom. The breaking out into separate search terms does make a difference, but not much: ims=# select count(*) from search_rm where to_tsvector('english',textsearch) @@ to_tsquery('english','bat & sb12n'); count ------- 3849 (1 row) Time: 413.329 ms ims=# select count(*) from search_rm ims-# where to_tsvector('english',textsearch) @@ to_tsquery('english','bat') AND ims-# to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n'); count ------- 3849 (1 row) Time: 352.583 ms So it's shaving about a sixth of the time off, which isn't bad, but not spectacularly good either! I'd also thought of trying something like this: ims=# select count(*) from ( select * from search_rm where to_tsvector('english',textsearch) @@ to_tsquery('english','bat') ) as core where to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n'); count ------- 3849 (1 row) Time: 357.248 ms ...in the hope that the sb12n test would only be applied to the set of 'bat' records acquired by the inner query. But as you can tell from the time, that's not particularly better or worse than your suggestion (bearing mind that 'bat' on its own is a 12ms search). I'm currently constructing a separate column containing a single bitmask value for about 15 of the 45 attributes, just to see if evaluating the bits with a bitand test for the bat records is faster than trying to FTS them in the first place. Something like select count (*) from ( select * from search_rm where to_tsvector('english',textsearch) @@ to_tsquery('english','bat') ) as core where bitand(searchbits,4096)>0; But it's taking a while to get that extra column constructed in the original table! Fingers crossed, because if not, it's all a bit of a show-stopper for our migration effort, I think. :-( Regards & thanks HJR
On 19 July 2010 01:46, Howard Rogers <hjr@diznix.com> wrote:
Thanks, Tom.On Mon, Jul 19, 2010 at 6:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Howard Rogers <hjr@diznix.com> writes:
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
>> count
>> -------
>> 3849
>> (1 row)
>
>> Time: 408.962 ms
>
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & !sb12y');
>> count
>> -------
>> 3849
>> (1 row)
>
>> Time: 11.533 ms
>
> Yeah, I imagine that the first of these will involve examining all the
> index entries for sb12n. There's not a lot of smarts about that inside
> the GIN index machinery, AFAIK: it'll just fetch all the relevant TIDs
> for both terms and then AND them.
>
>> I'm wondering firstly if there's any way I can configure PostgreSQL
>> FTS so that it produces the sort of results we've gotten used to from
>> Oracle, i.e., where search speeds do not go up wildly when a 'search
>> term' is applied that happens to be used by the vast majority of
>> document records.
>
> If you're willing to split out the search terms that are like this,
> you could probably get better results with something like
>
> select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat') AND
> to_tsvector('english', textsearch) @@ to_tsquery('english','sb12n');
>
> That will put it on the optimizer's head as to whether to use the index
> for one term or both terms.
>
> It might be worth noting that the optimizer will of course not get this
> right unless it has decent statistics about both search terms --- and
> there is an as-yet-unreleased patch about tsvector stats gathering:
> http://archives.postgresql.org/pgsql-committers/2010-05/msg00360.php
> I am not sure that the situation addressed by that patch applies in
> your case, but it might.
>
> regards, tom lane
>
The breaking out into separate search terms does make a difference,
but not much:Time: 413.329 ms
ims=# select count(*) from search_rm
where to_tsvector('english',textsearch) @@ to_tsquery('english','bat & sb12n');
count
-------
3849
(1 row)ims-# where to_tsvector('english',textsearch) @@ to_tsquery('english','bat') AND
ims=# select count(*) from search_rm
ims-# to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');
count
-------
3849
(1 row)
Time: 352.583 ms
So it's shaving about a sixth of the time off, which isn't bad, but
not spectacularly good either!
I'd also thought of trying something like this:
ims=# select count(*) from
(
select * from search_rm where
to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
) as core
where to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');
count
-------
3849
(1 row)
Time: 357.248 ms
...in the hope that the sb12n test would only be applied to the set of
'bat' records acquired by the inner query. But as you can tell from
the time, that's not particularly better or worse than your suggestion
(bearing mind that 'bat' on its own is a 12ms search).
I'm currently constructing a separate column containing a single
bitmask value for about 15 of the 45 attributes, just to see if
evaluating the bits with a bitand test for the bat records is faster
than trying to FTS them in the first place. Something like
select count (*) from
(
select * from search_rm where
to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
) as core
where bitand(searchbits,4096)>0;
But it's taking a while to get that extra column constructed in the
original table!
Fingers crossed, because if not, it's all a bit of a show-stopper for
our migration effort, I think. :-(
Regards & thanks
HJR
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi Howard,
As well as trying the bitand test, have you tried a plain %like% or a regex match/extraction on the results of performing the fts search purely on the search terms? I'm guessing here it would involve more calculation in the search than the bitand approach, but might require less maintenance and, along the direction you are already heading, place more emphasis on the refinement of candidate matches rather than the retrieval of better matches in the first instance, and perhaps more so for non-exhaustive searching.
Regards,
Steve
It's doable. but requires a lot of work. We need support for this. Oleg On Sun, 18 Jul 2010, Howard Rogers wrote: > I asked recently about a performance problem I'd been having with some > full text queries, and got really useful help that pointed me to the > root issues. Currently, I'm trying to see if our document search > (running on Oracle Text) can be migrated to PostgreSQL, and the reason > I asked that earlier question points to a fundamental design issue > we'll have with PostgreSQL that doesn't affect us in Oracle (not, I > hasten to add, that that means Oracle is better/right-er/whatever. > It's just different -but the difference will cause us a problem). > > Consider the following example (which is just one of 40-odd I could > have picked). > > Some of our documents are in panoramic format, for example. But not > many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic', > called 'sb12'. It's either 'y' or 'n' for any document. So a search > for 'sb12n' (find me all documents which are not panoramic) is > logically the same as a search for 'not sb12y'. However, 95% or more > of documents will be an sb12n, because hardly any documents are > panoramic in the first place. So. although the numeric outcome of > 'sb12n' and 'not sb12y' will always be the same, you would have to > check the entire table to find which ones are 'sb12n' (because most > documents are marked that way), whereas you'd only have to check the > 5% of records to find 'sb12y', because so few are marked that way. > > But in Oracle Text, this doesn't seem to happen: > > SQL> select count(*) from search_digital_rm where > contains(textsearch,'bat and sb12n')>0; > > COUNT(*) > ---------- > 3040 > > Elapsed: 00:00:00.10 > > SQL> select count(*) from search_digital_rm where > contains(textsearch,'bat not sb12y')>0; > > COUNT(*) > ---------- > 3040 > > Elapsed: 00:00:00.06 > > In both cases, the same number of records are returned. But, within a > margin of error, the time taken to do each test is about the same. > Even though the first test must be matching 'sb12n' for many millions > of records, it's taking not much longer than the search for 'sb12y', > which can only match about 90,000. It would seem (I can't tell from > the explain plan itself) as though what's happened is that the set of > 'bat' records has been fetched first (about 8000 in all). For so few > records, whether you're looking for sb12y or sb12n then becomes mostly > irrelevant for timing purposes, and hence the duration equivalence of > both queries. > > This is not what happens in PostgreSQL, however (as I now know, thanks > to the help on my question from a couple of days ago): > > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n'); > count > ------- > 3849 > (1 row) > > Time: 408.962 ms > > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat > & !sb12y'); > count > ------- > 3849 > (1 row) > > Time: 11.533 ms > > Now, one test takes about 40 times longer than the other, though the > one taking just 11ms is as fast as Oracle can manage (impressive, > considering I've done absolutely nothing to tune this PostgreSQL > testbed as yet!). Logically equivalent the two tests may be, but > hunting through lots of sb12n records and working out which are > related to bats is apparently a lot slower than finding things the > other way around, it would seem. > > I'm wondering firstly if there's any way I can configure PostgreSQL > FTS so that it produces the sort of results we've gotten used to from > Oracle, i.e., where search speeds do not go up wildly when a 'search > term' is applied that happens to be used by the vast majority of > document records. (For example, we currently allows searches for file > types, where 80% of documents would be "word documents", another 19% > would be PDFs and the remaining 1% of documents could be pretty much > anything else! We can't have people searching for "definitely want > only Word documents" if that means matching 8 million records and > search speeds shoot to the stratosphere as a result). > > Secondly, I'm open to any suggestions as to how you would organise > things or re-write the SQL so that the "attribute filter" is only > applied to the small subset of records which match the relevant "real > word" search term, if that's what's needed here. In other words, is my > best bet in the earlier examples to fetch *all* "bat" records, and > then nest that query within an outer query that adds a test for a > separate attribute column? Or is something else called for here? > > Would appreciate any thoughts on the subject! > > Regards > HJR > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
It's doable. but requires a lot of work. We need support for this. Oleg On Sun, 18 Jul 2010, Howard Rogers wrote: > I asked recently about a performance problem I'd been having with some > full text queries, and got really useful help that pointed me to the > root issues. Currently, I'm trying to see if our document search > (running on Oracle Text) can be migrated to PostgreSQL, and the reason > I asked that earlier question points to a fundamental design issue > we'll have with PostgreSQL that doesn't affect us in Oracle (not, I > hasten to add, that that means Oracle is better/right-er/whatever. > It's just different -but the difference will cause us a problem). > > Consider the following example (which is just one of 40-odd I could > have picked). > > Some of our documents are in panoramic format, for example. But not > many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic', > called 'sb12'. It's either 'y' or 'n' for any document. So a search > for 'sb12n' (find me all documents which are not panoramic) is > logically the same as a search for 'not sb12y'. However, 95% or more > of documents will be an sb12n, because hardly any documents are > panoramic in the first place. So. although the numeric outcome of > 'sb12n' and 'not sb12y' will always be the same, you would have to > check the entire table to find which ones are 'sb12n' (because most > documents are marked that way), whereas you'd only have to check the > 5% of records to find 'sb12y', because so few are marked that way. > > But in Oracle Text, this doesn't seem to happen: > > SQL> select count(*) from search_digital_rm where > contains(textsearch,'bat and sb12n')>0; > > COUNT(*) > ---------- > 3040 > > Elapsed: 00:00:00.10 > > SQL> select count(*) from search_digital_rm where > contains(textsearch,'bat not sb12y')>0; > > COUNT(*) > ---------- > 3040 > > Elapsed: 00:00:00.06 > > In both cases, the same number of records are returned. But, within a > margin of error, the time taken to do each test is about the same. > Even though the first test must be matching 'sb12n' for many millions > of records, it's taking not much longer than the search for 'sb12y', > which can only match about 90,000. It would seem (I can't tell from > the explain plan itself) as though what's happened is that the set of > 'bat' records has been fetched first (about 8000 in all). For so few > records, whether you're looking for sb12y or sb12n then becomes mostly > irrelevant for timing purposes, and hence the duration equivalence of > both queries. > > This is not what happens in PostgreSQL, however (as I now know, thanks > to the help on my question from a couple of days ago): > > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n'); > count > ------- > 3849 > (1 row) > > Time: 408.962 ms > > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english','bat > & !sb12y'); > count > ------- > 3849 > (1 row) > > Time: 11.533 ms > > Now, one test takes about 40 times longer than the other, though the > one taking just 11ms is as fast as Oracle can manage (impressive, > considering I've done absolutely nothing to tune this PostgreSQL > testbed as yet!). Logically equivalent the two tests may be, but > hunting through lots of sb12n records and working out which are > related to bats is apparently a lot slower than finding things the > other way around, it would seem. > > I'm wondering firstly if there's any way I can configure PostgreSQL > FTS so that it produces the sort of results we've gotten used to from > Oracle, i.e., where search speeds do not go up wildly when a 'search > term' is applied that happens to be used by the vast majority of > document records. (For example, we currently allows searches for file > types, where 80% of documents would be "word documents", another 19% > would be PDFs and the remaining 1% of documents could be pretty much > anything else! We can't have people searching for "definitely want > only Word documents" if that means matching 8 million records and > search speeds shoot to the stratosphere as a result). > > Secondly, I'm open to any suggestions as to how you would organise > things or re-write the SQL so that the "attribute filter" is only > applied to the small subset of records which match the relevant "real > word" search term, if that's what's needed here. In other words, is my > best bet in the earlier examples to fetch *all* "bat" records, and > then nest that query within an outer query that adds a test for a > separate attribute column? Or is something else called for here? > > Would appreciate any thoughts on the subject! > > Regards > HJR > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general