Thread: Full Text Search ideas

Full Text Search ideas

From
Howard Rogers
Date:
I asked recently about a performance problem I'd been having with some
full text queries, and got really useful help that pointed me to the
root issues. Currently, I'm trying to see if our document search
(running on Oracle Text) can be migrated to PostgreSQL, and the reason
I asked that earlier question points to a fundamental design issue
we'll have with PostgreSQL that doesn't affect us in Oracle (not, I
hasten to add, that that means Oracle is better/right-er/whatever.
It's just different -but the difference will cause us a problem).

Consider the following example (which is just one of 40-odd I could
have picked).

Some of our documents are in panoramic format, for example. But not
many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic',
called 'sb12'. It's either 'y' or 'n' for any document. So a search
for 'sb12n' (find me all documents which are not panoramic) is
logically the same as a search for 'not sb12y'. However, 95% or more
of documents will be an sb12n, because hardly any documents are
panoramic in the first place. So. although the numeric outcome of
'sb12n' and 'not sb12y' will always be the same, you would have to
check the entire table to find which ones are 'sb12n' (because most
documents are marked that way), whereas you'd only have to check the
5% of records to find 'sb12y', because so few are marked that way.

But in Oracle Text, this doesn't seem to happen:

SQL> select count(*) from search_digital_rm where
contains(textsearch,'bat and sb12n')>0;

 COUNT(*)
----------
     3040

Elapsed: 00:00:00.10

SQL> select count(*) from search_digital_rm where
contains(textsearch,'bat not sb12y')>0;

 COUNT(*)
----------
     3040

Elapsed: 00:00:00.06

In both cases, the same number of records are returned. But, within a
margin of error, the time taken to do each test is about the same.
Even though the first test must be matching 'sb12n' for many millions
of records, it's taking not much longer than the search for 'sb12y',
which can only match about 90,000. It would seem (I can't tell from
the explain plan itself) as though what's happened is that the set of
'bat' records has been fetched first (about 8000 in all). For so few
records, whether you're looking for sb12y or sb12n then becomes mostly
irrelevant for timing purposes, and hence the duration equivalence of
both queries.

This is not what happens in PostgreSQL, however (as I now know, thanks
to the help on my question from a couple of days ago):

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
count
-------
 3849
(1 row)

Time: 408.962 ms

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','bat
& !sb12y');
count
-------
 3849
(1 row)

Time: 11.533 ms

Now, one test takes about 40 times longer than the other, though the
one taking just 11ms is as fast as Oracle can manage (impressive,
considering I've done absolutely nothing to tune this PostgreSQL
testbed as yet!). Logically equivalent the two tests may be, but
hunting through lots of sb12n records and working out which are
related to bats is apparently a lot slower than finding things the
other way around, it would seem.

I'm wondering firstly if there's any way I can configure PostgreSQL
FTS so that it produces the sort of results we've gotten used to from
Oracle, i.e., where search speeds do not go up wildly when a 'search
term' is applied that happens to be used by the vast majority of
document records. (For example, we currently allows searches for file
types, where 80% of documents would be "word documents", another 19%
would be PDFs and the remaining 1% of documents could be pretty much
anything else! We can't have people searching for "definitely want
only Word documents" if that means matching 8 million records and
search speeds shoot to the stratosphere as a result).

Secondly, I'm open to any suggestions as to how you would organise
things or re-write the SQL so that the "attribute filter" is only
applied to the small subset of records which match the relevant "real
word" search term, if that's what's needed here. In other words, is my
best bet in the earlier examples to fetch *all* "bat" records, and
then nest that query within an outer query that adds a test for a
separate attribute column? Or is something else called for here?

Would appreciate any thoughts on the subject!

Regards
HJR

Re: Full Text Search ideas

From
Tom Lane
Date:
Howard Rogers <hjr@diznix.com> writes:
> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
> count
> -------
>  3849
> (1 row)

> Time: 408.962 ms

> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & !sb12y');
> count
> -------
>  3849
> (1 row)

> Time: 11.533 ms

Yeah, I imagine that the first of these will involve examining all the
index entries for sb12n.  There's not a lot of smarts about that inside
the GIN index machinery, AFAIK: it'll just fetch all the relevant TIDs
for both terms and then AND them.

> I'm wondering firstly if there's any way I can configure PostgreSQL
> FTS so that it produces the sort of results we've gotten used to from
> Oracle, i.e., where search speeds do not go up wildly when a 'search
> term' is applied that happens to be used by the vast majority of
> document records.

If you're willing to split out the search terms that are like this,
you could probably get better results with something like

select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','bat') AND
      to_tsvector('english', textsearch) @@ to_tsquery('english','sb12n');

That will put it on the optimizer's head as to whether to use the index
for one term or both terms.

It might be worth noting that the optimizer will of course not get this
right unless it has decent statistics about both search terms --- and
there is an as-yet-unreleased patch about tsvector stats gathering:
http://archives.postgresql.org/pgsql-committers/2010-05/msg00360.php
I am not sure that the situation addressed by that patch applies in
your case, but it might.

            regards, tom lane

Re: Full Text Search ideas

From
Howard Rogers
Date:
On Mon, Jul 19, 2010 at 6:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Howard Rogers <hjr@diznix.com> writes:
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
>> count
>> -------
>>  3849
>> (1 row)
>
>> Time: 408.962 ms
>
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & !sb12y');
>> count
>> -------
>>  3849
>> (1 row)
>
>> Time: 11.533 ms
>
> Yeah, I imagine that the first of these will involve examining all the
> index entries for sb12n.  There's not a lot of smarts about that inside
> the GIN index machinery, AFAIK: it'll just fetch all the relevant TIDs
> for both terms and then AND them.
>
>> I'm wondering firstly if there's any way I can configure PostgreSQL
>> FTS so that it produces the sort of results we've gotten used to from
>> Oracle, i.e., where search speeds do not go up wildly when a 'search
>> term' is applied that happens to be used by the vast majority of
>> document records.
>
> If you're willing to split out the search terms that are like this,
> you could probably get better results with something like
>
> select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat') AND
>      to_tsvector('english', textsearch) @@ to_tsquery('english','sb12n');
>
> That will put it on the optimizer's head as to whether to use the index
> for one term or both terms.
>
> It might be worth noting that the optimizer will of course not get this
> right unless it has decent statistics about both search terms --- and
> there is an as-yet-unreleased patch about tsvector stats gathering:
> http://archives.postgresql.org/pgsql-committers/2010-05/msg00360.php
> I am not sure that the situation addressed by that patch applies in
> your case, but it might.
>
>                        regards, tom lane
>

Thanks, Tom.

The breaking out into separate search terms does make a difference,
but not much:

ims=# select count(*) from search_rm
where to_tsvector('english',textsearch) @@ to_tsquery('english','bat & sb12n');
 count
-------
  3849
(1 row)

Time: 413.329 ms

ims=# select count(*) from search_rm
ims-# where to_tsvector('english',textsearch) @@ to_tsquery('english','bat') AND
ims-# to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');
 count
-------
  3849
(1 row)

Time: 352.583 ms

So it's shaving about a sixth of the time off, which isn't bad, but
not spectacularly good either!

I'd also thought of trying something like this:

ims=# select count(*) from
(
  select * from search_rm where
  to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
) as core
where to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');

 count
-------
  3849
(1 row)

Time: 357.248 ms

...in the hope that the sb12n test would only be applied to the set of
'bat' records acquired by the inner query. But as you can tell from
the time, that's not particularly better or worse than your suggestion
(bearing mind that 'bat' on its own is a 12ms search).

I'm currently constructing a separate column containing a single
bitmask value for about 15 of the 45 attributes, just to see if
evaluating the bits with a bitand test for the bat records is faster
than trying to FTS them in the first place. Something like

select count (*) from
   (
  select * from search_rm where
  to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
   ) as core
where bitand(searchbits,4096)>0;

But it's taking a while to get that extra column constructed in the
original table!

Fingers crossed, because if not, it's all a bit of a show-stopper for
our migration effort, I think. :-(

Regards & thanks
HJR

Re: Full Text Search ideas

From
Steve Grey
Date:
On 19 July 2010 01:46, Howard Rogers <hjr@diznix.com> wrote:
On Mon, Jul 19, 2010 at 6:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Howard Rogers <hjr@diznix.com> writes:
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
>> count
>> -------
>>  3849
>> (1 row)
>
>> Time: 408.962 ms
>
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & !sb12y');
>> count
>> -------
>>  3849
>> (1 row)
>
>> Time: 11.533 ms
>
> Yeah, I imagine that the first of these will involve examining all the
> index entries for sb12n.  There's not a lot of smarts about that inside
> the GIN index machinery, AFAIK: it'll just fetch all the relevant TIDs
> for both terms and then AND them.
>
>> I'm wondering firstly if there's any way I can configure PostgreSQL
>> FTS so that it produces the sort of results we've gotten used to from
>> Oracle, i.e., where search speeds do not go up wildly when a 'search
>> term' is applied that happens to be used by the vast majority of
>> document records.
>
> If you're willing to split out the search terms that are like this,
> you could probably get better results with something like
>
> select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat') AND
>      to_tsvector('english', textsearch) @@ to_tsquery('english','sb12n');
>
> That will put it on the optimizer's head as to whether to use the index
> for one term or both terms.
>
> It might be worth noting that the optimizer will of course not get this
> right unless it has decent statistics about both search terms --- and
> there is an as-yet-unreleased patch about tsvector stats gathering:
> http://archives.postgresql.org/pgsql-committers/2010-05/msg00360.php
> I am not sure that the situation addressed by that patch applies in
> your case, but it might.
>
>                        regards, tom lane
>

Thanks, Tom.

The breaking out into separate search terms does make a difference,
but not much:

ims=# select count(*) from search_rm
where to_tsvector('english',textsearch) @@ to_tsquery('english','bat & sb12n');
 count
-------
 3849
(1 row)

Time: 413.329 ms

ims=# select count(*) from search_rm
ims-# where to_tsvector('english',textsearch) @@ to_tsquery('english','bat') AND
ims-# to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');
 count
-------
 3849
(1 row)

Time: 352.583 ms

So it's shaving about a sixth of the time off, which isn't bad, but
not spectacularly good either!

I'd also thought of trying something like this:

ims=# select count(*) from
(
 select * from search_rm where
 to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
) as core
where to_tsvector('english',textsearch) @@ to_tsquery('english','sb12n');

 count
-------
 3849
(1 row)

Time: 357.248 ms

...in the hope that the sb12n test would only be applied to the set of
'bat' records acquired by the inner query. But as you can tell from
the time, that's not particularly better or worse than your suggestion
(bearing mind that 'bat' on its own is a 12ms search).

I'm currently constructing a separate column containing a single
bitmask value for about 15 of the 45 attributes, just to see if
evaluating the bits with a bitand test for the bat records is faster
than trying to FTS them in the first place. Something like

select count (*) from
  (
 select * from search_rm where
 to_tsvector('english',textsearch) @@ to_tsquery('english','bat')
  ) as core
where bitand(searchbits,4096)>0;

But it's taking a while to get that extra column constructed in the
original table!

Fingers crossed, because if not, it's all a bit of a show-stopper for
our migration effort, I think. :-(

Regards & thanks
HJR

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Hi Howard,

As well as trying the bitand test, have you tried a plain %like% or a regex match/extraction on the results of performing the fts search purely on the search terms?  I'm guessing here it would involve more calculation in the search than the bitand approach, but might require less maintenance and, along the direction you are already heading, place more emphasis on the refinement of candidate matches rather than the retrieval of better matches in the first instance, and perhaps more so for non-exhaustive searching.

Regards,

Steve

#avg_ls_inline_popup { position:absolute; z-index:9999; padding: 0px 0px; margin-left: 0px; margin-top: 0px; width: 240px; overflow: hidden; word-wrap: break-word; color: black; font-size: 10px; text-align: left; line-height: 13px;}

Re: Full Text Search ideas

From
Oleg Bartunov
Date:
It's doable. but requires a lot of work. We need support for this.

Oleg
On Sun, 18 Jul 2010, Howard Rogers wrote:

> I asked recently about a performance problem I'd been having with some
> full text queries, and got really useful help that pointed me to the
> root issues. Currently, I'm trying to see if our document search
> (running on Oracle Text) can be migrated to PostgreSQL, and the reason
> I asked that earlier question points to a fundamental design issue
> we'll have with PostgreSQL that doesn't affect us in Oracle (not, I
> hasten to add, that that means Oracle is better/right-er/whatever.
> It's just different -but the difference will cause us a problem).
>
> Consider the following example (which is just one of 40-odd I could
> have picked).
>
> Some of our documents are in panoramic format, for example. But not
> many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic',
> called 'sb12'. It's either 'y' or 'n' for any document. So a search
> for 'sb12n' (find me all documents which are not panoramic) is
> logically the same as a search for 'not sb12y'. However, 95% or more
> of documents will be an sb12n, because hardly any documents are
> panoramic in the first place. So. although the numeric outcome of
> 'sb12n' and 'not sb12y' will always be the same, you would have to
> check the entire table to find which ones are 'sb12n' (because most
> documents are marked that way), whereas you'd only have to check the
> 5% of records to find 'sb12y', because so few are marked that way.
>
> But in Oracle Text, this doesn't seem to happen:
>
> SQL> select count(*) from search_digital_rm where
> contains(textsearch,'bat and sb12n')>0;
>
>  COUNT(*)
> ----------
>      3040
>
> Elapsed: 00:00:00.10
>
> SQL> select count(*) from search_digital_rm where
> contains(textsearch,'bat not sb12y')>0;
>
>  COUNT(*)
> ----------
>      3040
>
> Elapsed: 00:00:00.06
>
> In both cases, the same number of records are returned. But, within a
> margin of error, the time taken to do each test is about the same.
> Even though the first test must be matching 'sb12n' for many millions
> of records, it's taking not much longer than the search for 'sb12y',
> which can only match about 90,000. It would seem (I can't tell from
> the explain plan itself) as though what's happened is that the set of
> 'bat' records has been fetched first (about 8000 in all). For so few
> records, whether you're looking for sb12y or sb12n then becomes mostly
> irrelevant for timing purposes, and hence the duration equivalence of
> both queries.
>
> This is not what happens in PostgreSQL, however (as I now know, thanks
> to the help on my question from a couple of days ago):
>
> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
> count
> -------
>  3849
> (1 row)
>
> Time: 408.962 ms
>
> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat
> & !sb12y');
> count
> -------
>  3849
> (1 row)
>
> Time: 11.533 ms
>
> Now, one test takes about 40 times longer than the other, though the
> one taking just 11ms is as fast as Oracle can manage (impressive,
> considering I've done absolutely nothing to tune this PostgreSQL
> testbed as yet!). Logically equivalent the two tests may be, but
> hunting through lots of sb12n records and working out which are
> related to bats is apparently a lot slower than finding things the
> other way around, it would seem.
>
> I'm wondering firstly if there's any way I can configure PostgreSQL
> FTS so that it produces the sort of results we've gotten used to from
> Oracle, i.e., where search speeds do not go up wildly when a 'search
> term' is applied that happens to be used by the vast majority of
> document records. (For example, we currently allows searches for file
> types, where 80% of documents would be "word documents", another 19%
> would be PDFs and the remaining 1% of documents could be pretty much
> anything else! We can't have people searching for "definitely want
> only Word documents" if that means matching 8 million records and
> search speeds shoot to the stratosphere as a result).
>
> Secondly, I'm open to any suggestions as to how you would organise
> things or re-write the SQL so that the "attribute filter" is only
> applied to the small subset of records which match the relevant "real
> word" search term, if that's what's needed here. In other words, is my
> best bet in the earlier examples to fetch *all* "bat" records, and
> then nest that query within an outer query that adds a test for a
> separate attribute column? Or is something else called for here?
>
> Would appreciate any thoughts on the subject!
>
> Regards
> HJR
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Full Text Search ideas

From
Oleg Bartunov
Date:
It's doable. but requires a lot of work. We need support for this.

Oleg
On Sun, 18 Jul 2010, Howard Rogers wrote:

> I asked recently about a performance problem I'd been having with some
> full text queries, and got really useful help that pointed me to the
> root issues. Currently, I'm trying to see if our document search
> (running on Oracle Text) can be migrated to PostgreSQL, and the reason
> I asked that earlier question points to a fundamental design issue
> we'll have with PostgreSQL that doesn't affect us in Oracle (not, I
> hasten to add, that that means Oracle is better/right-er/whatever.
> It's just different -but the difference will cause us a problem).
>
> Consider the following example (which is just one of 40-odd I could
> have picked).
>
> Some of our documents are in panoramic format, for example. But not
> many (say, 30,000 out of 10,000,000). We have a flag for 'panoramic',
> called 'sb12'. It's either 'y' or 'n' for any document. So a search
> for 'sb12n' (find me all documents which are not panoramic) is
> logically the same as a search for 'not sb12y'. However, 95% or more
> of documents will be an sb12n, because hardly any documents are
> panoramic in the first place. So. although the numeric outcome of
> 'sb12n' and 'not sb12y' will always be the same, you would have to
> check the entire table to find which ones are 'sb12n' (because most
> documents are marked that way), whereas you'd only have to check the
> 5% of records to find 'sb12y', because so few are marked that way.
>
> But in Oracle Text, this doesn't seem to happen:
>
> SQL> select count(*) from search_digital_rm where
> contains(textsearch,'bat and sb12n')>0;
>
>  COUNT(*)
> ----------
>      3040
>
> Elapsed: 00:00:00.10
>
> SQL> select count(*) from search_digital_rm where
> contains(textsearch,'bat not sb12y')>0;
>
>  COUNT(*)
> ----------
>      3040
>
> Elapsed: 00:00:00.06
>
> In both cases, the same number of records are returned. But, within a
> margin of error, the time taken to do each test is about the same.
> Even though the first test must be matching 'sb12n' for many millions
> of records, it's taking not much longer than the search for 'sb12y',
> which can only match about 90,000. It would seem (I can't tell from
> the explain plan itself) as though what's happened is that the set of
> 'bat' records has been fetched first (about 8000 in all). For so few
> records, whether you're looking for sb12y or sb12n then becomes mostly
> irrelevant for timing purposes, and hence the duration equivalence of
> both queries.
>
> This is not what happens in PostgreSQL, however (as I now know, thanks
> to the help on my question from a couple of days ago):
>
> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat & sb12n');
> count
> -------
>  3849
> (1 row)
>
> Time: 408.962 ms
>
> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english','bat
> & !sb12y');
> count
> -------
>  3849
> (1 row)
>
> Time: 11.533 ms
>
> Now, one test takes about 40 times longer than the other, though the
> one taking just 11ms is as fast as Oracle can manage (impressive,
> considering I've done absolutely nothing to tune this PostgreSQL
> testbed as yet!). Logically equivalent the two tests may be, but
> hunting through lots of sb12n records and working out which are
> related to bats is apparently a lot slower than finding things the
> other way around, it would seem.
>
> I'm wondering firstly if there's any way I can configure PostgreSQL
> FTS so that it produces the sort of results we've gotten used to from
> Oracle, i.e., where search speeds do not go up wildly when a 'search
> term' is applied that happens to be used by the vast majority of
> document records. (For example, we currently allows searches for file
> types, where 80% of documents would be "word documents", another 19%
> would be PDFs and the remaining 1% of documents could be pretty much
> anything else! We can't have people searching for "definitely want
> only Word documents" if that means matching 8 million records and
> search speeds shoot to the stratosphere as a result).
>
> Secondly, I'm open to any suggestions as to how you would organise
> things or re-write the SQL so that the "attribute filter" is only
> applied to the small subset of records which match the relevant "real
> word" search term, if that's what's needed here. In other words, is my
> best bet in the earlier examples to fetch *all* "bat" records, and
> then nest that query within an outer query that adds a test for a
> separate attribute column? Or is something else called for here?
>
> Would appreciate any thoughts on the subject!
>
> Regards
> HJR
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general