Thread: Full Text Search dictionary issues

Full Text Search dictionary issues

From

Howard Rogers

Date:

15 July 2010, 20:34:09

I have 10 million rows in a table, with full text index created on one
of the columns. I submit this query:

ims=# select count(*) from search_rm
ims-# where to_tsvector('english', textsearch)
ims-# @@ to_tsquery('english', 'woman & beach & ball');
 count
-------
   646
(1 row)
Time: 107.570 ms

...and those are excellent times. But if I alter the query to read:

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman & beach & ftx1');
 count
-------
 38343
(1 row)
Time: 640.985 ms

...then, as you see, it slows the query down by a factor of about 6,
which is not so good! The problem is that we need to be able to search
for "ftx1", since that's a flag we put in our document records to tell
us the file type, and we need to be able to retrieve different file
types at different times.

Now, I *think* the problem is that 'ftx1' is not being treated as
though it were a proper word:

ims=# select * from ts_debug('english','woman ball ftx1');
   alias   |       description        | token |  dictionaries  |
dictionary  | lexemes
-----------+--------------------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII          | woman | {english_stem} |
english_stem | {woman}
 blank     | Space symbols            |       | {}             |              |
 asciiword | Word, all ASCII          | ball  | {english_stem} |
english_stem | {ball}
 blank     | Space symbols            |       | {}             |              |
 numword   | Word, letters and digits | ftx1  | {simple}       |
simple       | {ftx1}
(5 rows)

Instead of being an asciiword that uses the english-stem dictionary,
it 'ftx1' gets regarded as a numword in the simple dictionary.

If I simply replace "ftxa" for "ftx1", it *is* then regarded as an
asciiword, and performance of the original query reverts to being just
fine, too:

ims=# select * from ts_debug('english','woman ball ftxa');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | woman | {english_stem} | english_stem | {woman}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | ball  | {english_stem} | english_stem | {ball}
 blank     | Space symbols   |       | {}             |              |
 asciiword | Word, all ASCII | ftxa  | {english_stem} | english_stem | {ftxa}

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch)
@@ to_tsquery('english', 'woman & beach & ftxa');
 count
-------
     0
(1 row)

Time: 88.603 ms

As you can see, 88ms for a search with 'ftxa' compared to 600+ms for
one with 'ftx1'.

I should mention that we have about 45 different 'nonsense word' flags
we use for all sorts of different purposes, such as telling us which
region a document is visible in, whether it is in portrait or
landscape mode and so on. All of these flag-words take the form of
zzzz1, or yyyy2 and so on. So there's a lot of these things causing
the problem, not just 'ftx1' specifically.

My question is, then, what I can do to stop this slowdown? Is there
some way to add 'ftx1' (and the others) as a word in the english_stem
dictionary so that it gets regarded as an asciiword, for example? Or
is there something else I can do to address the problem?

I'm fairly new to PostgreSQL's full text search. I've read Chapter 12
of the doco with rapt attention, but I don't see anything that leaps
at me as a fix for this issue. All help greatefully received,
therefore, and apologies in advance if this is a bit of a newbie
question.

Regards
HJR

Re: Full Text Search dictionary issues

From

Tom Lane

Date:

15 July 2010, 21:23:25

Howard Rogers <hjr@diznix.com> writes:
> I have 10 million rows in a table, with full text index created on one
> of the columns. I submit this query:

> ims=# select count(*) from search_rm
> ims-# where to_tsvector('english', textsearch)
> ims-# @@ to_tsquery('english', 'woman & beach & ball');
>  count
> -------
>    646
> (1 row)
> Time: 107.570 ms

> ...and those are excellent times. But if I alter the query to read:

> ims=# select count(*) from search_rm
> where to_tsvector('english', textsearch)
> @@ to_tsquery('english', 'woman & beach & ftx1');
>  count
> -------
>  38343
> (1 row)
> Time: 640.985 ms

> ...then, as you see, it slows the query down by a factor of about 6,

... um, but it increased the number of matching rows by a factor of
almost 60.  I think your complaint of poor scaling is misplaced.

> which is not so good! The problem is that we need to be able to search
> for "ftx1", since that's a flag we put in our document records to tell
> us the file type, and we need to be able to retrieve different file
> types at different times.

You might want to rethink how you're doing that --- it seems like a file
type flag ought to be a separate column rather than a word in a text
field.

            regards, tom lane

Re: Full Text Search dictionary issues

From

Ivan Voras

Date:

16 July 2010, 07:23:09

On 07/16/10 02:23, Tom Lane wrote:
> Howard Rogers <hjr@diznix.com> writes:
>> I have 10 million rows in a table, with full text index created on one
>> of the columns. I submit this query:
>
>> ims=# select count(*) from search_rm
>> ims-# where to_tsvector('english', textsearch)
>> ims-# @@ to_tsquery('english', 'woman & beach & ball');
>>  count
>> -------
>>    646
>> (1 row)
>> Time: 107.570 ms
>
>> ...and those are excellent times. But if I alter the query to read:
>
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch)
>> @@ to_tsquery('english', 'woman & beach & ftx1');
>>  count
>> -------
>>  38343
>> (1 row)
>> Time: 640.985 ms
>
>> ...then, as you see, it slows the query down by a factor of about 6,
>
> ... um, but it increased the number of matching rows by a factor of
> almost 60.  I think your complaint of poor scaling is misplaced.

This is basically the same question I asked a few days ago and I think
the reason for this (mis)expectation of performance comes from expecting
tsearch2 to behave like external specialized indexers. In such products,
the search result can be returned simply from the index, which can scale
fairly well, but PostgreSQL actually has to lookup all the records
returned and this is where most time is spent.

Re: Full Text Search dictionary issues

From

Howard Rogers

Date:

16 July 2010, 11:49:28

On Fri, Jul 16, 2010 at 10:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Howard Rogers <hjr@diznix.com> writes:
>> I have 10 million rows in a table, with full text index created on one
>> of the columns. I submit this query:
>
>> ims=# select count(*) from search_rm
>> ims-# where to_tsvector('english', textsearch)
>> ims-# @@ to_tsquery('english', 'woman & beach & ball');
>>  count
>> -------
>>    646
>> (1 row)
>> Time: 107.570 ms
>
>> ...and those are excellent times. But if I alter the query to read:
>
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch)
>> @@ to_tsquery('english', 'woman & beach & ftx1');
>>  count
>> -------
>>  38343
>> (1 row)
>> Time: 640.985 ms
>
>> ...then, as you see, it slows the query down by a factor of about 6,
>
> ... um, but it increased the number of matching rows by a factor of
> almost 60.  I think your complaint of poor scaling is misplaced.
>
>> which is not so good! The problem is that we need to be able to search
>> for "ftx1", since that's a flag we put in our document records to tell
>> us the file type, and we need to be able to retrieve different file
>> types at different times.
>
> You might want to rethink how you're doing that --- it seems like a file
> type flag ought to be a separate column rather than a word in a text
> field.
>
>                        regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

OK, Tom: I did actually account for the number of rows difference
before I posted, though I accept I didn't show you that. So here goes:

ims=# select count(*)
ims-# from search_rm
ims-# where to_tsvector('english', textsearch) @@
to_tsquery('english', 'wommmman & batt & ftxa')
ims-# limit 20;
 count
-------
    0
(1 row)

Time: 0.593 ms
ims=# select count(*)
from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english',
'wommmman & batt & ftx1')
limit 20;
 count
-------
    0
(1 row)

Time: 489.362 ms

Both queries return zero rows. One takes an awful lot longer than the
other. The only difference between them is that one searches for
'ftx1' and the other searches for 'ftx0'. My complaint of poor
scalability (actually, it was an enquiry about the role of dictionary
types!) is valid, I think. As a PostgreSQL newbie, I'm happy to accept
that I've done something plonkingly stupid to account for these
results, but I'd then like to know what it is I've done wrong! A
simple scale-up of the number of hits isn't, however, the problem, I
don't think.

With this amount of data, and with 45 different document attributes
that may or may not be searched for, some of them involving names and
places and dates, some just yes/no flags, it is utterly impossible to
have them as separate attribute columns and search on them with
anything like decent performance. We adopted this approach with Oracle
Text two years ago precisely because it was the only way to keep
web-based searches of 10,000,000 records coming back in less than a
second. So, no, we're not going to re-think the storage of 'attribute
data' as part of the searchable keyword field, though I'm more than
prepared to alter the precise format of that data if it helps
PostgreSQL any.

That said, however, we have people supplying us with document
references in the form DA3-76374YY-001, so alpha-numerics simply have
to be searchable with good speed, and I can't always magic-away the
alpha-numeric components, even if I wanted to.

So, I would still like to know if this performance difference when
encountering alpha-numeric "words" is dictionary-related, and if so
what I can do to fix that, please.

Cheers,
HJR

Re: Full Text Search dictionary issues

From

Steve Grey

Date:

16 July 2010, 12:02:38

Does it run any differently if you split out the tag?

select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english', 'wommmman & batt') and to_tsvector('english', textsearch) @@ 'ftx1'::tsquery

Steve

On 16 July 2010 05:22, Howard Rogers <hjr@diznix.com> wrote:

On Fri, Jul 16, 2010 at 10:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Howard Rogers <hjr@diznix.com> writes:
>> I have 10 million rows in a table, with full text index created on one
>> of the columns. I submit this query:
>
>> ims=# select count(*) from search_rm
>> ims-# where to_tsvector('english', textsearch)
>> ims-# @@ to_tsquery('english', 'woman & beach & ball');
>> count
>> -------
>> 646
>> (1 row)
>> Time: 107.570 ms
>
>> ...and those are excellent times. But if I alter the query to read:
>
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch)
>> @@ to_tsquery('english', 'woman & beach & ftx1');
>> count
>> -------
>> 38343
>> (1 row)
>> Time: 640.985 ms
>
>> ...then, as you see, it slows the query down by a factor of about 6,
>
> ... um, but it increased the number of matching rows by a factor of
> almost 60. I think your complaint of poor scaling is misplaced.
>
>> which is not so good! The problem is that we need to be able to search
>> for "ftx1", since that's a flag we put in our document records to tell
>> us the file type, and we need to be able to retrieve different file
>> types at different times.
>
> You might want to rethink how you're doing that --- it seems like a file
> type flag ought to be a separate column rather than a word in a text
> field.
>
> regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

OK, Tom: I did actually account for the number of rows difference
before I posted, though I accept I didn't show you that. So here goes:

ims=# select count(*)
ims-# from search_rm
ims-# where to_tsvector('english', textsearch) @@
to_tsquery('english', 'wommmman & batt & ftxa')
ims-# limit 20;
count
-------
0
(1 row)

Time: 0.593 ms
ims=# select count(*)
from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english',
'wommmman & batt & ftx1')
limit 20;
count
-------
0
(1 row)

Time: 489.362 ms

Both queries return zero rows. One takes an awful lot longer than the
other. The only difference between them is that one searches for
'ftx1' and the other searches for 'ftx0'. My complaint of poor
scalability (actually, it was an enquiry about the role of dictionary
types!) is valid, I think. As a PostgreSQL newbie, I'm happy to accept
that I've done something plonkingly stupid to account for these
results, but I'd then like to know what it is I've done wrong! A
simple scale-up of the number of hits isn't, however, the problem, I
don't think.

With this amount of data, and with 45 different document attributes
that may or may not be searched for, some of them involving names and
places and dates, some just yes/no flags, it is utterly impossible to
have them as separate attribute columns and search on them with
anything like decent performance. We adopted this approach with Oracle
Text two years ago precisely because it was the only way to keep
web-based searches of 10,000,000 records coming back in less than a
second. So, no, we're not going to re-think the storage of 'attribute
data' as part of the searchable keyword field, though I'm more than
prepared to alter the precise format of that data if it helps
PostgreSQL any.

That said, however, we have people supplying us with document
references in the form DA3-76374YY-001, so alpha-numerics simply have
to be searchable with good speed, and I can't always magic-away the
alpha-numeric components, even if I wanted to.

So, I would still like to know if this performance difference when
encountering alpha-numeric "words" is dictionary-related, and if so
what I can do to fix that, please.

Cheers,
HJR

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#avg_ls_inline_popup { position:absolute; z-index:9999; padding: 0px 0px; margin-left: 0px; margin-top: 0px; width: 240px; overflow: hidden; word-wrap: break-word; color: black; font-size: 10px; text-align: left; line-height: 13px;}

Re: Full Text Search dictionary issues

From

Richard Huxton

Date:

16 July 2010, 12:16:37

On 16/07/10 05:22, Howard Rogers wrote:
> OK, Tom: I did actually account for the number of rows difference
> before I posted, though I accept I didn't show you that. So here goes:

Tom's good, but his mind-reading powers aren't what they used to be :-)

> ims=# select count(*)
> ims-# from search_rm
> ims-# where to_tsvector('english', textsearch) @@
> to_tsquery('english', 'wommmman&  batt&  ftxa')
> ims-# limit 20;
>   count
> -------
>      0
> (1 row)
>
> Time: 0.593 ms
> ims=# select count(*)
> from search_rm
> where to_tsvector('english', textsearch) @@ to_tsquery('english',
> 'wommmman&  batt&  ftx1')
> limit 20;
>   count
> -------
>      0
> (1 row)
>
> Time: 489.362 ms
>
> Both queries return zero rows. One takes an awful lot longer than the
> other.

And how many matches do you get for each term? Is it equally slow if you
search for 'wommman & batt && pzsdja' vs '... pzsdj1'?

I'm assuming pzsdja/1 aren't valid tokens of course...

--
   Richard Huxton
   Archonet Ltd

Re: Full Text Search dictionary issues

From

Tom Lane

Date:

16 July 2010, 14:14:35

Howard Rogers <hjr@diznix.com> writes:
> OK, Tom: I did actually account for the number of rows difference
> before I posted, though I accept I didn't show you that. So here goes:
> ...
> Both queries return zero rows. One takes an awful lot longer than the
> other. The only difference between them is that one searches for
> 'ftx1' and the other searches for 'ftx0'.

Well, this still doesn't tell us anything about what I think the
critical point is, namely how many actual matches there are for
ftx1 versus ftx0.  Could we see counts for *just* those words without
the other conditions?

> So, I would still like to know if this performance difference when
> encountering alpha-numeric "words" is dictionary-related,

AFAIK there is no significant difference between treatment of pure alpha
and mixed alphanumeric "words", at least not once you get past
to_tsquery.  I'm still expecting this is just a matter of how many index
entries match.  It's barely possible that you've got a dictionary
configuration that makes the to_tsquery() function itself a lot slower
in the alphanumeric case, but that should affect ftx1 and ftx0 equally.

            regards, tom lane

Re: Full Text Search dictionary issues

From

Howard Rogers

Date:

17 July 2010, 02:30:32

On Sat, Jul 17, 2010 at 3:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Howard Rogers <hjr@diznix.com> writes:
>> OK, Tom: I did actually account for the number of rows difference
>> before I posted, though I accept I didn't show you that. So here goes:
>> ...
>> Both queries return zero rows. One takes an awful lot longer than the
>> other. The only difference between them is that one searches for
>> 'ftx1' and the other searches for 'ftx0'.
>
> Well, this still doesn't tell us anything about what I think the
> critical point is, namely how many actual matches there are for
> ftx1 versus ftx0.  Could we see counts for *just* those words without
> the other conditions?
>
>> So, I would still like to know if this performance difference when
>> encountering alpha-numeric "words" is dictionary-related,
>
> AFAIK there is no significant difference between treatment of pure alpha
> and mixed alphanumeric "words", at least not once you get past
> to_tsquery.  I'm still expecting this is just a matter of how many index
> entries match.  It's barely possible that you've got a dictionary
> configuration that makes the to_tsquery() function itself a lot slower
> in the alphanumeric case, but that should affect ftx1 and ftx0 equally.
>
>                        regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

My apologies for making this harder than it should have been. My
queries were correct, my description of them wasn't. The only
difference was 'ftx1' and 'ftxa', not 'ftx0'.

Anyway, I think I've finally got the point being made by you and
Richard (and sorry for being so slow about it!): ftx1 is a very, very
common flag. I'd guess 99% of records would be tagged with it. And
ftxa doesn't exist at all, of course. So, you're right:

ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','ball
& beach & salsm1');
 count
-------
  1753
(1 row)

Time: 557.010 ms
ims=# select count(*) from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english','ball
& beach & salsm4');
 count
-------
    97
(1 row)

Time: 39.518 ms

Salsm1 is another very common flag, applied to about 80% of documents.
Salsm4 applies to about 160 documents in total. So it's really not the
alpha-numerics causing the issue after all. It's genuinely the number
of records matching each term, as you originally said. (And a note to
Steve: it makes no difference separating out the 'flag factor', I'm
afraid).

It's given me lots to think about. The hard part ought to be finding
the women, or the beaches; the flags should be mere filters applied
after those have been fetched. Clearly, though, that's not how this is
behaving, though I've been fooled into thinking it ought to be because
of the apparent default optimization done in Oracle Text. Some new
technique is called for, I guess!  (If you've got any ideas, I'm all
ears...)

Anyway: at least you've all helped me realise that it's not a
dictionary problem (or a word-type) problem, so thank you all very
much for that; much appreciated.

Regards
HJR