Thread: Full Text Search dictionary issues
I have 10 million rows in a table, with full text index created on one of the columns. I submit this query: ims=# select count(*) from search_rm ims-# where to_tsvector('english', textsearch) ims-# @@ to_tsquery('english', 'woman & beach & ball'); count ------- 646 (1 row) Time: 107.570 ms ...and those are excellent times. But if I alter the query to read: ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english', 'woman & beach & ftx1'); count ------- 38343 (1 row) Time: 640.985 ms ...then, as you see, it slows the query down by a factor of about 6, which is not so good! The problem is that we need to be able to search for "ftx1", since that's a flag we put in our document records to tell us the file type, and we need to be able to retrieve different file types at different times. Now, I *think* the problem is that 'ftx1' is not being treated as though it were a proper word: ims=# select * from ts_debug('english','woman ball ftx1'); alias | description | token | dictionaries | dictionary | lexemes -----------+--------------------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | woman | {english_stem} | english_stem | {woman} blank | Space symbols | | {} | | asciiword | Word, all ASCII | ball | {english_stem} | english_stem | {ball} blank | Space symbols | | {} | | numword | Word, letters and digits | ftx1 | {simple} | simple | {ftx1} (5 rows) Instead of being an asciiword that uses the english-stem dictionary, it 'ftx1' gets regarded as a numword in the simple dictionary. If I simply replace "ftxa" for "ftx1", it *is* then regarded as an asciiword, and performance of the original query reverts to being just fine, too: ims=# select * from ts_debug('english','woman ball ftxa'); alias | description | token | dictionaries | dictionary | lexemes -----------+-----------------+-------+----------------+--------------+--------- asciiword | Word, all ASCII | woman | {english_stem} | english_stem | {woman} blank | Space symbols | | {} | | asciiword | Word, all ASCII | ball | {english_stem} | english_stem | {ball} blank | Space symbols | | {} | | asciiword | Word, all ASCII | ftxa | {english_stem} | english_stem | {ftxa} ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english', 'woman & beach & ftxa'); count ------- 0 (1 row) Time: 88.603 ms As you can see, 88ms for a search with 'ftxa' compared to 600+ms for one with 'ftx1'. I should mention that we have about 45 different 'nonsense word' flags we use for all sorts of different purposes, such as telling us which region a document is visible in, whether it is in portrait or landscape mode and so on. All of these flag-words take the form of zzzz1, or yyyy2 and so on. So there's a lot of these things causing the problem, not just 'ftx1' specifically. My question is, then, what I can do to stop this slowdown? Is there some way to add 'ftx1' (and the others) as a word in the english_stem dictionary so that it gets regarded as an asciiword, for example? Or is there something else I can do to address the problem? I'm fairly new to PostgreSQL's full text search. I've read Chapter 12 of the doco with rapt attention, but I don't see anything that leaps at me as a fix for this issue. All help greatefully received, therefore, and apologies in advance if this is a bit of a newbie question. Regards HJR
Howard Rogers <hjr@diznix.com> writes: > I have 10 million rows in a table, with full text index created on one > of the columns. I submit this query: > ims=# select count(*) from search_rm > ims-# where to_tsvector('english', textsearch) > ims-# @@ to_tsquery('english', 'woman & beach & ball'); > count > ------- > 646 > (1 row) > Time: 107.570 ms > ...and those are excellent times. But if I alter the query to read: > ims=# select count(*) from search_rm > where to_tsvector('english', textsearch) > @@ to_tsquery('english', 'woman & beach & ftx1'); > count > ------- > 38343 > (1 row) > Time: 640.985 ms > ...then, as you see, it slows the query down by a factor of about 6, ... um, but it increased the number of matching rows by a factor of almost 60. I think your complaint of poor scaling is misplaced. > which is not so good! The problem is that we need to be able to search > for "ftx1", since that's a flag we put in our document records to tell > us the file type, and we need to be able to retrieve different file > types at different times. You might want to rethink how you're doing that --- it seems like a file type flag ought to be a separate column rather than a word in a text field. regards, tom lane
On 07/16/10 02:23, Tom Lane wrote: > Howard Rogers <hjr@diznix.com> writes: >> I have 10 million rows in a table, with full text index created on one >> of the columns. I submit this query: > >> ims=# select count(*) from search_rm >> ims-# where to_tsvector('english', textsearch) >> ims-# @@ to_tsquery('english', 'woman & beach & ball'); >> count >> ------- >> 646 >> (1 row) >> Time: 107.570 ms > >> ...and those are excellent times. But if I alter the query to read: > >> ims=# select count(*) from search_rm >> where to_tsvector('english', textsearch) >> @@ to_tsquery('english', 'woman & beach & ftx1'); >> count >> ------- >> 38343 >> (1 row) >> Time: 640.985 ms > >> ...then, as you see, it slows the query down by a factor of about 6, > > ... um, but it increased the number of matching rows by a factor of > almost 60. I think your complaint of poor scaling is misplaced. This is basically the same question I asked a few days ago and I think the reason for this (mis)expectation of performance comes from expecting tsearch2 to behave like external specialized indexers. In such products, the search result can be returned simply from the index, which can scale fairly well, but PostgreSQL actually has to lookup all the records returned and this is where most time is spent.
On Fri, Jul 16, 2010 at 10:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Howard Rogers <hjr@diznix.com> writes: >> I have 10 million rows in a table, with full text index created on one >> of the columns. I submit this query: > >> ims=# select count(*) from search_rm >> ims-# where to_tsvector('english', textsearch) >> ims-# @@ to_tsquery('english', 'woman & beach & ball'); >> count >> ------- >> 646 >> (1 row) >> Time: 107.570 ms > >> ...and those are excellent times. But if I alter the query to read: > >> ims=# select count(*) from search_rm >> where to_tsvector('english', textsearch) >> @@ to_tsquery('english', 'woman & beach & ftx1'); >> count >> ------- >> 38343 >> (1 row) >> Time: 640.985 ms > >> ...then, as you see, it slows the query down by a factor of about 6, > > ... um, but it increased the number of matching rows by a factor of > almost 60. I think your complaint of poor scaling is misplaced. > >> which is not so good! The problem is that we need to be able to search >> for "ftx1", since that's a flag we put in our document records to tell >> us the file type, and we need to be able to retrieve different file >> types at different times. > > You might want to rethink how you're doing that --- it seems like a file > type flag ought to be a separate column rather than a word in a text > field. > > regards, tom lane > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > OK, Tom: I did actually account for the number of rows difference before I posted, though I accept I didn't show you that. So here goes: ims=# select count(*) ims-# from search_rm ims-# where to_tsvector('english', textsearch) @@ to_tsquery('english', 'wommmman & batt & ftxa') ims-# limit 20; count ------- 0 (1 row) Time: 0.593 ms ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english', 'wommmman & batt & ftx1') limit 20; count ------- 0 (1 row) Time: 489.362 ms Both queries return zero rows. One takes an awful lot longer than the other. The only difference between them is that one searches for 'ftx1' and the other searches for 'ftx0'. My complaint of poor scalability (actually, it was an enquiry about the role of dictionary types!) is valid, I think. As a PostgreSQL newbie, I'm happy to accept that I've done something plonkingly stupid to account for these results, but I'd then like to know what it is I've done wrong! A simple scale-up of the number of hits isn't, however, the problem, I don't think. With this amount of data, and with 45 different document attributes that may or may not be searched for, some of them involving names and places and dates, some just yes/no flags, it is utterly impossible to have them as separate attribute columns and search on them with anything like decent performance. We adopted this approach with Oracle Text two years ago precisely because it was the only way to keep web-based searches of 10,000,000 records coming back in less than a second. So, no, we're not going to re-think the storage of 'attribute data' as part of the searchable keyword field, though I'm more than prepared to alter the precise format of that data if it helps PostgreSQL any. That said, however, we have people supplying us with document references in the form DA3-76374YY-001, so alpha-numerics simply have to be searchable with good speed, and I can't always magic-away the alpha-numeric components, even if I wanted to. So, I would still like to know if this performance difference when encountering alpha-numeric "words" is dictionary-related, and if so what I can do to fix that, please. Cheers, HJR
Does it run any differently if you split out the tag?
select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english', 'wommmman & batt') and to_tsvector('english', textsearch) @@ 'ftx1'::tsquery
Steve
#avg_ls_inline_popup { position:absolute; z-index:9999; padding: 0px 0px; margin-left: 0px; margin-top: 0px; width: 240px; overflow: hidden; word-wrap: break-word; color: black; font-size: 10px; text-align: left; line-height: 13px;}
select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english', 'wommmman & batt') and to_tsvector('english', textsearch) @@ 'ftx1'::tsquery
Steve
On 16 July 2010 05:22, Howard Rogers <hjr@diznix.com> wrote:
On Fri, Jul 16, 2010 at 10:23 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Howard Rogers <hjr@diznix.com> writes:
>> I have 10 million rows in a table, with full text index created on one
>> of the columns. I submit this query:
>
>> ims=# select count(*) from search_rm
>> ims-# where to_tsvector('english', textsearch)
>> ims-# @@ to_tsquery('english', 'woman & beach & ball');
>> count
>> -------
>> 646
>> (1 row)
>> Time: 107.570 ms
>
>> ...and those are excellent times. But if I alter the query to read:
>
>> ims=# select count(*) from search_rm
>> where to_tsvector('english', textsearch)
>> @@ to_tsquery('english', 'woman & beach & ftx1');
>> count
>> -------
>> 38343
>> (1 row)
>> Time: 640.985 ms
>
>> ...then, as you see, it slows the query down by a factor of about 6,
>
> ... um, but it increased the number of matching rows by a factor of
> almost 60. I think your complaint of poor scaling is misplaced.
>
>> which is not so good! The problem is that we need to be able to search
>> for "ftx1", since that's a flag we put in our document records to tell
>> us the file type, and we need to be able to retrieve different file
>> types at different times.
>
> You might want to rethink how you're doing that --- it seems like a file
> type flag ought to be a separate column rather than a word in a text
> field.
>
> regards, tom lane
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
OK, Tom: I did actually account for the number of rows difference
before I posted, though I accept I didn't show you that. So here goes:
ims=# select count(*)
ims-# from search_rm
ims-# where to_tsvector('english', textsearch) @@
to_tsquery('english', 'wommmman & batt & ftxa')
ims-# limit 20;
count
-------
0
(1 row)
Time: 0.593 ms
ims=# select count(*)
from search_rm
where to_tsvector('english', textsearch) @@ to_tsquery('english',
'wommmman & batt & ftx1')
limit 20;
count
-------
0
(1 row)
Time: 489.362 ms
Both queries return zero rows. One takes an awful lot longer than the
other. The only difference between them is that one searches for
'ftx1' and the other searches for 'ftx0'. My complaint of poor
scalability (actually, it was an enquiry about the role of dictionary
types!) is valid, I think. As a PostgreSQL newbie, I'm happy to accept
that I've done something plonkingly stupid to account for these
results, but I'd then like to know what it is I've done wrong! A
simple scale-up of the number of hits isn't, however, the problem, I
don't think.
With this amount of data, and with 45 different document attributes
that may or may not be searched for, some of them involving names and
places and dates, some just yes/no flags, it is utterly impossible to
have them as separate attribute columns and search on them with
anything like decent performance. We adopted this approach with Oracle
Text two years ago precisely because it was the only way to keep
web-based searches of 10,000,000 records coming back in less than a
second. So, no, we're not going to re-think the storage of 'attribute
data' as part of the searchable keyword field, though I'm more than
prepared to alter the precise format of that data if it helps
PostgreSQL any.
That said, however, we have people supplying us with document
references in the form DA3-76374YY-001, so alpha-numerics simply have
to be searchable with good speed, and I can't always magic-away the
alpha-numeric components, even if I wanted to.
So, I would still like to know if this performance difference when
encountering alpha-numeric "words" is dictionary-related, and if so
what I can do to fix that, please.
Cheers,
HJR
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 16/07/10 05:22, Howard Rogers wrote: > OK, Tom: I did actually account for the number of rows difference > before I posted, though I accept I didn't show you that. So here goes: Tom's good, but his mind-reading powers aren't what they used to be :-) > ims=# select count(*) > ims-# from search_rm > ims-# where to_tsvector('english', textsearch) @@ > to_tsquery('english', 'wommmman& batt& ftxa') > ims-# limit 20; > count > ------- > 0 > (1 row) > > Time: 0.593 ms > ims=# select count(*) > from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english', > 'wommmman& batt& ftx1') > limit 20; > count > ------- > 0 > (1 row) > > Time: 489.362 ms > > Both queries return zero rows. One takes an awful lot longer than the > other. And how many matches do you get for each term? Is it equally slow if you search for 'wommman & batt && pzsdja' vs '... pzsdj1'? I'm assuming pzsdja/1 aren't valid tokens of course... -- Richard Huxton Archonet Ltd
Howard Rogers <hjr@diznix.com> writes: > OK, Tom: I did actually account for the number of rows difference > before I posted, though I accept I didn't show you that. So here goes: > ... > Both queries return zero rows. One takes an awful lot longer than the > other. The only difference between them is that one searches for > 'ftx1' and the other searches for 'ftx0'. Well, this still doesn't tell us anything about what I think the critical point is, namely how many actual matches there are for ftx1 versus ftx0. Could we see counts for *just* those words without the other conditions? > So, I would still like to know if this performance difference when > encountering alpha-numeric "words" is dictionary-related, AFAIK there is no significant difference between treatment of pure alpha and mixed alphanumeric "words", at least not once you get past to_tsquery. I'm still expecting this is just a matter of how many index entries match. It's barely possible that you've got a dictionary configuration that makes the to_tsquery() function itself a lot slower in the alphanumeric case, but that should affect ftx1 and ftx0 equally. regards, tom lane
On Sat, Jul 17, 2010 at 3:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Howard Rogers <hjr@diznix.com> writes: >> OK, Tom: I did actually account for the number of rows difference >> before I posted, though I accept I didn't show you that. So here goes: >> ... >> Both queries return zero rows. One takes an awful lot longer than the >> other. The only difference between them is that one searches for >> 'ftx1' and the other searches for 'ftx0'. > > Well, this still doesn't tell us anything about what I think the > critical point is, namely how many actual matches there are for > ftx1 versus ftx0. Could we see counts for *just* those words without > the other conditions? > >> So, I would still like to know if this performance difference when >> encountering alpha-numeric "words" is dictionary-related, > > AFAIK there is no significant difference between treatment of pure alpha > and mixed alphanumeric "words", at least not once you get past > to_tsquery. I'm still expecting this is just a matter of how many index > entries match. It's barely possible that you've got a dictionary > configuration that makes the to_tsquery() function itself a lot slower > in the alphanumeric case, but that should affect ftx1 and ftx0 equally. > > regards, tom lane > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > My apologies for making this harder than it should have been. My queries were correct, my description of them wasn't. The only difference was 'ftx1' and 'ftxa', not 'ftx0'. Anyway, I think I've finally got the point being made by you and Richard (and sorry for being so slow about it!): ftx1 is a very, very common flag. I'd guess 99% of records would be tagged with it. And ftxa doesn't exist at all, of course. So, you're right: ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english','ball & beach & salsm1'); count ------- 1753 (1 row) Time: 557.010 ms ims=# select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english','ball & beach & salsm4'); count ------- 97 (1 row) Time: 39.518 ms Salsm1 is another very common flag, applied to about 80% of documents. Salsm4 applies to about 160 documents in total. So it's really not the alpha-numerics causing the issue after all. It's genuinely the number of records matching each term, as you originally said. (And a note to Steve: it makes no difference separating out the 'flag factor', I'm afraid). It's given me lots to think about. The hard part ought to be finding the women, or the beaches; the flags should be mere filters applied after those have been fetched. Clearly, though, that's not how this is behaving, though I've been fooled into thinking it ought to be because of the apparent default optimization done in Oracle Text. Some new technique is called for, I guess! (If you've got any ideas, I'm all ears...) Anyway: at least you've all helped me realise that it's not a dictionary problem (or a word-type) problem, so thank you all very much for that; much appreciated. Regards HJR