Thread: fts, compond words?
Hi, I use the tsearch full text search with pg 8.0.3. It works great, but I wonder if it's possible to search for compound words? Ie if I search for "New York" i want to get a match on New York has traffic problems. but not on New axe murderer incident in brittish York. Is this possible? I don't use any wrapper, just select ... from ... where idxfti @@ to_tsquery('default', 'searchstring') Thanks, Marcus
On Mon, 5 Dec 2005, Marcus Engene wrote: > Hi, > > I use the tsearch full text search with pg 8.0.3. It works great, but I > wonder if it's possible to search for compound words? > Ie if I search for "New York" i want to get a match on > New York has traffic problems. > but not on > New axe murderer incident in brittish York. > > Is this possible? > > I don't use any wrapper, just > select > ... > from > ... > where > idxfti @@ to_tsquery('default', 'searchstring') ranking function is what you need. Read documentation. > > Thanks, > Marcus > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov wrote: > On Mon, 5 Dec 2005, Marcus Engene wrote: > >> Hi, >> >> I use the tsearch full text search with pg 8.0.3. It works great, but >> I wonder if it's possible to search for compound words? >> Ie if I search for "New York" i want to get a match on >> New York has traffic problems. >> but not on >> New axe murderer incident in brittish York. >> >> Is this possible? >> >> I don't use any wrapper, just >> select >> ... >> from >> ... >> where >> idxfti @@ to_tsquery('default', 'searchstring') > > > > ranking function is what you need. Read documentation. > Hi, I realized from the documentation that I'm not looking for compound words after all, I meant "exact phrase". I can't see how to make rank tell me which results has an exact phrase? Like "there must be a occurence of 'new' before 'york'" (stemmed not really exact phrase)? Is there something new in rank for pg 8.1? Thanks! Marcus
On 12/5/05, Marcus Engene <mengpg@engene.se> wrote: > Oleg Bartunov wrote: > > On Mon, 5 Dec 2005, Marcus Engene wrote: > > > >> Hi, > >> > >> I use the tsearch full text search with pg 8.0.3. It works great, but > >> I wonder if it's possible to search for compound words? > >> Ie if I search for "New York" i want to get a match on > >> New York has traffic problems. > >> but not on > >> New axe murderer incident in brittish York. > >> > >> Is this possible? > >> > >> I don't use any wrapper, just > >> select > >> ... > >> from > >> ... > >> where > >> idxfti @@ to_tsquery('default', 'searchstring') > > > > > > > > ranking function is what you need. Read documentation. > > > > Hi, > > I realized from the documentation that I'm not looking for > compound words after all, I meant "exact phrase". > > I can't see how to make rank tell me which results has an > exact phrase? Like "there must be a occurence of 'new' before > 'york'" (stemmed not really exact phrase)? > What you'll want to do is check the original text for the exact phrase after the tsearch2 index has given you some targets. Given table foo: CREATE TABLE foo ( id serial primary key, txt text, ts2 tsvector ); use query: SELECT id FROM foo WHERE ts2 @@ to_tsquery('new&york') AND txt ILIKE '%new york%'; You can get rid of the '%'s if you want the entire txt column to match the search phrase. -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
On Mon, 5 Dec 2005, Marcus Engene wrote: > > I realized from the documentation that I'm not looking for > compound words after all, I meant "exact phrase". > > I can't see how to make rank tell me which results has an > exact phrase? Like "there must be a occurence of 'new' before > 'york'" (stemmed not really exact phrase)? http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes Phrase search This tip is by Mike Rylander To do phrase searching just add an additional WHERE clause to your query: SELECT id FROM tab WHERE ts_idx_col @@ to_tsquery('history&lesson') AND text_col ~* '.*history\\s+lesson.*'; The full-text index will still be used, and the regex will be used to prune the results afterwards. > > Is there something new in rank for pg 8.1? it has some improving, but not for your case. Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Oleg Bartunov wrote: > On Mon, 5 Dec 2005, Marcus Engene wrote: >> I realized from the documentation that I'm not looking for >> compound words after all, I meant "exact phrase". >> >> I can't see how to make rank tell me which results has an >> exact phrase? Like "there must be a occurence of 'new' before >> 'york'" (stemmed not really exact phrase)? > > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes > > Phrase search > This tip is by Mike Rylander > > To do phrase searching just add an additional WHERE clause to your query: > > SELECT id FROM tab WHERE ts_idx_col @@ to_tsquery('history&lesson') > AND text_col ~* '.*history\\s+lesson.*'; > > The full-text index will still be used, and the regex will be used to > prune the results afterwards. Hi, Thanks for the answer, Oleg and Mike. This, I guess, will be problematic in a query like A & (B | C) or a more complex expression. say C is "New York" and that tsearch receives A & (B | (new & york)) I cannot just add the regexp afterwards. What if B is true? What would be nice to have, given ofcourse the index isn't stripped is something like A & (B | (New OperatorTheNextWordMustFollow York)) Would something like that be doable? Right now, intuitively, it would be two trees in the where clause: tsearch(A & B) OR (tsearch (A & C) AND regexpmatch(C)) ..and a nightmare in complex queries. Best regards, Marcus
On 12/6/05, Marcus Engene <mengpg@engene.se> wrote: [snip] > A & (B | (New OperatorTheNextWordMustFollow York)) > Actually, I love that idea. Oleg, would it be possible to create a tsquery operator that understands proximity? Or, how allowing a predicate to the current '&' op, as in '&[dist<=1]' meaning "next token follows with a max distance of 1". I imagine that it would only be useful on unstripped tsvectors, but if the lexem position is already stored ... -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
> > A & (B | (New OperatorTheNextWordMustFollow York)) > I had thought about this before myself. Alas I have never had the time to properly investigate implementing such a feature. :( A & (B | (New + York)) Something like that? > Actually, I love that idea. Oleg, would it be possible to create a > tsquery operator that understands proximity? Or, how allowing a > predicate to the current '&' op, as in '&[dist<=1]' meaning "next > token follows with a max distance of 1". I imagine that it would > only be useful on unstripped tsvectors, but if the lexem position is > already stored ... > Would the proximity go in both directions? Or just forward? What about tokens that come before? Just a thought. Andy
That is a long discussed thing. We can't formulate unconflicting rules... For example: 1) a &[dist<=2] ( b &[dist<=3] c ) 2) a &[dist<=2] ( b |[dist<=3] c ) 3) a &[dist<=2] !c 4) a &[dist<=2] ( b |[dist<=3] !c ) 5) a &[dist<=2] ( b & c ) What does exact they mean? What is tsvectors which should be matched by those queries? The simple solution is : under operation 'phrase search' (ok, it will be '+' below) it must be only 'phrase search operations. I.e.: a | b ( c + ( d + e ) ) - good a | ( c + ( d & g ) ) - bad. For example, we have word 'foonish' and after lexize we got two lexemes: 'foo1' and 'foo2'. So a good query 'a + foonish' becomes 'a + ( foo1 | foo2 )'... Mike Rylander wrote: > On 12/6/05, Marcus Engene <mengpg@engene.se> wrote: > > [snip] > > >> A & (B | (New OperatorTheNextWordMustFollow York)) >> > > > Actually, I love that idea. Oleg, would it be possible to create a > tsquery operator that understands proximity? Or, how allowing a > predicate to the current '&' op, as in '&[dist<=1]' meaning "next > token follows with a max distance of 1". I imagine that it would > only be useful on unstripped tsvectors, but if the lexem position is > already stored ... > > -- > Mike Rylander > mrylander@gmail.com > GPLS -- PINES Development > Database Developer > http://open-ils.org > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On 12/7/05, Teodor Sigaev <teodor@sigaev.ru> wrote: > That is a long discussed thing. We can't formulate unconflicting rules... For > example: > 1) a &[dist<=2] ( b &[dist<=3] c ) > 2) a &[dist<=2] ( b |[dist<=3] c ) > 3) a &[dist<=2] !c > 4) a &[dist<=2] ( b |[dist<=3] !c ) > 5) a &[dist<=2] ( b & c ) > What does exact they mean? What is tsvectors which should be matched by those > queries? 1,2,4, and 5 are obviously ambiguous, but 3 seems straightforward to me, if not more difficult to implement. Would it not be acceptable to say that proximity modifiers are only valid between two simple lexemes and can not be placed next to any compound expression? > > The simple solution is : under operation 'phrase search' (ok, it will be '+' > below) it must be only 'phrase search operations. I.e.: > a | b ( c + ( d + e ) ) - good > a | ( c + ( d & g ) ) - bad. > Same as above. And, while '+' would be a very good shortcut for "&[follows;dist=1]" (or some such), I think the user should be able to specify the proximity more explicitly as well. > For example, we have word 'foonish' and after lexize we got two lexemes: 'foo1' > and 'foo2'. So a good query 'a + foonish' becomes 'a + ( foo1 | foo2 )'... > hrm... that is a problem. Though, I think that's a case of how the compiled expression is built from user input. Unless I'm mistaken a + ( foo1 | foo2 ) is exactly equal to (a + foo1) | (a + foo2) Ahhh... but then there is the more complex example of a + foonish + bar becoming a + (foo1 | foo2) + bar .... but I guess that could be (a + foo1 + bar) | (a + foo2 + bar) > > > > > Mike Rylander wrote: > > On 12/6/05, Marcus Engene <mengpg@engene.se> wrote: > > > > [snip] > > > > > >> A & (B | (New OperatorTheNextWordMustFollow York)) > >> > > > > > > Actually, I love that idea. Oleg, would it be possible to create a > > tsquery operator that understands proximity? Or, how allowing a > > predicate to the current '&' op, as in '&[dist<=1]' meaning "next > > token follows with a max distance of 1". I imagine that it would > > only be useful on unstripped tsvectors, but if the lexem position is > > already stored ... > > > > -- > > Mike Rylander > > mrylander@gmail.com > > GPLS -- PINES Development > > Database Developer > > http://open-ils.org > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 2: Don't 'kill -9' the postmaster > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: http://www.sigaev.ru/ > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
As Teodor already pointed there is no non-ambiguous solution, or at least, we don't know it. Oleg On Wed, 7 Dec 2005, Andrew J. Kopciuch wrote: >>> A & (B | (New OperatorTheNextWordMustFollow York)) >> > > I had thought about this before myself. Alas I have never had the time to > properly investigate implementing such a feature. > > :( > > A & (B | (New + York)) > > Something like that? > >> Actually, I love that idea. Oleg, would it be possible to create a >> tsquery operator that understands proximity? Or, how allowing a >> predicate to the current '&' op, as in '&[dist<=1]' meaning "next >> token follows with a max distance of 1". I imagine that it would >> only be useful on unstripped tsvectors, but if the lexem position is >> already stored ... >> > > Would the proximity go in both directions? Or just forward? What about tokens > that come before? Just a thought. > > > > Andy > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Mike Rylander wrote: >> >>Mike Rylander wrote: >> >> >>>On 12/6/05, Marcus Engene <mengpg@engene.se> wrote: >>> >>>[snip] >>> >>> >>> >>> >>>> A & (B | (New OperatorTheNextWordMustFollow York)) >>>> >>>> >>>> >>>Actually, I love that idea. Oleg, would it be possible to create a >>>tsquery operator that understands proximity? Or, how allowing a >>>predicate to the current '&' op, as in '&[dist<=1]' meaning "next >>>token follows with a max distance of 1". I imagine that it would >>>only be useful on unstripped tsvectors, but if the lexem position is >>>already stored ... >>> >>> This might not be a solution in the longer term, but what I do for that type of thing is idxfti @@ '(a&b)' and message ~* 'a b' Postgres is smart enough to use the results of the GIST index and go from there with the message scanning. Jeff
> hrm... that is a problem. Though, I think that's a case of how the > compiled expression is built from user input. Unless I'm mistaken > > a + ( foo1 | foo2 ) > > is exactly equal to > > (a + foo1) | (a + foo2) > > > Ahhh... but then there is the more complex example of > > a + foonish + bar > > becoming > > a + (foo1 | foo2) + bar > > .... but I guess that could be > > (a + foo1 + bar) | (a + foo2 + bar) That a simple case, what about languages as norwegian or german? They has compound words and ispell dictionary can split them to lexemes. But, usialy there is more than one variant of separation: forbruksvaremerkelov forbruk vare merke lov forbruk vare merkelov forbruk varemerke lov forbruk varemerkelov forbruksvare merke lov forbruksvare merkelov (notice: I don't know translation, just an example. When we working on compound word support we found word which has 24 variant of separation!!) So, query 'a + forbruksvaremerkelov' will be awful: a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... ) Of course, that is examle just from mind, but solution of phrase search should work reasonably with such corner cases. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On 12/8/05, Teodor Sigaev <teodor@sigaev.ru> wrote: > > (a + foo1 + bar) | (a + foo2 + bar) > > That a simple case, what about languages as norwegian or german? They has > compound words and ispell dictionary can split them to lexemes. But, usialy > there is more than one variant of separation: > > forbruksvaremerkelov > forbruk vare merke lov > forbruk vare merkelov > forbruk varemerke lov > forbruk varemerkelov > forbruksvare merke lov > forbruksvare merkelov > (notice: I don't know translation, just an example. When we working on compound > word support we found word which has 24 variant of separation!!) > > So, query 'a + forbruksvaremerkelov' will be awful: > > a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... ) > > Of course, that is examle just from mind, but solution of phrase search should > work reasonably with such corner cases. > WARNING: What follows is wild, hand waving speculation as I don't fully understand the implications of compound words! ;-) My naive impression is that it would be both possible and a good idea to stem any compound words to their versions containing the most individual lexemes. As an analogy, this would be similar to transforming composed (Normalization Form C) UTF-8 characters into their decomposed (Normalization Form D) versions. From your example above, the stemmed version of 'forbrukvaremerkelov' would always decompose to 'forbruk vare merke lov', both for indexing and in to_tsquery(). For the purposes of phrase searching, or more generally proximity searching, the compiled query a + forbrukvaremerkelov might look something like a + forbruk + vare + merke + lov and that's it ... all parts of the compound word are required, and required to be in that order, for the "phrase" search to be valid. A compiled query like a + (forbruk & vare & merke & lov) wouldn't be valid anyway, because the user wants the entire compound word to be adjacent to 'a', and the bare '&' op would allow any of the parts to exist anywhere in the document ... or am I missing something? (I probably am.) The point is, once you go into an order-and-distance mode for two user supplied words (pre-stemming) you have to apply that mode to the entire set of stemmed lexemes that are involved in the "phrase". If that assumption, that "user requested order and distance" uses a different set of operators than free-form full text searching, then I think it's doable. Each sub-statement that comprises a phrase search is an atomic unit, and can be applied anywhere within the global compiled query. [Thinking ...] Starting from that assumption, take the example of a + foonish & bar The implication of the above assumption is that the '+' (or '&[follows;dist=1]') operator has higher precedence than a bare '&' operator. So, the next version of the query, before compilation is complete, might look like: (a + foonish) & bar Then we go through these steps: (a + (foo1 | foo2)) & bar #decompose compound and multi-stem words ( (a + foo1) | (a + foo2) ) & bar # create multiple atoms for multi-stem words The end result is both non-ambiguous and reflects the most likely user intended query. Let's try it with a compound word /and/ a multi-stem word, remembering that "phrase operators" are only allowed between simple query terms, not compound terms (grouped terms): 1) a & (foonish + forbrukvaremerkelov) & ! bar # user supplied query 2) a & ( (foo1 | foo2) + forbrukvaremerkelov) & ! bar # decompose multi-stem words 3) a & ( (foo1 + forbrukvaremerkelov) | (foo2 + forbrukvaremerkelov) ) & ! bar # make multiple atoms from multi-stemmed words involved in phrases (this creates 1 atom per stem per multi-stem word, and yes, that could get very big... but, IMHO, slow but working corner cases are OK) 4) a & ( (foo1 + forbruk + vare + merke + lov) | (foo2 + forbruk + vare + merke + lov) ) & ! bar # explode the compound words to their "decomposed" form, because that's what ought to be in the indexed data That meets the same criteria as the simpler example above, and I've not said anything about compound and multi-stem word outside the "phrase mode" portion of the query because the current behaviour is what we want in those cases. > > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: http://www.sigaev.ru/ > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
> That a simple case, what about languages as norwegian or german? They > has compound words and ispell dictionary can split them to lexemes. > But, usialy there is more than one variant of separation: > > forbruksvaremerkelov > forbruk vare merke lov > forbruk vare merkelov > forbruk varemerke lov > forbruk varemerkelov > forbruksvare merke lov > forbruksvare merkelov > (notice: I don't know translation, just an example. When we working on > compound word support we found word which has 24 variant of > separation!!) > > So, query 'a + forbruksvaremerkelov' will be awful: > > a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... ) > > Of course, that is examle just from mind, but solution of phrase > search should work reasonably with such corner cases. (Sorry for replying in the wrong place in the thread, I was away for a trip and unsubscribed meanwhile) I'm a swede and swedish is similair to norweigan and german. Take this example: lång hårig kvinna långhårig kvinna Words are put together to make a new word with different meaning. The first example means "tall hairy woman" and the second is "woman with long hair". If I would be on f.ex a date site, I'd want the distinction. ;-) If not, i should enter both strings ("lång hårig" | långhårig) & kvinna ...which is perfectly acceptable. IMHO I don't see any point in splitting these words. Let's go back to the subject, what about a syntax like this: idxfti @@ to_tsquery('default', 'pizza & (Chicago | [New York]') Ie the exact match string is always atomic. Wouldn't that be doable without any logical implications? Best regards, Marcus
On 12/12/05, Marcus Engene <mengpg@engene.se> wrote: > > That a simple case, what about languages as norwegian or german? They > > has compound words and ispell dictionary can split them to lexemes. > > But, usialy there is more than one variant of separation: > > > > forbruksvaremerkelov > > forbruk vare merke lov > > forbruk vare merkelov > > forbruk varemerke lov > > forbruk varemerkelov > > forbruksvare merke lov > > forbruksvare merkelov > > (notice: I don't know translation, just an example. When we working > on > compound word support we found word which has 24 variant of > > separation!!) > > > > So, query 'a + forbruksvaremerkelov' will be awful: > > > > a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | > ... ) > > > > Of course, that is examle just from mind, but solution of phrase > > search should work reasonably with such corner cases. > > (Sorry for replying in the wrong place in the thread, I was away for a > trip and unsubscribed meanwhile) > > I'm a swede and swedish is similair to norweigan and german. Take this > example: > > lång hårig kvinna > långhårig kvinna > > Words are put together to make a new word with different meaning. The > first example means "tall hairy woman" and the second is "woman with > long hair". If I would be on f.ex a date site, I'd want the distinction. > ;-) If not, i should enter both strings > ("lång hårig" | långhårig) & kvinna > ...which is perfectly acceptable. Well, that certainly kills my initial naive implementation plan! :-) Thank you for the explanation. [thinking] Well, if compound words should always be treated as the user has inserted them then it seems that the current implementation may be doing the wrong thing with regard to stemming compound words. If the compound words are being decomposed to constituent stems then you'd be getting semantically, or at least contextually, incorrect results, right? (Again, not an expert here. :-) ) [thinking more...] So, assuming that compound words should not be fully stemmed, due to the way they are used to create new words with different meanings, if step (4) were removed from my earlier plan then everything would continue to work as proposed. > > IMHO I don't see any point in splitting these words. > > > Let's go back to the subject, what about a syntax like this: > > idxfti @@ to_tsquery('default', 'pizza & (Chicago | [New York]') > > Ie the exact match string is always atomic. Wouldn't that be doable > without any logical implications? > I think there are several ways that phrase matching can be done in a logically consistent way. That is certainly one of them, and takes the focus off a single infix operator. TS2 already recognises grouping operations via parens, and restricting brackets ([,]) to surrounding only simple expressions (no '&', '|', '!' or '()') shouldn't be too hard. However, I'd still prefer that proximity searches could be specified more explicitly by the user. Using the above example: pizza & (Chicago | [New York]) becomes pizza & (Chicago | New + Your) which is implicitly pizza & (Chicago | New +{follows;dist=1} York) and that is read as: "Pizza, and chicago, or new followed by york at a distance of 1." where the modifier to the '+' operator could be specified by the user initially if desired. While I understand and agree that "phrase searching" would be the most common use for proximity+direction operator modifiers, I see things like the '+' operator and '[]' groupings as special cases of the more generalized restriction operation (or set thereof) based on the positional information recorded in (unstripped) indexes. Thoughts? > Best regards, > Marcus > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org