Re: fts, compond words? - Mailing list pgsql-general
From | Mike Rylander |
---|---|
Subject | Re: fts, compond words? |
Date | |
Msg-id | b918cf3d0512080809s1ecb1b2fn318ec886dbb1436e@mail.gmail.com Whole thread Raw |
In response to | Re: fts, compond words? (Teodor Sigaev <teodor@sigaev.ru>) |
List | pgsql-general |
On 12/8/05, Teodor Sigaev <teodor@sigaev.ru> wrote: > > (a + foo1 + bar) | (a + foo2 + bar) > > That a simple case, what about languages as norwegian or german? They has > compound words and ispell dictionary can split them to lexemes. But, usialy > there is more than one variant of separation: > > forbruksvaremerkelov > forbruk vare merke lov > forbruk vare merkelov > forbruk varemerke lov > forbruk varemerkelov > forbruksvare merke lov > forbruksvare merkelov > (notice: I don't know translation, just an example. When we working on compound > word support we found word which has 24 variant of separation!!) > > So, query 'a + forbruksvaremerkelov' will be awful: > > a + ( (forbruk & vare & merke & lov) | (forbruk & vare & merkelov) | ... ) > > Of course, that is examle just from mind, but solution of phrase search should > work reasonably with such corner cases. > WARNING: What follows is wild, hand waving speculation as I don't fully understand the implications of compound words! ;-) My naive impression is that it would be both possible and a good idea to stem any compound words to their versions containing the most individual lexemes. As an analogy, this would be similar to transforming composed (Normalization Form C) UTF-8 characters into their decomposed (Normalization Form D) versions. From your example above, the stemmed version of 'forbrukvaremerkelov' would always decompose to 'forbruk vare merke lov', both for indexing and in to_tsquery(). For the purposes of phrase searching, or more generally proximity searching, the compiled query a + forbrukvaremerkelov might look something like a + forbruk + vare + merke + lov and that's it ... all parts of the compound word are required, and required to be in that order, for the "phrase" search to be valid. A compiled query like a + (forbruk & vare & merke & lov) wouldn't be valid anyway, because the user wants the entire compound word to be adjacent to 'a', and the bare '&' op would allow any of the parts to exist anywhere in the document ... or am I missing something? (I probably am.) The point is, once you go into an order-and-distance mode for two user supplied words (pre-stemming) you have to apply that mode to the entire set of stemmed lexemes that are involved in the "phrase". If that assumption, that "user requested order and distance" uses a different set of operators than free-form full text searching, then I think it's doable. Each sub-statement that comprises a phrase search is an atomic unit, and can be applied anywhere within the global compiled query. [Thinking ...] Starting from that assumption, take the example of a + foonish & bar The implication of the above assumption is that the '+' (or '&[follows;dist=1]') operator has higher precedence than a bare '&' operator. So, the next version of the query, before compilation is complete, might look like: (a + foonish) & bar Then we go through these steps: (a + (foo1 | foo2)) & bar #decompose compound and multi-stem words ( (a + foo1) | (a + foo2) ) & bar # create multiple atoms for multi-stem words The end result is both non-ambiguous and reflects the most likely user intended query. Let's try it with a compound word /and/ a multi-stem word, remembering that "phrase operators" are only allowed between simple query terms, not compound terms (grouped terms): 1) a & (foonish + forbrukvaremerkelov) & ! bar # user supplied query 2) a & ( (foo1 | foo2) + forbrukvaremerkelov) & ! bar # decompose multi-stem words 3) a & ( (foo1 + forbrukvaremerkelov) | (foo2 + forbrukvaremerkelov) ) & ! bar # make multiple atoms from multi-stemmed words involved in phrases (this creates 1 atom per stem per multi-stem word, and yes, that could get very big... but, IMHO, slow but working corner cases are OK) 4) a & ( (foo1 + forbruk + vare + merke + lov) | (foo2 + forbruk + vare + merke + lov) ) & ! bar # explode the compound words to their "decomposed" form, because that's what ought to be in the indexed data That meets the same criteria as the simpler example above, and I've not said anything about compound and multi-stem word outside the "phrase mode" portion of the query because the current behaviour is what we want in those cases. > > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru > WWW: http://www.sigaev.ru/ > -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
pgsql-general by date: