Thread: Tsearch vs Snowball, or what's a source file?
While looking at the tsearch-in-core patch I was distressed to notice that a good fraction of it is derived files, bearing notices such as /* This file was generated automatically by the Snowball to ANSI C compiler */ Our normal policy is "no derived files in CVS", so I went looking to see if we couldn't avoid that. I now see that contrib/tsearch2 has been doing the same thing for awhile, and it's risen up to bite us before, eg http://archives.postgresql.org/pgsql-committers/2005-09/msg00137.php I had not previously known anything about Snowball, but after perusing their websitehttp://snowball.tartarus.org/ for a bit, I believe the following is an accurate summary: 1. The original word-stemming algorithms are written in a special language "Snowball". You can get both the Snowball compiler and the original ".sbl" source files off the Snowball site, but these files are not those. 2. The Snowball people also distribute a "pre-compiled" version of their stuff, ie, the results of generating ANSI C code from all the stemming algorithms. They call this distribution "libstemmer". 3. What we've been distributing in contrib/tsearch2/snowball is a severely cut-back subset of libstemmer, ie, just the English and Russian stemmers. This accounts for the occasional complaints in the mailing lists from people who were trying to add other stemmers from the libstemmer distribution (and running into version-skew problems, because the version we're using is not very up-to-date). 4. The proposed tsearch-in-core patch includes a larger subset of libstemmer, but it's still not the whole thing, and it still seems to be a modified copy rather than an exact one. There isn't any part of this that seems to me to be a good idea. Arguably we should be relying on the original .sbl files, but that would make the Snowball compiler a required tool for building distributions, which is a dependency I for one don't want to add. In any case there's probably not a lot of practical difference between relying on the Snowball project's .sbl files and relying on their libstemmer distribution. Either way, we are importing someone else's sources. (At least they're BSD-license sources...) What I definitely *don't* like is that we've whacked the fileset around in ways that make it hard for someone to drop in a newer version of the upstream sources. The filenames don't match, the directory layout doesn't match, and to add insult to injury we've plastered our copyright on their files. Following the precedent of the zic timezone files would suggest dropping an *unmodified* copy of the libstemmer distro into its own subdirectory of our CVS, and doing whatever we have to do to compile it without any changes, so that we can drop in updates later without creating problems. (This is, in fact, what the Snowball people recommend for incorporating their code into a larger application.) OTOH, keeping our copy of the zic files up-to-date has proven to be a significant pain in the neck, and so I'm not sure I care to follow that precedent exactly. The Snowball files may not change as often as politicians invent new timezone laws, but they seem to change regularly enough --- the libstemmer tarball I just downloaded from their website seems to have been generated barely a week ago, and no it doesn't match what's in the patch now. Is there a reasonable way to treat libstemmer as an external library? regards, tom lane
Tom, > Is there a reasonable way to treat libstemmer as an external library? Hmmm ... do we want to do that if we're distributing it in core? That would require us to have a --with-tsearch compile switch so that people who don't want to find & build libstemmer can build PostgreSQL. I thought the whole point of this feature was to have a version of Tsearch which "just worked" for users. As annoying as it may be to keep it updated, I think it's probably worth it from a user experience standpoint. However, we should definitely put the exact libstemmer C files, as distributed by the project, somewhere so that updating stemmer each time we do a patch release is simply a matter of download and rsync. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> Is there a reasonable way to treat libstemmer as an external library? > Hmmm ... do we want to do that if we're distributing it in core? That > would require us to have a --with-tsearch compile switch so that people > who don't want to find & build libstemmer can build PostgreSQL. I thought > the whole point of this feature was to have a version of Tsearch which > "just worked" for users. True. I just noticed that the upstream master distribution (their compiler source and .sbl files) weighs in at half the size of the libstemmer distribution: 68K vs 129K in tar.gz format --- no doubt due to all the repetitive boilerplate in the generated files. I'm not sure if the compiler source has any portability issues, but if not it is interesting to consider the idea of bundling the master distro instead of libstemmer. This would fix at least one issue that we otherwise will have, which is that the #include-paths they chose to generate libstemmer with seem a bit unfriendly for our purposes. The #include commands are determined by compiler options, so we could fix them if compiling the .sbl files on the fly. This makes no difference in terms of the ease of tracking their changes, of course, but it just feels better to me to be distributing "real" source code and not derived files. regards, tom lane
> This makes no difference in terms of the ease of tracking their changes, > of course, but it just feels better to me to be distributing "real" > source code and not derived files. Hmm. 1 Compiling from .sbl by original Snowball's makefile requires Perl and doesn't work cleanly: % gmake .... cc -o snowball compiler/space.o compiler/tokeniser.o compiler/analyser.o compiler/generator.o compiler/driver.o compiler/generator_java.o gmake: *** No rule to make target `libstemmer/libstemmer_c.in', needed by `libstemmer/libstemmer.c'. Stop. I used http://snowball.tartarus.org/dist/snowball_code.tgz tarball. 2 Snowball's compiling infrastructure doesn't support Windows target. I agree with simplify support process but, IMHO, it's much simpler to do it with C sources with pgsql's building infrastructure And where should it be placed? src/snowball directory? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> 1 Compiling from .sbl by original Snowball's makefile requires Perl and > doesn't work cleanly: AFAIK, you don't receive any answers on your email in snowball mailing list. > 2 Snowball's compiling infrastructure doesn't support Windows target. 3 I understand your wish about including only real sources and snowball_code.tgz contains some extra data (doc, examples) which isn't needed for tsearch & pgsql 4 Your suggest to simplify support of snowball by using original file's layout from Snowball For that reasons, we suggest to put into source tree (in src/snowball) only three directory for snowball_code.tgz: - /compiler - compiler from *.sbl to *.c - /runtime - common code for all stemmers- /algorithms - *.sbl files and use pgsql's makefile infrastructure to compiling stemmers. Comments, objections? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > 2 Snowball's compiling infrastructure doesn't support Windows target. Yeah. Another problem with using their original source code is that running the Snowball compiler during build would not work for cross-compiled builds of Postgres, at least not without solving the problem of building some code for the host platform instead of the target. So what I'm thinking now is we should import libstemmer instead of the snowball_code representation. I haven't gotten as far as thinking about exactly how to lay out the files though. regards, tom lane
I wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: >> 2 Snowball's compiling infrastructure doesn't support Windows target. > Yeah. Another problem with using their original source code is that > running the Snowball compiler during build would not work for > cross-compiled builds of Postgres, at least not without solving the > problem of building some code for the host platform instead of the > target. > So what I'm thinking now is we should import libstemmer instead of the > snowball_code representation. I haven't gotten as far as thinking about > exactly how to lay out the files though. I've done some more work on this point. After looking at the Snowball code in more detail, I'm thinking it'd be a good idea to keep it at arm's length in a loadable shared library, instead of incorporating it directly into the backend. This is because they don't see anything wrong with exporting random global function names like "eq_v" and "skip_utf8"; so the probability of name collisions is a bit too high for my taste. The current tsearch_core patch envisions having a couple of the snowball stemmers in the core backend and the rest in a loadable library, but I suggest we just put them all in a loadable library, with the only entry points being snowball_init() and snowball_lexize() tsearch dictionary support functions. (I am thinking of having just one such function pair, with the init function taking an init option to select which stemmer to use, instead of a separate Postgres function pair per stemmer.) Attached is a rough proof-of-concept patch for this. It doesn't do anything useful, but it does prove that we can compile and link the Snowball stemmers into a Postgres loadable module with only trivial changes to their source code. The code compiles cleanly (zero warnings in gcc). The file layout is src/backend/snowball/Makefile our files src/backend/snowball/README src/backend/snowball/dict_snowball.c src/backend/snowball/libstemmer/*.c their .c files src/include/snowball/header.h intercepting .h file src/include/snowball/libstemmer/*.h their .h files If there're no objections, I'll push forward with completing the dictionary support functions to go with this infrastructure. regards, tom lane
Attachment
> I've done some more work on this point. After looking at the Snowball > code in more detail, I'm thinking it'd be a good idea to keep it at > arm's length in a loadable shared library, instead of incorporating it I splited stemmers to two sets because of regression test. As I remember, there was some problems with loadable conversions and configure's flag --disable-shared > the only entry points being snowball_init() and snowball_lexize() > tsearch dictionary support functions. (I am thinking of having just one > such function pair, with the init function taking an init option to > select which stemmer to use, instead of a separate Postgres function > pair per stemmer.) So, it's needed to change dictinitoption format of snowball dictionaries to point both stop-word file and language's name. > If there're no objections, I'll push forward with completing the > dictionary support functions to go with this infrastructure. How will we synchronize our changes in patch? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: > I splited stemmers to two sets because of regression test. As I > remember, there was some problems with loadable conversions and > configure's flag --disable-shared I'm not worried about supporting --disable-shared installations very much. They didn't have tsearch support before, either, and they have not passed regression in many years for lack of plpgsql. > So, it's needed to change dictinitoption format of snowball dictionaries to > point both stop-word file and language's name. Right. > How will we synchronize our changes in patch? Go ahead and make the changes you want, and then I'll work on this. regards, tom lane
Tom Lane wrote: > Teodor Sigaev <teodor@sigaev.ru> writes: >> So, it's needed to change dictinitoption format of snowball dictionaries to >> point both stop-word file and language's name. > > Right. Is there any chance to get support for other languages than English and Russian into the tsearch2 distribution? Yours, Laurenz Albe
danish, dutch, finnish, french, german, hungarian, italian, norwegian, portuguese, spanish, swedish, russin and english Albe Laurenz wrote: > Tom Lane wrote: >> Teodor Sigaev <teodor@sigaev.ru> writes: >>> So, it's needed to change dictinitoption format of snowball > dictionaries to >>> point both stop-word file and language's name. >> Right. > > Is there any chance to get support for other languages than English and > Russian into the tsearch2 distribution? > > Yours, > Laurenz Albe > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Go ahead and make the changes you want, and then I'll work on this. So, I'm planing on this weekend: 1) rename FULLTEXT to TEXT SEARCH in SQL command 2) rework Snowball stemmer's as Tom suggested 3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING 4) remove support of default configuration per scheme. Default configuration will be only one per locale. About security holes in PARSER/DICTIONARY. I see following ways now: 1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY Disadvantage: hosting users will not be able to changedictionaries 2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template and pg_ts_dict and accordingly change CREATE/ALTER/DROPDICTIONARY Disadvantage: parser and dictionary's template will not dump/restore, it shouldbe restored manually (just a INSERT into pg_ts_parser/pg_ts_dict_template) 3) Similar to previous point, but: * CREATE/ALTER/DROP PARSER - super-user only * CREATE/ALTER/DROP DICTIONARY TEMPLATE- super-user only * CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser Disadvantage: new command CREATE/ALTER/DROPDICTIONARY TEMPLATE Which way do we choose? or I miss some variant? I would like to go by 3) way... Comments? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes: >> 1) rename FULLTEXT to TEXT SEARCH in SQL command > Working on it, I found rather obvious undesired side-effect: if TEXT > becomes a keyword then any output of name of text type becomes > quoted. Even if TEXT is in unreserved_keyword list. Yeah, I was aware that would happen. What I've been thinking for some time is that we ought to fix quote_ident and ruleutils.c to distinguish "unreserved" keywords from the other ones, and not quote unreserved ones. The list of unreserved words has gotten longer and more invasive in every release, and I don't see that stopping soon. It's already annoying that perfectly ordinary words like "document" and "key" get quoted. The main argument I can think of for not doing this is that if a currently non-reserved keyword becomes reserved in a future release, then having quoted it would prevent problems with restoring dump scripts. I don't find this argument very compelling, though; it seems just as likely that words we don't even have as keywords today will become reserved later. The only thing I see on the horizon that does fit that argument is WITH, which we could special-case. Comments? I'm willing to make this happen if there are no objections. regards, tom lane