Thread: Tsearch vs Snowball, or what's a source file?

Tsearch vs Snowball, or what's a source file?

From
Tom Lane
Date:
While looking at the tsearch-in-core patch I was distressed to notice
that a good fraction of it is derived files, bearing notices such as

/* This file was generated automatically by the Snowball to ANSI C compiler */

Our normal policy is "no derived files in CVS", so I went looking to
see if we couldn't avoid that.  I now see that contrib/tsearch2 has been
doing the same thing for awhile, and it's risen up to bite us before, eg
http://archives.postgresql.org/pgsql-committers/2005-09/msg00137.php

I had not previously known anything about Snowball, but after perusing
their websitehttp://snowball.tartarus.org/
for a bit, I believe the following is an accurate summary:

1. The original word-stemming algorithms are written in a special
language "Snowball".  You can get both the Snowball compiler and the
original ".sbl" source files off the Snowball site, but these files are
not those.

2. The Snowball people also distribute a "pre-compiled" version of their
stuff, ie, the results of generating ANSI C code from all the stemming
algorithms.  They call this distribution "libstemmer".

3. What we've been distributing in contrib/tsearch2/snowball is a
severely cut-back subset of libstemmer, ie, just the English and Russian
stemmers.  This accounts for the occasional complaints in the mailing
lists from people who were trying to add other stemmers from the
libstemmer distribution (and running into version-skew problems, because
the version we're using is not very up-to-date).

4. The proposed tsearch-in-core patch includes a larger subset of
libstemmer, but it's still not the whole thing, and it still seems to be
a modified copy rather than an exact one.


There isn't any part of this that seems to me to be a good idea.
Arguably we should be relying on the original .sbl files, but that would
make the Snowball compiler a required tool for building distributions,
which is a dependency I for one don't want to add.  In any case there's
probably not a lot of practical difference between relying on the
Snowball project's .sbl files and relying on their libstemmer
distribution.  Either way, we are importing someone else's sources.
(At least they're BSD-license sources...)

What I definitely *don't* like is that we've whacked the fileset around
in ways that make it hard for someone to drop in a newer version of the
upstream sources.  The filenames don't match, the directory layout
doesn't match, and to add insult to injury we've plastered our copyright
on their files.

Following the precedent of the zic timezone files would suggest dropping
an *unmodified* copy of the libstemmer distro into its own subdirectory
of our CVS, and doing whatever we have to do to compile it without any
changes, so that we can drop in updates later without creating problems.
(This is, in fact, what the Snowball people recommend for incorporating
their code into a larger application.)

OTOH, keeping our copy of the zic files up-to-date has proven to be a
significant pain in the neck, and so I'm not sure I care to follow that
precedent exactly.  The Snowball files may not change as often as
politicians invent new timezone laws, but they seem to change regularly
enough --- the libstemmer tarball I just downloaded from their website
seems to have been generated barely a week ago, and no it doesn't match
what's in the patch now.

Is there a reasonable way to treat libstemmer as an external library?
        regards, tom lane


Re: Tsearch vs Snowball, or what's a source file?

From
Josh Berkus
Date:
Tom,

> Is there a reasonable way to treat libstemmer as an external library?

Hmmm ... do we want to do that if we're distributing it in core?  That 
would require us to have a --with-tsearch compile switch so that people 
who don't want to find & build libstemmer can build PostgreSQL.  I thought 
the whole point of this feature was to have a version of Tsearch which 
"just worked" for users.

As annoying as it may be to keep it updated, I think it's probably worth it 
from a user experience standpoint.  However, we should definitely put the 
exact libstemmer C files, as distributed by the project, somewhere so that 
updating stemmer each time we do a patch release is simply a matter of 
download and rsync.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Tsearch vs Snowball, or what's a source file?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> Is there a reasonable way to treat libstemmer as an external library?

> Hmmm ... do we want to do that if we're distributing it in core?  That 
> would require us to have a --with-tsearch compile switch so that people 
> who don't want to find & build libstemmer can build PostgreSQL.  I thought 
> the whole point of this feature was to have a version of Tsearch which 
> "just worked" for users.

True.

I just noticed that the upstream master distribution (their compiler
source and .sbl files) weighs in at half the size of the libstemmer
distribution: 68K vs 129K in tar.gz format --- no doubt due to all the
repetitive boilerplate in the generated files.  I'm not sure if the
compiler source has any portability issues, but if not it is interesting
to consider the idea of bundling the master distro instead of
libstemmer.  This would fix at least one issue that we otherwise will
have, which is that the #include-paths they chose to generate libstemmer
with seem a bit unfriendly for our purposes.  The #include commands are
determined by compiler options, so we could fix them if compiling the
.sbl files on the fly.

This makes no difference in terms of the ease of tracking their changes,
of course, but it just feels better to me to be distributing "real"
source code and not derived files.
        regards, tom lane


Re: Tsearch vs Snowball, or what's a source file?

From
Teodor Sigaev
Date:
> This makes no difference in terms of the ease of tracking their changes,
> of course, but it just feels better to me to be distributing "real"
> source code and not derived files.

Hmm.
1 Compiling from .sbl by original Snowball's makefile requires Perl and doesn't 
work cleanly:
% gmake
....
cc -o snowball compiler/space.o compiler/tokeniser.o compiler/analyser.o 
compiler/generator.o compiler/driver.o compiler/generator_java.o
gmake: *** No rule to make target `libstemmer/libstemmer_c.in', needed by 
`libstemmer/libstemmer.c'.  Stop.
I used http://snowball.tartarus.org/dist/snowball_code.tgz tarball.

2 Snowball's compiling infrastructure doesn't support Windows target.

I agree with simplify support process but, IMHO, it's much simpler to do it with 
C sources with pgsql's building infrastructure

And where should it be placed? src/snowball directory?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch vs Snowball, or what's a source file?

From
Teodor Sigaev
Date:
> 1 Compiling from .sbl by original Snowball's makefile requires Perl and 
> doesn't work cleanly:
AFAIK, you don't receive any answers on your email in snowball mailing list.

> 2 Snowball's compiling infrastructure doesn't support Windows target.

3 I understand your wish about including only real sources and snowball_code.tgz 
contains some extra data (doc, examples) which isn't needed for tsearch & pgsql

4 Your suggest to simplify support of snowball by using original file's layout 
from Snowball

For that reasons, we suggest to put into source tree (in src/snowball) only 
three directory for snowball_code.tgz: - /compiler - compiler from *.sbl to *.c - /runtime - common code for all
stemmers- /algorithms - *.sbl files
 
and use pgsql's makefile infrastructure to compiling stemmers.

Comments, objections?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch vs Snowball, or what's a source file?

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> 2 Snowball's compiling infrastructure doesn't support Windows target.

Yeah.  Another problem with using their original source code is that
running the Snowball compiler during build would not work for
cross-compiled builds of Postgres, at least not without solving the
problem of building some code for the host platform instead of the
target.

So what I'm thinking now is we should import libstemmer instead of the
snowball_code representation.  I haven't gotten as far as thinking about
exactly how to lay out the files though.
        regards, tom lane


Re: Tsearch vs Snowball, or what's a source file?

From
Tom Lane
Date:
I wrote:
> Teodor Sigaev <teodor@sigaev.ru> writes:
>> 2 Snowball's compiling infrastructure doesn't support Windows target.

> Yeah.  Another problem with using their original source code is that
> running the Snowball compiler during build would not work for
> cross-compiled builds of Postgres, at least not without solving the
> problem of building some code for the host platform instead of the
> target.

> So what I'm thinking now is we should import libstemmer instead of the
> snowball_code representation.  I haven't gotten as far as thinking about
> exactly how to lay out the files though.

I've done some more work on this point.  After looking at the Snowball
code in more detail, I'm thinking it'd be a good idea to keep it at
arm's length in a loadable shared library, instead of incorporating it
directly into the backend.  This is because they don't see anything
wrong with exporting random global function names like "eq_v" and
"skip_utf8"; so the probability of name collisions is a bit too high for
my taste.  The current tsearch_core patch envisions having a couple of
the snowball stemmers in the core backend and the rest in a loadable
library, but I suggest we just put them all in a loadable library, with
the only entry points being snowball_init() and snowball_lexize()
tsearch dictionary support functions.  (I am thinking of having just one
such function pair, with the init function taking an init option to
select which stemmer to use, instead of a separate Postgres function
pair per stemmer.)

Attached is a rough proof-of-concept patch for this.  It doesn't do
anything useful, but it does prove that we can compile and link the
Snowball stemmers into a Postgres loadable module with only trivial
changes to their source code.  The code compiles cleanly (zero warnings
in gcc).  The file layout is

src/backend/snowball/Makefile        our files
src/backend/snowball/README
src/backend/snowball/dict_snowball.c
src/backend/snowball/libstemmer/*.c    their .c files

src/include/snowball/header.h        intercepting .h file
src/include/snowball/libstemmer/*.h    their .h files

If there're no objections, I'll push forward with completing the
dictionary support functions to go with this infrastructure.

            regards, tom lane


Attachment

Re: Tsearch vs Snowball, or what's a source file?

From
Teodor Sigaev
Date:
> I've done some more work on this point.  After looking at the Snowball
> code in more detail, I'm thinking it'd be a good idea to keep it at
> arm's length in a loadable shared library, instead of incorporating it

I splited stemmers to two sets because of regression test. As I remember, there 
was some problems with loadable conversions  and configure's flag --disable-shared


> the only entry points being snowball_init() and snowball_lexize()
> tsearch dictionary support functions.  (I am thinking of having just one
> such function pair, with the init function taking an init option to
> select which stemmer to use, instead of a separate Postgres function
> pair per stemmer.)
So, it's needed to change dictinitoption format of snowball dictionaries to 
point both stop-word file and language's name.


> If there're no objections, I'll push forward with completing the
> dictionary support functions to go with this infrastructure.
How will we synchronize our changes in patch?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch vs Snowball, or what's a source file?

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
> I splited stemmers to two sets because of regression test. As I
> remember, there was some problems with loadable conversions and
> configure's flag --disable-shared

I'm not worried about supporting --disable-shared installations very
much.  They didn't have tsearch support before, either, and they have
not passed regression in many years for lack of plpgsql.

> So, it's needed to change dictinitoption format of snowball dictionaries to 
> point both stop-word file and language's name.

Right.

> How will we synchronize our changes in patch?

Go ahead and make the changes you want, and then I'll work on this.
        regards, tom lane


Re: Tsearch vs Snowball, or what's a source file?

From
"Albe Laurenz"
Date:
Tom Lane wrote:
> Teodor Sigaev <teodor@sigaev.ru> writes:
>> So, it's needed to change dictinitoption format of snowball
dictionaries to
>> point both stop-word file and language's name.
>
> Right.

Is there any chance to get support for other languages than English and
Russian into the tsearch2 distribution?

Yours,
Laurenz Albe


Re: Tsearch vs Snowball, or what's a source file?

From
Teodor Sigaev
Date:
danish, dutch, finnish, french, german, hungarian, italian, norwegian, 
portuguese, spanish, swedish, russin and english

Albe Laurenz wrote:
> Tom Lane wrote:
>> Teodor Sigaev <teodor@sigaev.ru> writes:
>>> So, it's needed to change dictinitoption format of snowball
> dictionaries to 
>>> point both stop-word file and language's name.
>> Right.
> 
> Is there any chance to get support for other languages than English and
> Russian into the tsearch2 distribution?
> 
> Yours,
> Laurenz Albe
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch vs Snowball, or what's a source file?

From
Teodor Sigaev
Date:
> Go ahead and make the changes you want, and then I'll work on this.

So, I'm planing on this weekend:
1) rename FULLTEXT to TEXT SEARCH in SQL command
2) rework Snowball stemmer's as Tom suggested
3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
4) remove support of default configuration per scheme. Default configuration   will be only one per locale.


About security holes in PARSER/DICTIONARY. I see following ways now:
1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY   Disadvantage: hosting users will not be able to
changedictionaries
 
2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template   and pg_ts_dict and accordingly change
CREATE/ALTER/DROPDICTIONARY   Disadvantage: parser and dictionary's template will not dump/restore,                 it
shouldbe restored manually (just a INSERT into                 pg_ts_parser/pg_ts_dict_template)
 
3) Similar to previous point, but:   * CREATE/ALTER/DROP PARSER - super-user only   * CREATE/ALTER/DROP DICTIONARY
TEMPLATE- super-user only   * CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser   Disadvantage: new command
CREATE/ALTER/DROPDICTIONARY TEMPLATE
 


Which way do we choose? or I miss some variant?

I would like to go by 3) way... Comments?

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 


Re: Tsearch vs Snowball, or what's a source file?

From
Tom Lane
Date:
Teodor Sigaev <teodor@sigaev.ru> writes:
>> 1) rename FULLTEXT to TEXT SEARCH in SQL command

>   Working on it, I found rather obvious undesired side-effect: if TEXT
> becomes a keyword then any output of name of text type becomes
> quoted. Even if TEXT is in unreserved_keyword list.

Yeah, I was aware that would happen.  What I've been thinking for some
time is that we ought to fix quote_ident and ruleutils.c to distinguish
"unreserved" keywords from the other ones, and not quote unreserved
ones.  The list of unreserved words has gotten longer and more invasive
in every release, and I don't see that stopping soon.  It's already
annoying that perfectly ordinary words like "document" and "key" get
quoted.

The main argument I can think of for not doing this is that if a
currently non-reserved keyword becomes reserved in a future release,
then having quoted it would prevent problems with restoring dump
scripts.  I don't find this argument very compelling, though; it seems
just as likely that words we don't even have as keywords today will
become reserved later.  The only thing I see on the horizon that does
fit that argument is WITH, which we could special-case.

Comments?  I'm willing to make this happen if there are no objections.
        regards, tom lane