Thread: PATCH: Update snowball stemmers

PATCH: Update snowball stemmers

From
Arthur Zakirov
Date:
Hello hackers,

I'd like to propose the patch which syncs PostgreSQL snowball stemmers.
As Tom pointed [1] stemmers haven't synced for a very long time.

I copied all source files without changes, except replacing '#include
"../runtime/header.h"' with '#include "header.h"' and removing includes
of standard headers from utilities.c.

Hungarian language uses ISO-8859-1 and UTF-8 charsets in Postgres HEAD.
But in Snowball HEAD it is ISO-8859-2 per commit [2]. This patch changes
hungarian's charset from ISO-8859-1 to ISO-8859-2 too.

Additionally updated files in the patch are:
- utilities.c
- header.h

Will add to the next commitfest.

Any comments?

1 - https://www.postgresql.org/message-id/5689.1519054983%40sss.pgh.pa.us
2 - https://github.com/snowballstem/snowball/commit/4bcae97db044253ea2edae1dd3ca59f3cddd4b9d

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: PATCH: Update snowball stemmers

From
Andrew Dunstan
Date:

On 06/26/2018 08:20 AM, Arthur Zakirov wrote:
> Hello hackers,
>
> I'd like to propose the patch which syncs PostgreSQL snowball stemmers.
> As Tom pointed [1] stemmers haven't synced for a very long time.
>
> I copied all source files without changes, except replacing '#include
> "../runtime/header.h"' with '#include "header.h"' and removing includes
> of standard headers from utilities.c.
>
> Hungarian language uses ISO-8859-1 and UTF-8 charsets in Postgres HEAD.
> But in Snowball HEAD it is ISO-8859-2 per commit [2]. This patch changes
> hungarian's charset from ISO-8859-1 to ISO-8859-2 too.
>
> Additionally updated files in the patch are:
> - utilities.c
> - header.h
>
> Will add to the next commitfest.
>
> Any comments?
>
> 1 - https://www.postgresql.org/message-id/5689.1519054983%40sss.pgh.pa.us
> 2 - https://github.com/snowballstem/snowball/commit/4bcae97db044253ea2edae1dd3ca59f3cddd4b9d
>


I agree with Tom that we should sync with the upstream before we do 
anything else. This is a very large patch  but with fairly limited 
impact. I think now at the start of a dev cycle is the right time to 
apply it.

I don't know if we have a buildfarm animal testing Hungarian. Maybe we 
need a buildfarm animal or two testing a large number of locales.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: Update snowball stemmers

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 06/26/2018 08:20 AM, Arthur Zakirov wrote:
>> I'd like to propose the patch which syncs PostgreSQL snowball stemmers.
>> As Tom pointed [1] stemmers haven't synced for a very long time.

> I agree with Tom that we should sync with the upstream before we do 
> anything else. This is a very large patch but with fairly limited 
> impact. I think now at the start of a dev cycle is the right time to 
> apply it.

Agreed in principle.  The thing that's actually a bit shaky here is
to decide what "upstream" to follow, given that the original maintainer
has retired and there doesn't seem to be active work going on.

> I don't know if we have a buildfarm animal testing Hungarian. Maybe we 
> need a buildfarm animal or two testing a large number of locales.

I dunno that we need to set up a permanent buildfarm member for this.
I had been thinking in terms of testing every available locale on my
own machines before pushing, but given that the code is pretty static,
do we need to do that repetitively?

            regards, tom lane


Re: PATCH: Update snowball stemmers

From
Arthur Zakirov
Date:
On Fri, Jul 06, 2018 at 12:37:26PM -0400, Tom Lane wrote:
> > I don't know if we have a buildfarm animal testing Hungarian. Maybe we 
> > need a buildfarm animal or two testing a large number of locales.
> 
> I dunno that we need to set up a permanent buildfarm member for this.
> I had been thinking in terms of testing every available locale on my
> own machines before pushing, but given that the code is pretty static,
> do we need to do that repetitively?

I run installcheck for hu_HU.UTF-8, hu_HU.ISO-8859-2 and ru_RU.UTF-8
locales on laptop. Tests passed.

I think it is worth to consider that most text search functions
(to_tsvector, to_tsquery) use specific text search configuration (english
or a custom one). Only few of them use default text search configuration.
They are in json.sql, jsonb.sql, tsearch.sql tests.

Is it good idea to modify some current tests to use default configuration
or to add specific tests for locale testing?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: PATCH: Update snowball stemmers

From
Tom Lane
Date:
I see that the cfbot is having difficulty building this:

make[2]: *** No rule to make target `stem_ISO_8859_2_hungarian.o', needed by `dict_snowball.so'.  Stop.

Did you miss including Makefile changes in the submitted patch?

            regards, tom lane


Re: PATCH: Update snowball stemmers

From
Arthur Zakirov
Date:
On Wed, Sep 12, 2018 at 04:03:57PM -0400, Tom Lane wrote:
> I see that the cfbot is having difficulty building this:
> 
> make[2]: *** No rule to make target `stem_ISO_8859_2_hungarian.o', needed by `dict_snowball.so'.  Stop.
> 
> Did you miss including Makefile changes in the submitted patch?

Snowball's Makefile has changes related to hungarian stemmer:

-    stem_ISO_8859_1_hungarian.o \
...
+    stem_ISO_8859_2_hungarian.o \

I've noticed the error some time ago. And I tried to understand why
cfbot can't build it. Somehow cfbot didn't rename
stem_ISO_8859_1_hungarian. You can see it from the commit [1]. And there
is no new file stem_ISO_8859_2_hungarian.c [2].

Another problem with the header file [3]. cfbot created new file
stem_ISO_8859_2_hungarian.h, but it didn't delete old
stem_ISO_8859_1_hungarian.h.

In my laptop there is no such problem. I tried both "git apply" and
"patch -p1". And I can build the patch.

Maybe cfbot's "patch" doesn't understand the patch file. Maybe I have
too recent git (2.19.0)...

1 -
https://github.com/postgresql-cfbot/postgresql/commit/efc280b89b181657afe5412f398681b2c393a35c#diff-efde70a147d16a83b9b132b7f396ab6d
2 - https://github.com/postgresql-cfbot/postgresql/tree/commitfest/19/1697/src/backend/snowball/libstemmer
3 - https://github.com/postgresql-cfbot/postgresql/tree/commitfest/19/1697/src/include/snowball/libstemmer

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: PATCH: Update snowball stemmers

From
Tom Lane
Date:
Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> On Wed, Sep 12, 2018 at 04:03:57PM -0400, Tom Lane wrote:
>> Did you miss including Makefile changes in the submitted patch?

> I've noticed the error some time ago. And I tried to understand why
> cfbot can't build it. Somehow cfbot didn't rename
> stem_ISO_8859_1_hungarian. You can see it from the commit [1]. And there
> is no new file stem_ISO_8859_2_hungarian.c [2].

Oh, patch(1) doesn't understand git's idea of "renaming" files, cf

https://www.postgresql.org/message-id/CAEepm=2+CF3PshNRAs-r8jtPLKj0K6UEACeRSqBi5Cf74L=w7Q@mail.gmail.com

I'd suggest using "git diff --no-renames", since some of us will want
to apply the patch using patch(1).

> In my laptop there is no such problem. I tried both "git apply" and
> "patch -p1". And I can build the patch.

Really?  What version of patch was that?

            regards, tom lane


Re: PATCH: Update snowball stemmers

From
Arthur Zakirov
Date:
On Wed, Sep 12, 2018 at 05:47:05PM -0400, Tom Lane wrote:
> Oh, patch(1) doesn't understand git's idea of "renaming" files, cf
> 
> https://www.postgresql.org/message-id/CAEepm=2+CF3PshNRAs-r8jtPLKj0K6UEACeRSqBi5Cf74L=w7Q@mail.gmail.com
> 
> I'd suggest using "git diff --no-renames", since some of us will want
> to apply the patch using patch(1).

Ah, I see. I attached new version made with --no-renames. Will wait for
what cfbot will say.

> > In my laptop there is no such problem. I tried both "git apply" and
> > "patch -p1". And I can build the patch.
> 
> Really?  What version of patch was that?

It is 2.7.6. It seems it was released in 2018-02-03. Not sure, but maybe
they improved support for git's file renames.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment

Re: PATCH: Update snowball stemmers

From
Tom Lane
Date:
Arthur Zakirov <a.zakirov@postgrespro.ru> writes:
> Ah, I see. I attached new version made with --no-renames. Will wait for
> what cfbot will say.

I reviewed and pushed this.

As a cross-check on the patch, I cloned the Snowball github repo
and built the derived files in it.  I noticed that they'd incorporated
several new stemmers since 2007 --- not only your Nepali one, but
half a dozen more besides.  Since the point here is (IMO) mostly to
follow their lead on what's interesting, I went ahead and added those
as well.

In short, therefore, the commit includes the Nepali stuff from your
other thread as well as what was in this one.

Although I added nepali.stop from the other patch, I've not done
anything about updating our other stopword lists.  Presumably those
are a bit obsolete by now as well.  I wonder if we can prevail on
the Snowball people to make those available in some less painful way
than scraping them off assorted web pages.  Ideally they'd stick them
into their git repo ...

            regards, tom lane


Re: PATCH: Update snowball stemmers

From
Arthur Zakirov
Date:
On Mon, Sep 24, 2018 at 05:36:39PM -0400, Tom Lane wrote:
> I reviewed and pushed this.

Great! Thank you.

> As a cross-check on the patch, I cloned the Snowball github repo
> and built the derived files in it.  I noticed that they'd incorporated
> several new stemmers since 2007 --- not only your Nepali one, but
> half a dozen more besides.  Since the point here is (IMO) mostly to
> follow their lead on what's interesting, I went ahead and added those
> as well.

Agree. It is good decision. It may attract more users.

> Although I added nepali.stop from the other patch, I've not done
> anything about updating our other stopword lists.  Presumably those
> are a bit obsolete by now as well.  I wonder if we can prevail on
> the Snowball people to make those available in some less painful way
> than scraping them off assorted web pages.  Ideally they'd stick them
> into their git repo ...

They have repository snowball-website [1]. It is snowballstem.org
web-site source repository. It also stores stopwords for various
languages (for example for english [2]). I checked couple languages. It
seems their russian and danish stopword lists look like PostgreSQL's
stopword lists. But their english stopword list is different.

There is lack of stopword lists for the following languages:
- arabic
- irish
- lithuanian
- nepali - I can create a pull request to add it to snowball-website
- tamil

There is also another project, called Stopwords ISO [3]. But I'm not
sure about them. It stores stopword lists from various sources. And also
there are stopwords for languages mentioned above, except for nepali and
tamil.

I think I could make a script, which generates stopwords from
snowball-website repository. It can be run periodically. Is it suitable?
Also it would be good to move missing stopwords from Stopwords ISO to
snowball-website...

1 - https://github.com/snowballstem/snowball-website/tree/master/algorithms
2 - https://github.com/snowballstem/snowball-website/blob/master/algorithms/english/stop.txt
3 - https://github.com/stopwords-iso

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company