Re: PATCH: Update snowball stemmers - Mailing list pgsql-hackers

From Arthur Zakirov
Subject Re: PATCH: Update snowball stemmers
Date
Msg-id 20180925114506.GA14666@zakirov.localdomain
Whole thread Raw
In response to Re: PATCH: Update snowball stemmers  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Sep 24, 2018 at 05:36:39PM -0400, Tom Lane wrote:
> I reviewed and pushed this.

Great! Thank you.

> As a cross-check on the patch, I cloned the Snowball github repo
> and built the derived files in it.  I noticed that they'd incorporated
> several new stemmers since 2007 --- not only your Nepali one, but
> half a dozen more besides.  Since the point here is (IMO) mostly to
> follow their lead on what's interesting, I went ahead and added those
> as well.

Agree. It is good decision. It may attract more users.

> Although I added nepali.stop from the other patch, I've not done
> anything about updating our other stopword lists.  Presumably those
> are a bit obsolete by now as well.  I wonder if we can prevail on
> the Snowball people to make those available in some less painful way
> than scraping them off assorted web pages.  Ideally they'd stick them
> into their git repo ...

They have repository snowball-website [1]. It is snowballstem.org
web-site source repository. It also stores stopwords for various
languages (for example for english [2]). I checked couple languages. It
seems their russian and danish stopword lists look like PostgreSQL's
stopword lists. But their english stopword list is different.

There is lack of stopword lists for the following languages:
- arabic
- irish
- lithuanian
- nepali - I can create a pull request to add it to snowball-website
- tamil

There is also another project, called Stopwords ISO [3]. But I'm not
sure about them. It stores stopword lists from various sources. And also
there are stopwords for languages mentioned above, except for nepali and
tamil.

I think I could make a script, which generates stopwords from
snowball-website repository. It can be run periodically. Is it suitable?
Also it would be good to move missing stopwords from Stopwords ISO to
snowball-website...

1 - https://github.com/snowballstem/snowball-website/tree/master/algorithms
2 - https://github.com/snowballstem/snowball-website/blob/master/algorithms/english/stop.txt
3 - https://github.com/stopwords-iso

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


pgsql-hackers by date:

Previous
From: Dmitry Dolgov
Date:
Subject: Re: Segfault when creating partition with a primary key and sql_droptrigger exists
Next
From: Christoph Berg
Date:
Subject: Re: Collation versioning