Thread: PATCH: Update snowball stemmers
Hello hackers, I'd like to propose the patch which syncs PostgreSQL snowball stemmers. As Tom pointed [1] stemmers haven't synced for a very long time. I copied all source files without changes, except replacing '#include "../runtime/header.h"' with '#include "header.h"' and removing includes of standard headers from utilities.c. Hungarian language uses ISO-8859-1 and UTF-8 charsets in Postgres HEAD. But in Snowball HEAD it is ISO-8859-2 per commit [2]. This patch changes hungarian's charset from ISO-8859-1 to ISO-8859-2 too. Additionally updated files in the patch are: - utilities.c - header.h Will add to the next commitfest. Any comments? 1 - https://www.postgresql.org/message-id/5689.1519054983%40sss.pgh.pa.us 2 - https://github.com/snowballstem/snowball/commit/4bcae97db044253ea2edae1dd3ca59f3cddd4b9d -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On 06/26/2018 08:20 AM, Arthur Zakirov wrote: > Hello hackers, > > I'd like to propose the patch which syncs PostgreSQL snowball stemmers. > As Tom pointed [1] stemmers haven't synced for a very long time. > > I copied all source files without changes, except replacing '#include > "../runtime/header.h"' with '#include "header.h"' and removing includes > of standard headers from utilities.c. > > Hungarian language uses ISO-8859-1 and UTF-8 charsets in Postgres HEAD. > But in Snowball HEAD it is ISO-8859-2 per commit [2]. This patch changes > hungarian's charset from ISO-8859-1 to ISO-8859-2 too. > > Additionally updated files in the patch are: > - utilities.c > - header.h > > Will add to the next commitfest. > > Any comments? > > 1 - https://www.postgresql.org/message-id/5689.1519054983%40sss.pgh.pa.us > 2 - https://github.com/snowballstem/snowball/commit/4bcae97db044253ea2edae1dd3ca59f3cddd4b9d > I agree with Tom that we should sync with the upstream before we do anything else. This is a very large patch but with fairly limited impact. I think now at the start of a dev cycle is the right time to apply it. I don't know if we have a buildfarm animal testing Hungarian. Maybe we need a buildfarm animal or two testing a large number of locales. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 06/26/2018 08:20 AM, Arthur Zakirov wrote: >> I'd like to propose the patch which syncs PostgreSQL snowball stemmers. >> As Tom pointed [1] stemmers haven't synced for a very long time. > I agree with Tom that we should sync with the upstream before we do > anything else. This is a very large patch but with fairly limited > impact. I think now at the start of a dev cycle is the right time to > apply it. Agreed in principle. The thing that's actually a bit shaky here is to decide what "upstream" to follow, given that the original maintainer has retired and there doesn't seem to be active work going on. > I don't know if we have a buildfarm animal testing Hungarian. Maybe we > need a buildfarm animal or two testing a large number of locales. I dunno that we need to set up a permanent buildfarm member for this. I had been thinking in terms of testing every available locale on my own machines before pushing, but given that the code is pretty static, do we need to do that repetitively? regards, tom lane
On Fri, Jul 06, 2018 at 12:37:26PM -0400, Tom Lane wrote: > > I don't know if we have a buildfarm animal testing Hungarian. Maybe we > > need a buildfarm animal or two testing a large number of locales. > > I dunno that we need to set up a permanent buildfarm member for this. > I had been thinking in terms of testing every available locale on my > own machines before pushing, but given that the code is pretty static, > do we need to do that repetitively? I run installcheck for hu_HU.UTF-8, hu_HU.ISO-8859-2 and ru_RU.UTF-8 locales on laptop. Tests passed. I think it is worth to consider that most text search functions (to_tsvector, to_tsquery) use specific text search configuration (english or a custom one). Only few of them use default text search configuration. They are in json.sql, jsonb.sql, tsearch.sql tests. Is it good idea to modify some current tests to use default configuration or to add specific tests for locale testing? -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
I see that the cfbot is having difficulty building this: make[2]: *** No rule to make target `stem_ISO_8859_2_hungarian.o', needed by `dict_snowball.so'. Stop. Did you miss including Makefile changes in the submitted patch? regards, tom lane
On Wed, Sep 12, 2018 at 04:03:57PM -0400, Tom Lane wrote: > I see that the cfbot is having difficulty building this: > > make[2]: *** No rule to make target `stem_ISO_8859_2_hungarian.o', needed by `dict_snowball.so'. Stop. > > Did you miss including Makefile changes in the submitted patch? Snowball's Makefile has changes related to hungarian stemmer: - stem_ISO_8859_1_hungarian.o \ ... + stem_ISO_8859_2_hungarian.o \ I've noticed the error some time ago. And I tried to understand why cfbot can't build it. Somehow cfbot didn't rename stem_ISO_8859_1_hungarian. You can see it from the commit [1]. And there is no new file stem_ISO_8859_2_hungarian.c [2]. Another problem with the header file [3]. cfbot created new file stem_ISO_8859_2_hungarian.h, but it didn't delete old stem_ISO_8859_1_hungarian.h. In my laptop there is no such problem. I tried both "git apply" and "patch -p1". And I can build the patch. Maybe cfbot's "patch" doesn't understand the patch file. Maybe I have too recent git (2.19.0)... 1 - https://github.com/postgresql-cfbot/postgresql/commit/efc280b89b181657afe5412f398681b2c393a35c#diff-efde70a147d16a83b9b132b7f396ab6d 2 - https://github.com/postgresql-cfbot/postgresql/tree/commitfest/19/1697/src/backend/snowball/libstemmer 3 - https://github.com/postgresql-cfbot/postgresql/tree/commitfest/19/1697/src/include/snowball/libstemmer -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Arthur Zakirov <a.zakirov@postgrespro.ru> writes: > On Wed, Sep 12, 2018 at 04:03:57PM -0400, Tom Lane wrote: >> Did you miss including Makefile changes in the submitted patch? > I've noticed the error some time ago. And I tried to understand why > cfbot can't build it. Somehow cfbot didn't rename > stem_ISO_8859_1_hungarian. You can see it from the commit [1]. And there > is no new file stem_ISO_8859_2_hungarian.c [2]. Oh, patch(1) doesn't understand git's idea of "renaming" files, cf https://www.postgresql.org/message-id/CAEepm=2+CF3PshNRAs-r8jtPLKj0K6UEACeRSqBi5Cf74L=w7Q@mail.gmail.com I'd suggest using "git diff --no-renames", since some of us will want to apply the patch using patch(1). > In my laptop there is no such problem. I tried both "git apply" and > "patch -p1". And I can build the patch. Really? What version of patch was that? regards, tom lane
On Wed, Sep 12, 2018 at 05:47:05PM -0400, Tom Lane wrote: > Oh, patch(1) doesn't understand git's idea of "renaming" files, cf > > https://www.postgresql.org/message-id/CAEepm=2+CF3PshNRAs-r8jtPLKj0K6UEACeRSqBi5Cf74L=w7Q@mail.gmail.com > > I'd suggest using "git diff --no-renames", since some of us will want > to apply the patch using patch(1). Ah, I see. I attached new version made with --no-renames. Will wait for what cfbot will say. > > In my laptop there is no such problem. I tried both "git apply" and > > "patch -p1". And I can build the patch. > > Really? What version of patch was that? It is 2.7.6. It seems it was released in 2018-02-03. Not sure, but maybe they improved support for git's file renames. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
Arthur Zakirov <a.zakirov@postgrespro.ru> writes: > Ah, I see. I attached new version made with --no-renames. Will wait for > what cfbot will say. I reviewed and pushed this. As a cross-check on the patch, I cloned the Snowball github repo and built the derived files in it. I noticed that they'd incorporated several new stemmers since 2007 --- not only your Nepali one, but half a dozen more besides. Since the point here is (IMO) mostly to follow their lead on what's interesting, I went ahead and added those as well. In short, therefore, the commit includes the Nepali stuff from your other thread as well as what was in this one. Although I added nepali.stop from the other patch, I've not done anything about updating our other stopword lists. Presumably those are a bit obsolete by now as well. I wonder if we can prevail on the Snowball people to make those available in some less painful way than scraping them off assorted web pages. Ideally they'd stick them into their git repo ... regards, tom lane
On Mon, Sep 24, 2018 at 05:36:39PM -0400, Tom Lane wrote: > I reviewed and pushed this. Great! Thank you. > As a cross-check on the patch, I cloned the Snowball github repo > and built the derived files in it. I noticed that they'd incorporated > several new stemmers since 2007 --- not only your Nepali one, but > half a dozen more besides. Since the point here is (IMO) mostly to > follow their lead on what's interesting, I went ahead and added those > as well. Agree. It is good decision. It may attract more users. > Although I added nepali.stop from the other patch, I've not done > anything about updating our other stopword lists. Presumably those > are a bit obsolete by now as well. I wonder if we can prevail on > the Snowball people to make those available in some less painful way > than scraping them off assorted web pages. Ideally they'd stick them > into their git repo ... They have repository snowball-website [1]. It is snowballstem.org web-site source repository. It also stores stopwords for various languages (for example for english [2]). I checked couple languages. It seems their russian and danish stopword lists look like PostgreSQL's stopword lists. But their english stopword list is different. There is lack of stopword lists for the following languages: - arabic - irish - lithuanian - nepali - I can create a pull request to add it to snowball-website - tamil There is also another project, called Stopwords ISO [3]. But I'm not sure about them. It stores stopword lists from various sources. And also there are stopwords for languages mentioned above, except for nepali and tamil. I think I could make a script, which generates stopwords from snowball-website repository. It can be run periodically. Is it suitable? Also it would be good to move missing stopwords from Stopwords ISO to snowball-website... 1 - https://github.com/snowballstem/snowball-website/tree/master/algorithms 2 - https://github.com/snowballstem/snowball-website/blob/master/algorithms/english/stop.txt 3 - https://github.com/stopwords-iso -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company