Home > mailing lists

Re: How to drop all tokens that a snowball dictionary cannot stem? - Mailing list pgsql-general

From	Jeff Janes
Subject	Re: How to drop all tokens that a snowball dictionary cannot stem?
Date	November 23, 2019 15:27:29
Msg-id	CAMkU=1zS-+M4yeN_msxdd9u=PzS+Ne=SkKPxNrnVmvaw-Knr_w@mail.gmail.com Whole thread
In response to	How to drop all tokens that a snowball dictionary cannot stem? (Christoph Gößmann <mail@goessmann.io>)
Responses	Re: How to drop all tokens that a snowball dictionary cannot stem?
List	pgsql-general

Tree view

On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <mail@goessmann.io> wrote:

Hi everybody,

I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integrated snowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only remove stop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas?

I went ahead with creating a new configuration:

-- add new configuration english_led
CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);

-- dropping any words that contain numbers already in the parser
ALTER TEXT SEARCH CONFIGURATION english_led
DROP MAPPING FOR numword;

EXAMPLE:

SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words');
to_tsvector
--------------------------------------------------
'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7

In this tsvector, I would like 'somejnk' and 'tt' not to be included.

I don't think the question is well defined. It will happily stem 'somejnking' to ' somejnk', doesn't that mean that it **can** handle it? The fact that 'somejnk' itself wasn't altered during stemming doesn't mean it wasn't handled, just like 'test' wasn't altered during stemming.

Cheers,

Jeff

pgsql-general by date:

From: "Jason L. Amerson"
Date: 23 November 2019, 15:09:45
Subject: RE: Client Computers

From: Christoph Gößmann
Date: 23 November 2019, 15:42:02
Subject: Re: How to drop all tokens that a snowball dictionary cannot stem?

Re: How to drop all tokens that a snowball dictionary cannot stem? - Mailing list pgsql-general

Previous

Next