Home > mailing lists

How to drop all tokens that a snowball dictionary cannot stem? - Mailing list pgsql-general

From	Christoph Gößmann
Subject	How to drop all tokens that a snowball dictionary cannot stem?
Date	November 22, 2019 13:01:18
Msg-id	50A531BE-8A5D-40BA-B6AF-4B9B32FB7FF3@goessmann.io Whole thread Raw
Responses	Re: How to drop all tokens that a snowball dictionary cannot stem?
List	pgsql-general

Tree view

Hi everybody,

I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the
integratedsnowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only
removestop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any
ideas?

I went ahead with creating a new configuration:

-- add new configuration english_led
CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);

-- dropping any words that contain numbers already in the parser
ALTER TEXT SEARCH CONFIGURATION english_led
    DROP MAPPING FOR numword;

EXAMPLE:

SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words');
                   to_tsvector
--------------------------------------------------
 'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7

In this tsvector, I would like 'somejnk' and 'tt' not to be included.

Many thanks,
Christoph

pgsql-general by date:

From: Guillaume Lelarge
Date: 22 November 2019, 12:58:11
Subject: Re: A question about user atributes

From: Moreno Andreo
Date: 22 November 2019, 13:13:44
Subject: Re: [SPAM] Remote Connection Help

How to drop all tokens that a snowball dictionary cannot stem? - Mailing list pgsql-general

Previous

Next