Thread: How to drop all tokens that a snowball dictionary cannot stem?
Hi everybody, I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integratedsnowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only removestop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas? I went ahead with creating a new configuration: -- add new configuration english_led CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english); -- dropping any words that contain numbers already in the parser ALTER TEXT SEARCH CONFIGURATION english_led DROP MAPPING FOR numword; EXAMPLE: SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words'); to_tsvector -------------------------------------------------- 'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7 In this tsvector, I would like 'somejnk' and 'tt' not to be included. Many thanks, Christoph
On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <mail@goessmann.io> wrote:
Hi everybody,
I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integrated snowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only remove stop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas?
I went ahead with creating a new configuration:
-- add new configuration english_led
CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);
-- dropping any words that contain numbers already in the parser
ALTER TEXT SEARCH CONFIGURATION english_led
DROP MAPPING FOR numword;
EXAMPLE:
SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words');
to_tsvector
--------------------------------------------------
'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
In this tsvector, I would like 'somejnk' and 'tt' not to be included.
I don't think the question is well defined. It will happily stem 'somejnking' to ' somejnk', doesn't that mean that it **can** handle it? The fact that 'somejnk' itself wasn't altered during stemming doesn't mean it wasn't handled, just like 'test' wasn't altered during stemming.
Cheers,
Jeff
Hi Jeff,
You're right about that point. Let me redefine. I would like to drop all tokens which neither are the stemmed or unstemmed version of a known word. Would there be the possibility of putting a wordlist as a filter ahead of the stemming? Or do you know about a good English lexeme list that could be used to filter after stemming?
Thanks,
Christoph
On 23. Nov 2019, at 16:27, Jeff Janes <jeff.janes@gmail.com> wrote:On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann <mail@goessmann.io> wrote:Hi everybody,
I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integrated snowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only remove stop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas?
I went ahead with creating a new configuration:
-- add new configuration english_led
CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);
-- dropping any words that contain numbers already in the parser
ALTER TEXT SEARCH CONFIGURATION english_led
DROP MAPPING FOR numword;
EXAMPLE:
SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words');
to_tsvector
--------------------------------------------------
'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
In this tsvector, I would like 'somejnk' and 'tt' not to be included.I don't think the question is well defined. It will happily stem 'somejnking' to ' somejnk', doesn't that mean that it **can** handle it? The fact that 'somejnk' itself wasn't altered during stemming doesn't mean it wasn't handled, just like 'test' wasn't altered during stemming.Cheers,Jeff
On Sat, Nov 23, 2019 at 10:42 AM Christoph Gößmann <mail@goessmann.io> wrote:
Hi Jeff,You're right about that point. Let me redefine. I would like to drop all tokens which neither are the stemmed or unstemmed version of a known word. Would there be the possibility of putting a wordlist as a filter ahead of the stemming? Or do you know about a good English lexeme list that could be used to filter after stemming?
I think what you describe is the opposite of what snowball was designed to do. You want an ispell-based dictionary instead.
PostgreSQL doesn't ship with real ispell dictionaries, so you have to retrieve the files yourself and install them into $SHAREDIR/tsearch_data as described in the docs for https://www.postgresql.org/docs/12/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY
Cheers,
Jeff