pg_kazsearch: Full-text search extension for Kazakh language - Mailing list pgsql-general

From Darkhan
Subject pg_kazsearch: Full-text search extension for Kazakh language
Date
Msg-id CAOW9cEpjUV0fG6u6m86vt8RJOBLymys=k33DWzgEP+0SnXhZGA@mail.gmail.com
Whole thread
Responses Re: pg_kazsearch: Full-text search extension for Kazakh language
List pgsql-general

Hi all,

I built pg_kazsearch, a PostgreSQL extension that adds full-text search support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or stop word list available in PostgreSQL, so anyone searching Kazakh text is stuck with trigram matching or application-level workarounds.

Kazakh is agglutinative — a single word can carry 5-6 suffixes, which makes standard search approaches miss most relevant results. pg_kazsearch provides a custom Kazakh stemmer (core written in Rust), a stop word list, and a text search dictionary that plugs into the standard PostgreSQL FTS infrastructure — GIN indexes, ts_rank, phrase search all work out of the box.

I tested it on a dataset of 3,000 real Kazakh news articles. On the same query, pg_kazsearch returns 61 relevant articles vs 1 with trigram search, with a 23% improvement in recall overall.

You can install it with a single command via deb package or Docker image, no compilation needed.

Repo: https://github.com/darkhanakh/pg-kazsearch

I'd appreciate any feedback, especially from anyone working on text search internals or with experience supporting non-Latin or agglutinative languages in PostgreSQL.

Thanks, Darkhan

pgsql-general by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Documentation weirdness
Next
From: Matthias Apitz
Date:
Subject: configure && --with