Re: pg_kazsearch: Full-text search extension for Kazakh language - Mailing list pgsql-general
| From | Darkhan |
|---|---|
| Subject | Re: pg_kazsearch: Full-text search extension for Kazakh language |
| Date | |
| Msg-id | CAOW9cErZJAZQT+5icb8KpDdPPwxLv0q5TEKBV8pzv4pgPmwQQA@mail.gmail.com Whole thread |
| In response to | Re: pg_kazsearch: Full-text search extension for Kazakh language (Adrien Nayrat <adrien.nayrat@anayrat.info>) |
| Responses |
Re: pg_kazsearch: Full-text search extension for Kazakh language
|
| List | pgsql-general |
Thanks for the suggestion!
I did look into Snowball early on. There is actually a Turkish stemmer in Snowball already and Turkish is structurally very similar to Kazakh (both agglutinative Turkic languages). But honestly the Turkish one is pretty lobotomized, it only handles nominal suffixes and doesn’t account for verb morphology at all. The author even mentions this in the comments. So it kind of works for basic noun cases but falls apart on real text.
The reason I went with a standalone extension is that Kazakh has suffix chains where vowel harmony interacts with each layer and you need context-aware decisions, not just stripping patterns from the end of the word. My stemmer uses a penalty-scored BFS over possible suffix decompositions instead of the linear step-by-step stripping that Snowball does. With 5-6 suffixes stacked on one word you really need to evaluate multiple decomposition paths to find the best one.
That said contributing a simplified Kazakh stemmer to Snowball is something I’d like to explore longer term. Even a basic version would be better than nothing which is what exists today. Would need to figure out how much of the BFS logic can fit into the Snowball language or if a simpler approach gets close enough.
Appreciate the pointer!
Darkhan
On Wed, 8 Apr 2026 at 19:42 Adrien Nayrat <adrien.nayrat@anayrat.info> wrote:
On 4/5/26 3:32 PM, Darkhan wrote:
> Hi all,
>
> I built pg_kazsearch, a PostgreSQL extension that adds full-text search
> support for Kazakh. Currently there's no Kazakh dictionary, stemmer, or
> stop word list available in PostgreSQL, so anyone searching Kazakh text is
> stuck with trigram matching or application-level workarounds.
>
> Kazakh is agglutinative — a single word can carry 5-6 suffixes, which makes
> standard search approaches miss most relevant results. pg_kazsearch
> provides a custom Kazakh stemmer (core written in Rust), a stop word list,
> and a text search dictionary that plugs into the standard PostgreSQL FTS
> infrastructure — GIN indexes, ts_rank, phrase search all work out of the
> box.
>
> I tested it on a dataset of 3,000 real Kazakh news articles. On the same
> query, pg_kazsearch returns 61 relevant articles vs 1 with trigram search,
> with a 23% improvement in recall overall.
>
> You can install it with a single command via deb package or Docker image,
> no compilation needed.
>
> Repo: https://github.com/darkhanakh/pg-kazsearch
>
> I'd appreciate any feedback, especially from anyone working on text search
> internals or with experience supporting non-Latin or agglutinative
> languages in PostgreSQL.
>
> Thanks, Darkhan
>
Hello,
Thanks for your work.
I don't know anything about Kazakh.
But have you try to add it to Snowball stemmer [1] ?
As Postgres uses it, you have more chances to have Kazakh
supported in future versions.
1: https://github.com/snowballstem/snowball
--
Adrien NAYRAT
https://pro.anayrat.info
pgsql-general by date: