Re: Out of the box, full text search feature suggestion for postgresql - Mailing list pgsql-bugs

From Artur Zakirov
Subject Re: Out of the box, full text search feature suggestion for postgresql
Date
Msg-id CAKNkYnzheAEsB9MM6b9jEBn+W7j1T5Qh6OyogH3f8ZX8M+9gkw@mail.gmail.com
Whole thread Raw
In response to Re: Out of the box, full text search feature suggestion for postgresql  (Bruce Momjian <bruce@momjian.us>)
Responses Re: Out of the box, full text search feature suggestion for postgresql  (aa <ghevge@gmail.com>)
List pgsql-bugs
On Thu, 28 Dec 2023 at 17:46, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Dec 28, 2023 at 10:15:07AM -0500, aa wrote:
> > Hello Postgres Team!
> >
> > First of all, a big THANK YOU for the great work you folks are doing!
> >
> > The reason I am writing to you is to suggest a feature in future Postgres
> > versions, a feature that is partially there but is not quite where it should be
> > in my opinion: the full text search functionality. This functionality in my
> > opinion, should be available out of the box, for any possible language
> > available, including east Asia character based languages. You would probably
> > say that this will require a huge amount of work, and I would say, a postgres
> > extension which does exactly this, already exists, and it is called : pgroonga
> > (https://pgroonga.github.io/)
>
> Please explain how this is different from what we already have:
>
>         https://www.postgresql.org/docs/current/textsearch.html

I'm not familiar with pgroonga, but the main issue with built-in text
search is that it cannot tokenize asian and many other languages
properly.

Here default parser cannot tokenize Japanese text:

=# select * from ts_parse('default', 'これはペンです');
 tokid |     token
-------+----------------
     2 | これはペンです

Unlike Latin:

=# select * from ts_parse('default', 'this is a pen');
 tokid | token
-------+-------
     1 | this
    12 |
     1 | is
    12 |
     1 | a
    12 |
     1 | pen

To add support for Japanese (and other languages) it is necessary to
write a new parser or fix the existing default parser.

On the other hand pgroonga's source code looks complex, and I doubt
that there are pgsql-hackers who know it and target languages well and
who will be able to port it to Postgres core.

--
Artur



pgsql-bugs by date:

Previous
From: Amadeo Gallardo
Date:
Subject: Postgres 16.1 - Bug: cache entry already complete
Next
From: Tom Lane
Date:
Subject: Re: Postgres 16.1 - Bug: cache entry already complete