Home > mailing lists

Thread: pg full text search very slow for Chinese characters

pg full text search very slow for Chinese characters

From

Jimmy Huang

Date:

10 September 2019, 15:53:14

Hi Team,

Can anyone shed some light on why postgres 11 is extremely slow in my case?

I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

Database schema is as follows

Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)

A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.

Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.

On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

To my surprise, postgres 11 is extremely slow when creating a full text index.

I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.

I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.

If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).

I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.

Jimmy Huang

jimmy_huang@live.com

Re: pg full text search very slow for Chinese characters

From

Cory Nemelka

Date:

10 September 2019, 16:00:58

Well, there is a lot of information we would need to diagnose this. How much tuning have you done?, etc.

My advice is pretty simple. Don't expect performance on a notebook and, unless you are planning on hosting it on a notebook, use the notebook for development only . Test performance on a properly configured and tuned server.

--cnemelka

On Tue, Sep 10, 2019 at 9:53 AM Jimmy Huang <jimmy_huang@live.com> wrote:

Hi Team,

Can anyone shed some light on why postgres 11 is extremely slow in my case?

I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

Database schema is as follows

Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)

A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.
Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.

On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

To my surprise, postgres 11 is extremely slow when creating a full text index.

I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.
I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.
If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).

I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.

Jimmy Huang
jimmy_huang@live.com

Re: pg full text search very slow for Chinese characters

From

Michael Lewis

Date:

10 September 2019, 16:10:54

>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?

>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.

Re: pg full text search very slow for Chinese characters

From

Cory Nemelka

Date:

10 September 2019, 16:13:02

On Tue, Sep 10, 2019 at 10:11 AM Michael Lewis <mlewis@entrata.com> wrote:

>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?

>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.

I apologize for my top post. :D Won't happen again

Re: pg full text search very slow for Chinese characters

From

Jimmy Huang

Date:

10 September 2019, 16:20:57

It is all default values.

I just check maintenance_work_mem and indeed it is 64MB.

At first I gave 2GB ram to docker and later I increased ram to 4GB.

It did not make much difference when ram increased from 2GB to 4GB.

I will try increasing maintenance_work_mem and see if it helps.

发件人: Michael Lewis <mlewis@entrata.com>
日期: 2019年9月11日星期三上午12:11
收件人: 黄少君 <jimmy_huang@live.com>
抄送: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
主题: Re: pg full text search very slow for Chinese characters

Re: pg full text search very slow for Chinese characters

From

Tom Lane

Date:

10 September 2019, 16:21:45

Jimmy Huang <jimmy_huang@live.com> writes:
> I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters).  I don't know of any comparable issue in the core
tsvector logic, though.  The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

            regards, tom lane

Re: pg full text search very slow for Chinese characters

From

Andreas Joseph Krogh

Date:

10 September 2019, 16:42:26

På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@sss.pgh.pa.us>:

Jimmy Huang <jimmy_huang@live.com> writes:
> I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters). I don't know of any comparable issue in the core
tsvector logic, though. The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

regards, tom lane

We experienced quite awful performance when we hosted the DB on virtual servers (~5 years ago) and it turned out we hit the write-cache limit (then 8GB), which resulted in ~1MB/s IO thruput.

Running iozone might help tracing down IO-problems.

--
Andreas Joseph Krogh

Re: pg full text search very slow for Chinese characters

From

Kyotaro Horiguchi

Date:

11 September 2019, 02:34:17

Hi.

At Tue, 10 Sep 2019 18:42:26 +0200 (CEST), Andreas Joseph Krogh <andreas@visena.com> wrote in
<VisenaEmail.3.8750116fce15432e.16d1c0b2b28@tc7-visena>
> På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>>: Jimmy Huang <jimmy_huang@live.com> writes:
>  > I tried pg_trgm and my own customized token parser
> https://github.com/huangjimmy/pg_cjk_parser
>
>  pg_trgm is going to be fairly useless for indexing text that's mostly
>  multibyte characters, since its unit of indexable data is just 3 bytes
>  (not characters). I don't know of any comparable issue in the core
>  tsvector logic, though. The numbers you're quoting do sound quite awful,
>  but I share Cory's suspicion that it's something about your setup rather
>  than an inherent Postgres issue.
>
>  regards, tom lane We experienced quite awful performance when we hosted the
> DB on virtual servers (~5 years ago) and it turned out we hit the write-cache
> limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might
> help tracing down IO-problems. --
>  Andreas Joseph Krogh

Multibyte characters also quickly bloats index by many many small
buckets for every 3-characters combination of thouhsand of
characters, which makes it useless.

pg_bigm based on bigram/2-gram works better on multibyte
characters.

https://pgbigm.osdn.jp/index_en.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center