Thread: pg full text search very slow for Chinese characters

pg full text search very slow for Chinese characters

From
Jimmy Huang
Date:

Hi Team,

 

Can anyone shed some light on why postgres 11 is extremely slow in my case?

 

I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.

 

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

 

Database schema is as follows

 

Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)

 

A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.

Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.

 

On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).

 

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

 

To my surprise, postgres 11 is extremely slow when creating a full text index.

 

I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.

I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.

If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).

 

I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.  

 

Jimmy Huang

jimmy_huang@live.com

Re: pg full text search very slow for Chinese characters

From
Cory Nemelka
Date:
Well, there is a lot of information we would need to diagnose this.  How much tuning have you done?, etc.  

My advice is pretty simple.  Don't expect performance on a notebook and, unless you are planning on hosting it on a notebook, use the notebook for development only .  Test performance on a properly configured and tuned server.

--cnemelka


On Tue, Sep 10, 2019 at 9:53 AM Jimmy Huang <jimmy_huang@live.com> wrote:

Hi Team,

 

Can anyone shed some light on why postgres 11 is extremely slow in my case?

 

I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.

 

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

 

Database schema is as follows

 

Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)

 

A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.

Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.

 

On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).

 

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

 

To my surprise, postgres 11 is extremely slow when creating a full text index.

 

I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.

I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.

If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).

 

I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.  

 

Jimmy Huang

jimmy_huang@live.com

Re: pg full text search very slow for Chinese characters

From
Michael Lewis
Date:
>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?
 
 
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
 

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.

Re: pg full text search very slow for Chinese characters

From
Cory Nemelka
Date:

On Tue, Sep 10, 2019 at 10:11 AM Michael Lewis <mlewis@entrata.com> wrote:
>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?
 
 
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
 

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.


I apologize for my top post. :D Won't happen again

Re: pg full text search very slow for Chinese characters

From
Jimmy Huang
Date:

It is all default values.

I just check maintenance_work_mem and indeed it is 64MB.

 

At first I gave 2GB ram to docker and later I increased ram to 4GB.

It did not make much difference when ram increased from 2GB to 4GB.

 

I will try increasing maintenance_work_mem and see if it helps.

 

发件人: Michael Lewis <mlewis@entrata.com>
日期: 2019911 星期三 上午12:11
收件人: 少君 <jimmy_huang@live.com>
抄送: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
主题: Re: pg full text search very slow for Chinese characters

 

>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

How much ram and such did you give to this vm?
 
 
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
 

Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.

Re: pg full text search very slow for Chinese characters

From
Tom Lane
Date:
Jimmy Huang <jimmy_huang@live.com> writes:
> I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters).  I don't know of any comparable issue in the core
tsvector logic, though.  The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

            regards, tom lane



Re: pg full text search very slow for Chinese characters

From
Andreas Joseph Krogh
Date:
På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@sss.pgh.pa.us>:
Jimmy Huang <jimmy_huang@live.com> writes:
> I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters).  I don't know of any comparable issue in the core
tsvector logic, though.  The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.

regards, tom lane
 
 
We experienced quite awful performance when we hosted the DB on virtual servers (~5 years ago) and it turned out we hit the write-cache limit (then 8GB), which resulted in ~1MB/s IO thruput.
Running iozone might help tracing down IO-problems.
 
 
--
Andreas Joseph Krogh

Re: pg full text search very slow for Chinese characters

From
Kyotaro Horiguchi
Date:
Hi.

At Tue, 10 Sep 2019 18:42:26 +0200 (CEST), Andreas Joseph Krogh <andreas@visena.com> wrote in
<VisenaEmail.3.8750116fce15432e.16d1c0b2b28@tc7-visena>
> På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>>: Jimmy Huang <jimmy_huang@live.com> writes:
>  > I tried pg_trgm and my own customized token parser
> https://github.com/huangjimmy/pg_cjk_parser
>
>  pg_trgm is going to be fairly useless for indexing text that's mostly
>  multibyte characters, since its unit of indexable data is just 3 bytes
>  (not characters). I don't know of any comparable issue in the core
>  tsvector logic, though. The numbers you're quoting do sound quite awful,
>  but I share Cory's suspicion that it's something about your setup rather
>  than an inherent Postgres issue.
>
>  regards, tom lane We experienced quite awful performance when we hosted the
> DB on virtual servers (~5 years ago) and it turned out we hit the write-cache
> limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might
> help tracing down IO-problems. --
>  Andreas Joseph Krogh

Multibyte characters also quickly bloats index by many many small
buckets for every 3-characters combination of thouhsand of
characters, which makes it useless.

pg_bigm based on bigram/2-gram works better on multibyte
characters.

https://pgbigm.osdn.jp/index_en.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center