Thread: pg full text search very slow for Chinese characters
Hi Team,
Can anyone shed some light on why postgres 11 is extremely slow in my case?
I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.
My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.
Database schema is as follows
Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)
A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.
Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.
On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).
I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser
To my surprise, postgres 11 is extremely slow when creating a full text index.
I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.
I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.
If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).
I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.
Jimmy Huang
jimmy_huang@live.com
Hi Team,
Can anyone shed some light on why postgres 11 is extremely slow in my case?
I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.
My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.
Database schema is as follows
Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)
A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.
Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.
On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).
I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser
To my surprise, postgres 11 is extremely slow when creating a full text index.
I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.
I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.
If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).
I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.
Jimmy Huang
How much ram and such did you give to this vm?
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.
>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.
How much ram and such did you give to this vm?
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.
It is all default values.
I just check maintenance_work_mem and indeed it is 64MB.
At first I gave 2GB ram to docker and later I increased ram to 4GB.
It did not make much difference when ram increased from 2GB to 4GB.
I will try increasing maintenance_work_mem and see if it helps.
发件人: Michael Lewis <mlewis@entrata.com>
日期: 2019年9月11日 星期三 上午12:11
收件人: 黄 少君 <jimmy_huang@live.com>
抄送: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
主题: Re: pg full text search very slow for Chinese characters
>My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.
How much ram and such did you give to this vm?
>To my surprise, postgres 11 is extremely slow when creating a full text index. I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index.
Did you customize any config? maintenance_work_mem specifically would be relevant to the time to create an index and default value is only 64MB. Especially if you are running a spinning hard drive and not ssd, then this could be problematic.
Jimmy Huang <jimmy_huang@live.com> writes: > I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser pg_trgm is going to be fairly useless for indexing text that's mostly multibyte characters, since its unit of indexable data is just 3 bytes (not characters). I don't know of any comparable issue in the core tsvector logic, though. The numbers you're quoting do sound quite awful, but I share Cory's suspicion that it's something about your setup rather than an inherent Postgres issue. regards, tom lane
Jimmy Huang <jimmy_huang@live.com> writes:
> I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser
pg_trgm is going to be fairly useless for indexing text that's mostly
multibyte characters, since its unit of indexable data is just 3 bytes
(not characters). I don't know of any comparable issue in the core
tsvector logic, though. The numbers you're quoting do sound quite awful,
but I share Cory's suspicion that it's something about your setup rather
than an inherent Postgres issue.
regards, tom lane
Andreas Joseph Krogh
Hi. At Tue, 10 Sep 2019 18:42:26 +0200 (CEST), Andreas Joseph Krogh <andreas@visena.com> wrote in <VisenaEmail.3.8750116fce15432e.16d1c0b2b28@tc7-visena> > På tirsdag 10. september 2019 kl. 18:21:45, skrev Tom Lane <tgl@sss.pgh.pa.us > <mailto:tgl@sss.pgh.pa.us>>: Jimmy Huang <jimmy_huang@live.com> writes: > > I tried pg_trgm and my own customized token parser > https://github.com/huangjimmy/pg_cjk_parser > > pg_trgm is going to be fairly useless for indexing text that's mostly > multibyte characters, since its unit of indexable data is just 3 bytes > (not characters). I don't know of any comparable issue in the core > tsvector logic, though. The numbers you're quoting do sound quite awful, > but I share Cory's suspicion that it's something about your setup rather > than an inherent Postgres issue. > > regards, tom lane We experienced quite awful performance when we hosted the > DB on virtual servers (~5 years ago) and it turned out we hit the write-cache > limit (then 8GB), which resulted in ~1MB/s IO thruput. Running iozone might > help tracing down IO-problems. -- > Andreas Joseph Krogh Multibyte characters also quickly bloats index by many many small buckets for every 3-characters combination of thouhsand of characters, which makes it useless. pg_bigm based on bigram/2-gram works better on multibyte characters. https://pgbigm.osdn.jp/index_en.html regards. -- Kyotaro Horiguchi NTT Open Source Software Center