Re: pg full text search very slow for Chinese characters - Mailing list pgsql-general

From Cory Nemelka
Subject Re: pg full text search very slow for Chinese characters
Date
Msg-id CAMe5Gn2etfyY3579KtZCS2NyD9uvYPMxPe=Ns40RC9RVtAUjwQ@mail.gmail.com
Whole thread Raw
In response to pg full text search very slow for Chinese characters  (Jimmy Huang <jimmy_huang@live.com>)
List pgsql-general
Well, there is a lot of information we would need to diagnose this.  How much tuning have you done?, etc.  

My advice is pretty simple.  Don't expect performance on a notebook and, unless you are planning on hosting it on a notebook, use the notebook for development only .  Test performance on a properly configured and tuned server.

--cnemelka


On Tue, Sep 10, 2019 at 9:53 AM Jimmy Huang <jimmy_huang@live.com> wrote:

Hi Team,

 

Can anyone shed some light on why postgres 11 is extremely slow in my case?

 

I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages and stored them in a postgres 11 database.

 

My postgres instance is based on docker image postgres:11 and runs on my MacBook Pro i7 16GB.

 

Database schema is as follows

 

Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id, name, info, preface, text, html, url, parent_id, inserted_at, updated_at, info_html, preface_html)

 

A wikisource web page is downloaded and its html text is inserted into table “pages” column “html.

Later, books.{name, info, preface, text, html, info_html, preface_html} are extracted from pages.html. The text column of books is a txt version of the content of html column of table pages.

 

On average there are 7635 characters (each characters is 3 bytes long because of utf-8 encoding) for text column of table books and I want to add full text search to books(text).

 

I tried pg_trgm and my own customized token parser https://github.com/huangjimmy/pg_cjk_parser

 

To my surprise, postgres 11 is extremely slow when creating a full text index.

 

I added a column of tsvector type and tried to create an index on that column. Pg could not finish creating a GIN index for a long time and I had to cancel the execution.

I then tried to create a partial full text index for 500 rows and it took postgres 2 to 3 minutes to create the index. Based on this estimation, pg will need at least one day to create a full GIN full text search index for 303049 rows of data. I think this is ridiculous slow.

If I tried to create fts index for books(name) or books(info), it took just 3 minutes to create the index. However, name and info are extremely short compared to books(text).

 

I switched to Elasticsearch and it turned out that Elasticsearch is extremely efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.  

 

Jimmy Huang

jimmy_huang@live.com

pgsql-general by date:

Previous
From: Jimmy Huang
Date:
Subject: pg full text search very slow for Chinese characters
Next
From: Michael Lewis
Date:
Subject: Re: pg full text search very slow for Chinese characters