GIN, pg_trgm and large table - Mailing list pgsql-novice

From Max Fomichev
Subject GIN, pg_trgm and large table
Date
Msg-id 44c55f0b-a67e-94d3-07f3-6381204ef9e9@gmail.com
Whole thread Raw
Responses Re: GIN, pg_trgm and large table  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: GIN, pg_trgm and large table  (Tomasz Barszczewski <tb@nop.com.pl>)
List pgsql-novice
Hello!
I have the following table and index with about 15 billion records.

CREATE TABLE us_en.ngrams
(
     ngram text COLLATE pg_catalog."default" NOT NULL,
     year smallint NOT NULL,
     occurrence bigint NOT NULL,
     words smallint NOT NULL
)
WITH (
     OIDS = FALSE
)
TABLESPACE pg_default;

CREATE INDEX trgm_idx_ngram_ngrams_us_en
     ON us_en.ngrams USING gin
     (ngram COLLATE pg_catalog."default" gin_trgm_ops)
     TABLESPACE pg_default;

Configuration:
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
128GB RAM, 2TB SDD
Server settings:
shared_buffers = 32GB
effective_cache_size = 96GB
work_mem = 4GB
maintenance_work_mem = 16GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
max_worker_processes = 16
gin_fuzzy_search_limit = 10000


My queries related to us_en.ngrams.ngram field are too slow:
ngrams=# set enable_seqscan=false;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE 
ngram LIKE '%computer%' LIMIT 20;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=232884.91..232933.12 rows=20 width=36) (actual 
time=44962.926..44966.214 rows=14 loops=1)
    Buffers: shared hit=18177345 read=124224
    ->  Bitmap Heap Scan on ngrams  (cost=232884.91..61129746.57 
rows=25261021 width=36) (actual time=44962.925..44966.202 rows=14 loops=1)
          Recheck Cond: (ngram ~~ '%computer%'::text)
          Rows Removed by Index Recheck: 10
          Heap Blocks: exact=8
          Buffers: shared hit=18177345 read=124224
          ->  Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en 
(cost=0.00..226569.66 rows=25261021 width=0) (actual 
time=44961.929..44961.929 rows=24 loops=1)
                Index Cond: (ngram ~~ '%computer%'::text)
                Buffers: shared hit=18177345 read=124216
  Planning time: 0.305 ms
  Execution time: 44966.271 ms
(12 rows)


ngrams=# set enable_seqscan=true;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE 
ngram LIKE '%computer%' LIMIT 20;
                                                        QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..215.41 rows=20 width=36) (actual time=0.248..29.089 
rows=20 loops=1)
    Buffers: shared hit=14 read=1509
    ->  Seq Scan on ngrams  (cost=0.00..272072628.00 rows=25261021 
width=36) (actual time=0.247..29.074 rows=20 loops=1)
          Filter: (ngram ~~ '%computer%'::text)
          Rows Removed by Filter: 207598
          Buffers: shared hit=14 read=1509
  Planning time: 0.332 ms
  Execution time: 29.117 ms
(8 rows)


ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE 
ngram LIKE '%version%' ORDER BY (occurrence, year) DESC LIMIT 20;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=4661042.23..4661042.28 rows=20 width=36) (actual 
time=144417.365..144417.365 rows=0 loops=1)
    Buffers: shared hit=28982531 read=167634
    ->  Sort  (cost=4661042.23..4664172.58 rows=1252138 width=36) 
(actual time=144417.363..144417.363 rows=0 loops=1)
          Sort Key: (ROW(occurrence, year)) DESC
          Sort Method: quicksort  Memory: 25kB
          Buffers: shared hit=28982531 read=167634
          ->  Bitmap Heap Scan on ngrams (cost=20372.07..4627723.29 
rows=1252138 width=36) (actual time=144417.331..144417.331 rows=0 loops=1)
                Recheck Cond: (ngram ~~ '%version%'::text)
                Buffers: shared hit=28982527 read=167633
                ->  Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en  
(cost=0.00..20059.04 rows=1252138 width=0) (actual 
time=144417.328..144417.328 rows=0 loops=1)
                      Index Cond: (ngram ~~ '%version%'::text)
                      Buffers: shared hit=28982527 read=167633
  Planning time: 0.344 ms
  Execution time: 144417.522 ms
(14 rows)

It there any way to improve GIN/pg_tgrm performance on a such large table?
Thank you!


-- 
Best regards,
Max Fomichev



pgsql-novice by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_query won't execute
Next
From: Tom Lane
Date:
Subject: Re: GIN, pg_trgm and large table