Thread: GIN, pg_trgm and large table

GIN, pg_trgm and large table

From
Max Fomichev
Date:
Hello!
I have the following table and index with about 15 billion records.

CREATE TABLE us_en.ngrams
(
     ngram text COLLATE pg_catalog."default" NOT NULL,
     year smallint NOT NULL,
     occurrence bigint NOT NULL,
     words smallint NOT NULL
)
WITH (
     OIDS = FALSE
)
TABLESPACE pg_default;

CREATE INDEX trgm_idx_ngram_ngrams_us_en
     ON us_en.ngrams USING gin
     (ngram COLLATE pg_catalog."default" gin_trgm_ops)
     TABLESPACE pg_default;

Configuration:
PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
128GB RAM, 2TB SDD
Server settings:
shared_buffers = 32GB
effective_cache_size = 96GB
work_mem = 4GB
maintenance_work_mem = 16GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
max_worker_processes = 16
gin_fuzzy_search_limit = 10000


My queries related to us_en.ngrams.ngram field are too slow:
ngrams=# set enable_seqscan=false;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE 
ngram LIKE '%computer%' LIMIT 20;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=232884.91..232933.12 rows=20 width=36) (actual 
time=44962.926..44966.214 rows=14 loops=1)
    Buffers: shared hit=18177345 read=124224
    ->  Bitmap Heap Scan on ngrams  (cost=232884.91..61129746.57 
rows=25261021 width=36) (actual time=44962.925..44966.202 rows=14 loops=1)
          Recheck Cond: (ngram ~~ '%computer%'::text)
          Rows Removed by Index Recheck: 10
          Heap Blocks: exact=8
          Buffers: shared hit=18177345 read=124224
          ->  Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en 
(cost=0.00..226569.66 rows=25261021 width=0) (actual 
time=44961.929..44961.929 rows=24 loops=1)
                Index Cond: (ngram ~~ '%computer%'::text)
                Buffers: shared hit=18177345 read=124216
  Planning time: 0.305 ms
  Execution time: 44966.271 ms
(12 rows)


ngrams=# set enable_seqscan=true;
SET
ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE 
ngram LIKE '%computer%' LIMIT 20;
                                                        QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=0.00..215.41 rows=20 width=36) (actual time=0.248..29.089 
rows=20 loops=1)
    Buffers: shared hit=14 read=1509
    ->  Seq Scan on ngrams  (cost=0.00..272072628.00 rows=25261021 
width=36) (actual time=0.247..29.074 rows=20 loops=1)
          Filter: (ngram ~~ '%computer%'::text)
          Rows Removed by Filter: 207598
          Buffers: shared hit=14 read=1509
  Planning time: 0.332 ms
  Execution time: 29.117 ms
(8 rows)


ngrams=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM us_en.ngrams WHERE 
ngram LIKE '%version%' ORDER BY (occurrence, year) DESC LIMIT 20;
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=4661042.23..4661042.28 rows=20 width=36) (actual 
time=144417.365..144417.365 rows=0 loops=1)
    Buffers: shared hit=28982531 read=167634
    ->  Sort  (cost=4661042.23..4664172.58 rows=1252138 width=36) 
(actual time=144417.363..144417.363 rows=0 loops=1)
          Sort Key: (ROW(occurrence, year)) DESC
          Sort Method: quicksort  Memory: 25kB
          Buffers: shared hit=28982531 read=167634
          ->  Bitmap Heap Scan on ngrams (cost=20372.07..4627723.29 
rows=1252138 width=36) (actual time=144417.331..144417.331 rows=0 loops=1)
                Recheck Cond: (ngram ~~ '%version%'::text)
                Buffers: shared hit=28982527 read=167633
                ->  Bitmap Index Scan on trgm_idx_ngram_ngrams_us_en  
(cost=0.00..20059.04 rows=1252138 width=0) (actual 
time=144417.328..144417.328 rows=0 loops=1)
                      Index Cond: (ngram ~~ '%version%'::text)
                      Buffers: shared hit=28982527 read=167633
  Planning time: 0.344 ms
  Execution time: 144417.522 ms
(14 rows)

It there any way to improve GIN/pg_tgrm performance on a such large table?
Thank you!


-- 
Best regards,
Max Fomichev



Re: GIN, pg_trgm and large table

From
Tom Lane
Date:
Max Fomichev <max.fomitchev@gmail.com> writes:
> I have the following table and index with about 15 billion records.
> ...
> PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
> 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit
> ...
> It there any way to improve GIN/pg_tgrm performance on a such large table?

There was some work done in 9.6 to improve pg_trgm's performance when
dealing with queries involving very common trigrams.  So maybe an update
to 9.6 or v10 would help you.

I have a bad feeling though that 15 billion rows is too many for an index
based on trigrams to be really useful --- there are just not enough
distinct trigrams.  It's too bad we don't have a more general N-gram
indexing facility.

            regards, tom lane


Re: GIN, pg_trgm and large table

From
Max Fomichev
Date:
Hello Tom,
thank you for your reply.

On 10/03/2018 19:00, Tom Lane wrote:
> Max Fomichev <max.fomitchev@gmail.com> writes:
>> I have the following table and index with about 15 billion records. 
>> ... PostgreSQL 9.5.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
>> 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, 64-bit ... It there any way 
>> to improve GIN/pg_tgrm performance on a such large table? 
> There was some work done in 9.6 to improve pg_trgm's performance when 
> dealing with queries involving very common trigrams. So maybe an 
> update to 9.6 or v10 would help you.

I'll try 10.x version.

> I have a bad feeling though that 15 billion rows is too many for an 
> index based on trigrams to be really useful --- there are just not 
> enough distinct trigrams. It's too bad we don't have a more general 
> N-gram indexing facility.
Could you please advise what is the correct approach/index type for my 
case?
I have about 15 billion ngram records (each ngram contains from 1 to 5 
words). I'd like to find all ngrams where search word is a part of it.

-- 
Best regards,
Max Fomichev



Re: GIN, pg_trgm and large table

From
Tomasz Barszczewski
Date:
On 10.03.2018 11:31, Max Fomichev wrote:
Hello!
I have the following table and index with about 15 billion records.
tb: Maybe try to use the partition - (CONSTRAINT ... CHECK) smaller tables are searched.


ps. 
Also use the primary key - eg pg_repack requires it.

--
Best regards, Tomasz Barszczewski - tb@nop.com.pl