Re: How to speed up pg_trgm / gin index scan - Mailing list pgsql-general

From Merlin Moncure
Subject Re: How to speed up pg_trgm / gin index scan
Date
Msg-id CAHyXU0za-pe53U92j+dLnHomd+BtnQnd2SMBUW83PBinH-0sMA@mail.gmail.com
Whole thread Raw
In response to Re: How to speed up pg_trgm / gin index scan  (Christian Ramseyer <rc@networkz.ch>)
Responses Re: How to speed up pg_trgm / gin index scan
List pgsql-general
On Tue, Jul 28, 2015 at 8:34 AM, Christian Ramseyer <rc@networkz.ch> wrote:
> On 22/06/15 13:51, Christian Ramseyer wrote:
>> Hi
>>
>> I have a pretty large table with syslog messages.
>>
>> It is already partitioned by month, and for a single month I have e.g.
>>
>>
>> DM=# \d+ logs_01
>>
>>     Column    |            Type             |
>> --------------+-----------------------------+
>>  host         | character varying(255)      |
>>  facility     | character varying(10)       |
>>  priority     | character varying(10)       |
>>  tag          | character varying(255)      |
>>  log_date     | timestamp without time zone |
>>  program      | character varying(255)      |
>>  msg          | text                        |
>>  seq          | bigint                      |
>>
>> Indexes:
>>     "logs_01_pkey" PRIMARY KEY, btree (seq)
>>     "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops)
>>     "logs_01_date_index" btree (log_date)
>>     "tridx_logs_01_msg" gin (msg gin_trgm_ops)
>>
>>
>> DM=# select count(*) from logs_01;
>>   count
>> ----------
>>  83052864
>>
>>
>> I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is
around60 GB on the 35 GB table: 
>>
>> DM=# select count(*) from logs_01;
>>   count
>> ----------
>>  83052864
>>
>>
>> DM=# \dt+ logs_01
>>                      List of relations
>>  Schema |  Name   | Type  |  Owner   | Size  | Description
>> --------+---------+-------+----------+-------+-------------
>>  public | logs_01 | table | postgres | 35 GB |
>>
>> DM=# \di+ tridx_logs_01_msg
>>                                List of relations
>>  Schema |       Name        | Type  |  Owner   |  Table  | Size  | Description
>> --------+-------------------+-------+----------+---------+-------+-------------
>>  public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB |
>>
>>
>> A typical query on this table looks like this:
>>
>> explain analyze
>> select log_date, host, msg
>> from logs_01 as log   where  log.msg like '%192.23.33.177%'
>>     and log.log_date >= '2015-1-18 1:45:24'
>>     and log.log_date <= '2015-1-19 1:45:24'
>>     order by log_date asc offset 200 limit 50;
>>
>>
>> It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN
indextakes quite long (40 seconds) 
>>
>>                                                                                                 QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1)
>>    ->  Sort  (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1)
>>          Sort Key: log_date
>>          Sort Method: top-N heapsort  Memory: 152kB
>>          ->  Bitmap Heap Scan on logs_01 log  (cost=34009.21..34505.44 rows=124 width=195) (actual
time=42963.969..42969.725rows=2472 loops=1) 
>>                Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp
withouttime zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) 
>>                ->  BitmapAnd  (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0
loops=1)
>>                      ->  Bitmap Index Scan on tridx_logs_01_msg  (cost=0.00..6992.15 rows=8020 width=0) (actual
time=42731.145..42731.145rows=168489 loops=1) 
>>                            Index Cond: (msg ~~ '%192.23.33.177%'::text)
>>                      ->  Bitmap Index Scan on logs_01_date_index  (cost=0.00..27016.75 rows=1287939 width=0) (actual
time=180.055..180.055rows=1173048 loops=1) 
>>                            Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
(log_date<= '2015-01-19 01:45:24'::timestamp without time zone)) 
>>  Total runtime: 42971.137 ms
>>
>> (also on http://explain.depesz.com/s/KpaB)
>>
>>
>> Any good ideas on how I could speed this up a bit?
>>
>> I have already tried to throw quite a bunch of memory at the problem:
>>
>> shared_buffers = 64GB
>> work_mem = 16GB
>>
>> but it didn't improve between this and the 32GB shared/ 2GB work GB I had before.
>>
>> This is on Postgres 9.1.15 on Linux.
>>
>
>
>
>> Try 9.4 and you'll surprise.
>>
>> 1. GIN has compression
>> 2. GIN has fast scan feature.
>>
>> Oleg
>
>
> Hi Oleg and List
>
> I finally got around to try 9.4, and it is quite fantastic.
>
> Index size went from 58 to now 14 GB:
>
> DM=# \di+ tridx_logs_01_msg
>                                List of relations
>  Schema |       Name        | Type  |  Owner   |  Table  | Size  |
> Description
> --------+-------------------+-------+----------+---------+-------+-------------
>  public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB |
>
>
> And the time for the above query went down to about 20 seconds:
>
>
> DM=# explain analyze
> DM-# select log_date, host, msg
> DM-# from logs_01 as log   where  log.msg like '%192.23.33.177%'
> DM-#     and log.log_date >= '2015-1-18 1:45:24'
> DM-#     and log.log_date <= '2015-1-19 1:45:24'
> DM-#     order by log_date asc offset 200 limit 50;
>
>                         QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=28815.06..28815.06 rows=1 width=194) (actual
> time=19032.099..19032.099 rows=0 loops=1)
>    ->  Sort  (cost=28814.74..28815.06 rows=128 width=194) (actual
> time=19032.093..19032.093 rows=0 loops=1)
>          Sort Key: log_date
>          Sort Method: quicksort  Memory: 25kB
>          ->  Bitmap Heap Scan on logs_01 log  (cost=28298.06..28810.26
> rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1)
>                Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND
> (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND
> (log_date <= '2015-01-19 01:45:24'::timestamp without time zone))
>                ->  BitmapAnd  (cost=28298.06..28298.06 rows=128 width=0)
> (actual time=19031.983..19031.983 rows=0 loops=1)
>                      ->  Bitmap Index Scan on tridx_logs_01_msg
> (cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121
> rows=99 loops=1)
>                            Index Cond: (msg ~~ '%192.23.33.177%'::text)
>                      ->  Bitmap Index Scan on logs_01_date_index
> (cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084
> rows=1173048 loops=1)
>                            Index Cond: ((log_date >= '2015-01-18
> 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19
> 01:45:24'::timestamp without time zone))
>  Planning time: 0.945 ms
>  Execution time: 19032.409 ms
> (13 rows)
>
> Great stuff! Sorry Oleg I don't have your original message anymore and
> can't reply into the right place in the thread, so I took the liberty to
> CC: you.

There are some more big optimizations (via Jeff Janes) coming down the
pike for trigram searching.  See thread:
http://www.postgresql.org/message-id/CAMkU=1woR_Pdmie6d-zj6sDOPiHd_iUe3vZSXFGe_i4-AQYsJQ@mail.gmail.com.

I think it should be possible to patch the 9.4 pg_trgm module with
Jeff's stuff -- it might be worthwhile to do that and run some tests
and report back.  I don't know if they address your particular case
but in some situations the speedups are really dramatic.

merlin


pgsql-general by date:

Previous
From: Christian Ramseyer
Date:
Subject: Re: How to speed up pg_trgm / gin index scan
Next
From: Arthur Silva
Date:
Subject: Re: How to speed up pg_trgm / gin index scan