Re: How to speed up pg_trgm / gin index scan - Mailing list pgsql-general
From | Merlin Moncure |
---|---|
Subject | Re: How to speed up pg_trgm / gin index scan |
Date | |
Msg-id | CAHyXU0za-pe53U92j+dLnHomd+BtnQnd2SMBUW83PBinH-0sMA@mail.gmail.com Whole thread Raw |
In response to | Re: How to speed up pg_trgm / gin index scan (Christian Ramseyer <rc@networkz.ch>) |
Responses |
Re: How to speed up pg_trgm / gin index scan
|
List | pgsql-general |
On Tue, Jul 28, 2015 at 8:34 AM, Christian Ramseyer <rc@networkz.ch> wrote: > On 22/06/15 13:51, Christian Ramseyer wrote: >> Hi >> >> I have a pretty large table with syslog messages. >> >> It is already partitioned by month, and for a single month I have e.g. >> >> >> DM=# \d+ logs_01 >> >> Column | Type | >> --------------+-----------------------------+ >> host | character varying(255) | >> facility | character varying(10) | >> priority | character varying(10) | >> tag | character varying(255) | >> log_date | timestamp without time zone | >> program | character varying(255) | >> msg | text | >> seq | bigint | >> >> Indexes: >> "logs_01_pkey" PRIMARY KEY, btree (seq) >> "idx_logs_01_lower_host" btree (lower(host::text) varchar_pattern_ops) >> "logs_01_date_index" btree (log_date) >> "tridx_logs_01_msg" gin (msg gin_trgm_ops) >> >> >> DM=# select count(*) from logs_01; >> count >> ---------- >> 83052864 >> >> >> I'd like to provide a fast "like %x%" search on the msg column, hence I added a trigram based gin index on it. It is around60 GB on the 35 GB table: >> >> DM=# select count(*) from logs_01; >> count >> ---------- >> 83052864 >> >> >> DM=# \dt+ logs_01 >> List of relations >> Schema | Name | Type | Owner | Size | Description >> --------+---------+-------+----------+-------+------------- >> public | logs_01 | table | postgres | 35 GB | >> >> DM=# \di+ tridx_logs_01_msg >> List of relations >> Schema | Name | Type | Owner | Table | Size | Description >> --------+-------------------+-------+----------+---------+-------+------------- >> public | tridx_logs_01_msg | index | postgres | logs_01 | 58 GB | >> >> >> A typical query on this table looks like this: >> >> explain analyze >> select log_date, host, msg >> from logs_01 as log where log.msg like '%192.23.33.177%' >> and log.log_date >= '2015-1-18 1:45:24' >> and log.log_date <= '2015-1-19 1:45:24' >> order by log_date asc offset 200 limit 50; >> >> >> It yields a promising explain that shows that the index is used, but unfortunately the bitmap index scan on the GIN indextakes quite long (40 seconds) >> >> QUERY PLAN >> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> Limit (cost=34510.06..34510.06 rows=1 width=195) (actual time=42971.002..42971.015 rows=50 loops=1) >> -> Sort (cost=34509.75..34510.06 rows=124 width=195) (actual time=42970.960..42970.990 rows=250 loops=1) >> Sort Key: log_date >> Sort Method: top-N heapsort Memory: 152kB >> -> Bitmap Heap Scan on logs_01 log (cost=34009.21..34505.44 rows=124 width=195) (actual time=42963.969..42969.725rows=2472 loops=1) >> Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND (log_date >= '2015-01-18 01:45:24'::timestamp withouttime zone) AND (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) >> -> BitmapAnd (cost=34009.21..34009.21 rows=124 width=0) (actual time=42962.562..42962.562 rows=0 loops=1) >> -> Bitmap Index Scan on tridx_logs_01_msg (cost=0.00..6992.15 rows=8020 width=0) (actual time=42731.145..42731.145rows=168489 loops=1) >> Index Cond: (msg ~~ '%192.23.33.177%'::text) >> -> Bitmap Index Scan on logs_01_date_index (cost=0.00..27016.75 rows=1287939 width=0) (actual time=180.055..180.055rows=1173048 loops=1) >> Index Cond: ((log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND (log_date<= '2015-01-19 01:45:24'::timestamp without time zone)) >> Total runtime: 42971.137 ms >> >> (also on http://explain.depesz.com/s/KpaB) >> >> >> Any good ideas on how I could speed this up a bit? >> >> I have already tried to throw quite a bunch of memory at the problem: >> >> shared_buffers = 64GB >> work_mem = 16GB >> >> but it didn't improve between this and the 32GB shared/ 2GB work GB I had before. >> >> This is on Postgres 9.1.15 on Linux. >> > > > >> Try 9.4 and you'll surprise. >> >> 1. GIN has compression >> 2. GIN has fast scan feature. >> >> Oleg > > > Hi Oleg and List > > I finally got around to try 9.4, and it is quite fantastic. > > Index size went from 58 to now 14 GB: > > DM=# \di+ tridx_logs_01_msg > List of relations > Schema | Name | Type | Owner | Table | Size | > Description > --------+-------------------+-------+----------+---------+-------+------------- > public | tridx_logs_01_msg | index | postgres | logs_01 | 14 GB | > > > And the time for the above query went down to about 20 seconds: > > > DM=# explain analyze > DM-# select log_date, host, msg > DM-# from logs_01 as log where log.msg like '%192.23.33.177%' > DM-# and log.log_date >= '2015-1-18 1:45:24' > DM-# and log.log_date <= '2015-1-19 1:45:24' > DM-# order by log_date asc offset 200 limit 50; > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Limit (cost=28815.06..28815.06 rows=1 width=194) (actual > time=19032.099..19032.099 rows=0 loops=1) > -> Sort (cost=28814.74..28815.06 rows=128 width=194) (actual > time=19032.093..19032.093 rows=0 loops=1) > Sort Key: log_date > Sort Method: quicksort Memory: 25kB > -> Bitmap Heap Scan on logs_01 log (cost=28298.06..28810.26 > rows=128 width=194) (actual time=19031.992..19031.992 rows=0 loops=1) > Recheck Cond: ((msg ~~ '%192.23.33.177%'::text) AND > (log_date >= '2015-01-18 01:45:24'::timestamp without time zone) AND > (log_date <= '2015-01-19 01:45:24'::timestamp without time zone)) > -> BitmapAnd (cost=28298.06..28298.06 rows=128 width=0) > (actual time=19031.983..19031.983 rows=0 loops=1) > -> Bitmap Index Scan on tridx_logs_01_msg > (cost=0.00..508.15 rows=8020 width=0) (actual time=18408.121..18408.121 > rows=99 loops=1) > Index Cond: (msg ~~ '%192.23.33.177%'::text) > -> Bitmap Index Scan on logs_01_date_index > (cost=0.00..27789.60 rows=1325303 width=0) (actual time=623.084..623.084 > rows=1173048 loops=1) > Index Cond: ((log_date >= '2015-01-18 > 01:45:24'::timestamp without time zone) AND (log_date <= '2015-01-19 > 01:45:24'::timestamp without time zone)) > Planning time: 0.945 ms > Execution time: 19032.409 ms > (13 rows) > > Great stuff! Sorry Oleg I don't have your original message anymore and > can't reply into the right place in the thread, so I took the liberty to > CC: you. There are some more big optimizations (via Jeff Janes) coming down the pike for trigram searching. See thread: http://www.postgresql.org/message-id/CAMkU=1woR_Pdmie6d-zj6sDOPiHd_iUe3vZSXFGe_i4-AQYsJQ@mail.gmail.com. I think it should be possible to patch the 9.4 pg_trgm module with Jeff's stuff -- it might be worthwhile to do that and run some tests and report back. I don't know if they address your particular case but in some situations the speedups are really dramatic. merlin
pgsql-general by date: