Re: pg_trgm version 1.2 - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: pg_trgm version 1.2
Date
Msg-id CAHyXU0z1Ht8r276OMrZZ01_g3j+C6eey+1jFFkzwbV62v9bRcQ@mail.gmail.com
Whole thread Raw
In response to Re: pg_trgm version 1.2  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-hackers
On Mon, Jun 29, 2015 at 7:23 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sat, Jun 27, 2015 at 5:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> V1.1: Time: 1743.691 ms  --- after repeated execution to warm the cache
>>
>> V1.2: Time:  2.839 ms      --- after repeated execution to warm the cache
>
> Wow!  I'm going to test this.  I have some data sets for which trigram
> searching isn't really practical...if the search string touches
> trigrams with a lot of duplication the algorithm can have trouble
> beating brute force searches.
>
> trigram searching is important: it's the only way currently to search
> string encoded structures for partial strings quickly.

I ran your patch against stock 9.4 and am happy to confirm massive
speedups of pg_trgm; results of 90% reduction in runtime are common.
Also, with the new changes it's hard to get the indexed search to
significantly underperform brute force searching which is a huge
improvement vs the stock behavior, something that made me very wary of
using these kinds of searches in the past.

datatable: 'test2'
rows: ~ 2 million
heap size: 3.3GB (includes several unrelated fields)
index size: 1GB
9.4: stock
9.5: patched

match 50% rows, brute force seq scan
9.4: 11.5s
9.5: 9.1s

match 50% rows, indexed (time is quite variable with 9.4 giving > 40 sec times)
9.4: 21.0s
9.5: 11.8s

match 1% rows, indexed (>90% time reduction!)
9.4: .566s
9.5: .046s

match .1% rows, one selective one non-selective search term, selective
term first
9.4: .563s
9.5: .028s

match .1% rows, one selective one non-selective search term, selective term last
9.4: 1.014s
9.5: 0.093s

very nice!  Recently, I examined pg_tgrm for an attribute searching
system -- it failed due to response time variability and lack of tools
to control that.  Were your patch in place, I would have passed it.  I
had a 'real world' data set though.   With this, pg_trgm is basically
outperforming SOLR search engine for all cases we're interested in
whereas before low selectivity cases where having all kinds of
trouble.

merlin



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Support for N synchronous standby servers - take 2
Next
From: Andres Freund
Date:
Subject: Re: Rework the way multixact truncations work