Re: pg_trgm version 1.2 - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: pg_trgm version 1.2
Date
Msg-id CAPpHfdtUfkY7Hv+7dLQPYFWrW1xrUb1jvPrdwL84eyGN6y1E5w@mail.gmail.com
Whole thread Raw
In response to pg_trgm version 1.2  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: pg_trgm version 1.2  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-hackers
On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
This patch implements version 1.2 of contrib module pg_trgm.

This supports the triconsistent function, introduced in version 9.4 of the server, to make it faster to implement indexed queries where some keys are common and some are rare.

Thank you for the patch! Lack of pg_trgm triconsistent support was significant miss after "fast scan" implementation.
 
I've included the paths to both upgrade and downgrade between 1.1 and 1.2, although after doing so you must close and restart the session before you can be sure the change has taken effect. There is no change to the on-disk index structure

pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you expect them in final commit? As I can see in other contribs we have only last version and upgrade scripts.

 This shows the difference it can make in some cases:

create extension pg_trgm version "1.1";

create table foo as select  

  md5(random()::text)|| case when random()<0.000005 then 'lmnop' else '123' end || 

  md5(random()::text) as bar 

from generate_series(1,10000000);

create index on foo using gin (bar gin_trgm_ops);

--some queries

alter extension pg_trgm update to "1.2"; 

--close, reopen, more queries


select count(*) from foo where bar like '%12344321lmnabcddd%';

 

V1.1: Time: 1743.691 ms  --- after repeated execution to warm the cache

V1.2: Time:  2.839 ms      --- after repeated execution to warm the cache


Nice!
 

You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in core) from 4 to some higher value (which it probably should be anyway, but there will always be a case where it needs to be higher than you can afford it to be, so a real solution is needed).

 
Actually, it depends on how long it takes to calculate consistent function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES could be. Since all functions in PostgreSQL may define its cost, GIN could calculate MAX_MAYBE_ENTRIES from the cost of consistent function.

I wasn't sure if this should be a new version of pg_trgm or not, because there is no user visible change other than to performance.  But there may be some cases where it results in performance reduction and so it is nice to provide options.  Also, I'd like to use it in a back-branch, so versions seems to be the right way to go there.

 
We definitely have to stamp a new version of extension every time when its SQL-definition changes. By the current design of GIN, providing both consistent and triconsistent function should be optimal for performance. If it's not so then it's a problem of GIN, not opclass.

There is a lot of code duplication between the binary consistent function and the ternary one.  I thought it the duplication was necessary in order to support both 1.1 and 1.2 from the same code base.

 
For me current code duplication is OK.

There may also be some gains in the similarity and regex cases, but I didn't really analyze those for performance.

 

I've thought about how to document this change.  Looking to other example of other contrib modules with multiple versions, I decided that we don't document them, other than in the release notes.


The same patch applies to 9.4 code with a minor conflict in the Makefile, and gives benefits there as well.


Some other notes about the patch:
* You allocate boolcheck array and don't use it. 
* Check coding style and formatting, in particular "check[i]==GIN_TRUE" should be "check[i] == GIN_TRUE".
* I think some comments needed in gin_trgm_triconsistent() about trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph() that way because trigramsMatchGraph() implements monotonous boolean function.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: anole: assorted stability problems
Next
From: Andres Freund
Date:
Subject: Re: Refactor to split nodeAgg.c?