Re: Supporting = operator in gin/gist_trgm_ops - Mailing list pgsql-hackers

From Georgios
Subject Re: Supporting = operator in gin/gist_trgm_ops
Date
Msg-id mFe1f5NtiSfa2VxFlEO2_vsvOIma9_v3ldlv0d5BSg7WSLQPnTd6kBpZPy4R3GMlsUYd7U_XcK4qfimk2NI8P0aM3wnJJSkxNY61wJ5FDsQ=@protonmail.com
Whole thread Raw
In response to Re: Supporting = operator in gin/gist_trgm_ops  (Julien Rouhaud <julien.rouhaud@free.fr>)
Responses Re: Supporting = operator in gin/gist_trgm_ops  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, November 13, 2020 10:50 AM, Julien Rouhaud <julien.rouhaud@free.fr> wrote:

> On Wed, Nov 11, 2020 at 8:34 PM Georgios Kokolatos
> gkokolatos@protonmail.com wrote:
>
> > The following review has been posted through the commitfest application:
> > make installcheck-world: tested, passed
> > Implements feature: tested, passed
> > Spec compliant: not tested
> > Documentation: not tested
> > Hi,
> > this patch implements a useful and missing feature. Thank you.
> > It includes documentation, which to a non-native speaker as myself seems appropriate.
> > It includes comprehensive tests that cover the implemented cases.
> > In the thread Alexander has pointed out, quote:
> > "It would be more efficient to generate trigrams for equal operator
> > using generate_trgm() instead of generate_wildcard_trgm()"
> > I will echo the sentiment, though from a slightly different and possibly not
> > as important point of view. The method used to extract trigrams from the query
> > should match the method used to extract trigrams from the values when they
> > get added to the index. This is gin_extract_value_trgm() and is indeed using
> > generate_trgm().
> > I have no opinion over Alexander's second comment regarding costing.
> > I change the status to 'Waiting on Author', but please feel free to override
> > my opinion if you feel I am wrong and reset it to 'Needs review'.
>
> Thanks for the reminder Georgios! Thanks a lot Alexander for the review!
>
> Indeed, I should have used generate_trgm() rather than
> generate_wildcard_trgm(). IIUC, the rest of the code should still be
> doing the same as [I]LikeStrategyNumber. I attach a v3 with that
> modification.

Great, thanks!

v3 looks good.

>
> For the costing, I tried this naive dataset:
>
> CREATE TABLE t1 AS select md5(random()::text) AS val from
> generate_series(1, 100000);
> CREATE INDEX t1_btree ON t1 (val);
> CREATE INDEX t1_gist ON t1 USING gist (val gist_trgm_ops);
>
> Cost are like this (all default configuration, using any random existing entry):
>
> EXPLAIN ANALYZE SELECT * FROM t1 where val =
>
> =============================================
>
> '8dcf324ce38428e4d27a363953ac1c51';
> QUERY PLAN
>
> -----------------------------------------------
>
> Index Only Scan using t1_btree on t1 (cost=0.42..4.44 rows=1
> width=33) (actual time=0.192..0.194 rows=1 loops=1)
> Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
> Heap Fetches: 0
> Planning Time: 0.133 ms
> Execution Time: 0.222 ms
> (5 rows)
>
> EXPLAIN ANALYZE SELECT * FROM t1 where val =
>
> =============================================
>
> '8dcf324ce38428e4d27a363953ac1c51';
> QUERY PLAN
>
> -----------------------------------------------
>
> Index Scan using t1_gist on t1 (cost=0.28..8.30 rows=1 width=33)
> (actual time=0.542..2.359 rows=1 loops=1)
> Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
> Planning Time: 0.189 ms
> Execution Time: 2.382 ms
> (4 rows)
>
> EXPLAIN ANALYZE SELECT * FROM t1 where val =
>
> =============================================
>
> '8dcf324ce38428e4d27a363953ac1c51';
> QUERY PLAN
>
> -----------------------------------------------
>
> Bitmap Heap Scan on t1 (cost=400.01..404.02 rows=1 width=33) (actual
> time=2.486..2.488 rows=1 loops=1)
> Recheck Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
> Heap Blocks: exact=1
> -> Bitmap Index Scan on t1_gin (cost=0.00..400.01 rows=1 width=0)
> (actual time=2.474..2.474 rows=1 loops=1)
> Index Cond: (val = '8dcf324ce38428e4d27a363953ac1c51'::text)
> Planning Time: 0.206 ms
> Execution Time: 2.611 ms
>
> So assuming that this dataset is representative enough, costing indeed
> prefers a btree index over a gist/gin index, which should avoid
> regression with this change.

It sounds reasonable, although I would leave it to a bit more cost savvy
people to argue the point.

>
> Gin is however quite off, likely because it's a bitmap index scan
> rather than an index scan, so gist is preferred in this scenario.
> That's not ideal, but I'm not sure that there are many people having
> both gin_trgm_ops and gist_trgm_ops.

Same as above.

In short, I think v3 of the patch looks good to change to 'RFC' status.
Given the possible costing concerns, I will refrain from changing the
status just yet, to give the opportunity to more reviewers to chime in.
If in the next few days there are no more reviews, I will update the
status to 'RFC' to move the patch forward.

Thoughts?

Cheers,
//Georgios



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: WIP: WAL prefetch (another approach)
Next
From: Heikki Linnakangas
Date:
Subject: Re: [PATCH] Combine same ternary types in GIN and TSearch