Re: citext LIKE search bug - Mailing list pgsql-bugs

From Morris de Oryx
Subject Re: citext LIKE search bug
Date
Msg-id CAKqnccj44OKMzf8hL0oGbUP04DiHj7iKtvG8VrkjHbTxk1ssFA@mail.gmail.com
Whole thread Raw
In response to Re: citext LIKE search bug  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-bugs
Thanks for the answer and clarification. Yes, I'm using trigram index now, and it's working well. I love n-grams for fuzzy matching, and the Postgres implementation is a thing of beauty. It is a bit funny in that you have to use LIKE to hint the planner to use the index on a = search.

I've been working with Postgres for a couple of years now, coming from another environment. In our case, Postgres is not a front-line system, it's a secondary system for aggregation, analysis and pushing things through to a visualization platform. Postgres is just awesome but, whenever you're shifting to a new context, it's easy to come in with bad priors. That's me and citext, all the way down.

Postgres has case-sensitive text searches, as you all know. I guess that becomes a bit more optional in PG 12 with case and/or diacritical-blind collations...but we're on PG 11 deployed on RDS, so we're unlikely to see PG 12 a lot before ~March 2020. We started on 9.5 or 9.6. What I was used to are case-blind searches with B-trees of up to the first 1024 bytes of character data (it's all in UTF-16) that work on = <> and starts_with% without any functions or castings. So, Bad Priors. In Postgres:

-- Searches on text are case-sensitive (I figured that out early.)

-- The planner only uses a function index if the query uses *exactly* the same query. This I found out *last week*.

It's that second bit that was the killer, and that makes me question the point of citext at all. For background, I dug through my notes and assembled a chronological list of days when users asked me for case-sensitive search in their database over the past 30+ years:

....

Exactly never. My goal was to make case-blind searches automatic and transparent...no special goo in the query:

select foo from bar where lower(text) = lower('the search string');

But, as it turns out, you lose the indexes unless you do exactly something like that with citext. Like if you use the clever (not my clever) idea of B-treeing a hash of the text:

select *
  from record_changes_log_detail
 where hashtext(old_value) = hashtext('Gold Kerrison Neuro') and -- The first clause uses the index to find records with matching hash codes.
       old_value = 'Gold Kerrison Neuro'; -- The second clause makes sure that the records are real matches.

That works perfectly for = searches, but there's a whole lot of complexity in the query syntax for someone used to this to do the same thing:

select *
  from record_changes_log_detail
 where old_value = 'Gold Kerrison Neuro';
 
Is there, or could there be, any way for the query planner to guess that an expression-based index should be searched other than recapitulating the entire expression in the query? For example, my sample data has about 8M rows with a citext field with old values (a sort of audit-log thing). The length of these values are typically very small. 6 records in the entire data set have long values. If I index the first 128 characters, I think that gives me the original value for 99.9999+% of the rows. So, something like this - or the same for text: 

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail
    USING btree ((left(old_value,128)::citext) citext_pattern_ops);

For the index to be used today, the search needs to be

select * from record_changes_log_detail
where left(old_value,128)::citext = 'Gold Kerrison Neuro';

But here's the thing, Postgres _knows_ that the expression index is on the first 128 characters of the field. If the query value is within that value on an = search, then the B-tree of the substring could be used, and would return 100% accurate results.

Is that sort of query planner optimization conceivable in Postgres? I'd offer to help but, well, I'm absolutely unqualified. I knew a guy who gave up his PhD on mapping five-dimensional math into four-dimensional space. He said "I was smart enough to ask the question, but not even close to smart enough to answer it." That's me on my query planner idea/question.

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16013: Unexpected results from bit field query
Next
From: Fahar Abbas
Date:
Subject: Re: Postgres 11.5.1 failed installation