Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Date
Msg-id 1405131377.9081.244.camel@jeff-desktop
Whole thread Raw
In response to Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Attached is a small patch to $SUBJECT.
> > In master, only single-byte characters are allowed as an escape. Of
> > course, with the patch it must still be a single character, but it may
> > be multi-byte.
>
> I'm concerned about the performance cost of this patch.  Have you done
> any measurements about what kind of overhead you are putting on the
> inner loop of similar_escape?

I didn't consider this very performance critical, because this is
looping through the pattern, which I wouldn't expect to be a long
string. On my machine using en_US.UTF-8, the difference is imperceptible
for a SIMILAR TO ... ESCAPE query.

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

    explain analyze
      select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;

which was the worst reasonable case I could think of. (It appears that
selecting from a table is faster than from generate_series. I'm curious
what you use when testing the performance of an individual function at
the SQL level.)

> At the very least, please don't call GetDatabaseEncoding() over again
> every single time through the inner loop.  More generally, why do we
> need to use pg_encoding_verifymb?  The input data is presumably validly
> encoded.  ISTM the significantly cheaper pg_mblen() would be more
> appropriate.

Thank you. Using the non-verifying variants reduces the penalty in the
above test to 1%.

If needed, we could optimize further through code specialization, as
like_escape() does. Though I think like_escape() is specialized just
because MatchText() is specialized; and MatchText is specialized because
it operates on the actual data, not just the pattern. So I don't see a
reason to specialize similar_escape().

Regards,
    Jeff Davis


Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: proposal: rounding up time value less than its unit.
Next
From: Noah Misch
Date:
Subject: Re: Crash on backend exit w/ OpenLDAP [2.4.24, 2.4.31]