Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Date
Msg-id 53FB4B0F.9080502@vmware.com
Whole thread Raw
In response to Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
List pgsql-hackers
On 07/12/2014 05:16 AM, Jeff Davis wrote:
> On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
>> Jeff Davis <pgsql@j-davis.com> writes:
>>> Attached is a small patch to $SUBJECT.
>>> In master, only single-byte characters are allowed as an escape. Of
>>> course, with the patch it must still be a single character, but it may
>>> be multi-byte.
>>
>> I'm concerned about the performance cost of this patch.  Have you done
>> any measurements about what kind of overhead you are putting on the
>> inner loop of similar_escape?
>
> I didn't consider this very performance critical, because this is
> looping through the pattern, which I wouldn't expect to be a long
> string. On my machine using en_US.UTF-8, the difference is imperceptible
> for a SIMILAR TO ... ESCAPE query.
>
> I was able to see about a 2% increase in runtime when using the
> similar_escape function directly. I made a 10M tuple table and did:
>
>      explain analyze
>        select
> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;
>
> which was the worst reasonable case I could think of.

Actually, that gets optimized to a constant in the planner:

postgres=# explain  verbose select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
from t;
                                         QUERY PLAN


--------------------------------------------------------------------------------
----------
  Seq Scan on public.t  (cost=0.00..144247.85 rows=9999985 width=0)
    Output:
'^(?:ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ
)$'::text
  Planning time: 0.033 ms
(3 rows)

With a working test case:

create table t (pattern text);
insert into t select
'ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ' from
generate_series(1, 1000000);
vacuum t;

explain (analyze) select similar_escape(pattern,'#') from t;

your patch seems to be about 2x-3x as slow as unpatched master. So this
needs some optimization. A couple of ideas:

1. If the escape string is in fact a single-byte character, you can
proceed with the loop just as it is today, without the pg_mblen calls.

2. Since pg_mblen() will always return an integer between 1-6, it would
probably be faster to replace the memcpy() and memcmp() calls with
simple for-loops iterating byte-by-byte.

In very brief testing, with the 1. change above, the performance with
this patch is back to what it's without the patch. See attached.

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: Built-in binning functions
Next
From: Heikki Linnakangas
Date:
Subject: Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING