Thread: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Jeff Davis
Date:
Attached is a small patch to $SUBJECT.

In master, only single-byte characters are allowed as an escape. Of
course, with the patch it must still be a single character, but it may
be multi-byte.

Regards,
    Jeff Davis


Attachment

Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Fujii Masao
Date:
On Fri, Jul 11, 2014 at 12:49 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Attached is a small patch to $SUBJECT.
>
> In master, only single-byte characters are allowed as an escape. Of
> course, with the patch it must still be a single character, but it may
> be multi-byte.

+1

Probably you know that, multi-byte character is already allowed
as escape in LIKE. There seems no reason why we don't support
multi-byte escape character in SIMILAR TO and SUBSTRING.

Could you add the patch into next CF?

The patch doesn't contain the change of the document. But I think that
it's better to document what character is allowed as escape in LIKE,
SIMILAR TO and SUBSTRING.

Regards,

-- 
Fujii Masao



Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Jeff Davis
Date:
On Fri, 2014-07-11 at 14:41 +0900, Fujii Masao wrote:
> Could you add the patch into next CF?

Sure. The patch is so small I was thinking about committing it in a few
days (assuming no complaints), but I'm in no hurry.

> The patch doesn't contain the change of the document. But I think that
> it's better to document what character is allowed as escape in LIKE,
> SIMILAR TO and SUBSTRING.

It should be assumed that multi-byte characters are allowed nearly
everywhere, and we should document the places where only single-byte
characters are allowed.

Regards,Jeff Davis





Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Attached is a small patch to $SUBJECT.
> In master, only single-byte characters are allowed as an escape. Of
> course, with the patch it must still be a single character, but it may
> be multi-byte.

I'm concerned about the performance cost of this patch.  Have you done
any measurements about what kind of overhead you are putting on the
inner loop of similar_escape?

At the very least, please don't call GetDatabaseEncoding() over again
every single time through the inner loop.  More generally, why do we
need to use pg_encoding_verifymb?  The input data is presumably validly
encoded.  ISTM the significantly cheaper pg_mblen() would be more
appropriate.
        regards, tom lane



Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Jeff Davis
Date:
On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Attached is a small patch to $SUBJECT.
> > In master, only single-byte characters are allowed as an escape. Of
> > course, with the patch it must still be a single character, but it may
> > be multi-byte.
>
> I'm concerned about the performance cost of this patch.  Have you done
> any measurements about what kind of overhead you are putting on the
> inner loop of similar_escape?

I didn't consider this very performance critical, because this is
looping through the pattern, which I wouldn't expect to be a long
string. On my machine using en_US.UTF-8, the difference is imperceptible
for a SIMILAR TO ... ESCAPE query.

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

    explain analyze
      select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;

which was the worst reasonable case I could think of. (It appears that
selecting from a table is faster than from generate_series. I'm curious
what you use when testing the performance of an individual function at
the SQL level.)

> At the very least, please don't call GetDatabaseEncoding() over again
> every single time through the inner loop.  More generally, why do we
> need to use pg_encoding_verifymb?  The input data is presumably validly
> encoded.  ISTM the significantly cheaper pg_mblen() would be more
> appropriate.

Thank you. Using the non-verifying variants reduces the penalty in the
above test to 1%.

If needed, we could optimize further through code specialization, as
like_escape() does. Though I think like_escape() is specialized just
because MatchText() is specialized; and MatchText is specialized because
it operates on the actual data, not just the pattern. So I don't see a
reason to specialize similar_escape().

Regards,
    Jeff Davis


Attachment

Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Heikki Linnakangas
Date:
On 07/12/2014 05:16 AM, Jeff Davis wrote:
> I was able to see about a 2% increase in runtime when using the
> similar_escape function directly. I made a 10M tuple table and did:
>
>      explain analyze
>        select
> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;
>
> which was the worst reasonable case I could think of. (It appears that
> selecting from a table is faster than from generate_series. I'm curious
> what you use when testing the performance of an individual function at
> the SQL level.)

A large table like that is what I usually do. A large generate_series() 
spends a lot of time building the tuplestore, especially if it doesn't 
fit in work_mem and spills to disk. Sometimes I use this to avoid it:

explain analyze      select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') 
from generate_series(1, 10000) a, generate_series(1,1000);

although in my experience it still has somewhat more overhead than a 
straight seqscan because.

- Heikki




Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 07/12/2014 05:16 AM, Jeff Davis wrote:
>> I was able to see about a 2% increase in runtime when using the
>> similar_escape function directly. I made a 10M tuple table and did:
>> 
>> explain analyze
>> select
>> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;
>> 
>> which was the worst reasonable case I could think of. (It appears that
>> selecting from a table is faster than from generate_series. I'm curious
>> what you use when testing the performance of an individual function at
>> the SQL level.)

> A large table like that is what I usually do. A large generate_series() 
> spends a lot of time building the tuplestore, especially if it doesn't 
> fit in work_mem and spills to disk. Sometimes I use this to avoid it:

> explain analyze
>        select
> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') 
> from generate_series(1, 10000) a, generate_series(1,1000);

> although in my experience it still has somewhat more overhead than a 
> straight seqscan because.

[ scratches head... ]  Surely similar_escape is marked immutable, and
will therefore be executed exactly once in either of these formulations,
because the planner will fold the expression to a constant.
        regards, tom lane



Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Heikki Linnakangas
Date:
On 07/12/2014 05:16 AM, Jeff Davis wrote:
> On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
>> Jeff Davis <pgsql@j-davis.com> writes:
>>> Attached is a small patch to $SUBJECT.
>>> In master, only single-byte characters are allowed as an escape. Of
>>> course, with the patch it must still be a single character, but it may
>>> be multi-byte.
>>
>> I'm concerned about the performance cost of this patch.  Have you done
>> any measurements about what kind of overhead you are putting on the
>> inner loop of similar_escape?
>
> I didn't consider this very performance critical, because this is
> looping through the pattern, which I wouldn't expect to be a long
> string. On my machine using en_US.UTF-8, the difference is imperceptible
> for a SIMILAR TO ... ESCAPE query.
>
> I was able to see about a 2% increase in runtime when using the
> similar_escape function directly. I made a 10M tuple table and did:
>
>      explain analyze
>        select
> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t;
>
> which was the worst reasonable case I could think of.

Actually, that gets optimized to a constant in the planner:

postgres=# explain  verbose select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
from t;
                                         QUERY PLAN


--------------------------------------------------------------------------------
----------
  Seq Scan on public.t  (cost=0.00..144247.85 rows=9999985 width=0)
    Output:
'^(?:ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ
)$'::text
  Planning time: 0.033 ms
(3 rows)

With a working test case:

create table t (pattern text);
insert into t select
'ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ' from
generate_series(1, 1000000);
vacuum t;

explain (analyze) select similar_escape(pattern,'#') from t;

your patch seems to be about 2x-3x as slow as unpatched master. So this
needs some optimization. A couple of ideas:

1. If the escape string is in fact a single-byte character, you can
proceed with the loop just as it is today, without the pg_mblen calls.

2. Since pg_mblen() will always return an integer between 1-6, it would
probably be faster to replace the memcpy() and memcmp() calls with
simple for-loops iterating byte-by-byte.

In very brief testing, with the 1. change above, the performance with
this patch is back to what it's without the patch. See attached.

- Heikki


Attachment

Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Heikki Linnakangas
Date:
On 08/25/2014 04:48 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> On 07/12/2014 05:16 AM, Jeff Davis wrote:
>>> I was able to see about a 2% increase in runtime when using the
>>> similar_escape function directly. I made a 10M tuple table and did:
>>>
>>> explain analyze
>>> select
>>>
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
fromt;
 
>>>
>>> which was the worst reasonable case I could think of. (It appears that
>>> selecting from a table is faster than from generate_series. I'm curious
>>> what you use when testing the performance of an individual function at
>>> the SQL level.)
>
>> A large table like that is what I usually do. A large generate_series()
>> spends a lot of time building the tuplestore, especially if it doesn't
>> fit in work_mem and spills to disk. Sometimes I use this to avoid it:
>
>> explain analyze
>>         select
>>
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
>> from generate_series(1, 10000) a, generate_series(1,1000);
>
>> although in my experience it still has somewhat more overhead than a
>> straight seqscan because.
>
> [ scratches head... ]  Surely similar_escape is marked immutable, and
> will therefore be executed exactly once in either of these formulations,
> because the planner will fold the expression to a constant.

Yeah, just noticed that myself..

- Heikki




Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 08/25/2014 04:48 PM, Tom Lane wrote:
>> [ scratches head... ]  Surely similar_escape is marked immutable, and
>> will therefore be executed exactly once in either of these formulations,
>> because the planner will fold the expression to a constant.

> Yeah, just noticed that myself..

... although, given that, it is also fair to wonder how much the speed of
similar_escape really matters.  Surely in most use-cases the pattern and
escape char will be constants.  And, when they are not, wouldn't the
subsequent parsing work for the regexes dominate the cost of
similar_escape anyway?

IOW, I'm not sure we should be advising Jeff to duplicate code in
order to have a fast path.  Keeping it short might be the best goal.
        regards, tom lane



Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Heikki Linnakangas
Date:
On 08/25/2014 06:14 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> On 08/25/2014 04:48 PM, Tom Lane wrote:
>>> [ scratches head... ]  Surely similar_escape is marked immutable, and
>>> will therefore be executed exactly once in either of these formulations,
>>> because the planner will fold the expression to a constant.
>
>> Yeah, just noticed that myself..
>
> ... although, given that, it is also fair to wonder how much the speed of
> similar_escape really matters.  Surely in most use-cases the pattern and
> escape char will be constants.  And, when they are not, wouldn't the
> subsequent parsing work for the regexes dominate the cost of
> similar_escape anyway?
>
> IOW, I'm not sure we should be advising Jeff to duplicate code in
> order to have a fast path.  Keeping it short might be the best goal.

It's certainly not worth bending over backwards for a small performance 
gain here, but I think special-casing a single-byte escape sequence is 
still quite reasonable. It doesn't make the code any longer.

- Heikki




Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Jeff Davis
Date:
On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote:
> Actually, that gets optimized to a constant in the planner:

Oops, thank you (and Tom).

> your patch seems to be about 2x-3x as slow as unpatched master. So this
> needs some optimization. A couple of ideas:

I didn't see anywhere near that kind of regression. On unpatched master,
with your test case, I saw it stabilize to about 680ms. With
similar-escape-1, I saw about 775ms (15% regression). Are those at all
close to your numbers? Is there a chance you used an unoptimized build
for one of them, or left asserts enabled?

> 1. If the escape string is in fact a single-byte character, you can
> proceed with the loop just as it is today, without the pg_mblen calls.
>
> 2. Since pg_mblen() will always return an integer between 1-6, it would
> probably be faster to replace the memcpy() and memcmp() calls with
> simple for-loops iterating byte-by-byte.
>
> In very brief testing, with the 1. change above, the performance with
> this patch is back to what it's without the patch. See attached.

The particular patch has a mistake: the first branch is always taken
because pg_mblen() won't return 0. It's also fairly ugly to set mblen in
the test for the branch that doesn't use it.

Attached a patch implementing the same idea though: only use the
multibyte path if *both* the escape char and the current character from
the pattern are multibyte.

I also changed the comment to more clearly state the behavior upon which
we're relying. I hope what I said is accurate.

Regards,
    Jeff Davis


Attachment

Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Jeff Davis
Date:
On Tue, 2014-08-26 at 22:13 -0700, Jeff Davis wrote:
> Attached a patch implementing the same idea though: only use the
> multibyte path if *both* the escape char and the current character from
> the pattern are multibyte.

Forgot to mention: with this patch, the test completes in about 720ms,
so just between unpatched and the v1 patch.

I feel like we're spending too much time optimizing this patch, but we
got this far, and it doesn't really complicate the code, so we might as
well finish it.

Regards,Jeff Davis





Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

From
Heikki Linnakangas
Date:
On 08/27/2014 08:13 AM, Jeff Davis wrote:
> On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote:
>> your patch seems to be about 2x-3x as slow as unpatched master. So this
>> needs some optimization. A couple of ideas:
>
> I didn't see anywhere near that kind of regression. On unpatched master,
> with your test case, I saw it stabilize to about 680ms. With
> similar-escape-1, I saw about 775ms (15% regression). Are those at all
> close to your numbers? Is there a chance you used an unoptimized build
> for one of them, or left asserts enabled?

Oh. I can't now reproduce my earlier results either, I must've messed up 
something. I'm now seeing similar numbers as you.

> Attached a patch implementing the same idea though: only use the
> multibyte path if *both* the escape char and the current character from
> the pattern are multibyte.
>
> I also changed the comment to more clearly state the behavior upon which
> we're relying. I hope what I said is accurate.

s/the the/the/. Other than that, looks good to me.

- Heikki