Thread: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
Attached is a small patch to $SUBJECT. In master, only single-byte characters are allowed as an escape. Of course, with the patch it must still be a single character, but it may be multi-byte. Regards, Jeff Davis
Attachment
On Fri, Jul 11, 2014 at 12:49 PM, Jeff Davis <pgsql@j-davis.com> wrote: > Attached is a small patch to $SUBJECT. > > In master, only single-byte characters are allowed as an escape. Of > course, with the patch it must still be a single character, but it may > be multi-byte. +1 Probably you know that, multi-byte character is already allowed as escape in LIKE. There seems no reason why we don't support multi-byte escape character in SIMILAR TO and SUBSTRING. Could you add the patch into next CF? The patch doesn't contain the change of the document. But I think that it's better to document what character is allowed as escape in LIKE, SIMILAR TO and SUBSTRING. Regards, -- Fujii Masao
On Fri, 2014-07-11 at 14:41 +0900, Fujii Masao wrote: > Could you add the patch into next CF? Sure. The patch is so small I was thinking about committing it in a few days (assuming no complaints), but I'm in no hurry. > The patch doesn't contain the change of the document. But I think that > it's better to document what character is allowed as escape in LIKE, > SIMILAR TO and SUBSTRING. It should be assumed that multi-byte characters are allowed nearly everywhere, and we should document the places where only single-byte characters are allowed. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Attached is a small patch to $SUBJECT. > In master, only single-byte characters are allowed as an escape. Of > course, with the patch it must still be a single character, but it may > be multi-byte. I'm concerned about the performance cost of this patch. Have you done any measurements about what kind of overhead you are putting on the inner loop of similar_escape? At the very least, please don't call GetDatabaseEncoding() over again every single time through the inner loop. More generally, why do we need to use pg_encoding_verifymb? The input data is presumably validly encoded. ISTM the significantly cheaper pg_mblen() would be more appropriate. regards, tom lane
On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Attached is a small patch to $SUBJECT. > > In master, only single-byte characters are allowed as an escape. Of > > course, with the patch it must still be a single character, but it may > > be multi-byte. > > I'm concerned about the performance cost of this patch. Have you done > any measurements about what kind of overhead you are putting on the > inner loop of similar_escape? I didn't consider this very performance critical, because this is looping through the pattern, which I wouldn't expect to be a long string. On my machine using en_US.UTF-8, the difference is imperceptible for a SIMILAR TO ... ESCAPE query. I was able to see about a 2% increase in runtime when using the similar_escape function directly. I made a 10M tuple table and did: explain analyze select similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; which was the worst reasonable case I could think of. (It appears that selecting from a table is faster than from generate_series. I'm curious what you use when testing the performance of an individual function at the SQL level.) > At the very least, please don't call GetDatabaseEncoding() over again > every single time through the inner loop. More generally, why do we > need to use pg_encoding_verifymb? The input data is presumably validly > encoded. ISTM the significantly cheaper pg_mblen() would be more > appropriate. Thank you. Using the non-verifying variants reduces the penalty in the above test to 1%. If needed, we could optimize further through code specialization, as like_escape() does. Though I think like_escape() is specialized just because MatchText() is specialized; and MatchText is specialized because it operates on the actual data, not just the pattern. So I don't see a reason to specialize similar_escape(). Regards, Jeff Davis
Attachment
On 07/12/2014 05:16 AM, Jeff Davis wrote: > I was able to see about a 2% increase in runtime when using the > similar_escape function directly. I made a 10M tuple table and did: > > explain analyze > select > similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; > > which was the worst reasonable case I could think of. (It appears that > selecting from a table is faster than from generate_series. I'm curious > what you use when testing the performance of an individual function at > the SQL level.) A large table like that is what I usually do. A large generate_series() spends a lot of time building the tuplestore, especially if it doesn't fit in work_mem and spills to disk. Sometimes I use this to avoid it: explain analyze select similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from generate_series(1, 10000) a, generate_series(1,1000); although in my experience it still has somewhat more overhead than a straight seqscan because. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 07/12/2014 05:16 AM, Jeff Davis wrote: >> I was able to see about a 2% increase in runtime when using the >> similar_escape function directly. I made a 10M tuple table and did: >> >> explain analyze >> select >> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; >> >> which was the worst reasonable case I could think of. (It appears that >> selecting from a table is faster than from generate_series. I'm curious >> what you use when testing the performance of an individual function at >> the SQL level.) > A large table like that is what I usually do. A large generate_series() > spends a lot of time building the tuplestore, especially if it doesn't > fit in work_mem and spills to disk. Sometimes I use this to avoid it: > explain analyze > select > similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') > from generate_series(1, 10000) a, generate_series(1,1000); > although in my experience it still has somewhat more overhead than a > straight seqscan because. [ scratches head... ] Surely similar_escape is marked immutable, and will therefore be executed exactly once in either of these formulations, because the planner will fold the expression to a constant. regards, tom lane
On 07/12/2014 05:16 AM, Jeff Davis wrote: > On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote: >> Jeff Davis <pgsql@j-davis.com> writes: >>> Attached is a small patch to $SUBJECT. >>> In master, only single-byte characters are allowed as an escape. Of >>> course, with the patch it must still be a single character, but it may >>> be multi-byte. >> >> I'm concerned about the performance cost of this patch. Have you done >> any measurements about what kind of overhead you are putting on the >> inner loop of similar_escape? > > I didn't consider this very performance critical, because this is > looping through the pattern, which I wouldn't expect to be a long > string. On my machine using en_US.UTF-8, the difference is imperceptible > for a SIMILAR TO ... ESCAPE query. > > I was able to see about a 2% increase in runtime when using the > similar_escape function directly. I made a 10M tuple table and did: > > explain analyze > select > similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; > > which was the worst reasonable case I could think of. Actually, that gets optimized to a constant in the planner: postgres=# explain verbose select similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; QUERY PLAN -------------------------------------------------------------------------------- ---------- Seq Scan on public.t (cost=0.00..144247.85 rows=9999985 width=0) Output: '^(?:ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ )$'::text Planning time: 0.033 ms (3 rows) With a working test case: create table t (pattern text); insert into t select 'ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ' from generate_series(1, 1000000); vacuum t; explain (analyze) select similar_escape(pattern,'#') from t; your patch seems to be about 2x-3x as slow as unpatched master. So this needs some optimization. A couple of ideas: 1. If the escape string is in fact a single-byte character, you can proceed with the loop just as it is today, without the pg_mblen calls. 2. Since pg_mblen() will always return an integer between 1-6, it would probably be faster to replace the memcpy() and memcmp() calls with simple for-loops iterating byte-by-byte. In very brief testing, with the 1. change above, the performance with this patch is back to what it's without the patch. See attached. - Heikki
Attachment
On 08/25/2014 04:48 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> On 07/12/2014 05:16 AM, Jeff Davis wrote: >>> I was able to see about a 2% increase in runtime when using the >>> similar_escape function directly. I made a 10M tuple table and did: >>> >>> explain analyze >>> select >>> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') fromt; >>> >>> which was the worst reasonable case I could think of. (It appears that >>> selecting from a table is faster than from generate_series. I'm curious >>> what you use when testing the performance of an individual function at >>> the SQL level.) > >> A large table like that is what I usually do. A large generate_series() >> spends a lot of time building the tuplestore, especially if it doesn't >> fit in work_mem and spills to disk. Sometimes I use this to avoid it: > >> explain analyze >> select >> similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') >> from generate_series(1, 10000) a, generate_series(1,1000); > >> although in my experience it still has somewhat more overhead than a >> straight seqscan because. > > [ scratches head... ] Surely similar_escape is marked immutable, and > will therefore be executed exactly once in either of these formulations, > because the planner will fold the expression to a constant. Yeah, just noticed that myself.. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > On 08/25/2014 04:48 PM, Tom Lane wrote: >> [ scratches head... ] Surely similar_escape is marked immutable, and >> will therefore be executed exactly once in either of these formulations, >> because the planner will fold the expression to a constant. > Yeah, just noticed that myself.. ... although, given that, it is also fair to wonder how much the speed of similar_escape really matters. Surely in most use-cases the pattern and escape char will be constants. And, when they are not, wouldn't the subsequent parsing work for the regexes dominate the cost of similar_escape anyway? IOW, I'm not sure we should be advising Jeff to duplicate code in order to have a fast path. Keeping it short might be the best goal. regards, tom lane
On 08/25/2014 06:14 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> On 08/25/2014 04:48 PM, Tom Lane wrote: >>> [ scratches head... ] Surely similar_escape is marked immutable, and >>> will therefore be executed exactly once in either of these formulations, >>> because the planner will fold the expression to a constant. > >> Yeah, just noticed that myself.. > > ... although, given that, it is also fair to wonder how much the speed of > similar_escape really matters. Surely in most use-cases the pattern and > escape char will be constants. And, when they are not, wouldn't the > subsequent parsing work for the regexes dominate the cost of > similar_escape anyway? > > IOW, I'm not sure we should be advising Jeff to duplicate code in > order to have a fast path. Keeping it short might be the best goal. It's certainly not worth bending over backwards for a small performance gain here, but I think special-casing a single-byte escape sequence is still quite reasonable. It doesn't make the code any longer. - Heikki
On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote: > Actually, that gets optimized to a constant in the planner: Oops, thank you (and Tom). > your patch seems to be about 2x-3x as slow as unpatched master. So this > needs some optimization. A couple of ideas: I didn't see anywhere near that kind of regression. On unpatched master, with your test case, I saw it stabilize to about 680ms. With similar-escape-1, I saw about 775ms (15% regression). Are those at all close to your numbers? Is there a chance you used an unoptimized build for one of them, or left asserts enabled? > 1. If the escape string is in fact a single-byte character, you can > proceed with the loop just as it is today, without the pg_mblen calls. > > 2. Since pg_mblen() will always return an integer between 1-6, it would > probably be faster to replace the memcpy() and memcmp() calls with > simple for-loops iterating byte-by-byte. > > In very brief testing, with the 1. change above, the performance with > this patch is back to what it's without the patch. See attached. The particular patch has a mistake: the first branch is always taken because pg_mblen() won't return 0. It's also fairly ugly to set mblen in the test for the branch that doesn't use it. Attached a patch implementing the same idea though: only use the multibyte path if *both* the escape char and the current character from the pattern are multibyte. I also changed the comment to more clearly state the behavior upon which we're relying. I hope what I said is accurate. Regards, Jeff Davis
Attachment
On Tue, 2014-08-26 at 22:13 -0700, Jeff Davis wrote: > Attached a patch implementing the same idea though: only use the > multibyte path if *both* the escape char and the current character from > the pattern are multibyte. Forgot to mention: with this patch, the test completes in about 720ms, so just between unpatched and the v1 patch. I feel like we're spending too much time optimizing this patch, but we got this far, and it doesn't really complicate the code, so we might as well finish it. Regards,Jeff Davis
On 08/27/2014 08:13 AM, Jeff Davis wrote: > On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote: >> your patch seems to be about 2x-3x as slow as unpatched master. So this >> needs some optimization. A couple of ideas: > > I didn't see anywhere near that kind of regression. On unpatched master, > with your test case, I saw it stabilize to about 680ms. With > similar-escape-1, I saw about 775ms (15% regression). Are those at all > close to your numbers? Is there a chance you used an unoptimized build > for one of them, or left asserts enabled? Oh. I can't now reproduce my earlier results either, I must've messed up something. I'm now seeing similar numbers as you. > Attached a patch implementing the same idea though: only use the > multibyte path if *both* the escape char and the current character from > the pattern are multibyte. > > I also changed the comment to more clearly state the behavior upon which > we're relying. I hope what I said is accurate. s/the the/the/. Other than that, looks good to me. - Heikki