Thread: regex match and special characters
Hi, Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character atthe end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm) test=# select 'abcd ' ~ 'abcd\s'; ?column? ---------- t (1 row) test=# select version(); version ------------------------------------------------------------------------------------------------- PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit (1 row) On another server (running on the same system on a different port) postgres=# select version(); version ----------------------------------------------------------------------------------------------- PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit (1 row) postgres=# select 'abcd ' ~ 'abcd\s'; ?column? ---------- f (1 row) For both clusters, the client encoding is UTF8, the database encoding and collation is UTF8 and en_US.utf8 respectively,and the lc_ctype is en_US.utf8. I am accessing the databases running locally by ssh-ing first to the host. I observed similar issues with other Linux-based servers running Ubuntu, in all cases the regex resulted in true on PostgreSQL10+ and false on earlier versions (down to 9.3). The query comes from a table check that suddenly stopped acceptingrows valid in the older version during the migration. Making it select 'abcd ' ~ E'abcd\\s' doesn't modify theoutcome, unsurprisingly. Is it reproducible for others here as well? Given that it is, Is there a way to make both versions behave the same? Cheers, Alex
On 08/16/2018 03:59 AM, Alex Kliukin wrote: > Hi, > > Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character atthe end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm) > > test=# select 'abcd ' ~ 'abcd\s'; > ?column? > ---------- > t > (1 row) > > test=# select version(); > version > ------------------------------------------------------------------------------------------------- > PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit > (1 row) > > > On another server (running on the same system on a different port) > > postgres=# select version(); > version > ----------------------------------------------------------------------------------------------- > PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit > (1 row) > > postgres=# select 'abcd ' ~ 'abcd\s'; > ?column? > ---------- > f > (1 row) > > For both clusters, the client encoding is UTF8, the database encoding and collation is UTF8 and en_US.utf8 respectively,and the lc_ctype is en_US.utf8. I am accessing the databases running locally by ssh-ing first to the host. > > I observed similar issues with other Linux-based servers running Ubuntu, in all cases the regex resulted in true on PostgreSQL10+ and false on earlier versions (down to 9.3). The query comes from a table check that suddenly stopped acceptingrows valid in the older version during the migration. Making it select 'abcd ' ~ E'abcd\\s' doesn't modify theoutcome, unsurprisingly. > > Is it reproducible for others here as well? Given that it is, Is there a way to make both versions behave the same? select version(); version ------------------------------------------------------------------------------------ PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.8.5, 64-bit lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 test=# select 'abcd'||chr(2006) ~ E'abcd\s'; ?column? ---------- f (1 row) In your example you are working on Postgres devel. Have you tried it on Postgres 10 and/or 11? > > Cheers, > Alex > > -- Adrian Klaver adrian.klaver@aklaver.com
Alex Kliukin <alexk@hintbits.com> writes: > Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character atthe end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm) I think the reason for the discrepancy is that in v10 we fixed the regex locale support so that it could properly classify code points above U+7FF, cf https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c54159d44ceaba26ceda9fea1804f0de122a8f30 So 10 is giving the right answer (i.e. that \s matches U+2006). 9.x is not, but we're not going to back-patch such a large change. regards, tom lane
På torsdag 16. august 2018 kl. 15:16:52, skrev Adrian Klaver <adrian.klaver@aklaver.com>:
On 08/16/2018 03:59 AM, Alex Kliukin wrote:
> Hi,
>
> Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character at the end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm)
>
> test=# select 'abcd ' ~ 'abcd\s';
> ?column?
> ----------
> t
> (1 row)
>
> test=# select version();
> version
> -------------------------------------------------------------------------------------------------
> PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
> (1 row)
>
>
> On another server (running on the same system on a different port)
>
> postgres=# select version();
> version
> -----------------------------------------------------------------------------------------------
> PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
> (1 row)
>
> postgres=# select 'abcd ' ~ 'abcd\s';
> ?column?
> ----------
> f
> (1 row)
>
> For both clusters, the client encoding is UTF8, the database encoding and collation is UTF8 and en_US.utf8 respectively, and the lc_ctype is en_US.utf8. I am accessing the databases running locally by ssh-ing first to the host.
>
> I observed similar issues with other Linux-based servers running Ubuntu, in all cases the regex resulted in true on PostgreSQL 10+ and false on earlier versions (down to 9.3). The query comes from a table check that suddenly stopped accepting rows valid in the older version during the migration. Making it select 'abcd ' ~ E'abcd\\s' doesn't modify the outcome, unsurprisingly.
>
> Is it reproducible for others here as well? Given that it is, Is there a way to make both versions behave the same?
select version();
version
------------------------------------------------------------------------------------
PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
test=# select 'abcd'||chr(2006) ~ E'abcd\s';
?column?
----------
f
(1 row)
In your example you are working on Postgres devel. Have you tried it on
Postgres 10 and/or 11?
char(2006) produces the wrong character as 2006 is the hex-value. You have to use 8198:
andreak@[local]:5433 10.4 andreak=# select version();
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│ version │
├────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0, 64-bit │
└────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)
andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t │
└──────────┘
(1 row)
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│ version │
├────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0, 64-bit │
└────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)
andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t │
└──────────┘
(1 row)
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote: > char(2006) produces the wrong character as 2006 is the hex-value. You > have to use 8198: > andreak@[local]:543310.4 andreak=# select version(); > ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ > │ version > │ > ├────────────────────────────────────────────────────────────────────────────────────────────────┤ > │ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu > 7.3.0-16ubuntu3) 7.3.0, 64-bit │ > └────────────────────────────────────────────────────────────────────────────────────────────────┘ > (1 row) > > andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s'; > ┌──────────┐ > │ ?column? │ > ├──────────┤ > │ t │ > └──────────┘ > (1 row) Argh, read the wrong line. Thanks for the correction. Still: test=# select version(); version ------------------------------------------------------------------------------------ PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.8.5, 64-bit (1 row) test=# select 'abcd'||chr(8198) ~ E'abcd\s'; ?column? ---------- f (1 row) > -- > *Andreas Joseph Krogh* > CTO / Partner - Visena AS > Mobile: +47 909 56 963 > andreas@visena.com <mailto:andreas@visena.com> > www.visena.com <https://www.visena.com> > <https://www.visena.com> -- Adrian Klaver adrian.klaver@aklaver.com
På torsdag 16. august 2018 kl. 16:32:40, skrev Adrian Klaver <adrian.klaver@aklaver.com>:
On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote:
> char(2006) produces the wrong character as 2006 is the hex-value. You
> have to use 8198:
> andreak@[local]:543310.4 andreak=# select version();
> ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ version
> │
> ├────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 7.3.0-16ubuntu3) 7.3.0, 64-bit │
> └────────────────────────────────────────────────────────────────────────────────────────────────┘
> (1 row)
>
> andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
> ┌──────────┐
> │ ?column? │
> ├──────────┤
> │ t │
> └──────────┘
> (1 row)
Argh, read the wrong line. Thanks for the correction. Still:
test=# select version();
version
------------------------------------------------------------------------------------
PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit
(1 row)
test=# select 'abcd'||chr(8198) ~ E'abcd\s';
?column?
----------
f
(1 row)
When using E-syntax you need to double the backslash for escaping:
andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t │
└──────────┘
(1 row)
┌──────────┐
│ ?column? │
├──────────┤
│ t │
└──────────┘
(1 row)
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
On 08/16/2018 07:47 AM, Andreas Joseph Krogh wrote: > På torsdag 16. august 2018 kl. 16:32:40, skrev Adrian Klaver > <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>: > > On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote: > > > char(2006) produces the wrong character as 2006 is the hex-value. You > > have to use 8198: > > andreak@[local]:543310.4 andreak=# select version(); > > > ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ > > │ version > > │ > > > ├────────────────────────────────────────────────────────────────────────────────────────────────┤ > > │ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu > > 7.3.0-16ubuntu3) 7.3.0, 64-bit │ > > > └────────────────────────────────────────────────────────────────────────────────────────────────┘ > > (1 row) > > > > andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ > 'abcd\s'; > > ┌──────────┐ > > │ ?column? │ > > ├──────────┤ > > │ t │ > > └──────────┘ > > (1 row) > > Argh, read the wrong line. Thanks for the correction. Still: > > test=# select version(); > version > > ------------------------------------------------------------------------------------ > PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux) > 4.8.5, 64-bit > (1 row) > > > test=# select 'abcd'||chr(8198) ~ E'abcd\s'; > ?column? > ---------- > f > (1 row) > > When using E-syntax you need to double the backslash for escaping: > andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s'; > ┌──────────┐ > │ ?column? │ > ├──────────┤ > │ t │ > └──────────┘ > (1 row) Hmm: test=# show standard_conforming_strings; standard_conforming_strings ----------------------------- off (1 row) test=# show escape_string_warning; escape_string_warning ----------------------- on (1 row) test=# select 'abcd'||chr(8198) ~ 'abcd\s'; WARNING: nonstandard use of escape in a string literal LINE 1: select 'abcd'||chr(8198) ~ 'abcd\s'; ^ HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. ?column? ---------- f (1 row) test=# select 'abcd'||chr(8198) ~ 'abcd\\s'; WARNING: nonstandard use of \\ in a string literal LINE 1: select 'abcd'||chr(8198) ~ 'abcd\\s'; ^ HINT: Use the escape string syntax for backslashes, e.g., E'\\'. ?column? ---------- t (1 row) test=# set escape_string_warning = 'off'; SET test=# show escape_string_warning; escape_string_warning ----------------------- off (1 row) test=# select 'abcd'||chr(8198) ~ 'abcd\s'; ?column? ---------- f (1 row) test=# set standard_conforming_strings = 'on'; SET test=# show standard_conforming_strings; standard_conforming_strings ----------------------------- on (1 row) test=# select 'abcd'||chr(8198) ~ 'abcd\s'; ?column? ---------- t (1 row) Wonder if the OP has standard_conforming_strings='off' and escape_string_warning='off'? > -- > *Andreas Joseph Krogh* > CTO / Partner - Visena AS > Mobile: +47 909 56 963 > andreas@visena.com <mailto:andreas@visena.com> > www.visena.com <https://www.visena.com> > <https://www.visena.com> -- Adrian Klaver adrian.klaver@aklaver.com
Andreas Joseph Krogh <andreas@visena.com> writes: > When using E-syntax you need to double the backslash for escaping: > andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s'; Another thing to keep in mind is that this is all locale-dependent (specifically, LC_CTYPE, I believe). In plain old C locale, nothing beyond the standard ASCII whitespace characters will match \s. I'm not sure how universal it is for other locales to treat characters like U+2006 as whitespace. regards, tom lane
On 08/16/2018 08:13 AM, Adrian Klaver wrote: > > > Wonder if the OP has standard_conforming_strings='off' and > escape_string_warning='off'? In the above referring to 9.6.9 instance. > > > >> -- >> *Andreas Joseph Krogh* >> CTO / Partner - Visena AS >> Mobile: +47 909 56 963 >> andreas@visena.com <mailto:andreas@visena.com> >> www.visena.com <https://www.visena.com> >> <https://www.visena.com> > > -- Adrian Klaver adrian.klaver@aklaver.com
On 08/16/2018 08:19 AM, Adrian Klaver wrote: > On 08/16/2018 08:13 AM, Adrian Klaver wrote: > >> >> >> Wonder if the OP has standard_conforming_strings='off' and >> escape_string_warning='off'? > > In the above referring to 9.6.9 instance. Well that theory is no good: test=# select version(); version ------------------------------------------------------------------------------------- PostgreSQL 9.6.7 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.8.5, 64-bit (1 row) test=# set standard_conforming_strings='off'; SET test=# show standard_conforming_strings; standard_conforming_strings ----------------------------- off test=# show escape_string_warning; escape_string_warning ----------------------- off test=# select 'abcd'||chr(8198) ~ E'abcd\s'; ?column? ---------- f (1 row) test=# select 'abcd'||chr(8198) ~ E'abcd\\s'; ?column? ---------- f (1 row) test=# set escape_string_warning='on'; SET test=# set standard_conforming_strings='on'; SET test=# select 'abcd'||chr(8198) ~ 'abcd\s'; ?column? ---------- f (1 row) test=# select 'abcd'||chr(8198) ~ E'abcd\\s'; ?column? ---------- f (1 row) >> >> >> >>> -- >>> *Andreas Joseph Krogh* >>> CTO / Partner - Visena AS >>> Mobile: +47 909 56 963 >>> andreas@visena.com <mailto:andreas@visena.com> >>> www.visena.com <https://www.visena.com> >>> <https://www.visena.com> >> >> > > -- Adrian Klaver adrian.klaver@aklaver.com
On 16. Aug 2018, at 16:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:Alex Kliukin <alexk@hintbits.com> writes:Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character at the end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm)
I think the reason for the discrepancy is that in v10 we fixed the regex
locale support so that it could properly classify code points above U+7FF,
cf
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c54159d44ceaba26ceda9fea1804f0de122a8f30
This nails down the cause, thanks a lot for the link! Apparently I missed it from PostgreSQL 10 release notes, where it is present in the “Queries” section, although AFAIK it deserved an entry in the "migration to version 10”, as it may potentially make dump/restore from previous versions to version 10 error out if there are table constraints that use regex classes over the Unicode text fields with code points above U+7FF.
So 10 is giving the right answer (i.e. that \s matches U+2006).
9.x is not
Agreed.
Cheers,
Alex
Hi Adrian,
On 16. Aug 2018, at 18:13, Adrian Klaver <adrian.klaver@aklaver.com> wrote:test=# select 'abcd'||chr(8198) ~ 'abcd\s';
?column?
----------
t
(1 row)
Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?
Both are set to ‘on’ for me for all versions (I believe those are default settings). I have 12devel indeed on my test system alongside 9.6, but I’ve tried it as well on PostgreSQL 10 running on a different distro with different locale settings and it produced the same result (check being true).
I think Tom’s answer solves it, although I am wondering how did you get true in the statement quoted above on PostgreSQL 9.6, perhaps that result is actually from PostgreSQL 10?
Cheers,
Oleksii
On 08/18/2018 08:12 AM, Oleksii Kliukin wrote: > Hi Adrian, > >> On 16. Aug 2018, at 18:13, Adrian Klaver <adrian.klaver@aklaver.com >> <mailto:adrian.klaver@aklaver.com>> wrote: >> >> test=# select 'abcd'||chr(8198) ~ 'abcd\s'; >> ?column? >> ---------- >> t >> (1 row) >> >> >> Wonder if the OP has standard_conforming_strings='off' and >> escape_string_warning='off'? >> > > Both are set to ‘on’ for me for all versions (I believe those are > default settings). I have 12devel indeed on my test system alongside > 9.6, but I’ve tried it as well on PostgreSQL 10 running on a different > distro with different locale settings and it produced the same result > (check being true). > > I think Tom’s answer solves it, although I am wondering how did you get > true in the statement quoted above on PostgreSQL 9.6, perhaps that > result is actually from PostgreSQL 10? Yes it was on Postgres 10. I got reversed in my thinking. A subsequent post: https://www.postgresql.org/message-id/8b8d9250-1ee3-6aff-b580-592c7c759586%40aklaver.com shows the behavior you saw on 9.6. What happens when I work through my thought process on line:( > > Cheers, > Oleksii -- Adrian Klaver adrian.klaver@aklaver.com