Thread: regex match and special characters

regex match and special characters

From
Alex Kliukin
Date:
Hi,

Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character
atthe end of the string is actually U+2006 SIX-PER-EM SPACE
(http://www.fileformat.info/info/unicode/char/2006/index.htm)

test=# select 'abcd ' ~ 'abcd\s';
 ?column?
----------
 t
(1 row)

test=# select version();
                                             version
-------------------------------------------------------------------------------------------------
 PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
(1 row)


On another server (running on the same system on a different port)

postgres=# select version();
                                            version
-----------------------------------------------------------------------------------------------
 PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
(1 row)

postgres=# select 'abcd ' ~ 'abcd\s';
 ?column?
----------
 f
(1 row)

For both clusters, the client encoding is UTF8, the database encoding and collation is UTF8 and en_US.utf8
respectively,and the lc_ctype is en_US.utf8. I am accessing the databases running locally by ssh-ing first to the host.


I observed similar issues with other Linux-based servers running Ubuntu, in all cases the regex resulted in true on
PostgreSQL10+ and false on earlier versions (down to 9.3). The query comes from a table check that suddenly stopped
acceptingrows valid in the older version during the migration. Making it  select 'abcd ' ~ E'abcd\\s' doesn't  modify
theoutcome, unsurprisingly. 

Is it reproducible for others here as well? Given that it is, Is there a way to make both versions behave the same?

Cheers,
Alex


Re: regex match and special characters

From
Adrian Klaver
Date:
On 08/16/2018 03:59 AM, Alex Kliukin wrote:
> Hi,
> 
> Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character
atthe end of the string is actually U+2006 SIX-PER-EM SPACE
(http://www.fileformat.info/info/unicode/char/2006/index.htm)
> 
> test=# select 'abcd ' ~ 'abcd\s';
>   ?column?
> ----------
>   t
> (1 row)
> 
> test=# select version();
>                                               version
> -------------------------------------------------------------------------------------------------
>   PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
> (1 row)
> 
> 
> On another server (running on the same system on a different port)
> 
> postgres=# select version();
>                                              version
> -----------------------------------------------------------------------------------------------
>   PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
> (1 row)
> 
> postgres=# select 'abcd ' ~ 'abcd\s';
>   ?column?
> ----------
>   f
> (1 row)
> 
> For both clusters, the client encoding is UTF8, the database encoding and collation is UTF8 and en_US.utf8
respectively,and the lc_ctype is en_US.utf8. I am accessing the databases running locally by ssh-ing first to the
host.
> 
> I observed similar issues with other Linux-based servers running Ubuntu, in all cases the regex resulted in true on
PostgreSQL10+ and false on earlier versions (down to 9.3). The query comes from a table check that suddenly stopped
acceptingrows valid in the older version during the migration. Making it  select 'abcd ' ~ E'abcd\\s' doesn't  modify
theoutcome, unsurprisingly.
 
> 
> Is it reproducible for others here as well? Given that it is, Is there a way to make both versions behave the same?

select version();
                                       version 

------------------------------------------------------------------------------------
  PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux) 
4.8.5, 64-bit


lc_collate                          | en_US.UTF-8 

lc_ctype                            | en_US.UTF-8


test=# select 'abcd'||chr(2006) ~ E'abcd\s';
  ?column?
----------
  f
(1 row)

In your example you are working on Postgres devel. Have you tried it on 
Postgres 10 and/or 11?

> 
> Cheers,
> Alex
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: regex match and special characters

From
Tom Lane
Date:
Alex Kliukin <alexk@hintbits.com> writes:
> Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character
atthe end of the string is actually U+2006 SIX-PER-EM SPACE
(http://www.fileformat.info/info/unicode/char/2006/index.htm)

I think the reason for the discrepancy is that in v10 we fixed the regex
locale support so that it could properly classify code points above U+7FF,
cf

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c54159d44ceaba26ceda9fea1804f0de122a8f30

So 10 is giving the right answer (i.e. that \s matches U+2006).
9.x is not, but we're not going to back-patch such a large change.

            regards, tom lane


Sv: Re: regex match and special characters

From
Andreas Joseph Krogh
Date:
På torsdag 16. august 2018 kl. 15:16:52, skrev Adrian Klaver <adrian.klaver@aklaver.com>:
On 08/16/2018 03:59 AM, Alex Kliukin wrote:
> Hi,
>
> Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character at the end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm)
>
> test=# select 'abcd ' ~ 'abcd\s';
>   ?column?
> ----------
>   t
> (1 row)
>
> test=# select version();
>                                               version
> -------------------------------------------------------------------------------------------------
>   PostgreSQL 12devel on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
> (1 row)
>
>
> On another server (running on the same system on a different port)
>
> postgres=# select version();
>                                              version
> -----------------------------------------------------------------------------------------------
>   PostgreSQL 9.6.9 on x86_64-pc-linux-gnu, compiled by gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit
> (1 row)
>
> postgres=# select 'abcd ' ~ 'abcd\s';
>   ?column?
> ----------
>   f
> (1 row)
>
> For both clusters, the client encoding is UTF8, the database encoding and collation is UTF8 and en_US.utf8 respectively, and the lc_ctype is en_US.utf8. I am accessing the databases running locally by ssh-ing first to the host.
>
> I observed similar issues with other Linux-based servers running Ubuntu, in all cases the regex resulted in true on PostgreSQL 10+ and false on earlier versions (down to 9.3). The query comes from a table check that suddenly stopped accepting rows valid in the older version during the migration. Making it  select 'abcd ' ~ E'abcd\\s' doesn't  modify the outcome, unsurprisingly.
>
> Is it reproducible for others here as well? Given that it is, Is there a way to make both versions behave the same?

select version();
                                       version

------------------------------------------------------------------------------------
  PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit


lc_collate                          | en_US.UTF-8

lc_ctype                            | en_US.UTF-8


test=# select 'abcd'||chr(2006) ~ E'abcd\s';
  ?column?
----------
  f
(1 row)

In your example you are working on Postgres devel. Have you tried it on
Postgres 10 and/or 11?
 
char(2006) produces the wrong character as 2006 is the hex-value. You have to use 8198:
 
andreak@[local]:5433 10.4 andreak=# select version();
┌────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                            version                                             │
├────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0, 64-bit │
└────────────────────────────────────────────────────────────────────────────────────────────────┘
(1 row)

andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t        │
└──────────┘
(1 row)

 
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Sv: Re: regex match and special characters

From
Adrian Klaver
Date:
On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote:

> char(2006) produces the wrong character as 2006 is the hex-value. You 
> have to use 8198:
> andreak@[local]:543310.4 andreak=# select version();
> ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                            version 
>                                              │
> ├────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
> 7.3.0-16ubuntu3) 7.3.0, 64-bit │
> └────────────────────────────────────────────────────────────────────────────────────────────────┘
> (1 row)
> 
> andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
> ┌──────────┐
> │ ?column? │
> ├──────────┤
> │ t        │
> └──────────┘
> (1 row)

Argh, read the wrong line. Thanks for the correction. Still:

test=# select version();
                                       version 

------------------------------------------------------------------------------------
  PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux) 
4.8.5, 64-bit
(1 row)


test=# select 'abcd'||chr(8198) ~ E'abcd\s';
  ?column?
----------
  f
(1 row)


> -- 
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andreas@visena.com <mailto:andreas@visena.com>
> www.visena.com <https://www.visena.com>
> <https://www.visena.com>


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Sv: Re: Sv: Re: regex match and special characters

From
Andreas Joseph Krogh
Date:
På torsdag 16. august 2018 kl. 16:32:40, skrev Adrian Klaver <adrian.klaver@aklaver.com>:
On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote:

> char(2006) produces the wrong character as 2006 is the hex-value. You
> have to use 8198:
> andreak@[local]:543310.4 andreak=# select version();
> ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                            version
>                                              │
> ├────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 7.3.0-16ubuntu3) 7.3.0, 64-bit │
> └────────────────────────────────────────────────────────────────────────────────────────────────┘
> (1 row)
>
> andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ 'abcd\s';
> ┌──────────┐
> │ ?column? │
> ├──────────┤
> │ t        │
> └──────────┘
> (1 row)

Argh, read the wrong line. Thanks for the correction. Still:

test=# select version();
                                       version

------------------------------------------------------------------------------------
  PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
4.8.5, 64-bit
(1 row)


test=# select 'abcd'||chr(8198) ~ E'abcd\s';
  ?column?
----------
  f
(1 row)
 
When using E-syntax you need to double the backslash for escaping:
 
andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s';
┌──────────┐
│ ?column? │
├──────────┤
│ t        │
└──────────┘
(1 row)

 
 
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
 
Attachment

Re: Sv: Re: Sv: Re: regex match and special characters

From
Adrian Klaver
Date:
On 08/16/2018 07:47 AM, Andreas Joseph Krogh wrote:
> På torsdag 16. august 2018 kl. 16:32:40, skrev Adrian Klaver 
> <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>:
> 
>     On 08/16/2018 07:04 AM, Andreas Joseph Krogh wrote:
> 
>      > char(2006) produces the wrong character as 2006 is the hex-value. You
>      > have to use 8198:
>      > andreak@[local]:543310.4 andreak=# select version();
>      >
>     ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
>      > │                                            version
>      >                                              │
>      >
>     ├────────────────────────────────────────────────────────────────────────────────────────────────┤
>      > │ PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
>      > 7.3.0-16ubuntu3) 7.3.0, 64-bit │
>      >
>     └────────────────────────────────────────────────────────────────────────────────────────────────┘
>      > (1 row)
>      >
>      > andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~
>     'abcd\s';
>      > ┌──────────┐
>      > │ ?column? │
>      > ├──────────┤
>      > │ t        │
>      > └──────────┘
>      > (1 row)
> 
>     Argh, read the wrong line. Thanks for the correction. Still:
> 
>     test=# select version();
>                                             version
> 
>     ------------------------------------------------------------------------------------
>        PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux)
>     4.8.5, 64-bit
>     (1 row)
> 
> 
>     test=# select 'abcd'||chr(8198) ~ E'abcd\s';
>        ?column?
>     ----------
>        f
>     (1 row)
> 
> When using E-syntax you need to double the backslash for escaping:
> andreak@[local]:543310.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s';
> ┌──────────┐
> │ ?column? │
> ├──────────┤
> │ t        │
> └──────────┘
> (1 row)

Hmm:

test=# show standard_conforming_strings;
  standard_conforming_strings
-----------------------------
  off
(1 row)

test=# show escape_string_warning;
  escape_string_warning
-----------------------
  on
(1 row)



test=# select 'abcd'||chr(8198) ~ 'abcd\s';
WARNING:  nonstandard use of escape in a string literal
LINE 1: select 'abcd'||chr(8198) ~ 'abcd\s';
                                    ^
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.
  ?column?
----------
  f
(1 row)

test=# select 'abcd'||chr(8198) ~ 'abcd\\s';
WARNING:  nonstandard use of \\ in a string literal
LINE 1: select 'abcd'||chr(8198) ~ 'abcd\\s';
                                    ^
HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
  ?column?
----------
  t
(1 row)


test=# set escape_string_warning = 'off';
SET
test=# show escape_string_warning;
  escape_string_warning
-----------------------
  off
(1 row)

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
  ?column?
----------
  f
(1 row)

test=# set standard_conforming_strings = 'on';
SET
test=# show standard_conforming_strings;
  standard_conforming_strings
-----------------------------
  on
(1 row)

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
  ?column?
----------
  t
(1 row)


Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?



> -- 
> *Andreas Joseph Krogh*
> CTO / Partner - Visena AS
> Mobile: +47 909 56 963
> andreas@visena.com <mailto:andreas@visena.com>
> www.visena.com <https://www.visena.com>
> <https://www.visena.com>


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Sv: Re: Sv: Re: regex match and special characters

From
Tom Lane
Date:
Andreas Joseph Krogh <andreas@visena.com> writes:
> When using E-syntax you need to double the backslash for escaping:
> andreak@[local]:5433 10.4 andreak=# select 'abcd'||chr(8198) ~ E'abcd\\s'; 

Another thing to keep in mind is that this is all locale-dependent
(specifically, LC_CTYPE, I believe).  In plain old C locale, nothing
beyond the standard ASCII whitespace characters will match \s.  I'm not
sure how universal it is for other locales to treat characters like
U+2006 as whitespace.

            regards, tom lane


Re: Sv: Re: Sv: Re: regex match and special characters

From
Adrian Klaver
Date:
On 08/16/2018 08:13 AM, Adrian Klaver wrote:

> 
> 
> Wonder if the OP has standard_conforming_strings='off' and
> escape_string_warning='off'?

In the above referring to 9.6.9 instance.
> 
> 
> 
>> -- 
>> *Andreas Joseph Krogh*
>> CTO / Partner - Visena AS
>> Mobile: +47 909 56 963
>> andreas@visena.com <mailto:andreas@visena.com>
>> www.visena.com <https://www.visena.com>
>> <https://www.visena.com>
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Sv: Re: Sv: Re: regex match and special characters

From
Adrian Klaver
Date:
On 08/16/2018 08:19 AM, Adrian Klaver wrote:
> On 08/16/2018 08:13 AM, Adrian Klaver wrote:
> 
>>
>>
>> Wonder if the OP has standard_conforming_strings='off' and
>> escape_string_warning='off'?
> 
> In the above referring to 9.6.9 instance.

Well that theory is no good:

test=# select version();
                                        version 

-------------------------------------------------------------------------------------
  PostgreSQL 9.6.7 on x86_64-pc-linux-gnu, compiled by gcc (SUSE Linux) 
4.8.5, 64-bit
(1 row)


test=# set standard_conforming_strings='off';
SET
test=# show standard_conforming_strings;
  standard_conforming_strings
-----------------------------
  off

test=# show escape_string_warning;
  escape_string_warning
-----------------------
  off


test=# select 'abcd'||chr(8198) ~ E'abcd\s';
  ?column?
----------
  f
(1 row)

test=# select 'abcd'||chr(8198) ~ E'abcd\\s';
  ?column?
----------
  f
(1 row)

test=# set escape_string_warning='on';
SET
test=# set standard_conforming_strings='on';
SET


test=# select 'abcd'||chr(8198) ~ 'abcd\s';
  ?column?
----------
  f
(1 row)

test=# select 'abcd'||chr(8198) ~ E'abcd\\s';
  ?column?
----------
  f
(1 row)




>>
>>
>>
>>> -- 
>>> *Andreas Joseph Krogh*
>>> CTO / Partner - Visena AS
>>> Mobile: +47 909 56 963
>>> andreas@visena.com <mailto:andreas@visena.com>
>>> www.visena.com <https://www.visena.com>
>>> <https://www.visena.com>
>>
>>
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: regex match and special characters

From
Oleksii Kliukin
Date:

On 16. Aug 2018, at 16:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alex Kliukin <alexk@hintbits.com> writes:
Here is a simple SQL statement that gives different results on PostgreSQL 9.6 and PostgreSQL 10+. The space character at the end of the string is actually U+2006 SIX-PER-EM SPACE (http://www.fileformat.info/info/unicode/char/2006/index.htm)

I think the reason for the discrepancy is that in v10 we fixed the regex
locale support so that it could properly classify code points above U+7FF,
cf

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=c54159d44ceaba26ceda9fea1804f0de122a8f30

This nails down the cause, thanks a lot for the link! Apparently I missed it from PostgreSQL 10 release notes, where it is present in the “Queries” section, although AFAIK it deserved an entry in the "migration to version 10”, as it may potentially make dump/restore from previous versions to version 10 error out if there are table constraints that use regex classes over the Unicode text fields with code points above U+7FF.


So 10 is giving the right answer (i.e. that \s matches U+2006).
9.x is not

Agreed.

Cheers,
Alex

Re: regex match and special characters

From
Oleksii Kliukin
Date:
Hi Adrian,

On 16. Aug 2018, at 18:13, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

test=# select 'abcd'||chr(8198) ~ 'abcd\s';
?column?
----------
t
(1 row)


Wonder if the OP has standard_conforming_strings='off' and
escape_string_warning='off'?


Both are set to ‘on’ for me for all versions (I believe those are default settings). I have 12devel indeed on my test system alongside 9.6, but I’ve tried it as well on PostgreSQL 10 running on a different distro with different locale settings and it produced the same result (check being true). 

I think Tom’s answer solves it, although I am wondering how did you get true in the statement quoted above on PostgreSQL 9.6, perhaps that result is actually from PostgreSQL 10?

Cheers,
Oleksii

Re: regex match and special characters

From
Adrian Klaver
Date:
On 08/18/2018 08:12 AM, Oleksii Kliukin wrote:
> Hi Adrian,
> 
>> On 16. Aug 2018, at 18:13, Adrian Klaver <adrian.klaver@aklaver.com 
>> <mailto:adrian.klaver@aklaver.com>> wrote:
>>
>> test=# select 'abcd'||chr(8198) ~ 'abcd\s';
>> ?column?
>> ----------
>> t
>> (1 row)
>>
>>
>> Wonder if the OP has standard_conforming_strings='off' and
>> escape_string_warning='off'?
>>
> 
> Both are set to ‘on’ for me for all versions (I believe those are 
> default settings). I have 12devel indeed on my test system alongside 
> 9.6, but I’ve tried it as well on PostgreSQL 10 running on a different 
> distro with different locale settings and it produced the same result 
> (check being true).
> 
> I think Tom’s answer solves it, although I am wondering how did you get 
> true in the statement quoted above on PostgreSQL 9.6, perhaps that 
> result is actually from PostgreSQL 10?

Yes it was on Postgres 10. I got reversed in my thinking. A subsequent post:

https://www.postgresql.org/message-id/8b8d9250-1ee3-6aff-b580-592c7c759586%40aklaver.com

shows the behavior you saw on 9.6. What happens when I work through my 
thought process on line:(

> 
> Cheers,
> Oleksii


-- 
Adrian Klaver
adrian.klaver@aklaver.com