Thread: Why format() adds double quote?

Why format() adds double quote?

From

Tatsuo Ishii

Date:

20 January 2016, 02:47:13

test=# select format('%I', t) from t1; format  
----------aaa"AAA""あいう"
(3 rows)

Why is the text value of the third line needed to be double quoted?
(note that it is a multi byte character). Same thing can be said to
quote_ident().

We treat identifiers made of the multi byte characters without double
quotation (non delimited identifier) in other places.

test=# create table t2(あいう text);
CREATE TABLE
test=# insert into t2 values('aaa');
INSERT 0 1
test=# select あいう from t2;あいう 
--------aaa
(1 row)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Pavel Stehule

Date:

20 January 2016, 05:36:45

2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:

test=# select format('%I', t) from t1;
format
----------
aaa
"AAA"
"あいう"
(3 rows)

Why is the text value of the third line needed to be double quoted?
(note that it is a multi byte character). Same thing can be said to
quote_ident().

We treat identifiers made of the multi byte characters without double
quotation (non delimited identifier) in other places.

test=# create table t2(あいう text);
CREATE TABLE
test=# insert into t2 values('aaa');
INSERT 0 1
test=# select あいう from t2;
あいう
--------
aaa
(1 row)

format uses same routine as quote_ident. So quote_ident should be fixed first.

Regards

Pavel

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

20 January 2016, 06:20:24

> 2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
> 
>> test=# select format('%I', t) from t1;
>>   format
>> ----------
>>  aaa
>>  "AAA"
>>  "あいう"
>> (3 rows)
>>
>> Why is the text value of the third line needed to be double quoted?
>> (note that it is a multi byte character). Same thing can be said to
>> quote_ident().
>>
>> We treat identifiers made of the multi byte characters without double
>> quotation (non delimited identifier) in other places.
>>
>> test=# create table t2(あいう text);
>> CREATE TABLE
>> test=# insert into t2 values('aaa');
>> INSERT 0 1
>> test=# select あいう from t2;
>>  あいう
>> --------
>>  aaa
>> (1 row)
> 
> format uses same routine as quote_ident. So quote_ident should be fixed
> first.

Yes, I had that in my mind too.

Attached is the proposed patch to fix the bug.
Regression tests passed.

Here is an example after the patch. Note that the third row is not
quoted any more.

test=#  select format('%I', あいう) from t2;format 
--------aaa"AAA"あああ
(3 rows)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 3783e97..b93fc27 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -9405,7 +9405,7 @@ quote_identifier(const char *ident)     * would like to use <ctype.h> macros here, but they might
yieldunwanted     * locale-specific results...     */
 
-    safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_');
+    safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_' || IS_HIGHBIT_SET(ident[0]));    for (ptr = ident;
*ptr;ptr++)    {
 
@@ -9413,7 +9413,8 @@ quote_identifier(const char *ident)        if ((ch >= 'a' && ch <= 'z') ||            (ch >= '0'
&&ch <= '9') ||
 
-            (ch == '_'))
+            (ch == '_') ||
+            (IS_HIGHBIT_SET(ch)))        {            /* okay */        }

Re: Why format() adds double quote?

From

Pavel Stehule

Date:

20 January 2016, 07:51:16

2016-01-20 7:20 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:

> 2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>
>> test=# select format('%I', t) from t1;
>> format
>> ----------
>> aaa
>> "AAA"
>> "あいう"
>> (3 rows)
>>
>> Why is the text value of the third line needed to be double quoted?
>> (note that it is a multi byte character). Same thing can be said to
>> quote_ident().
>>
>> We treat identifiers made of the multi byte characters without double
>> quotation (non delimited identifier) in other places.
>>
>> test=# create table t2(あいう text);
>> CREATE TABLE
>> test=# insert into t2 values('aaa');
>> INSERT 0 1
>> test=# select あいう from t2;
>> あいう
>> --------
>> aaa
>> (1 row)
>
> format uses same routine as quote_ident. So quote_ident should be fixed
> first.

Yes, I had that in my mind too.

Attached is the proposed patch to fix the bug.
Regression tests passed.

Here is an example after the patch. Note that the third row is not
quoted any more.

test=# select format('%I', あいう) from t2;
format
--------
aaa
"AAA"
あああ
(3 rows)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 3783e97..b93fc27 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -9405,7 +9405,7 @@ quote_identifier(const char *ident)
* would like to use <ctype.h> macros here, but they might yield unwanted
* locale-specific results...
*/
- safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_');
+ safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_' || IS_HIGHBIT_SET(ident[0]));

for (ptr = ident; *ptr; ptr++)
{
@@ -9413,7 +9413,8 @@ quote_identifier(const char *ident)

if ((ch >= 'a' && ch <= 'z') ||
(ch >= '0' && ch <= '9') ||
- (ch == '_'))
+ (ch == '_') ||
+ (IS_HIGHBIT_SET(ch)))
{
/* okay */
}

This patch ls simply - I remember I was surprised, so we allow any multibyte char few months ago.

+1

Pavel

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

20 January 2016, 09:17:26

> Hi
> 
> 2016-01-20 7:20 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
> 
>> > 2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>> >
>> >> test=# select format('%I', t) from t1;
>> >>   format
>> >> ----------
>> >>  aaa
>> >>  "AAA"
>> >>  "あいう"
>> >> (3 rows)
>> >>
>> >> Why is the text value of the third line needed to be double quoted?
>> >> (note that it is a multi byte character). Same thing can be said to
>> >> quote_ident().
>> >>
>> >> We treat identifiers made of the multi byte characters without double
>> >> quotation (non delimited identifier) in other places.
>> >>
>> >> test=# create table t2(あいう text);
>> >> CREATE TABLE
>> >> test=# insert into t2 values('aaa');
>> >> INSERT 0 1
>> >> test=# select あいう from t2;
>> >>  あいう
>> >> --------
>> >>  aaa
>> >> (1 row)
>> >
>> > format uses same routine as quote_ident. So quote_ident should be fixed
>> > first.
>>
>> Yes, I had that in my mind too.
>>
>> Attached is the proposed patch to fix the bug.
>> Regression tests passed.
>>
>> Here is an example after the patch. Note that the third row is not
>> quoted any more.
>>
>> test=#  select format('%I', あいう) from t2;
>>  format
>> --------
>>  aaa
>>  "AAA"
>>  あああ
>> (3 rows)
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> diff --git a/src/backend/utils/adt/ruleutils.c
>> b/src/backend/utils/adt/ruleutils.c
>> index 3783e97..b93fc27 100644
>> --- a/src/backend/utils/adt/ruleutils.c
>> +++ b/src/backend/utils/adt/ruleutils.c
>> @@ -9405,7 +9405,7 @@ quote_identifier(const char *ident)
>>          * would like to use <ctype.h> macros here, but they might yield
>> unwanted
>>          * locale-specific results...
>>          */
>> -       safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_');
>> +       safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_' ||
>> IS_HIGHBIT_SET(ident[0]));
>>
>>         for (ptr = ident; *ptr; ptr++)
>>         {
>> @@ -9413,7 +9413,8 @@ quote_identifier(const char *ident)
>>
>>                 if ((ch >= 'a' && ch <= 'z') ||
>>                         (ch >= '0' && ch <= '9') ||
>> -                       (ch == '_'))
>> +                       (ch == '_') ||
>> +                       (IS_HIGHBIT_SET(ch)))
>>                 {
>>                         /* okay */
>>                 }
>>
>>
> This patch ls simply - I remember I was surprised, so we allow any
> multibyte char few months ago.
> 
> +1

If we would go this way, question is if we should back patch this or
not since the patch apparently changes the existing
behaviors. Comments?  I would think we should not.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Pavel Stehule

Date:

20 January 2016, 09:20:47

2016-01-20 10:17 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:

> Hi
>
> 2016-01-20 7:20 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>
>> > 2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>> >
>> >> test=# select format('%I', t) from t1;
>> >> format
>> >> ----------
>> >> aaa
>> >> "AAA"
>> >> "あいう"
>> >> (3 rows)
>> >>
>> >> Why is the text value of the third line needed to be double quoted?
>> >> (note that it is a multi byte character). Same thing can be said to
>> >> quote_ident().
>> >>
>> >> We treat identifiers made of the multi byte characters without double
>> >> quotation (non delimited identifier) in other places.
>> >>
>> >> test=# create table t2(あいう text);
>> >> CREATE TABLE
>> >> test=# insert into t2 values('aaa');
>> >> INSERT 0 1
>> >> test=# select あいう from t2;
>> >> あいう
>> >> --------
>> >> aaa
>> >> (1 row)
>> >
>> > format uses same routine as quote_ident. So quote_ident should be fixed
>> > first.
>>
>> Yes, I had that in my mind too.
>>
>> Attached is the proposed patch to fix the bug.
>> Regression tests passed.
>>
>> Here is an example after the patch. Note that the third row is not
>> quoted any more.
>>
>> test=# select format('%I', あいう) from t2;
>> format
>> --------
>> aaa
>> "AAA"
>> あああ
>> (3 rows)
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> diff --git a/src/backend/utils/adt/ruleutils.c
>> b/src/backend/utils/adt/ruleutils.c
>> index 3783e97..b93fc27 100644
>> --- a/src/backend/utils/adt/ruleutils.c
>> +++ b/src/backend/utils/adt/ruleutils.c
>> @@ -9405,7 +9405,7 @@ quote_identifier(const char *ident)
>> * would like to use <ctype.h> macros here, but they might yield
>> unwanted
>> * locale-specific results...
>> */
>> - safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_');
>> + safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_' ||
>> IS_HIGHBIT_SET(ident[0]));
>>
>> for (ptr = ident; *ptr; ptr++)
>> {
>> @@ -9413,7 +9413,8 @@ quote_identifier(const char *ident)
>>
>> if ((ch >= 'a' && ch <= 'z') ||
>> (ch >= '0' && ch <= '9') ||
>> - (ch == '_'))
>> + (ch == '_') ||
>> + (IS_HIGHBIT_SET(ch)))
>> {
>> /* okay */
>> }
>>
>>
> This patch ls simply - I remember I was surprised, so we allow any
> multibyte char few months ago.
>
> +1

If we would go this way, question is if we should back patch this or
not since the patch apparently changes the existing
behaviors. Comments? I would think we should not.

I am sure, so we should not backport this change. This can breaks customer regress tests - and the current behave isn't 100% correct, but it is safe.

Pavel

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Robert Haas

Date:

20 January 2016, 20:39:18

On Wed, Jan 20, 2016 at 4:20 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> If we would go this way, question is if we should back patch this or
>> not since the patch apparently changes the existing
>> behaviors. Comments?  I would think we should not.
>
> I am sure, so we should not backport this change. This can breaks customer
> regress tests - and the current behave isn't 100% correct, but it is safe.

Quite.  This is not a bug fix.  It's a behavior change, perhaps for the better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

24 January 2016, 10:04:16

> On Wed, Jan 20, 2016 at 4:20 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>> If we would go this way, question is if we should back patch this or
>>> not since the patch apparently changes the existing
>>> behaviors. Comments?  I would think we should not.
>>
>> I am sure, so we should not backport this change. This can breaks customer
>> regress tests - and the current behave isn't 100% correct, but it is safe.
> 
> Quite.  This is not a bug fix.  It's a behavior change, perhaps for the better.

Added to the commitfest 2016-03.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

"Dickson S. Guedes"

Date:

24 January 2016, 22:54:51

2016-01-24 8:04 GMT-02:00 Tatsuo Ishii <ishii@postgresql.org>:
>> On Wed, Jan 20, 2016 at 4:20 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>>> If we would go this way, question is if we should back patch this or
>>>> not since the patch apparently changes the existing
>>>> behaviors. Comments?  I would think we should not.
>>>
>>> I am sure, so we should not backport this change. This can breaks customer
>>> regress tests - and the current behave isn't 100% correct, but it is safe.
>>
>> Quite.  This is not a bug fix.  It's a behavior change, perhaps for the better.
>
> Added to the commitfest 2016-03.

Hi,

I gone ahead a little and tested this patch and it works like was
proposed, I agree that it's not a bug fix but a new behavior so -1 for
backport.

While applying patch against master
(1129c2b0ad2732f301f696ae2cf98fb063a4c1f8) it offsets two hunks.

Since format() has regression tests I suggest that one should be added
to cover this. It could worth to add the new behavior to the docs,
since there no explicit example for %I.

I performed the follow tests that works as expected using some Portuguese words:

postgres=# create table test (nome varchar, endereço text, "UF"
varchar(2), título varchar);
CREATE TABLE
Time: 80,769 ms
postgres=# select format('%I', attname) from pg_attribute join
pg_class on (attrelid = oid) where relname = 'test'; format
----------"UF"cmaxcminctidendereçonometableoidtítuloxmaxxmin
(10 rows)

Time: 1,728 ms
postgres=# select format('%I', 'endereco'); format
----------endereco
(1 row)

Time: 0,098 ms
postgres=# select format('%I', 'endereço'); format
----------endereço
(1 row)

Time: 0,088 ms
postgres=# select format('%I', 'あああ');format
--------あああ
(1 row)

Time: 0,072 ms
postgres=# select format('%I', 'título');format
--------título
(1 row)

Time: 0,051 ms
postgres=# select format('%I', 'título e');  format
------------"título e"
(1 row)

Time: 0,051 ms
postgres=# select format('%I', 'título_e'); format
----------título_e
(1 row)

Time: 0,051 ms
postgres=# select format('%I', '_título');format
---------_título
(1 row)

Time: 0,047 ms
postgres=# select format('%I', '1_título');  format
------------"1_título"
(1 row)

Time: 0,046 ms


Thank you for this!


Best regards,
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://github.com/guedes - http://guedesoft.net
http://www.postgresql.org.br

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

26 January 2016, 07:29:19

> 2016-01-24 8:04 GMT-02:00 Tatsuo Ishii <ishii@postgresql.org>:
>>> On Wed, Jan 20, 2016 at 4:20 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>>>> If we would go this way, question is if we should back patch this or
>>>>> not since the patch apparently changes the existing
>>>>> behaviors. Comments?  I would think we should not.
>>>>
>>>> I am sure, so we should not backport this change. This can breaks customer
>>>> regress tests - and the current behave isn't 100% correct, but it is safe.
>>>
>>> Quite.  This is not a bug fix.  It's a behavior change, perhaps for the better.
>>
>> Added to the commitfest 2016-03.
> 
> Hi,
> 
> I gone ahead a little and tested this patch and it works like was
> proposed, I agree that it's not a bug fix but a new behavior so -1 for
> backport.

IMO, it's a bug or at least an inconsistency but I admit it's too late
to back patch to existing stable branches.

> While applying patch against master
> (1129c2b0ad2732f301f696ae2cf98fb063a4c1f8) it offsets two hunks.
> 
> Since format() has regression tests I suggest that one should be added
> to cover this. 

I don't think it's doable. The test requires to handle multiple
database encodings. The regression test framework handles only one
database encoding. Probably adding to the existing mb test is the
easiest.

> It could worth to add the new behavior to the docs,
> since there no explicit example for %I.

> I performed the follow tests that works as expected using some Portuguese words:

I assume you used UTF-8 encoding database.
Great.

> postgres=# create table test (nome varchar, endereço text, "UF"
> varchar(2), título varchar);
> CREATE TABLE
> Time: 80,769 ms
> postgres=# select format('%I', attname) from pg_attribute join
> pg_class on (attrelid = oid) where relname = 'test';
>   format
> ----------
>  "UF"
>  cmax
>  cmin
>  ctid
>  endereço
>  nome
>  tableoid
>  título
>  xmax
>  xmin
> (10 rows)
> 
> Time: 1,728 ms
> postgres=# select format('%I', 'endereco');
>   format
> ----------
>  endereco
> (1 row)
> 
> Time: 0,098 ms
> postgres=# select format('%I', 'endereço');
>   format
> ----------
>  endereço
> (1 row)
> 
> Time: 0,088 ms
> postgres=# select format('%I', 'あああ');
>  format
> --------
>  あああ
> (1 row)
> 
> Time: 0,072 ms
> postgres=# select format('%I', 'título');
>  format
> --------
>  título
> (1 row)
> 
> Time: 0,051 ms
> postgres=# select format('%I', 'título e');
>    format
> ------------
>  "título e"
> (1 row)
> 
> Time: 0,051 ms
> postgres=# select format('%I', 'título_e');
>   format
> ----------
>  título_e
> (1 row)
> 
> Time: 0,051 ms
> postgres=# select format('%I', '_título');
>  format
> ---------
>  _título
> (1 row)
> 
> Time: 0,047 ms
> postgres=# select format('%I', '1_título');
>    format
> ------------
>  "1_título"
> (1 row)
> 
> Time: 0,046 ms
> 
> 
> Thank you for this!
> 
> 
> Best regards,
> -- 
> Dickson S. Guedes
> mail/xmpp: guedes@guedesoft.net - skype: guediz
> http://github.com/guedes - http://guedesoft.net
> http://www.postgresql.org.br

Re: Why format() adds double quote?

From

"Daniel Verite"

Date:

26 January 2016, 20:01:18

    Tatsuo Ishii wrote:

> IMO, it's a bug or at least an inconsistency

Personally I don't see this change being good for everything.

Let's play devil's advocate:

create table abc(U&"foo\2003" int);

U+2003 is 'EM SPACE', in Unicode's General Punctuation block.

With the current version, format('%I', attname) on this column is:
"foo "

With the patched version, it produces this:
foo 

So the visual hint that there are more characters at the end is lost.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Why format() adds double quote?

From

"Dickson S. Guedes"

Date:

26 January 2016, 22:04:16

2016-01-26 5:29 GMT-02:00 Tatsuo Ishii <ishii@postgresql.org>:
>
> I assume you used UTF-8 encoding database.

Yes, I do.


-- 
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://github.com/guedes - http://guedesoft.net
http://www.postgresql.org.br

Re: Why format() adds double quote?

From

"Dickson S. Guedes"

Date:

26 January 2016, 22:29:25

2016-01-26 18:00 GMT-02:00 Daniel Verite <daniel@manitou-mail.org>:
> ...
> create table abc(U&"foo\2003" int);
>
> U+2003 is 'EM SPACE', in Unicode's General Punctuation block.
>
> With the current version, format('%I', attname) on this column is:
> "foo "
>
> With the patched version, it produces this:
> foo
>
> So the visual hint that there are more characters at the end is lost.


Thanks for advocate, I see here that it even produces that output with
simple spaces.

postgres=# create table x ("aí      " text);
CREATE TABLE
postgres=# \d x       Tabela "public.x" Coluna  | Tipo | Modificadores
----------+------+---------------aí       | text |


This will break copy&paste user actions and scripts that parses that output.

Maybe the patch should consider left/right non-printable chars to
choose whether to show or not the " ?

[]s
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://github.com/guedes - http://guedesoft.net
http://www.postgresql.org.br

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

27 January 2016, 01:41:04

> Thanks for advocate, I see here that it even produces that output with
> simple spaces.
>
> postgres=# create table x ("aí      " text);
> CREATE TABLE
> postgres=# \d x
>         Tabela "public.x"
>   Coluna  | Tipo | Modificadores
> ----------+------+---------------
>  aí       | text |
>
>
> This will break copy&paste user actions and scripts that parses that output.
>
> Maybe the patch should consider left/right non-printable chars to
> choose whether to show or not the " ?

This is a totally different story from the topic discussed in this
thread. psql never adds double quotations to column name even with
upper case col names.

test=# create table t6("ABC" int);
CREATE TABLE
test=# \d t6     Table "public.t6"Column |  Type   | Modifiers
--------+---------+-----------ABC    | integer |

If you want to change the existing psql's behavior, propose it
yourself.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

27 January 2016, 01:46:28

>> IMO, it's a bug or at least an inconsistency
> 
> Personally I don't see this change being good for everything.
> 
> Let's play devil's advocate:
> 
> create table abc(U&"foo\2003" int);
> 
> U+2003 is 'EM SPACE', in Unicode's General Punctuation block.
> 
> With the current version, format('%I', attname) on this column is:
> "foo "
> 
> With the patched version, it produces this:
> foo 
> 
> So the visual hint that there are more characters at the end is lost.

What is the "visual hint"? If you are talking about psql's output, it
never adds "visual hint" (double quotations).

If you are talking about the string handling in a program, what kind
of program cares about "visiual"?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Pavel Stehule

Date:

27 January 2016, 04:57:13

2016-01-26 21:00 GMT+01:00 Daniel Verite <daniel@manitou-mail.org>:

Tatsuo Ishii wrote:

> IMO, it's a bug or at least an inconsistency

Personally I don't see this change being good for everything.

Let's play devil's advocate:

create table abc(U&"foo\2003" int);

U+2003 is 'EM SPACE', in Unicode's General Punctuation block.

With the current version, format('%I', attname) on this column is:
"foo "

With the patched version, it produces this:
foo

So the visual hint that there are more characters at the end is lost.

I can agree, so current behave can be useful in some cases, but still it is bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping functions.

Currently, any multibyte char can be unescaped identifier (only apostrophes are tested). We should to test white chars too.

Regards

Pavel

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

27 January 2016, 05:13:13

> 2016-01-26 21:00 GMT+01:00 Daniel Verite <daniel@manitou-mail.org>:
> 
>>         Tatsuo Ishii wrote:
>>
>> > IMO, it's a bug or at least an inconsistency
>>
>> Personally I don't see this change being good for everything.
>>
>> Let's play devil's advocate:
>>
>> create table abc(U&"foo\2003" int);
>>
>> U+2003 is 'EM SPACE', in Unicode's General Punctuation block.
>>
>> With the current version, format('%I', attname) on this column is:
>> "foo "
>>
>> With the patched version, it produces this:
>> foo
>>
>> So the visual hint that there are more characters at the end is lost.
>>
> 
> I can agree, so current behave can be useful in some cases, but still it is
> bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping
> functions.
> 
> Currently, any multibyte char can be unescaped identifier (only apostrophes
> are tested). We should to test white chars too.

Really? I thought we do that test.

test=# create table t6("あいう えお" int);
CREATE TABLE
test=# \d t6        Table "public.t6"  Column    |  Type   | Modifiers 
-------------+---------+-----------あいう えお | integer | 
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Pavel Stehule

Date:

27 January 2016, 05:16:22

2016-01-27 6:13 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:

> 2016-01-26 21:00 GMT+01:00 Daniel Verite <daniel@manitou-mail.org>:
>
>> Tatsuo Ishii wrote:
>>
>> > IMO, it's a bug or at least an inconsistency
>>
>> Personally I don't see this change being good for everything.
>>
>> Let's play devil's advocate:
>>
>> create table abc(U&"foo\2003" int);
>>
>> U+2003 is 'EM SPACE', in Unicode's General Punctuation block.
>>
>> With the current version, format('%I', attname) on this column is:
>> "foo "
>>
>> With the patched version, it produces this:
>> foo
>>
>> So the visual hint that there are more characters at the end is lost.
>>
>
> I can agree, so current behave can be useful in some cases, but still it is
> bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping
> functions.
>
> Currently, any multibyte char can be unescaped identifier (only apostrophes
> are tested). We should to test white chars too.

Really? I thought we do that test.

what you are expecting from this test? UTF single quotes are tested only in quote functions probably.

Pavel

test=# create table t6("あいうえお" int);
CREATE TABLE
test=# \d t6
Table "public.t6"
Column | Type | Modifiers
-------------+---------+-----------
あいうえお | integer |
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

27 January 2016, 05:24:08

>> > I can agree, so current behave can be useful in some cases, but still it
>> is
>> > bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping
>> > functions.
>> >
>> > Currently, any multibyte char can be unescaped identifier (only
>> apostrophes
>> > are tested). We should to test white chars too.
>>
>> Really? I thought we do that test.
>>
> 
> what you are expecting from this test? UTF single quotes are tested only in
> quote functions probably.

I just wanted to demonstrate multibyte chars including ASCII white
spaces can be an identifier.

> We should to test white chars too.

What do you exactly propose regarding white chars and multibyte chars
here? Maybe you propose to consider non ASCII white spaces (treate
them as ASCII white spaces)?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Pavel
> 
> 
>>
>> test=# create table t6("あいう えお" int);
>> CREATE TABLE
>> test=# \d t6
>>          Table "public.t6"
>>    Column    |  Type   | Modifiers
>> -------------+---------+-----------
>>  あいう えお | integer |
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>

Re: Why format() adds double quote?

From

Pavel Stehule

Date:

27 January 2016, 07:06:57

2016-01-27 6:24 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:

>> > I can agree, so current behave can be useful in some cases, but still it
>> is
>> > bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping
>> > functions.
>> >
>> > Currently, any multibyte char can be unescaped identifier (only
>> apostrophes
>> > are tested). We should to test white chars too.
>>
>> Really? I thought we do that test.
>>
>
> what you are expecting from this test? UTF single quotes are tested only in
> quote functions probably.

I just wanted to demonstrate multibyte chars including ASCII white
spaces can be an identifier.

I understand now.

> We should to test white chars too.

What do you exactly propose regarding white chars and multibyte chars
here? Maybe you propose to consider non ASCII white spaces (treate
them as ASCII white spaces)?

I propose the work with UTF white chars should be same like ASCII white chars. The current design is too simple - with possible pretty bad issues. Daniel's example is good - there is big gap in design.

Regards

Pavel

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Pavel
>
>
>>
>> test=# create table t6("あいうえお" int);
>> CREATE TABLE
>> test=# \d t6
>> Table "public.t6"
>> Column | Type | Modifiers
>> -------------+---------+-----------
>> あいうえお | integer |
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

27 January 2016, 07:26:04

>> What do you exactly propose regarding white chars and multibyte chars
>> here? Maybe you propose to consider non ASCII white spaces (treate
>> them as ASCII white spaces)?
>>
> 
> I propose the work with UTF white chars should be same like ASCII white
> chars. The current design is too simple - with possible pretty bad issues.
> Daniel's example is good - there is big gap in design.

I think we should consider followings before going forward:

1) Does PostgreSQL treat non ASCII white spaces same as ASCII white  spaces anyware in the system? If not, there's no
reasonwe should  think format() and quote_indent() are exception.
 

2) What does the SQL standard say? Do they say that non ASCII white  spaces should be treated as ASCII white spaces?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Pavel Stehule

Date:

27 January 2016, 07:37:22

2016-01-27 8:25 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:

>> What do you exactly propose regarding white chars and multibyte chars
>> here? Maybe you propose to consider non ASCII white spaces (treate
>> them as ASCII white spaces)?
>>
>
> I propose the work with UTF white chars should be same like ASCII white
> chars. The current design is too simple - with possible pretty bad issues.
> Daniel's example is good - there is big gap in design.

I think we should consider followings before going forward:

1) Does PostgreSQL treat non ASCII white spaces same as ASCII white
spaces anyware in the system? If not, there's no reason we should
think format() and quote_indent() are exception.

2) What does the SQL standard say? Do they say that non ASCII white
spaces should be treated as ASCII white spaces?

I am not sure, if SQL standard say some about it. But I am sure, so using unescaped or unclosed UTF8 spaces is bug, dangerous, wrong

Pavel

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

"Daniel Verite"

Date:

27 January 2016, 15:37:59

Tatsuo Ishii wrote:

> 2) What does the SQL standard say? Do they say that non ASCII white
>   spaces should be treated as ASCII white spaces?

I've used white space in the example, but I'm concerned about
punctuation too.

unicode.org has this helpful paper:
http://www.unicode.org/L2/L2000/00260-sql.pdf
which studies Unicode in SQL-99 identifiers.

The relevant BNF they extracted from the standard looks like this:

identifier body> ::=  <identifier start>  [ { <underscore> | <identifier part> }... ]

<identifier start> ::=  <initial alphabetic character>  | <ideographic character>

<identifier part> ::=   <alphabetic character>   | <ideographic character>   | <decimal digit character>   |
<identifiercombining character>   | <underscore>   | <alternate underscore>   | <extender character>   | <identifier
ignorablecharacter>   | <connector character> 

<delimited identifier> ::=  <double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::= <delimited identifier part>...

<delimited identifier part> ::=  <nondoublequote character>  | <doublequote symbol>

========

The current version of quote_ident() plays it safe by implementing
the rule that, as soon it encounters a character outside
of US-ASCII, it surrounds the identifier with double quotes, no matter
to which category or block this character belongs.
So its output is guaranteed to be compatible with the above grammar.

The change in the patch is that multibyte characters just don't imply
quoting.

But according to the points 1 and 2 of the paper, the first character
must have the Unicode alphabetic property, and it must not
have the Unicode combining property.

I'm mostly ignorant in Unicode so I'm not sure of the precise
implications of having such Unicode properties, but still my
understanding is that the new quote_ident() ignores these rules,
so in this sense it could produce outputs that wouldn't be
compatible with SQL-99.

Also, here's what we say in the manual about non quoted identifiers:
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html

"SQL identifiers and key words must begin with a letter (a-z, but also
letters with diacritical marks and non-Latin letters) or an underscore
(_). Subsequent characters in an identifier or key word can be
letters, underscores, digits (0-9), or dollar signs ($)"

So it explicitly allows letters in general  (and also seems less
strict than SQL-99 about underscore), but it makes no promise about
Unicode punctuation or spaces, for instance, even though in practice
the parser seems to accept them just fine.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Why format() adds double quote?

From

"Daniel Verite"

Date:

27 January 2016, 16:09:02

    Tatsuo Ishii wrote:

> What is the "visual hint"? If you are talking about psql's output, it
> never adds "visual hint" (double quotations).
>
> If you are talking about the string handling in a program, what kind
> of program cares about "visiual"?

Sure. The scenario I'm thinking about would be someone inspecting
generated queries, say for controlling or troubleshooting,
the queries containing identifiers injected with the help of
quote_ident/format. That could be from the logs, or
monitoring or audit tools.

If identifiers contain weird Unicode characters, currently
that's relatively easy to spot because they get surrounded by
double quotes.

If I see something like this: UPDATE "test․table" SET...
I immediately think that there's something fishy. It looks like test
should be a schema, but the surrounding quotes say otherwise.
In any case, it's clear that it updates a table in the current schema.

But if I see that: UPDATE test․table SET...
is seems legit and seems to update "table" inside the "test" schema
even though that's not what it does, it actually updates the
"test․table" table in the current schema, because the dot between
test and table is not the US-ASCII U+002E,  it's U+2024,
'ONE DOT LEADER'
On my display, they are almost indiscernible.

This boils down to the fact that the current quote_ident gives:

=# select quote_ident('test․table');quote_ident
--------------"test․table"

whereas the quote_ident patched as proposed gives:

=# select quote_ident('test․table');quote_ident
-------------test․table

So this is what I don't feel good about.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Why format() adds double quote?

From

Tom Lane

Date:

27 January 2016, 16:54:37

"Daniel Verite" <daniel@manitou-mail.org> writes:
> This boils down to the fact that the current quote_ident gives:

> =# select quote_ident('test․table');
>  quote_ident  
> --------------
>  "test․table"

> whereas the quote_ident patched as proposed gives:

> =# select quote_ident('test․table');
>  quote_ident 
> -------------
>  test․table

> So this is what I don't feel good about.

This patch was originally proposed as a simple, cost-free change,
but it's becoming obvious that it is no such thing.  I think
we should probably reject it and move on.
        regards, tom lane

Re: Why format() adds double quote?

From

"Dickson S. Guedes"

Date:

27 January 2016, 22:47:07

2016-01-26 23:40 GMT-02:00 Tatsuo Ishii <ishii@postgresql.org>:
>> Thanks for advocate, I see here that it even produces that output with
>> simple spaces.
>>
>> postgres=# create table x ("aí      " text);
>> CREATE TABLE
>> postgres=# \d x
>>         Tabela "public.x"
>>   Coluna  | Tipo | Modificadores
>> ----------+------+---------------
>>  aí       | text |
>>
>>
>> This will break copy&paste user actions and scripts that parses that output.
>>
>> Maybe the patch should consider left/right non-printable chars to
>> choose whether to show or not the " ?
>
> This is a totally different story from the topic discussed in this
> thread. psql never adds double quotations to column name even with
> upper case col names.

Indeed, you are right.

> If you want to change the existing psql's behavior, propose it
> yourself.

It could be interesting, maybe using a \pset quote_columns_char, I'll
think about, thank you.

Best regards.
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://github.com/guedes - http://guedesoft.net
http://www.postgresql.org.br

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

28 January 2016, 00:00:46

> I've used white space in the example, but I'm concerned about
> punctuation too.
> 
> unicode.org has this helpful paper:
> http://www.unicode.org/L2/L2000/00260-sql.pdf
> which studies Unicode in SQL-99 identifiers.
> 
> The relevant BNF they extracted from the standard looks like this:
> 
> identifier body> ::=
>    <identifier start>
>    [ { <underscore> | <identifier part> }... ]
> 
> <identifier start> ::=
>    <initial alphabetic character>
>    | <ideographic character>
> 
> <identifier part> ::=
>     <alphabetic character>
>     | <ideographic character>
>     | <decimal digit character>
>     | <identifier combining character>
>     | <underscore>
>     | <alternate underscore>
>     | <extender character>
>     | <identifier ignorable character>
>     | <connector character>
> 
> <delimited identifier> ::=
>    <double quote> <delimited identifier body> <double quote>
> 
> <delimited identifier body> ::= <delimited identifier part>...
> 
> <delimited identifier part> ::=
>    <nondoublequote character>
>    | <doublequote symbol>
> 
> ========
> 
> The current version of quote_ident() plays it safe by implementing
> the rule that, as soon it encounters a character outside
> of US-ASCII, it surrounds the identifier with double quotes, no matter
> to which category or block this character belongs.
> So its output is guaranteed to be compatible with the above grammar.
> 
> The change in the patch is that multibyte characters just don't imply
> quoting.
> 
> But according to the points 1 and 2 of the paper, the first character
> must have the Unicode alphabetic property, and it must not
> have the Unicode combining property.

Good point.

> I'm mostly ignorant in Unicode so I'm not sure of the precise
> implications of having such Unicode properties, but still my
> understanding is that the new quote_ident() ignores these rules,
> so in this sense it could produce outputs that wouldn't be
> compatible with SQL-99.
> 
> Also, here's what we say in the manual about non quoted identifiers:
> http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html
> 
> "SQL identifiers and key words must begin with a letter (a-z, but also
> letters with diacritical marks and non-Latin letters) or an underscore
> (_). Subsequent characters in an identifier or key word can be
> letters, underscores, digits (0-9), or dollar signs ($)"
> 
> So it explicitly allows letters in general  (and also seems less
> strict than SQL-99 about underscore), but it makes no promise about
> Unicode punctuation or spaces, for instance, even though in practice
> the parser seems to accept them just fine.

You could arbitary extend your point, not only with Unicode
punctuation or spaces, There are number of characters look-alike "-"
in Unicode, for example.　Do we want to treat them like ASCII "-"?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Why format() adds double quote?

From

Tatsuo Ishii

Date:

28 January 2016, 00:05:47

> "Daniel Verite" <daniel@manitou-mail.org> writes:
>> This boils down to the fact that the current quote_ident gives:
> 
>> =# select quote_ident('test․table');
>>  quote_ident  
>> --------------
>>  "test․table"
> 
>> whereas the quote_ident patched as proposed gives:
> 
>> =# select quote_ident('test․table');
>>  quote_ident 
>> -------------
>>  test․table
> 
>> So this is what I don't feel good about.
> 
> This patch was originally proposed as a simple, cost-free change,
> but it's becoming obvious that it is no such thing.  I think
> we should probably reject it and move on.

It seems I opend a can of worms. I'm going to reject my proposal
myself.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp