Thread: Why format() adds double quote?
test=# select format('%I', t) from t1; format ----------aaa"AAA""あいう" (3 rows) Why is the text value of the third line needed to be double quoted? (note that it is a multi byte character). Same thing can be said to quote_ident(). We treat identifiers made of the multi byte characters without double quotation (non delimited identifier) in other places. test=# create table t2(あいう text); CREATE TABLE test=# insert into t2 values('aaa'); INSERT 0 1 test=# select あいう from t2;あいう --------aaa (1 row) Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
test=# select format('%I', t) from t1;
format
----------
aaa
"AAA"
"あいう"
(3 rows)
Why is the text value of the third line needed to be double quoted?
(note that it is a multi byte character). Same thing can be said to
quote_ident().
We treat identifiers made of the multi byte characters without double
quotation (non delimited identifier) in other places.
test=# create table t2(あいう text);
CREATE TABLE
test=# insert into t2 values('aaa');
INSERT 0 1
test=# select あいう from t2;
あいう
--------
aaa
(1 row)
format uses same routine as quote_ident. So quote_ident should be fixed first.
Regards
Pavel
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
> 2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>: > >> test=# select format('%I', t) from t1; >> format >> ---------- >> aaa >> "AAA" >> "あいう" >> (3 rows) >> >> Why is the text value of the third line needed to be double quoted? >> (note that it is a multi byte character). Same thing can be said to >> quote_ident(). >> >> We treat identifiers made of the multi byte characters without double >> quotation (non delimited identifier) in other places. >> >> test=# create table t2(あいう text); >> CREATE TABLE >> test=# insert into t2 values('aaa'); >> INSERT 0 1 >> test=# select あいう from t2; >> あいう >> -------- >> aaa >> (1 row) > > format uses same routine as quote_ident. So quote_ident should be fixed > first. Yes, I had that in my mind too. Attached is the proposed patch to fix the bug. Regression tests passed. Here is an example after the patch. Note that the third row is not quoted any more. test=# select format('%I', あいう) from t2;format --------aaa"AAA"あああ (3 rows) Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 3783e97..b93fc27 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -9405,7 +9405,7 @@ quote_identifier(const char *ident) * would like to use <ctype.h> macros here, but they might yieldunwanted * locale-specific results... */ - safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_'); + safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_' || IS_HIGHBIT_SET(ident[0])); for (ptr = ident; *ptr;ptr++) { @@ -9413,7 +9413,8 @@ quote_identifier(const char *ident) if ((ch >= 'a' && ch <= 'z') || (ch >= '0' &&ch <= '9') || - (ch == '_')) + (ch == '_') || + (IS_HIGHBIT_SET(ch))) { /* okay */ }
Hi
2016-01-20 7:20 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
> 2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>
>> test=# select format('%I', t) from t1;
>> format
>> ----------
>> aaa
>> "AAA"
>> "あいう"
>> (3 rows)
>>
>> Why is the text value of the third line needed to be double quoted?
>> (note that it is a multi byte character). Same thing can be said to
>> quote_ident().
>>
>> We treat identifiers made of the multi byte characters without double
>> quotation (non delimited identifier) in other places.
>>
>> test=# create table t2(あいう text);
>> CREATE TABLE
>> test=# insert into t2 values('aaa');
>> INSERT 0 1
>> test=# select あいう from t2;
>> あいう
>> --------
>> aaa
>> (1 row)
>
> format uses same routine as quote_ident. So quote_ident should be fixed
> first.
Yes, I had that in my mind too.
Attached is the proposed patch to fix the bug.
Regression tests passed.
Here is an example after the patch. Note that the third row is not
quoted any more.
test=# select format('%I', あいう) from t2;
format
--------
aaa
"AAA"
あああ
(3 rows)
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 3783e97..b93fc27 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -9405,7 +9405,7 @@ quote_identifier(const char *ident)
* would like to use <ctype.h> macros here, but they might yield unwanted
* locale-specific results...
*/
- safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_');
+ safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_' || IS_HIGHBIT_SET(ident[0]));
for (ptr = ident; *ptr; ptr++)
{
@@ -9413,7 +9413,8 @@ quote_identifier(const char *ident)
if ((ch >= 'a' && ch <= 'z') ||
(ch >= '0' && ch <= '9') ||
- (ch == '_'))
+ (ch == '_') ||
+ (IS_HIGHBIT_SET(ch)))
{
/* okay */
}
This patch ls simply - I remember I was surprised, so we allow any multibyte char few months ago.
+1
+1
Pavel
> Hi > > 2016-01-20 7:20 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>: > >> > 2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>: >> > >> >> test=# select format('%I', t) from t1; >> >> format >> >> ---------- >> >> aaa >> >> "AAA" >> >> "あいう" >> >> (3 rows) >> >> >> >> Why is the text value of the third line needed to be double quoted? >> >> (note that it is a multi byte character). Same thing can be said to >> >> quote_ident(). >> >> >> >> We treat identifiers made of the multi byte characters without double >> >> quotation (non delimited identifier) in other places. >> >> >> >> test=# create table t2(あいう text); >> >> CREATE TABLE >> >> test=# insert into t2 values('aaa'); >> >> INSERT 0 1 >> >> test=# select あいう from t2; >> >> あいう >> >> -------- >> >> aaa >> >> (1 row) >> > >> > format uses same routine as quote_ident. So quote_ident should be fixed >> > first. >> >> Yes, I had that in my mind too. >> >> Attached is the proposed patch to fix the bug. >> Regression tests passed. >> >> Here is an example after the patch. Note that the third row is not >> quoted any more. >> >> test=# select format('%I', あいう) from t2; >> format >> -------- >> aaa >> "AAA" >> あああ >> (3 rows) >> >> Best regards, >> -- >> Tatsuo Ishii >> SRA OSS, Inc. Japan >> English: http://www.sraoss.co.jp/index_en.php >> Japanese:http://www.sraoss.co.jp >> >> diff --git a/src/backend/utils/adt/ruleutils.c >> b/src/backend/utils/adt/ruleutils.c >> index 3783e97..b93fc27 100644 >> --- a/src/backend/utils/adt/ruleutils.c >> +++ b/src/backend/utils/adt/ruleutils.c >> @@ -9405,7 +9405,7 @@ quote_identifier(const char *ident) >> * would like to use <ctype.h> macros here, but they might yield >> unwanted >> * locale-specific results... >> */ >> - safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_'); >> + safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_' || >> IS_HIGHBIT_SET(ident[0])); >> >> for (ptr = ident; *ptr; ptr++) >> { >> @@ -9413,7 +9413,8 @@ quote_identifier(const char *ident) >> >> if ((ch >= 'a' && ch <= 'z') || >> (ch >= '0' && ch <= '9') || >> - (ch == '_')) >> + (ch == '_') || >> + (IS_HIGHBIT_SET(ch))) >> { >> /* okay */ >> } >> >> > This patch ls simply - I remember I was surprised, so we allow any > multibyte char few months ago. > > +1 If we would go this way, question is if we should back patch this or not since the patch apparently changes the existing behaviors. Comments? I would think we should not. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
2016-01-20 10:17 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
If we would go this way, question is if we should back patch this or> Hi
>
> 2016-01-20 7:20 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>
>> > 2016-01-20 3:47 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>> >
>> >> test=# select format('%I', t) from t1;
>> >> format
>> >> ----------
>> >> aaa
>> >> "AAA"
>> >> "あいう"
>> >> (3 rows)
>> >>
>> >> Why is the text value of the third line needed to be double quoted?
>> >> (note that it is a multi byte character). Same thing can be said to
>> >> quote_ident().
>> >>
>> >> We treat identifiers made of the multi byte characters without double
>> >> quotation (non delimited identifier) in other places.
>> >>
>> >> test=# create table t2(あいう text);
>> >> CREATE TABLE
>> >> test=# insert into t2 values('aaa');
>> >> INSERT 0 1
>> >> test=# select あいう from t2;
>> >> あいう
>> >> --------
>> >> aaa
>> >> (1 row)
>> >
>> > format uses same routine as quote_ident. So quote_ident should be fixed
>> > first.
>>
>> Yes, I had that in my mind too.
>>
>> Attached is the proposed patch to fix the bug.
>> Regression tests passed.
>>
>> Here is an example after the patch. Note that the third row is not
>> quoted any more.
>>
>> test=# select format('%I', あいう) from t2;
>> format
>> --------
>> aaa
>> "AAA"
>> あああ
>> (3 rows)
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> diff --git a/src/backend/utils/adt/ruleutils.c
>> b/src/backend/utils/adt/ruleutils.c
>> index 3783e97..b93fc27 100644
>> --- a/src/backend/utils/adt/ruleutils.c
>> +++ b/src/backend/utils/adt/ruleutils.c
>> @@ -9405,7 +9405,7 @@ quote_identifier(const char *ident)
>> * would like to use <ctype.h> macros here, but they might yield
>> unwanted
>> * locale-specific results...
>> */
>> - safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_');
>> + safe = ((ident[0] >= 'a' && ident[0] <= 'z') || ident[0] == '_' ||
>> IS_HIGHBIT_SET(ident[0]));
>>
>> for (ptr = ident; *ptr; ptr++)
>> {
>> @@ -9413,7 +9413,8 @@ quote_identifier(const char *ident)
>>
>> if ((ch >= 'a' && ch <= 'z') ||
>> (ch >= '0' && ch <= '9') ||
>> - (ch == '_'))
>> + (ch == '_') ||
>> + (IS_HIGHBIT_SET(ch)))
>> {
>> /* okay */
>> }
>>
>>
> This patch ls simply - I remember I was surprised, so we allow any
> multibyte char few months ago.
>
> +1
not since the patch apparently changes the existing
behaviors. Comments? I would think we should not.
I am sure, so we should not backport this change. This can breaks customer regress tests - and the current behave isn't 100% correct, but it is safe.
Pavel
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
On Wed, Jan 20, 2016 at 4:20 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> If we would go this way, question is if we should back patch this or >> not since the patch apparently changes the existing >> behaviors. Comments? I would think we should not. > > I am sure, so we should not backport this change. This can breaks customer > regress tests - and the current behave isn't 100% correct, but it is safe. Quite. This is not a bug fix. It's a behavior change, perhaps for the better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Wed, Jan 20, 2016 at 4:20 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>> If we would go this way, question is if we should back patch this or >>> not since the patch apparently changes the existing >>> behaviors. Comments? I would think we should not. >> >> I am sure, so we should not backport this change. This can breaks customer >> regress tests - and the current behave isn't 100% correct, but it is safe. > > Quite. This is not a bug fix. It's a behavior change, perhaps for the better. Added to the commitfest 2016-03. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
2016-01-24 8:04 GMT-02:00 Tatsuo Ishii <ishii@postgresql.org>: >> On Wed, Jan 20, 2016 at 4:20 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>>> If we would go this way, question is if we should back patch this or >>>> not since the patch apparently changes the existing >>>> behaviors. Comments? I would think we should not. >>> >>> I am sure, so we should not backport this change. This can breaks customer >>> regress tests - and the current behave isn't 100% correct, but it is safe. >> >> Quite. This is not a bug fix. It's a behavior change, perhaps for the better. > > Added to the commitfest 2016-03. Hi, I gone ahead a little and tested this patch and it works like was proposed, I agree that it's not a bug fix but a new behavior so -1 for backport. While applying patch against master (1129c2b0ad2732f301f696ae2cf98fb063a4c1f8) it offsets two hunks. Since format() has regression tests I suggest that one should be added to cover this. It could worth to add the new behavior to the docs, since there no explicit example for %I. I performed the follow tests that works as expected using some Portuguese words: postgres=# create table test (nome varchar, endereço text, "UF" varchar(2), título varchar); CREATE TABLE Time: 80,769 ms postgres=# select format('%I', attname) from pg_attribute join pg_class on (attrelid = oid) where relname = 'test'; format ----------"UF"cmaxcminctidendereçonometableoidtítuloxmaxxmin (10 rows) Time: 1,728 ms postgres=# select format('%I', 'endereco'); format ----------endereco (1 row) Time: 0,098 ms postgres=# select format('%I', 'endereço'); format ----------endereço (1 row) Time: 0,088 ms postgres=# select format('%I', 'あああ');format --------あああ (1 row) Time: 0,072 ms postgres=# select format('%I', 'título');format --------título (1 row) Time: 0,051 ms postgres=# select format('%I', 'título e'); format ------------"título e" (1 row) Time: 0,051 ms postgres=# select format('%I', 'título_e'); format ----------título_e (1 row) Time: 0,051 ms postgres=# select format('%I', '_título');format ---------_título (1 row) Time: 0,047 ms postgres=# select format('%I', '1_título'); format ------------"1_título" (1 row) Time: 0,046 ms Thank you for this! Best regards, -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://github.com/guedes - http://guedesoft.net http://www.postgresql.org.br
> 2016-01-24 8:04 GMT-02:00 Tatsuo Ishii <ishii@postgresql.org>: >>> On Wed, Jan 20, 2016 at 4:20 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: >>>>> If we would go this way, question is if we should back patch this or >>>>> not since the patch apparently changes the existing >>>>> behaviors. Comments? I would think we should not. >>>> >>>> I am sure, so we should not backport this change. This can breaks customer >>>> regress tests - and the current behave isn't 100% correct, but it is safe. >>> >>> Quite. This is not a bug fix. It's a behavior change, perhaps for the better. >> >> Added to the commitfest 2016-03. > > Hi, > > I gone ahead a little and tested this patch and it works like was > proposed, I agree that it's not a bug fix but a new behavior so -1 for > backport. IMO, it's a bug or at least an inconsistency but I admit it's too late to back patch to existing stable branches. > While applying patch against master > (1129c2b0ad2732f301f696ae2cf98fb063a4c1f8) it offsets two hunks. > > Since format() has regression tests I suggest that one should be added > to cover this. I don't think it's doable. The test requires to handle multiple database encodings. The regression test framework handles only one database encoding. Probably adding to the existing mb test is the easiest. > It could worth to add the new behavior to the docs, > since there no explicit example for %I. > I performed the follow tests that works as expected using some Portuguese words: I assume you used UTF-8 encoding database. Great. > postgres=# create table test (nome varchar, endereço text, "UF" > varchar(2), título varchar); > CREATE TABLE > Time: 80,769 ms > postgres=# select format('%I', attname) from pg_attribute join > pg_class on (attrelid = oid) where relname = 'test'; > format > ---------- > "UF" > cmax > cmin > ctid > endereço > nome > tableoid > título > xmax > xmin > (10 rows) > > Time: 1,728 ms > postgres=# select format('%I', 'endereco'); > format > ---------- > endereco > (1 row) > > Time: 0,098 ms > postgres=# select format('%I', 'endereço'); > format > ---------- > endereço > (1 row) > > Time: 0,088 ms > postgres=# select format('%I', 'あああ'); > format > -------- > あああ > (1 row) > > Time: 0,072 ms > postgres=# select format('%I', 'título'); > format > -------- > título > (1 row) > > Time: 0,051 ms > postgres=# select format('%I', 'título e'); > format > ------------ > "título e" > (1 row) > > Time: 0,051 ms > postgres=# select format('%I', 'título_e'); > format > ---------- > título_e > (1 row) > > Time: 0,051 ms > postgres=# select format('%I', '_título'); > format > --------- > _título > (1 row) > > Time: 0,047 ms > postgres=# select format('%I', '1_título'); > format > ------------ > "1_título" > (1 row) > > Time: 0,046 ms > > > Thank you for this! > > > Best regards, > -- > Dickson S. Guedes > mail/xmpp: guedes@guedesoft.net - skype: guediz > http://github.com/guedes - http://guedesoft.net > http://www.postgresql.org.br
Tatsuo Ishii wrote: > IMO, it's a bug or at least an inconsistency Personally I don't see this change being good for everything. Let's play devil's advocate: create table abc(U&"foo\2003" int); U+2003 is 'EM SPACE', in Unicode's General Punctuation block. With the current version, format('%I', attname) on this column is: "foo " With the patched version, it produces this: foo So the visual hint that there are more characters at the end is lost. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
2016-01-26 5:29 GMT-02:00 Tatsuo Ishii <ishii@postgresql.org>: > > I assume you used UTF-8 encoding database. Yes, I do. -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://github.com/guedes - http://guedesoft.net http://www.postgresql.org.br
2016-01-26 18:00 GMT-02:00 Daniel Verite <daniel@manitou-mail.org>: > ... > create table abc(U&"foo\2003" int); > > U+2003 is 'EM SPACE', in Unicode's General Punctuation block. > > With the current version, format('%I', attname) on this column is: > "foo " > > With the patched version, it produces this: > foo > > So the visual hint that there are more characters at the end is lost. Thanks for advocate, I see here that it even produces that output with simple spaces. postgres=# create table x ("aí " text); CREATE TABLE postgres=# \d x Tabela "public.x" Coluna | Tipo | Modificadores ----------+------+---------------aí | text | This will break copy&paste user actions and scripts that parses that output. Maybe the patch should consider left/right non-printable chars to choose whether to show or not the " ? []s -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://github.com/guedes - http://guedesoft.net http://www.postgresql.org.br
> Thanks for advocate, I see here that it even produces that output with > simple spaces. > > postgres=# create table x ("aí " text); > CREATE TABLE > postgres=# \d x > Tabela "public.x" > Coluna | Tipo | Modificadores > ----------+------+--------------- > aí | text | > > > This will break copy&paste user actions and scripts that parses that output. > > Maybe the patch should consider left/right non-printable chars to > choose whether to show or not the " ? This is a totally different story from the topic discussed in this thread. psql never adds double quotations to column name even with upper case col names. test=# create table t6("ABC" int); CREATE TABLE test=# \d t6 Table "public.t6"Column | Type | Modifiers --------+---------+-----------ABC | integer | If you want to change the existing psql's behavior, propose it yourself. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
>> IMO, it's a bug or at least an inconsistency > > Personally I don't see this change being good for everything. > > Let's play devil's advocate: > > create table abc(U&"foo\2003" int); > > U+2003 is 'EM SPACE', in Unicode's General Punctuation block. > > With the current version, format('%I', attname) on this column is: > "foo " > > With the patched version, it produces this: > foo > > So the visual hint that there are more characters at the end is lost. What is the "visual hint"? If you are talking about psql's output, it never adds "visual hint" (double quotations). If you are talking about the string handling in a program, what kind of program cares about "visiual"? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
2016-01-26 21:00 GMT+01:00 Daniel Verite <daniel@manitou-mail.org>:
Tatsuo Ishii wrote:
> IMO, it's a bug or at least an inconsistency
Personally I don't see this change being good for everything.
Let's play devil's advocate:
create table abc(U&"foo\2003" int);
U+2003 is 'EM SPACE', in Unicode's General Punctuation block.
With the current version, format('%I', attname) on this column is:
"foo "
With the patched version, it produces this:
foo
So the visual hint that there are more characters at the end is lost.
I can agree, so current behave can be useful in some cases, but still it is bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping functions.
Currently, any multibyte char can be unescaped identifier (only apostrophes are tested). We should to test white chars too.
Regards
Pavel
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
> 2016-01-26 21:00 GMT+01:00 Daniel Verite <daniel@manitou-mail.org>: > >> Tatsuo Ishii wrote: >> >> > IMO, it's a bug or at least an inconsistency >> >> Personally I don't see this change being good for everything. >> >> Let's play devil's advocate: >> >> create table abc(U&"foo\2003" int); >> >> U+2003 is 'EM SPACE', in Unicode's General Punctuation block. >> >> With the current version, format('%I', attname) on this column is: >> "foo " >> >> With the patched version, it produces this: >> foo >> >> So the visual hint that there are more characters at the end is lost. >> > > I can agree, so current behave can be useful in some cases, but still it is > bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping > functions. > > Currently, any multibyte char can be unescaped identifier (only apostrophes > are tested). We should to test white chars too. Really? I thought we do that test. test=# create table t6("あいう えお" int); CREATE TABLE test=# \d t6 Table "public.t6" Column | Type | Modifiers -------------+---------+-----------あいう えお | integer | -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
2016-01-27 6:13 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
> 2016-01-26 21:00 GMT+01:00 Daniel Verite <daniel@manitou-mail.org>:
>
>> Tatsuo Ishii wrote:
>>
>> > IMO, it's a bug or at least an inconsistency
>>
>> Personally I don't see this change being good for everything.
>>
>> Let's play devil's advocate:
>>
>> create table abc(U&"foo\2003" int);
>>
>> U+2003 is 'EM SPACE', in Unicode's General Punctuation block.
>>
>> With the current version, format('%I', attname) on this column is:
>> "foo "
>>
>> With the patched version, it produces this:
>> foo
>>
>> So the visual hint that there are more characters at the end is lost.
>>
>
> I can agree, so current behave can be useful in some cases, but still it is
> bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping
> functions.
>
> Currently, any multibyte char can be unescaped identifier (only apostrophes
> are tested). We should to test white chars too.
Really? I thought we do that test.
what you are expecting from this test? UTF single quotes are tested only in quote functions probably.
Pavel
test=# create table t6("あいう えお" int);
CREATE TABLE
test=# \d t6
Table "public.t6"
Column | Type | Modifiers
-------------+---------+-----------
あいう えお | integer |--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
>> > I can agree, so current behave can be useful in some cases, but still it >> is >> > bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping >> > functions. >> > >> > Currently, any multibyte char can be unescaped identifier (only >> apostrophes >> > are tested). We should to test white chars too. >> >> Really? I thought we do that test. >> > > what you are expecting from this test? UTF single quotes are tested only in > quote functions probably. I just wanted to demonstrate multibyte chars including ASCII white spaces can be an identifier. > We should to test white chars too. What do you exactly propose regarding white chars and multibyte chars here? Maybe you propose to consider non ASCII white spaces (treate them as ASCII white spaces)? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp > Pavel > > >> >> test=# create table t6("あいう えお" int); >> CREATE TABLE >> test=# \d t6 >> Table "public.t6" >> Column | Type | Modifiers >> -------------+---------+----------- >> あいう えお | integer | >> -- >> Tatsuo Ishii >> SRA OSS, Inc. Japan >> English: http://www.sraoss.co.jp/index_en.php >> Japanese:http://www.sraoss.co.jp >>
2016-01-27 6:24 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>> > I can agree, so current behave can be useful in some cases, but still it
>> is
>> > bug (inconsistency) between PostgreSQL parser and PostgreSQL escaping
>> > functions.
>> >
>> > Currently, any multibyte char can be unescaped identifier (only
>> apostrophes
>> > are tested). We should to test white chars too.
>>
>> Really? I thought we do that test.
>>
>
> what you are expecting from this test? UTF single quotes are tested only in
> quote functions probably.
I just wanted to demonstrate multibyte chars including ASCII white
spaces can be an identifier.
I understand now.
> We should to test white chars too.
What do you exactly propose regarding white chars and multibyte chars
here? Maybe you propose to consider non ASCII white spaces (treate
them as ASCII white spaces)?
I propose the work with UTF white chars should be same like ASCII white chars. The current design is too simple - with possible pretty bad issues. Daniel's example is good - there is big gap in design.
Regards
Pavel
Best regards,--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
> Pavel
>
>
>>
>> test=# create table t6("あいう えお" int);
>> CREATE TABLE
>> test=# \d t6
>> Table "public.t6"
>> Column | Type | Modifiers
>> -------------+---------+-----------
>> あいう えお | integer |
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> What do you exactly propose regarding white chars and multibyte chars >> here? Maybe you propose to consider non ASCII white spaces (treate >> them as ASCII white spaces)? >> > > I propose the work with UTF white chars should be same like ASCII white > chars. The current design is too simple - with possible pretty bad issues. > Daniel's example is good - there is big gap in design. I think we should consider followings before going forward: 1) Does PostgreSQL treat non ASCII white spaces same as ASCII white spaces anyware in the system? If not, there's no reasonwe should think format() and quote_indent() are exception. 2) What does the SQL standard say? Do they say that non ASCII white spaces should be treated as ASCII white spaces? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
2016-01-27 8:25 GMT+01:00 Tatsuo Ishii <ishii@postgresql.org>:
>> What do you exactly propose regarding white chars and multibyte chars
>> here? Maybe you propose to consider non ASCII white spaces (treate
>> them as ASCII white spaces)?
>>
>
> I propose the work with UTF white chars should be same like ASCII white
> chars. The current design is too simple - with possible pretty bad issues.
> Daniel's example is good - there is big gap in design.
I think we should consider followings before going forward:
1) Does PostgreSQL treat non ASCII white spaces same as ASCII white
spaces anyware in the system? If not, there's no reason we should
think format() and quote_indent() are exception.
+1
2) What does the SQL standard say? Do they say that non ASCII white
spaces should be treated as ASCII white spaces?
I am not sure, if SQL standard say some about it. But I am sure, so using unescaped or unclosed UTF8 spaces is bug, dangerous, wrong
Pavel
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
Tatsuo Ishii wrote: > 2) What does the SQL standard say? Do they say that non ASCII white > spaces should be treated as ASCII white spaces? I've used white space in the example, but I'm concerned about punctuation too. unicode.org has this helpful paper: http://www.unicode.org/L2/L2000/00260-sql.pdf which studies Unicode in SQL-99 identifiers. The relevant BNF they extracted from the standard looks like this: identifier body> ::= <identifier start> [ { <underscore> | <identifier part> }... ] <identifier start> ::= <initial alphabetic character> | <ideographic character> <identifier part> ::= <alphabetic character> | <ideographic character> | <decimal digit character> | <identifiercombining character> | <underscore> | <alternate underscore> | <extender character> | <identifier ignorablecharacter> | <connector character> <delimited identifier> ::= <double quote> <delimited identifier body> <double quote> <delimited identifier body> ::= <delimited identifier part>... <delimited identifier part> ::= <nondoublequote character> | <doublequote symbol> ======== The current version of quote_ident() plays it safe by implementing the rule that, as soon it encounters a character outside of US-ASCII, it surrounds the identifier with double quotes, no matter to which category or block this character belongs. So its output is guaranteed to be compatible with the above grammar. The change in the patch is that multibyte characters just don't imply quoting. But according to the points 1 and 2 of the paper, the first character must have the Unicode alphabetic property, and it must not have the Unicode combining property. I'm mostly ignorant in Unicode so I'm not sure of the precise implications of having such Unicode properties, but still my understanding is that the new quote_ident() ignores these rules, so in this sense it could produce outputs that wouldn't be compatible with SQL-99. Also, here's what we say in the manual about non quoted identifiers: http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html "SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($)" So it explicitly allows letters in general (and also seems less strict than SQL-99 about underscore), but it makes no promise about Unicode punctuation or spaces, for instance, even though in practice the parser seems to accept them just fine. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Tatsuo Ishii wrote: > What is the "visual hint"? If you are talking about psql's output, it > never adds "visual hint" (double quotations). > > If you are talking about the string handling in a program, what kind > of program cares about "visiual"? Sure. The scenario I'm thinking about would be someone inspecting generated queries, say for controlling or troubleshooting, the queries containing identifiers injected with the help of quote_ident/format. That could be from the logs, or monitoring or audit tools. If identifiers contain weird Unicode characters, currently that's relatively easy to spot because they get surrounded by double quotes. If I see something like this: UPDATE "test․table" SET... I immediately think that there's something fishy. It looks like test should be a schema, but the surrounding quotes say otherwise. In any case, it's clear that it updates a table in the current schema. But if I see that: UPDATE test․table SET... is seems legit and seems to update "table" inside the "test" schema even though that's not what it does, it actually updates the "test․table" table in the current schema, because the dot between test and table is not the US-ASCII U+002E, it's U+2024, 'ONE DOT LEADER' On my display, they are almost indiscernible. This boils down to the fact that the current quote_ident gives: =# select quote_ident('test․table');quote_ident --------------"test․table" whereas the quote_ident patched as proposed gives: =# select quote_ident('test․table');quote_ident -------------test․table So this is what I don't feel good about. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > This boils down to the fact that the current quote_ident gives: > =# select quote_ident('test․table'); > quote_ident > -------------- > "test․table" > whereas the quote_ident patched as proposed gives: > =# select quote_ident('test․table'); > quote_ident > ------------- > test․table > So this is what I don't feel good about. This patch was originally proposed as a simple, cost-free change, but it's becoming obvious that it is no such thing. I think we should probably reject it and move on. regards, tom lane
2016-01-26 23:40 GMT-02:00 Tatsuo Ishii <ishii@postgresql.org>: >> Thanks for advocate, I see here that it even produces that output with >> simple spaces. >> >> postgres=# create table x ("aí " text); >> CREATE TABLE >> postgres=# \d x >> Tabela "public.x" >> Coluna | Tipo | Modificadores >> ----------+------+--------------- >> aí | text | >> >> >> This will break copy&paste user actions and scripts that parses that output. >> >> Maybe the patch should consider left/right non-printable chars to >> choose whether to show or not the " ? > > This is a totally different story from the topic discussed in this > thread. psql never adds double quotations to column name even with > upper case col names. Indeed, you are right. > If you want to change the existing psql's behavior, propose it > yourself. It could be interesting, maybe using a \pset quote_columns_char, I'll think about, thank you. Best regards. -- Dickson S. Guedes mail/xmpp: guedes@guedesoft.net - skype: guediz http://github.com/guedes - http://guedesoft.net http://www.postgresql.org.br
> I've used white space in the example, but I'm concerned about > punctuation too. > > unicode.org has this helpful paper: > http://www.unicode.org/L2/L2000/00260-sql.pdf > which studies Unicode in SQL-99 identifiers. > > The relevant BNF they extracted from the standard looks like this: > > identifier body> ::= > <identifier start> > [ { <underscore> | <identifier part> }... ] > > <identifier start> ::= > <initial alphabetic character> > | <ideographic character> > > <identifier part> ::= > <alphabetic character> > | <ideographic character> > | <decimal digit character> > | <identifier combining character> > | <underscore> > | <alternate underscore> > | <extender character> > | <identifier ignorable character> > | <connector character> > > <delimited identifier> ::= > <double quote> <delimited identifier body> <double quote> > > <delimited identifier body> ::= <delimited identifier part>... > > <delimited identifier part> ::= > <nondoublequote character> > | <doublequote symbol> > > ======== > > The current version of quote_ident() plays it safe by implementing > the rule that, as soon it encounters a character outside > of US-ASCII, it surrounds the identifier with double quotes, no matter > to which category or block this character belongs. > So its output is guaranteed to be compatible with the above grammar. > > The change in the patch is that multibyte characters just don't imply > quoting. > > But according to the points 1 and 2 of the paper, the first character > must have the Unicode alphabetic property, and it must not > have the Unicode combining property. Good point. > I'm mostly ignorant in Unicode so I'm not sure of the precise > implications of having such Unicode properties, but still my > understanding is that the new quote_ident() ignores these rules, > so in this sense it could produce outputs that wouldn't be > compatible with SQL-99. > > Also, here's what we say in the manual about non quoted identifiers: > http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html > > "SQL identifiers and key words must begin with a letter (a-z, but also > letters with diacritical marks and non-Latin letters) or an underscore > (_). Subsequent characters in an identifier or key word can be > letters, underscores, digits (0-9), or dollar signs ($)" > > So it explicitly allows letters in general (and also seems less > strict than SQL-99 about underscore), but it makes no promise about > Unicode punctuation or spaces, for instance, even though in practice > the parser seems to accept them just fine. You could arbitary extend your point, not only with Unicode punctuation or spaces, There are number of characters look-alike "-" in Unicode, for example. Do we want to treat them like ASCII "-"? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
> "Daniel Verite" <daniel@manitou-mail.org> writes: >> This boils down to the fact that the current quote_ident gives: > >> =# select quote_ident('test․table'); >> quote_ident >> -------------- >> "test․table" > >> whereas the quote_ident patched as proposed gives: > >> =# select quote_ident('test․table'); >> quote_ident >> ------------- >> test․table > >> So this is what I don't feel good about. > > This patch was originally proposed as a simple, cost-free change, > but it's becoming obvious that it is no such thing. I think > we should probably reject it and move on. It seems I opend a can of worms. I'm going to reject my proposal myself. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp