Thread: Ltree syntax improvement
Dear all,
Please find attached the patch extending the sets of symbols allowed in ltree labels. The patch introduces 2 variants of escaping symbols, via backslashing separate symbols and via quoting the labels in whole for ltree, lquery and ltxtquery datatypes.
--
SY, Dmitry Belyavsky
Attachment
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry Belyavsky написал: > Please find attached the patch extending the sets of symbols allowed in > ltree labels. The patch introduces 2 variants of escaping symbols, via > backslashing separate symbols and via quoting the labels in whole for > ltree, lquery and ltxtquery datatypes. Hi! Let's me join the review process... I've just give a superficial look to the patch, read it through, and try here and there, so first I'll give feedback about exterior features of the patch. So, the first question: why do we need all this? The answer seems to be obvious, but it would be good say it explicitly. What problems would it solve? Do these problems really exists? The second question is about coding style. As far as I can see you've passed your code through pg_indent, but it does not solve all problems. As far as I know, postgres commiters are quite strict about code width, the code should not be wider that 80 col, except places were string constants are introduced (There you can legally exceed 80 col). So I would advice to use vim with tabstop=4 and colorcolumn=80 and make sure that new code text does not cross red vertical line. Comments. There is no fixed coding style for comments in postgres. Usually this means that it is better to follow coding in the code around. But ltree does not have much comments, so I would suggest to use the best style I've seen in the code /* * [function name] * < tab ><tab> [Short description, what does it do] * * [long explanation how it works, several subparagraph if needed */ (Seen this in src/include/utils/rel.h) or something like that. At least it would be good to have explicit separation between descriptions and explanation of how it works. And about comment /* * Function calculating bytes to escape * to_escape is an array of "special" 1-byte symbols * Behvaiour: * If there is no "special" symbols, return 0 * If there are any special symbol, we need initial and final quote, so return 2 * If there are any quotes, we need to escape all of them and also initial and final quote, so * return 2 + number of quotes */ It has some formatting. But it should not. As far as I understand this comment should be treated as single paragraph by reformatter, and refomatted. I do not understand why pg_indent have not done it. Documentation (https:// www.postgresql.org/docs/11/source-format.html) claims that if you want to avoid reformatting, you should add "-------------" at the beginning and at the end of the comment. So it would be good either remove this formatting, or add "----" if you are sure you want to keep it. Third part is about error messages. First I've seen errors like elog(ERROR, "internal error during splitting levels");. I've never seen error like that in postgres. May be if this error is in the part of the code, that should never be reached in real live, may be it would be good to change it to Assert? Or if this code can be normally reached, add some more explanation of what had happened... About other error messages. There are messages like errmsg("syntax error"), errdetail("Unexpected end of line."))); or errmsg("escaping syntax error"))); In postgres there is error message style guide https://www.postgresql.org/docs/11/error-style-guide.html Here I would expect messages like that Primary: Error parsing ltree path Detail: Curly quote is opened and not closed Or something like that, I am not expert in English, but this way it would be better for sure. Key points are: Primary message should point to general area where error occurred (there can be a large SQL with a lot of things in it, so it is good to point that problem is with ltree staff). And is is better to word from the point of view of a user. What for you (and me) is unexpected end of line, for him it is unclosed curly quote. Also I am not sure, if it is easy, but if it is possible, it would be good to move "^" symbol that points to the place of the error, to the place inside ' ' where unclosed curly quote is located. As a user I would appreciate that. So that's what I've seen while giving it superficial look...
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry Belyavsky написал: > Dear all, > > Please find attached the patch extending the sets of symbols allowed in > ltree labels. The patch introduces 2 variants of escaping symbols, via > backslashing separate symbols and via quoting the labels in whole for > ltree, lquery and ltxtquery datatypes. Sorry I still did not do full revision, I've stumbled up on other things in the code, which, as I think should be fixed before final checking through. (I guess that code do what is expected, since it passes tests and so on, it needs recheck, but I'd like to do this full recheck only once) So my doubts about this code is that a lot parts of code is just a copy-paste of the same code that was written right above. And it can be rewritten in a way that the same lines (or parts of lines) is repeated only once... This should make code more readable, and make code more similar to the code of postgres core, I've seen. Here I will speak about lquery_in function from ltree_io.c. 1. May be it is better here to switch from else if (state == LQPRS_[something]) to switch (stat) { case LQPRS_[something]: } because because here the main case is to determine what is the current state of your state machine, do something with it, and then go to the next step. Now after you've done what should be done in this step, you should make sure that no other code is executed writing more ifs... If you use switch/case you will be able to say break wherever you like, it will save you some ifs. Theoretical example (cl is for character length fc is for fist char) if (cl==1 && fc =='1') {do_case1();} else if (cl==1 && fc =='2') {do_case2();} else if (cl==1 && fc =='3') {do_case3();} else {do_otherwise();} If it is wrapped in switch/case it will be like if (cl==1) { if (fc =='1') {do_case1(); break;} if (fc =='2') {do_case2(); break;} if (fc =='3') {do_case3(); break;} } /*if nothing is found */ do_otherwise(); This for example will allow you to get rid of multiply charlen == 1 checks in else if ((lptr->flag & LVAR_QUOTEDPART) == 0) { if (charlen == 1 && t_iseq(ptr, '@')) { if (lptr->start == ptr) UNCHAR; lptr->flag |= LVAR_INCASE; curqlevel->flag |= LVAR_INCASE; } else if (charlen == 1 && t_iseq(ptr, '*')) { if (lptr->start == ptr) UNCHAR; lptr->flag |= LVAR_ANYEND; curqlevel->flag |= LVAR_ANYEND; } else if (charlen == 1 && t_iseq(ptr, '%')) { if (lptr->start == ptr) UNCHAR; lptr->flag |= LVAR_SUBLEXEME; curqlevel->flag |= LVAR_SUBLEXEME; } else if (charlen == 1 && t_iseq(ptr, '|')) { real_nodeitem_len(lptr, ptr, escaped_count, tail_space_bytes, tail_space_symbols); if (state == LQPRS_WAITDELIMSTRICT) adjust_quoted_nodeitem(lptr); check_level_length(lptr, pos); state = LQPRS_WAITVAR; } else if (charlen == 1 && t_iseq(ptr, '.')) { real_nodeitem_len(lptr, ptr, escaped_count, tail_space_bytes, tail_space_symbols); if (state == LQPRS_WAITDELIMSTRICT) adjust_quoted_nodeitem(lptr); check_level_length(lptr, pos); state = LQPRS_WAITLEVEL; curqlevel = NEXTLEV(curqlevel); } else if (charlen == 1 && t_iseq(ptr, '\\')) { if (state == LQPRS_WAITDELIMSTRICT) UNCHAR; state = LQPRS_WAITESCAPED; } else { if (charlen == 1 && strchr("!{}", *ptr)) UNCHAR; if (state == LQPRS_WAITDELIMSTRICT) { if (t_isspace(ptr)) { ptr += charlen; pos++; tail_space_bytes += charlen; tail_space_symbols = 1; continue; } UNCHAR; } if (lptr->flag & ~LVAR_QUOTEDPART) UNCHAR; } } There are a lot of them, I would suggest to reduce there number to one check in the right place. Similar thing is about "GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR);" in this part of code. if (charlen == 1) { if (t_iseq(ptr, '!')) { GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr + 1; state = LQPRS_WAITDELIM; curqlevel->numvar = 1; curqlevel->flag |= LQL_NOT; hasnot = true; } else if (t_iseq(ptr, '*')) state = LQPRS_WAITOPEN; else if (t_iseq(ptr, '\\')) { GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr; curqlevel->numvar = 1; state = LQPRS_WAITESCAPED; } else if (strchr(".|@%{}", *ptr)) { UNCHAR; } else { GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr; state = LQPRS_WAITDELIM; curqlevel->numvar = 1; if (t_iseq(ptr, '"')) { lptr->flag |= LVAR_QUOTEDPART; } } } else { GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr; state = LQPRS_WAITDELIM; curqlevel->numvar = 1; } It is repeated almost is every branch of this if-subtree... Can it be rewritten the way it is repeated only once? And so on. So my request is to reduce multiply repetition of the same code... PS. One other thing that I've been thinking over but did not came to final conclusion... It is often quite often case if (t_iseq(ptr, 'a'){same_code(); code_for_a();} else if (t_iseq(ptr, 'b'){same_code(); code_for_b();} else if (t_iseq(ptr, 'c'){same_code(); code_for_c);} else something_else(); I've been thinking if it is possible to move this to const unsigned char c=TOUCHAR(ptr); switch(c) { case 'a': case 'b': case 'c': same_code(); if (c=='a') {code_for_a();} if (c=='b') {code_for_b();} if (c=='c') {code_for_c();} break; default: something_else(); } I did not still get, how valid is replacement of t_iseq() with TOUCHAR() + "==". It is quite equivalent now, but I do not know how the core code is going to evolve, so can't get final understanding. I even tried to discuss it with Teodor (you've seen it) but still did come to any conclusion. So for now status of this idea is "my considerations about ways to deduplicate the code". So this is all essential things I can say about this patch as it is now. May be somebody can add something more...
Dear Nikolay,
On Wed, Feb 20, 2019 at 12:28 PM Nikolay Shaplov <dhyan@nataraj.su> wrote:
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry
Belyavsky написал:
> Dear all,
>
> Please find attached the patch extending the sets of symbols allowed in
> ltree labels. The patch introduces 2 variants of escaping symbols, via
> backslashing separate symbols and via quoting the labels in whole for
> ltree, lquery and ltxtquery datatypes.
Sorry I still did not do full revision, I've stumbled up on other things in
the code, which, as I think should be fixed before final checking through. (I
guess that code do what is expected, since it passes tests and so on, it needs
recheck, but I'd like to do this full recheck only once)
So my doubts about this code is that a lot parts of code is just a copy-paste
of the same code that was written right above. And it can be rewritten in a
way that the same lines (or parts of lines) is repeated only once...
This should make code more readable, and make code more similar to the code of
postgres core, I've seen.
Here I will speak about lquery_in function from ltree_io.c.
1. May be it is better here to switch from else if (state ==
LQPRS_[something]) to switch (stat) { case LQPRS_[something]: }
because because here the main case is to determine what is the current state
of your state machine, do something with it, and then go to the next step.
Now after you've done what should be done in this step, you should make sure
that no other code is executed writing more ifs... If you use switch/case you
will be able to say break wherever you like, it will save you some ifs.
I thought about it writing the code. But it significantly increased the size of the patch
so I decided to follow the original coding style here.
Theoretical example
(cl is for character length fc is for fist char)
if (cl==1 && fc =='1') {do_case1();}
else if (cl==1 && fc =='2') {do_case2();}
else if (cl==1 && fc =='3') {do_case3();}
else {do_otherwise();}
If it is wrapped in switch/case it will be like
if (cl==1)
{
if (fc =='1') {do_case1(); break;}
if (fc =='2') {do_case2(); break;}
if (fc =='3') {do_case3(); break;}
}
/*if nothing is found */
do_otherwise();
This for example will allow you to get rid of multiply charlen == 1 checks in
else if ((lptr->flag & LVAR_QUOTEDPART) == 0)
{
if (charlen == 1 && t_iseq(ptr, '@'))
{
if (lptr->start == ptr)
UNCHAR;
lptr->flag |= LVAR_INCASE;
curqlevel->flag |= LVAR_INCASE;
}
else if (charlen == 1 && t_iseq(ptr, '*'))
{
if (lptr->start == ptr)
UNCHAR;
lptr->flag |= LVAR_ANYEND;
curqlevel->flag |= LVAR_ANYEND;
}
else if (charlen == 1 && t_iseq(ptr, '%'))
{
if (lptr->start == ptr)
UNCHAR;
lptr->flag |= LVAR_SUBLEXEME;
curqlevel->flag |= LVAR_SUBLEXEME;
}
else if (charlen == 1 && t_iseq(ptr, '|'))
{
real_nodeitem_len(lptr, ptr, escaped_count,
tail_space_bytes, tail_space_symbols);
if (state == LQPRS_WAITDELIMSTRICT)
adjust_quoted_nodeitem(lptr);
check_level_length(lptr, pos);
state = LQPRS_WAITVAR;
}
else if (charlen == 1 && t_iseq(ptr, '.'))
{
real_nodeitem_len(lptr, ptr, escaped_count,
tail_space_bytes, tail_space_symbols);
if (state == LQPRS_WAITDELIMSTRICT)
adjust_quoted_nodeitem(lptr);
check_level_length(lptr, pos);
state = LQPRS_WAITLEVEL;
curqlevel = NEXTLEV(curqlevel);
}
else if (charlen == 1 && t_iseq(ptr, '\\'))
{
if (state == LQPRS_WAITDELIMSTRICT)
UNCHAR;
state = LQPRS_WAITESCAPED;
}
else
{
if (charlen == 1 && strchr("!{}", *ptr))
UNCHAR;
if (state == LQPRS_WAITDELIMSTRICT)
{
if (t_isspace(ptr))
{
ptr += charlen;
pos++;
tail_space_bytes += charlen;
tail_space_symbols = 1;
continue;
}
UNCHAR;
}
if (lptr->flag & ~LVAR_QUOTEDPART)
UNCHAR;
}
}
Yes, this piece of code could be converted to the form you suggest, but
only partially, because the final else includes a strchr() call that,
being rewritten to the 'case' syntax will look ugly enough.
There are a lot of them, I would suggest to reduce there number to one check
in the right place.
Similar thing is about "GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);" in this part of code.
if (charlen == 1)
{
if (t_iseq(ptr, '!'))
{
GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
lptr->start = ptr + 1;
state = LQPRS_WAITDELIM;
curqlevel->numvar = 1;
curqlevel->flag |= LQL_NOT;
hasnot = true;
}
else if (t_iseq(ptr, '*'))
state = LQPRS_WAITOPEN;
else if (t_iseq(ptr, '\\'))
{
GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
lptr->start = ptr;
curqlevel->numvar = 1;
state = LQPRS_WAITESCAPED;
}
else if (strchr(".|@%{}", *ptr))
{
UNCHAR;
}
else
{
GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
lptr->start = ptr;
state = LQPRS_WAITDELIM;
curqlevel->numvar = 1;
if (t_iseq(ptr, '"'))
{
lptr->flag |= LVAR_QUOTEDPART;
}
}
}
else
{
GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
lptr->start = ptr;
state = LQPRS_WAITDELIM;
curqlevel->numvar = 1;
}
It is repeated almost is every branch of this if-subtree... Can it be
rewritten the way it is repeated only once?
Well, I see the only one so long sequence of 'if'.
And so on.
So my request is to reduce multiply repetition of the same code...
PS. One other thing that I've been thinking over but did not came to final
conclusion...
It is often quite often case
if (t_iseq(ptr, 'a'){same_code(); code_for_a();}
else if (t_iseq(ptr, 'b'){same_code(); code_for_b();}
else if (t_iseq(ptr, 'c'){same_code(); code_for_c);}
else something_else();
I've been thinking if it is possible to move this to
const unsigned char c=TOUCHAR(ptr);
switch(c)
{
case 'a':
case 'b':
case 'c':
same_code();
if (c=='a') {code_for_a();}
if (c=='b') {code_for_b();}
if (c=='c') {code_for_c();}
break;
default:
something_else();
}
I did not still get, how valid is replacement of t_iseq() with TOUCHAR() +
"==". It is quite equivalent now, but I do not know how the core code is going
to evolve, so can't get final understanding. I even tried to discuss it with
Teodor (you've seen it) but still did come to any conclusion.
I agree with Teodor here. We do not know about various charsets so
the current syntax seems to be more safe.
So for now status of this idea is "my considerations about ways to deduplicate
the code".
So this is all essential things I can say about this patch as it is now. May
be somebody can add something more...
SY, Dmitry Belyavsky
Hi Dmitry, This patch appears too complex to be submitted to the last CF, as Andres has also noted [1]. I have set the target version to PG13. Regards, -- -David david@pgmasters.net [1] https://www.postgresql.org/message-id/20190216054526.zss2cufdxfeudr4i%40alap3.anarazel.de
On Tue, Feb 5, 2019 at 2:27 PM Nikolay Shaplov <dhyan@nataraj.su> wrote:
В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry
Belyavsky написал:
> Please find attached the patch extending the sets of symbols allowed in
> ltree labels. The patch introduces 2 variants of escaping symbols, via
> backslashing separate symbols and via quoting the labels in whole for
> ltree, lquery and ltxtquery datatypes.
Hi!
Let's me join the review process...
I've just give a superficial look to the patch, read it through, and try here
and there, so first I'll give feedback about exterior features of the patch.
So, the first question: why do we need all this? The answer seems to be
obvious, but it would be good say it explicitly. What problems would it solve?
Do these problems really exists?
Yes, it does. We maintain an extension (https://github.com/adjust/wltree) which has a fixed separator (::) and allows any utf-8 character in the tree.
In our case we currently use our extended tree to store user-defined hierarchies of labels, which might contain whitespace, Chinese, Japanese, or Korean characters, etc.
I would love to be able to deprecate our work on this extension and eventually use stock ltree.
The second question is about coding style. As far as I can see you've passed
your code through pg_indent, but it does not solve all problems.
As far as I know, postgres commiters are quite strict about code width, the
code should not be wider that 80 col, except places were string constants are
introduced (There you can legally exceed 80 col). So I would advice to use vim
with tabstop=4 and colorcolumn=80 and make sure that new code text does not
cross red vertical line.
Comments. There is no fixed coding style for comments in postgres. Usually this
means that it is better to follow coding in the code around. But ltree does
not have much comments, so I would suggest to use the best style I've seen in
the code
/*
* [function name]
* < tab ><tab> [Short description, what does it do]
*
* [long explanation how it works, several subparagraph if needed
*/
(Seen this in src/include/utils/rel.h)
or something like that.
At least it would be good to have explicit separation between descriptions and
explanation of how it works.
And about comment
/*
* Function calculating bytes to escape
* to_escape is an array of "special" 1-byte symbols
* Behvaiour:
* If there is no "special" symbols, return 0
* If there are any special symbol, we need initial and final quote, so return
2
* If there are any quotes, we need to escape all of them and also initial and
final quote, so
* return 2 + number of quotes
*/
It has some formatting. But it should not. As far as I understand this
comment should be treated as single paragraph by reformatter, and refomatted.
I do not understand why pg_indent have not done it. Documentation (https://
www.postgresql.org/docs/11/source-format.html) claims that if you want to
avoid reformatting, you should add "-------------" at the beginning and at the
end of the comment. So it would be good either remove this formatting, or add
"----" if you are sure you want to keep it.
Third part is about error messages.
First I've seen errors like elog(ERROR, "internal error during splitting
levels");. I've never seen error like that in postgres. May be if this error
is in the part of the code, that should never be reached in real live, may be
it would be good to change it to Assert? Or if this code can be normally
reached, add some more explanation of what had happened...
About other error messages.
There are messages like
errmsg("syntax error"),
errdetail("Unexpected end of line.")));
or
errmsg("escaping syntax error")));
In postgres there is error message style guide
https://www.postgresql.org/docs/11/error-style-guide.html
Here I would expect messages like that
Primary: Error parsing ltree path
Detail: Curly quote is opened and not closed
Or something like that, I am not expert in English, but this way it would be
better for sure. Key points are: Primary message should point to general area
where error occurred (there can be a large SQL with a lot of things in it, so
it is good to point that problem is with ltree staff). And is is better to
word from the point of view of a user. What for you (and me) is unexpected end
of line, for him it is unclosed curly quote.
Also I am not sure, if it is easy, but if it is possible, it would be good to
move "^" symbol that points to the place of the error, to the place inside ' '
where unclosed curly quote is located. As a user I would appreciate that.
So that's what I've seen while giving it superficial look...
Best Regards,
Chris Travers
Head of Database
Saarbrücker Straße 37a, 10405 Berlin
Good day. Sorry to pop in, but if you are active users of ltree, please let me know if you rely on the exclamation operator (negative matching)? I have just filed a patch, fixing very old bug - and it changes the logic substantially. https://commitfest.postgresql.org/23/2054/ - I'd be grateful for your comments (please use the other thread). On Thu, Mar 7, 2019 at 1:10 PM Chris Travers <chris.travers@adjust.com> wrote: > > > > On Tue, Feb 5, 2019 at 2:27 PM Nikolay Shaplov <dhyan@nataraj.su> wrote: >> >> В письме от вторник, 29 января 2019 г. 20:43:07 MSK пользователь Dmitry >> Belyavsky написал: >> >> > Please find attached the patch extending the sets of symbols allowed in >> > ltree labels. The patch introduces 2 variants of escaping symbols, via >> > backslashing separate symbols and via quoting the labels in whole for >> > ltree, lquery and ltxtquery datatypes. >> >> Hi! >> >> Let's me join the review process... >> >> I've just give a superficial look to the patch, read it through, and try here >> and there, so first I'll give feedback about exterior features of the patch. >> >> So, the first question: why do we need all this? The answer seems to be >> obvious, but it would be good say it explicitly. What problems would it solve? >> Do these problems really exists? > > > Yes, it does. We maintain an extension (https://github.com/adjust/wltree) which has a fixed separator (::) and allowsany utf-8 character in the tree. > > In our case we currently use our extended tree to store user-defined hierarchies of labels, which might contain whitespace,Chinese, Japanese, or Korean characters, etc. > > I would love to be able to deprecate our work on this extension and eventually use stock ltree. >> >> >> The second question is about coding style. As far as I can see you've passed >> your code through pg_indent, but it does not solve all problems. >> >> As far as I know, postgres commiters are quite strict about code width, the >> code should not be wider that 80 col, except places were string constants are >> introduced (There you can legally exceed 80 col). So I would advice to use vim >> with tabstop=4 and colorcolumn=80 and make sure that new code text does not >> cross red vertical line. >> >> Comments. There is no fixed coding style for comments in postgres. Usually this >> means that it is better to follow coding in the code around. But ltree does >> not have much comments, so I would suggest to use the best style I've seen in >> the code >> /* >> * [function name] >> * < tab ><tab> [Short description, what does it do] >> * >> * [long explanation how it works, several subparagraph if needed >> */ >> (Seen this in src/include/utils/rel.h) >> or something like that. >> At least it would be good to have explicit separation between descriptions and >> explanation of how it works. >> >> And about comment >> /* >> * Function calculating bytes to escape >> * to_escape is an array of "special" 1-byte symbols >> * Behvaiour: >> * If there is no "special" symbols, return 0 >> * If there are any special symbol, we need initial and final quote, so return >> 2 >> * If there are any quotes, we need to escape all of them and also initial and >> final quote, so >> * return 2 + number of quotes >> */ >> >> It has some formatting. But it should not. As far as I understand this >> comment should be treated as single paragraph by reformatter, and refomatted. >> I do not understand why pg_indent have not done it. Documentation (https:// >> www.postgresql.org/docs/11/source-format.html) claims that if you want to >> avoid reformatting, you should add "-------------" at the beginning and at the >> end of the comment. So it would be good either remove this formatting, or add >> "----" if you are sure you want to keep it. >> >> Third part is about error messages. >> First I've seen errors like elog(ERROR, "internal error during splitting >> levels");. I've never seen error like that in postgres. May be if this error >> is in the part of the code, that should never be reached in real live, may be >> it would be good to change it to Assert? Or if this code can be normally >> reached, add some more explanation of what had happened... >> >> About other error messages. >> >> There are messages like >> errmsg("syntax error"), >> errdetail("Unexpected end of line."))); >> >> or >> >> errmsg("escaping syntax error"))); >> >> >> In postgres there is error message style guide >> https://www.postgresql.org/docs/11/error-style-guide.html >> >> Here I would expect messages like that >> >> Primary: Error parsing ltree path >> Detail: Curly quote is opened and not closed >> >> Or something like that, I am not expert in English, but this way it would be >> better for sure. Key points are: Primary message should point to general area >> where error occurred (there can be a large SQL with a lot of things in it, so >> it is good to point that problem is with ltree staff). And is is better to >> word from the point of view of a user. What for you (and me) is unexpected end >> of line, for him it is unclosed curly quote. >> >> Also I am not sure, if it is easy, but if it is possible, it would be good to >> move "^" symbol that points to the place of the error, to the place inside ' ' >> where unclosed curly quote is located. As a user I would appreciate that. >> >> So that's what I've seen while giving it superficial look... >> >> > > > -- > Best Regards, > Chris Travers > Head of Database > > Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com > Saarbrücker Straße 37a, 10405 Berlin >
В письме от четверг, 7 марта 2019 г. 13:09:49 MSK пользователь Chris Travers написал: > We maintain an extension (https://github.com/adjust/wltree) > which has a fixed separator (::) and allows any utf-8 character in the tree. > > In our case we currently use our extended tree to store user-defined > hierarchies of labels, which might contain whitespace, Chinese, Japanese, > or Korean characters, etc. > > I would love to be able to deprecate our work on this extension and > eventually use stock ltree. I am afraid, that if would not be possible to have ltree with custom separator. Or we need to pass a very long way to reach there. To have custom separator we will need some place to store it. We can use GUC variable, and set separator for ltree globally. It might solve some of the problems. But will get some more. If for example we dump database, and then restore it on instance where that GUC option is not set, everything will be brocken. It is not opclass option (that are discussed here on the list), because opclass options is no about parsing, but about comparing things that was already parsed. It is not option of an attribute (if we can have custom attribute option), because we can have ltree as a variable, without creating an attribute... It should be an option of ltree type itself. I have no idea how we can implement this. And who will allow us to commit such strange thing ;-)
В письме от воскресенье, 24 февраля 2019 г. 14:31:55 MSK пользователь Dmitry Belyavsky написал: Hi! Am back here again. I've been thinking about this patch a while... Come to some conclusions and wrote some examples... First I came to idea that the best way to simplify the code is change the state machine from if to switch/case. Because in your code a lot of repetition is done just because you can't say "Thats all, let's go to next symbol" in any place in the code. Now it should follow all the branches of if-else tree that is inside the state-machine "node" to get to the next symbol. To show how simpler the things would be I changed the state machine processing in lquery_in form if to switch/case, and changed the code for LQPRS_WAITLEVEL state processing, removing duplicate code, using "break" wherever it is possible (The indention in this example is unproper to make diff more clear) so from that much of code ================= if (state == LQPRS_WAITLEVEL) { if (t_isspace(ptr)) { ptr += charlen; pos++; continue; } escaped_count = 0; real_levels++; if (charlen == 1) { if (t_iseq(ptr, '!')) { GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr + 1; state = LQPRS_WAITDELIM; curqlevel->numvar = 1; curqlevel->flag |= LQL_NOT; hasnot = true; } else if (t_iseq(ptr, '*')) state = LQPRS_WAITOPEN; else if (t_iseq(ptr, '\\')) { GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr; curqlevel->numvar = 1; state = LQPRS_WAITESCAPED; } else if (strchr(".|@%{}", *ptr)) { UNCHAR; } else { GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr; state = LQPRS_WAITDELIM; curqlevel->numvar = 1; if (t_iseq(ptr, '"')) { lptr->flag |= LVAR_QUOTEDPART; } } } else { GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr; state = LQPRS_WAITDELIM; curqlevel->numvar = 1; } } ===================== I came to this ===================== case LQPRS_WAITLEVEL: if (t_isspace(ptr)) break; /* Just go to next symbol */ escaped_count = 0; real_levels++; if (charlen == 1) { if (strchr(".|@%{}", *ptr)) UNCHAR; if (t_iseq(ptr, '*')) { state = LQPRS_WAITOPEN; break; } } GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * numOR); lptr->start = ptr; curqlevel->numvar = 1; state = LQPRS_WAITDELIM; if (charlen == 1) { if (t_iseq(ptr, '\\')) { state = LQPRS_WAITESCAPED; break; } if (t_iseq(ptr, '!')) { lptr->start += 1 /*FIXME explain why */; curqlevel->flag |= LQL_NOT; hasnot = true; } else if (t_iseq(ptr, '"')) { lptr->flag |= LVAR_QUOTEDPART; } } break; ======= And this code is much more readable.... You can just read it aloud: -Spaces we just skip - on .|@%{} throws and exception - for asterisk change state and nothing else then for others allocate a buffer - for slash we just set a flag - for exclamation mark we set a flag and skip it - for double quote set a flag. All the logic are clear. And in your version of the code it is difficult to get the idea from the code that simple. So my suggestion is following: 1. Submit and commit the patch that changes state machines from ifs to switch/ cases, and nothing else. This will reindent the code, so in order to keep further changes visible, we should change nothing else. 2. Change your code to switch/cases, use break to deduplicate code wherever is possible, and commit it. I can join you while working with this (but then I am not sure I would be able to remain a reviewer...) I've also attached an example I've quoted above, formatted as a diff, so it would be easy to check how does it work. It should be applied after your patch. (I gave it a strange name ltree.not-a-patch hoping that commitfest software will not treat it a new version of a patch)
Attachment
Dear Nikolay,
Many thanks for your efforts!
On Sat, Apr 6, 2019 at 2:29 PM Nikolay Shaplov <dhyan@nataraj.su> wrote:
В письме от воскресенье, 24 февраля 2019 г. 14:31:55 MSK пользователь Dmitry
Belyavsky написал:
Hi! Am back here again.
I've been thinking about this patch a while... Come to some conclusions and
wrote some examples...
First I came to idea that the best way to simplify the code is change the
state machine from if to switch/case. Because in your code a lot of repetition
is done just because you can't say "Thats all, let's go to next symbol" in any
place in the code. Now it should follow all the branches of if-else tree that
is inside the state-machine "node" to get to the next symbol.
To show how simpler the things would be I changed the state machine processing
in lquery_in form if to switch/case, and changed the code for LQPRS_WAITLEVEL
state processing, removing duplicate code, using "break" wherever it is
possible
(The indention in this example is unproper to make diff more clear)
so from that much of code
=================
if (state == LQPRS_WAITLEVEL)
{
if (t_isspace(ptr))
{
ptr += charlen;
pos++;
continue;
}
escaped_count = 0;
real_levels++;
if (charlen == 1)
{
if (t_iseq(ptr, '!'))
{
GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
lptr->start = ptr + 1;
state = LQPRS_WAITDELIM;
curqlevel->numvar = 1;
curqlevel->flag |= LQL_NOT;
hasnot = true;
}
else if (t_iseq(ptr, '*'))
state = LQPRS_WAITOPEN;
else if (t_iseq(ptr, '\\'))
{
GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
lptr->start = ptr;
curqlevel->numvar = 1;
state = LQPRS_WAITESCAPED;
}
else if (strchr(".|@%{}", *ptr))
{
UNCHAR;
}
else
{
GETVAR(curqlevel) = lptr = (nodeitem *)
palloc0(sizeof(nodeitem) * numOR);
lptr->start = ptr;
state = LQPRS_WAITDELIM;
curqlevel->numvar = 1;
if (t_iseq(ptr, '"'))
{
lptr->flag |= LVAR_QUOTEDPART;
}
}
}
else
{
GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) *
numOR);
lptr->start = ptr;
state = LQPRS_WAITDELIM;
curqlevel->numvar = 1;
}
}
=====================
I came to this
=====================
case LQPRS_WAITLEVEL:
if (t_isspace(ptr))
break; /* Just go to next symbol */
escaped_count = 0;
real_levels++;
if (charlen == 1)
{
if (strchr(".|@%{}", *ptr))
UNCHAR;
if (t_iseq(ptr, '*'))
{
state = LQPRS_WAITOPEN;
break;
}
}
GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) *
numOR);
lptr->start = ptr;
curqlevel->numvar = 1;
state = LQPRS_WAITDELIM;
if (charlen == 1)
{
if (t_iseq(ptr, '\\'))
{
state = LQPRS_WAITESCAPED;
break;
}
if (t_iseq(ptr, '!'))
{
lptr->start += 1 /*FIXME explain why */;
curqlevel->flag |= LQL_NOT;
hasnot = true;
}
else if (t_iseq(ptr, '"'))
{
lptr->flag |= LVAR_QUOTEDPART;
}
}
break;
=======
And this code is much more readable.... You can just read it aloud:
-Spaces we just skip
- on .|@%{} throws and exception
- for asterisk change state and nothing else
then for others allocate a buffer
- for slash we just set a flag
- for exclamation mark we set a flag and skip it
- for double quote set a flag.
All the logic are clear. And in your version of the code it is difficult to get
the idea from the code that simple.
So my suggestion is following:
1. Submit and commit the patch that changes state machines from ifs to switch/
cases, and nothing else. This will reindent the code, so in order to keep
further changes visible, we should change nothing else.
2. Change your code to switch/cases, use break to deduplicate code wherever is
possible, and commit it.
I can join you while working with this (but then I am not sure I would be able
to remain a reviewer...)
I've also attached an example I've quoted above, formatted as a diff, so it
would be easy to check how does it work. It should be applied after your
patch.
(I gave it a strange name ltree.not-a-patch hoping that commitfest software
will not treat it a new version of a patch)
I've applied your patch.
From my point of view, there is no major difference between case and chain if here.
Neither case nor ifs allow extracting the common code to separate function - just because there seem to be no identical pieces of code.
So yes, your patch makes sense, but I do not see any advantages of applying it.
SY, Dmitry Belyavsky
On Wed, Apr 17, 2019 at 5:29 AM Dmitry Belyavsky <beldmit@gmail.com> wrote: > I've applied your patch. > From my point of view, there is no major difference between case and chain if here. > Neither case nor ifs allow extracting the common code to separate function - just because there seem to be no identicalpieces of code. Hi Dmitry, The documentation doesn't build[1], due to invalid XML. Since I'm here, here is some proof-reading of the English in the documentation: <para> - A <firstterm>label</firstterm> is a sequence of alphanumeric characters - and underscores (for example, in C locale the characters - <literal>A-Za-z0-9_</literal> are allowed). Labels must be less than 256 bytes - long. + A <firstterm>label</firstterm> is a sequence of characters. Labels must be + less than 256 symbols long. Label may contain any character supported by Postgres "fewer than 256 characters in length", and "<productname>PostgreSQL</productname>" + except <literal>\0</literal>. If label contains spaces, dots, lquery modifiers, "spaces, dots or lquery modifiers," + they may be <firstterm>escaped</firstterm>. Escaping can be done either by preceeding + backslash (<literal>\\</literal>) symbol, or by wrapping the label in whole in double + quotes (<literal>"</literal>). Initial and final unescaped whitespace is stripped. "Escaping can be done with either a preceding backslash [...] or by wrapping the whole label in double quotes [...]." </para> + During converting text into internal representations, wrapping double quotes "During conversion to internal representation, " + and escaping backslashes are removed. During converting internal + representations into text, if the label does not contain any special "During conversion from internal representation to text, " + symbols, it is printed as is. Otherwise, it is wrapped in quotes and, if + there are internal quotes, they are escaped with backslash. The list of special "escaped with backslashes." + <para> + Examples: <literal>42</literal>, <literal>"\\42"</literal>, + <literal>\\4\\2</literal>, <literal> 42 </literal> and <literal> "42" + </literal> will have the similar internal representation and, being "will all have the same internal representation and," + converted from internal representation, will become <literal>42</literal>. + Literal <literal>abc def</literal> will turn into <literal>"abc + def"</literal>. </para> [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/555571856 -- Thomas Munro https://enterprisedb.com
Dear Thomas,
Thank you for your proofreading!
Please find the updated patch attached. It also contains the missing escaping.
On Mon, Jul 8, 2019 at 10:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Apr 17, 2019 at 5:29 AM Dmitry Belyavsky <beldmit@gmail.com> wrote:
> I've applied your patch.
> From my point of view, there is no major difference between case and chain if here.
> Neither case nor ifs allow extracting the common code to separate function - just because there seem to be no identical pieces of code.
Hi Dmitry,
The documentation doesn't build[1], due to invalid XML. Since I'm
here, here is some proof-reading of the English in the documentation:
<para>
- A <firstterm>label</firstterm> is a sequence of alphanumeric characters
- and underscores (for example, in C locale the characters
- <literal>A-Za-z0-9_</literal> are allowed). Labels must be less
than 256 bytes
- long.
+ A <firstterm>label</firstterm> is a sequence of characters. Labels must be
+ less than 256 symbols long. Label may contain any character
supported by Postgres
"fewer than 256 characters in length", and
"<productname>PostgreSQL</productname>"
+ except <literal>\0</literal>. If label contains spaces, dots,
lquery modifiers,
"spaces, dots or lquery modifiers,"
+ they may be <firstterm>escaped</firstterm>. Escaping can be done
either by preceeding
+ backslash (<literal>\\</literal>) symbol, or by wrapping the label
in whole in double
+ quotes (<literal>"</literal>). Initial and final unescaped
whitespace is stripped.
"Escaping can be done with either a preceding backslash [...] or by
wrapping the whole label in double quotes [...]."
</para>
+ During converting text into internal representations, wrapping
double quotes
"During conversion to internal representation, "
+ and escaping backslashes are removed. During converting internal
+ representations into text, if the label does not contain any special
"During conversion from internal representation to text, "
+ symbols, it is printed as is. Otherwise, it is wrapped in quotes and, if
+ there are internal quotes, they are escaped with backslash. The
list of special
"escaped with backslashes."
+ <para>
+ Examples: <literal>42</literal>, <literal>"\\42"</literal>,
+ <literal>\\4\\2</literal>, <literal> 42 </literal> and <literal> "42"
+ </literal> will have the similar internal representation and, being
"will all have the same internal representation and,"
+ converted from internal representation, will become <literal>42</literal>.
+ Literal <literal>abc def</literal> will turn into <literal>"abc
+ def"</literal>.
</para>
[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/555571856
--
Thomas Munro
https://enterprisedb.com
SY, Dmitry Belyavsky
Attachment
On 2019-Jul-08, Dmitry Belyavsky wrote: > Dear Thomas, > > Thank you for your proofreading! > > Please find the updated patch attached. It also contains the missing > escaping. I think all these functions you're adding should have a short sentence explaining what it does. I'm not really convinced that we need this much testing. It seems a bit excessive. Many of these very focused test SQL lines could be removed with no loss of coverage, and many of the others could be grouped into one. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Dear Alvaro,
On Mon, Jul 8, 2019 at 11:16 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jul-08, Dmitry Belyavsky wrote:
> Dear Thomas,
>
> Thank you for your proofreading!
>
> Please find the updated patch attached. It also contains the missing
> escaping.
I think all these functions you're adding should have a short sentence
explaining what it does.
I'm not really convinced that we need this much testing. It seems a bit
excessive. Many of these very focused test SQL lines could be removed
with no loss of coverage, and many of the others could be grouped into
one.
I did not introduce any functions. I've just changed the parser.
I'm not sure that it makes sense to remove any tests as most of them were written to catch really happened bugs during the implementation.
SY, Dmitry Belyavsky
On 2019-Jul-08, Dmitry Belyavsky wrote: > I did not introduce any functions. I've just changed the parser. I mean the C-level functions -- count_parts_ors() and so on. > I'm not sure that it makes sense to remove any tests as most of them were > written to catch really happened bugs during the implementation. Well, I don't mean to decrease the coverage, only to condense a lot of little tests in a small powerful test. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 8, 2019 at 11:33 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jul-08, Dmitry Belyavsky wrote:
> I did not introduce any functions. I've just changed the parser.
I mean the C-level functions -- count_parts_ors() and so on.
Added a comment to count_parts_ors()
The other functions introduced by me were already described.
SY, Dmitry Belyavsky
Attachment
On Wed, Jul 10, 2019 at 7:40 AM Dmitry Belyavsky <beldmit@gmail.com> wrote: > [ltree_20190709.diff] Hi Dmitry, You need to update contrib/ltree_plpython/expected/ltree_plpython.out, otherwise check-world fails when built with Python support. The good news is that it looks like it fails because you fixed something! (Though I didn't check the details). CREATE FUNCTION test2() RETURNS ltree LANGUAGE plpythonu TRANSFORM FOR TYPE ltree AS $$ return ['foo', 'bar', 'baz'] $$; -- plpython to ltree is not yet implemented, so this will fail, -- because it will try to parse the Python list as an ltree input -- string. SELECT test2(); -ERROR: syntax error at position 0 -CONTEXT: while creating return value -PL/Python function "test2" + test2 +------------------------- + "['foo', 'bar', 'baz']" +(1 row) + -- Thomas Munro https://enterprisedb.com
Dear Thomas,
On Thu, Jul 11, 2019 at 11:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jul 10, 2019 at 7:40 AM Dmitry Belyavsky <beldmit@gmail.com> wrote:
> [ltree_20190709.diff]
Hi Dmitry,
You need to update contrib/ltree_plpython/expected/ltree_plpython.out,
otherwise check-world fails when built with Python support. The good
news is that it looks like it fails because you fixed something!
(Though I didn't check the details).
CREATE FUNCTION test2() RETURNS ltree
LANGUAGE plpythonu
TRANSFORM FOR TYPE ltree
AS $$
return ['foo', 'bar', 'baz']
$$;
-- plpython to ltree is not yet implemented, so this will fail,
-- because it will try to parse the Python list as an ltree input
-- string.
SELECT test2();
-ERROR: syntax error at position 0
-CONTEXT: while creating return value
-PL/Python function "test2"
+ test2
+-------------------------
+ "['foo', 'bar', 'baz']"
+(1 row)
+
See attached. I'm not familiar enough with python so I just removed the failing tests.
If the main patch is accepted, the ltree_python extension should be redesigned, I think...
SY, Dmitry Belyavsky
Attachment
Hi! I have looked at the patch and found some problems. 1. I fixed some bugs (fixed patch with additional test cases is attached): -- NULL 'lptr' pointer dereference at lquery_in() =# SELECT '*'::lquery; -- crash -- '|' after '*{n}' is wrongly handled (LQPRS_WAITEND state) =# SELECT '*{1}|2'::lquery; WARNING: problem in alloc set MessageContext: req size > alloc size for chunk 0x1c2d508 in block 0x1c2bc58 WARNING: problem in alloc set MessageContext: req size > alloc size for chunk 0x1c2d508 in block 0x1c2bc58 lquery -------------*{1}.*{,48} (1 row) -- wrong handling of trailing whitespace =# SELECT '"a" '::ltree; ERROR: name of level is empty LINE 1: SELECT '"a" '::ltree; ^ DETAIL: Name length is 0 in position 4. =# SELECT '"a" '::lquery; ERROR: name of level is empty LINE 1: SELECT '"a" '::lquery; ^ DETAIL: Name length is 0 in position 4. -- backslashes are not escaped in ltree_out()/lquery_out(), -- which is not consistent with ltree_in()/lquery_in() =# SELECT '"\\"'::ltree;ltree -------"\" (1 row) =# SELECT '"\\"'::lquery;lquery --------"\" (1 row) =# SELECT '"\\"'::ltree::text::ltree; ERROR: syntax error DETAIL: Unexpected end of line. =# SELECT '"\\"'::lquery::text::lquery; ERROR: syntax error DETAIL: Unexpected end of line. 2. There are inconsistencies in whitespace handling before and after *, @, %, {} (I have not fixed this because I'm not sure how it is supposed to work): -- whitespace before '*' is not ignored =# SELECT '"a" *'::lquery;lquery --------"a\""* (1 row) =# SELECT 'a *'::lquery;lquery --------"a "* (1 row) -- whitespace after '*' and '{}' is disallowed =# SELECT 'a* .b'::lquery; ERROR: syntax error at position 2 LINE 1: SELECT 'a* .b'::lquery; ^ =# SELECT 'a* |b'::lquery; ERROR: syntax error at position 2 LINE 1: SELECT 'a* |b'::lquery; ^ =# SELECT '*{1} .a'::lquery; ERROR: syntax error at position 4 LINE 1: SELECT '*{1} .a'::lquery; ^ -- but the whitespace after levels without '*@%{}' is allowed =# SELECT 'a |b'::lquery;lquery --------a|b (1 row) 3. Empty level names between '!' and '|' are allowed. This behavior can be seen on master, so it seems that we cannot fix it now: -- master =# SELECT '!|a'::lquery;lquery --------!|a (1 row) -- patched =# SELECT '!|a'::lquery;lquery --------!""|a (1 row) -- empty level names in other places are disallowed =# SELECT '!a|'::lquery; ERROR: syntax error LINE 1: SELECT '!a|'::lquery; ^ DETAIL: Unexpected end of line. =# SELECT '|a'::lquery; ERROR: syntax error at position 0 LINE 1: SELECT '|a'::lquery; ^ 4. It looks strange to me that leading and trailing unquoted whitespace is ignored, but the internal whitespace is treated like a quoted: =# SELECT ' a b . c d '::lquery; lquery -----------------"a b"."c d" (1 row) I would prefer unquoted unescaped whitespace to be a delimiter always. 5. It seems wrong to me that ltree and lquery have different special character sets now. This leads to the fact that arbitrary ltree text cannot be used directly as lquery text, as it seemed to be before the syntax improvements: =# SELECT 'a|b'::ltree::text::lquery;lquery --------a|b (1 row) =# SELECT '"a|b"'::ltree::text::lquery;lquery --------a|b (1 row) =# SELECT '"a|b"'::lquery;lquery --------"a|b" (1 row) There might not be a problem if we had ltree::lquery cast. Also I think that text[]::ltree/ltree::text[] casts for ltree construction/deconstruction from text level names can be very useful. 6. ltree and escpecially lquery parsing code still look too complicated for me, and I believe that the bugs described above are a direct consequence of this. So the code needs to be refactored, maybe even without using of state machines.
On 11.07.2019 20:49, Dmitry Belyavsky wrote:
On Thu, Jul 11, 2019 at 11:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:On Wed, Jul 10, 2019 at 7:40 AM Dmitry Belyavsky <beldmit@gmail.com> wrote:
> [ltree_20190709.diff]
You need to update contrib/ltree_plpython/expected/ltree_plpython.out,
otherwise check-world fails when built with Python support. The good
news is that it looks like it fails because you fixed something!
(Though I didn't check the details).
CREATE FUNCTION test2() RETURNS ltree
LANGUAGE plpythonu
TRANSFORM FOR TYPE ltree
AS $$
return ['foo', 'bar', 'baz']
$$;
-- plpython to ltree is not yet implemented, so this will fail,
-- because it will try to parse the Python list as an ltree input
-- string.
SELECT test2();
-ERROR: syntax error at position 0
-CONTEXT: while creating return value
-PL/Python function "test2"
+ test2
+-------------------------
+ "['foo', 'bar', 'baz']"
+(1 row)
+See attached. I'm not familiar enough with python so I just removed the failing tests.If the main patch is accepted, the ltree_python extension should be redesigned, I think...
7. ltree_plpython test does not fail now because Python list is converted to a text and then to a ltree, and the textual representation of a Python list has become a valid ltree text: SELECT $$['foo', 'bar', 'baz']$$::ltree; ltree -------------------------"['foo', 'bar', 'baz']" (1 row) So Python lists can be now successfully converted to ltrees without a transform, but the result is not that is expected ('foo.bar.baz'::ltree). -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Thu, Jul 18, 2019 at 1:28 AM Nikita Glukhov <n.gluhov@postgrespro.ru> wrote: > 1. I fixed some bugs (fixed patch with additional test cases is attached): Hi Nikita, Thanks. I have set this to "Needs review", in the September 'fest. -- Thomas Munro https://enterprisedb.com
Hi, This patch got mostly ignored since 2019-07 commitfest :-( The latest patch (sent by Nikita) does not apply because of a minor conflict in contrib/ltree/ltxtquery_io.c. I see the patch removes a small bit of ltree_plpython tests which would otherwise fail (with the "I don't know plpython" justification). Why not to instead update the tests to accept the new output? Or is it really the case that the case that we no longer need those tests? The patch also reworks some parts from "if" to "switch" statements. I agree switch statements are more readable, but maybe we should do this in two steps - first adopting the "switch" without changing the logic, and then making changes. But maybe that's an overkill. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Dear Tomas,
If the C part will be reviewed and considered mergeable, I'll update the plpython tests.
On Mon, Jan 6, 2020 at 4:49 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,
This patch got mostly ignored since 2019-07 commitfest :-( The latest
patch (sent by Nikita) does not apply because of a minor conflict in
contrib/ltree/ltxtquery_io.c.
I see the patch removes a small bit of ltree_plpython tests which would
otherwise fail (with the "I don't know plpython" justification). Why not
to instead update the tests to accept the new output? Or is it really
the case that the case that we no longer need those tests?
The patch also reworks some parts from "if" to "switch" statements. I
agree switch statements are more readable, but maybe we should do this
in two steps - first adopting the "switch" without changing the logic,
and then making changes. But maybe that's an overkill.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
SY, Dmitry Belyavsky
Dmitry Belyavsky <beldmit@gmail.com> writes: > If the C part will be reviewed and considered mergeable, I'll update the > plpython tests. I haven't looked at any of the code involved in this, but I did examine the "failing" plpython test, and I'm quite distressed about that change of behavior. Simply removing the test case is certainly not okay, and I do not think that just changing it to accept this new behavior is okay either. As Nikita said upthread: >> 7. ltree_plpython test does not fail now because Python list is converted to a >> text and then to a ltree, and the textual representation of a Python list has >> become a valid ltree text: >> >> SELECT $$['foo', 'bar', 'baz']$$::ltree; >> ltree >> ------------------------- >> "['foo', 'bar', 'baz']" >> (1 row) >> >> So Python lists can be now successfully converted to ltrees without a transform, >> but the result is not that is expected ('foo.bar.baz'::ltree). If this case doesn't throw an error, then we're going to have a compatibility problem whenever somebody finally gets around to implementing the python-to-ltree transform properly, because it would break any code that might be relying on this (wrong) behavior. In general, I think it's a mistake to allow unquoted punctuation to be taken as part of an ltree label, which is what this patch is evidently doing. By doing that, you'll make it impossible for anyone to ever again extend the ltree syntax, because if they want to assign special meaning to braces or whatever, they can't do so without breaking existing applications. For example, if the existing code allowed double-quote or backslash as a label character, we'd already have rejected this patch as being too big a compatibility break. So it's not very forward-thinking to close off future improvements like this. Thus, what I think you should do is require non-alphanumeric label characters to be quoted, either via double-quotes or backslashes (although it's questionable whether we really need two independent quoting mechanisms here). That would preserve extensibility, and it'd also preserve our freedom to fix ltree_plpython later, since the case of interest would still be an error for now. And it would mean that you don't have subtly different rules for what's data in ltree versus what's data in lquery or ltxtquery. BTW, the general rule in existing backend code that's doing string parsing is to allow non-ASCII (high-bit-set) characters to be taken as data without inquiring too closely as to what they are. This avoids a bunch of locale and encoding issues without much loss of flexibility. (If we do ever extend the ltree syntax again, we'd certainly choose ASCII punctuation characters for whatever special symbols we need, else the feature might not be available in all encodings.) So for instance in your examples involving "Ñ", it's fine to take that as a label character without concern for locale/encoding. I'm not sure what I think about the whitespace business. It looks like what you propose is to strip unquoted leading and trailing whitespace but allow embedded whitespace. There's precedent for that, certainly, but I wonder whether it isn't too confusing. In any case you didn't document that. regards, tom lane
Attached new version of the patch. I did major refactoring of ltree label parsing, extracting common parsing code for ltree, lquery, and ltxtquery. This greatly simplified state machines. On the advice of Tomas Vondra, I also extracted a preliminary patch with 'if' to 'switch' conversion.
On 21.01.2020 22:13, Tom Lane wrote:
Dmitry Belyavsky <beldmit@gmail.com> writes:If the C part will be reviewed and considered mergeable, I'll update the plpython tests.I haven't looked at any of the code involved in this, but I did examine the "failing" plpython test, and I'm quite distressed about that change of behavior. Simply removing the test case is certainly not okay, and I do not think that just changing it to accept this new behavior is okay either. As Nikita said upthread:7. ltree_plpython test does not fail now because Python list is converted to a text and then to a ltree, and the textual representation of a Python list has become a valid ltree text: SELECT $$['foo', 'bar', 'baz']$$::ltree; ltree ------------------------- "['foo', 'bar', 'baz']" (1 row) So Python lists can be now successfully converted to ltrees without a transform, but the result is not that is expected ('foo.bar.baz'::ltree).If this case doesn't throw an error, then we're going to have a compatibility problem whenever somebody finally gets around to implementing the python-to-ltree transform properly, because it would break any code that might be relying on this (wrong) behavior. In general, I think it's a mistake to allow unquoted punctuation to be taken as part of an ltree label, which is what this patch is evidently doing. By doing that, you'll make it impossible for anyone to ever again extend the ltree syntax, because if they want to assign special meaning to braces or whatever, they can't do so without breaking existing applications. For example, if the existing code allowed double-quote or backslash as a label character, we'd already have rejected this patch as being too big a compatibility break. So it's not very forward-thinking to close off future improvements like this. Thus, what I think you should do is require non-alphanumeric label characters to be quoted, either via double-quotes or backslashes (although it's questionable whether we really need two independent quoting mechanisms here). That would preserve extensibility, and it'd also preserve our freedom to fix ltree_plpython later, since the case of interest would still be an error for now. And it would mean that you don't have subtly different rules for what's data in ltree versus what's data in lquery or ltxtquery.
Now non-alphanumeric label characters should be escaped in ltree, lquery and ltxtquery. Plpython tests does not require changes now.
BTW, the general rule in existing backend code that's doing string parsing is to allow non-ASCII (high-bit-set) characters to be taken as data without inquiring too closely as to what they are. This avoids a bunch of locale and encoding issues without much loss of flexibility. (If we do ever extend the ltree syntax again, we'd certainly choose ASCII punctuation characters for whatever special symbols we need, else the feature might not be available in all encodings.) So for instance in your examples involving "Ñ", it's fine to take that as a label character without concern for locale/encoding.
I'm not sure what I think about the whitespace business. It looks like what you propose is to strip unquoted leading and trailing whitespace but allow embedded whitespace. There's precedent for that, certainly, but I wonder whether it isn't too confusing. In any case you didn't document that.
Embedded whitespace should also be escaped now. I'm also not sure about stripping unquoted leading and trailing whitespace.--
Attachment
Nikita Glukhov <n.gluhov@postgrespro.ru> writes: > Attached new version of the patch. I spent a little bit of time looking through this, and have a few comments: * You have a lot of places where tests for particular ASCII characters are done like this: if ((charlen == 1) && t_iseq(src, '\\')) This is a tedious and expensive way to spell if (*src == '\\') because charlen is necessarily 1 if you are looking at an ASCII character; there is no allowed backend encoding in which an ASCII character can be the first byte of a multibyte character. Aside from the direct simplifications of the tests that this makes possible, I see some places where you'd not have to pass charlen around, either. * I spent a fair amount of time thinking that a lot of the added code was wrong because it was only considering escaping and not double-quoting. I eventually concluded that the idea is to convert double-quoting to a pure escape-based representation during input and store it that way. However, I don't really see why that is either necessary or a good idea --- the internal storage already has a length counter for each label. So I think what you really ought to be doing here is simplifying out both quotes and escapes during ltree_in and just storing the notionally-represented string internally. (If I've misunderstood what the plan is, well the utter lack of documentation in the patch isn't helping.) * The added test cases seem a bit excessive and repetitive. regards, tom lane
On 25.03.2020 2:08, Tom Lane wrote:
Nikita Glukhov <n.gluhov@postgrespro.ru> writes:Attached new version of the patch.I spent a little bit of time looking through this, and have a few comments: * You have a lot of places where tests for particular ASCII characters are done like this: if ((charlen == 1) && t_iseq(src, '\\')) This is a tedious and expensive way to spell if (*src == '\\') because charlen is necessarily 1 if you are looking at an ASCII character; there is no allowed backend encoding in which an ASCII character can be the first byte of a multibyte character. Aside from the direct simplifications of the tests that this makes possible, I see some places where you'd not have to pass charlen around, either.
All unnecessary checks of charlen were removed, but t_iseq() were left for consistency.
* I spent a fair amount of time thinking that a lot of the added code was wrong because it was only considering escaping and not double-quoting. I eventually concluded that the idea is to convert double-quoting to a pure escape-based representation during input and store it that way. However, I don't really see why that is either necessary or a good idea --- the internal storage already has a length counter for each label. So I think what you really ought to be doing here is simplifying out both quotes and escapes during ltree_in and just storing the notionally-represented string internally. (If I've misunderstood what the plan is, well the utter lack of documentation in the patch isn't helping.)
ltree_in() removes quotes and escapes before storing strings (see copy_unescaped()), just as you suggest. ltree_out() adds escapes and quotes if necessary (see copy_escaped(), extra_bytes_for_escaping()). I have refactored code a bit, removed duplicated code, fixed several bugs in reallocation of output strings, and added some comments.
* The added test cases seem a bit excessive and repetitive.
I have removed some tests that have become redundant after changes in parsing.
--
Attachment
Nikita Glukhov <n.gluhov@postgrespro.ru> writes: > [ latest version of ltree syntax extension ] This is going to need another rebase after all the other ltree hacking that just got done. However, I did include 0001 (use a switch) in the commit I just pushed, so you don't need to worry about that. regards, tom lane
On 02.04.2020 2:46, Tom Lane wrote:
Nikita Glukhov <n.gluhov@postgrespro.ru> writes:[ latest version of ltree syntax extension ]This is going to need another rebase after all the other ltree hacking that just got done. However, I did include 0001 (use a switch) in the commit I just pushed, so you don't need to worry about that. regards, tom lane
Rebased patch attached.
I’m not sure whether it's worth to introduce one LTREE_TOK_SPECIAL for the whole set of special characters, and still check them with t_iseq(). -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Nikita Glukhov <n.gluhov@postgrespro.ru> writes: > Rebased patch attached. Thanks for rebasing! The cfbot's not very happy though: 4842ltxtquery_io.c: In function ‘makepol’: 4843ltxtquery_io.c:188:13: error: ‘escaped’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 4844 if (lenval - escaped <= 0) 4845 ^ 4846ltxtquery_io.c:230:6: note: ‘escaped’ was declared here 4847 int escaped; 4848 ^ 4849cc1: all warnings being treated as errors regards, tom lane
On 02.04.2020 19:16, Tom Lane wrote: > Nikita Glukhov <n.gluhov@postgrespro.ru> writes: >> Rebased patch attached. > Thanks for rebasing! The cfbot's not very happy though: > > 4842ltxtquery_io.c: In function ‘makepol’: > 4843ltxtquery_io.c:188:13: error: ‘escaped’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > 4844 if (lenval - escaped <= 0) > 4845 ^ > 4846ltxtquery_io.c:230:6: note: ‘escaped’ was declared here > 4847 int escaped; > 4848 ^ > 4849cc1: all warnings being treated as errors > > regards, tom lane Fixed patch attached. -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Nikita Glukhov <n.gluhov@postgrespro.ru> writes: > Fixed patch attached. I looked through this again, and I still don't feel very happy with it. As I mentioned upthread, I'm not convinced that we ought to have *both* quoting and backslash-escaping in these datatypes. TIMTOWTDI may be a virtue to the Perl crowd, but to a lot of other people it's just confusing and extra complication. In particular it will translate directly to extra complication in every application that wants to use non-alphanumeric labels, since they'll have to be prepared to deparse whatever style the backend happens to put out. It's also adding a far from trivial amount of complication to the patch itself. And it adds bug-prone ambiguity; for instance, the patch fails to document whether backslashes do anything special within quotes. On balance, since the existing limitations on labels are an approximation of SQL identifier rules, I think it'd be a good idea to use SQL-identifier quoting rules; that is, you have to use quotes for anything that isn't alphanumeric, backslashes are not special, and the way to get a double quote that's data is to write two adjacent double quotes. That's simple and it doesn't create any confusion when writing an ltree value inside a SQL literal. It's less great if you're trying to write an ltree within an array or record literal value, but we can't have everything (and the backslash alternative would be no better for that anyway). I've still got mixed emotions about the exception of allowing whitespace to be stripped before or after a label. While that probably doesn't pose any forward-compatibility hazards (ie, it's unlikely we'd want to redefine such syntax to mean something different), I'm not sure it passes the least-confusion test. We were just getting beat up a couple days ago about the fact that record_in is lax about whitespace [1][2], so maybe we shouldn't make that same choice here. As far as the code itself goes, I don't think the code structure in ltree_io.c is very well chosen. There seem to be half a dozen routines that are all intimately dependent on the quoting/escaping rules, and it's really pretty hard to be sure that they're all in sync (and if they're not, that's going to lead to crashes and perhaps security-grade bugs). Moreover it looks like you end up making multiple passes over the input, so it's inefficient as well as hard to understand/maintain. Given the choice to have a separate frontend function that recognizes a label as a single token, it seems like it'd be better if that function were also responsible for generating a de-quoted label value as it went. I also feel that not enough attention has been paid to the error reporting. For instance, the patch changes the longstanding policy of reporting an overlength label with its end position: @@ -53,7 +54,7 @@ SELECT repeat('x', 255)::ltree; SELECT repeat('x', 256)::ltree; ERROR: label string is too long -DETAIL: Label length is 256, must be at most 255, at character 257. +DETAIL: Label length is 256, must be at most 255, at character 1. SELECT ltree2text('1.2.3.34.sdf'); ltree2text -------------- That might be an OK choice in a green field, but this isn't a green field, even if we did change the wording of the message since v12. I also noted some random inconsistencies such as regression=# select '1234567890*1234567890'::ltree; ERROR: ltree syntax error at character 11 LINE 1: select '1234567890*1234567890'::ltree; ^ regression=# select '1234567890+1234567890'::ltree; ERROR: ltree syntax error at character 11 LINE 1: select '1234567890+1234567890'::ltree; ^ DETAIL: Unexpected character Apparently that's because * is special to lquery while + isn't, but that distinction really shouldn't matter to ltree. (This DETAIL message isn't close to following project style for detail messages, either.) It'd probably be better if the frontend function didn't contain assumptions about which punctuation characters are important. Another thought here, though it's not really the fault of this patch, is that I really think ltree ought to go over to a treatment of non-ASCII characters that's more like the core parser's, ie anything that isn't ASCII is data (or assumed-to-be-alphanumeric, if you prefer). The trouble with the current definition is that it's dependent on LC_CTYPE, so labels that are acceptable to one database might not be acceptable elsewhere. We could remove that hazard, and as a bonus eliminate some possibly-expensive character classification tests, if we just said all non-ASCII characters are data. regards, tom lane [1] https://www.postgresql.org/message-id/BCE3F9FC-6BE9-44D8-AD25-F5FF109C7BD4@yugabyte.com [2] https://www.postgresql.org/message-id/158593753274.7901.11178770123847486396%40wrigleys.postgresql.org
> On 4 Apr 2020, at 01:26, Nikita Glukhov <n.gluhov@postgrespro.ru> wrote: > Fixed patch attached. This patch cause a regression in the ltree_plpython module, it needs the below trivial change to make it pass again: --- a/contrib/ltree_plpython/expected/ltree_plpython.out +++ b/contrib/ltree_plpython/expected/ltree_plpython.out @@ -39,5 +39,6 @@ $$; -- string. SELECT test2(); ERROR: ltree syntax error at character 1 +DETAIL: Unexpected character CONTEXT: while creating return value PL/Python function "test2" Please submit a rebased version of the patch. cheers ./daniel
On Wed, Jul 01, 2020 at 03:23:30PM +0200, Daniel Gustafsson wrote: > Please submit a rebased version of the patch. Which has not happened after two months, so I have marked the patch as returned with feedback. -- Michael