Thread: [RFC] nodeToString format and exporting the SQL parser

[RFC] nodeToString format and exporting the SQL parser

From
Michael Tharp
Date:
Most Esteemed Hackers:

Due to popular demand on #postgresql (by which I mean David Fetter), I 
have been spending a little time making the internal SQL parser 
available to clients via a C-language SQL function. The function itself 
is extremely simple: just a wrapper around a call to raw_parser followed 
by nodeToString. Most of the "hard stuff" has been in parsing the output 
of nodeToString on the client side. So, I have a few questions to help 
gauge interest in related patches:

Is there interest in a patch to extend nodes/outfuncs.c with support for 
serializing more node types? Coverage has been pretty good so far but 
various utility statements and their related nodes are missing, e.g. 
AlterTableStmt and GrantStmt. I expect that this will be the least 
contentious suggestion.

The nodeToString format as it stands is somewhat ambiguous with respect 
to the type of a node member's value if one does not have access to 
readfuncs.c. For example, a T_BitString called foo is serialized as 
':foo b1010' while a char * containing 'b1010' is also serialized as 
':foo b1010'. This may just mean that _outToken needs to escape the 
leading 'b'. A similar problem exists for booleans ('true' as a string 
vs. as a boolean).

Additionally, values may span more than one token for certain types e.g. 
Datum (":constvalue 4 [ 16 0 0 0 ]"). Plan trees have a few types that 
don't have a corresponding read function and output an array of 
space-separated integers. PlanInvalItem even seems to use a format 
containing parentheses, which the tokenizer splits as if it were a list. 
While most of these only occur in plan nodes and thus don't affect my 
use case (Datum being the exception), it would be ideal if they could be 
parsed more straightforwardly.

These last two problems perhaps can be worked around by escaping more 
things in _outToken, but maybe it would be smarter to make the fields 
self-descriptive in terms of type. For example, the field names could be 
prefixed with a short string describing its type, which in most cases 
would be a single character, e.g. 's:schemaname' for a char*, 'b:true' 
for a bool, 'n:...' for any node (including Value nodes), or longer 
strings for less commonly used types like the integer arrays in plan 
nodes (although these would probably be better as a real integer list). 
These could be used to unambiguously parse individual tokens and also to 
determine how many or what kind of token to expect for multi-token 
values such as Datum which would otherwise require guessing. Does this 
seem reasonable? Is there another format that might make more sense?

As far as I can tell, the current parser in nodes/read.c ignores the 
field names entirely, so this can be done without changing postgres' own 
parsing code at all and without affecting backwards compatibility of any 
stored trees. Does anyone else out there use nodeToString() output in 
their own tools, and if so, does this make your life easier or harder?

Lastly, I'll leave a link to my WIP implementation in case anyone is 
interested:  http://bitbucket.org/gxti/parse_sql/src/
Currently I'm working on adding support for cooked parse trees and 
figuring out what, if anything, I need to do to support multibyte 
encodings. My personal use is for parsing DDL so the input is decidedly 
not hostile but I'd still like to make this a generally useful module.

Thanks in advance for any comments, tips, or flames sent my way.

-- m. tharp


Re: [RFC] nodeToString format and exporting the SQL parser

From
Tom Lane
Date:
Michael Tharp <gxti@partiallystapled.com> writes:
> Due to popular demand on #postgresql (by which I mean David Fetter), I 
> have been spending a little time making the internal SQL parser 
> available to clients via a C-language SQL function. The function itself 
> is extremely simple: just a wrapper around a call to raw_parser followed 
> by nodeToString.

What exactly is the use-case for this?  Generally speaking I'm against
exposing that data structure to clients, because there will inevitably
be griping when we change it (as we most certainly will).  Your
complaints boil down to "this is hard to parse from the client side",
and that already tells me you're doing something that we will refuse
to support.
        regards, tom lane


Re: [RFC] nodeToString format and exporting the SQL parser

From
Markus Wanner
Date:
Hi,

Michael Tharp wrote:
> I have been spending a little time making the internal SQL parser 
> available to clients via a C-language SQL function.

This sounds very much like one of the Cluster Features:
http://wiki.postgresql.org/wiki/ClusterFeatures#API_into_the_Parser_.2F_Parser_as_an_independent_module

Is this what you (or David) have in mind?

Regards

Markus Wanner


Re: [RFC] nodeToString format and exporting the SQL parser

From
David Fetter
Date:
On Sat, Apr 03, 2010 at 03:17:30PM +0200, Markus Schiltknecht wrote:
> Hi,
> 
> Michael Tharp wrote:
> >I have been spending a little time making the internal SQL parser
> >available to clients via a C-language SQL function.
> 
> This sounds very much like one of the Cluster Features:
> http://wiki.postgresql.org/wiki/ClusterFeatures#API_into_the_Parser_.2F_Parser_as_an_independent_module
> 
> Is this what you (or David) have in mind?

I'm not a fan of statement-based replication of any description.  The
use cases I have in mind involve things like known-correct syntax
highlighting in text editors.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: [RFC] nodeToString format and exporting the SQL parser

From
Michael Tharp
Date:
On 04/02/2010 04:16 PM, Tom Lane wrote:
> Generally speaking I'm against
> exposing that data structure to clients, because there will inevitably
> be griping when we change it (as we most certainly will).  Your
> complaints boil down to "this is hard to parse from the client side",
> and that already tells me you're doing something that we will refuse
> to support.

Sorry, I got the impression that the node format was intended to be 
stable. If it's for internal use only then I don't suppose my 
suggestions make much sense.

As for the structure of the parse nodes themselves, changes aren't a big 
deal to me. What I'm doing is diffing entire files of DDL, so even if 
the node format changes over time the changes in two files parsed using 
the same code are still evident. In fact, the ambiguities in parsing 
don't really affect this use case much because I don't need exact values 
to detect changes.

Markus' link suggests using the parser for a load balancer, which would 
require that this all be in library form anyway as the balancer cannot 
afford a round trip to the server to parse the query, so maybe I should 
focus my efforts there. Making it a separate library would also resolve 
the compatibility issues as the library could keep its own structures 
regardless of where Postgres goes.

Thanks for the input!

-- m. tharp


Re: [RFC] nodeToString format and exporting the SQL parser

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/04/2010 18:10, David Fetter wrote:
> On Sat, Apr 03, 2010 at 03:17:30PM +0200, Markus Schiltknecht wrote:
>> Hi,
>>
>> Michael Tharp wrote:
>>> I have been spending a little time making the internal SQL parser
>>> available to clients via a C-language SQL function.
>>
>> This sounds very much like one of the Cluster Features:
>> http://wiki.postgresql.org/wiki/ClusterFeatures#API_into_the_Parser_.2F_Parser_as_an_independent_module
>>
>> Is this what you (or David) have in mind?
> 
> I'm not a fan of statement-based replication of any description.  The
> use cases I have in mind involve things like known-correct syntax
> highlighting in text editors.

The point here is not to expose the internal data structure, but to
deliver a tokenized version of the given SQL script.

There's actually many different use cases for external projects : - syntax highlighting - rewrite query with proper
indentation- replication - properly splitting queries from a script - define type of the query (SELECT ? UPDATE/DELETE
?DDL ?) - checking validity of a query before sending it - ...
 

In addition of PgPool needs, I can see 3 or 4 direct use cases for
pgAdmin and phpPgAdmin.

So it seems to me having the parser code in a shared library would be
very useful for external C projects which can link to it. However it
would be useless for other non-C projects which can't use it directly
but are connected to a PostgreSQL backend anyway (phpPgAdmin as instance).

What about having a new SQL command like TOKENIZE ? it would kinda act
like EXPLAIN but giving a tokenized version of the given SQL script. As
EXPLAIN, it could speak XML, YAML, JSON, you name it...

Each token could have : - a type ('identifier', 'string', 'sql command', 'sql keyword',
'variable'...) - the start position in the string - the value - the line number - ...

A simple example of a tokenizer is the php one: http://fr.php.net/token_get_all

And here is a basic example which return pseudo rows here :

=> TOKENIZE $script$   SELECT 1;   UPDATE test SET "a"=2; $script$;
  type      | pos |   value  | line
- -------------+-----+----------+------SQL_COMMAND | 1   | 'SELECT' |   1CONSTANT    | 8   | '1'      |   1DELIMITER
|9   | ';'      |   1SQL_COMMAND | 11  | 'UPDATE' |   2IDENTIFIER  | 18  | 'test'   |   2SQL_KEYWORD | 23  | 'SET'    |
 2IDENTIFIER  | 27  | '"a"'    |   2OPERATOR    | 30  | '='      |   2CONSTANT    | 31  | '1'      |   2
 

> 
> Cheers,
> David.

As a phpPgAdmin dev, I am thinking about this subject since a long time.
I am interested about trying to create such a patch after discussing it
and if you think it is doable.

- -- 
JGuillaume (ioguix) de Rorthais
http://www.dalibo.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvPOJMACgkQxWGfaAgowiLrUACfa7qMVr3oiOVS7JfhTa1S9EqY
pYkAn3Sj6cezC/EdWPu2+kzrgjaDygGE
=oY1c
-----END PGP SIGNATURE-----


Re: [RFC] nodeToString format and exporting the SQL parser

From
Pavel Stehule
Date:
2010/4/21 Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 04/04/2010 18:10, David Fetter wrote:
>> On Sat, Apr 03, 2010 at 03:17:30PM +0200, Markus Schiltknecht wrote:
>>> Hi,
>>>
>>> Michael Tharp wrote:
>>>> I have been spending a little time making the internal SQL parser
>>>> available to clients via a C-language SQL function.
>>>
>>> This sounds very much like one of the Cluster Features:
>>> http://wiki.postgresql.org/wiki/ClusterFeatures#API_into_the_Parser_.2F_Parser_as_an_independent_module
>>>
>>> Is this what you (or David) have in mind?
>>
>> I'm not a fan of statement-based replication of any description.  The
>> use cases I have in mind involve things like known-correct syntax
>> highlighting in text editors.
>
> The point here is not to expose the internal data structure, but to
> deliver a tokenized version of the given SQL script.
>
> There's actually many different use cases for external projects :
>  - syntax highlighting
>  - rewrite query with proper indentation
>  - replication
>  - properly splitting queries from a script
>  - define type of the query (SELECT ? UPDATE/DELETE ? DDL ?)
>  - checking validity of a query before sending it
>  - ...
>
> In addition of PgPool needs, I can see 3 or 4 direct use cases for
> pgAdmin and phpPgAdmin.
>
> So it seems to me having the parser code in a shared library would be
> very useful for external C projects which can link to it. However it
> would be useless for other non-C projects which can't use it directly
> but are connected to a PostgreSQL backend anyway (phpPgAdmin as instance).
>
> What about having a new SQL command like TOKENIZE ? it would kinda act
> like EXPLAIN but giving a tokenized version of the given SQL script. As
> EXPLAIN, it could speak XML, YAML, JSON, you name it...
>
> Each token could have :
>  - a type ('identifier', 'string', 'sql command', 'sql keyword',
> 'variable'...)
>  - the start position in the string
>  - the value
>  - the line number
>  - ...
>
> A simple example of a tokenizer is the php one:
>  http://fr.php.net/token_get_all
>
> And here is a basic example which return pseudo rows here :
>
> => TOKENIZE $script$
>    SELECT 1;
>    UPDATE test SET "a"=2;
>  $script$;
>

you don't need special command for this task .. function is enough

new SQL command is useless

http://www.pgsql.cz/index.php/Oracle_functionality_%28en%29#PLVlex

it can be very simple with new changes in parser.

Regards
Pavel Stehule



>   type      | pos |   value  | line
> - -------------+-----+----------+------
>  SQL_COMMAND | 1   | 'SELECT' |   1
>  CONSTANT    | 8   | '1'      |   1
>  DELIMITER   | 9   | ';'      |   1
>  SQL_COMMAND | 11  | 'UPDATE' |   2
>  IDENTIFIER  | 18  | 'test'   |   2
>  SQL_KEYWORD | 23  | 'SET'    |   2
>  IDENTIFIER  | 27  | '"a"'    |   2
>  OPERATOR    | 30  | '='      |   2
>  CONSTANT    | 31  | '1'      |   2
>
>>
>> Cheers,
>> David.
>
> As a phpPgAdmin dev, I am thinking about this subject since a long time.
> I am interested about trying to create such a patch after discussing it
> and if you think it is doable.
>
> - --
> JGuillaume (ioguix) de Rorthais
> http://www.dalibo.com
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkvPOJMACgkQxWGfaAgowiLrUACfa7qMVr3oiOVS7JfhTa1S9EqY
> pYkAn3Sj6cezC/EdWPu2+kzrgjaDygGE
> =oY1c
> -----END PGP SIGNATURE-----
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [RFC] nodeToString format and exporting the SQL parser

From
Bruce Momjian
Date:
Jehan-Guillaume (ioguix) de Rorthais wrote:
> A simple example of a tokenizer is the php one:
>   http://fr.php.net/token_get_all
> 
> And here is a basic example which return pseudo rows here :
> 
> => TOKENIZE $script$
>     SELECT 1;
>     UPDATE test SET "a"=2;
>   $script$;
> 
>    type      | pos |   value  | line
> - -------------+-----+----------+------
>  SQL_COMMAND | 1   | 'SELECT' |   1
>  CONSTANT    | 8   | '1'      |   1
>  DELIMITER   | 9   | ';'      |   1
>  SQL_COMMAND | 11  | 'UPDATE' |   2
>  IDENTIFIER  | 18  | 'test'   |   2
>  SQL_KEYWORD | 23  | 'SET'    |   2
>  IDENTIFIER  | 27  | '"a"'    |   2
>  OPERATOR    | 30  | '='      |   2
>  CONSTANT    | 31  | '1'      |   2

Sounds useful to me, though as a function like suggested in a later
email.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com


Re: [RFC] nodeToString format and exporting the SQL parser

From
Robert Haas
Date:
On Sat, Apr 24, 2010 at 8:07 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Jehan-Guillaume (ioguix) de Rorthais wrote:
>> A simple example of a tokenizer is the php one:
>>   http://fr.php.net/token_get_all
>>
>> And here is a basic example which return pseudo rows here :
>>
>> => TOKENIZE $script$
>>     SELECT 1;
>>     UPDATE test SET "a"=2;
>>   $script$;
>>
>>    type      | pos |   value  | line
>> - -------------+-----+----------+------
>>  SQL_COMMAND | 1   | 'SELECT' |   1
>>  CONSTANT    | 8   | '1'      |   1
>>  DELIMITER   | 9   | ';'      |   1
>>  SQL_COMMAND | 11  | 'UPDATE' |   2
>>  IDENTIFIER  | 18  | 'test'   |   2
>>  SQL_KEYWORD | 23  | 'SET'    |   2
>>  IDENTIFIER  | 27  | '"a"'    |   2
>>  OPERATOR    | 30  | '='      |   2
>>  CONSTANT    | 31  | '1'      |   2
>
> Sounds useful to me, though as a function like suggested in a later
> email.

If tool-builders think this is useful, I have no problem with making
it available.  It should be suitably disclaimed: "We reserve the right
to rip out the entire flex/yacc-based lexer and parser at any time and
replace them with a hand-coded system written in Prolog that emits
tokenization information only in ASN.1-encoded pig latin.  If massive
changes in the way this function works - or its complete disappearance
- are going to make you grumpy, don't call it."

But having said that, assuming there is a real use case for this, I
think it's better to let people get at it rather than forcing them to
roll their own.  Because frankly, if we do rip out the whole thing,
then people are going to have to adjust their stuff anyway, regardless
of whether they're using some API we provide or something they've
cooked up from scratch.  And in practice, most changes on our end are
likely to be incremental, though, again, we're not guaranteeing that
in any way.

...Robert


Re: [RFC] nodeToString format and exporting the SQL parser

From
Robert Haas
Date:
On Fri, Apr 2, 2010 at 3:53 PM, Michael Tharp <gxti@partiallystapled.com> wrote:
> Most Esteemed Hackers:
>
> Due to popular demand on #postgresql (by which I mean David Fetter), I have
> been spending a little time making the internal SQL parser available to
> clients via a C-language SQL function. The function itself is extremely
> simple: just a wrapper around a call to raw_parser followed by nodeToString.

Seems reasonable.

> Most of the "hard stuff" has been in parsing the output of nodeToString on
> the client side. So, I have a few questions to help gauge interest in
> related patches:
>
> Is there interest in a patch to extend nodes/outfuncs.c with support for
> serializing more node types? Coverage has been pretty good so far but
> various utility statements and their related nodes are missing, e.g.
> AlterTableStmt and GrantStmt. I expect that this will be the least
> contentious suggestion.

This wouldn't bother me provided the code footprint is small.  I would
be against adding a lot of complexity for this.

> The nodeToString format as it stands is somewhat ambiguous with respect to
> the type of a node member's value if one does not have access to
> readfuncs.c. For example, a T_BitString called foo is serialized as ':foo
> b1010' while a char * containing 'b1010' is also serialized as ':foo b1010'.
> This may just mean that _outToken needs to escape the leading 'b'. A similar
> problem exists for booleans ('true' as a string vs. as a boolean).

I am not inclined to change this.  Turning the format into something
self-describing seems to me to be significant work and a significant
compatibility break for a very small amount of benefit.

> Additionally, values may span more than one token for certain types e.g.
> Datum (":constvalue 4 [ 16 0 0 0 ]"). Plan trees have a few types that don't
> have a corresponding read function and output an array of space-separated
> integers. PlanInvalItem even seems to use a format containing parentheses,
> which the tokenizer splits as if it were a list. While most of these only
> occur in plan nodes and thus don't affect my use case (Datum being the
> exception), it would be ideal if they could be parsed more
> straightforwardly.

I'm not inclined to change this, either.

> These last two problems perhaps can be worked around by escaping more things
> in _outToken, but maybe it would be smarter to make the fields
> self-descriptive in terms of type. For example, the field names could be
> prefixed with a short string describing its type, which in most cases would
> be a single character, e.g. 's:schemaname' for a char*, 'b:true' for a bool,
> 'n:...' for any node (including Value nodes), or longer strings for less
> commonly used types like the integer arrays in plan nodes (although these
> would probably be better as a real integer list). These could be used to
> unambiguously parse individual tokens and also to determine how many or what
> kind of token to expect for multi-token values such as Datum which would
> otherwise require guessing. Does this seem reasonable? Is there another
> format that might make more sense?

This seems ugly to me and I don't see the utility of it.

> As far as I can tell, the current parser in nodes/read.c ignores the field
> names entirely, so this can be done without changing postgres' own parsing
> code at all and without affecting backwards compatibility of any stored
> trees. Does anyone else out there use nodeToString() output in their own
> tools, and if so, does this make your life easier or harder?
>
> Lastly, I'll leave a link to my WIP implementation in case anyone is
> interested:
>  http://bitbucket.org/gxti/parse_sql/src/
> Currently I'm working on adding support for cooked parse trees and figuring
> out what, if anything, I need to do to support multibyte encodings. My
> personal use is for parsing DDL so the input is decidedly not hostile but
> I'd still like to make this a generally useful module.
>
> Thanks in advance for any comments, tips, or flames sent my way.

Thanks for having a thick skin.  :-)

I'm having a hard time imaging what you could use this for without
encoding a lot of information about the meaning of particular
constructs.  In which case the self-describing stuff is not needed.
As you point out downthread, if all you want to do is compare, it's
not needed either.

...Robert


Re: [RFC] nodeToString format and exporting the SQL parser

From
Bruce Momjian
Date:
Added to TODO:
Allow the parser output to be seen by clientsThis is useful for external tools.*
http://archives.postgresql.org/pgsql-hackers/2010-04/msg00095.php

---------------------------------------------------------------------------

Robert Haas wrote:
> On Fri, Apr 2, 2010 at 3:53 PM, Michael Tharp <gxti@partiallystapled.com> wrote:
> > Most Esteemed Hackers:
> >
> > Due to popular demand on #postgresql (by which I mean David Fetter), I have
> > been spending a little time making the internal SQL parser available to
> > clients via a C-language SQL function. The function itself is extremely
> > simple: just a wrapper around a call to raw_parser followed by nodeToString.
> 
> Seems reasonable.
> 
> > Most of the "hard stuff" has been in parsing the output of nodeToString on
> > the client side. So, I have a few questions to help gauge interest in
> > related patches:
> >
> > Is there interest in a patch to extend nodes/outfuncs.c with support for
> > serializing more node types? Coverage has been pretty good so far but
> > various utility statements and their related nodes are missing, e.g.
> > AlterTableStmt and GrantStmt. I expect that this will be the least
> > contentious suggestion.
> 
> This wouldn't bother me provided the code footprint is small.  I would
> be against adding a lot of complexity for this.
> 
> > The nodeToString format as it stands is somewhat ambiguous with respect to
> > the type of a node member's value if one does not have access to
> > readfuncs.c. For example, a T_BitString called foo is serialized as ':foo
> > b1010' while a char * containing 'b1010' is also serialized as ':foo b1010'.
> > This may just mean that _outToken needs to escape the leading 'b'. A similar
> > problem exists for booleans ('true' as a string vs. as a boolean).
> 
> I am not inclined to change this.  Turning the format into something
> self-describing seems to me to be significant work and a significant
> compatibility break for a very small amount of benefit.
> 
> > Additionally, values may span more than one token for certain types e.g.
> > Datum (":constvalue 4 [ 16 0 0 0 ]"). Plan trees have a few types that don't
> > have a corresponding read function and output an array of space-separated
> > integers. PlanInvalItem even seems to use a format containing parentheses,
> > which the tokenizer splits as if it were a list. While most of these only
> > occur in plan nodes and thus don't affect my use case (Datum being the
> > exception), it would be ideal if they could be parsed more
> > straightforwardly.
> 
> I'm not inclined to change this, either.
> 
> > These last two problems perhaps can be worked around by escaping more things
> > in _outToken, but maybe it would be smarter to make the fields
> > self-descriptive in terms of type. For example, the field names could be
> > prefixed with a short string describing its type, which in most cases would
> > be a single character, e.g. 's:schemaname' for a char*, 'b:true' for a bool,
> > 'n:...' for any node (including Value nodes), or longer strings for less
> > commonly used types like the integer arrays in plan nodes (although these
> > would probably be better as a real integer list). These could be used to
> > unambiguously parse individual tokens and also to determine how many or what
> > kind of token to expect for multi-token values such as Datum which would
> > otherwise require guessing. Does this seem reasonable? Is there another
> > format that might make more sense?
> 
> This seems ugly to me and I don't see the utility of it.
> 
> > As far as I can tell, the current parser in nodes/read.c ignores the field
> > names entirely, so this can be done without changing postgres' own parsing
> > code at all and without affecting backwards compatibility of any stored
> > trees. Does anyone else out there use nodeToString() output in their own
> > tools, and if so, does this make your life easier or harder?
> >
> > Lastly, I'll leave a link to my WIP implementation in case anyone is
> > interested:
> > ?http://bitbucket.org/gxti/parse_sql/src/
> > Currently I'm working on adding support for cooked parse trees and figuring
> > out what, if anything, I need to do to support multibyte encodings. My
> > personal use is for parsing DDL so the input is decidedly not hostile but
> > I'd still like to make this a generally useful module.
> >
> > Thanks in advance for any comments, tips, or flames sent my way.
> 
> Thanks for having a thick skin.  :-)
> 
> I'm having a hard time imaging what you could use this for without
> encoding a lot of information about the meaning of particular
> constructs.  In which case the self-describing stuff is not needed.
> As you point out downthread, if all you want to do is compare, it's
> not needed either.
> 
> ...Robert
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com


Re: [RFC] nodeToString format and exporting the SQL parser

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Apr 24, 2010 at 8:07 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Sounds useful to me, though as a function like suggested in a later
>> email.

> If tool-builders think this is useful, I have no problem with making
> it available.  It should be suitably disclaimed: "We reserve the right
> to rip out the entire flex/yacc-based lexer and parser at any time and
> replace them with a hand-coded system written in Prolog that emits
> tokenization information only in ASN.1-encoded pig latin.  If massive
> changes in the way this function works - or its complete disappearance
> - are going to make you grumpy, don't call it."

I'm a bit concerned with the vagueness of the goals here.  We started
with a request to dump out node trees, ie, post-parsing representation;
but the example use case of syntax highlighting would find that
representation quite useless.  (Example: foo::bar and CAST(foo AS bar)
yield the same parse tree.)  A syntax highlighter might get some use
out of the lexer-output token stream, but I'm afraid from the proposed
output that people might be expecting more semantic information than
the lexer can provide.  The lexer doesn't, for example, have any clue
that some keywords are commands and others aren't; nor any very clear
understanding about the semantic difference between the tokens '='
and ';'.

Also, if all you want is the lexer, it's not that hard to steal psql's
version and adapt it to your purposes.  The lexer doesn't change very
fast, and it's not that big either.

Anyway, it certainly wouldn't be hard for an add-on module to provide a
SRF that calls the lexer (or parser) and returns some sort of tabular
representation of the results.  I'm just not sure how useful it'll be
in the real world.
        regards, tom lane


Re: [RFC] nodeToString format and exporting the SQL parser

From
Michael Tharp
Date:
On 04/24/2010 08:49 PM, Robert Haas wrote:
>> The nodeToString format as it stands is somewhat ambiguous with respect to
>> the type of a node member's value if one does not have access to
>> readfuncs.c. For example, a T_BitString called foo is serialized as ':foo
>> b1010' while a char * containing 'b1010' is also serialized as ':foo b1010'.
>> This may just mean that _outToken needs to escape the leading 'b'. A similar
>> problem exists for booleans ('true' as a string vs. as a boolean).
>
> I am not inclined to change this.  Turning the format into something
> self-describing seems to me to be significant work and a significant
> compatibility break for a very small amount of benefit.

The funny thing is, it doesn't seem to be a compatibility break because 
the code in readfuncs.c that parses the node strings ignores the field 
names entirely because it assumes they are in a particular order. It 
also isn't much work to change the output because the code is, with the 
exception of a few weirdos, all at the top of outfuncs.c, and the 
weirdos are also dispersed within that file.

However, I'm no longer convinced that using a serialized node tree is 
the way to go for my use case, nor am I particularly sure that it even 
matches my use case at all anymore as I keep simplifying the goals as 
time goes on. I won't be able to make any compelling arguments until I 
figure out what I need :-)

Thanks for the feedback.

-- m. tharp


Re: [RFC] nodeToString format and exporting the SQL parser

From
Robert Haas
Date:
On Sat, Apr 24, 2010 at 9:08 PM, Michael Tharp
<gxti@partiallystapled.com> wrote:
> The funny thing is, it doesn't seem to be a compatibility break because the
> code in readfuncs.c that parses the node strings ignores the field names
> entirely because it assumes they are in a particular order. It also isn't
> much work to change the output because the code is, with the exception of a
> few weirdos, all at the top of outfuncs.c, and the weirdos are also
> dispersed within that file.

Yeah, I think that's basically all true.

> However, I'm no longer convinced that using a serialized node tree is the
> way to go for my use case, nor am I particularly sure that it even matches
> my use case at all anymore as I keep simplifying the goals as time goes on.
> I won't be able to make any compelling arguments until I figure out what I
> need :-)

However this is, as you say, a fairly damning point.  :-)

Bruce, I think we should not have a TODO for this until the OP (or
someone) can address this issue a bit better.

...Robert


Re: [RFC] nodeToString format and exporting the SQL parser

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Sat, Apr 24, 2010 at 9:08 PM, Michael Tharp
> <gxti@partiallystapled.com> wrote:
> > The funny thing is, it doesn't seem to be a compatibility break because the
> > code in readfuncs.c that parses the node strings ignores the field names
> > entirely because it assumes they are in a particular order. It also isn't
> > much work to change the output because the code is, with the exception of a
> > few weirdos, all at the top of outfuncs.c, and the weirdos are also
> > dispersed within that file.
> 
> Yeah, I think that's basically all true.
> 
> > However, I'm no longer convinced that using a serialized node tree is the
> > way to go for my use case, nor am I particularly sure that it even matches
> > my use case at all anymore as I keep simplifying the goals as time goes on.
> > I won't be able to make any compelling arguments until I figure out what I
> > need :-)
> 
> However this is, as you say, a fairly damning point.  :-)
> 
> Bruce, I think we should not have a TODO for this until the OP (or
> someone) can address this issue a bit better.

OK, removed.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com


providing tokenized version of parsed SQL script (was: nodeToString format and exporting the SQL parser)

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 25/04/2010 03:02, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sat, Apr 24, 2010 at 8:07 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>> Sounds useful to me, though as a function like suggested in a later
>>> email.
> 
>> If tool-builders think this is useful, I have no problem with making
>> it available.  It should be suitably disclaimed: "We reserve the right
>> to rip out the entire flex/yacc-based lexer and parser at any time and
>> replace them with a hand-coded system written in Prolog that emits
>> tokenization information only in ASN.1-encoded pig latin.  If massive
>> changes in the way this function works - or its complete disappearance
>> - are going to make you grumpy, don't call it."
> 
> I'm a bit concerned with the vagueness of the goals here.  We started
> with a request to dump out node trees, ie, post-parsing representation;
> but the example use case of syntax highlighting would find that
> representation quite useless.  (Example: foo::bar and CAST(foo AS bar)
> yield the same parse tree.)  

Well, the tokenizer stuff was actually my understanding of the following
quote from Michael Tharp :
« ... making the internal SQL parser available to clients via a
C-language SQL function. ».

I thought Michael was trying to write a tokenizer based on node tree
returned by raw_parser. As it seems Michael is not even sure about what
he's trying to do, I prefer refocus a bit this thread

> A syntax highlighter might get some use
> out of the lexer-output token stream, but I'm afraid from the proposed
> output that people might be expecting more semantic information than
> the lexer can provide.  The lexer doesn't, for example, have any clue
> that some keywords are commands and others aren't; nor any very clear
> understanding about the semantic difference between the tokens '='
> and ';'.

Exact, a proper tokenizer function should be able to give some (simple)
information about the type of each token. That is what I tried to define
in this draft with the "type" field :
 => SELECT pgtokenize($script$     SELECT 1;     UPDATE test SET "a"=2;   $script$);
    type      | pos |   value  | line -------------+-----+----------+------  SQL_COMMAND | 1   | 'SELECT' |   1
CONSTANT   | 8   | '1'      |   1  DELIMITER   | 9   | ';'      |   1  SQL_COMMAND | 11  | 'UPDATE' |   2  IDENTIFIER
|18  | 'test'   |   2  SQL_KEYWORD | 23  | 'SET'    |   2  IDENTIFIER  | 27  | '"a"'    |   2  OPERATOR    | 30  | '='
   |   2  CONSTANT    | 31  | '1'      |   2
 


> 
> Also, if all you want is the lexer, it's not that hard to steal psql's
> version and adapt it to your purposes.  The lexer doesn't change very
> fast, and it's not that big either.

Stealing the lexer from psql is possible...for C application.
Don't know yet if we could port it to other languages easily and if a
simple lexer would really answer the use cases here.

> 
> Anyway, it certainly wouldn't be hard for an add-on module to provide a
> SRF that calls the lexer (or parser) and returns some sort of tabular
> representation of the results.  I'm just not sure how useful it'll be
> in the real world.

Well, I would prefer not to tell users of pgAdmin or phpPgAdmin that
they depend on a contrib module.
Moreover, PostgreSQL already expose a lot of informations about its
internal mechanisms, configuration, ddl etc. I think having a proper
tokenizer function is just a natural new functionality for core if possible.

Having dropped an eye here and there in the parser code, I am not sure
where I could get required info and mix them to produce something close
to my draft yet.
But I prefer to discussing first before spending too much time and
throwing any potential code after...

> 
>             regards, tom lane

- -- 
JGuillaume (ioguix) de Rorthais
http://www.dalibo.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvXdxgACgkQxWGfaAgowiJujQCglXpCYpFttwHOkmkCd92zMxnv
r00An1sjmRrR6u61VjCtXputcNBevHsz
=ri3i
-----END PGP SIGNATURE-----


On Tue, Apr 27, 2010 at 7:45 PM, Jehan-Guillaume (ioguix) de Rorthais
<ioguix@free.fr> wrote:
> I thought Michael was trying to write a tokenizer based on node tree
> returned by raw_parser. As it seems Michael is not even sure about what
> he's trying to do, I prefer refocus a bit this thread
[...]
> Having dropped an eye here and there in the parser code, I am not sure
> where I could get required info and mix them to produce something close
> to my draft yet.
> But I prefer to discussing first before spending too much time and
> throwing any potential code after...

I can't quite tell, from reading this, what you're trying to do with this...

...Robert


Re: providing tokenized version of parsed SQL script

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 28/04/2010 02:20, Robert Haas wrote:
> On Tue, Apr 27, 2010 at 7:45 PM, Jehan-Guillaume (ioguix) de Rorthais
> <ioguix@free.fr> wrote:
>> I thought Michael was trying to write a tokenizer based on node tree
>> returned by raw_parser. As it seems Michael is not even sure about what
>> he's trying to do, I prefer refocus a bit this thread
> [...]
>> Having dropped an eye here and there in the parser code, I am not sure
>> where I could get required info and mix them to produce something close
>> to my draft yet.
>> But I prefer to discussing first before spending too much time and
>> throwing any potential code after...
> 
> I can't quite tell, from reading this, what you're trying to do with this...

I would like to do what I described here:

http://archives.postgresql.org/pgsql-hackers/2010-04/msg00974.php

Pavel Stehule convinced me a function would be a better approach than a
new SQL keyword. Moreover he points an existing function in the Oracle
ecosystem that is doing very closely what I descibed in my draft:

http://archives.postgresql.org/pgsql-hackers/2010-04/msg00976.php
http://www.pgsql.cz/index.php/Oracle_functionality_%28en%29#PLVlex

Presently I am just studying some parts of the parser code to find where
I could find : - token/lexer representation of a SQL script - type of each token

Sorry the mess and confusion, hope this mail is clearer ?

> 
> ...Robert

- -- 
JGuillaume (ioguix) de Rorthais
http://www.dalibo.com

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvX7aIACgkQxWGfaAgowiI5PQCeITOqAlCbeFyYc2KdHPUOfXhe
YY8An0jdfF31SK75bmd21ixFffXiyoEF
=ymbW
-----END PGP SIGNATURE-----