[RFC] nodeToString format and exporting the SQL parser - Mailing list pgsql-hackers

From Michael Tharp
Subject [RFC] nodeToString format and exporting the SQL parser
Date
Msg-id 4BB64B3F.7020406@partiallystapled.com
Whole thread Raw
Responses Re: [RFC] nodeToString format and exporting the SQL parser  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [RFC] nodeToString format and exporting the SQL parser  (Markus Wanner <markus@bluegap.ch>)
Re: [RFC] nodeToString format and exporting the SQL parser  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Most Esteemed Hackers:

Due to popular demand on #postgresql (by which I mean David Fetter), I 
have been spending a little time making the internal SQL parser 
available to clients via a C-language SQL function. The function itself 
is extremely simple: just a wrapper around a call to raw_parser followed 
by nodeToString. Most of the "hard stuff" has been in parsing the output 
of nodeToString on the client side. So, I have a few questions to help 
gauge interest in related patches:

Is there interest in a patch to extend nodes/outfuncs.c with support for 
serializing more node types? Coverage has been pretty good so far but 
various utility statements and their related nodes are missing, e.g. 
AlterTableStmt and GrantStmt. I expect that this will be the least 
contentious suggestion.

The nodeToString format as it stands is somewhat ambiguous with respect 
to the type of a node member's value if one does not have access to 
readfuncs.c. For example, a T_BitString called foo is serialized as 
':foo b1010' while a char * containing 'b1010' is also serialized as 
':foo b1010'. This may just mean that _outToken needs to escape the 
leading 'b'. A similar problem exists for booleans ('true' as a string 
vs. as a boolean).

Additionally, values may span more than one token for certain types e.g. 
Datum (":constvalue 4 [ 16 0 0 0 ]"). Plan trees have a few types that 
don't have a corresponding read function and output an array of 
space-separated integers. PlanInvalItem even seems to use a format 
containing parentheses, which the tokenizer splits as if it were a list. 
While most of these only occur in plan nodes and thus don't affect my 
use case (Datum being the exception), it would be ideal if they could be 
parsed more straightforwardly.

These last two problems perhaps can be worked around by escaping more 
things in _outToken, but maybe it would be smarter to make the fields 
self-descriptive in terms of type. For example, the field names could be 
prefixed with a short string describing its type, which in most cases 
would be a single character, e.g. 's:schemaname' for a char*, 'b:true' 
for a bool, 'n:...' for any node (including Value nodes), or longer 
strings for less commonly used types like the integer arrays in plan 
nodes (although these would probably be better as a real integer list). 
These could be used to unambiguously parse individual tokens and also to 
determine how many or what kind of token to expect for multi-token 
values such as Datum which would otherwise require guessing. Does this 
seem reasonable? Is there another format that might make more sense?

As far as I can tell, the current parser in nodes/read.c ignores the 
field names entirely, so this can be done without changing postgres' own 
parsing code at all and without affecting backwards compatibility of any 
stored trees. Does anyone else out there use nodeToString() output in 
their own tools, and if so, does this make your life easier or harder?

Lastly, I'll leave a link to my WIP implementation in case anyone is 
interested:  http://bitbucket.org/gxti/parse_sql/src/
Currently I'm working on adding support for cooked parse trees and 
figuring out what, if anything, I need to do to support multibyte 
encodings. My personal use is for parsing DDL so the input is decidedly 
not hostile but I'd still like to make this a generally useful module.

Thanks in advance for any comments, tips, or flames sent my way.

-- m. tharp


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Compile fail, alpha5 & gcc 4.3.3 in elog.c
Next
From: Tom Lane
Date:
Subject: Re: [RFC] nodeToString format and exporting the SQL parser