Home > mailing lists
Re: patch: function xmltable - Mailing list pgsql-hackers

From	Craig Ringer
Subject	Re: patch: function xmltable
Date	September 7, 2016 03:03:27
Msg-id	CAMsr+YFVDkngz-R5b9fNPs+Zu6AXbMZbwis9s0=GfrDVUds5GQ@mail.gmail.com Whole thread Raw
In response to	Re: patch: function xmltable (Pavel Stehule <pavel.stehule@gmail.com>)
Responses	Re: patch: function xmltable Re: patch: function xmltable
List	pgsql-hackers
Tree view
On 7 September 2016 at 04:13, Pavel Stehule <pavel.stehule@gmail.com> wrote:

>> Overall, I think this needs to be revised with appropriate comments.
>> Whitespace/formatting needs fixing since it's all over the place.
>> Documentation is insufficient (per notes below).
>
>
> I am not able to write documentation in English language :( - This function
> is pretty complex - so I hope so anybody with better language skills can
> help with this. It respects standard and it respects little bit different
> Oracle's behave too (different order of DEFAULT and PATH parts).

OK, no problem. It can't be committed without more comprehensive docs
though, especially for new and nontrivial functionality.

Is there some reference material you can point to so someone else can
help with docs? And can you describe what differences there are
between your implementation and the reference?

Alternately, if you document it in Czech, do you know of anyone who
could assist in translating to English for the main documentation?

>> Re identifier naming, some of this code uses XmlTable naming patterns,
>> some uses TableExpr prefixes. Is that intended to indicate a bounary
>> between things re-usable for other structured data ingesting
>> functions? Do you expect a "JSONEXPR" or similar in future? That's
>> alluded to by
>
>
> This structure should be reused by JSON_TABLE function. Now, it is little
> bit strange, because there is only XMLTABLE implementation - and I have to
> choose between a) using two different names now, b) renaming some part in
> future.

OK. Are you planning on writing this JSON_TABLE or are you leaving
room for future growth? Either way is fine, just curious.

> And although XMLTABLE and JSON_TABLE functions are pretty similar - share
> 90% of data (input value, path, columns definitions), these functions has
> different syntax - so only middle level code should be shared.

That makes sense.

I think it would be best if you separated out the TableExpr
infrastructure from the XMLTABLE implementation though, so we can
review the first level infrastrcture separately and make this a
2-patch series. Most importantly, doing it that way will help you find
places where TableExpr code calls directly into XMLTABLE code. If
TableExpr is supposed to be reusable for json etc, it probably
shouldn't be calling XmlTable stuff directly.

That also means somewhat smaller simpler patches, which probably isn't bad.

I don't necessarily think this needs to be fully pluggable with
callbacks etc. It doesn't sound like you expect this to be used by
extensions or to have a lot of users, right? So it probably just needs
clearer separation of the infrastructure layer from the xmltable
layer. I think splitting the patch will make that easier to see and
make it easier to find problems.

My biggest complaint at the moment is that execEvalTableExpr calls
initXmlTableContext(...) directly, is aware of XML namespaces
directly, calls XmlTableSetRowPath() directly, calls
XmlTableFetchRow() directly, etc. It is in no way generic/reusable for
some later JSONTABLE feature. That needs to be fixed by:

* Renaming it so it's clearly only for XMLTABLE; or
* Abstracting the init context, set row path, fetch row etc operations
so json ones can be plugged in later



> Currently the common part is not too big - just the Node related part - I am
> not sure about necessity of two patches.

The problem is that the common part is all mixed in with the
XMLTABLE-specific part, so it's not at all clear it can be common with
something else.

> I am agree, there is missing some
> TableExpBuilder, where can be better isolated the XML part.

Yeah, that's sort of what I'm getting at.


>> execEvalTableExpr seems to be defined twice, with a difference in
>> case. This is probably not going to fly:
>>
>>
>> +static Datum
>> +execEvalTableExpr(TableExprState *tstate,
>> +                        ExprContext *econtext,
>> +                        bool *isNull, ExprDoneCond *isDone)
>> +{
>>
>> +static Datum
>> +ExecEvalTableExpr(TableExprState *tstate,
>> +                        ExprContext *econtext,
>> +                        bool *isNull, ExprDoneCond *isDone)
>> +{
>>
>>
>> It looks like you've split the function into a "guts" and "wrapper"
>> part, with the error handling PG_TRY / PG_CATCH block in the wrapper.
>> That seems reasonable for readability, but the naming isn't.
>
>
> I invite any idea how these functions should be named.

Definitely not how they are ;) . They really can't differ in a single
character's case.

I'm not sure if PostgreSQL has any formal convention for this. Some
places use _impl e.g.  pg_read_barrier_impl() but that's in the
context of an interface-vs-implementation separation, which isn't the
case here.

Some places use _internal, like AlterObjectRename_internal(...), but
that's where there's an associated public/external part, which isn't
the case here.

Some places use _guts e.g. pg_logical_slot_get_changes_guts(...),
largely where there's common use by several callers.

This is a fairly arbitrary function split for readability/length. Is
it actually useful to split this function up at all?

Anyone else have an opinion?

>> A comment is needed to explain what ExecEvalTableExpr is / does. If
>> it's XMLTABLE specific (which it looks like based on the code), its
>> name should reflect that. This pattern is repeated elsewhere; e.g.
>> TableExprState is really the state for an XMLTABLE expression. But
>> PostgreSQL actually has TABLE statements, and in future we might want
>> to support table-expressions, so I don't think this naming is
>> appropriate. This is made worse by the lack of comments on things like
>> the definition of TableExprState. Please use something that makes it
>> clear it's for XMLTABLE and add appropriate comments.
>
>
> I understand, so using TableExpr can be strange (for XMLTABLE function). But
> when we will have JSON_TABLE function, then it will have a sense.

It's pretty hard to review that as shared infrastructure when it's
still tangled up in xmltable specifics, though.

> "TableExprState" is consistent with "TableExpr".
>
> Any idea how it should be changed?

I think if you want it to be shareable infrasructure, you need to
write it so it can be used as shared infrastructure. Not just name it
that way but then make it XMLTABLE specific in actual functionality.


>>                 /* when typmod is not valid, refresh it */
>>                 if (te->typmod == -1)
>>
>>
>> Is this a cache? How is it valid or not valid and when? The comment
>> (thanks!) on TableExprGetTupleDesc says:
>>
>> /*
>>  * When we skip transform stage (in view), then TableExpr's
>>  * TupleDesc should not be valid. Refresh is necessary.
>>  */
>>
>> but I'm not really grasping what you're trying to explain here. What
>> transform stage? What view? This could well be my ignorance of this
>> part of the code; if it should be understandable by a reader who is
>> appropriately familiar with the executor that's fine, but if it's
>> specific to how XMLTABLE works some more explanation would be good.
>
>
> This is most difficult part of this patch, and I am not sure it it is fully
> correctly implemented. I use TupleDesc cache. The TupleDesc is created in
> parser/transform stage. When the XMLTABLE is used in some view, then the
> transformed parser tree is materialized - and when the view is used in
> query, then this tree is loaded and the parser/transform stage is "skipped".
> I'll check this code against implementation of ROW constructor and I'll try
> to do more comments there.

Thanks. It would be good to highlight when this does and does not
happen and why. Why is it necessary at all?

What happens if XMLTABLE is used in a query directly, not part of a
view? if XMLTABLE is used in a view? If XMLTABLE is used in a prepared
statement / plpgsql statement / etc? What about a CTE term?

Not necessarily list all these cases one by one, just explain what
happens, when and why. Especially if it's complex, so other readers
can understand it and don't have to study it in detail to understand
what is going on. It does not need to be good public-facing
documentation, and details of wording, grammar etc can be fixed up
later, it's the ideas that matter.

Is this similar to other logic elsewhere? If so, reference that other
logic so readers know where to look. That way if they're
changing/bugfixing/etc one place they know there's another place that
might need changing.

I don't know this area of the code well enough to give a solid review
of the actual functionality, and I don't yet understand what it's
trying to do so it's hard to review it by studying what it actually
does vs what it claims to do. Maybe Peter E can help, he said he was
thinking of looking at this patch too. But more information on what
it's trying to do would be a big help.

>> PostgreSQL already has XPATH support in the form of xmlexists(...)
>> etc. Why is getXPathToken() etc needed? What re-use is possible here?
>> There's no explanation in the patch header or comments. Should the new
>> xpath parser be re-used by the existing xpath stuff? Why can't we use
>> libxml's facilities? etc. This at least needs explaining in the
>> submission, and some kind of hint as to why we have two different ways
>> to do it is needed in the code. If we do need a new XML parser, should
>> it be bundled in adt/xml.c along with a lot of user-facing
>> functionality, or a separate file?
>
> libxml2 and our XPATH function doesn't support default namespace (
> http://plasmasturm.org/log/259/ ). This is pretty useful feature - so I
> implemented.

OK, that makes sense.

For the purpose of getting this patch in, is it a _necessary_ feature?
Can XMLTABLE be usefully implemented without it, and if so, can it be
added in a subsequent patch? It would be nice to simplify this by
using existing libxml2 functionality in the first version rather than
adding a whole new xpath as well!

> This is the mayor issue of libxml2 library. Another difference
> between XPATH function and XMLTABLE function is using two phase searching
> and implicit prefix "./" and suffix ("/text()") in XMLTABLE. XMLTABLE using
> two XPATH expressions - for row data cutting and next for column data
> cutting (from row data). The our XPATH functions is pretty simple mapped to
> libxml2 XPATH API. But it is not possible with XMLTABLE function - due
> design of this function in standard (it is more user friendly and doesn't
> require exactly correct xpath expressions).

So you can't use existing libxml2 xpath support to implement XMLTABLE,
even without default namespaces?

> I didn't find any API in libxml2 for a work with parsed xpath expressions -
> I need some info about the first and last token of xpath expression - it is
> base for decision about using prefix or suffix.
>
> This functionality (xpath expression parser) cannot be used for our XPATH
> function now - maybe default namespace in future.

So we'll have two different XPATH implementations for different
places, with different limitations, different possible bugs, etc?

What would be needed to make the new XPATH work for our built-in xpath
functions too?

>> How does XmlTableGetValue(...) and XmlTableGetRowValue(...) relate to
>> this? It doesn't look like they're intended to be called directly by
>> the user, and they're not documented (or commented).
>
>
> Probably I used wrong names. XMLTABLE function is running in two different
> modes - with explicitly defined columns (XmlTableGetValue is used), and
> without explicitly defined columns - so result is one XML column and only
> one one step searching is used (there are not column related xpath
> expressions) ( XmlTableGetRowValue is used). The function XmlTableGetValue
> is used for getting one column value, the function XmlTableGetRowValue is
> used for getting one value too, but in special case, when there are not any
> other value.

So both are internal implementation of the parser-level XMLTABLE(...)
construct and are not intended to be called directly by users - right?

Comments please! A short comment on the function saying this would be
a big help.

Regarding naming, do we already have a convention for functions that
are internal implementation of something the user "spells"
differently? Where it's transformed by the parser? I couldn't find
one. I don't much care about the names so long as there are comments
explaining what calls the functions and what the user-facing interface
that matches the function is.

Is it safe for users to call these directly? What happens if they do
so incorrectly?

Why are they not in pg_proc.h? Do they need to be?

>> +/*
>> + * There are different requests from XMLTABLE, JSON_TABLE functions
>> + * on passed data than has CREATE TABLE command. It is reason for
>> + * introduction special structure instead using ColumnDef.
>> + */
>> +typedef struct TableExprRawCol
>> +{
>> +    NodeTag     type;
>> +    char       *colname;
>> +    TypeName   *typeName;
>> +    bool        for_ordinality;
>> +    bool        is_not_null;
>> +    Node       *path_expr;
>> +    Node       *default_expr;
>> +    int         location;
>> +} TableExprRawCol;
>
> I am sorry. It is my fault. Now we have very similar node ColumnDef. This
> node is designed for usage in utility commands - and it is not designed for
> usage inside a query.

Makes sense.

> I had to decide between enhancing ColumnDef node or
> introduction new special node. Because there are more special attributes and
> it is hard to serialize current ColumnDef, I decided to use new node.

Seems reasonable. The summary is "this is the parse node for a column
of an XMLTABLE expression".

Suggested comment:

/** This is the parsenode for a column definition in a table-expression
like XMLTABLE.** We can't re-use ColumnDef here; the utility command column
definition has all the* wrong attributes for use in table-expressions and just doesn't make
sense here.*/
typedef struct TableExprColumn
{
...
};

?

Why "RawCol" ? What does it become when it's not "raw" anymore? Is
that a reference to ColumnDef's raw_default and cooked_default for
untransformed vs transformed parse-trees?




-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date:
From: Peter Eisentraut
Date: 07 September 2016, 03:03:23
Subject: Re: improved DefElem list processing
From: Masahiko Sawada
Date: 07 September 2016, 03:14:02
Subject: Re: Optimization for lazy_scan_heap
Re: patch: function xmltable - Mailing list pgsql-hackers

Previous

Next