Thread: WIP: Generic functions for Node types using generated metadata
Hi, There's been a lot of complaints over the years about how annoying it is to keep the out/read/copy/equalfuncs.c functions in sync with the actual underlying structs. There've been various calls for automating their generation, but no actual patches that I am aware of. There also recently has been discussion about generating more efficient memory layout for node trees that we know are read only (e.g. plan trees inside the plancache), and about copying such trees more efficiently (e.g. by having one big allocation, and then just adjusting pointers). One way to approach this problem would be to to parse the type definitions, and directly generate code for the various functions. But that does mean that such a code-generator needs to be expanded for each such functions. An alternative approach is to have a parser of the node definitions that doesn't generate code directly, but instead generates metadata. And then use that metadata to write node aware functions. This seems more promising to me. In the attached, *very experimental WIP*, patch I've played around with this approach. I have written a small C program that uses libclang (more on that later) to parse relevant headers, and generate metadata about node types. Currently the generated metadata roughly looks like this: #define TYPE_CAT_SCALAR (1 << 0) // FIXME: this isn't well named rn #define TYPE_CAT_POINTER (1 << 1) #define TYPE_CAT_UNION (1 << 2) // FIXME: unused #define TYPE_CAT_INCOMPLETE (1 << 3) #define KNOWN_TYPE_UNKNOWN 0 #define KNOWN_TYPE_STRING 1 #define KNOWN_TYPE_INT 2 #define KNOWN_TYPE_NODE 3 #define KNOWN_TYPE_BITMAPSET 4 typedef struct NodeTypeInfo { /* FIXME: intern me */ const char *structname; int16 nelems; int16 start_at; int16 size; } NodeTypeInfo; typedef struct NodeTypeComponents { /* FIXME: intern me */ const char *fieldname; /* FIXME: intern me */ const char *fieldtype; int16 offset; int16 size; int16 flags; int16 node_type_id; int16 known_type_id; } NodeTypeComponents; NodeTypeInfo node_type_info[] = { {0}, {.structname = "IndexInfo", .nelems = 22, .start_at = 2128, .size = sizeof(IndexInfo)}, ... {.structname = "Append", .nelems = 4, .start_at = 1878, .size = sizeof(Append)}, ... NodeTypeComponents node_type_components[] = { ... {.fieldname = "plan", .fieldtype = "struct Plan", .offset = offsetof(struct Append, plan), .size = sizeof(struct Plan),.flags = TYPE_CAT_SCALAR, .node_type_id = 9, .known_type_id = KNOWN_TYPE_UNKNOWN}, {.fieldname = "appendplans", .fieldtype = "struct List *", .offset = offsetof(struct Append, appendplans), .size = sizeof(structList *), .flags = TYPE_CAT_POINTER, .node_type_id = 223, .known_type_id = KNOWN_TYPE_UNKNOWN}, {.fieldname = "first_partial_plan", .fieldtype = "int", .offset = offsetof(struct Append, first_partial_plan), .size= sizeof(int), .flags = TYPE_CAT_SCALAR, .node_type_id = -1, .known_type_id = KNOWN_TYPE_UNKNOWN}, {.fieldname = "part_prune_info", .fieldtype = "struct PartitionPruneInfo *", .offset = offsetof(struct Append, part_prune_info),.size = sizeof(struct PartitionPruneInfo *), .flags = TYPE_CAT_POINTER, .node_type_id = 53, .known_type_id= KNOWN_TYPE_UNKNOWN}, ... i.e. each node type is listed in node_type_info, indexed by the NodeTag value. The fields building up the node are defined in node_type_components. By including the size for Node structs, and components, and by including the offset for components, we can write generic routines accessing node trees. In the attached I used this metadata to write a function that can compute the amount of memory a node tree needs (without currently considering memory allocator overhead). The function for doing so has a pretty limited amount of type awareness - it needs to know about strings, the various list types, value nodes and Bitmapset. I'm fairly sure this metadata can also be used to write the other currently existing node functions. I *suspect* that it'll be quite OK performancewise, compared to bespoke code, but I have *NOT* measured that. The one set of fields this currently can not deal with is the various arrays that we directly reference from nodes. For e.g. typedef struct Sort { Plan plan; int numCols; /* number of sort-key columns */ AttrNumber *sortColIdx; /* their indexes in the target list */ Oid *sortOperators; /* OIDs of operators to sort them by */ Oid *collations; /* OIDs of collations */ bool *nullsFirst; /* NULLS FIRST/LAST directions */ } Sort; the generic code has no way of knowing that sortColIdx, sortOperators, collations, nullsFirst are all numCols long arrays. I can see various ways of dealing with that: 1) We could convert them all to lists, now that we have fast arbitrary access. But that'd add a lot of indirection / separate allocations. 2) We could add annotations to the sourcecode, to declare that association. That's probably not trivial, but wouldn't be that hard - one disadvantage is that we probably couldn't use that declaration for automated asserts etc. 3) We could introduce a few macros to create array type that include the length of the members. That'd duplicate the lenght for each of those arrays, but I have a bit of a hard time believing that that's a meaningful enough overhead. I'm thinking of a macro that'd be used like ARRAY_FIELD(AttrNumber) *sortColIdx; that'd generate code like struct { size_t nmembers; AttrNumber members[FLEXIBLE_ARRAY_MEMBER]; } *sortColIdx; plus a set of macros (or perhaps inline functions + macros) to access them. With any of these we could add enough metadata for node type members to be able to handle such arrays in a generic manner, rather than having to write code for each of them. With regards to using libclang for the parsing: I chose that because it seemed the easiest to experiment with, compared to annotating all the structs with enough metadata to be able to easily parse them from a perl script. The node definitions are after all distributed over quite a few headers. I think it might even be the correct way forward, over inventing our own mini-languages and writing ad-hoc parsers for those. It sure is easier to understand plain C code, compared to having to understand various embeded mini-languages consisting out of macros. The obvious drawback is that it'd require more people to install libclang - a significant imposition. I think it might be OK if we only required it to be installed when changing node types (although it's not necessarily trivial how to detect that node types haven't changed, if we wanted to allow other code changes in the relevant headers), and stored the generated code in the repository. Alternatively we could annotate the code enough to be able to write our own parser, or use some other C parser. I don't really want to invest significantly more time into this without first debating the general idea. The attached patch really just is a fairly minimal prototype. It does not: - properly integrate into the buildsystem, to automatically re-generate the node information when necessary - instead one has to "make -C src/backend/catalog gennodes-run" - solve the full problem (see discussion of referenced arrays above) - actually use the metadata in a useful manner (there just is a function that estimates node tree sizes, which is used for parse/plan trees) - have clean code One interesting - but less important - bit is that the patch currently has the declared node type id available for node members, e.g: typedef struct FieldStore { Expr xpr; Expr *arg; /* input tuple value */ List *newvals; /* new value(s) for field(s) */ List *fieldnums; /* integer list of field attnums */ Oid resulttype; /* type of result (same as type of arg) */ /* Like RowExpr, we deliberately omit a typmod and collation here */ } FieldStore; but we cannot actually rely on arg actually pointing to a node of type Expr* - it'll always point to a "subclass". Obviously we can handle that by dispatching using nodeTag(arg), but it'd be more efficient if we didn't need that. And we also loose some error checking due to that. I wonder if we could collect enough metadata to determine which potential subclasses a node type has. And then dispatch more efficiently when only one, and assert that it's a legal subclass for the other cases. Greetings, Andres Freund
Attachment
Hello Andres, Just my 0.02 €: > There's been a lot of complaints over the years about how annoying it is > to keep the out/read/copy/equalfuncs.c functions in sync with the actual > underlying structs. > > There've been various calls for automating their generation, but no > actual patches that I am aware of. I started something a while back, AFAICR after spending stupid time looking for a stupid missing field copy or whatever. I wrote a (simple) perl script deriving all (most) node utility functions for the header files. I gave up as the idea did not gather much momentum from committers, so I assumed the effort would be rejected in the end. AFAICR the feedback spirit was something like "node definition do not change often, we can manage it by hand". > There also recently has been discussion about generating more efficient > memory layout for node trees that we know are read only (e.g. plan trees > inside the plancache), and about copying such trees more efficiently > (e.g. by having one big allocation, and then just adjusting pointers). If pointers are relative to the start, it could be just indexes that do not need much adjusting. > One way to approach this problem would be to to parse the type > definitions, and directly generate code for the various functions. But > that does mean that such a code-generator needs to be expanded for each > such functions. No big deal for the effort I made. The issue was more dealing with exceptions (eg "we do not serialize this field because it is not used for some reason") and understanding some implicit assumptions in the struct declarations. > An alternative approach is to have a parser of the node definitions that > doesn't generate code directly, but instead generates metadata. And then > use that metadata to write node aware functions. This seems more > promising to me. Hmmm. The approach we had in an (old) research project was to write the meta data, and derive all struct & utility functions from these. It is simpler this way because you save parsing some C, and it can be made language agnostic (i.e. serializing the data structure from a language and reading its value from another). > I'm fairly sure this metadata can also be used to write the other > currently existing node functions. Beware of strange exceptions… > With regards to using libclang for the parsing: I chose that because it > seemed the easiest to experiment with, compared to annotating all the > structs with enough metadata to be able to easily parse them from a perl > script. I did not find this an issue when I tried, because the annotation needed is basically the type name of the field. > The node definitions are after all distributed over quite a few headers. Yep. > I think it might even be the correct way forward, over inventing our own > mini-languages and writing ad-hoc parsers for those. It sure is easier > to understand plain C code, compared to having to understand various > embeded mini-languages consisting out of macros. Dunno. > The obvious drawback is that it'd require more people to install > libclang - a significant imposition. Indeed. A perl-only dependence would be much simpler that relying on a particular library from a particular compiler to compile postgres, possibly with an unrelated compiler. > Alternatively we could annotate the code enough to be able to write our > own parser, or use some other C parser. If you can dictate some conventions, eg one line/one field, simple perl regexpr would work well I think, you would not need a parser per se. > I don't really want to invest significantly more time into this without > first debating the general idea. That what I did, and I quitted quickly:-) On the general idea, I'm 100% convinced that stupid utility functions should be either generic or generated, not maintained by hand. -- Fabien.
Hallo Andres, >> There've been various calls for automating their generation, but no >> actual patches that I am aware of. > > I started something a while back I have found this thread: https://www.postgresql.org/message-id/flat/E1cq93r-0004ey-Mp%40gemulon.postgresql.org It seems that comments from committers discouraged me to go on… :-) For instance Robert wanted a "checker", which is basically harder than a generator because you have to parse both sides and then compare. Basically there was no consensus at the time (March 2017), so I gave up. It seems that I even distroyed the branch I was working on, so the no doubt magnificent perl script I wrote is also lost. -- Fabien.
Hi, On 2019-08-28 16:41:36 -0700, Andres Freund wrote: > There's been a lot of complaints over the years about how annoying it is > to keep the out/read/copy/equalfuncs.c functions in sync with the actual > underlying structs. > > There've been various calls for automating their generation, but no > actual patches that I am aware of. > > There also recently has been discussion about generating more efficient > memory layout for node trees that we know are read only (e.g. plan trees > inside the plancache), and about copying such trees more efficiently > (e.g. by having one big allocation, and then just adjusting pointers). > > > One way to approach this problem would be to to parse the type > definitions, and directly generate code for the various functions. But > that does mean that such a code-generator needs to be expanded for each > such functions. > > An alternative approach is to have a parser of the node definitions that > doesn't generate code directly, but instead generates metadata. And then > use that metadata to write node aware functions. This seems more > promising to me. > > In the attached, *very experimental WIP*, patch I've played around with > this approach. I have written a small C program that uses libclang > (more on that later) to parse relevant headers, and generate metadata > about node types. > > [more detailed explanation of the approach] Attached is a patchset that implements this approach. At the end the patchset entirely replaces src/backend/nodes/{copyfuncs.c,equalfuncs.c,outfuncs.c,readfuncs.c, read.c} with relatively generic code using the metadata mentioned above. To show how nice this is, here's the diffstat for the most relevant commits in the series: 1) WIP: Introduce compile-time node type metadata collection & reimplement node funcs. 18 files changed, 3192 insertions(+), 30 deletions(-) 2) WIP: Add generated node metadata. 1 file changed, 7279 insertions(+) 3) WIP: Remove manually maintained out/read/copy/equalfuncs code. 11 files changed, 49 insertions(+), 17277 deletions(-) Even including the full auto-generated metadata, it's still a quite substantial win. Using that metadata one can do stuff that wasn't feasible before. As an example, the last patch in the series implements a version of copyObject() (badly named copyObjectRo()) that copies an object into a single allocation. That's quite worthwhile memory-usage wise: PREPARE foo AS SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity,false AS relhasoids, c.relispartition, pg_catalog.array_to_string(c.reloptions || array(select 'toast.'|| x from pg_catalog.unnest(tc.reloptions) x), ', '), c.reltablespace, CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.textEND, c.relpersistence, c.relreplident, am.amname FROM pg_catalog.pg_classc LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid) LEFT JOIN pg_catalog.pg_am am ON (c.relam= am.oid) WHERE c.oid = '1259'; EXECUTE foo ; With single-allocation: CachedPlan: 24504 total in 2 blocks; 664 free (0 chunks); 23840 used Grand total: 24504 bytes in 2 blocks; 664 free (0 chunks); 23840 used Default: CachedPlan: 65536 total in 7 blocks; 16016 free (0 chunks); 49520 used Grand total: 65536 bytes in 7 blocks; 16016 free (0 chunks); 49520 used And with a bit more elbow grease we could expand that logic so that copyObject from such a "block allocated" node tree would already know how much memory to allocate, memcpy() the source over to the target, and just adjust the pointer offsets. We could also use this to have a version of IsA() that supported our form of inheritance. In a quick hack I wrote a function that computes a per-node-type bitmap indicating whether other nodetypes are subtypes of that node. While there's still a bug (it doesn't correctly handle node types that are just typedefed to each other), it already allows to produce: type T_JoinState has 3 subtypes child: T_NestLoopState child: T_MergeJoinState child: T_HashJoinState type T_Expr has 44 subtypes child: T_Var child: T_Const child: T_Param child: T_Aggref child: T_GroupingFunc child: T_WindowFunc child: T_SubscriptingRef child: T_FuncExpr child: T_NamedArgExpr It seems quite useful to be able to assert such relationsship. > I *suspect* that it'll be quite OK performancewise, compared to bespoke > code, but I have *NOT* measured that. I now have. When testing with COPY_PARSE_PLAN_TREES, WRITE_READ_PARSE_PLAN_TREES the "generic" code is a bit slower, but not much (< %5). To some degree that's possibly inherent, but I think the bigger issue right now is that the new code does more appendStringInfo() calls, because it outputs the field name separately. Also, the slightly expanded node format adds a few more tokens, which needs to be parsed (pg_strtok (or rather its replacement) is the slowest bit. I think with a bit of optimization we can get that back, and the facilities allow us to make much bigger wins. So I think this is quite good. > With regards to using libclang for the parsing: I chose that because it > seemed the easiest to experiment with, compared to annotating all the > structs with enough metadata to be able to easily parse them from a perl > script. The node definitions are after all distributed over quite a few > headers. > > I think it might even be the correct way forward, over inventing our own > mini-languages and writing ad-hoc parsers for those. It sure is easier > to understand plain C code, compared to having to understand various > embeded mini-languages consisting out of macros. > > The obvious drawback is that it'd require more people to install > libclang - a significant imposition. I think it might be OK if we only > required it to be installed when changing node types (although it's not > necessarily trivial how to detect that node types haven't changed, if we > wanted to allow other code changes in the relevant headers), and stored > the generated code in the repository. > > Alternatively we could annotate the code enough to be able to write our > own parser, or use some other C parser. Still using libclang. The .c file containing the metadata is now part of the source tree (i.e. would be checked in), and would only need to be regenerated when one of the relevant headers change (but obviously such a change could be unimportant). > The one set of fields this currently can not deal with is the various > arrays that we directly reference from nodes. For e.g. > > typedef struct Sort > { > Plan plan; > int numCols; /* number of sort-key columns */ > AttrNumber *sortColIdx; /* their indexes in the target list */ > Oid *sortOperators; /* OIDs of operators to sort them by */ > Oid *collations; /* OIDs of collations */ > bool *nullsFirst; /* NULLS FIRST/LAST directions */ > } Sort; > > the generic code has no way of knowing that sortColIdx, sortOperators, > collations, nullsFirst are all numCols long arrays. > > I can see various ways of dealing with that: > > 1) We could convert them all to lists, now that we have fast arbitrary > access. But that'd add a lot of indirection / separate allocations. > > 2) We could add annotations to the sourcecode, to declare that > association. That's probably not trivial, but wouldn't be that hard - > one disadvantage is that we probably couldn't use that declaration > for automated asserts etc. > > 3) We could introduce a few macros to create array type that include the > length of the members. That'd duplicate the lenght for each of those > arrays, but I have a bit of a hard time believing that that's a > meaningful enough overhead. > > I'm thinking of a macro that'd be used like > ARRAY_FIELD(AttrNumber) *sortColIdx; > that'd generate code like > struct > { > size_t nmembers; > AttrNumber members[FLEXIBLE_ARRAY_MEMBER]; > } *sortColIdx; > > plus a set of macros (or perhaps inline functions + macros) to access > them. I've implemented 3), which seems to work well. But it's a fair bit of macro magic. Basically, one can define a type to be array supported, by once using PGARR_DEFINE_TYPE(element_type); which defines a struct type that has a members array of type element_type. After that variables of the array type can be defined using PGARR(element_type) (as members in a struct, variables, ...). Macros like pgarr_size(arr), pgarr_empty(arr), pgarr_at(arr, at) can be used to query (and in the last case also modify) the array. pgarr_append(element_type, arr, newel) can be used to append to the array. Unfortunately I haven't figured out a satisfying a way to write pgarr_append() without specifying the element_type. Either there's multiple-evaluation of any of the types (for checking whether the capacity needs to be increased), only `newel`s that can have their address taken are supported (so it can be passed to a helper function), or compiler specific magic has to be used (__typeof__ solves this nicely). The array can be allocated (using pgarr_alloc_ro(type, capacity)) so that a certain number of elements fit inline. Boring stuff aside, there's a few more parts of the patchseries: 1) Move int->string routines to src/common, and expand. The current pg_*toa routines are quite incomplete (no unsigned support), are terribly named, and require unnecessary strlen() calls to recompute the length of the number, even though the routines already know them. Fix all of that, and move to src/common/string.c for good measure. 2) WIP: Move stringinfo to common/ The metadata collection program needed a string buffer. PQExpBuffer is weird, and has less functionality. 3) WIP: Improve and expand stringinfo.[ch]. Expands the set of functions exposed, to allow appending various numeric types etc. Also improve efficiency by moving code to inline functions - that's beneficial because the size is often known, which can make the copy faster. The code is also available at https://github.com/anarazel/postgres/tree/header https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary (in the header branch). While working on this I evolved the node string format a bit: 1) Node types start with the their "normal" name, rather than uppercase. There seems little point in having such a divergence. 2) The node type is followed by the node-type id. That allows to more quickly locate the corresponding node metadata (array and one name recheck, rather than a binary search). I.e. the node starts with "{Scan 18 " rather than "{SCAN " as before. 3) Nodes that contain other nodes as sub-types "inline", still emit {} for the subtype. There's no functional need for this, but I found the output otherwise much harder to read. E.g. for mergejoin we'd have something like {MergeJoin 37 :join {Join 35 :plan {Plan ...} :jointype JOIN_INNER ...} :skip_mark_restore true ...} 4) As seen in the above example, enums are decoded to their string values. I found that makes the output easier to read. Again, not functionally required. 5) Value nodes aren't emitted without a {Value ...} anymore. I changed this when I expanded the WRITE/READ tests, and encountered failures because the old encoding is not entirely rountrip safe (e.g. -INT32_MIN will be parsed as a float at raw parse time, but after write/read, it'll be parsed as an integer). While that could be fixed in other ways (e.g. by emitting a trailing . for all floats), I also found it to be clearer this way - Value nodes are otherwise undistinguishable from raw strings, raw numbers etc, which is not great. It'd also be easier to now just change the node format to something else. Comments? Todo / Questions: - lots of cleanup - better error checking when looking up node definitions - better dealing with node types that are just typedef'ed to another type - don't use full tokenizer in all places, too expensive Greetings, Andres Freund
Attachment
- v2-0001-FIX-ExprState.tag-to-actually-just-be-a-tag.patch.gz
- v2-0002-Introduce-separate-type-for-location-information.patch.gz
- v2-0003-WIP-Introduce-strongly-typed-array-type.patch.gz
- v2-0004-Move-int-string-routines-to-src-common-and-expand.patch.gz
- v2-0005-WIP-Move-stringinfo-to-common.patch.gz
- v2-0006-WIP-Improve-and-expand-stringinfo.-ch.patch.gz
- v2-0007-Change-out-readfuncs-to-handle-floats-precisely.patch.gz
- v2-0008-WIP-Introduce-compile-time-node-type-metadata-col.patch.gz
- v2-0009-WIP-Add-generated-node-metadata.patch.gz
- v2-0010-Enable-previously-disabled-asserts-ensuring-node-.patch.gz
- v2-0011-WIP-Remove-manually-maintained-out-read-copy-equa.patch.gz
- v2-0012-WIP-Add-support-for-copying-node-tree-into-single.patch.gz
Hi, On 2019-09-19 22:18:57 -0700, Andres Freund wrote: > While working on this I evolved the node string format a bit: > > 1) Node types start with the their "normal" name, rather than > uppercase. There seems little point in having such a divergence. > > 2) The node type is followed by the node-type id. That allows to more > quickly locate the corresponding node metadata (array and one name > recheck, rather than a binary search). I.e. the node starts with > "{Scan 18 " rather than "{SCAN " as before. > > 3) Nodes that contain other nodes as sub-types "inline", still emit {} > for the subtype. There's no functional need for this, but I found the > output otherwise much harder to read. E.g. for mergejoin we'd have > something like > > {MergeJoin 37 :join {Join 35 :plan {Plan ...} :jointype JOIN_INNER ...} :skip_mark_restore true ...} > > 4) As seen in the above example, enums are decoded to their string > values. I found that makes the output easier to read. Again, not > functionally required. > > 5) Value nodes aren't emitted without a {Value ...} anymore. I changed > this when I expanded the WRITE/READ tests, and encountered failures > because the old encoding is not entirely rountrip safe > (e.g. -INT32_MIN will be parsed as a float at raw parse time, but > after write/read, it'll be parsed as an integer). While that could be > fixed in other ways (e.g. by emitting a trailing . for all floats), I > also found it to be clearer this way - Value nodes are otherwise > undistinguishable from raw strings, raw numbers etc, which is not > great. > > It'd also be easier to now just change the node format to something else. E.g. to just use json. Which'd certainly be a lot easier to delve into, given the amount of tooling (both on the pg SQL level, and for commandline / editors / etc). I don't think it'd be any less efficient. There'd be a few more = signs, but the lexer is smarter / faster than the one currently in use for the outfuncs format. And we'd just reuse pg_parse_json rather than having a dedicated parser. - Andres
On Fri, Sep 20, 2019 at 03:43:54PM -0700, Andres Freund wrote: > Hi, > > On 2019-09-19 22:18:57 -0700, Andres Freund wrote: > > While working on this I evolved the node string format a bit: > > > > 1) Node types start with the their "normal" name, rather than > > uppercase. There seems little point in having such a divergence. > > > > 2) The node type is followed by the node-type id. That allows to more > > quickly locate the corresponding node metadata (array and one name > > recheck, rather than a binary search). I.e. the node starts with > > "{Scan 18 " rather than "{SCAN " as before. > > > > 3) Nodes that contain other nodes as sub-types "inline", still emit {} > > for the subtype. There's no functional need for this, but I found the > > output otherwise much harder to read. E.g. for mergejoin we'd have > > something like > > > > {MergeJoin 37 :join {Join 35 :plan {Plan ...} :jointype JOIN_INNER ...} :skip_mark_restore true ...} > > > > 4) As seen in the above example, enums are decoded to their string > > values. I found that makes the output easier to read. Again, not > > functionally required. > > > > 5) Value nodes aren't emitted without a {Value ...} anymore. I changed > > this when I expanded the WRITE/READ tests, and encountered failures > > because the old encoding is not entirely rountrip safe > > (e.g. -INT32_MIN will be parsed as a float at raw parse time, but > > after write/read, it'll be parsed as an integer). While that could be > > fixed in other ways (e.g. by emitting a trailing . for all floats), I > > also found it to be clearer this way - Value nodes are otherwise > > undistinguishable from raw strings, raw numbers etc, which is not > > great. > > > > It'd also be easier to now just change the node format to something else. > > E.g. to just use json. +many JSON's been around long enough to give some assurance that it's not going away, and it's pretty simple. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Fri, Aug 30, 2019 at 9:03 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > I have found this thread: > > https://www.postgresql.org/message-id/flat/E1cq93r-0004ey-Mp%40gemulon.postgresql.org > > It seems that comments from committers discouraged me to go on… :-) For > instance Robert wanted a "checker", which is basically harder than a > generator because you have to parse both sides and then compare. Well, I don't think I intended to ask for something that was more difficult than a full generator. I think it's more that I had the idea that a checker would be simpler. It's true that you'd have to parse both sides and compare. On the other hand, a checker can be incomplete -- only checking certain things -- whereas a generator has to work completely -- including all of the strange cases. So it seemed to me that a checker would allow for tolerating more in the way of exceptions than a generator. A generator also has to integrate properly into the build system, which can be tricky. It seems like the approach Andres is proposing here could work pretty well. I think the biggest possible problem is that any semi-serious developer will basically have to have LLVM installed. To build the software, you wouldn't need LLVM unless you want to build with JIT support. But to modify the software, you'll need LLVM for any modification that touches node definitions. I don't know how much of a nuisance that's likely to be for people, especially people developing on less-mainstream platforms. One concern I have is about whether the code that uses LLVM is likely to be dependent on specific LLVM versions. If I can just type something like 'yum/port/brew/apt-get install llvm' on any semi-modern platform and have that be good enough, it won't bother me much at all. On the other hand, if I have to hand-compile it because RHEL version $X only ships older LLVM $Y (or ships unexpectedly-newer version $YY) then that's going to be annoying. We already have the same annoyance with autoconf; at some times, I've needed to have multiple versions installed locally to cater to all the branches. However, that's less of a problem than this would be, because (1) updating configure is a substantially less-common need than updating node definitions and (2) autoconf is a much smaller piece of software than libclang. It builds in about 1 second, which I bet LLVM does not. To point to an analogous case, note that we pretty much have to adjust a bunch of things every few years to be able to support new versions of Visual Studio, and until we do, it Just Doesn't Work. That stinks. In contrast, new versions of gcc often cause new warnings, but those are easier to work around until such time as somebody gets around to cleaning them up. But most developers get to ignore Windows most of the time, whereas if this breaks for somebody, they can't really work at all until they either work around it on their side or it gets fixed upstream. So it's a significant potential inconvenience. As a benchmark, I'd propose this: if the LLVM interfaces that this new code would use work in all versions of LLVM released in the last 3 years and there's no indication that they will change in the next release, then I'd feel pretty comfortable. If they've changed once, that'd probably still be OK. If they've changed more than once, perhaps we should think about a Perl script under our own control as an alternative, so as to avoid having to keep adjusting the C code every time LLVM whacks the interfaces around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > It seems like the approach Andres is proposing here could work pretty > well. I think the biggest possible problem is that any semi-serious > developer will basically have to have LLVM installed. To build the > software, you wouldn't need LLVM unless you want to build with JIT > support. But to modify the software, you'll need LLVM for any > modification that touches node definitions. I don't know how much of a > nuisance that's likely to be for people, especially people developing > on less-mainstream platforms. I'm afraid that's going to be a deal-breaker for lots of people. It's fine for prototyping the idea but we'll need to find another implementation before we can move to commit. > One concern I have is about whether the > code that uses LLVM is likely to be dependent on specific LLVM > versions. Yeah, that's one of the reasons it's a deal-breaker. We've been able to insist that everybody touching configure use the same autoconf version, but I don't think that could possibly fly for LLVM. But without that, all the derived files would change in random ways depending on who'd committed last. Even if it's only whitespace changes, that'd be a mess. regards, tom lane
On Wed, Oct 2, 2019 at 12:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm afraid that's going to be a deal-breaker for lots of people. > It's fine for prototyping the idea but we'll need to find another > implementation before we can move to commit. Why do you think it will be a deal-breaker for lots of people? I mean, if we get this to a point where it just requires installing some reasonably modern version of LLVM, I don't see why that's worse than having to do the same thing for say, Perl if you want to build --with-perl, or Python if you want to build --with-python, or bison or lex if you want to change the lexer and parser. One more build-time dependency shouldn't be a big deal, as long as we don't need a really specific version. Or am I missing something? > > One concern I have is about whether the > > code that uses LLVM is likely to be dependent on specific LLVM > > versions. > > Yeah, that's one of the reasons it's a deal-breaker. We've been able to > insist that everybody touching configure use the same autoconf version, > but I don't think that could possibly fly for LLVM. But without that, > all the derived files would change in random ways depending on who'd > committed last. Even if it's only whitespace changes, that'd be a mess. I don't really see a reason why that would be an issue here. The code Andres wrote just uses LLVM to parse the structure definitions from our header files; the code generation stuff is hand-rolled and just prints out C. It's basically two big arrays, one of which is indexed by NodeTag and thus in a fixed order, and the other of which is an array of all structure members of all node types. The latter doesn't seem to be sorted in any terribly obvious way at the moment -- structure members are in order of occurrence within the corresponding definition, but the definitions themselves are not in any super-obvious order. That could presumably be fixed pretty easily, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 2, 2019 at 12:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm afraid that's going to be a deal-breaker for lots of people. >> It's fine for prototyping the idea but we'll need to find another >> implementation before we can move to commit. > Why do you think it will be a deal-breaker for lots of people? I mean, > if we get this to a point where it just requires installing some > reasonably modern version of LLVM, I don't see why that's worse than > having to do the same thing for say, Perl if you want to build > --with-perl, or Python if you want to build --with-python, or bison or > lex if you want to change the lexer and parser. One more build-time > dependency shouldn't be a big deal, as long as we don't need a really > specific version. Or am I missing something? Think it's available/trivially installable on e.g. Windows? I'm not convinced. In any case, our list of build requirements is depressingly long already. The existing expectation is that we make our build tools in Perl. I'm sure Andres doesn't want to write a C parser in Perl, but poking around suggests that there are multiple options already available in CPAN. I'd much rather tell people "oh, there's YA module you need to get from CPAN" than "figure out how to install version XXX of LLVM". The other direction we could plausibly go in is to give up the assuption that parsenodes.h and friends are the authoritative source of info, and instead declare all these structs in a little language based on JSON or what-have-you, from which we generate parsenodes.h along with the backend/nodes/ files. I kind of suspect that we'll be forced into that eventually anyway, because one thing you are not going to get from LLVM or a pre-existing Perl C parser is anything but the lowest-common-denominator version of what's in the structs. I find it really hard to believe that we won't need some semantic annotations in addition to the bare C struct declarations. As an example: in some cases, pointer values in a Node struct point to arrays of length determined by a different field in the struct. How do we tie those together without magic? I think there has to be an annotation marking the connection, and we're not going to find that out from LLVM. regards, tom lane
Hi, On 2019-10-02 14:30:08 -0400, Robert Haas wrote: > On Wed, Oct 2, 2019 at 12:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'm afraid that's going to be a deal-breaker for lots of people. > > It's fine for prototyping the idea but we'll need to find another > > implementation before we can move to commit. > > Why do you think it will be a deal-breaker for lots of people? I mean, > if we get this to a point where it just requires installing some > reasonably modern version of LLVM, I don't see why that's worse than > having to do the same thing for say, Perl if you want to build > --with-perl, or Python if you want to build --with-python, or bison or > lex if you want to change the lexer and parser. One more build-time > dependency shouldn't be a big deal, as long as we don't need a really > specific version. Or am I missing something? It shouldn't be that bad. On windows it can just be a binary to install: http://releases.llvm.org/download.html and newer versions of visual studio apparently even have GUI support for installing LLVM + clang + ... > > > One concern I have is about whether the > > > code that uses LLVM is likely to be dependent on specific LLVM > > > versions. FWIW, after one forward-compatible code change (a convenience function didn't exist), one configure check (3.7 is older than what we support for JIT), gennodes.c generated exactly the same output with 3.7 and the tip-of-tree libclang that I used for development. I did not check further back, but 3.7 is plenty old. The libclang API is part of the small set of "stable" APIs (which only expose C) in LLVM, and the deprecation cycles are fairly long. That obviously doesn't protect against accidentally adding dependencies on a feature only available in newer libclang versions, but I don't think we'd be likely to change the feature set of the node generation program all that often, and we already have buildfarm animals using most llvm versions. > > Yeah, that's one of the reasons it's a deal-breaker. We've been able to > > insist that everybody touching configure use the same autoconf version, > > but I don't think that could possibly fly for LLVM. But without that, > > all the derived files would change in random ways depending on who'd > > committed last. Even if it's only whitespace changes, that'd be a mess. > > I don't really see a reason why that would be an issue here. The code > Andres wrote just uses LLVM to parse the structure definitions from > our header files; the code generation stuff is hand-rolled and just > prints out C. It's basically two big arrays, one of which is indexed > by NodeTag and thus in a fixed order, and the other of which is an > array of all structure members of all node types. The latter doesn't > seem to be sorted in any terribly obvious way at the moment -- > structure members are in order of occurrence within the corresponding > definition, but the definitions themselves are not in any > super-obvious order. That could presumably be fixed pretty easily, > though. Yea, I don't think there should be any big problem here. The order already is "consistent" between versions etc, and only depends on the content of our include files. It's not quite right yet, because adding a structure member currently can causes a too big diff in the generated metadata file, due to renumbering of array indexes included in struct contents of structs. For the "structure members" array that is probably best solved by simply removing it, and instead referencing the members directly from within the node aray. That'll slightly increase the size (pointer, rather than uint16) of the "node structs" array, and make it harder to iterate over the members of all structures (uh, why would you want that?), but is otherwise easy. The strings table, which is currently "uniqued", is suboptimal for similar reasons, and also performance (one more lookup in a separate array). I suspect the best way there too might be to just inline them, instead of storing them de-duplicated. Inlining the contents of those would also make it fairly easy to update the file when doing small changes, if somebody doesn't want to/can't install libclang. We could probably include a few static asserts (once we can have them on file scope) in the generated node file, to make mistakes more apparent. If we decide that size is enough of an issue, we'd probably have to output some macros to reduce the amount of visible re-numbering. So instead of something like {.name = 2201 /* Result */, .first_field_at = 1857, .num_fields = 2, .size = sizeof(Result)}, we'd have {.name = TI_STRINGOFF_RESULT, .first_field_at = TI_FIRST_FIELD_RESULT, .num_fields = 2, .size = sizeof(Result)}, #define TI_STRINGOFF_RESULT (TI_STRINGOFF_RESCONSTQUAL + 1) #define TI_FIRST_FIELD_RESULT (TI_FIRST_FIELD_PLAN + 15) (with a bit more smarts for how to name the string #defines in a non-conflicting way) Greetings, Andres Freund
Hi, On 2019-10-02 14:47:22 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Oct 2, 2019 at 12:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I'm afraid that's going to be a deal-breaker for lots of people. > >> It's fine for prototyping the idea but we'll need to find another > >> implementation before we can move to commit. > > > Why do you think it will be a deal-breaker for lots of people? I mean, > > if we get this to a point where it just requires installing some > > reasonably modern version of LLVM, I don't see why that's worse than > > having to do the same thing for say, Perl if you want to build > > --with-perl, or Python if you want to build --with-python, or bison or > > lex if you want to change the lexer and parser. One more build-time > > dependency shouldn't be a big deal, as long as we don't need a really > > specific version. Or am I missing something? > > Think it's available/trivially installable on e.g. Windows? I'm not > convinced. In any case, our list of build requirements is depressingly > long already. As I wrote nearby, it's just a download of an installer away. > The existing expectation is that we make our build tools in Perl. > I'm sure Andres doesn't want to write a C parser in Perl, but > poking around suggests that there are multiple options already > available in CPAN. I'd much rather tell people "oh, there's YA > module you need to get from CPAN" than "figure out how to install > version XXX of LLVM". As far as I can tell they're all at least one of 1) written in C, so also have build requirements (obviously a shorter build time) 2) not very good (including plenty unsupported C, not to speak of various optional extensions we use, not having preprocessor support, ...) 3) unmaintained for many years. Did you find any convincing ones? Whereas libclang / llvm seem very unlikely to be unmaintained anytime soon, given the still increasing adoption. It's also much more complete, than any such perl module will realistically be. > The other direction we could plausibly go in is to give up the > assuption that parsenodes.h and friends are the authoritative > source of info, and instead declare all these structs in a little > language based on JSON or what-have-you, from which we generate > parsenodes.h along with the backend/nodes/ files. I think this should really work for more than just parsenodes (if you mean primnodes etc with "friends"), and even more than just node types (if you mean all the common node types with "friends"). For other Node types we already have to have pretty complete out/readfuncs support (e.g. to ship plans to parallel workers). and there's plenty other cases where we can use that information, e.g. as done in the prototype attached upthread: On 2019-09-19 22:18:57 -0700, Andres Freund wrote: > Using that metadata one can do stuff that wasn't feasible before. As an > example, the last patch in the series implements a version of > copyObject() (badly named copyObjectRo()) that copies an object into a > single allocation. That's quite worthwhile memory-usage wise: > > PREPARE foo AS SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity,false AS relhasoids, c.relispartition, pg_catalog.array_to_string(c.reloptions || array(select 'toast.'|| x from pg_catalog.unnest(tc.reloptions) x), ', '), c.reltablespace, CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.textEND, c.relpersistence, c.relreplident, am.amname FROM pg_catalog.pg_classc LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid) LEFT JOIN pg_catalog.pg_am am ON (c.relam= am.oid) WHERE c.oid = '1259'; > EXECUTE foo ; > > With single-allocation: > CachedPlan: 24504 total in 2 blocks; 664 free (0 chunks); 23840 used > Grand total: 24504 bytes in 2 blocks; 664 free (0 chunks); 23840 used > > Default: > CachedPlan: 65536 total in 7 blocks; 16016 free (0 chunks); 49520 used > Grand total: 65536 bytes in 7 blocks; 16016 free (0 chunks); 49520 used > > And with a bit more elbow grease we could expand that logic so that > copyObject from such a "block allocated" node tree would already know > how much memory to allocate, memcpy() the source over to the target, and > just adjust the pointer offsets. And I'm currently prototyping implementing the serialization/deserialization of UNDO records into a compressed format using very similar information, to resolve the impasse that one side (among others, Robert) wants efficient and meaningful compression of undo records, while not believing a general compression library can provide that, and the other side (most prominently Heikki), doesn't want to limit format of undo record that much. > I kind of suspect that we'll be forced into that eventually anyway, > because one thing you are not going to get from LLVM or a pre-existing > Perl C parser is anything but the lowest-common-denominator version of > what's in the structs. I find it really hard to believe that we won't > need some semantic annotations in addition to the bare C struct > declarations. As an example: in some cases, pointer values in a Node > struct point to arrays of length determined by a different field in > the struct. How do we tie those together without magic? I did solve that in the patchset posted here by replacing such "bare" arrays with an array type that includes both the length, and the members (without loosing the type, using some macro magic). I think that approach has some promise, not just for this - I'd greatly appreciate thoughts on the part of the messages upthread (copied at the bottom, for convenience). But also: > I think there has to be an annotation marking the connection, and > we're not going to find that out from LLVM. libclang does allow to access macro "expansions" and also parsing of comments. So I don't think it'd be a problem to just recognize such connections if we added a few macros for that purpose. On 2019-09-19 22:18:57 -0700, Andres Freund wrote: > > The one set of fields this currently can not deal with is the various > > arrays that we directly reference from nodes. For e.g. > > > > typedef struct Sort > > { > > Plan plan; > > int numCols; /* number of sort-key columns */ > > AttrNumber *sortColIdx; /* their indexes in the target list */ > > Oid *sortOperators; /* OIDs of operators to sort them by */ > > Oid *collations; /* OIDs of collations */ > > bool *nullsFirst; /* NULLS FIRST/LAST directions */ > > } Sort; > > > > the generic code has no way of knowing that sortColIdx, sortOperators, > > collations, nullsFirst are all numCols long arrays. > > > > I can see various ways of dealing with that: > > > > 1) We could convert them all to lists, now that we have fast arbitrary > > access. But that'd add a lot of indirection / separate allocations. > > > > 2) We could add annotations to the sourcecode, to declare that > > association. That's probably not trivial, but wouldn't be that hard - > > one disadvantage is that we probably couldn't use that declaration > > for automated asserts etc. > > > > 3) We could introduce a few macros to create array type that include the > > length of the members. That'd duplicate the lenght for each of those > > arrays, but I have a bit of a hard time believing that that's a > > meaningful enough overhead. > > > > I'm thinking of a macro that'd be used like > > ARRAY_FIELD(AttrNumber) *sortColIdx; > > that'd generate code like > > struct > > { > > size_t nmembers; > > AttrNumber members[FLEXIBLE_ARRAY_MEMBER]; > > } *sortColIdx; > > > > plus a set of macros (or perhaps inline functions + macros) to access > > them. > > I've implemented 3), which seems to work well. But it's a fair bit of > macro magic. > > Basically, one can define a type to be array supported, by once using > PGARR_DEFINE_TYPE(element_type); which defines a struct type that has a > members array of type element_type. After that variables of the array > type can be defined using PGARR(element_type) (as members in a struct, > variables, ...). > > Macros like pgarr_size(arr), pgarr_empty(arr), pgarr_at(arr, at) can be > used to query (and in the last case also modify) the array. > > pgarr_append(element_type, arr, newel) can be used to append to the > array. Unfortunately I haven't figured out a satisfying a way to write > pgarr_append() without specifying the element_type. Either there's > multiple-evaluation of any of the types (for checking whether the > capacity needs to be increased), only `newel`s that can have their > address taken are supported (so it can be passed to a helper function), > or compiler specific magic has to be used (__typeof__ solves this > nicely). > > The array can be allocated (using pgarr_alloc_ro(type, capacity)) so > that a certain number of elements fit inline. > Greetings, Andres Freund
On Wed, Oct 2, 2019 at 4:46 PM Andres Freund <andres@anarazel.de> wrote: > > The existing expectation is that we make our build tools in Perl. > > I'm sure Andres doesn't want to write a C parser in Perl, but > > poking around suggests that there are multiple options already > > available in CPAN. I'd much rather tell people "oh, there's YA > > module you need to get from CPAN" than "figure out how to install > > version XXX of LLVM". > > As far as I can tell they're all at least one of > 1) written in C, so also have build requirements (obviously a shorter > build time) > 2) not very good (including plenty unsupported C, not to speak of > various optional extensions we use, not having preprocessor support, > ...) > 3) unmaintained for many years. I find this argument to be compelling. To me, it seems like having to install some random CPAN module is probably more likely to be a problem than having to install LLVM. For one thing, it's pretty likely that any given developer already has LLVM, either because they're using clang as their C compiler in general, or because their system has it installed anyway, or because they've built with JIT support at least once. And, if they haven't got it, their favorite packaging system will certainly have a build of LLVM, but it may not have a build of Parse::C::Erratically. Really, this has been a problem even just for TAP tests, where we've had periodic debates over whether it's OK for a test to depend on some new module that everyone may not have installed, possibly because they're running some ancient Perl distribution (and our definition of "ancient" might make a historian giggle) where it wasn't bundled with core Perl. Can we rewrite the test so it doesn't need the module? Do we just skip the test, possibly masking a failure for some users? Do we make it fail and force everybody to find a way to install that module? Just to be clear, I'm not in love with using LLVM. It's a big hairy piece of technology that I don't understand. However, it's also a very widely-supported and very capable piece of technology that other people do understand, and we're already using it for other things. I don't think we're going to do better by using something different and probably less capable and less well-supported for this thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 20 Sep 2019 at 17:19, Andres Freund <andres@anarazel.de> wrote: > 3) WIP: Improve and expand stringinfo.[ch]. > > Expands the set of functions exposed, to allow appending various > numeric types etc. Also improve efficiency by moving code to inline > functions - that's beneficial because the size is often known, which > can make the copy faster. I've thought it might be useful to have appendStringInfoInt() and the like before. However, I wondered if there's a good reason that the format needs to be easily human-readable. We could reduce the size of the output and speed up read/write functions if we were to use base16 or even base64 output. I also think the Bitmapset output is pretty inefficient. The output is optimized for something that Bitmapsets themselves are not optimized for. The only time the current output will be efficient is if there are very few set bits but the values of those bits are very far apart. Bitmapsets themselves are not optimal for that, so I don't see why our out function for it should be. ISTM it would be better encoded as a hex string, or perhaps if we still want to optimize a bit for sparse Bitmapsets then it could be done as word_number:word_value. However, that would mean the output would be variable depending on the width of the bitmapword. David