Re: WIP: Generic functions for Node types using generated metadata - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: WIP: Generic functions for Node types using generated metadata |
Date | |
Msg-id | 20190920051857.2fhnvhvx4qdddviz@alap3.anarazel.de Whole thread Raw |
In response to | WIP: Generic functions for Node types using generated metadata (Andres Freund <andres@anarazel.de>) |
Responses |
Re: WIP: Generic functions for Node types using generated metadata
Re: WIP: Generic functions for Node types using generated metadata |
List | pgsql-hackers |
Hi, On 2019-08-28 16:41:36 -0700, Andres Freund wrote: > There's been a lot of complaints over the years about how annoying it is > to keep the out/read/copy/equalfuncs.c functions in sync with the actual > underlying structs. > > There've been various calls for automating their generation, but no > actual patches that I am aware of. > > There also recently has been discussion about generating more efficient > memory layout for node trees that we know are read only (e.g. plan trees > inside the plancache), and about copying such trees more efficiently > (e.g. by having one big allocation, and then just adjusting pointers). > > > One way to approach this problem would be to to parse the type > definitions, and directly generate code for the various functions. But > that does mean that such a code-generator needs to be expanded for each > such functions. > > An alternative approach is to have a parser of the node definitions that > doesn't generate code directly, but instead generates metadata. And then > use that metadata to write node aware functions. This seems more > promising to me. > > In the attached, *very experimental WIP*, patch I've played around with > this approach. I have written a small C program that uses libclang > (more on that later) to parse relevant headers, and generate metadata > about node types. > > [more detailed explanation of the approach] Attached is a patchset that implements this approach. At the end the patchset entirely replaces src/backend/nodes/{copyfuncs.c,equalfuncs.c,outfuncs.c,readfuncs.c, read.c} with relatively generic code using the metadata mentioned above. To show how nice this is, here's the diffstat for the most relevant commits in the series: 1) WIP: Introduce compile-time node type metadata collection & reimplement node funcs. 18 files changed, 3192 insertions(+), 30 deletions(-) 2) WIP: Add generated node metadata. 1 file changed, 7279 insertions(+) 3) WIP: Remove manually maintained out/read/copy/equalfuncs code. 11 files changed, 49 insertions(+), 17277 deletions(-) Even including the full auto-generated metadata, it's still a quite substantial win. Using that metadata one can do stuff that wasn't feasible before. As an example, the last patch in the series implements a version of copyObject() (badly named copyObjectRo()) that copies an object into a single allocation. That's quite worthwhile memory-usage wise: PREPARE foo AS SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity,false AS relhasoids, c.relispartition, pg_catalog.array_to_string(c.reloptions || array(select 'toast.'|| x from pg_catalog.unnest(tc.reloptions) x), ', '), c.reltablespace, CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.textEND, c.relpersistence, c.relreplident, am.amname FROM pg_catalog.pg_classc LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid) LEFT JOIN pg_catalog.pg_am am ON (c.relam= am.oid) WHERE c.oid = '1259'; EXECUTE foo ; With single-allocation: CachedPlan: 24504 total in 2 blocks; 664 free (0 chunks); 23840 used Grand total: 24504 bytes in 2 blocks; 664 free (0 chunks); 23840 used Default: CachedPlan: 65536 total in 7 blocks; 16016 free (0 chunks); 49520 used Grand total: 65536 bytes in 7 blocks; 16016 free (0 chunks); 49520 used And with a bit more elbow grease we could expand that logic so that copyObject from such a "block allocated" node tree would already know how much memory to allocate, memcpy() the source over to the target, and just adjust the pointer offsets. We could also use this to have a version of IsA() that supported our form of inheritance. In a quick hack I wrote a function that computes a per-node-type bitmap indicating whether other nodetypes are subtypes of that node. While there's still a bug (it doesn't correctly handle node types that are just typedefed to each other), it already allows to produce: type T_JoinState has 3 subtypes child: T_NestLoopState child: T_MergeJoinState child: T_HashJoinState type T_Expr has 44 subtypes child: T_Var child: T_Const child: T_Param child: T_Aggref child: T_GroupingFunc child: T_WindowFunc child: T_SubscriptingRef child: T_FuncExpr child: T_NamedArgExpr It seems quite useful to be able to assert such relationsship. > I *suspect* that it'll be quite OK performancewise, compared to bespoke > code, but I have *NOT* measured that. I now have. When testing with COPY_PARSE_PLAN_TREES, WRITE_READ_PARSE_PLAN_TREES the "generic" code is a bit slower, but not much (< %5). To some degree that's possibly inherent, but I think the bigger issue right now is that the new code does more appendStringInfo() calls, because it outputs the field name separately. Also, the slightly expanded node format adds a few more tokens, which needs to be parsed (pg_strtok (or rather its replacement) is the slowest bit. I think with a bit of optimization we can get that back, and the facilities allow us to make much bigger wins. So I think this is quite good. > With regards to using libclang for the parsing: I chose that because it > seemed the easiest to experiment with, compared to annotating all the > structs with enough metadata to be able to easily parse them from a perl > script. The node definitions are after all distributed over quite a few > headers. > > I think it might even be the correct way forward, over inventing our own > mini-languages and writing ad-hoc parsers for those. It sure is easier > to understand plain C code, compared to having to understand various > embeded mini-languages consisting out of macros. > > The obvious drawback is that it'd require more people to install > libclang - a significant imposition. I think it might be OK if we only > required it to be installed when changing node types (although it's not > necessarily trivial how to detect that node types haven't changed, if we > wanted to allow other code changes in the relevant headers), and stored > the generated code in the repository. > > Alternatively we could annotate the code enough to be able to write our > own parser, or use some other C parser. Still using libclang. The .c file containing the metadata is now part of the source tree (i.e. would be checked in), and would only need to be regenerated when one of the relevant headers change (but obviously such a change could be unimportant). > The one set of fields this currently can not deal with is the various > arrays that we directly reference from nodes. For e.g. > > typedef struct Sort > { > Plan plan; > int numCols; /* number of sort-key columns */ > AttrNumber *sortColIdx; /* their indexes in the target list */ > Oid *sortOperators; /* OIDs of operators to sort them by */ > Oid *collations; /* OIDs of collations */ > bool *nullsFirst; /* NULLS FIRST/LAST directions */ > } Sort; > > the generic code has no way of knowing that sortColIdx, sortOperators, > collations, nullsFirst are all numCols long arrays. > > I can see various ways of dealing with that: > > 1) We could convert them all to lists, now that we have fast arbitrary > access. But that'd add a lot of indirection / separate allocations. > > 2) We could add annotations to the sourcecode, to declare that > association. That's probably not trivial, but wouldn't be that hard - > one disadvantage is that we probably couldn't use that declaration > for automated asserts etc. > > 3) We could introduce a few macros to create array type that include the > length of the members. That'd duplicate the lenght for each of those > arrays, but I have a bit of a hard time believing that that's a > meaningful enough overhead. > > I'm thinking of a macro that'd be used like > ARRAY_FIELD(AttrNumber) *sortColIdx; > that'd generate code like > struct > { > size_t nmembers; > AttrNumber members[FLEXIBLE_ARRAY_MEMBER]; > } *sortColIdx; > > plus a set of macros (or perhaps inline functions + macros) to access > them. I've implemented 3), which seems to work well. But it's a fair bit of macro magic. Basically, one can define a type to be array supported, by once using PGARR_DEFINE_TYPE(element_type); which defines a struct type that has a members array of type element_type. After that variables of the array type can be defined using PGARR(element_type) (as members in a struct, variables, ...). Macros like pgarr_size(arr), pgarr_empty(arr), pgarr_at(arr, at) can be used to query (and in the last case also modify) the array. pgarr_append(element_type, arr, newel) can be used to append to the array. Unfortunately I haven't figured out a satisfying a way to write pgarr_append() without specifying the element_type. Either there's multiple-evaluation of any of the types (for checking whether the capacity needs to be increased), only `newel`s that can have their address taken are supported (so it can be passed to a helper function), or compiler specific magic has to be used (__typeof__ solves this nicely). The array can be allocated (using pgarr_alloc_ro(type, capacity)) so that a certain number of elements fit inline. Boring stuff aside, there's a few more parts of the patchseries: 1) Move int->string routines to src/common, and expand. The current pg_*toa routines are quite incomplete (no unsigned support), are terribly named, and require unnecessary strlen() calls to recompute the length of the number, even though the routines already know them. Fix all of that, and move to src/common/string.c for good measure. 2) WIP: Move stringinfo to common/ The metadata collection program needed a string buffer. PQExpBuffer is weird, and has less functionality. 3) WIP: Improve and expand stringinfo.[ch]. Expands the set of functions exposed, to allow appending various numeric types etc. Also improve efficiency by moving code to inline functions - that's beneficial because the size is often known, which can make the copy faster. The code is also available at https://github.com/anarazel/postgres/tree/header https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary (in the header branch). While working on this I evolved the node string format a bit: 1) Node types start with the their "normal" name, rather than uppercase. There seems little point in having such a divergence. 2) The node type is followed by the node-type id. That allows to more quickly locate the corresponding node metadata (array and one name recheck, rather than a binary search). I.e. the node starts with "{Scan 18 " rather than "{SCAN " as before. 3) Nodes that contain other nodes as sub-types "inline", still emit {} for the subtype. There's no functional need for this, but I found the output otherwise much harder to read. E.g. for mergejoin we'd have something like {MergeJoin 37 :join {Join 35 :plan {Plan ...} :jointype JOIN_INNER ...} :skip_mark_restore true ...} 4) As seen in the above example, enums are decoded to their string values. I found that makes the output easier to read. Again, not functionally required. 5) Value nodes aren't emitted without a {Value ...} anymore. I changed this when I expanded the WRITE/READ tests, and encountered failures because the old encoding is not entirely rountrip safe (e.g. -INT32_MIN will be parsed as a float at raw parse time, but after write/read, it'll be parsed as an integer). While that could be fixed in other ways (e.g. by emitting a trailing . for all floats), I also found it to be clearer this way - Value nodes are otherwise undistinguishable from raw strings, raw numbers etc, which is not great. It'd also be easier to now just change the node format to something else. Comments? Todo / Questions: - lots of cleanup - better error checking when looking up node definitions - better dealing with node types that are just typedef'ed to another type - don't use full tokenizer in all places, too expensive Greetings, Andres Freund
Attachment
- v2-0001-FIX-ExprState.tag-to-actually-just-be-a-tag.patch.gz
- v2-0002-Introduce-separate-type-for-location-information.patch.gz
- v2-0003-WIP-Introduce-strongly-typed-array-type.patch.gz
- v2-0004-Move-int-string-routines-to-src-common-and-expand.patch.gz
- v2-0005-WIP-Move-stringinfo-to-common.patch.gz
- v2-0006-WIP-Improve-and-expand-stringinfo.-ch.patch.gz
- v2-0007-Change-out-readfuncs-to-handle-floats-precisely.patch.gz
- v2-0008-WIP-Introduce-compile-time-node-type-metadata-col.patch.gz
- v2-0009-WIP-Add-generated-node-metadata.patch.gz
- v2-0010-Enable-previously-disabled-asserts-ensuring-node-.patch.gz
- v2-0011-WIP-Remove-manually-maintained-out-read-copy-equa.patch.gz
- v2-0012-WIP-Add-support-for-copying-node-tree-into-single.patch.gz
pgsql-hackers by date: