Thread: Functions in C with Ornate Data Structures
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'm trying to write C functions to handle some of the number crunching that I have been doing via backend processing. Specifically, I want to be able to construct a function such that a query like: select crunch_number(foo) from bar where [some condition]; ...where `foo' is the name of some column and `bar' is some table name. The function needs to create some ornate data structures (i.e., doubly linked lists), and outputs some summary statistic. If my data types were simpler, I could simply use an AGGREGATE function. Unfortunately, I don't know of any way to schlep something as complex as a doubly-linked list of arrays of arbitrary precision numbers. I suppose ideally I'd like some way of either: -Being able to call a function on each row (like most user-defined functions) which only returns a result on the last row; or -Being able to pass the table name, column name, and selection conditions to the function, and walk through the matching rows inside the function, returning a single result upon completion In terms of logical structure, this looks similar to functions to do things like compute means or standard deviations. The complication (as far as I can tell) is because I can't get by with a simple accumulator variable/transform function. Is there any clean way to accomplish this in Postgres? Any pointers or suggestions would be appreciated. - -Steve -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.3 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE8SLyyG3kIaxeRZl8RAsNNAKCy8YDnMMZCIGrMYT6pt2IxqxtCJwCgxFp2 HFlA8B9X5BJRfnMDmQSh8Ss= =Kx46 -----END PGP SIGNATURE-----
"Stephen P. Berry" <spb@meshuggeneh.net> writes: > If my data types were simpler, I could simply use an AGGREGATE function. > Unfortunately, I don't know of any way to schlep something as complex > as a doubly-linked list of arrays of arbitrary precision numbers. You could, but the amount of data copying needed would be annoying. However, there's no law that says you can't cheat. I'd suggest that you build this as an aggregate function whose nominal state value is simply a pointer to data structures that are off somewhere else. For example, assuming that you are willing to cheat to the extent of assuming sizeof(pointer) = sizeof(integer), try something like this: CREATE AGGREGATE crunch_number ( basetype = float8, -- or whatever the input column type is sfunc = crunch_func, stype = integer, ffunc = crunch_finish, initcond = 0); where crunch_func(integer) returns integer is your data accumulation function, and it looks like datstruct *ptr = (datstruct *) PG_GET_INT32(0); double newdataval = PG_GET_FLOAT8(1); if (ptr == NULL) { /* first call of query; initialize datastructures */ } /* update datastructures using newdataval */ PG_RETURN_INT32((int32) ptr); Finally, crunch_finish(integer) returns float8 (or whatever is needed) contains your code to compute the final result and release the working datastructure. Now, the important detail: you can't allocate your working datastructures with a simple palloc(), because these functions will be called in a short-lived memory context. What I'd suggest is that in your setup step, you create a private memory context that is a child of TransactionCommandContext; then allocate all your datastructures in that. Then in the crunch_finish step, you needn't bother with retail releasing of the data structures, just destroy the private context and you're done. regards, tom lane PS: this is not a novice-level question ;-). You should be asking this kind of stuff on pgsql-hackers, methinks. There really isn't any other list that discusses C coding inside the backend.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 In message <11292.1011403856@sss.pgh.pa.us>, Tom Lane writes: >You could, but the amount of data copying needed would be annoying. >However, there's no law that says you can't cheat. I'd suggest that >you build this as an aggregate function whose nominal state value is >simply a pointer to data structures that are off somewhere else. >For example, assuming that you are willing to cheat to the extent of >assuming sizeof(pointer) = sizeof(integer), try something like this: I'd actually thought of doing something like this, but couldn't find an actual explicit argument type for pointers[0], and I can't make the assumption you describe for portability reasons (my three main test platforms are alpha, sparc64, and x86). I was also hoping that I could get away with not passing the problem data structures internally at all...i.e., have a crunch_input() function that initialises the linked list I need and populates it, then a crunch_result() function that spits out the result. The accumulator is just a dummy variable, and the interesting data structure(s) aren't in the argument list for any of the functions. I tried doing such a thing with an aggregate, but it didn't work---although, interestingly, invoking the input and result functions manually in a single session did. I took this to mean that I really didn't understand aggregates, so I was assuming it was a novice-level question. I was sorta hoping this would turn out to be a standard question (although I couldn't find any useful references in the mailing list archives or via web searches). >Now, the important detail: you can't allocate your working >datastructures with a simple palloc(), because these functions will be >called in a short-lived memory context. What I'd suggest is that in >your setup step, you create a private memory context that is a child >of TransactionCommandContext; then allocate all your datastructures in >that. Then in the crunch_finish step, you needn't bother with retail >releasing of the data structures, just destroy the private context >and you're done. Is there any way to keep `intermediate' data used by user-defined functions around indefinitely? I.e., have some sort of crunch_init() function that creates a bunch of in-memory data structures, which can then be used by subsequent (and independent) queries? I'm assuming not...and if I want to do that sort of thing I should populate a temporary table with the data from these `intermediate' results. Or keep all this fancy stuff in standalone applications rather than in user-defined functions. It seems like the general class of thing I'm trying to accomplish isn't that esoteric. Imagine trying to write a function to compute the standard deviation of arbitrary precision numbers using the GMP library or some such. Note that I'm not saying that that's what I'm trying to do...I'm just offering it as a simple sample problem in which one can't pass everything as an argument in an aggregate. How does one set about doing such a thing in Postgres? - -Steve - ----- 0 I was hoping that there would be, since the macro widgetry in the version 1 function semantics clearly includes the concept of pointers as a distinct type. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.3 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE8SN5tG3kIaxeRZl8RAmPmAJ4ilTeyoC//MRG5JHf7AmNuR7oW/QCdHHqw RoE/GplKts1rxNO85ADEebk= =Oedz -----END PGP SIGNATURE-----
"Stephen P. Berry" <spb@meshuggeneh.net> writes: >> For example, assuming that you are willing to cheat to the extent of >> assuming sizeof(pointer) = sizeof(integer), try something like this: > I'd actually thought of doing something like this, but couldn't find > an actual explicit argument type for pointers[0], and I can't make > the assumption you describe for portability reasons (my three main > test platforms are alpha, sparc64, and x86). Fair enough. I had actually thought better of that shortly after writing, so here's how I'd really do it: Still make the declaration of the state datatype be "integer" at the SQL level, and say initcond = 0. (If you don't do this, you have to fight nodeAgg.c's ideas about what to do with a pass-by-reference datatype, and it ain't worth the trouble.) But in the C code, write acquisition and return of the state value as datstruct *ptr = (datstruct *) PG_GETARG_POINTER(0); ... PG_RETURN_POINTER(ptr); This relies on the fact that what you are *really* passing and returning is not an int but a Datum, and Datum is by definition large enough for pointers. The only part of the above that's even slightly dubious is the assumption that a Datum created from an int32 zero will read as a pointer NULL --- but I am not aware of any platform where a zero bit pattern doesn't read as a pointer NULL (and lots of pieces of Postgres would break on such a platform). You could get around that too by making the initial state condition be a SQL NULL instead of a zero, but I don't see the point. Unless you need to treat NULL input values as something other than "ignores", you really want to declare the sfunc as strict, and that gets in the way of using a NULL initcond. > Is there any way to keep `intermediate' data used by user-defined > functions around indefinitely? I.e., have some sort of crunch_init() > function that creates a bunch of in-memory data structures, which > can then be used by subsequent (and independent) queries? You can if you can figure out how to find them again. However, the only obvious answer to that is to use static variables, which falls down miserably if someone tries to run two independent instances of your aggregate in one query. I'd suggest hewing closely to the external behavior of standard aggregates --- ie, each one is an independent calculation. You can use the above techniques to build an efficient implementation. If you instead build something that has an API involving state that persists across queries, I'm pretty sure you'll regret it in the long run. > It seems like the general class of thing I'm trying to accomplish > isn't that esoteric. Imagine trying to write a function to compute > the standard deviation of arbitrary precision numbers using the GMP > library or some such. Note that I'm not saying that that's what I'm > trying to do...I'm just offering it as a simple sample problem in > which one can't pass everything as an argument in an aggregate. How > does one set about doing such a thing in Postgres? I blink not an eye to say that I'd do it exactly as described above. Stick all the intermediate state into a data structure that's referenced by a single master pointer, and pass the pointer as the "state value" of the aggregate. BTW, mlw posted some contrib code on pghackers just a day or two back that does something similar to this. He did some details differently than I would've, notably this INT32-vs-POINTER business; but it's a working example. regards, tom lane