Thread: C function accepting/returning cstring vs. text
I'm having a hard time trying to understand how everything should be done in C extensions. Now I'm trying to understand how and when I should accept/return cstring vs text and if and how I should take care of any encoding. I'm trying to output the content of a tsvector as a set of record pos, weight, text|cstring I've found tsvectorout that should be enough "educational". But I still miss details about when/why use cstrings vs. text and encoding. Is there any part of the code from where I could learn about: - memory allocation for both cstring and text - example of returning text (especially if size of out>in) - what should I take care of for encoding (what's really inside a tsvector, text, cstring)? I don't know if pgsql-hackers is the right place to ask and if this kind of questions are considered "homework" but without any reference and knowledge of the code base it's pretty hard even to find prototype/sample code. thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: The README files might be a good place to start, then browse code. > Is there any part of the code from where I could learn about: > - memory allocation for both cstring and text src/backend/utils/mmgr/README > - example of returning text (especially if size of out>in) > - what should I take care of for encoding (what's really inside a > tsvector, text, cstring)? src/backend/utils/fmgr/README -Kevin
On Mon, 25 Jan 2010 16:36:46 -0600 "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: > > The README files might be a good place to start, then browse code. Is there a book? The more I read the source and the few info about it, the more I have questions that should have been answered by documenting the function or data structure in spite of looking for some code that use it and see if I can infer what is expecting, what should be the best context to use it in, if there are better candidates to do the same thing etc... > > - example of returning text (especially if size of out>in) > > - what should I take care of for encoding (what's really inside a > > tsvector, text, cstring)? > src/backend/utils/fmgr/README I just learned there is a "return all row" mode for returning set functions: " There are currently two modes in which a function can return a set result: value-per-call, or materialize. In value-per-call mode, the function returns one value each time it is called, and finally reports "done" when it has no more values to return. In materialize mode, the function's output set is instantiated in a Tuplestore object; all the values are returned in one call. Additional modes might be added in future. " There is no example of a function working in this mode. I'd guess it should be suited for quick operation and small return. But... what should be considered quick and small? When someone should use a "row at a time" function and a "return all row" function? I haven't been able to understand the difference between function returning cstring and text and if there is any need to be careful about encoding and escaping when copying from the lexeme to a buffer that will return a cstring or text. Is there any difference from function returning text and function returning cstring? Can I make direct changes to input parameters and return pointers to internal parts of these structures? Or should I always allocate my char*, make the modification there and then return my "copy"? Is there any operation that should take care of encoding when dealing with cstring or text? thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
On 27/01/2010 9:14 PM, Ivan Sergio Borgonovo wrote: > On Mon, 25 Jan 2010 16:36:46 -0600 > "Kevin Grittner"<Kevin.Grittner@wicourts.gov> wrote: > >> Ivan Sergio Borgonovo<mail@webthatworks.it> wrote: >> >> The README files might be a good place to start, then browse code. > > Is there a book? > > The more I read the source and the few info about it, the more I > have questions that should have been answered by documenting the > function or data structure in spite of looking for some code that > use it and see if I can infer what is expecting, what should be the > best context to use it in, if there are better candidates to do the > same thing etc... I don't code on PostgreSQL's guts, so I'm perhaps not in the best position to speak, but: - Documentation has a cost too, particularly a maintenance cost. Outdated docs become misleading or downright false and can be much more harm than good. So a reasonable balance must be struck. I'm not saying PostgreSQL is _at_ that reasonable balance re its internal documentation, but there is such a thing as over-documenting. Writing a small book on each function means you have to maintain that, and that gets painful if code is undergoing any sort of major change. - It's easy to say "should" when you're not the one writing it. Personally, I try to say "hey, it's cool that I have access to this system" and "isn't it great I even have the right to modify it to do what I want, even though the learning curve _can_ be pretty steep". Hey, you could contribute yourself - patch some documentation into those functions where you find that reading the source isn't clear enough, and they really need a "see also" or "called from" comment or the like. As it is, I'm extremely grateful for the excellent user-level/admin oriented manual and glad to see the SPI docs too. -- Craig Ringer
On Wed, Jan 27, 2010 at 02:14:36PM +0100, Ivan Sergio Borgonovo wrote: > I haven't been able to understand the difference between function > returning cstring and text and if there is any need to be careful > about encoding and escaping when copying from the lexeme to a buffer > that will return a cstring or text. Well, the difference is that one is a cstring and the other is text. Seriously though, text is more useful if you want people to be able to use the result in other functions since on SQL level almost everything is text. cstring is needed for some APIs but it generally not used unless necessary. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
> Is there any difference from function returning text and function > returning cstring? > text is varlena type - it could be TOASTed, comprimated. It is better integrated to PostgreSQL world. cstring is just C zero terminated string - so good for system call, call some external libraries. Regards Pavel Stehule
On Wed, 27 Jan 2010 14:44:02 +0100 Martijn van Oosterhout <kleptog@svana.org> wrote: > On Wed, Jan 27, 2010 at 02:14:36PM +0100, Ivan Sergio Borgonovo > wrote: > > I haven't been able to understand the difference between function > > returning cstring and text and if there is any need to be careful > > about encoding and escaping when copying from the lexeme to a > > buffer that will return a cstring or text. > Well, the difference is that one is a cstring and the other is > text. Seriously though, text is more useful if you want people to > be able to use the result in other functions since on SQL level > almost everything is text. cstring is needed for some APIs but it > generally not used unless necessary. I didn't get it. Maybe I really chose the wrong function as an example (tsvectorout). What's not included in "on SQL level almost everything is text"? There are a lot of functions in contrib taking cstring input and returning cstring output. Are they just in the same "special" class of [type]in, [type]out [type]recv... functions? I've to re-read carefully http://www.postgresql.org/docs/8.4/static/xfunc-c.html since I discovered there may be explanations about text buffers etc... I discover there is a cstring_to_text function... and a text_to_cstring_buffer too... let me see if I can find something else... thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Wed, 27 Jan 2010 21:41:02 +0800 Craig Ringer <craig@postnewspapers.com.au> wrote: > I don't code on PostgreSQL's guts, so I'm perhaps not in the best > position to speak, but: > > - Documentation has a cost too, particularly a maintenance cost. > Outdated docs become misleading or downright false and can be much > more harm than good. So a reasonable balance must be struck. I'm > not saying PostgreSQL is _at_ that reasonable balance re its > internal documentation, but there is such a thing as > over-documenting. Writing a small book on each function means you > have to maintain that, and that gets painful if code is undergoing > any sort of major change. I'd be willing to pay for a book. > - It's easy to say "should" when you're not the one writing it. > Personally, I try to say "hey, it's cool that I have access to > this system" and "isn't it great I even have the right to modify > it to do what I want, even though the learning curve _can_ be > pretty steep". Well... I tend to generally make available to others everything I learn. I'd be nice a more advanced use of doxygen so it would be easier to have a map of functions. > Hey, you could contribute yourself - patch some documentation into > those functions where you find that reading the source isn't clear > enough, and they really need a "see also" or "called from" comment > or the like. Right now I've not enough knowledge to hope my notes get into the source code. Once I've a working piece of code I'll put the information I gathered in the process on my web site and if someone find them worth for a better place I'll release them with a suitable license. -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo wrote: > What's not included in "on SQL level almost everything is text"? input and output functions always use cstring rather than text. The I/O functions are normally not used directly at the SQL level. > There are a lot of functions in contrib taking cstring input and > returning cstring output. > Are they just in the same "special" class of [type]in, [type]out > [type]recv... functions? Probably, but I didn't check. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Ivan Sergio Borgonovo wrote: > I just learned there is a "return all row" mode for returning set > functions: > " > There are currently two modes in which a function can return a set > result: value-per-call, or materialize. In value-per-call mode, the > function returns one value each time it is called, and finally > reports "done" when it has no more values to return. In materialize > mode, the function's output set is instantiated in a Tuplestore > object; all the values are returned in one call. Additional modes > might be added in future" > > There is no example of a function working in this mode. > I'd guess it should be suited for quick operation and small return. > But... what should be considered quick and small? > When someone should use a "row at a time" function and a "return all > row" function? > There are quite a few SRF functions in the code. Look for example in contrib/hstore/hstore_op.c for some fairly simple examples. SRFs are quite capable of returning huge resultsets, not just small ones. Example code for matrerialize mode can be found in the PLs among other places (e.g. plperl_return_next() ) cheers andrew
On Wed, 27 Jan 2010 11:49:46 -0300 Alvaro Herrera <alvherre@commandprompt.com> wrote: > > There are a lot of functions in contrib taking cstring input and > > returning cstring output. > > Are they just in the same "special" class of [type]in, [type]out > > [type]recv... functions? > Probably, but I didn't check. Does this nearly translate to: "nothing you should care about right now and anyway just functions that won't return results to SQL"? thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: > There are a lot of functions in contrib taking cstring input and > returning cstring output. > Are they just in the same "special" class of [type]in, [type]out > [type]recv... functions? If you're looking in contrib subdirectories, perhaps the missing link here is the *.sql.in files. These are what are run in a database to expose functions to the SQL language. Start at those functions and see what the call tree is from there. -Kevin
On Wed, 27 Jan 2010 10:10:01 -0500 Andrew Dunstan <andrew@dunslane.net> wrote: > There are quite a few SRF functions in the code. Look for example > in contrib/hstore/hstore_op.c for some fairly simple examples. > SRFs are quite capable of returning huge resultsets, not just > small ones. Example code for matrerialize mode can be found in the > PLs among other places (e.g. plperl_return_next() ) I'm more interested in understanding when I should use materialized mode. eg. I should be more concerned about memory or cpu cycles and what should be taken as a reference to consider memory needs "large"? If for example I was going to split a large TEXT into a set of record (let's say I'm processing csv that has been loaded into a text field)... I'd consider the CPU use "light" but the memory needs "large". Would be this task suited for the materialized mode? Is there a rule of thumb to chose between one mode or the other? thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo wrote: > On Wed, 27 Jan 2010 10:10:01 -0500 > Andrew Dunstan <andrew@dunslane.net> wrote: > > >> There are quite a few SRF functions in the code. Look for example >> in contrib/hstore/hstore_op.c for some fairly simple examples. >> SRFs are quite capable of returning huge resultsets, not just >> small ones. Example code for matrerialize mode can be found in the >> PLs among other places (e.g. plperl_return_next() ) >> > > I'm more interested in understanding when I should use materialized > mode. > eg. I should be more concerned about memory or cpu cycles and what > should be taken as a reference to consider memory needs "large"? > If for example I was going to split a large TEXT into a set of > record (let's say I'm processing csv that has been loaded into a > text field)... I'd consider the CPU use "light" but the memory needs > "large". Would be this task suited for the materialized mode? > > Is there a rule of thumb to chose between one mode or the other? > > > If you don't know your memory use will be light, use materialized mode. For small results the data will still be in memory anyway. The Tuplestore will only spill to disk if it grows beyond a certain size. cheers andrew
Ivan Sergio Borgonovo wrote: > I'm more interested in understanding when I should use materialized > mode. > eg. I should be more concerned about memory or cpu cycles and what > should be taken as a reference to consider memory needs "large"? > If for example I was going to split a large TEXT into a set of > record (let's say I'm processing csv that has been loaded into a > text field)... I'd consider the CPU use "light" but the memory needs > "large". Would be this task suited for the materialized mode? Currently, there's no difference in terms of memory needs. The backend always materializes the result of a SRF into a tuplestore anyway, if the function didn't do it itself. There has been discussion of optimizing away that materialization step, but no-one has come up with an acceptable patch for that yet. There probably isn't much difference in CPU usage either. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Ivan Sergio Borgonovo wrote: > On Wed, 27 Jan 2010 11:49:46 -0300 > Alvaro Herrera <alvherre@commandprompt.com> wrote: > > > > There are a lot of functions in contrib taking cstring input and > > > returning cstring output. > > > Are they just in the same "special" class of [type]in, [type]out > > > [type]recv... functions? > > > Probably, but I didn't check. > > Does this nearly translate to: > "nothing you should care about right now and anyway just functions > that won't return results to SQL"? I meant "they are almost certainly in the same class as typein and typeout, but I didn't check every single one of them". As far as I can tell you are not writing an output function, but rather a debugging function of sorts. I see no reason to return cstring. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Ivan Sergio Borgonovo wrote: > Is there a book? > You'll find a basic intro to this area in "PostgreSQL Developer's Handbook" by Geschwinde and Schonig. You're already past the level of questions they answer in there though. This whole cstring/text issue you're asking about, I figured out by reading the source code, deciphering the examples in the talk at http://www.joeconway.com/presentations/tut_oscon_2004.pdf , and then using interesting keywords there ("DatumGetCString" is a good one to find interesting places in the code) to find code examples and messages on that topic in the list archives. Finally, finding some commits/patches that do things similar to what you want, and dissecting how they work, is quite informative too. I agree the learning curve could be a shortened a lot, and have actually submitted a conference talk proposal in this area to try and improve that. You're a few months ahead of me having something written down to share though. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, 27 Jan 2010 17:37:23 +0200 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Currently, there's no difference in terms of memory needs. The > backend always materializes the result of a SRF into a tuplestore > anyway, if the function didn't do it itself. There has been > discussion of optimizing away that materialization step, but > no-one has come up with an acceptable patch for that yet. > There probably isn't much difference in CPU usage either. On Wed, 27 Jan 2010 10:34:10 -0500 Andrew Dunstan <andrew@dunslane.net> wrote: > If you don't know your memory use will be light, use materialized > mode. For small results the data will still be in memory anyway. > The Tuplestore will only spill to disk if it grows beyond a > certain size. I keep on missing something. Considering all the context switching stuff, the "first call" initialization etc... I'd expect that what's collecting the result was going to push it downstream a bit at a time. What's actually happening is... the result get collected anyway in a Tuplestore. But then... why do we have all that logic to save the function context if anyway it is more convenient to process everything in one run? It's a pain to save the context just to save a pointer inside a structure, it would be more convenient to just process all the structure and return it as a Tuplestore in one pass. BTW thanks to Greg for pointing me once more to Joe Conway's tutorial. When I read it the first time I wasn't in a condition to take advantage of it. Now it looks more useful. -- Ivan Sergio Borgonovo http://www.webthatworks.it
On 01/27/2010 09:49 AM, Ivan Sergio Borgonovo wrote: > On Wed, 27 Jan 2010 17:37:23 +0200 > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > >> Currently, there's no difference in terms of memory needs. The >> backend always materializes the result of a SRF into a tuplestore >> anyway, if the function didn't do it itself. There has been >> discussion of optimizing away that materialization step, but >> no-one has come up with an acceptable patch for that yet. > I keep on missing something. > But then... why do we have all that logic to save the function > context if anyway it is more convenient to process everything in one > run? As already pointed out, there is currently no difference between value_per_call and materialize modes. The original intent was to have both modes eventually, with the initial one being materialize (historical note: my first SRF patch was value_per_call, but had difficult issues to be solved, and the consensus was that materialize was a simpler and safer approach for the first try at this feature). The advantage of value_per_call (if it were not materialized anyway in the backend) would be to allow pipelining, which enables very large datasets to be streamed without exhausting memory or having to wait for it all to be materialized. Implementing true value_per_call is still something on my TODO list, but obviously has not risen to a very high priority for me as it has now been an embarrassing long time since it was put there. But that said, materialize mode has proven extremely good at covering the most common use cases with acceptable performance. Joe
Ivan Sergio Borgonovo wrote: > But then... why do we have all that logic to save the function > context if anyway it is more convenient to process everything in one > run? > It's a pain to save the context just to save a pointer inside a > structure, it would be more convenient to just process all the > structure and return it as a Tuplestore in one pass. When the set-returning-function feature was written originally, years ago, the tuple at a time mode did really work tuple at a time. But it had issues and was axed out of the patch before it was committed, to keep it simple. The idea was to revisit it at some point, but it hasn't bothered anyone enough to fix it. It's basically "not implemented yet". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Jan 27, 2010, at 1:00 PM, Joe Conway wrote: > Implementing true value_per_call is still something on my TODO list, but > obviously has not risen to a very high priority for me as it has now > been an embarrassing long time since it was put there. But that said, > materialize mode has proven extremely good at covering the most common > use cases with acceptable performance. Hrm. I think this has been noted before, but one of the problems with VPC is that there can be a fairly significant amountof overhead involved with context setup and teardown--especially with PLs. If you're streaming millions of rows, it'sno longer a small matter. I would think some extension to Tuplestore would be preferable. Where chunks of rows are placed into the Tuplestore on demandin order to minimize context setup/teardown overhead. That is, if the Tuplestore is empty and the user needs more rows,invoke the procedure again with the expectation that it will dump another chunk of rows into the container. Not a formalspecification by any means, but I'm curious if anyone has considered that direction. Or along the same lines, how about a valueS-per-call mode? =)
On Wed, 27 Jan 2010 22:06:43 +0200 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Ivan Sergio Borgonovo wrote: > > But then... why do we have all that logic to save the function > > context if anyway it is more convenient to process everything in > > one run? > > It's a pain to save the context just to save a pointer inside a > > structure, it would be more convenient to just process all the > > structure and return it as a Tuplestore in one pass. > > When the set-returning-function feature was written originally, > years ago, the tuple at a time mode did really work tuple at a > time. But it had issues and was axed out of the patch before it > was committed, to keep it simple. The idea was to revisit it at > some point, but it hasn't bothered anyone enough to fix it. It's > basically "not implemented yet". Summing it up: 1) tuple at a time theoretically should be better and "future proof" once someone write the missing code but the code isstill not there 2) practically there is no substantial extra cost in returning tuple at a time Is 2) really true? It seems that materialized mode is much simpler. It requires a lot less code and it doesn't force you to save local variables and then restore them every time. So does it still make sense to get an idea about when the returned data set and complexity of computation really fit value_per_call or materialized mode? What could happen between function calls in value_per_call? Would still value_per_call offer a chance to postgresql/OS to better allocate CPU cycles/memory? thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > When the set-returning-function feature was written originally, years > ago, the tuple at a time mode did really work tuple at a time. But it > had issues and was axed out of the patch before it was committed, to > keep it simple. The idea was to revisit it at some point, but it hasn't > bothered anyone enough to fix it. It's basically "not implemented yet". It depends on call context --- you're speaking of the nodeFunctionScan context, but I believe an SRF called in the select targetlist will operate in tuple-at-a-time mode if the function allows. regards, tom lane