Thread: C function accepting/returning cstring vs. text

C function accepting/returning cstring vs. text

From

Ivan Sergio Borgonovo

Date:

25 January 2010, 16:12:38

I'm having a hard time trying to understand how everything should be
done in C extensions.

Now I'm trying to understand how and when I should accept/return
cstring vs text and if and how I should take care of any encoding.

I'm trying to output the content of a tsvector as a set of record
pos, weight, text|cstring
I've found tsvectorout that should be enough "educational".
But I still miss details about when/why use cstrings vs. text and
encoding.

Is there any part of the code from where I could learn about:
- memory allocation for both cstring and text
- example of returning text (especially if size of out>in)
- what should I take care of for encoding (what's really inside a tsvector, text, cstring)?

I don't know if pgsql-hackers is the right place to ask and if this
kind of questions are considered "homework" but without any
reference and knowledge of the code base it's pretty hard even to
find prototype/sample code.

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: C function accepting/returning cstring vs. text

From

"Kevin Grittner"

Date:

25 January 2010, 18:37:06

Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
The README files might be a good place to start, then browse code.
> Is there any part of the code from where I could learn about:
> - memory allocation for both cstring and text
src/backend/utils/mmgr/README
> - example of returning text (especially if size of out>in)
> - what should I take care of for encoding (what's really inside a
>   tsvector, text, cstring)?
src/backend/utils/fmgr/README
-Kevin

Re: C function accepting/returning cstring vs. text

From

Ivan Sergio Borgonovo

Date:

27 January 2010, 09:15:10

On Mon, 25 Jan 2010 16:36:46 -0600
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
>  
> The README files might be a good place to start, then browse code.

Is there a book?

The more I read the source and the few info about it, the more I
have questions that should have been answered by documenting the
function or data structure in spite of looking for some code that
use it and see if I can infer what is expecting, what should be the
best context to use it in, if there are better candidates to do the
same thing etc...

> > - example of returning text (especially if size of out>in)
> > - what should I take care of for encoding (what's really inside a
> >   tsvector, text, cstring)?

> src/backend/utils/fmgr/README

I just learned there is a "return all row" mode for returning set
functions:
"
There are currently two modes in which a function can return a set
result: value-per-call, or materialize.  In value-per-call mode, the
function returns one value each time it is called, and finally
reports "done" when it has no more values to return.  In materialize
mode, the function's output set is instantiated in a Tuplestore
object; all the values are returned in one call. Additional modes
might be added in future.
"

There is no example of a function working in this mode.
I'd guess it should be suited for quick operation and small return.
But... what should be considered quick and small?
When someone should use a "row at a time" function and a "return all
row" function?

I haven't been able to understand the difference between function
returning cstring and text and if there is any need to be careful
about encoding and escaping when copying from the lexeme to a buffer
that will return a cstring or text.

Is there any difference from function returning text and function
returning cstring?

Can I make direct changes to input parameters and return pointers to
internal parts of these structures?
Or should I always allocate my char*, make the modification there
and then return my "copy"?

Is there any operation that should take care of encoding when
dealing with cstring or text?

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: C function accepting/returning cstring vs. text

From

Craig Ringer

Date:

27 January 2010, 09:41:29

On 27/01/2010 9:14 PM, Ivan Sergio Borgonovo wrote:
> On Mon, 25 Jan 2010 16:36:46 -0600
> "Kevin Grittner"<Kevin.Grittner@wicourts.gov>  wrote:
>
>> Ivan Sergio Borgonovo<mail@webthatworks.it>  wrote:
>>
>> The README files might be a good place to start, then browse code.
>
> Is there a book?
>
> The more I read the source and the few info about it, the more I
> have questions that should have been answered by documenting the
> function or data structure in spite of looking for some code that
> use it and see if I can infer what is expecting, what should be the
> best context to use it in, if there are better candidates to do the
> same thing etc...

I don't code on PostgreSQL's guts, so I'm perhaps not in the best 
position to speak, but:

- Documentation has a cost too, particularly a maintenance cost. 
Outdated docs become misleading or downright false and can be much more 
harm than good. So a reasonable balance must be struck. I'm not saying 
PostgreSQL is _at_ that reasonable balance re its internal 
documentation, but there is such a thing as over-documenting. Writing a 
small book on each function means you have to maintain that, and that 
gets painful if code is undergoing any sort of major change.

- It's easy to say "should" when you're not the one writing it. 
Personally, I try to say "hey, it's cool that I have access to this 
system" and "isn't it great I even have the right to modify it to do 
what I want, even though the learning curve _can_ be pretty steep".

Hey, you could contribute yourself - patch some documentation into those 
functions where you find that reading the source isn't clear enough, and 
they really need a "see also" or "called from" comment or the like.

As it is, I'm extremely grateful for the excellent user-level/admin 
oriented manual and glad to see the SPI docs too.

--
Craig Ringer

Re: C function accepting/returning cstring vs. text

From

Martijn van Oosterhout

Date:

27 January 2010, 09:44:18

On Wed, Jan 27, 2010 at 02:14:36PM +0100, Ivan Sergio Borgonovo wrote:
> I haven't been able to understand the difference between function
> returning cstring and text and if there is any need to be careful
> about encoding and escaping when copying from the lexeme to a buffer
> that will return a cstring or text.

Well, the difference is that one is a cstring and the other is text.
Seriously though, text is more useful if you want people to be able to
use the result in other functions since on SQL level almost everything
is text. cstring is needed for some APIs but it generally not used
unless necessary.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: C function accepting/returning cstring vs. text

From

Pavel Stehule

Date:

27 January 2010, 10:06:06

> Is there any difference from function returning text and function
> returning cstring?
>

text is varlena type - it could be TOASTed, comprimated. It is better
integrated to PostgreSQL world. cstring is just C zero terminated
string - so good for system call, call some external libraries.

Regards
Pavel Stehule

Re: C function accepting/returning cstring vs. text

From

Ivan Sergio Borgonovo

Date:

27 January 2010, 10:39:39

On Wed, 27 Jan 2010 14:44:02 +0100
Martijn van Oosterhout <kleptog@svana.org> wrote:

> On Wed, Jan 27, 2010 at 02:14:36PM +0100, Ivan Sergio Borgonovo
> wrote:
> > I haven't been able to understand the difference between function
> > returning cstring and text and if there is any need to be careful
> > about encoding and escaping when copying from the lexeme to a
> > buffer that will return a cstring or text.

> Well, the difference is that one is a cstring and the other is
> text. Seriously though, text is more useful if you want people to
> be able to use the result in other functions since on SQL level
> almost everything is text. cstring is needed for some APIs but it
> generally not used unless necessary.

I didn't get it.
Maybe I really chose the wrong function as an example (tsvectorout).

What's not included in "on SQL level almost everything is text"?

There are a lot of functions in contrib taking cstring input and
returning cstring output.
Are they just in the same "special" class of [type]in, [type]out
[type]recv... functions?

I've to re-read carefully
http://www.postgresql.org/docs/8.4/static/xfunc-c.html
since I discovered there may be explanations about text buffers
etc...

I discover there is a cstring_to_text function... and a
text_to_cstring_buffer too... let me see if I can find something
else...

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: C function accepting/returning cstring vs. text

From

Ivan Sergio Borgonovo

Date:

27 January 2010, 10:41:59

On Wed, 27 Jan 2010 21:41:02 +0800
Craig Ringer <craig@postnewspapers.com.au> wrote:

> I don't code on PostgreSQL's guts, so I'm perhaps not in the best 
> position to speak, but:
> 
> - Documentation has a cost too, particularly a maintenance cost. 
> Outdated docs become misleading or downright false and can be much
> more harm than good. So a reasonable balance must be struck. I'm
> not saying PostgreSQL is _at_ that reasonable balance re its
> internal documentation, but there is such a thing as
> over-documenting. Writing a small book on each function means you
> have to maintain that, and that gets painful if code is undergoing
> any sort of major change.

I'd be willing to pay for a book.

> - It's easy to say "should" when you're not the one writing it. 
> Personally, I try to say "hey, it's cool that I have access to
> this system" and "isn't it great I even have the right to modify
> it to do what I want, even though the learning curve _can_ be
> pretty steep".

Well... I tend to generally make available to others everything I
learn. I'd be nice a more advanced use of doxygen so it would be
easier to have a map of functions.

> Hey, you could contribute yourself - patch some documentation into
> those functions where you find that reading the source isn't clear
> enough, and they really need a "see also" or "called from" comment
> or the like.

Right now I've not enough knowledge to hope my notes get into the
source code. Once I've a working piece of code I'll put the
information I gathered in the process on my web site and if someone
find them worth for a better place I'll release them with a suitable
license.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: C function accepting/returning cstring vs. text

From

Alvaro Herrera

Date:

27 January 2010, 10:50:01

Ivan Sergio Borgonovo wrote:

> What's not included in "on SQL level almost everything is text"?

input and output functions always use cstring rather than text.  The I/O
functions are normally not used directly at the SQL level.

> There are a lot of functions in contrib taking cstring input and
> returning cstring output.
> Are they just in the same "special" class of [type]in, [type]out
> [type]recv... functions?

Probably, but I didn't check.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: C function accepting/returning cstring vs. text

From

Andrew Dunstan

Date:

27 January 2010, 11:10:15


Ivan Sergio Borgonovo wrote:
> I just learned there is a "return all row" mode for returning set
> functions:
> "
> There are currently two modes in which a function can return a set
> result: value-per-call, or materialize.  In value-per-call mode, the
> function returns one value each time it is called, and finally
> reports "done" when it has no more values to return.  In materialize
> mode, the function's output set is instantiated in a Tuplestore
> object; all the values are returned in one call. Additional modes
> might be added in future"
>
> There is no example of a function working in this mode.
> I'd guess it should be suited for quick operation and small return.
> But... what should be considered quick and small?
> When someone should use a "row at a time" function and a "return all
> row" function?
>   

There are quite a few SRF functions in the code. Look for example in 
contrib/hstore/hstore_op.c for some fairly simple examples.
SRFs are quite capable of returning huge resultsets, not just small 
ones. Example code for matrerialize mode can be found in the PLs among 
other places (e.g. plperl_return_next() )

cheers

andrew

Re: C function accepting/returning cstring vs. text

From

Ivan Sergio Borgonovo

Date:

27 January 2010, 11:18:41

On Wed, 27 Jan 2010 11:49:46 -0300
Alvaro Herrera <alvherre@commandprompt.com> wrote:

> > There are a lot of functions in contrib taking cstring input and
> > returning cstring output.
> > Are they just in the same "special" class of [type]in, [type]out
> > [type]recv... functions?

> Probably, but I didn't check.

Does this nearly translate to:
"nothing you should care about right now and anyway just functions
that won't return results to SQL"?

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: C function accepting/returning cstring vs. text

From

"Kevin Grittner"

Date:

27 January 2010, 11:27:37

Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:
> There are a lot of functions in contrib taking cstring input and
> returning cstring output.
> Are they just in the same "special" class of [type]in, [type]out
> [type]recv... functions?
If you're looking in contrib subdirectories, perhaps the missing
link here is the *.sql.in files.  These are what are run in a
database to expose functions to the SQL language. Start at those
functions and see what the call tree is from there.
-Kevin

Re: C function accepting/returning cstring vs. text

From

Ivan Sergio Borgonovo

Date:

27 January 2010, 11:28:39

On Wed, 27 Jan 2010 10:10:01 -0500
Andrew Dunstan <andrew@dunslane.net> wrote:

> There are quite a few SRF functions in the code. Look for example
> in contrib/hstore/hstore_op.c for some fairly simple examples.
> SRFs are quite capable of returning huge resultsets, not just
> small ones. Example code for matrerialize mode can be found in the
> PLs among other places (e.g. plperl_return_next() )

I'm more interested in understanding when I should use materialized
mode.
eg. I should be more concerned about memory or cpu cycles and what
should be taken as a reference to consider memory needs "large"?
If for example I was going to split a large TEXT into a set of
record (let's say I'm processing csv that has been loaded into a
text field)... I'd consider the CPU use "light" but the memory needs
"large". Would be this task suited for the materialized mode?

Is there a rule of thumb to chose between one mode or the other?

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: C function accepting/returning cstring vs. text

From

Andrew Dunstan

Date:

27 January 2010, 11:34:26


Ivan Sergio Borgonovo wrote:
> On Wed, 27 Jan 2010 10:10:01 -0500
> Andrew Dunstan <andrew@dunslane.net> wrote:
>
>   
>> There are quite a few SRF functions in the code. Look for example
>> in contrib/hstore/hstore_op.c for some fairly simple examples.
>> SRFs are quite capable of returning huge resultsets, not just
>> small ones. Example code for matrerialize mode can be found in the
>> PLs among other places (e.g. plperl_return_next() )
>>     
>
> I'm more interested in understanding when I should use materialized
> mode.
> eg. I should be more concerned about memory or cpu cycles and what
> should be taken as a reference to consider memory needs "large"?
> If for example I was going to split a large TEXT into a set of
> record (let's say I'm processing csv that has been loaded into a
> text field)... I'd consider the CPU use "light" but the memory needs
> "large". Would be this task suited for the materialized mode?
>
> Is there a rule of thumb to chose between one mode or the other?
>
>
>   

If you don't know your memory use will be light, use materialized mode. 
For small results the data will still be in memory anyway. The 
Tuplestore will only spill to disk if it grows beyond a certain size.

cheers

andrew

Re: C function accepting/returning cstring vs. text

From

Heikki Linnakangas

Date:

27 January 2010, 11:37:41

Ivan Sergio Borgonovo wrote:
> I'm more interested in understanding when I should use materialized
> mode.
> eg. I should be more concerned about memory or cpu cycles and what
> should be taken as a reference to consider memory needs "large"?
> If for example I was going to split a large TEXT into a set of
> record (let's say I'm processing csv that has been loaded into a
> text field)... I'd consider the CPU use "light" but the memory needs
> "large". Would be this task suited for the materialized mode?

Currently, there's no difference in terms of memory needs. The backend
always materializes the result of a SRF into a tuplestore anyway, if the
function didn't do it itself. There has been discussion of optimizing
away that materialization step, but no-one has come up with an
acceptable patch for that yet.

There probably isn't much difference in CPU usage either.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: C function accepting/returning cstring vs. text

From

Alvaro Herrera

Date:

27 January 2010, 11:53:35

Ivan Sergio Borgonovo wrote:
> On Wed, 27 Jan 2010 11:49:46 -0300
> Alvaro Herrera <alvherre@commandprompt.com> wrote:
> 
> > > There are a lot of functions in contrib taking cstring input and
> > > returning cstring output.
> > > Are they just in the same "special" class of [type]in, [type]out
> > > [type]recv... functions?
> 
> > Probably, but I didn't check.
> 
> Does this nearly translate to:
> "nothing you should care about right now and anyway just functions
> that won't return results to SQL"?

I meant "they are almost certainly in the same class as typein and
typeout, but I didn't check every single one of them".

As far as I can tell you are not writing an output function, but
rather a debugging function of sorts.  I see no reason to return cstring.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: C function accepting/returning cstring vs. text

From

Greg Smith

Date:

27 January 2010, 12:42:11

Ivan Sergio Borgonovo wrote:
> Is there a book?
>   

You'll find a basic intro to this area in "PostgreSQL Developer's 
Handbook" by Geschwinde and Schonig.  You're already past the level of 
questions they answer in there though.  This whole cstring/text issue 
you're asking about, I figured out by reading the source code, 
deciphering the examples in the talk at 
http://www.joeconway.com/presentations/tut_oscon_2004.pdf , and then 
using interesting keywords there ("DatumGetCString" is a good one to 
find interesting places in the code) to find code examples and messages 
on that topic in the list archives.  Finally, finding some 
commits/patches that do things similar to what you want, and dissecting 
how they work, is quite informative too.

I agree the learning curve could be a shortened a lot, and have actually 
submitted a conference talk proposal in this area to try and improve 
that.  You're a few months ahead of me having something written down to 
share though.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: C function accepting/returning cstring vs. text

From

Ivan Sergio Borgonovo

Date:

27 January 2010, 13:50:18

On Wed, 27 Jan 2010 17:37:23 +0200
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

> Currently, there's no difference in terms of memory needs. The
> backend always materializes the result of a SRF into a tuplestore
> anyway, if the function didn't do it itself. There has been
> discussion of optimizing away that materialization step, but
> no-one has come up with an acceptable patch for that yet.

> There probably isn't much difference in CPU usage either.

On Wed, 27 Jan 2010 10:34:10 -0500
Andrew Dunstan <andrew@dunslane.net> wrote:

> If you don't know your memory use will be light, use materialized
> mode. For small results the data will still be in memory anyway.
> The Tuplestore will only spill to disk if it grows beyond a
> certain size.

I keep on missing something.

Considering all the context switching stuff, the "first call"
initialization etc... I'd expect that what's collecting the result
was going to push it downstream a bit at a time.
What's actually happening is... the result get collected anyway in a
Tuplestore.

But then... why do we have all that logic to save the function
context if anyway it is more convenient to process everything in one
run?
It's a pain to save the context just to save a pointer inside a
structure, it would be more convenient to just process all the
structure and return it as a Tuplestore in one pass.

BTW thanks to Greg for pointing me once more to Joe Conway's
tutorial. When I read it the first time I wasn't in a condition to
take advantage of it. Now it looks more useful.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: C function accepting/returning cstring vs. text

From

Joe Conway

Date:

27 January 2010, 16:17:02

On 01/27/2010 09:49 AM, Ivan Sergio Borgonovo wrote:
> On Wed, 27 Jan 2010 17:37:23 +0200
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
>
>> Currently, there's no difference in terms of memory needs. The
>> backend always materializes the result of a SRF into a tuplestore
>> anyway, if the function didn't do it itself. There has been
>> discussion of optimizing away that materialization step, but
>> no-one has come up with an acceptable patch for that yet.

> I keep on missing something.

> But then... why do we have all that logic to save the function
> context if anyway it is more convenient to process everything in one
> run?

As already pointed out, there is currently no difference between
value_per_call and materialize modes. The original intent was to have
both modes eventually, with the initial one being materialize
(historical note: my first SRF patch was value_per_call, but had
difficult issues to be solved, and the consensus was that materialize
was a simpler and safer approach for the first try at this feature).

The advantage of value_per_call (if it were not materialized anyway in
the backend) would be to allow pipelining, which enables very large
datasets to be streamed without exhausting memory or having to wait for
it all to be materialized.

Implementing true value_per_call is still something on my TODO list, but
obviously has not risen to a very high priority for me as it has now
been an embarrassing long time since it was put there. But that said,
materialize mode has proven extremely good at covering the most common
use cases with acceptable performance.

Joe

Re: C function accepting/returning cstring vs. text

From

Heikki Linnakangas

Date:

27 January 2010, 16:31:57

Ivan Sergio Borgonovo wrote:
> But then... why do we have all that logic to save the function
> context if anyway it is more convenient to process everything in one
> run?
> It's a pain to save the context just to save a pointer inside a
> structure, it would be more convenient to just process all the
> structure and return it as a Tuplestore in one pass.

When the set-returning-function feature was written originally, years
ago, the tuple at a time mode did really work tuple at a time. But it
had issues and was axed out of the patch before it was committed, to
keep it simple. The idea was to revisit it at some point, but it hasn't
bothered anyone enough to fix it. It's basically "not implemented yet".

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: C function accepting/returning cstring vs. text

From

James William Pye

Date:

27 January 2010, 17:26:00

On Jan 27, 2010, at 1:00 PM, Joe Conway wrote:
> Implementing true value_per_call is still something on my TODO list, but
> obviously has not risen to a very high priority for me as it has now
> been an embarrassing long time since it was put there. But that said,
> materialize mode has proven extremely good at covering the most common
> use cases with acceptable performance.

Hrm. I think this has been noted before, but one of the problems with VPC is that there can be a fairly significant
amountof overhead involved with context setup and teardown--especially with PLs. If you're streaming millions of rows,
it'sno longer a small matter. 

I would think some extension to Tuplestore would be preferable. Where chunks of rows are placed into the Tuplestore on
demandin order to minimize context setup/teardown overhead. That is, if the Tuplestore is empty and the user needs more
rows,invoke the procedure again with the expectation that it will dump another chunk of rows into the container. Not a
formalspecification by any means, but I'm curious if anyone has considered that direction. 

Or along the same lines, how about a valueS-per-call mode? =)

Re: C function accepting/returning cstring vs. text

From

Ivan Sergio Borgonovo

Date:

27 January 2010, 17:26:10

On Wed, 27 Jan 2010 22:06:43 +0200
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

> Ivan Sergio Borgonovo wrote:
> > But then... why do we have all that logic to save the function
> > context if anyway it is more convenient to process everything in
> > one run?
> > It's a pain to save the context just to save a pointer inside a
> > structure, it would be more convenient to just process all the
> > structure and return it as a Tuplestore in one pass.
> 
> When the set-returning-function feature was written originally,
> years ago, the tuple at a time mode did really work tuple at a
> time. But it had issues and was axed out of the patch before it
> was committed, to keep it simple. The idea was to revisit it at
> some point, but it hasn't bothered anyone enough to fix it. It's
> basically "not implemented yet".

Summing it up:
1) tuple at a time theoretically should be better and "future proof"  once someone write the missing code but the code
isstill not  there
 
2) practically there is no substantial extra cost in returning tuple  at a time

Is 2) really true?
It seems that materialized mode is much simpler. It requires a lot
less code and it doesn't force you to save local variables and then
restore them every time.

So does it still make sense to get an idea about when the returned
data set and complexity of computation really fit value_per_call or
materialized mode?

What could happen between function calls in value_per_call?
Would still value_per_call offer a chance to postgresql/OS to better
allocate CPU cycles/memory?

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it

Re: C function accepting/returning cstring vs. text

From

Tom Lane

Date:

27 January 2010, 18:46:50

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> When the set-returning-function feature was written originally, years
> ago, the tuple at a time mode did really work tuple at a time. But it
> had issues and was axed out of the patch before it was committed, to
> keep it simple. The idea was to revisit it at some point, but it hasn't
> bothered anyone enough to fix it. It's basically "not implemented yet".

It depends on call context --- you're speaking of the nodeFunctionScan
context, but I believe an SRF called in the select targetlist will
operate in tuple-at-a-time mode if the function allows.
        regards, tom lane