Thread: Re: [HACKERS] Text <-> C string

Re: [HACKERS] Text <-> C string

From
"Brendan Jurd"
Date:
As discussed on -hackers, I'm trying to get rid of some redundant code
by creating a widely useful set of functions to convert between text
and C string in the backend.

The new extern functions, declared in include/utils/builtins.h and
defined in backend/utils/adt/varlena.c, are:

char * text_cstring(const text *t)
char * text_cstring_limit(const text *t, int len)
text * cstring_text(const char *s)
text * cstring_text_limit(const char *s, int len)

Within varlena.c, the actual conversions are performed by:

char * do_text_cstring(const text *t, const int len)
text * do_cstring_text(const char *s, const int len)

These functions now do the work for the fmgr functions textin and
textout, as well as being directly accessible by backend code.

I've searched through the backend for any code which converted between
text and C string manually (with memcpy and VARDATA), replacing with
calls to one of the four new functions as appropriate.

I came across some areas which were using the same, or similar,
conversion technique on other varlena data types, such as bytea or
xmltype.  In cases where the conversion was completely identical I
used the new functions.  In cases with any differences (even if they
seemed minor) I played it safe and left them alone.

I'd now like to submit my work so far for review.  This patch compiled
cleanly on Linux and passed all parallel regression tests.  It appears
to be performance-neutral based on a few rough tests; I haven't tried
to profile the changes in detail.

There is still a lot of code out there using DirectFunctionCall1 to
call text(in|out)).  I've decided to wait for some community feedback
on the patch as it stands before replacing those calls.  There are a
great many, and it would be a shame to have to go through them more
than once.

I would naively expect that replacing fmgr calls with direct calls
would lead to a performance gain (no fmgr overhead), but honestly I'm
not sure whether that would actually make a difference.

Thanks for your time,
BJ

Attachment

Re: [HACKERS] Text <-> C string

From
Bruce Momjian
Date:
This has been saved for the 8.4 release:

    http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Brendan Jurd wrote:
> As discussed on -hackers, I'm trying to get rid of some redundant code
> by creating a widely useful set of functions to convert between text
> and C string in the backend.
>
> The new extern functions, declared in include/utils/builtins.h and
> defined in backend/utils/adt/varlena.c, are:
>
> char * text_cstring(const text *t)
> char * text_cstring_limit(const text *t, int len)
> text * cstring_text(const char *s)
> text * cstring_text_limit(const char *s, int len)
>
> Within varlena.c, the actual conversions are performed by:
>
> char * do_text_cstring(const text *t, const int len)
> text * do_cstring_text(const char *s, const int len)
>
> These functions now do the work for the fmgr functions textin and
> textout, as well as being directly accessible by backend code.
>
> I've searched through the backend for any code which converted between
> text and C string manually (with memcpy and VARDATA), replacing with
> calls to one of the four new functions as appropriate.
>
> I came across some areas which were using the same, or similar,
> conversion technique on other varlena data types, such as bytea or
> xmltype.  In cases where the conversion was completely identical I
> used the new functions.  In cases with any differences (even if they
> seemed minor) I played it safe and left them alone.
>
> I'd now like to submit my work so far for review.  This patch compiled
> cleanly on Linux and passed all parallel regression tests.  It appears
> to be performance-neutral based on a few rough tests; I haven't tried
> to profile the changes in detail.
>
> There is still a lot of code out there using DirectFunctionCall1 to
> call text(in|out)).  I've decided to wait for some community feedback
> on the patch as it stands before replacing those calls.  There are a
> great many, and it would be a shame to have to go through them more
> than once.
>
> I would naively expect that replacing fmgr calls with direct calls
> would lead to a performance gain (no fmgr overhead), but honestly I'm
> not sure whether that would actually make a difference.
>
> Thanks for your time,
> BJ

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] Text <-> C string

From
Bruce Momjian
Date:
This has been saved for the 8.4 release:

    http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Brendan Jurd wrote:
> As discussed on -hackers, I'm trying to get rid of some redundant code
> by creating a widely useful set of functions to convert between text
> and C string in the backend.
>
> The new extern functions, declared in include/utils/builtins.h and
> defined in backend/utils/adt/varlena.c, are:
>
> char * text_cstring(const text *t)
> char * text_cstring_limit(const text *t, int len)
> text * cstring_text(const char *s)
> text * cstring_text_limit(const char *s, int len)
>
> Within varlena.c, the actual conversions are performed by:
>
> char * do_text_cstring(const text *t, const int len)
> text * do_cstring_text(const char *s, const int len)
>
> These functions now do the work for the fmgr functions textin and
> textout, as well as being directly accessible by backend code.
>
> I've searched through the backend for any code which converted between
> text and C string manually (with memcpy and VARDATA), replacing with
> calls to one of the four new functions as appropriate.
>
> I came across some areas which were using the same, or similar,
> conversion technique on other varlena data types, such as bytea or
> xmltype.  In cases where the conversion was completely identical I
> used the new functions.  In cases with any differences (even if they
> seemed minor) I played it safe and left them alone.
>
> I'd now like to submit my work so far for review.  This patch compiled
> cleanly on Linux and passed all parallel regression tests.  It appears
> to be performance-neutral based on a few rough tests; I haven't tried
> to profile the changes in detail.
>
> There is still a lot of code out there using DirectFunctionCall1 to
> call text(in|out)).  I've decided to wait for some community feedback
> on the patch as it stands before replacing those calls.  There are a
> great many, and it would be a shame to have to go through them more
> than once.
>
> I would naively expect that replacing fmgr calls with direct calls
> would lead to a performance gain (no fmgr overhead), but honestly I'm
> not sure whether that would actually make a difference.
>
> Thanks for your time,
> BJ

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [HACKERS] Text <-> C string

From
Tom Lane
Date:
"Brendan Jurd" <direvus@gmail.com> writes:
> As discussed on -hackers, I'm trying to get rid of some redundant code
> by creating a widely useful set of functions to convert between text
> and C string in the backend.

> The new extern functions, declared in include/utils/builtins.h and
> defined in backend/utils/adt/varlena.c, are:

> char * text_cstring(const text *t)
> char * text_cstring_limit(const text *t, int len)
> text * cstring_text(const char *s)
> text * cstring_text_limit(const char *s, int len)

I started to look at applying this patch and immediately decided that
I didn't like these names --- it's exceeding un-obvious which direction
of conversion any one of the functions performs.  Seems like every time
you wanted to call one, you'd be going back to look at the source code
to remember which to use.

What do people think of text_to_cstring?  Or should we go with
TextPGetCString for consistency with the Datum-whacking macros?  (I seem
to recall having argued against the latter idea, but am reconsidering.)
Or something else?

BTW, I suspect that the _limit functions are mostly useless ---
a quick look through the patch did not suggest that any of the places
using them really needed a limit.  The point of, for instance,
TZ_STRLEN_MAX is to be able to use fixed-size local buffers, and
if you're going to pay for a palloc anyway then you might as well
forget it.  (There might be some value in a strlcpy equivalent that
copies from a text datum into a limited-size caller-supplied buffer,
but that's not what we've got here.)

            regards, tom lane

Re: [HACKERS] Text <-> C string

From
"Brendan Jurd"
Date:
On 20/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I started to look at applying this patch and immediately decided that
>  I didn't like these names --- it's exceeding un-obvious which direction
>  of conversion any one of the functions performs.  Seems like every time
>  you wanted to call one, you'd be going back to look at the source code
>  to remember which to use.
>

That's a fair criticism.  I wanted to make the function names as
compact as possible, but your comment about the directionality of the
functions rings true.

>  What do people think of text_to_cstring?  Or should we go with
>  TextPGetCString for consistency with the Datum-whacking macros?  (I seem
>  to recall having argued against the latter idea, but am reconsidering.)
>  Or something else?
>

Your original argument against FooGetBar was that it would be *too*
consistent with the Datum macros, leading people to think that these
functions actually were macros.

As long as we don't want people getting confused about the
function/macro distinction, that argument still makes sense to me, and
I'd be more inclined towards the foo_to_bar() convention.

>  BTW, I suspect that the _limit functions are mostly useless ---
>  a quick look through the patch did not suggest that any of the places
>  using them really needed a limit.  The point of, for instance,
>  TZ_STRLEN_MAX is to be able to use fixed-size local buffers, and
>  if you're going to pay for a palloc anyway then you might as well
>  forget it.

What about callers like dotrim() in oracle_compat.c, which only want
to copy characters from the source string up to a particular length?
Doesn't that indicate a legitimate requirement for a
cstring_to_text_limit() (the call site was palloc'ing the text value
anyway)?

On the other hand, we do have those call sites (TZ_STRLEN_MAX is a
good example) where the caller just wanted to use a local buffer.  In
which case your strlcpy-equivalent function would probably be the
right thing to offer.

> (There might be some value in a strlcpy equivalent that
>  copies from a text datum into a limited-size caller-supplied buffer,
>  but that's not what we've got here.)
>

Perhaps we keep cstring_to_text_limit(), but make
text_to_cstring_limit() behave like strlcpy() instead?

One of the questions in the original patch submission was whether it
would be worth changing all those DirectFunctionCall(textin) and
(textout) calls to use the new functions.  Is it worthwhile avoiding
the fmgr overhead?

Thanks for taking the time to review my patch and provide this
feedback.  I'm happy to send in an updated version once we settle on
the naming convention for the functions.

Last time I looked, the codebase had shifted quite a bit since I
originally wrote the patch.  So it probably needs some work to apply
cleanly on the latest sources anyway.

Regards,
BJ

Re: [HACKERS] Text <-> C string

From
Tom Lane
Date:
"Brendan Jurd" <direvus@gmail.com> writes:
> One of the questions in the original patch submission was whether it
> would be worth changing all those DirectFunctionCall(textin) and
> (textout) calls to use the new functions.  Is it worthwhile avoiding
> the fmgr overhead?

I think that's worth doing just on notational clarity grounds.
The small cycle savings doesn't excite me, but understanding
DirectFunctionCall1(textin, CStringGetDatum(foo)) just involves
more different bits of trivia than cstring_to_text(foo).

> Last time I looked, the codebase had shifted quite a bit since I
> originally wrote the patch.  So it probably needs some work to apply
> cleanly on the latest sources anyway.

Yeah, with wide-impact patches like this you are always going to have
that problem.  One point though is that we don't have to improve every
call site at the same time.  I'd be inclined to put in the new functions
and hit some representative sample of utils/adt/ files in the first
commit, and then incrementally fix other stuff.

            regards, tom lane

Re: [HACKERS] Text <-> C string

From
Tom Lane
Date:
"Brendan Jurd" <direvus@gmail.com> writes:
> As discussed on -hackers, I'm trying to get rid of some redundant code
> by creating a widely useful set of functions to convert between text
> and C string in the backend.

Applied with revisions --- the functions were modified as per recent
discussion, and I fixed a lot more potential call sites.

There are no textout/textin calls left, but I may have missed some
places that were doing it the hard way with direct palloc/memcpy
manipulations.  It might be worth trolling all the VARDATA() references
to see if any more are easily replaceable.

I notice in particular that xfunc.sgml contains sample C functions to
copy and concatenate text.  While these aren't directly replaceable
with the new functions, I wonder whether we ought to change the examples
to make them less certain to break if we ever change text's
representation.

            regards, tom lane

Re: [HACKERS] Text <-> C string

From
"Brendan Jurd"
Date:
On 26/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  There are no textout/textin calls left, but I may have missed some
>  places that were doing it the hard way with direct palloc/memcpy
>  manipulations.  It might be worth trolling all the VARDATA() references
>  to see if any more are easily replaceable.
>

I think there are perhaps a dozen such sites, and I hope to submit a
cleanup patch that gets rid of these soon.

I did come across one site I'm not sure about in utils/adt/xml.c:291

        /*
         * We need a null-terminated string to pass to parse_xml_decl().  Rather
         * than make a separate copy, make the temporary result one byte bigger
         * than it needs to be.
         */
        result = palloc(nbytes + 1 + VARHDRSZ);
        SET_VARSIZE(result, nbytes + VARHDRSZ);
        memcpy(VARDATA(result), str, nbytes);
        str = VARDATA(result);
        str[nbytes] = '\0';

The author seemed pretty sure he didn't want to make a separate copy
of the string, so I'm thinking there's not a whole lot I can do to
improve this site.

>  I notice in particular that xfunc.sgml contains sample C functions to
>  copy and concatenate text.  While these aren't directly replaceable
>  with the new functions, I wonder whether we ought to change the examples
>  to make them less certain to break if we ever change text's
>  representation.
>

Yes, these sample C functions are shown in tutorial/funcs.c and
funcs_new.c as well.  I agree that the examples could do with
changing, but don't have any keen insight on what to change them to.

Re: [HACKERS] Text <-> C string

From
Tom Lane
Date:
"Brendan Jurd" <direvus@gmail.com> writes:
> On 26/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> There are no textout/textin calls left, but I may have missed some
>> places that were doing it the hard way with direct palloc/memcpy
>> manipulations.  It might be worth trolling all the VARDATA() references
>> to see if any more are easily replaceable.

> I think there are perhaps a dozen such sites, and I hope to submit a
> cleanup patch that gets rid of these soon.
> I did come across one site I'm not sure about in utils/adt/xml.c:291

I intentionally didn't touch xml.c, nor anyplace that is not dealing
in text, even if it happens to be binary-compatible with text.

            regards, tom lane

Re: [HACKERS] Text <-> C string

From
"Brendan Jurd"
Date:
On 29/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I intentionally didn't touch xml.c, nor anyplace that is not dealing
>  in text, even if it happens to be binary-compatible with text.
>

Hmm, okay.  My original submission did include a few such changes; for
example, in xml_in and xml_out_internal I saw that the conversion
between xmltype and cstring was identical to the text conversion, so I
went ahead and used the text functions.  Feedback upthread suggested
that it was okay to go ahead with casting in identical cases. [1]

I saw that these changes made it into the commit, so I assumed that it
was the right call.

If we don't want to meddle with xmltype/bytea/VarChar at all, we'll
have to revert those changes, and I'll have to seriously scale back
the cleanup patch I was about to post.

Regards,
BJ

[1] http://archives.postgresql.org/pgsql-hackers/2007-09/msg01094.php

Re: [HACKERS] Text <-> C string

From
"Brendan Jurd"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

- -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Sat, Mar 29, 2008 at 5:40 AM, Brendan Jurd  wrote:
> On 29/03/2008, Tom Lane  wrote:
>  > I intentionally didn't touch xml.c, nor anyplace that is not dealing
>  >  in text, even if it happens to be binary-compatible with text.
>  >
>
>  Hmm, okay.  My original submission did include a few such changes; for
>  example, in xml_in and xml_out_internal I saw that the conversion
>  between xmltype and cstring was identical to the text conversion, so I
>  went ahead and used the text functions.  Feedback upthread suggested
>  that it was okay to go ahead with casting in identical cases. [1]
>
>  I saw that these changes made it into the commit, so I assumed that it
>  was the right call.
>
>  If we don't want to meddle with xmltype/bytea/VarChar at all, we'll
>  have to revert those changes, and I'll have to seriously scale back
>  the cleanup patch I was about to post.

Still not sure where we stand on the above.  To cast, or not to cast?

Cheers,
BJ
- -----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)
Comment: http://getfiregpg.org

iD8DBQFIBaFy5YBsbHkuyV0RAsMmAKDHaEb7aMudKJgVxcf5RRcOtAJ+bwCgivLI
5B3xJze46NGWjEyOpq/TSaY=
=BObd
- -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.7 (GNU/Linux)
Comment: http://getfiregpg.org

iD8DBQFIBaIl5YBsbHkuyV0RArDDAJ0QThLXAhzy+NX+2YsF+q4z/sIy1QCeJPiW
s/rVns3FFQVP5g9DTOshDfQ=
=4Tdh
-----END PGP SIGNATURE-----

Re: [HACKERS] Text <-> C string

From
Tom Lane
Date:
"Brendan Jurd" <direvus@gmail.com> writes:
>> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll
>> have to revert those changes, and I'll have to seriously scale back
>> the cleanup patch I was about to post.

> Still not sure where we stand on the above.  To cast, or not to cast?

I dunno.  I know there was previously some handwaving about representing
XML values in some more intelligent fashion than a plain text string,
but I have no idea if anyone is likely to do anything about it in the
foreseeable future.

            regards, tom lane

Re: [HACKERS] Text <-> C string

From
Andrew Dunstan
Date:

Tom Lane wrote:
> "Brendan Jurd" <direvus@gmail.com> writes:
>
>>> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll
>>> have to revert those changes, and I'll have to seriously scale back
>>> the cleanup patch I was about to post.
>>>
>
>
>> Still not sure where we stand on the above.  To cast, or not to cast?
>>
>
> I dunno.  I know there was previously some handwaving about representing
> XML values in some more intelligent fashion than a plain text string,
> but I have no idea if anyone is likely to do anything about it in the
> foreseeable future.
>
>
>

It seems very unlikely to me.

cheers

andrew

Re: [HACKERS] Text <-> C string

From
"Brendan Jurd"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Apr 17, 2008 at 1:16 AM, Tom Lane  wrote:
> "Brendan Jurd"  writes:
>
> >> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll
>  >> have to revert those changes, and I'll have to seriously scale back
>  >> the cleanup patch I was about to post.
>
>  > Still not sure where we stand on the above.  To cast, or not to cast?
>
>  I dunno.  I know there was previously some handwaving about representing
>  XML values in some more intelligent fashion than a plain text string,
>  but I have no idea if anyone is likely to do anything about it in the
>  foreseeable future.
>

Well ... if somebody does want to change the representation of xml
down the road, he's going to have to touch all the sites where the
code converts to and from cstring anyway, right?

So the only real difference this will make is that, instead of having
to replace four lines of VARDATA/memcpy per site, he'll have to
replace a single function call.  That doesn't seem like a negative.

With that in mind, please find attached my followup patch.  It cleans
up another 21 sites manually copying between cstring and varlena, for
a net reduction of 115 lines of code.

I didn't attempt to work through every reference to VARDATA, but I did
look at every hit from a  `grep -Rn VARDATA . | grep memcpy`.

All regression tests passed (contrib tests included) on gentoo.

Patch added to commitfest queue.

Cheers,
BJ
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: http://getfiregpg.org

iD8DBQFIBm105YBsbHkuyV0RAmqfAKCfyNyFdciqX4QV81sG9MhPt+KXuACfe694
3d/ICZF6yqV6K20X3TVX+So=
=CvRM
-----END PGP SIGNATURE-----

Attachment

Re: [HACKERS] Text <-> C string

From
Tom Lane
Date:
"Brendan Jurd" <direvus@gmail.com> writes:
> Well ... if somebody does want to change the representation of xml
> down the road, he's going to have to touch all the sites where the
> code converts to and from cstring anyway, right?

True.

> With that in mind, please find attached my followup patch.  It cleans
> up another 21 sites manually copying between cstring and varlena, for
> a net reduction of 115 lines of code.

I applied most of this, but not the parts that were dependent on the
assumption that bytea and text are the same.  That is unlikely to remain
true if we ever get around to putting locale/encoding information into
text values.  Furthermore it's a horrid type pun, because bytea can
contain embedded nulls whereas cstrings can't; you were making
unwarranted assumptions about whether, say, cstring_to_text_with_len
would allow embedded nulls to go by.  And lastly, quite a few of those
changes were just plain broken, eg several places in selfuncs.c where
you allowed strlen() to be applied to a "bytea converted to cstring".

            regards, tom lane