Thread: Re: [HACKERS] Text <-> C string
As discussed on -hackers, I'm trying to get rid of some redundant code by creating a widely useful set of functions to convert between text and C string in the backend. The new extern functions, declared in include/utils/builtins.h and defined in backend/utils/adt/varlena.c, are: char * text_cstring(const text *t) char * text_cstring_limit(const text *t, int len) text * cstring_text(const char *s) text * cstring_text_limit(const char *s, int len) Within varlena.c, the actual conversions are performed by: char * do_text_cstring(const text *t, const int len) text * do_cstring_text(const char *s, const int len) These functions now do the work for the fmgr functions textin and textout, as well as being directly accessible by backend code. I've searched through the backend for any code which converted between text and C string manually (with memcpy and VARDATA), replacing with calls to one of the four new functions as appropriate. I came across some areas which were using the same, or similar, conversion technique on other varlena data types, such as bytea or xmltype. In cases where the conversion was completely identical I used the new functions. In cases with any differences (even if they seemed minor) I played it safe and left them alone. I'd now like to submit my work so far for review. This patch compiled cleanly on Linux and passed all parallel regression tests. It appears to be performance-neutral based on a few rough tests; I haven't tried to profile the changes in detail. There is still a lot of code out there using DirectFunctionCall1 to call text(in|out)). I've decided to wait for some community feedback on the patch as it stands before replacing those calls. There are a great many, and it would be a shame to have to go through them more than once. I would naively expect that replacing fmgr calls with direct calls would lead to a performance gain (no fmgr overhead), but honestly I'm not sure whether that would actually make a difference. Thanks for your time, BJ
Attachment
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Brendan Jurd wrote: > As discussed on -hackers, I'm trying to get rid of some redundant code > by creating a widely useful set of functions to convert between text > and C string in the backend. > > The new extern functions, declared in include/utils/builtins.h and > defined in backend/utils/adt/varlena.c, are: > > char * text_cstring(const text *t) > char * text_cstring_limit(const text *t, int len) > text * cstring_text(const char *s) > text * cstring_text_limit(const char *s, int len) > > Within varlena.c, the actual conversions are performed by: > > char * do_text_cstring(const text *t, const int len) > text * do_cstring_text(const char *s, const int len) > > These functions now do the work for the fmgr functions textin and > textout, as well as being directly accessible by backend code. > > I've searched through the backend for any code which converted between > text and C string manually (with memcpy and VARDATA), replacing with > calls to one of the four new functions as appropriate. > > I came across some areas which were using the same, or similar, > conversion technique on other varlena data types, such as bytea or > xmltype. In cases where the conversion was completely identical I > used the new functions. In cases with any differences (even if they > seemed minor) I played it safe and left them alone. > > I'd now like to submit my work so far for review. This patch compiled > cleanly on Linux and passed all parallel regression tests. It appears > to be performance-neutral based on a few rough tests; I haven't tried > to profile the changes in detail. > > There is still a lot of code out there using DirectFunctionCall1 to > call text(in|out)). I've decided to wait for some community feedback > on the patch as it stands before replacing those calls. There are a > great many, and it would be a shame to have to go through them more > than once. > > I would naively expect that replacing fmgr calls with direct calls > would lead to a performance gain (no fmgr overhead), but honestly I'm > not sure whether that would actually make a difference. > > Thanks for your time, > BJ [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Brendan Jurd wrote: > As discussed on -hackers, I'm trying to get rid of some redundant code > by creating a widely useful set of functions to convert between text > and C string in the backend. > > The new extern functions, declared in include/utils/builtins.h and > defined in backend/utils/adt/varlena.c, are: > > char * text_cstring(const text *t) > char * text_cstring_limit(const text *t, int len) > text * cstring_text(const char *s) > text * cstring_text_limit(const char *s, int len) > > Within varlena.c, the actual conversions are performed by: > > char * do_text_cstring(const text *t, const int len) > text * do_cstring_text(const char *s, const int len) > > These functions now do the work for the fmgr functions textin and > textout, as well as being directly accessible by backend code. > > I've searched through the backend for any code which converted between > text and C string manually (with memcpy and VARDATA), replacing with > calls to one of the four new functions as appropriate. > > I came across some areas which were using the same, or similar, > conversion technique on other varlena data types, such as bytea or > xmltype. In cases where the conversion was completely identical I > used the new functions. In cases with any differences (even if they > seemed minor) I played it safe and left them alone. > > I'd now like to submit my work so far for review. This patch compiled > cleanly on Linux and passed all parallel regression tests. It appears > to be performance-neutral based on a few rough tests; I haven't tried > to profile the changes in detail. > > There is still a lot of code out there using DirectFunctionCall1 to > call text(in|out)). I've decided to wait for some community feedback > on the patch as it stands before replacing those calls. There are a > great many, and it would be a shame to have to go through them more > than once. > > I would naively expect that replacing fmgr calls with direct calls > would lead to a performance gain (no fmgr overhead), but honestly I'm > not sure whether that would actually make a difference. > > Thanks for your time, > BJ [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Brendan Jurd" <direvus@gmail.com> writes: > As discussed on -hackers, I'm trying to get rid of some redundant code > by creating a widely useful set of functions to convert between text > and C string in the backend. > The new extern functions, declared in include/utils/builtins.h and > defined in backend/utils/adt/varlena.c, are: > char * text_cstring(const text *t) > char * text_cstring_limit(const text *t, int len) > text * cstring_text(const char *s) > text * cstring_text_limit(const char *s, int len) I started to look at applying this patch and immediately decided that I didn't like these names --- it's exceeding un-obvious which direction of conversion any one of the functions performs. Seems like every time you wanted to call one, you'd be going back to look at the source code to remember which to use. What do people think of text_to_cstring? Or should we go with TextPGetCString for consistency with the Datum-whacking macros? (I seem to recall having argued against the latter idea, but am reconsidering.) Or something else? BTW, I suspect that the _limit functions are mostly useless --- a quick look through the patch did not suggest that any of the places using them really needed a limit. The point of, for instance, TZ_STRLEN_MAX is to be able to use fixed-size local buffers, and if you're going to pay for a palloc anyway then you might as well forget it. (There might be some value in a strlcpy equivalent that copies from a text datum into a limited-size caller-supplied buffer, but that's not what we've got here.) regards, tom lane
On 20/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I started to look at applying this patch and immediately decided that > I didn't like these names --- it's exceeding un-obvious which direction > of conversion any one of the functions performs. Seems like every time > you wanted to call one, you'd be going back to look at the source code > to remember which to use. > That's a fair criticism. I wanted to make the function names as compact as possible, but your comment about the directionality of the functions rings true. > What do people think of text_to_cstring? Or should we go with > TextPGetCString for consistency with the Datum-whacking macros? (I seem > to recall having argued against the latter idea, but am reconsidering.) > Or something else? > Your original argument against FooGetBar was that it would be *too* consistent with the Datum macros, leading people to think that these functions actually were macros. As long as we don't want people getting confused about the function/macro distinction, that argument still makes sense to me, and I'd be more inclined towards the foo_to_bar() convention. > BTW, I suspect that the _limit functions are mostly useless --- > a quick look through the patch did not suggest that any of the places > using them really needed a limit. The point of, for instance, > TZ_STRLEN_MAX is to be able to use fixed-size local buffers, and > if you're going to pay for a palloc anyway then you might as well > forget it. What about callers like dotrim() in oracle_compat.c, which only want to copy characters from the source string up to a particular length? Doesn't that indicate a legitimate requirement for a cstring_to_text_limit() (the call site was palloc'ing the text value anyway)? On the other hand, we do have those call sites (TZ_STRLEN_MAX is a good example) where the caller just wanted to use a local buffer. In which case your strlcpy-equivalent function would probably be the right thing to offer. > (There might be some value in a strlcpy equivalent that > copies from a text datum into a limited-size caller-supplied buffer, > but that's not what we've got here.) > Perhaps we keep cstring_to_text_limit(), but make text_to_cstring_limit() behave like strlcpy() instead? One of the questions in the original patch submission was whether it would be worth changing all those DirectFunctionCall(textin) and (textout) calls to use the new functions. Is it worthwhile avoiding the fmgr overhead? Thanks for taking the time to review my patch and provide this feedback. I'm happy to send in an updated version once we settle on the naming convention for the functions. Last time I looked, the codebase had shifted quite a bit since I originally wrote the patch. So it probably needs some work to apply cleanly on the latest sources anyway. Regards, BJ
"Brendan Jurd" <direvus@gmail.com> writes: > One of the questions in the original patch submission was whether it > would be worth changing all those DirectFunctionCall(textin) and > (textout) calls to use the new functions. Is it worthwhile avoiding > the fmgr overhead? I think that's worth doing just on notational clarity grounds. The small cycle savings doesn't excite me, but understanding DirectFunctionCall1(textin, CStringGetDatum(foo)) just involves more different bits of trivia than cstring_to_text(foo). > Last time I looked, the codebase had shifted quite a bit since I > originally wrote the patch. So it probably needs some work to apply > cleanly on the latest sources anyway. Yeah, with wide-impact patches like this you are always going to have that problem. One point though is that we don't have to improve every call site at the same time. I'd be inclined to put in the new functions and hit some representative sample of utils/adt/ files in the first commit, and then incrementally fix other stuff. regards, tom lane
"Brendan Jurd" <direvus@gmail.com> writes: > As discussed on -hackers, I'm trying to get rid of some redundant code > by creating a widely useful set of functions to convert between text > and C string in the backend. Applied with revisions --- the functions were modified as per recent discussion, and I fixed a lot more potential call sites. There are no textout/textin calls left, but I may have missed some places that were doing it the hard way with direct palloc/memcpy manipulations. It might be worth trolling all the VARDATA() references to see if any more are easily replaceable. I notice in particular that xfunc.sgml contains sample C functions to copy and concatenate text. While these aren't directly replaceable with the new functions, I wonder whether we ought to change the examples to make them less certain to break if we ever change text's representation. regards, tom lane
On 26/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: > There are no textout/textin calls left, but I may have missed some > places that were doing it the hard way with direct palloc/memcpy > manipulations. It might be worth trolling all the VARDATA() references > to see if any more are easily replaceable. > I think there are perhaps a dozen such sites, and I hope to submit a cleanup patch that gets rid of these soon. I did come across one site I'm not sure about in utils/adt/xml.c:291 /* * We need a null-terminated string to pass to parse_xml_decl(). Rather * than make a separate copy, make the temporary result one byte bigger * than it needs to be. */ result = palloc(nbytes + 1 + VARHDRSZ); SET_VARSIZE(result, nbytes + VARHDRSZ); memcpy(VARDATA(result), str, nbytes); str = VARDATA(result); str[nbytes] = '\0'; The author seemed pretty sure he didn't want to make a separate copy of the string, so I'm thinking there's not a whole lot I can do to improve this site. > I notice in particular that xfunc.sgml contains sample C functions to > copy and concatenate text. While these aren't directly replaceable > with the new functions, I wonder whether we ought to change the examples > to make them less certain to break if we ever change text's > representation. > Yes, these sample C functions are shown in tutorial/funcs.c and funcs_new.c as well. I agree that the examples could do with changing, but don't have any keen insight on what to change them to.
"Brendan Jurd" <direvus@gmail.com> writes: > On 26/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There are no textout/textin calls left, but I may have missed some >> places that were doing it the hard way with direct palloc/memcpy >> manipulations. It might be worth trolling all the VARDATA() references >> to see if any more are easily replaceable. > I think there are perhaps a dozen such sites, and I hope to submit a > cleanup patch that gets rid of these soon. > I did come across one site I'm not sure about in utils/adt/xml.c:291 I intentionally didn't touch xml.c, nor anyplace that is not dealing in text, even if it happens to be binary-compatible with text. regards, tom lane
On 29/03/2008, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I intentionally didn't touch xml.c, nor anyplace that is not dealing > in text, even if it happens to be binary-compatible with text. > Hmm, okay. My original submission did include a few such changes; for example, in xml_in and xml_out_internal I saw that the conversion between xmltype and cstring was identical to the text conversion, so I went ahead and used the text functions. Feedback upthread suggested that it was okay to go ahead with casting in identical cases. [1] I saw that these changes made it into the commit, so I assumed that it was the right call. If we don't want to meddle with xmltype/bytea/VarChar at all, we'll have to revert those changes, and I'll have to seriously scale back the cleanup patch I was about to post. Regards, BJ [1] http://archives.postgresql.org/pgsql-hackers/2007-09/msg01094.php
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 - -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Sat, Mar 29, 2008 at 5:40 AM, Brendan Jurd wrote: > On 29/03/2008, Tom Lane wrote: > > I intentionally didn't touch xml.c, nor anyplace that is not dealing > > in text, even if it happens to be binary-compatible with text. > > > > Hmm, okay. My original submission did include a few such changes; for > example, in xml_in and xml_out_internal I saw that the conversion > between xmltype and cstring was identical to the text conversion, so I > went ahead and used the text functions. Feedback upthread suggested > that it was okay to go ahead with casting in identical cases. [1] > > I saw that these changes made it into the commit, so I assumed that it > was the right call. > > If we don't want to meddle with xmltype/bytea/VarChar at all, we'll > have to revert those changes, and I'll have to seriously scale back > the cleanup patch I was about to post. Still not sure where we stand on the above. To cast, or not to cast? Cheers, BJ - -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: http://getfiregpg.org iD8DBQFIBaFy5YBsbHkuyV0RAsMmAKDHaEb7aMudKJgVxcf5RRcOtAJ+bwCgivLI 5B3xJze46NGWjEyOpq/TSaY= =BObd - -----END PGP SIGNATURE----- -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.7 (GNU/Linux) Comment: http://getfiregpg.org iD8DBQFIBaIl5YBsbHkuyV0RArDDAJ0QThLXAhzy+NX+2YsF+q4z/sIy1QCeJPiW s/rVns3FFQVP5g9DTOshDfQ= =4Tdh -----END PGP SIGNATURE-----
"Brendan Jurd" <direvus@gmail.com> writes: >> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll >> have to revert those changes, and I'll have to seriously scale back >> the cleanup patch I was about to post. > Still not sure where we stand on the above. To cast, or not to cast? I dunno. I know there was previously some handwaving about representing XML values in some more intelligent fashion than a plain text string, but I have no idea if anyone is likely to do anything about it in the foreseeable future. regards, tom lane
Tom Lane wrote: > "Brendan Jurd" <direvus@gmail.com> writes: > >>> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll >>> have to revert those changes, and I'll have to seriously scale back >>> the cleanup patch I was about to post. >>> > > >> Still not sure where we stand on the above. To cast, or not to cast? >> > > I dunno. I know there was previously some handwaving about representing > XML values in some more intelligent fashion than a plain text string, > but I have no idea if anyone is likely to do anything about it in the > foreseeable future. > > > It seems very unlikely to me. cheers andrew
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Apr 17, 2008 at 1:16 AM, Tom Lane wrote: > "Brendan Jurd" writes: > > >> If we don't want to meddle with xmltype/bytea/VarChar at all, we'll > >> have to revert those changes, and I'll have to seriously scale back > >> the cleanup patch I was about to post. > > > Still not sure where we stand on the above. To cast, or not to cast? > > I dunno. I know there was previously some handwaving about representing > XML values in some more intelligent fashion than a plain text string, > but I have no idea if anyone is likely to do anything about it in the > foreseeable future. > Well ... if somebody does want to change the representation of xml down the road, he's going to have to touch all the sites where the code converts to and from cstring anyway, right? So the only real difference this will make is that, instead of having to replace four lines of VARDATA/memcpy per site, he'll have to replace a single function call. That doesn't seem like a negative. With that in mind, please find attached my followup patch. It cleans up another 21 sites manually copying between cstring and varlena, for a net reduction of 115 lines of code. I didn't attempt to work through every reference to VARDATA, but I did look at every hit from a `grep -Rn VARDATA . | grep memcpy`. All regression tests passed (contrib tests included) on gentoo. Patch added to commitfest queue. Cheers, BJ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: http://getfiregpg.org iD8DBQFIBm105YBsbHkuyV0RAmqfAKCfyNyFdciqX4QV81sG9MhPt+KXuACfe694 3d/ICZF6yqV6K20X3TVX+So= =CvRM -----END PGP SIGNATURE-----
Attachment
"Brendan Jurd" <direvus@gmail.com> writes: > Well ... if somebody does want to change the representation of xml > down the road, he's going to have to touch all the sites where the > code converts to and from cstring anyway, right? True. > With that in mind, please find attached my followup patch. It cleans > up another 21 sites manually copying between cstring and varlena, for > a net reduction of 115 lines of code. I applied most of this, but not the parts that were dependent on the assumption that bytea and text are the same. That is unlikely to remain true if we ever get around to putting locale/encoding information into text values. Furthermore it's a horrid type pun, because bytea can contain embedded nulls whereas cstrings can't; you were making unwarranted assumptions about whether, say, cstring_to_text_with_len would allow embedded nulls to go by. And lastly, quite a few of those changes were just plain broken, eg several places in selfuncs.c where you allowed strlen() to be applied to a "bytea converted to cstring". regards, tom lane