Thread: patch: General purpose utility functions used by the JSON data type
I factored out the general-purpose utility functions in the JSON data type code into a patch against HEAD. I have made a few changes to them since I posted about them earlier ( http://archives.postgresql.org/pgsql-hackers/2010-08/msg00692.php ). A summary of the utility functions along with some of my own thoughts about them: getEnumLabelOids * Useful-ometer: ()-----------------------------------o * Rationale: There is currently no streamlined way to return a custom enum value from a PostgreSQL function written in C. This function performs a batch lookup of enum OIDs, which can then be cached with fn_extra. This should be reasonably efficient, and it's quite elegant to use. FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT * Useful-ometer: ()--------------------o * Rationale: Using fcinfo->flinfo->fn_extra takes a lot of boilerplate. These macros help cut down the boilerplate, and the comment explains what fn_extra is all about. getTypeInfo * Useful-ometer: ()---------------------------o * Rationale: The get_type_io_data "six-fer" function is very cumbersome to use, since one has to declare all the output variables. The getTypeInfo puts the results in a structure. It also performs the fmgr_info_cxt step, which is a step done after every usage of get_type_io_data in the PostgreSQL code. * Other thoughts: getTypeInfo also retrieves typcategory (and typispreferred), which is rather ad-hoc. This benefits the JSON code because to_json() uses the typcategory to figure out what type of JSON value to convert something to (for instance, things in category 'A' become JSON arrays). Other data types could care less about the typcategory. Should getTypeInfo leave that step out? pg_substring, pg_encoding_substring * Useful-ometer: ()-------o * Rationale: The JSONPath code uses it / will use it for extracting substrings, which is probably not a very useful feature (but who am I to say that). This function could probably benefit the text_substring() function in varlena.c , but it would take a bit of work to ensure it continues to comply with standards. server_to_utf8, utf8_to_server, text_to_utf8_cstring, utf8_cstring_to_text, utf8_cstring_to_text_with_len * Useful-ometer: ()--------------o * Rationale: The JSON data type operates in UTF-8 rather than the server encoding because it needs to deal with Unicode escapes, but individual Unicode characters can't be converted to/from the server encoding simply and efficiently (as far as I know). These routines made the conversions done by the JSON data type vastly simpler, and they could simplify other data types in the future (XML does a lot of server<->UTF-8 conversions too). This patch doesn't include tests . How would I go about writing them? I have made the JSON data type built-in, and I will post that patch shortly (it depends on this one). The built-in JSON data type uses all of these utility functions, and the tests for the JSON data type pass.
Attachment
On 13/08/10 10:45, Joseph Adams wrote: > This patch doesn't include tests . How would I go about writing them? > > I have made the JSON data type built-in, and I will post that patch > shortly (it depends on this one). The built-in JSON data type uses > all of these utility functions, and the tests for the JSON data type > pass. > Joseph, Most existing data types have a regression SQL script in src/test/regress/sql. Using one of them as an example to draw some inspiration from, you should be able to write a 'json.sql' script that exercises the data type and it's supporting functions. Full instructions can be found at http://wiki.postgresql.org/wiki/Regression_test_authoring Regards, -- Mike Fowler Registered Linux user: 379787
On Fri, Aug 13, 2010 at 5:45 AM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > getEnumLabelOids > * Useful-ometer: ()-----------------------------------o > * Rationale: There is currently no streamlined way to return a custom > enum value from a PostgreSQL function written in C. This function > performs a batch lookup of enum OIDs, which can then be cached with > fn_extra. This should be reasonably efficient, and it's quite elegant > to use. It's possible that there might be a contrib module out there someplace that wants to do this, but it's clearly a waste of time for a core datatype. The OIDs are fixed. Just get rid of the enum altogether and use the OIDs directly wherever you would have used the enum. Then you don't need this. Incidentally, if we were going to accept this function, I think we'd need to add some kind of a check to throw an error if any of the labels can't be mapped onto an Oid. Otherwise, an error in this area might lead to difficult-to-find misbehavior. > FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT > * Useful-ometer: ()--------------------o > * Rationale: Using fcinfo->flinfo->fn_extra takes a lot of > boilerplate. These macros help cut down the boilerplate, and the > comment explains what fn_extra is all about. I think this is not a good idea. Macros that use local variables under the covers make it hard for new contributors to read the code and understand what it does. It's only worth doing if it saves a lot of typing, and this doesn't. If we were going to do this, the right thing for your patch to do would be to update every instance of this coding pattern in our source base, so that people who copy those examples in the future do it the new way. But I think there's no real advantage in that: it would complicate back-patching future bug fixes for no particularly compelling reason. > getTypeInfo > * Useful-ometer: ()---------------------------o > * Rationale: The get_type_io_data "six-fer" function is very > cumbersome to use, since one has to declare all the output variables. > The getTypeInfo puts the results in a structure. It also performs the > fmgr_info_cxt step, which is a step done after every usage of > get_type_io_data in the PostgreSQL code. > * Other thoughts: getTypeInfo also retrieves typcategory (and > typispreferred), which is rather ad-hoc. This benefits the JSON code > because to_json() uses the typcategory to figure out what type of JSON > value to convert something to (for instance, things in category 'A' > become JSON arrays). Other data types could care less about the > typcategory. Should getTypeInfo leave that step out? Well, again, you have to decide whether this is a function that you're adding just for the JSON code or whether it's really a general purpose utility function. If you want it to be a general purpose utility function, you need to change all the callers that could potentially leverage it. If it's JSON specific, put it in the JSON code. It might be simpler to just declare the variables. > pg_substring, pg_encoding_substring > * Useful-ometer: ()-------o > * Rationale: The JSONPath code uses it / will use it for extracting > substrings, which is probably not a very useful feature (but who am I > to say that). This function could probably benefit the > text_substring() function in varlena.c , but it would take a bit of > work to ensure it continues to comply with standards. I'm more positive about this idea. If you can make this generally useful, I'd encourage you to do that. On a random coding style note, I see that you have two copies of the following code, which can fairly obviously be written in a single line rather than five, assuming it's actually safe. + if (sub_end + len > e) + { + Assert(false); /* Clipped multibyte character */ + break; + } > server_to_utf8, utf8_to_server, text_to_utf8_cstring, > utf8_cstring_to_text, utf8_cstring_to_text_with_len > * Useful-ometer: ()--------------o > * Rationale: The JSON data type operates in UTF-8 rather than the > server encoding because it needs to deal with Unicode escapes, but > individual Unicode characters can't be converted to/from the server > encoding simply and efficiently (as far as I know). These routines > made the conversions done by the JSON data type vastly simpler, and > they could simplify other data types in the future (XML does a lot of > server<->UTF-8 conversions too). Sounds interesting. But again, you would need to modify the XML code to use these also, so that we can clearly see that this is reusable code, and actually reuse it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Aug 13, 2010 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 13, 2010 at 5:45 AM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: >> getEnumLabelOids >> * Useful-ometer: ()-----------------------------------o >> * Rationale: There is currently no streamlined way to return a custom >> enum value from a PostgreSQL function written in C. This function >> performs a batch lookup of enum OIDs, which can then be cached with >> fn_extra. This should be reasonably efficient, and it's quite elegant >> to use. > > It's possible that there might be a contrib module out there someplace > that wants to do this, but it's clearly a waste of time for a core > datatype. The OIDs are fixed. Just get rid of the enum altogether > and use the OIDs directly wherever you would have used the enum. Then > you don't need this. Indeed, a built-in JSON data type will certainly not need it. However, users may want to return enums from procedures written in C, and this function provides a way to do it. > Incidentally, if we were going to accept this function, I think we'd > need to add some kind of a check to throw an error if any of the > labels can't be mapped onto an Oid. Otherwise, an error in this area > might lead to difficult-to-find misbehavior. I agree. Perhaps an ereport(ERROR, ...) would be better, since it would not be hard for a user to cause an enum mapping error (even in a production database) by updating a stored procedure in C but not updating the corresponding enum (or vice versa, if enum labels are renamed). >> FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT >> * Useful-ometer: ()--------------------o >> * Rationale: Using fcinfo->flinfo->fn_extra takes a lot of >> boilerplate. These macros help cut down the boilerplate, and the >> comment explains what fn_extra is all about. > > I think this is not a good idea. Macros that use local variables > under the covers make it hard for new contributors to read the code > and understand what it does. It's only worth doing if it saves a lot > of typing, and this doesn't. If we were going to do this, the right > thing for your patch to do would be to update every instance of this > coding pattern in our source base, so that people who copy those > examples in the future do it the new way. But I think there's no real > advantage in that: it would complicate back-patching future bug fixes > for no particularly compelling reason. Fair enough. Perhaps the comment about FN_EXTRA (which explains fn_extra in more detail) could be reworded (to talk about using fcinfo->flinfo->fn_extra manually) and included in the documentation at xfunc-c.html . fn_extra currently only gets a passing mention in the discussion about set-returning functions. >> pg_substring, pg_encoding_substring >> * Useful-ometer: ()-------o >> * Rationale: The JSONPath code uses it / will use it for extracting >> substrings, which is probably not a very useful feature (but who am I >> to say that). This function could probably benefit the >> text_substring() function in varlena.c , but it would take a bit of >> work to ensure it continues to comply with standards. > > I'm more positive about this idea. If you can make this generally > useful, I'd encourage you to do that. On a random coding style note, > I see that you have two copies of the following code, which can fairly > obviously be written in a single line rather than five, assuming it's > actually safe. > > + if (sub_end + len > e) > + { > + Assert(false); /* Clipped multibyte > character */ > + break; > + } If I simply say Assert(sub_end + len <= e), the function will yield a range hanging off the edge of the input string (out of bounds). The five lines include a safeguard against that when assertion checking is off. This could happen if the input string has a clipped multibyte character at the end. Perhaps I should just drop the assertions and make it so if there's a clipped character at the end, it'll be ignored, no big deal. Joey Adams
On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams <joeyadams3.14159@gmail.com> wrote: > Indeed, a built-in JSON data type will certainly not need it. > However, users may want to return enums from procedures written in C, > and this function provides a way to do it. Yeah, but I can't see accepting it on speculation. We'll add it if and when it's clear that it will be generally useful. >> Incidentally, if we were going to accept this function, I think we'd >> need to add some kind of a check to throw an error if any of the >> labels can't be mapped onto an Oid. Otherwise, an error in this area >> might lead to difficult-to-find misbehavior. > > I agree. Perhaps an ereport(ERROR, ...) would be better, since it > would not be hard for a user to cause an enum mapping error (even in a > production database) by updating a stored procedure in C but not > updating the corresponding enum (or vice versa, if enum labels are > renamed). Right... > Fair enough. Perhaps the comment about FN_EXTRA (which explains > fn_extra in more detail) could be reworded (to talk about using > fcinfo->flinfo->fn_extra manually) and included in the documentation > at xfunc-c.html . fn_extra currently only gets a passing mention in > the discussion about set-returning functions. Feel to propose a patch to that comment. >>> pg_substring, pg_encoding_substring >>> * Useful-ometer: ()-------o >>> * Rationale: The JSONPath code uses it / will use it for extracting >>> substrings, which is probably not a very useful feature (but who am I >>> to say that). This function could probably benefit the >>> text_substring() function in varlena.c , but it would take a bit of >>> work to ensure it continues to comply with standards. >> >> I'm more positive about this idea. If you can make this generally >> useful, I'd encourage you to do that. On a random coding style note, >> I see that you have two copies of the following code, which can fairly >> obviously be written in a single line rather than five, assuming it's >> actually safe. >> >> + if (sub_end + len > e) >> + { >> + Assert(false); /* Clipped multibyte >> character */ >> + break; >> + } > > If I simply say Assert(sub_end + len <= e), the function will yield a > range hanging off the edge of the input string (out of bounds). The > five lines include a safeguard against that when assertion checking is > off. This could happen if the input string has a clipped multibyte > character at the end. Perhaps I should just drop the assertions and > make it so if there's a clipped character at the end, it'll be > ignored, no big deal. I think maybe what you want is ereport(ERROR). It should never be possible for user action to trip an Assert(); Assert() is only to catch coding mistakes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Joseph Adams <joeyadams3.14159@gmail.com> writes: > On Fri, Aug 13, 2010 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> + if (sub_end + len > e) >> + { >> + Assert(false); /* Clipped multibyte character */ >> + break; >> + } > If I simply say Assert(sub_end + len <= e), the function will yield a > range hanging off the edge of the input string (out of bounds). The > five lines include a safeguard against that when assertion checking is > off. If you think it is actually likely to happen in practice, then an Assert is 100% inappropriate. Throw an actual error instead. Code that has provisions for continuing after an Assert failure is wrong by definition. regards, tom lane
On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote: > On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams > <joeyadams3.14159@gmail.com> wrote: > > Indeed, a built-in JSON data type will certainly not need it. > > However, users may want to return enums from procedures written in > > C, and this function provides a way to do it. > > Yeah, but I can't see accepting it on speculation. We'll add it if > and when it's clear that it will be generally useful. Please pardon me for jumping in here, but one of the things people really love about PostgreSQL is that when you reach for something, it's frequently "just there." As we're improving enums (allowing them to be altered easily after creation, etc.), it seems reasonable to provide ways to return them from all kinds of PLs, including making this easier in C. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Fri, Aug 13, 2010 at 1:05 PM, David Fetter <david@fetter.org> wrote: > On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote: >> On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams >> <joeyadams3.14159@gmail.com> wrote: >> > Indeed, a built-in JSON data type will certainly not need it. >> > However, users may want to return enums from procedures written in >> > C, and this function provides a way to do it. >> >> Yeah, but I can't see accepting it on speculation. We'll add it if >> and when it's clear that it will be generally useful. > > Please pardon me for jumping in here, but one of the things people > really love about PostgreSQL is that when you reach for something, > it's frequently "just there." > > As we're improving enums (allowing them to be altered easily after > creation, etc.), it seems reasonable to provide ways to return them > from all kinds of PLs, including making this easier in C. Maybe so, but it's not clear the interface that Joseph implemented is the one everyone wants... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote: > On Fri, Aug 13, 2010 at 1:05 PM, David Fetter <david@fetter.org> wrote: > > On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote: > >> On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams > >> <joeyadams3.14159@gmail.com> wrote: > >> > Indeed, a built-in JSON data type will certainly not need it. > >> > However, users may want to return enums from procedures written in > >> > C, and this function provides a way to do it. > >> > >> Yeah, but I can't see accepting it on speculation. We'll add it if > >> and when it's clear that it will be generally useful. > > > > Please pardon me for jumping in here, but one of the things people > > really love about PostgreSQL is that when you reach for something, > > it's frequently "just there." > > > > As we're improving enums (allowing them to be altered easily after > > creation, etc.), it seems reasonable to provide ways to return them > > from all kinds of PLs, including making this easier in C. > > Maybe so, but it's not clear the interface that Joseph implemented is > the one everyone wants... Fair enough. What's the interface now in a nutshell? Lack of nutshells might well mean the interface needs more work... Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Fri, Aug 13, 2010 at 2:02 PM, David Fetter <david@fetter.org> wrote: > On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote: >> Maybe so, but it's not clear the interface that Joseph implemented is >> the one everyone wants... > > Fair enough. What's the interface now in a nutshell? Lack of > nutshells might well mean the interface needs more work... Suppose we have: -- SQL -- CREATE TYPE colors AS ENUM ('red', 'green', 'blue'); -- C -- enum Colors {RED, GREEN, BLUE, COLOR_COUNT}; In a nutshell: * Define an enum mapping like so: static EnumLabel enum_labels[COLOR_COUNT] = {{COLOR_RED, "red"},{COLOR_GREEN, "green"},{COLOR_BLUE, "blue"} }; * Get the OIDs of the enum labels: Oid oids[COLOR_COUNT]; getEnumLabelOids("colors", enum_labels, oids, COLOR_COUNT); * Convert C enum values to OIDs using the table: PG_RETURN_OID(oids[COLOR_BLUE]); A caller would typically cache the Oid table with fn_extra. Currently, getEnumLabelOids puts InvalidOid for labels that could not successfully be looked up. This will probably be changed to ereport(ERROR) instead. Also, I'm thinking that getEnumLabelOids should be renamed to something that's easier to remember. Maybe enum_oids or get_enum_oids. Joey Adams
On 08/13/2010 03:46 PM, Joseph Adams wrote: > On Fri, Aug 13, 2010 at 2:02 PM, David Fetter<david@fetter.org> wrote: >> On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote: >>> Maybe so, but it's not clear the interface that Joseph implemented is >>> the one everyone wants... >> Fair enough. What's the interface now in a nutshell? Lack of >> nutshells might well mean the interface needs more work... > Suppose we have: > > -- SQL -- > CREATE TYPE colors AS ENUM ('red', 'green', 'blue'); > > -- C -- > enum Colors {RED, GREEN, BLUE, COLOR_COUNT}; > > In a nutshell: > > * Define an enum mapping like so: > > static EnumLabel enum_labels[COLOR_COUNT] = > { > {COLOR_RED, "red"}, > {COLOR_GREEN, "green"}, > {COLOR_BLUE, "blue"} > }; > > * Get the OIDs of the enum labels: > > Oid oids[COLOR_COUNT]; > getEnumLabelOids("colors", enum_labels, oids, COLOR_COUNT); > > * Convert C enum values to OIDs using the table: > > PG_RETURN_OID(oids[COLOR_BLUE]); > > A caller would typically cache the Oid table with fn_extra. > > Currently, getEnumLabelOids puts InvalidOid for labels that could not > successfully be looked up. This will probably be changed to > ereport(ERROR) instead. > > Also, I'm thinking that getEnumLabelOids should be renamed to > something that's easier to remember. Maybe enum_oids or > get_enum_oids. > > > I should point out that I'm hoping to present a patch in a few weeks for extensible enums, along the lines previously discussed on this list. I have only just noticed this thread or I would have jumped in earlier. Maybe what I'm doing won't impact much on this - it does cache enum oids and their associated sort orders in the function context, but lazily - the theory being that a retail comparison should not have to look up the whole of a large enum set just to get two sort order values. cheers andrew