Thread: patch: General purpose utility functions used by the JSON data type

patch: General purpose utility functions used by the JSON data type

From
Joseph Adams
Date:
I factored out the general-purpose utility functions in the JSON data
type code into a patch against HEAD.  I have made a few changes to
them since I posted about them earlier (
http://archives.postgresql.org/pgsql-hackers/2010-08/msg00692.php ).

A summary of the utility functions along with some of my own thoughts
about them:

getEnumLabelOids
 * Useful-ometer: ()-----------------------------------o
 * Rationale: There is currently no streamlined way to return a custom
enum value from a PostgreSQL function written in C.  This function
performs a batch lookup of enum OIDs, which can then be cached with
fn_extra.  This should be reasonably efficient, and it's quite elegant
to use.

FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT
 * Useful-ometer: ()--------------------o
 * Rationale: Using fcinfo->flinfo->fn_extra takes a lot of
boilerplate.  These macros help cut down the boilerplate, and the
comment explains what fn_extra is all about.

getTypeInfo
 * Useful-ometer: ()---------------------------o
 * Rationale: The get_type_io_data "six-fer" function is very
cumbersome to use, since one has to declare all the output variables.
The getTypeInfo puts the results in a structure.  It also performs the
fmgr_info_cxt step, which is a step done after every usage of
get_type_io_data in the PostgreSQL code.
 * Other thoughts: getTypeInfo also retrieves typcategory (and
typispreferred), which is rather ad-hoc.  This benefits the JSON code
because to_json() uses the typcategory to figure out what type of JSON
value to convert something to (for instance, things in category 'A'
become JSON arrays).  Other data types could care less about the
typcategory.  Should getTypeInfo leave that step out?

pg_substring, pg_encoding_substring
 * Useful-ometer: ()-------o
 * Rationale: The JSONPath code uses it / will use it for extracting
substrings, which is probably not a very useful feature (but who am I
to say that).  This function could probably benefit the
text_substring() function in varlena.c , but it would take a bit of
work to ensure it continues to comply with standards.

server_to_utf8, utf8_to_server, text_to_utf8_cstring,
utf8_cstring_to_text, utf8_cstring_to_text_with_len
 * Useful-ometer: ()--------------o
 * Rationale:  The JSON data type operates in UTF-8 rather than the
server encoding because it needs to deal with Unicode escapes, but
individual Unicode characters can't be converted to/from the server
encoding simply and efficiently (as far as I know).  These routines
made the conversions done by the JSON data type vastly simpler, and
they could simplify other data types in the future (XML does a lot of
server<->UTF-8 conversions too).

This patch doesn't include tests .  How would I go about writing them?

I have made the JSON data type built-in, and I will post that patch
shortly (it depends on this one).  The built-in JSON data type uses
all of these utility functions, and the tests for the JSON data type
pass.

Attachment

Re: patch: General purpose utility functions used by the JSON data type

From
Mike Fowler
Date:
On 13/08/10 10:45, Joseph Adams wrote:
> This patch doesn't include tests .  How would I go about writing them?
>
> I have made the JSON data type built-in, and I will post that patch
> shortly (it depends on this one).  The built-in JSON data type uses
> all of these utility functions, and the tests for the JSON data type
> pass.
>    

Joseph,

Most existing data types have a regression SQL script in 
src/test/regress/sql. Using one of them as an example to draw some 
inspiration from, you should be able to write a 'json.sql' script that 
exercises the data type and it's supporting functions. Full instructions 
can be found at http://wiki.postgresql.org/wiki/Regression_test_authoring

Regards,

-- 
Mike Fowler
Registered Linux user: 379787



Re: patch: General purpose utility functions used by the JSON data type

From
Robert Haas
Date:
On Fri, Aug 13, 2010 at 5:45 AM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:
> getEnumLabelOids
>  * Useful-ometer: ()-----------------------------------o
>  * Rationale: There is currently no streamlined way to return a custom
> enum value from a PostgreSQL function written in C.  This function
> performs a batch lookup of enum OIDs, which can then be cached with
> fn_extra.  This should be reasonably efficient, and it's quite elegant
> to use.

It's possible that there might be a contrib module out there someplace
that wants to do this, but it's clearly a waste of time for a core
datatype.  The OIDs are fixed.  Just get rid of the enum altogether
and use the OIDs directly wherever you would have used the enum.  Then
you don't need this.

Incidentally, if we were going to accept this function, I think we'd
need to add some kind of a check to throw an error if any of the
labels can't be mapped onto an Oid.  Otherwise, an error in this area
might lead to difficult-to-find misbehavior.

> FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT
>  * Useful-ometer: ()--------------------o
>  * Rationale: Using fcinfo->flinfo->fn_extra takes a lot of
> boilerplate.  These macros help cut down the boilerplate, and the
> comment explains what fn_extra is all about.

I think this is not a good idea.  Macros that use local variables
under the covers make it hard for new contributors to read the code
and understand what it does.  It's only worth doing if it saves a lot
of typing, and this doesn't.  If we were going to do this, the right
thing for your patch to do would be to update every instance of this
coding pattern in our source base, so that people who copy those
examples in the future do it the new way.  But I think there's no real
advantage in that: it would complicate back-patching future bug fixes
for no particularly compelling reason.

> getTypeInfo
>  * Useful-ometer: ()---------------------------o
>  * Rationale: The get_type_io_data "six-fer" function is very
> cumbersome to use, since one has to declare all the output variables.
> The getTypeInfo puts the results in a structure.  It also performs the
> fmgr_info_cxt step, which is a step done after every usage of
> get_type_io_data in the PostgreSQL code.
>  * Other thoughts: getTypeInfo also retrieves typcategory (and
> typispreferred), which is rather ad-hoc.  This benefits the JSON code
> because to_json() uses the typcategory to figure out what type of JSON
> value to convert something to (for instance, things in category 'A'
> become JSON arrays).  Other data types could care less about the
> typcategory.  Should getTypeInfo leave that step out?

Well, again, you have to decide whether this is a function that you're
adding just for the JSON code or whether it's really a general purpose
utility function.  If you want it to be a general purpose utility
function, you need to change all the callers that could potentially
leverage it.  If it's JSON specific, put it in the JSON code.  It
might be simpler to just declare the variables.

> pg_substring, pg_encoding_substring
>  * Useful-ometer: ()-------o
>  * Rationale: The JSONPath code uses it / will use it for extracting
> substrings, which is probably not a very useful feature (but who am I
> to say that).  This function could probably benefit the
> text_substring() function in varlena.c , but it would take a bit of
> work to ensure it continues to comply with standards.

I'm more positive about this idea.  If you can make this generally
useful, I'd encourage you to do that.  On a random coding style note,
I see that you have two copies of the following code, which can fairly
obviously be written in a single line rather than five, assuming it's
actually safe.

+               if (sub_end + len > e)
+               {
+                       Assert(false);          /* Clipped multibyte
character */
+                       break;
+               }


> server_to_utf8, utf8_to_server, text_to_utf8_cstring,
> utf8_cstring_to_text, utf8_cstring_to_text_with_len
>  * Useful-ometer: ()--------------o
>  * Rationale:  The JSON data type operates in UTF-8 rather than the
> server encoding because it needs to deal with Unicode escapes, but
> individual Unicode characters can't be converted to/from the server
> encoding simply and efficiently (as far as I know).  These routines
> made the conversions done by the JSON data type vastly simpler, and
> they could simplify other data types in the future (XML does a lot of
> server<->UTF-8 conversions too).

Sounds interesting.  But again, you would need to modify the XML code
to use these also, so that we can clearly see that this is reusable
code, and actually reuse it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: patch: General purpose utility functions used by the JSON data type

From
Joseph Adams
Date:
On Fri, Aug 13, 2010 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 13, 2010 at 5:45 AM, Joseph Adams
> <joeyadams3.14159@gmail.com> wrote:
>> getEnumLabelOids
>>  * Useful-ometer: ()-----------------------------------o
>>  * Rationale: There is currently no streamlined way to return a custom
>> enum value from a PostgreSQL function written in C.  This function
>> performs a batch lookup of enum OIDs, which can then be cached with
>> fn_extra.  This should be reasonably efficient, and it's quite elegant
>> to use.
>
> It's possible that there might be a contrib module out there someplace
> that wants to do this, but it's clearly a waste of time for a core
> datatype.  The OIDs are fixed.  Just get rid of the enum altogether
> and use the OIDs directly wherever you would have used the enum.  Then
> you don't need this.

Indeed, a built-in JSON data type will certainly not need it.
However, users may want to return enums from procedures written in C,
and this function provides a way to do it.

> Incidentally, if we were going to accept this function, I think we'd
> need to add some kind of a check to throw an error if any of the
> labels can't be mapped onto an Oid.  Otherwise, an error in this area
> might lead to difficult-to-find misbehavior.

I agree.  Perhaps an ereport(ERROR, ...) would be better, since it
would not be hard for a user to cause an enum mapping error (even in a
production database) by updating a stored procedure in C but not
updating the corresponding enum (or vice versa, if enum labels are
renamed).

>> FN_EXTRA, FN_EXTRA_ALLOC, FN_MCXT
>>  * Useful-ometer: ()--------------------o
>>  * Rationale: Using fcinfo->flinfo->fn_extra takes a lot of
>> boilerplate.  These macros help cut down the boilerplate, and the
>> comment explains what fn_extra is all about.
>
> I think this is not a good idea.  Macros that use local variables
> under the covers make it hard for new contributors to read the code
> and understand what it does.  It's only worth doing if it saves a lot
> of typing, and this doesn't.  If we were going to do this, the right
> thing for your patch to do would be to update every instance of this
> coding pattern in our source base, so that people who copy those
> examples in the future do it the new way.  But I think there's no real
> advantage in that: it would complicate back-patching future bug fixes
> for no particularly compelling reason.

Fair enough.  Perhaps the comment about FN_EXTRA (which explains
fn_extra in more detail) could be reworded (to talk about using
fcinfo->flinfo->fn_extra manually) and included in the documentation
at xfunc-c.html .  fn_extra currently only gets a passing mention in
the discussion about set-returning functions.

>> pg_substring, pg_encoding_substring
>>  * Useful-ometer: ()-------o
>>  * Rationale: The JSONPath code uses it / will use it for extracting
>> substrings, which is probably not a very useful feature (but who am I
>> to say that).  This function could probably benefit the
>> text_substring() function in varlena.c , but it would take a bit of
>> work to ensure it continues to comply with standards.
>
> I'm more positive about this idea.  If you can make this generally
> useful, I'd encourage you to do that.  On a random coding style note,
> I see that you have two copies of the following code, which can fairly
> obviously be written in a single line rather than five, assuming it's
> actually safe.
>
> +               if (sub_end + len > e)
> +               {
> +                       Assert(false);          /* Clipped multibyte
> character */
> +                       break;
> +               }

If I simply say Assert(sub_end + len <= e), the function will yield a
range hanging off the edge of the input string (out of bounds).  The
five lines include a safeguard against that when assertion checking is
off.  This could happen if the input string has a clipped multibyte
character at the end.  Perhaps I should just drop the assertions and
make it so if there's a clipped character at the end, it'll be
ignored, no big deal.


Joey Adams


Re: patch: General purpose utility functions used by the JSON data type

From
Robert Haas
Date:
On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams
<joeyadams3.14159@gmail.com> wrote:
> Indeed, a built-in JSON data type will certainly not need it.
> However, users may want to return enums from procedures written in C,
> and this function provides a way to do it.

Yeah, but I can't see accepting it on speculation.  We'll add it if
and when it's clear that it will be generally useful.

>> Incidentally, if we were going to accept this function, I think we'd
>> need to add some kind of a check to throw an error if any of the
>> labels can't be mapped onto an Oid.  Otherwise, an error in this area
>> might lead to difficult-to-find misbehavior.
>
> I agree.  Perhaps an ereport(ERROR, ...) would be better, since it
> would not be hard for a user to cause an enum mapping error (even in a
> production database) by updating a stored procedure in C but not
> updating the corresponding enum (or vice versa, if enum labels are
> renamed).

Right...

> Fair enough.  Perhaps the comment about FN_EXTRA (which explains
> fn_extra in more detail) could be reworded (to talk about using
> fcinfo->flinfo->fn_extra manually) and included in the documentation
> at xfunc-c.html .  fn_extra currently only gets a passing mention in
> the discussion about set-returning functions.

Feel to propose a patch to that comment.

>>> pg_substring, pg_encoding_substring
>>>  * Useful-ometer: ()-------o
>>>  * Rationale: The JSONPath code uses it / will use it for extracting
>>> substrings, which is probably not a very useful feature (but who am I
>>> to say that).  This function could probably benefit the
>>> text_substring() function in varlena.c , but it would take a bit of
>>> work to ensure it continues to comply with standards.
>>
>> I'm more positive about this idea.  If you can make this generally
>> useful, I'd encourage you to do that.  On a random coding style note,
>> I see that you have two copies of the following code, which can fairly
>> obviously be written in a single line rather than five, assuming it's
>> actually safe.
>>
>> +               if (sub_end + len > e)
>> +               {
>> +                       Assert(false);          /* Clipped multibyte
>> character */
>> +                       break;
>> +               }
>
> If I simply say Assert(sub_end + len <= e), the function will yield a
> range hanging off the edge of the input string (out of bounds).  The
> five lines include a safeguard against that when assertion checking is
> off.  This could happen if the input string has a clipped multibyte
> character at the end.  Perhaps I should just drop the assertions and
> make it so if there's a clipped character at the end, it'll be
> ignored, no big deal.

I think maybe what you want is ereport(ERROR).  It should never be
possible for user action to trip an Assert(); Assert() is only to
catch coding mistakes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: patch: General purpose utility functions used by the JSON data type

From
Tom Lane
Date:
Joseph Adams <joeyadams3.14159@gmail.com> writes:
> On Fri, Aug 13, 2010 at 10:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> +               if (sub_end + len > e)
>> +               {
>> +                       Assert(false);          /* Clipped multibyte character */
>> +                       break;
>> +               }

> If I simply say Assert(sub_end + len <= e), the function will yield a
> range hanging off the edge of the input string (out of bounds).  The
> five lines include a safeguard against that when assertion checking is
> off.

If you think it is actually likely to happen in practice, then an Assert
is 100% inappropriate.  Throw an actual error instead.  Code that has
provisions for continuing after an Assert failure is wrong by definition.
        regards, tom lane


Re: patch: General purpose utility functions used by the JSON data type

From
David Fetter
Date:
On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote:
> On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams
> <joeyadams3.14159@gmail.com> wrote:
> > Indeed, a built-in JSON data type will certainly not need it.
> > However, users may want to return enums from procedures written in
> > C, and this function provides a way to do it.
> 
> Yeah, but I can't see accepting it on speculation.  We'll add it if
> and when it's clear that it will be generally useful.

Please pardon me for jumping in here, but one of the things people
really love about PostgreSQL is that when you reach for something,
it's frequently "just there."

As we're improving enums (allowing them to be altered easily after
creation, etc.), it seems reasonable to provide ways to return them
from all kinds of PLs, including making this easier in C.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: patch: General purpose utility functions used by the JSON data type

From
Robert Haas
Date:
On Fri, Aug 13, 2010 at 1:05 PM, David Fetter <david@fetter.org> wrote:
> On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote:
>> On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams
>> <joeyadams3.14159@gmail.com> wrote:
>> > Indeed, a built-in JSON data type will certainly not need it.
>> > However, users may want to return enums from procedures written in
>> > C, and this function provides a way to do it.
>>
>> Yeah, but I can't see accepting it on speculation.  We'll add it if
>> and when it's clear that it will be generally useful.
>
> Please pardon me for jumping in here, but one of the things people
> really love about PostgreSQL is that when you reach for something,
> it's frequently "just there."
>
> As we're improving enums (allowing them to be altered easily after
> creation, etc.), it seems reasonable to provide ways to return them
> from all kinds of PLs, including making this easier in C.

Maybe so, but it's not clear the interface that Joseph implemented is
the one everyone wants...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: patch: General purpose utility functions used by the JSON data type

From
David Fetter
Date:
On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote:
> On Fri, Aug 13, 2010 at 1:05 PM, David Fetter <david@fetter.org> wrote:
> > On Fri, Aug 13, 2010 at 12:59:48PM -0400, Robert Haas wrote:
> >> On Fri, Aug 13, 2010 at 12:55 PM, Joseph Adams
> >> <joeyadams3.14159@gmail.com> wrote:
> >> > Indeed, a built-in JSON data type will certainly not need it.
> >> > However, users may want to return enums from procedures written in
> >> > C, and this function provides a way to do it.
> >>
> >> Yeah, but I can't see accepting it on speculation.  We'll add it if
> >> and when it's clear that it will be generally useful.
> >
> > Please pardon me for jumping in here, but one of the things people
> > really love about PostgreSQL is that when you reach for something,
> > it's frequently "just there."
> >
> > As we're improving enums (allowing them to be altered easily after
> > creation, etc.), it seems reasonable to provide ways to return them
> > from all kinds of PLs, including making this easier in C.
> 
> Maybe so, but it's not clear the interface that Joseph implemented is
> the one everyone wants...

Fair enough.  What's the interface now in a nutshell?  Lack of
nutshells might well mean the interface needs more work...

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: patch: General purpose utility functions used by the JSON data type

From
Joseph Adams
Date:
On Fri, Aug 13, 2010 at 2:02 PM, David Fetter <david@fetter.org> wrote:
> On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote:
>> Maybe so, but it's not clear the interface that Joseph implemented is
>> the one everyone wants...
>
> Fair enough.  What's the interface now in a nutshell?  Lack of
> nutshells might well mean the interface needs more work...

Suppose we have:

-- SQL --
CREATE TYPE colors AS ENUM ('red', 'green', 'blue');

-- C --
enum Colors {RED, GREEN, BLUE, COLOR_COUNT};

In a nutshell:

* Define an enum mapping like so:

static EnumLabel enum_labels[COLOR_COUNT] =
{{COLOR_RED, "red"},{COLOR_GREEN, "green"},{COLOR_BLUE, "blue"}
};

* Get the OIDs of the enum labels:

Oid oids[COLOR_COUNT];
getEnumLabelOids("colors", enum_labels, oids, COLOR_COUNT);

* Convert C enum values to OIDs using the table:

PG_RETURN_OID(oids[COLOR_BLUE]);

A caller would typically cache the Oid table with fn_extra.

Currently, getEnumLabelOids puts InvalidOid for labels that could not
successfully be looked up.  This will probably be changed to
ereport(ERROR) instead.

Also, I'm thinking that getEnumLabelOids should be renamed to
something that's easier to remember.  Maybe enum_oids or
get_enum_oids.


Joey Adams


Re: patch: General purpose utility functions used by the JSON data type

From
Andrew Dunstan
Date:

On 08/13/2010 03:46 PM, Joseph Adams wrote:
> On Fri, Aug 13, 2010 at 2:02 PM, David Fetter<david@fetter.org>  wrote:
>> On Fri, Aug 13, 2010 at 01:33:06PM -0400, Robert Haas wrote:
>>> Maybe so, but it's not clear the interface that Joseph implemented is
>>> the one everyone wants...
>> Fair enough.  What's the interface now in a nutshell?  Lack of
>> nutshells might well mean the interface needs more work...
> Suppose we have:
>
> -- SQL --
> CREATE TYPE colors AS ENUM ('red', 'green', 'blue');
>
> -- C --
> enum Colors {RED, GREEN, BLUE, COLOR_COUNT};
>
> In a nutshell:
>
> * Define an enum mapping like so:
>
> static EnumLabel enum_labels[COLOR_COUNT] =
> {
>     {COLOR_RED, "red"},
>     {COLOR_GREEN, "green"},
>     {COLOR_BLUE, "blue"}
> };
>
> * Get the OIDs of the enum labels:
>
> Oid oids[COLOR_COUNT];
> getEnumLabelOids("colors", enum_labels, oids, COLOR_COUNT);
>
> * Convert C enum values to OIDs using the table:
>
> PG_RETURN_OID(oids[COLOR_BLUE]);
>
> A caller would typically cache the Oid table with fn_extra.
>
> Currently, getEnumLabelOids puts InvalidOid for labels that could not
> successfully be looked up.  This will probably be changed to
> ereport(ERROR) instead.
>
> Also, I'm thinking that getEnumLabelOids should be renamed to
> something that's easier to remember.  Maybe enum_oids or
> get_enum_oids.
>
>
>


I should point out that I'm hoping to present a patch in a few weeks for 
extensible enums, along the lines previously discussed on this list. I 
have only just noticed this thread or I would have jumped in earlier.

Maybe what I'm doing won't impact much on this - it does cache enum oids 
and their associated sort orders in the function context, but lazily - 
the theory being that a retail comparison should not have to look up the 
whole of a large enum set just to get two sort order values.

cheers

andrew