Re: jsonb and nested hstore - Mailing list pgsql-hackers

From Andres Freund
Subject Re: jsonb and nested hstore
Date
Msg-id 20140201222045.GC5930@awork2.anarazel.de
Whole thread Raw
In response to Re: jsonb and nested hstore  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: jsonb and nested hstore
Re: jsonb and nested hstore
Re: jsonb and nested hstore
adt Makefile, was Re: jsonb and nested hstore
List pgsql-hackers
On 2014-01-30 14:07:42 -0500, Andrew Dunstan wrote:
> +  <para id="functions-json-table">
> +   <xref linkend="functions-json-creation-table"> shows the functions that are
> +   available for creating <type>json</type> values.
> +   (see <xref linkend="datatype-json">)
>    </para>
>  
> -  <table id="functions-json-table">
> -    <title>JSON Support Functions</title>
> +  <indexterm>
> +   <primary>array_to_json</primary>
> +  </indexterm>
> +  <indexterm>
> +   <primary>row_to_json</primary>
> +  </indexterm>
> +  <indexterm>
> +   <primary>to_json</primary>
> +  </indexterm>
> +  <indexterm>
> +   <primary>json_build_array</primary>
> +  </indexterm>
> +  <indexterm>
> +   <primary>json_build_object</primary>
> +  </indexterm>
> +  <indexterm>
> +   <primary>json_object</primary>
> +  </indexterm>

Hm, why are you collecting the indexterms at the top in the contrast to
the previous way of collecting them at the point of documentation?

> diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
> index 1ae9fa0..fd93d9b 100644
> --- a/src/backend/utils/adt/Makefile
> +++ b/src/backend/utils/adt/Makefile
> @@ -32,7 +32,8 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
>      tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
>      tsvector.o tsvector_op.o tsvector_parser.o \
>      txid.o uuid.o windowfuncs.o xml.o rangetypes_spgist.o \
> -    rangetypes_typanalyze.o rangetypes_selfuncs.o
> +    rangetypes_typanalyze.o rangetypes_selfuncs.o \
> +    jsonb.o jsonb_support.o

Odd, most OBJS lines are kept in alphabetical order, but that doesn't
seem to be the case here.

> +/*
> + * for jsonb we always want the de-escaped value - that's what's in token
> + */
> +

strange newline.

> +static void
> +jsonb_in_scalar(void *state, char *token, JsonTokenType tokentype)
> +{
> +    JsonbInState *_state = (JsonbInState *) state;
> +    JsonbValue    v;
> +
> +    v.size = sizeof(JEntry);
> +
> +    switch (tokentype)
> +    {
> +
...

> +        default:                /* nothing else should be here in fact */
> +            break;

Shouldn't this at least Assert(false) or something?

> +static void
> +recvJsonbValue(StringInfo buf, JsonbValue *v, uint32 level, int c)
> +{
> +    uint32        hentry = c & JENTRY_TYPEMASK;
> +
> +    if (hentry == JENTRY_ISNULL)
> +    {
> +        v->type = jbvNull;
> +        v->size = sizeof(JEntry);
> +    }
> +    else if (hentry == JENTRY_ISOBJECT || hentry == JENTRY_ISARRAY || hentry == JENTRY_ISCALAR)
> +    {
> +        recvJsonb(buf, v, level + 1, (uint32) c);
> +    }
> +    else if (hentry == JENTRY_ISFALSE || hentry == JENTRY_ISTRUE)
> +    {
> +        v->type = jbvBool;
> +        v->size = sizeof(JEntry);
> +        v->boolean = (hentry == JENTRY_ISFALSE) ? false : true;
> +    }
> +    else if (hentry == JENTRY_ISNUMERIC)
> +    {
> +        v->type = jbvNumeric;
> +        v->numeric = DatumGetNumeric(DirectFunctionCall3(numeric_recv, PointerGetDatum(buf),
> +                                       Int32GetDatum(0), Int32GetDatum(-1)));
> +
> +        v->size = sizeof(JEntry) * 2 + VARSIZE_ANY(v->numeric);

What's the *2 here?

> +static void
> +recvJsonb(StringInfo buf, JsonbValue *v, uint32 level, uint32 header)
> +{
> +    uint32        hentry;
> +    uint32        i;

This function and recvJsonbValue call each other recursively, afaics
without any limit, shouldn't they check for the stack depth?

> +    hentry = header & JENTRY_TYPEMASK;
> +
> +    v->size = 3 * sizeof(JEntry);

*3?

> +    if (hentry == JENTRY_ISOBJECT)
> +    {
> +        v->type = jbvHash;
> +        v->hash.npairs = header & JB_COUNT_MASK;
> +        if (v->hash.npairs > 0)
> +        {
> +            v->hash.pairs = palloc(sizeof(*v->hash.pairs) * v->hash.npairs);
> +

Hm, if I understand correctly, we're just allocating JB_COUNT_MASK
(which is 0x0FFFFFFF) * sizeof(*v->hash.pairs) bytes here, without any
crosschecks about the actual length of the data? So with a few bytes the
server can be coaxed to allocate a gigabyte of data?
Since this immediately calls another input routine, this can be done in
a nested fashion, quickly OOMing the server.

I think this and several other places really need a bit more input
sanity checking.

> +            for (i = 0; i < v->hash.npairs; i++)
> +            {
> +                recvJsonbValue(buf, &v->hash.pairs[i].key, level, pq_getmsgint(buf, 4));
> +                if (v->hash.pairs[i].key.type != jbvString)
> +                    elog(ERROR, "jsonb's key could be only a string");

Shouldn't that be an ereport(ERRCODE_DATATYPE_MISMATCH)? Similar in a
few other places.

> +char *
> +JsonbToCString(StringInfo out, char *in, int estimated_len)
> +{
> +    bool        first = true;
> +    JsonbIterator *it;
> +    int            type;
> +    JsonbValue    v;
> +    int            level = 0;
> +
> +    if (out == NULL)
> +        out = makeStringInfo();

Such a behaviour certainly deserves a documentary comment. Generally
some more functions could use that.

> +    while ((type = JsonbIteratorGet(&it, &v, false)) != 0)
> +    {
> +reout:
> +        switch (type)
> +        {
...
> +                {
> +                    Assert(type == WJB_BEGIN_OBJECT || type == WJB_BEGIN_ARRAY);
> +                    goto reout;

Hrmpf.

> +Datum
> +jsonb_typeof(PG_FUNCTION_ARGS)
> +{
...
> +}

Hm, shouldn't that be in jsonfuncs.c?

> diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
> index a19b222..f1eacc6 100644
> --- a/src/backend/utils/adt/jsonfuncs.c
> +++ b/src/backend/utils/adt/jsonfuncs.c
> @@ -27,6 +27,7 @@
>  #include "utils/builtins.h"
>  #include "utils/hsearch.h"
>  #include "utils/json.h"
> +#include "utils/jsonb.h"
>  #include "utils/jsonapi.h"
>  #include "utils/lsyscache.h"
>  #include "utils/memutils.h"
> @@ -51,6 +52,7 @@ static inline Datum get_path_all(PG_FUNCTION_ARGS, bool as_text);
>  static inline text *get_worker(text *json, char *field, int elem_index,
>             char **tpath, int *ipath, int npath,
>             bool normalize_results);
> +static inline Datum get_jsonb_path_all(PG_FUNCTION_ARGS, bool as_text);

I don't see the point of using PG_FUNCTION_ARGS if you're manually
calling it like
+       return get_jsonb_path_all(fcinfo, false);

That just makes it harder if someday PG_FUNCTION_ARGS grows a second
argument or something.


> +Datum
> +jsonb_object_keys(PG_FUNCTION_ARGS)
> +{
> +    FuncCallContext *funcctx;
> +    OkeysState *state;
> +    int            i;
> +
> +    if (SRF_IS_FIRSTCALL())
> +    {
> +        MemoryContext oldcontext;
> +        Jsonb       *jb = PG_GETARG_JSONB(0);
> +        bool        skipNested = false;
> +        JsonbIterator *it;
> +        JsonbValue    v;
> +        int            r = 0;
> +
> +        if (JB_ROOT_IS_SCALAR(jb))
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("cannot call jsonb_object_keys on a scalar")));
> +        else if (JB_ROOT_IS_ARRAY(jb))
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("cannot call jsonb_object_keys on an array")));
> +
> +        funcctx = SRF_FIRSTCALL_INIT();
> +        oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);

This will detoast 'jb' into the expression context, since
PG_GETARG_JSONB() is called before the MemoryContextSwitchTo. But that's
ok since the percall code only deals with ->result, right?

> -    /* make these in a sufficiently long-lived memory context */
>      old_cxt = MemoryContextSwitchTo(rsi->econtext->ecxt_per_query_memory);

wh remove that comment?

> +#define JENTRY_ISCALAR        (0x10000000 | 0x40000000)

Isn't there an S missing here?

> --- a/contrib/hstore/hstore_compat.c
> +++ b/contrib/hstore/hstore_compat.c
> +/*
> + * New Old version (new not-nested version of hstore, v2 version)
> + * V2 and v3 (nested) are upward binary compatible. But
> + * framework was fully changed. Keep here old definitions (v2)
> + */

That's an, err, interesting sentence. I think referring to old new
version and stuff is less than helpful. I realize lots of that is
baggage from existing code, but yet another version doesn't make it
easier.


I lost my stomach (or maybe it was the glass of red) somewhere in the
middle, but I think this needs a lot of work. Especially the io code
doesn't seem ready to me. I'd consider ripping out the send/recv code
for 9.4, that seems the biggest can of worms. It will still be usable
without.

There's just about no comments in large and relevant parts of the
code. There's not much documentation about the binary layout of a
definitely not trivial type with convoluted interdependencies with
hstore...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Marco Atzeri
Date:
Subject: Re: Postgresql for cygwin - 3rd
Next
From: Andrew Dunstan
Date:
Subject: Re: Postgresql for cygwin - 3rd