Re: jsonb and nested hstore - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: jsonb and nested hstore
Date
Msg-id 530E5B40.6040606@dunslane.net
Whole thread Raw
In response to Re: jsonb and nested hstore  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: jsonb and nested hstore
Re: jsonb and nested hstore
Re: jsonb and nested hstore
List pgsql-hackers

Apologies for top-post.

I have made some fixes based in these comments. A new patch for the
jsonb portion is attached.

Responses interspersed below.

Teodor, many of these comments are basically for you. Please respond.

cheers

andrew




On 02/10/2014 09:11 PM, Andres Freund wrote:
> Hi,
>
> Is it just me or is jsonapi.h not very well documented?


What about it do you think is missing? In any case, it's hardly relevant
to this patch, so I'll take that as obiter dicta.

>
> On 2014-02-06 18:47:31 -0500, Andrew Dunstan wrote:
>> +/*
>> + * for jsonb we always want the de-escaped value - that's what's in token
>> + */
>> +static void
>> +jsonb_in_scalar(void *state, char *token, JsonTokenType tokentype)
>> +{
>> +    JsonbInState *_state = (JsonbInState *) state;
>> +    JsonbValue    v;
>> +
>> +    v.size = sizeof(JEntry);
>> +
>> +    switch (tokentype)
>> +    {
>> +
>> +        case JSON_TOKEN_STRING:
>> +            v.type = jbvString;
>> +            v.string.len = token ? checkStringLen(strlen(token)) : 0;
>> +            v.string.val = token ? pnstrdup(token, v.string.len) : NULL;
>> +            v.size += v.string.len;
>> +            break;
>> +        case JSON_TOKEN_NUMBER:
>> +            v.type = jbvNumeric;
>> +            v.numeric = DatumGetNumeric(DirectFunctionCall3(numeric_in, CStringGetDatum(token), 0, -1));
>> +
>> +            v.size += VARSIZE_ANY(v.numeric) +sizeof(JEntry) /* alignment */ ;
> missing space.

Fixed.

>
> Why does + sizeof(JEntry) change anything about alignment? If it was
> aligned before, adding a statically sized value doesn't give any new
> guarantees about alignment?


Teodor, please comment.


>
>> +/*
>> + * jsonb type recv function
>> + *
>> + * the type is sent as text in binary mode, so this is almost the same
>> + * as the input function.
>> + */
>> +Datum
>> +jsonb_recv(PG_FUNCTION_ARGS)
>> +{
>> +    StringInfo    buf = (StringInfo) PG_GETARG_POINTER(0);
>> +    text       *result = cstring_to_text_with_len(buf->data, buf->len);
>> +
>> +    return deserialize_json_text(result);
>> +}
> This is a bit absurd, we're receiving a string in a StringInfo buffer,
> just to copy it into text, and then in makeJsonLexContext() access the
> raw chars again.


I have fixed this so that we don't construct a text object just so we
can json parse a cstring.


>> +static void
>> +putEscapedValue(StringInfo out, JsonbValue *v)
>> +{
>> +    switch (v->type)
>> +    {
>> +        case jbvNull:
>> +            appendBinaryStringInfo(out, "null", 4);
>> +            break;
>> +        case jbvString:
>> +            escape_json(out, pnstrdup(v->string.val, v->string.len));
>> +            break;
>> +        case jbvBool:
>> +            if (v->boolean)
>> +                appendBinaryStringInfo(out, "true", 4);
>> +            else
>> +                appendBinaryStringInfo(out, "false", 5);
>> +            break;
>> +        case jbvNumeric:
>> +            appendStringInfoString(out, DatumGetCString(DirectFunctionCall1(numeric_out,
PointerGetDatum(v->numeric))));
>> +            break;
>> +        default:
>> +            elog(ERROR, "unknown jsonb scalar type");
>> +    }
>> +}
> Hm, will the jbvNumeric always result in correct correct quoting?
> datum_to_json() does extra hangups for that case, any reason we don't
> need that here?


Yes, there is a reason we don't need it here. datum_to_json is
converting SQL numerics to json, and these might be strings such as
'Nan'. But we never store something in a jsonb numeric field unless it
came in as a json numeric format, which never needs quoting. The json
parser will never parse 'NaN' as a numeric value.


>> +char *
>> +JsonbToCString(StringInfo out, char *in, int estimated_len)
>> +{
> ...
>> +    while (redo_switch || ((type = JsonbIteratorGet(&it, &v, false)) != 0))
>> +    {
>> +        redo_switch = false;
> Not sure if I see the advantage over the goto here. A comment explaining
> what the reason for the goto is wouldhave sufficed.

I think you're being pretty damn picky here. You whined about the goto,
I removed it, now you don't like that either. Personally I think this is
cleaner.

>
>> +            case WJB_KEY:
>> +                if (first == false)
>> +                    appendBinaryStringInfo(out, ", ", 2);
>> +                first = true;
>> +
>> +                putEscapedValue(out, &v);
>> +                appendBinaryStringInfo(out, ": ", 2);
> putEscapedValue doesn't gurantee only strings are output, but
> datum_to_json does extra hangups for that case.

But the key here will always be a string. It's enforced by the JSON
rules. I suppose we could call escape_json directly here and save a
function call, but I don't agree that there is any problem here.

>
>> +                type = JsonbIteratorGet(&it, &v, false);
>> +                if (type == WJB_VALUE)
>> +                {
>> +                    first = false;
>> +                    putEscapedValue(out, &v);
>> +                }
>> +                else
>> +                {
>> +                    Assert(type == WJB_BEGIN_OBJECT || type == WJB_BEGIN_ARRAY);
>> +                    /*
>> +                     * We need to rerun current switch() due to put
>> +                     * in current place object which we just got
>> +                     * from iterator.
>> +                     */
> "due to put"?


I think that's due to the author not being a native English speaker.
I've tried to improve it a bit.

Teodor, please comment if you like.


>
>> +/*
>> + * jsonb type send function
>> + *
>> + * Just send jsonb as a string of text
>> + */
>> +Datum
>> +jsonb_send(PG_FUNCTION_ARGS)
>> +{
>> +    Jsonb       *jb = PG_GETARG_JSONB(0);
>> +    StringInfoData buf;
>> +    char       *out;
>> +
>> +    out = JsonbToCString(NULL, (JB_ISEMPTY(jb)) ? NULL : VARDATA(jb), VARSIZE(jb));
>> +
>> +    pq_begintypsend(&buf);
>> +    pq_sendtext(&buf, out, strlen(out));
>> +    PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
>> +}
> Why aren't you using using the stringbuf passing JsonbToCString
> convention here to avoid the strlen()?

Fixed.

>
>> +/*
>> + * Compare two jbvString JsonbValue values, third argument
>> + * 'arg', if it's not null, should be a pointer to bool
>> + * value which will be set to true if strings are equal and
>> + * untouched otherwise.
>> + */
>> +int
>> +compareJsonbStringValue(const void *a, const void *b, void *arg)
>> +{
>> +    const JsonbValue *va = a;
>> +    const JsonbValue *vb = b;
>> +    int            res;
>> +
>> +    Assert(va->type == jbvString);
>> +    Assert(vb->type == jbvString);
>> +
>> +    if (va->string.len == vb->string.len)
>> +    {
>> +        res = memcmp(va->string.val, vb->string.val, va->string.len);
>> +        if (res == 0 && arg)
>> +            *(bool *) arg = true;
> Should be NULL, not 0.

No, the compiler doesn't like that for int values.


>
>> +/*
>> + * qsort helper to compare JsonbPair values, third argument
>> + * arg will be trasferred as is to subsequent
> *transferred.
>

fixed.

>> +/*
>> + * some constant order of JsonbValue
>> + */
>> +int
>> +compareJsonbValue(JsonbValue *a, JsonbValue *b)
>> +{
> Called recursively, needs to check for stack depth.


fixed.



Teodor, please examine and comment on all comments below this point.



>
>> +JsonbValue *
>> +findUncompressedJsonbValueByValue(char *buffer, uint32 flags,
>> +                                  uint32 *lowbound, JsonbValue *key)
>> +{
> Functions like this *REALLY* need documentation for their
> parameters. And of their actual purpose.
>
> What's actually the uncompressed bit here? Isn't it actually the
> contrary? This is navigating the compressed, non-tree form, no?
>
>> +    if (flags & JB_FLAG_ARRAY & header)
>> +    {
>> +        JEntry       *array = (JEntry *) (buffer + sizeof(header));
>> +        char       *data = (char *) (array + (header & JB_COUNT_MASK));
>> +        int            i;
>> +        for (i = (lowbound) ? *lowbound : 0; i < (header & JB_COUNT_MASK); i++)
>> +        {
>> +            JEntry       *e = array + i;
>> +            else if (JBE_ISSTRING(*e) && key->type == jbvString)
>> +            {
>> +                if (key->string.len == JBE_LEN(*e) &&
>> +                    memcmp(key->string.val, data + JBE_OFF(*e),
>> +                           key->string.len) == 0)
>> +                {
> So, here we have our own undocumented! indexing system. Grand.
>
>> +    else if (flags & JB_FLAG_OBJECT & header)
>> +    {
>> +        JEntry       *array = (JEntry *) (buffer + sizeof(header));
>> +        char       *data = (char *) (array + (header & JB_COUNT_MASK) * 2);
>> +        uint32        stopLow = lowbound ? *lowbound : 0,
>> +                    stopHigh = (header & JB_COUNT_MASK),
>> +                    stopMiddle;
> I don't understand what the point of the lowbound logic could be here?
> If a key hasn't been found, it hasn't been found? Maybe the idea is to
> use it when testing containedness or somesuch? Wouldn't iterating over
> the keyspace be a better idea for that case?
>
>> +        if (key->type != jbvString)
>> +            return NULL;
> That's not allowed, right?
>
>> +/*
>> + * Get i-th value of array or hash. if i < 0 then it counts from
>> + * the end of array/hash. Note: returns pointer to statically
>> + * allocated JsonbValue.
>> + */
>> +JsonbValue *
>> +getJsonbValue(char *buffer, uint32 flags, int32 i)
>> +{
>> +    uint32        header = *(uint32 *) buffer;
>> +    static JsonbValue r;
> Really? And why on earth would static allocation be a good idea? Specify
> it on the caller's stack if need be. Or even return by value, today's
> calling convention will just allocate that on the caller's stack without
> copying.
> Accessing static data isn't even faster.
>
>> +    if (JBE_ISSTRING(*e))
>> +    {
>> +        r.type = jbvString;
>> +        r.string.val = data + JBE_OFF(*e);
>> +        r.string.len = JBE_LEN(*e);
>> +        r.size = sizeof(JEntry) + r.string.len;
>> +    }
>> +    else if (JBE_ISBOOL(*e))
>> +    {
>> +        r.type = jbvBool;
>> +        r.boolean = (JBE_ISBOOL_TRUE(*e)) ? true : false;
>> +        r.size = sizeof(JEntry);
>> +    }
>> +    else if (JBE_ISNUMERIC(*e))
>> +    {
>> +        r.type = jbvNumeric;
>> +        r.numeric = (Numeric) (data + INTALIGN(JBE_OFF(*e)));
>> +
>> +        r.size = 2 * sizeof(JEntry) + VARSIZE_ANY(r.numeric);
>> +    }
>> +    else if (JBE_ISNULL(*e))
>> +    {
>> +        r.type = jbvNull;
>> +        r.size = sizeof(JEntry);
>> +    }
>> +    else
>> +    {
>> +        r.type = jbvBinary;
>> +        r.binary.data = data + INTALIGN(JBE_OFF(*e));
>> +        r.binary.len = JBE_LEN(*e) - (INTALIGN(JBE_OFF(*e)) - JBE_OFF(*e));
>> +        r.size = r.binary.len + 2 * sizeof(JEntry);
>> +    }
> This bit of code exists pretty similarly in several places, maybe consolitate?
>
>> +/****************************************************************************
>> + *                      Walk on tree representation of jsonb                    *
>> + ****************************************************************************/
>> +static void
>> +walkUncompressedJsonbDo(JsonbValue *v, walk_jsonb_cb cb, void *cb_arg, uint32 level)
>> +{
>> +    int            i;
> check stack limit.
>
>> +void
>> +walkUncompressedJsonb(JsonbValue *v, walk_jsonb_cb cb, void *cb_arg)
>> +{
>> +    if (v)
>> +        walkUncompressedJsonbDo(v, cb, cb_arg, 0);
>> +}
>> +
>> +/****************************************************************************
>> + *                           Iteration over binary jsonb                        *
>> + ****************************************************************************/
> This needs docs.
>
>> +static void
>> +parseBuffer(JsonbIterator *it, char *buffer)
>> +{
> Why invent completely independent naming conventions to the previous
> functions here?
>
>> +static bool
>> +formAnswer(JsonbIterator **it, JsonbValue *v, JEntry * e, bool skipNested)
>> +{
> Imaginatively undescriptive name. But if it were slightly more more
> abstracted away from JsonbIterator it could be the answer to my prayers
> above about removing redundant code.
>
>> +static JsonbIterator *
>> +up(JsonbIterator *it)
>> +{
> Not a good name.
>
>> +int
>> +JsonbIteratorGet(JsonbIterator **it, JsonbValue *v, bool skipNested)
>> +{
>> +    int            res;
> recursive, stack depth check.
>
>> +    switch ((*it)->type | (*it)->state)
>> +    {
>> +        case JB_FLAG_ARRAY | jbi_start:
> I don't know, but I don't see the point in avoid if (), else if()
> ... constructs if it requires such dirty tricks.
>
>
>> +/****************************************************************************
>> + *          Transformation from tree to binary representation of jsonb        *
>> + ****************************************************************************/
>> +typedef struct CompressState
>> +{
>> +    char       *begin;
>> +    char       *ptr;
>> +
>> +    struct
>> +    {
>> +        uint32        i;
>> +        uint32       *header;
>> +        JEntry       *array;
>> +        char       *begin;
>> +    }           *levelstate, *lptr, *pptr;
>> +
>> +    uint32        maxlevel;
>> +
>> +}    CompressState;
>> +
>> +#define curLevelState    state->lptr
>> +#define prevLevelState    state->pptr
> brrr.
>
> I stopped looking at code at this point.
>
>> diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
>> index e1d8aae..50ddf50 100644
>> --- a/src/backend/utils/adt/jsonfuncs.c
>> +++ b/src/backend/utils/adt/jsonfuncs.c
> there's lots of whitespace/tab damage in this file. Check git log/diff
> --check or such.
>
> This is still a mess, sorry:
> * Large and important part continue to be undocumented. Especially in
>    jsonb_support.c
> * Lots of naming inconsistencies.
> * There's no documentation about what compressed/uncompressed jsonbs
>    are. The former is the ondisk representation, the latter the in-memory
>    tree representation.
> * There's no non-code documentation about the on-disk format.
>
> Unfortunately I can't see how this patch could get ready in time for
> this CF. There's *lots* of work to be done. The code as is isn't going
> to be maintainable. Much of it obvious by simply scanning through the
> code, without even looking for higher level issues. And much of it has
> previously been pointed out, without getting real attention.
>
> That's not to speak of the nested hstore patch, which I didn't even
> start to look at. That's twice this patches size.
>
> Greetings,
>
> Andres Freund
>


Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Another possible corruption bug in 9.3.2 or possibly a known MultiXact problem?
Next
From: Alexander Korotkov
Date:
Subject: Re: GIN improvements part2: fast scan