Thread: json(b)_to_tsvector with numeric values

json(b)_to_tsvector with numeric values

From
Dmitry Dolgov
Date:
Hi,

We've just noticed, that current implementation of `json(b)_to_tsvector` can be
confusing sometimes, if the target document contains numeric values.
In this case
we just drop them, and only string values will contribute to the result:

    select to_tsvector('english', '{"a": "The Fat Rats", "b": 123}'::jsonb);
       to_tsvector
    -----------------
     'fat':2 'rat':3
    (1 row)

The result would be less surprising if all values, that can be converted to
string representation (so, strings and numeric values, nothing to do for null &
boolean), will take part in it:

    select to_tsvector('english', '{"a": "The Fat Rats", "b": 123}'::jsonb);
           to_tsvector
    -------------------------
     '123':5 'fat':2 'rat':3
    (1 row)

Attached patch contains small fix that's necessary to get the described
behavior. This patch doesn't touch `ts_headline` though, because following the
same approach it would require changing the type of element in the resulting
json(b).

Any opinions about this suggestion? Can it be considered as a bug fix and
included into this release?

Attachment

Re: json(b)_to_tsvector with numeric values

From
Arthur Zakirov
Date:
Hello Dmitry,

Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Any opinions about this suggestion? Can it be considered as a bug fix and
included into this release?

I think there is no chance to include it into v11. You can add the patch to the 2018-09 commitfest. 


--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: json(b)_to_tsvector with numeric values

From
Oleg Bartunov
Date:
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
> Hello Dmitry,
>
> Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>
>> Any opinions about this suggestion? Can it be considered as a bug fix and
>> included into this release?
>
>
> I think there is no chance to include it into v11. You can add the patch to
> the 2018-09 commitfest.

I found this bug, when working on presentation about FTS and it looked
annoying, since it validates
the consistency of FTS.I think this is a bug, which needs to be fixed,
else inconsistency with existing full text search  will be gets
deeper.

The fix looks trivial, but needs a review, of course.

>
>
> --
> Arthur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company


Re: json(b)_to_tsvector with numeric values

From
Arthur Zakirov
Date:
On Mon, Apr 02, 2018 at 11:41:12AM +0300, Oleg Bartunov wrote:
> On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
> I found this bug, when working on presentation about FTS and it looked
> annoying, since it validates
> the consistency of FTS.I think this is a bug, which needs to be fixed,
> else inconsistency with existing full text search  will be gets
> deeper.
> 
> The fix looks trivial, but needs a review, of course.

Oh, I understood. The code looks good, tests passed. But maybe it is
better to use NumericGetDatum() instead of PointerGetDatum()?

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: json(b)_to_tsvector with numeric values

From
Dmitry Dolgov
Date:
> On 2 April 2018 at 11:27, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
> On Mon, Apr 02, 2018 at 11:41:12AM +0300, Oleg Bartunov wrote:
>> On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote:
>> I found this bug, when working on presentation about FTS and it looked
>> annoying, since it validates
>> the consistency of FTS.I think this is a bug, which needs to be fixed,
>> else inconsistency with existing full text search  will be gets
>> deeper.
>>
>> The fix looks trivial, but needs a review, of course.
>
> Oh, I understood. The code looks good, tests passed. But maybe it is
> better to use NumericGetDatum() instead of PointerGetDatum()?

Well, technically speaking they're the same, but yes, NumericGetDatum would be
more precise. I've modified it in the attached patch.

Attachment

Re: json(b)_to_tsvector with numeric values

From
Teodor Sigaev
Date:
>>> the consistency of FTS.I think this is a bug, which needs to be fixed,
>>> else inconsistency with existing full text search  will be gets
>>> deeper.
Hm, seems, it's useful feature, but I suggest to make separate function 
jsonb_any_to_tsvector and add support for boolean too (if you know better name 
for function, do not hide it). Changing behavior of existing function is not 
obvious for users and, seems, should not backpatched.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: json(b)_to_tsvector with numeric values

From
Dmitry Dolgov
Date:
> On 4 April 2018 at 11:52, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>>> the consistency of FTS.I think this is a bug, which needs to be fixed,
>>>> else inconsistency with existing full text search  will be gets
>>>> deeper.
>
> Hm, seems, it's useful feature, but I suggest to make separate function
> jsonb_any_to_tsvector and add support for boolean too (if you know better
> name for function, do not hide it). Changing behavior of existing function
> is not obvious for users and, seems, should not backpatched.

What do you think about having not a separate function, but a flag argument to
the existing one (like `create` in `jsonb_set`), that will have false as
default value? The result would be the same, but without an extra function with
almost the same implementation.


Re: json(b)_to_tsvector with numeric values

From
Teodor Sigaev
Date:
>> Hm, seems, it's useful feature, but I suggest to make separate function
>> jsonb_any_to_tsvector and add support for boolean too (if you know better
>> name for function, do not hide it). Changing behavior of existing function
>> is not obvious for users and, seems, should not backpatched.
> 
> What do you think about having not a separate function, but a flag argument to
> the existing one (like `create` in `jsonb_set`), that will have false as
> default value? The result would be the same, but without an extra function with
> almost the same implementation.

tsvector jsonb_to_tsvector(jsonb[, bool]) ?
Agreed. Second arg should be optional.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: json(b)_to_tsvector with numeric values

From
Dmitry Dolgov
Date:
> On 4 April 2018 at 16:09, Teodor Sigaev <teodor@sigaev.ru> wrote:
>
>>> Hm, seems, it's useful feature, but I suggest to make separate function
>>> jsonb_any_to_tsvector and add support for boolean too (if you know better
>>> name for function, do not hide it). Changing behavior of existing
>>> function
>>> is not obvious for users and, seems, should not backpatched.
>>
>>
>> What do you think about having not a separate function, but a flag
>> argument to
>> the existing one (like `create` in `jsonb_set`), that will have false as
>> default value? The result would be the same, but without an extra function
>> with
>> almost the same implementation.
>
>
> tsvector jsonb_to_tsvector(jsonb[, bool]) ?
> Agreed. Second arg should be optional.

Unfortunately, this idea with a flag argument can't be implemented easily
(related discussion is here [1]). So I've modified the patch accordingly to
your original suggestion about having separate functions
`json(b)_all_to_tsvector`.

1:
https://www.postgresql.org/message-id/flat/CA%2Bq6zcVJ%2BWx%2B-%3DkkN5UC0T-LtsJWnx0g9S0xSnn3jUWkriufDA%40mail.gmail.com

Attachment

Re: json(b)_to_tsvector with numeric values

From
Teodor Sigaev
Date:
1) I don't like jsonb_all_to_tsvector too.. What if we will accept new variant 
to index? Let me suggest:

tsvector jsonb_to_tsvector([regclass,] jsonb, text[])

where text[] arg is actually a flags, array contains any combination of literals 
'numeric', 'string', 'boolean' (and even 'key' to index keys_ to point which 
types should be indexed. More than it, may be, it should a jsonb type for 
possible improvements in future. For now, it shouldbe a jsonb array type with 
string elements described above, example:

select jsonb_to_tsvector('{"a": "aaa in bbb ddd ccc", "b":123}',
                '["numeric", "boolean"]');


Form jsonb_to_tsvector('...', '["string"]) is effectively the same as current 
to_tsvector(jsonb)

2)
Now it fails, and I see something strange in resuling tsvector: 'true':9,13 and
'fals':9,13 - I don't see any bool keys in input json.

% more /home/teodor/pgsql/src/test/regress/regression.diffs
*** /home/teodor/pgsql/src/test/regress/expected/jsonb.out      2018-04-06 
16:34:59.424481000 +0300
--- /home/teodor/pgsql/src/test/regress/results/jsonb.out       2018-04-06 
16:36:48.095411000 +0300
***************
*** 4132,4138 ****
   select jsonb_all_to_tsvector('english', '{"a": "aaa in bbb ddd ccc", "b": 
123, "c": 456}'::jsonb);
                       jsonb_all_to_tsvector
   --------------------------------------------------------------
!  '123':7 '456':11 'aaa':1 'bbb':3 'ccc':5 'ddd':4 'true':9,13
   (1 row)

   -- ts_vector corner cases
--- 4132,4138 ----
   select jsonb_all_to_tsvector('english', '{"a": "aaa in bbb ddd ccc", "b": 
123, "c": 456}'::jsonb);
                       jsonb_all_to_tsvector
   --------------------------------------------------------------
!  '123':7 '456':11 'aaa':1 'bbb':3 'ccc':5 'ddd':4 'fals':9,13
   (1 row)

   -- ts_vector corner cases


Dmitry Dolgov wrote:
>> On 4 April 2018 at 16:09, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>
>>>> Hm, seems, it's useful feature, but I suggest to make separate function
>>>> jsonb_any_to_tsvector and add support for boolean too (if you know better
>>>> name for function, do not hide it). Changing behavior of existing
>>>> function
>>>> is not obvious for users and, seems, should not backpatched.
>>>
>>>
>>> What do you think about having not a separate function, but a flag
>>> argument to
>>> the existing one (like `create` in `jsonb_set`), that will have false as
>>> default value? The result would be the same, but without an extra function
>>> with
>>> almost the same implementation.
>>
>>
>> tsvector jsonb_to_tsvector(jsonb[, bool]) ?
>> Agreed. Second arg should be optional.
> 
> Unfortunately, this idea with a flag argument can't be implemented easily
> (related discussion is here [1]). So I've modified the patch accordingly to
> your original suggestion about having separate functions
> `json(b)_all_to_tsvector`.
> 
> 1:
https://www.postgresql.org/message-id/flat/CA%2Bq6zcVJ%2BWx%2B-%3DkkN5UC0T-LtsJWnx0g9S0xSnn3jUWkriufDA%40mail.gmail.com
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: json(b)_to_tsvector with numeric values

From
Dmitry Dolgov
Date:
> On 6 April 2018 at 16:25, Teodor Sigaev <teodor@sigaev.ru> wrote:
> 1) I don't like jsonb_all_to_tsvector too.. What if we will accept new
> variant to index? Let me suggest:
>
> tsvector jsonb_to_tsvector([regclass,] jsonb, text[])
>
> where text[] arg is actually a flags, array contains any combination of
> literals 'numeric', 'string', 'boolean' (and even 'key' to index keys_ to
> point which types should be indexed. More than it, may be, it should a jsonb
> type for possible improvements in future. For now, it shouldbe a jsonb array
> type with string elements described above, example:
>
> select jsonb_to_tsvector('{"a": "aaa in bbb ddd ccc", "b":123}',
>                                 '["numeric", "boolean"]');
>
>
> Form jsonb_to_tsvector('...', '["string"]) is effectively the same as
> current to_tsvector(jsonb)

Thank you for the suggestion, this sounds appealing. But I have two questions:

* why it should be a jsonb array, not a regular array?

* it would introduce the idea of jsonb element type expressed in text format,
  so "string", "numeric", "boolean" etc - are there any consequences of that?
  As far as I understand so far there was only jsonb_typeof.

> 2) Now it fails, and I see something strange in resuling tsvector

Oh, sorry, stupid copy-paste mistake in the condition. Just for the records,
I've attached fixed version of the previous patch (without any changes about an
array instead of a boolean flag).

Attachment

Re: json(b)_to_tsvector with numeric values

From
Teodor Sigaev
Date:

Dmitry Dolgov wrote:
>> On 6 April 2018 at 16:25, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> 1) I don't like jsonb_all_to_tsvector too.. What if we will accept new
>> variant to index? Let me suggest:
>>
>> tsvector jsonb_to_tsvector([regclass,] jsonb, text[])
>>
>> where text[] arg is actually a flags, array contains any combination of
>> literals 'numeric', 'string', 'boolean' (and even 'key' to index keys_ to
>> point which types should be indexed. More than it, may be, it should a jsonb
>> type for possible improvements in future. For now, it shouldbe a jsonb array
>> type with string elements described above, example:
>>
>> select jsonb_to_tsvector('{"a": "aaa in bbb ddd ccc", "b":123}',
>>                                  '["numeric", "boolean"]');
>>
>>
>> Form jsonb_to_tsvector('...', '["string"]) is effectively the same as
>> current to_tsvector(jsonb)
> 
> Thank you for the suggestion, this sounds appealing. But I have two questions:
> 
> * why it should be a jsonb array, not a regular array?
To simplify extension of this array in future, for example to add limitation of 
recursion level in converted jsonb, etc.


> 
> * it would introduce the idea of jsonb element type expressed in text format,
>    so "string", "numeric", "boolean" etc - are there any consequences of that?
>    As far as I understand so far there was only jsonb_typeof.
Good catch, jsonb_typeof strings are okay: "string", "number", "boolean"
also  "all", "key", "value"

See workable sketch for parsing jsonb flags and new worker variant.



>> 2) Now it fails, and I see something strange in resuling tsvector
> 
> Oh, sorry, stupid copy-paste mistake in the condition. Just for the records,
> I've attached fixed version of the previous patch (without any changes about an
> array instead of a boolean flag).


by the way:
Datum
jsonb_to_tsvector(PG_FUNCTION_ARGS)
{
     Jsonb      *jb = PG_GETARG_JSONB_P(0);
...
     PG_FREE_IF_COPY(jb, 1); //wrong arg number
}

jsonb_all_to_tsvector and json variants too




-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: json(b)_to_tsvector with numeric values

From
Dmitry Dolgov
Date:
> On 6 April 2018 at 18:55, Teodor Sigaev <teodor@sigaev.ru> wrote:
>
>
> Dmitry Dolgov wrote:
>>>
>>> On 6 April 2018 at 16:25, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>> 1) I don't like jsonb_all_to_tsvector too.. What if we will accept new
>>> variant to index? Let me suggest:
>>>
>>> tsvector jsonb_to_tsvector([regclass,] jsonb, text[])
>>>
>>> where text[] arg is actually a flags, array contains any combination of
>>> literals 'numeric', 'string', 'boolean' (and even 'key' to index keys_ to
>>> point which types should be indexed. More than it, may be, it should a
>>> jsonb
>>> type for possible improvements in future. For now, it shouldbe a jsonb
>>> array
>>> type with string elements described above, example:
>>>
>>> select jsonb_to_tsvector('{"a": "aaa in bbb ddd ccc", "b":123}',
>>>                                  '["numeric", "boolean"]');
>>>
>>>
>>> Form jsonb_to_tsvector('...', '["string"]) is effectively the same as
>>> current to_tsvector(jsonb)
>>
>>
>> Thank you for the suggestion, this sounds appealing. But I have two
>> questions:
>>
>> * why it should be a jsonb array, not a regular array?
>
> To simplify extension of this array in future, for example to add limitation
> of recursion level in converted jsonb, etc.
>
>
>>
>> * it would introduce the idea of jsonb element type expressed in text
>> format,
>>    so "string", "numeric", "boolean" etc - are there any consequences of
>> that?
>>    As far as I understand so far there was only jsonb_typeof.
>
> Good catch, jsonb_typeof strings are okay: "string", "number", "boolean"
> also  "all", "key", "value"
>
> See workable sketch for parsing jsonb flags and new worker variant.

Yep, thanks for the sketch. Here is the new version of patch, does it look
close to what you have in mind?

Attachment

Re: json(b)_to_tsvector with numeric values

From
Teodor Sigaev
Date:
>> See workable sketch for parsing jsonb flags and new worker variant.
> 
> Yep, thanks for the sketch. Here is the new version of patch, does it look
> close to what you have in mind?

Patch looks good except error messaging, you took it directly from 
sketch where I didn't spend time for it. Please, improve. elog() should 
be used only for impossible error, whereas user input could contins 
mistakes.


-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: json(b)_to_tsvector with numeric values

From
Dmitry Dolgov
Date:
> On 7 April 2018 at 17:09, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>> See workable sketch for parsing jsonb flags and new worker variant.
>>
>>
>> Yep, thanks for the sketch. Here is the new version of patch, does it look
>> close to what you have in mind?
>
>
> Patch looks good except error messaging, you took it directly from sketch
> where I didn't spend time for it. Please, improve. elog() should be used
> only for impossible error, whereas user input could contins mistakes.

I assume what you mean is that for user input errors we need to use ereport.
Indeed, thanks for noticing. I've replaced all elog except the last one, since
it actually describes an impossible situation, when we started to read an
array, but ended up having something else instead WJB_END_ARRAY.

Attachment

Re: json(b)_to_tsvector with numeric values

From
Teodor Sigaev
Date:
Thank you, pushed with some editorization

Dmitry Dolgov wrote:
>> On 7 April 2018 at 17:09, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>>> See workable sketch for parsing jsonb flags and new worker variant.
>>>
>>>
>>> Yep, thanks for the sketch. Here is the new version of patch, does it look
>>> close to what you have in mind?
>>
>>
>> Patch looks good except error messaging, you took it directly from sketch
>> where I didn't spend time for it. Please, improve. elog() should be used
>> only for impossible error, whereas user input could contins mistakes.
> 
> I assume what you mean is that for user input errors we need to use ereport.
> Indeed, thanks for noticing. I've replaced all elog except the last one, since
> it actually describes an impossible situation, when we started to read an
> array, but ended up having something else instead WJB_END_ARRAY.
> 

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/