Thread: json(b)_to_tsvector with numeric values
Hi, We've just noticed, that current implementation of `json(b)_to_tsvector` can be confusing sometimes, if the target document contains numeric values. In this case we just drop them, and only string values will contribute to the result: select to_tsvector('english', '{"a": "The Fat Rats", "b": 123}'::jsonb); to_tsvector ----------------- 'fat':2 'rat':3 (1 row) The result would be less surprising if all values, that can be converted to string representation (so, strings and numeric values, nothing to do for null & boolean), will take part in it: select to_tsvector('english', '{"a": "The Fat Rats", "b": 123}'::jsonb); to_tsvector ------------------------- '123':5 'fat':2 'rat':3 (1 row) Attached patch contains small fix that's necessary to get the described behavior. This patch doesn't touch `ts_headline` though, because following the same approach it would require changing the type of element in the resulting json(b). Any opinions about this suggestion? Can it be considered as a bug fix and included into this release?
Attachment
Hello Dmitry,
Dmitry Dolgov <9erthalion6@gmail.com> wrote:
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Any opinions about this suggestion? Can it be considered as a bug fix and
included into this release?
I think there is no chance to include it into v11. You can add the patch to the 2018-09 commitfest.
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Postgres Professional: http://www.postgrespro.com The Russian Postgres Company On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote: > Hello Dmitry, > > Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> >> Any opinions about this suggestion? Can it be considered as a bug fix and >> included into this release? > > > I think there is no chance to include it into v11. You can add the patch to > the 2018-09 commitfest. I found this bug, when working on presentation about FTS and it looked annoying, since it validates the consistency of FTS.I think this is a bug, which needs to be fixed, else inconsistency with existing full text search will be gets deeper. The fix looks trivial, but needs a review, of course. > > > -- > Arthur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company
On Mon, Apr 02, 2018 at 11:41:12AM +0300, Oleg Bartunov wrote: > On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote: > I found this bug, when working on presentation about FTS and it looked > annoying, since it validates > the consistency of FTS.I think this is a bug, which needs to be fixed, > else inconsistency with existing full text search will be gets > deeper. > > The fix looks trivial, but needs a review, of course. Oh, I understood. The code looks good, tests passed. But maybe it is better to use NumericGetDatum() instead of PointerGetDatum()? -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
> On 2 April 2018 at 11:27, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote: > On Mon, Apr 02, 2018 at 11:41:12AM +0300, Oleg Bartunov wrote: >> On Mon, Apr 2, 2018 at 9:45 AM, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote: >> I found this bug, when working on presentation about FTS and it looked >> annoying, since it validates >> the consistency of FTS.I think this is a bug, which needs to be fixed, >> else inconsistency with existing full text search will be gets >> deeper. >> >> The fix looks trivial, but needs a review, of course. > > Oh, I understood. The code looks good, tests passed. But maybe it is > better to use NumericGetDatum() instead of PointerGetDatum()? Well, technically speaking they're the same, but yes, NumericGetDatum would be more precise. I've modified it in the attached patch.
Attachment
>>> the consistency of FTS.I think this is a bug, which needs to be fixed, >>> else inconsistency with existing full text search will be gets >>> deeper. Hm, seems, it's useful feature, but I suggest to make separate function jsonb_any_to_tsvector and add support for boolean too (if you know better name for function, do not hide it). Changing behavior of existing function is not obvious for users and, seems, should not backpatched. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> On 4 April 2018 at 11:52, Teodor Sigaev <teodor@sigaev.ru> wrote: >>>> the consistency of FTS.I think this is a bug, which needs to be fixed, >>>> else inconsistency with existing full text search will be gets >>>> deeper. > > Hm, seems, it's useful feature, but I suggest to make separate function > jsonb_any_to_tsvector and add support for boolean too (if you know better > name for function, do not hide it). Changing behavior of existing function > is not obvious for users and, seems, should not backpatched. What do you think about having not a separate function, but a flag argument to the existing one (like `create` in `jsonb_set`), that will have false as default value? The result would be the same, but without an extra function with almost the same implementation.
>> Hm, seems, it's useful feature, but I suggest to make separate function >> jsonb_any_to_tsvector and add support for boolean too (if you know better >> name for function, do not hide it). Changing behavior of existing function >> is not obvious for users and, seems, should not backpatched. > > What do you think about having not a separate function, but a flag argument to > the existing one (like `create` in `jsonb_set`), that will have false as > default value? The result would be the same, but without an extra function with > almost the same implementation. tsvector jsonb_to_tsvector(jsonb[, bool]) ? Agreed. Second arg should be optional. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> On 4 April 2018 at 16:09, Teodor Sigaev <teodor@sigaev.ru> wrote: > >>> Hm, seems, it's useful feature, but I suggest to make separate function >>> jsonb_any_to_tsvector and add support for boolean too (if you know better >>> name for function, do not hide it). Changing behavior of existing >>> function >>> is not obvious for users and, seems, should not backpatched. >> >> >> What do you think about having not a separate function, but a flag >> argument to >> the existing one (like `create` in `jsonb_set`), that will have false as >> default value? The result would be the same, but without an extra function >> with >> almost the same implementation. > > > tsvector jsonb_to_tsvector(jsonb[, bool]) ? > Agreed. Second arg should be optional. Unfortunately, this idea with a flag argument can't be implemented easily (related discussion is here [1]). So I've modified the patch accordingly to your original suggestion about having separate functions `json(b)_all_to_tsvector`. 1: https://www.postgresql.org/message-id/flat/CA%2Bq6zcVJ%2BWx%2B-%3DkkN5UC0T-LtsJWnx0g9S0xSnn3jUWkriufDA%40mail.gmail.com
Attachment
1) I don't like jsonb_all_to_tsvector too.. What if we will accept new variant to index? Let me suggest: tsvector jsonb_to_tsvector([regclass,] jsonb, text[]) where text[] arg is actually a flags, array contains any combination of literals 'numeric', 'string', 'boolean' (and even 'key' to index keys_ to point which types should be indexed. More than it, may be, it should a jsonb type for possible improvements in future. For now, it shouldbe a jsonb array type with string elements described above, example: select jsonb_to_tsvector('{"a": "aaa in bbb ddd ccc", "b":123}', '["numeric", "boolean"]'); Form jsonb_to_tsvector('...', '["string"]) is effectively the same as current to_tsvector(jsonb) 2) Now it fails, and I see something strange in resuling tsvector: 'true':9,13 and 'fals':9,13 - I don't see any bool keys in input json. % more /home/teodor/pgsql/src/test/regress/regression.diffs *** /home/teodor/pgsql/src/test/regress/expected/jsonb.out 2018-04-06 16:34:59.424481000 +0300 --- /home/teodor/pgsql/src/test/regress/results/jsonb.out 2018-04-06 16:36:48.095411000 +0300 *************** *** 4132,4138 **** select jsonb_all_to_tsvector('english', '{"a": "aaa in bbb ddd ccc", "b": 123, "c": 456}'::jsonb); jsonb_all_to_tsvector -------------------------------------------------------------- ! '123':7 '456':11 'aaa':1 'bbb':3 'ccc':5 'ddd':4 'true':9,13 (1 row) -- ts_vector corner cases --- 4132,4138 ---- select jsonb_all_to_tsvector('english', '{"a": "aaa in bbb ddd ccc", "b": 123, "c": 456}'::jsonb); jsonb_all_to_tsvector -------------------------------------------------------------- ! '123':7 '456':11 'aaa':1 'bbb':3 'ccc':5 'ddd':4 'fals':9,13 (1 row) -- ts_vector corner cases Dmitry Dolgov wrote: >> On 4 April 2018 at 16:09, Teodor Sigaev <teodor@sigaev.ru> wrote: >> >>>> Hm, seems, it's useful feature, but I suggest to make separate function >>>> jsonb_any_to_tsvector and add support for boolean too (if you know better >>>> name for function, do not hide it). Changing behavior of existing >>>> function >>>> is not obvious for users and, seems, should not backpatched. >>> >>> >>> What do you think about having not a separate function, but a flag >>> argument to >>> the existing one (like `create` in `jsonb_set`), that will have false as >>> default value? The result would be the same, but without an extra function >>> with >>> almost the same implementation. >> >> >> tsvector jsonb_to_tsvector(jsonb[, bool]) ? >> Agreed. Second arg should be optional. > > Unfortunately, this idea with a flag argument can't be implemented easily > (related discussion is here [1]). So I've modified the patch accordingly to > your original suggestion about having separate functions > `json(b)_all_to_tsvector`. > > 1: https://www.postgresql.org/message-id/flat/CA%2Bq6zcVJ%2BWx%2B-%3DkkN5UC0T-LtsJWnx0g9S0xSnn3jUWkriufDA%40mail.gmail.com > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> On 6 April 2018 at 16:25, Teodor Sigaev <teodor@sigaev.ru> wrote: > 1) I don't like jsonb_all_to_tsvector too.. What if we will accept new > variant to index? Let me suggest: > > tsvector jsonb_to_tsvector([regclass,] jsonb, text[]) > > where text[] arg is actually a flags, array contains any combination of > literals 'numeric', 'string', 'boolean' (and even 'key' to index keys_ to > point which types should be indexed. More than it, may be, it should a jsonb > type for possible improvements in future. For now, it shouldbe a jsonb array > type with string elements described above, example: > > select jsonb_to_tsvector('{"a": "aaa in bbb ddd ccc", "b":123}', > '["numeric", "boolean"]'); > > > Form jsonb_to_tsvector('...', '["string"]) is effectively the same as > current to_tsvector(jsonb) Thank you for the suggestion, this sounds appealing. But I have two questions: * why it should be a jsonb array, not a regular array? * it would introduce the idea of jsonb element type expressed in text format, so "string", "numeric", "boolean" etc - are there any consequences of that? As far as I understand so far there was only jsonb_typeof. > 2) Now it fails, and I see something strange in resuling tsvector Oh, sorry, stupid copy-paste mistake in the condition. Just for the records, I've attached fixed version of the previous patch (without any changes about an array instead of a boolean flag).
Attachment
Dmitry Dolgov wrote: >> On 6 April 2018 at 16:25, Teodor Sigaev <teodor@sigaev.ru> wrote: >> 1) I don't like jsonb_all_to_tsvector too.. What if we will accept new >> variant to index? Let me suggest: >> >> tsvector jsonb_to_tsvector([regclass,] jsonb, text[]) >> >> where text[] arg is actually a flags, array contains any combination of >> literals 'numeric', 'string', 'boolean' (and even 'key' to index keys_ to >> point which types should be indexed. More than it, may be, it should a jsonb >> type for possible improvements in future. For now, it shouldbe a jsonb array >> type with string elements described above, example: >> >> select jsonb_to_tsvector('{"a": "aaa in bbb ddd ccc", "b":123}', >> '["numeric", "boolean"]'); >> >> >> Form jsonb_to_tsvector('...', '["string"]) is effectively the same as >> current to_tsvector(jsonb) > > Thank you for the suggestion, this sounds appealing. But I have two questions: > > * why it should be a jsonb array, not a regular array? To simplify extension of this array in future, for example to add limitation of recursion level in converted jsonb, etc. > > * it would introduce the idea of jsonb element type expressed in text format, > so "string", "numeric", "boolean" etc - are there any consequences of that? > As far as I understand so far there was only jsonb_typeof. Good catch, jsonb_typeof strings are okay: "string", "number", "boolean" also "all", "key", "value" See workable sketch for parsing jsonb flags and new worker variant. >> 2) Now it fails, and I see something strange in resuling tsvector > > Oh, sorry, stupid copy-paste mistake in the condition. Just for the records, > I've attached fixed version of the previous patch (without any changes about an > array instead of a boolean flag). by the way: Datum jsonb_to_tsvector(PG_FUNCTION_ARGS) { Jsonb *jb = PG_GETARG_JSONB_P(0); ... PG_FREE_IF_COPY(jb, 1); //wrong arg number } jsonb_all_to_tsvector and json variants too -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
> On 6 April 2018 at 18:55, Teodor Sigaev <teodor@sigaev.ru> wrote: > > > Dmitry Dolgov wrote: >>> >>> On 6 April 2018 at 16:25, Teodor Sigaev <teodor@sigaev.ru> wrote: >>> 1) I don't like jsonb_all_to_tsvector too.. What if we will accept new >>> variant to index? Let me suggest: >>> >>> tsvector jsonb_to_tsvector([regclass,] jsonb, text[]) >>> >>> where text[] arg is actually a flags, array contains any combination of >>> literals 'numeric', 'string', 'boolean' (and even 'key' to index keys_ to >>> point which types should be indexed. More than it, may be, it should a >>> jsonb >>> type for possible improvements in future. For now, it shouldbe a jsonb >>> array >>> type with string elements described above, example: >>> >>> select jsonb_to_tsvector('{"a": "aaa in bbb ddd ccc", "b":123}', >>> '["numeric", "boolean"]'); >>> >>> >>> Form jsonb_to_tsvector('...', '["string"]) is effectively the same as >>> current to_tsvector(jsonb) >> >> >> Thank you for the suggestion, this sounds appealing. But I have two >> questions: >> >> * why it should be a jsonb array, not a regular array? > > To simplify extension of this array in future, for example to add limitation > of recursion level in converted jsonb, etc. > > >> >> * it would introduce the idea of jsonb element type expressed in text >> format, >> so "string", "numeric", "boolean" etc - are there any consequences of >> that? >> As far as I understand so far there was only jsonb_typeof. > > Good catch, jsonb_typeof strings are okay: "string", "number", "boolean" > also "all", "key", "value" > > See workable sketch for parsing jsonb flags and new worker variant. Yep, thanks for the sketch. Here is the new version of patch, does it look close to what you have in mind?
Attachment
>> See workable sketch for parsing jsonb flags and new worker variant. > > Yep, thanks for the sketch. Here is the new version of patch, does it look > close to what you have in mind? Patch looks good except error messaging, you took it directly from sketch where I didn't spend time for it. Please, improve. elog() should be used only for impossible error, whereas user input could contins mistakes. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> On 7 April 2018 at 17:09, Teodor Sigaev <teodor@sigaev.ru> wrote: >>> See workable sketch for parsing jsonb flags and new worker variant. >> >> >> Yep, thanks for the sketch. Here is the new version of patch, does it look >> close to what you have in mind? > > > Patch looks good except error messaging, you took it directly from sketch > where I didn't spend time for it. Please, improve. elog() should be used > only for impossible error, whereas user input could contins mistakes. I assume what you mean is that for user input errors we need to use ereport. Indeed, thanks for noticing. I've replaced all elog except the last one, since it actually describes an impossible situation, when we started to read an array, but ended up having something else instead WJB_END_ARRAY.
Attachment
Thank you, pushed with some editorization Dmitry Dolgov wrote: >> On 7 April 2018 at 17:09, Teodor Sigaev <teodor@sigaev.ru> wrote: >>>> See workable sketch for parsing jsonb flags and new worker variant. >>> >>> >>> Yep, thanks for the sketch. Here is the new version of patch, does it look >>> close to what you have in mind? >> >> >> Patch looks good except error messaging, you took it directly from sketch >> where I didn't spend time for it. Please, improve. elog() should be used >> only for impossible error, whereas user input could contins mistakes. > > I assume what you mean is that for user input errors we need to use ereport. > Indeed, thanks for noticing. I've replaced all elog except the last one, since > it actually describes an impossible situation, when we started to read an > array, but ended up having something else instead WJB_END_ARRAY. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/