Thread: [HACKERS] [PATCH] few fts functions for jsonb
Hi all
I would like to propose patch with a set of new small functions for fts in case of
jsonb data type:
* to_tsvector(config, jsonb) - make a tsvector from all string values and
elements of jsonb object. To prevent the situation, when tsquery can find a
phrase consisting of lexemes from two different values/elements, this
function will add an increment to position of each lexeme from every new
value/element.
* ts_headline(config, jsonb, tsquery, options) - generate a headline directly
from jsonb object
Here are the examples how they work:
```
=# select to_tsvector('{"a": "aaa bbb", "b": ["ccc ddd"], "c": {"d": "eee fff"}}'::jsonb);
to_tsvector
-------------------------------------------------
'aaa':1 'bbb':2 'ccc':4 'ddd':5 'eee':7 'fff':8
(1 row)
=# select ts_headline('english', '{"a": "aaa bbb", "b": {"c": "ccc ddd"}}'::jsonb, tsquery('bbb & ddd & hhh'), 'StartSel = <, StopSel = >');
ts_headline
----------------------
aaa <bbb> ccc <ddd>
(1 row)
```
Any comments or suggestions?
Attachment
The proposed patch looks not very important, but I consider it as an important feature, which Oracle and Microsoft already have, that's why I asked Dmitry to work on this and made it before feature freeze. My comments follows below the post.
On Tue, Feb 28, 2017 at 1:59 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
1. add json support
Hi allI would like to propose patch with a set of new small functions for fts in case ofjsonb data type:* to_tsvector(config, jsonb) - make a tsvector from all string values andelements of jsonb object. To prevent the situation, when tsquery can find aphrase consisting of lexemes from two different values/elements, thisfunction will add an increment to position of each lexeme from every newvalue/element.* ts_headline(config, jsonb, tsquery, options) - generate a headline directlyfrom jsonb objectHere are the examples how they work:```=# select to_tsvector('{"a": "aaa bbb", "b": ["ccc ddd"], "c": {"d": "eee fff"}}'::jsonb);to_tsvector------------------------------------------------- 'aaa':1 'bbb':2 'ccc':4 'ddd':5 'eee':7 'fff':8(1 row)=# select ts_headline('english', '{"a": "aaa bbb", "b": {"c": "ccc ddd"}}'::jsonb, tsquery('bbb & ddd & hhh'), 'StartSel = <, StopSel = >');ts_headline----------------------aaa <bbb> ccc <ddd>(1 row)```
Any comments or suggestions?
1. add json support
2. Its_headline should returns the original json with highlighting. As a first try the proposed ts_headline could be ok, probably need special option.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
> On 28 February 2017 at 19:21, Oleg Bartunov <obartunov@gmail.com> wrote:
> 1. add json support
> 1. add json support
I've added json support for all functions.
> Its_headline should returns the original json with highlighting
Yes, I see now. I don't think it's worth it to add a special option for that
purpose, so I've just changed the implementation to return the original json(b).
Attachment
On 03/10/2017 11:13 AM, Dmitry Dolgov wrote: > > On 28 February 2017 at 19:21, Oleg Bartunov <obartunov@gmail.com > <mailto:obartunov@gmail.com>> wrote: > > 1. add json support > > I've added json support for all functions. > > > Its_headline should returns the original json with highlighting > > Yes, I see now. I don't think it's worth it to add a special option > for that > purpose, so I've just changed the implementation to return the > original json(b). > This is a pretty good idea. However, I think it should probably be broken up into a couple of pieces - one for the generic json/jsonb transforms infrastructure (which probably needs some more comments) and one for the FTS functions that will use it. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 21 March 2017 at 03:03, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> However, I think it should probably be broken up into a couple of pieces -
> one for the generic json/jsonb transforms infrastructure (which probably
> needs some more comments) and one for the FTS functions that will use it.
Sure, here are two patches with separated functionality and a bit more
commentaries for the transform functions.
>
> However, I think it should probably be broken up into a couple of pieces -
> one for the generic json/jsonb transforms infrastructure (which probably
> needs some more comments) and one for the FTS functions that will use it.
Sure, here are two patches with separated functionality and a bit more
commentaries for the transform functions.
Attachment
On 03/21/2017 06:28 PM, Dmitry Dolgov wrote: > > On 21 March 2017 at 03:03, Andrew Dunstan > <andrew.dunstan@2ndquadrant.com > <mailto:andrew.dunstan@2ndquadrant.com>> wrote: > > > > However, I think it should probably be broken up into a couple of > pieces - > > one for the generic json/jsonb transforms infrastructure (which probably > > needs some more comments) and one for the FTS functions that will > use it. > > Sure, here are two patches with separated functionality and a bit more > commentaries for the transform functions. I'm not through looking at this. However, here are a few preliminary comments * we might need to rationalize the header locations a bit * iterate_json(b) and transform_json(b) are a bit too generallynamed. Really what they do is iterate over or transform string values in the json(b). They ignore / preservethe structure, keys, and non-string scalar values in the json(b). A general iterate or transform function wouldbe called in effect with a stream of all the elements in the json, not just scalar strings. * Unless I'm missing somethingthe iterate_json(b)_values return value is ignored. Instead of returning the state it looks to me like it shouldreturn nothing and be declared as void instead of void * * transform_jsonb and transform_json are somewhat asymmetrical.The latter should probably return a text* instead of a StringInfo, to be consistent with the former. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> I'm not through looking at this. However, here are a few preliminary comments
I've attached new versions of the patches with improvements related to these commentaries.
I've attached new versions of the patches with improvements related to these commentaries.
Attachment
On 26 March 2017 at 17:57, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> I'm not through looking at this. However, here are a few preliminary >> comments > > I've attached new versions of the patches with improvements related to these > commentaries. These patches seem fundamentally OK. But I'm still not happy with the naming etc. I think the header changes would probably be better placed in jsonapi.h or in a new header file. And the names still seem too general to me. e.g. transform_json_values should probably be transform_json_string_values, and the static support functions should be renamed to match. Also the JsonIterateAction and JsonTransformAction funtion typedefs should probably be renamed to match. I'm not sure there is any great point in the is_jsonb_data macro, which is only used in one spot. I would get rid of it and expand the test in place. I don't have much time this week to work on it, as there are one or two other patches I also want to look at. If you clean these things up I will commit it. The second patch looks fine. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 29 March 2017 at 18:28, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> These patches seem fundamentally OK. But I'm still not happy with the
> naming etc.
I've changed names for all functions and action definitions, moved out the
changes in header file to `jsonapi.h` and removed `is_jsonb_data` macro. So it
should be better now.
Attachment
On 29 March 2017 at 16:19, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On 29 March 2017 at 18:28, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> >> wrote: >> >> These patches seem fundamentally OK. But I'm still not happy with the >> naming etc. > > I've changed names for all functions and action definitions, moved out the > changes in header file to `jsonapi.h` and removed `is_jsonb_data` macro. So > it > should be better now. I have just noticed as I was writing/testing the non-existent docs for this patch that it doesn't supply variants of to_tsvector that take a regconfig as the first argument. Is there a reason for that? Why should the json(b) versions be different from the text versions? cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 31 March 2017 at 00:01, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> I have just noticed as I was writing/testing the non-existent docs for
> this patch that it doesn't supply variants of to_tsvector that take a
> regconfig as the first argument. Is there a reason for that? Why
> should the json(b) versions be different from the text versions?
No, there is no reason, I just missed that. Here is a new version of the patch (only the functions part)
>
> I have just noticed as I was writing/testing the non-existent docs for
> this patch that it doesn't supply variants of to_tsvector that take a
> regconfig as the first argument. Is there a reason for that? Why
> should the json(b) versions be different from the text versions?
No, there is no reason, I just missed that. Here is a new version of the patch (only the functions part)
to add those variants.
Attachment
On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthalion6@gmail.com> wrote:
On 31 March 2017 at 00:01, Andrew Dunstan <andrew.dunstan@2ndquadrant.coNo, there is no reason, I just missed that. Here is a new version of the patch (only the functions part)m> wrote:
>
> I have just noticed as I was writing/testing the non-existent docs for
> this patch that it doesn't supply variants of to_tsvector that take a
> regconfig as the first argument. Is there a reason for that? Why
> should the json(b) versions be different from the text versions?to add those variants.
Congratulations with patch committed, who will write an addition documentation? I think we need to touch FTS and JSON parts.
On 03/31/2017 03:17 PM, Oleg Bartunov wrote: > > > On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthalion6@gmail.com > <mailto:9erthalion6@gmail.com>> wrote: > > On 31 March 2017 at 00:01, Andrew Dunstan > <andrew.dunstan@2ndquadrant.com > <mailto:andrew.dunstan@2ndquadrant.com>> wrote: > > > > I have just noticed as I was writing/testing the non-existent > docs for > > this patch that it doesn't supply variants of to_tsvector that > take a > > regconfig as the first argument. Is there a reason for that? Why > > should the json(b) versions be different from the text versions? > > No, there is no reason, I just missed that. Here is a new version > of the patch (only the functions part) > to add those variants. > > > Congratulations with patch committed, who will write an addition > documentation? I think we need to touch FTS and JSON parts. I added documentation when I committed it for the new functions, in the FTS section. I'm not sure what we need to add to the JSON section if anything. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-04-01 16:20:46 -0400, Andrew Dunstan wrote: > > > On 03/31/2017 03:17 PM, Oleg Bartunov wrote: > > > > > > On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthalion6@gmail.com > > <mailto:9erthalion6@gmail.com>> wrote: > > > > On 31 March 2017 at 00:01, Andrew Dunstan > > <andrew.dunstan@2ndquadrant.com > > <mailto:andrew.dunstan@2ndquadrant.com>> wrote: > > > > > > I have just noticed as I was writing/testing the non-existent > > docs for > > > this patch that it doesn't supply variants of to_tsvector that > > take a > > > regconfig as the first argument. Is there a reason for that? Why > > > should the json(b) versions be different from the text versions? > > > > No, there is no reason, I just missed that. Here is a new version > > of the patch (only the functions part) > > to add those variants. > > > > > > Congratulations with patch committed, who will write an addition > > documentation? I think we need to touch FTS and JSON parts. > I added documentation when I committed it for the new functions, in the > FTS section. I'm not sure what we need to add to the JSON section if > anything. I see that the CF entry for this hasn't been marked as committed: https://commitfest.postgresql.org/13/1054/ Is there anything left here? - Andres
On 01.04.2017 22:20, Andrew Dunstan wrote: > I added documentation when I committed it for the new functions, in the > FTS section. I'm not sure what we need to add to the JSON section if > anything. Not sure, if this is related but the formatting of https://www.postgresql.org/docs/devel/static/functions-textsearch.html looks a bit strange. Just 2 questions/notes: 1) in what order are the values of the JSON extracted? 2) Regarding the additional line: to_tsvector([ config regconfig , ] document json(b)) tsvector reduce document text to tsvector to_tsvector('english', '{"a": "The Fat Rats"}'::json) 'fat':2 'rat':3 Maybe change "reduce document text to tsvector" to "extracting JSON values <in what order> and reduce to tsvector"? Sven
On 04/03/2017 02:22 PM, Andres Freund wrote: > On 2017-04-01 16:20:46 -0400, Andrew Dunstan wrote: >> >> On 03/31/2017 03:17 PM, Oleg Bartunov wrote: >>> >>> On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthalion6@gmail.com >>> <mailto:9erthalion6@gmail.com>> wrote: >>> >>> On 31 March 2017 at 00:01, Andrew Dunstan >>> <andrew.dunstan@2ndquadrant.com >>> <mailto:andrew.dunstan@2ndquadrant.com>> wrote: >>> > >>> > I have just noticed as I was writing/testing the non-existent >>> docs for >>> > this patch that it doesn't supply variants of to_tsvector that >>> take a >>> > regconfig as the first argument. Is there a reason for that? Why >>> > should the json(b) versions be different from the text versions? >>> >>> No, there is no reason, I just missed that. Here is a new version >>> of the patch (only the functions part) >>> to add those variants. >>> >>> >>> Congratulations with patch committed, who will write an addition >>> documentation? I think we need to touch FTS and JSON parts. >> I added documentation when I committed it for the new functions, in the >> FTS section. I'm not sure what we need to add to the JSON section if >> anything. > I see that the CF entry for this hasn't been marked as committed: > https://commitfest.postgresql.org/13/1054/ > Is there anything left here? > Says "Status committed" for me. I fixed this in Sunday after Tom prodded me. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/03/2017 02:44 PM, Sven R. Kunze wrote: > On 01.04.2017 22:20, Andrew Dunstan wrote: >> I added documentation when I committed it for the new functions, in the >> FTS section. I'm not sure what we need to add to the JSON section if >> anything. > > Not sure, if this is related but the formatting of > https://www.postgresql.org/docs/devel/static/functions-textsearch.html > looks a bit strange. > > Just 2 questions/notes: > 1) in what order are the values of the JSON extracted? In the order they exist in the underlying document. > > 2) Regarding the additional line: > to_tsvector([ config regconfig , ] document json(b)) tsvector > reduce document text to tsvector to_tsvector('english', '{"a": "The > Fat Rats"}'::json) 'fat':2 'rat':3 > > Maybe change "reduce document text to tsvector" to "extracting JSON > values <in what order> and reduce to tsvector"? > > OK, I will do something along those lines. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03.04.2017 21:30, Andrew Dunstan wrote: > On 04/03/2017 02:44 PM, Sven R. Kunze wrote: >> On 01.04.2017 22:20, Andrew Dunstan wrote: >>> I added documentation when I committed it for the new functions, in the >>> FTS section. I'm not sure what we need to add to the JSON section if >>> anything. >> Not sure, if this is related but the formatting of >> https://www.postgresql.org/docs/devel/static/functions-textsearch.html >> looks a bit strange. >> >> Just 2 questions/notes: >> 1) in what order are the values of the JSON extracted? > In the order they exist in the underlying document. Just asking as the order can have implications for fulltext searches. So, might be valuable for the docs. Are these documents equally ordered in this sense? srkunze=# select '{"a": "abc", "b": "def"}'::jsonb; jsonb -------------------------- {"a": "abc", "b": "def"} (1 row) srkunze=# select '{"b": "def", "a": "abc"}'::jsonb; jsonb -------------------------- {"a": "abc", "b": "def"} (1 row) Also what about non-ascii keys? Are they ordered by the default locale of the PostgreSQL cluster (say de_DE.utf-8)? >> 2) Regarding the additional line: >> to_tsvector([ config regconfig , ] document json(b)) tsvector >> reduce document text to tsvector to_tsvector('english', '{"a": "The >> Fat Rats"}'::json) 'fat':2 'rat':3 >> >> Maybe change "reduce document text to tsvector" to "extracting JSON >> values <in what order> and reduce to tsvector"? >> >> > > OK, I will do something along those lines. > > cheers > > andrew >
On 04/03/2017 03:41 PM, Sven R. Kunze wrote: > On 03.04.2017 21:30, Andrew Dunstan wrote: >> On 04/03/2017 02:44 PM, Sven R. Kunze wrote: >>> On 01.04.2017 22:20, Andrew Dunstan wrote: >>>> I added documentation when I committed it for the new functions, in >>>> the >>>> FTS section. I'm not sure what we need to add to the JSON section if >>>> anything. >>> Not sure, if this is related but the formatting of >>> https://www.postgresql.org/docs/devel/static/functions-textsearch.html >>> looks a bit strange. >>> >>> Just 2 questions/notes: >>> 1) in what order are the values of the JSON extracted? >> In the order they exist in the underlying document. > > Just asking as the order can have implications for fulltext searches. > So, might be valuable for the docs. > > > Are these documents equally ordered in this sense? > > srkunze=# select '{"a": "abc", "b": "def"}'::jsonb; > jsonb > -------------------------- > {"a": "abc", "b": "def"} > (1 row) > > srkunze=# select '{"b": "def", "a": "abc"}'::jsonb; > jsonb > -------------------------- > {"a": "abc", "b": "def"} > (1 row) > Yes, when converted to jsonb these two documents are identical. > > Also what about non-ascii keys? Are they ordered by the default locale > of the PostgreSQL cluster (say de_DE.utf-8)? Yes, I believe so. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services