Thread: [HACKERS] [PATCH] few fts functions for jsonb

[HACKERS] [PATCH] few fts functions for jsonb

From
Dmitry Dolgov
Date:
Hi all

I would like to propose patch with a set of new small functions for fts in case of
jsonb data type:

* to_tsvector(config, jsonb) - make a tsvector from all string values and
  elements of jsonb object. To prevent the situation, when tsquery can find a
  phrase consisting of lexemes from two different values/elements, this
  function will add an increment to position of each lexeme from every new
  value/element.

* ts_headline(config, jsonb, tsquery, options) - generate a headline directly
  from jsonb object

Here are the examples how they work:

```
=# select to_tsvector('{"a": "aaa bbb", "b": ["ccc ddd"], "c": {"d": "eee fff"}}'::jsonb);
                   to_tsvector
-------------------------------------------------
 'aaa':1 'bbb':2 'ccc':4 'ddd':5 'eee':7 'fff':8
(1 row)


=# select ts_headline('english', '{"a": "aaa bbb", "b": {"c": "ccc ddd"}}'::jsonb, tsquery('bbb & ddd & hhh'), 'StartSel = <, StopSel = >');
     ts_headline
----------------------
 aaa <bbb> ccc <ddd>
(1 row)
```

Any comments or suggestions?
Attachment

Re: [HACKERS] [PATCH] few fts functions for jsonb

From
Oleg Bartunov
Date:
The proposed patch looks not very important, but I consider it as an important feature, which Oracle and Microsoft already have, that's why I asked Dmitry to work on this and made it before feature freeze. My comments follows below the post.

On Tue, Feb 28, 2017 at 1:59 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Hi all

I would like to propose patch with a set of new small functions for fts in case of
jsonb data type:

* to_tsvector(config, jsonb) - make a tsvector from all string values and
  elements of jsonb object. To prevent the situation, when tsquery can find a
  phrase consisting of lexemes from two different values/elements, this
  function will add an increment to position of each lexeme from every new
  value/element.

* ts_headline(config, jsonb, tsquery, options) - generate a headline directly
  from jsonb object

Here are the examples how they work:

```
=# select to_tsvector('{"a": "aaa bbb", "b": ["ccc ddd"], "c": {"d": "eee fff"}}'::jsonb);
                   to_tsvector
-------------------------------------------------
 'aaa':1 'bbb':2 'ccc':4 'ddd':5 'eee':7 'fff':8
(1 row)


=# select ts_headline('english', '{"a": "aaa bbb", "b": {"c": "ccc ddd"}}'::jsonb, tsquery('bbb & ddd & hhh'), 'StartSel = <, StopSel = >');
     ts_headline
----------------------
 aaa <bbb> ccc <ddd>
(1 row)
```

Any comments or suggestions?

1. add json support
2. Its_headline  should returns the original json with highlighting.  As a first try the proposed ts_headline  could be ok, probably need special option.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] few fts functions for jsonb

From
Dmitry Dolgov
Date:
> On 28 February 2017 at 19:21, Oleg Bartunov <obartunov@gmail.com> wrote:
> 1. add json support

I've added json support for all functions.

>  Its_headline  should returns the original json with highlighting

Yes, I see now. I don't think it's worth it to add a special option for that
purpose, so I've just changed the implementation to return the original json(b).
Attachment

Re: [HACKERS] [PATCH] few fts functions for jsonb

From
Andrew Dunstan
Date:

On 03/10/2017 11:13 AM, Dmitry Dolgov wrote:
> > On 28 February 2017 at 19:21, Oleg Bartunov <obartunov@gmail.com
> <mailto:obartunov@gmail.com>> wrote:
> > 1. add json support
>
> I've added json support for all functions.
>
> >  Its_headline  should returns the original json with highlighting
>
> Yes, I see now. I don't think it's worth it to add a special option
> for that
> purpose, so I've just changed the implementation to return the
> original json(b).
>


This is a pretty good idea.

However, I think it should probably be broken up into a couple of pieces
- one for the generic json/jsonb transforms infrastructure (which
probably needs some more comments) and one for the FTS functions that
will use it.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] [PATCH] few fts functions for jsonb

From
Dmitry Dolgov
Date:
> On 21 March 2017 at 03:03, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> However, I think it should probably be broken up into a couple of pieces -
> one for the generic json/jsonb transforms infrastructure (which probably
> needs some more comments) and one for the FTS functions that will use it.

Sure, here are two patches with separated functionality and a bit more
commentaries for the transform functions.
Attachment

Re: [HACKERS] [PATCH] few fts functions for jsonb

From
Andrew Dunstan
Date:

On 03/21/2017 06:28 PM, Dmitry Dolgov wrote:
> > On 21 March 2017 at 03:03, Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com
> <mailto:andrew.dunstan@2ndquadrant.com>> wrote:
> >
> > However, I think it should probably be broken up into a couple of
> pieces -
> > one for the generic json/jsonb transforms infrastructure (which probably
> > needs some more comments) and one for the FTS functions that will
> use it.
>
> Sure, here are two patches with separated functionality and a bit more
> commentaries for the transform functions.

I'm not through looking at this. However, here are a few preliminary
comments
 * we might need to rationalize the header locations a bit * iterate_json(b) and transform_json(b) are a bit too
generallynamed.   Really what they do is iterate over or transform string values in   the json(b). They ignore /
preservethe structure, keys, and   non-string scalar values in the json(b). A general iterate or   transform function
wouldbe called in effect with a stream of all   the elements in the json, not just scalar strings. * Unless I'm missing
somethingthe iterate_json(b)_values return value   is ignored. Instead of returning the state it looks to me like it
shouldreturn nothing and be declared as void instead of void * * transform_jsonb and transform_json are somewhat
asymmetrical.The   latter should probably return a text* instead of a StringInfo, to be   consistent with the former.
 

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] few fts functions for jsonb

From
Dmitry Dolgov
Date:
> I'm not through looking at this. However, here are a few preliminary comments

I've attached new versions of the patches with improvements related to these commentaries.
Attachment

Re: [PATCH] few fts functions for jsonb

From
Andrew Dunstan
Date:
On 26 March 2017 at 17:57, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> I'm not through looking at this. However, here are a few preliminary
>> comments
>
> I've attached new versions of the patches with improvements related to these
> commentaries.

These patches seem fundamentally OK. But I'm still not happy with the
naming etc.

I think the header changes would probably be better placed in
jsonapi.h or in a new header file.

And the names still seem too general to me. e.g. transform_json_values
should probably be transform_json_string_values, and the static
support functions should be renamed to match. Also the
JsonIterateAction and JsonTransformAction funtion typedefs should
probably be renamed to match.

I'm not sure there is any great point in the is_jsonb_data macro,
which is only used in one spot. I would get rid of it and expand the
test in place.

I don't have much time this week to work on it, as there are one or
two other patches I also want to look at.  If you clean these things
up I will commit it. The second patch looks fine.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] few fts functions for jsonb

From
Dmitry Dolgov
Date:
> On 29 March 2017 at 18:28, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> These patches seem fundamentally OK. But I'm still not happy with the
> naming etc.

I've changed names for all functions and action definitions, moved out the
changes in header file to `jsonapi.h` and removed `is_jsonb_data` macro. So it
should be better now.
Attachment

Re: [PATCH] few fts functions for jsonb

From
Andrew Dunstan
Date:
On 29 March 2017 at 16:19, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> On 29 March 2017 at 18:28, Andrew Dunstan <andrew.dunstan@2ndquadrant.com>
>> wrote:
>>
>> These patches seem fundamentally OK. But I'm still not happy with the
>> naming etc.
>
> I've changed names for all functions and action definitions, moved out the
> changes in header file to `jsonapi.h` and removed `is_jsonb_data` macro. So
> it
> should be better now.

I have just noticed as I was writing/testing the non-existent docs for
this patch that it doesn't supply variants of to_tsvector that take a
regconfig as the first argument. Is there a reason for that? Why
should the json(b) versions be different from the text versions?

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] few fts functions for jsonb

From
Dmitry Dolgov
Date:
On 31 March 2017 at 00:01, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> I have just noticed as I was writing/testing the non-existent docs for
> this patch that it doesn't supply variants of to_tsvector that take a
> regconfig as the first argument. Is there a reason for that? Why
> should the json(b) versions be different from the text versions?

No, there is no reason, I just missed that. Here is a new version of the patch (only the functions part)
to add those variants.
Attachment

Re: [PATCH] few fts functions for jsonb

From
Oleg Bartunov
Date:


On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthalion6@gmail.com> wrote:
On 31 March 2017 at 00:01, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
>
> I have just noticed as I was writing/testing the non-existent docs for
> this patch that it doesn't supply variants of to_tsvector that take a
> regconfig as the first argument. Is there a reason for that? Why
> should the json(b) versions be different from the text versions?

No, there is no reason, I just missed that. Here is a new version of the patch (only the functions part)
to add those variants.

Congratulations with patch committed, who will write an addition documentation? I think we need to touch  FTS and JSON parts.

Re: [PATCH] few fts functions for jsonb

From
Andrew Dunstan
Date:

On 03/31/2017 03:17 PM, Oleg Bartunov wrote:
>
>
> On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthalion6@gmail.com
> <mailto:9erthalion6@gmail.com>> wrote:
>
>     On 31 March 2017 at 00:01, Andrew Dunstan
>     <andrew.dunstan@2ndquadrant.com
>     <mailto:andrew.dunstan@2ndquadrant.com>> wrote:
>     >
>     > I have just noticed as I was writing/testing the non-existent
>     docs for
>     > this patch that it doesn't supply variants of to_tsvector that
>     take a
>     > regconfig as the first argument. Is there a reason for that? Why
>     > should the json(b) versions be different from the text versions?
>
>     No, there is no reason, I just missed that. Here is a new version
>     of the patch (only the functions part)
>     to add those variants.
>
>
> Congratulations with patch committed, who will write an addition
> documentation? I think we need to touch  FTS and JSON parts.


I added documentation when I committed it for the new functions, in the
FTS section. I'm not sure what we need to add to the JSON section if
anything.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] few fts functions for jsonb

From
Andres Freund
Date:
On 2017-04-01 16:20:46 -0400, Andrew Dunstan wrote:
> 
> 
> On 03/31/2017 03:17 PM, Oleg Bartunov wrote:
> >
> >
> > On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthalion6@gmail.com
> > <mailto:9erthalion6@gmail.com>> wrote:
> >
> >     On 31 March 2017 at 00:01, Andrew Dunstan
> >     <andrew.dunstan@2ndquadrant.com
> >     <mailto:andrew.dunstan@2ndquadrant.com>> wrote:
> >     >
> >     > I have just noticed as I was writing/testing the non-existent
> >     docs for
> >     > this patch that it doesn't supply variants of to_tsvector that
> >     take a
> >     > regconfig as the first argument. Is there a reason for that? Why
> >     > should the json(b) versions be different from the text versions?
> >
> >     No, there is no reason, I just missed that. Here is a new version
> >     of the patch (only the functions part)
> >     to add those variants.
> >
> >
> > Congratulations with patch committed, who will write an addition
> > documentation? I think we need to touch  FTS and JSON parts.

> I added documentation when I committed it for the new functions, in the
> FTS section. I'm not sure what we need to add to the JSON section if
> anything.

I see that the CF entry for this hasn't been marked as committed:
https://commitfest.postgresql.org/13/1054/
Is there anything left here?

- Andres



Re: [PATCH] few fts functions for jsonb

From
"Sven R. Kunze"
Date:
On 01.04.2017 22:20, Andrew Dunstan wrote:
> I added documentation when I committed it for the new functions, in the
> FTS section. I'm not sure what we need to add to the JSON section if
> anything.

Not sure, if this is related but the formatting of 
https://www.postgresql.org/docs/devel/static/functions-textsearch.html 
looks a bit strange.

Just 2 questions/notes:
1) in what order are the values of the JSON extracted?

2) Regarding the additional line:
to_tsvector([ config regconfig , ] document json(b))    tsvector reduce 
document text to tsvector    to_tsvector('english', '{"a": "The Fat 
Rats"}'::json)    'fat':2 'rat':3

Maybe change "reduce document text to tsvector" to "extracting JSON 
values <in what order> and reduce to tsvector"?


Sven



Re: [PATCH] few fts functions for jsonb

From
Andrew Dunstan
Date:

On 04/03/2017 02:22 PM, Andres Freund wrote:
> On 2017-04-01 16:20:46 -0400, Andrew Dunstan wrote:
>>
>> On 03/31/2017 03:17 PM, Oleg Bartunov wrote:
>>>
>>> On 30 Mar 2017 23:43, "Dmitry Dolgov" <9erthalion6@gmail.com
>>> <mailto:9erthalion6@gmail.com>> wrote:
>>>
>>>     On 31 March 2017 at 00:01, Andrew Dunstan
>>>     <andrew.dunstan@2ndquadrant.com
>>>     <mailto:andrew.dunstan@2ndquadrant.com>> wrote:
>>>     >
>>>     > I have just noticed as I was writing/testing the non-existent
>>>     docs for
>>>     > this patch that it doesn't supply variants of to_tsvector that
>>>     take a
>>>     > regconfig as the first argument. Is there a reason for that? Why
>>>     > should the json(b) versions be different from the text versions?
>>>
>>>     No, there is no reason, I just missed that. Here is a new version
>>>     of the patch (only the functions part)
>>>     to add those variants.
>>>
>>>
>>> Congratulations with patch committed, who will write an addition
>>> documentation? I think we need to touch  FTS and JSON parts.
>> I added documentation when I committed it for the new functions, in the
>> FTS section. I'm not sure what we need to add to the JSON section if
>> anything.
> I see that the CF entry for this hasn't been marked as committed:
> https://commitfest.postgresql.org/13/1054/
> Is there anything left here?
>


Says "Status committed" for me. I fixed this in Sunday after Tom prodded me.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] few fts functions for jsonb

From
Andrew Dunstan
Date:

On 04/03/2017 02:44 PM, Sven R. Kunze wrote:
> On 01.04.2017 22:20, Andrew Dunstan wrote:
>> I added documentation when I committed it for the new functions, in the
>> FTS section. I'm not sure what we need to add to the JSON section if
>> anything.
>
> Not sure, if this is related but the formatting of
> https://www.postgresql.org/docs/devel/static/functions-textsearch.html
> looks a bit strange.
>
> Just 2 questions/notes:
> 1) in what order are the values of the JSON extracted?

In the order they exist in the underlying document.

>
> 2) Regarding the additional line:
> to_tsvector([ config regconfig , ] document json(b))    tsvector
> reduce document text to tsvector    to_tsvector('english', '{"a": "The
> Fat Rats"}'::json)    'fat':2 'rat':3
>
> Maybe change "reduce document text to tsvector" to "extracting JSON
> values <in what order> and reduce to tsvector"?
>
>


OK, I will do something along those lines.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] few fts functions for jsonb

From
"Sven R. Kunze"
Date:
On 03.04.2017 21:30, Andrew Dunstan wrote:
> On 04/03/2017 02:44 PM, Sven R. Kunze wrote:
>> On 01.04.2017 22:20, Andrew Dunstan wrote:
>>> I added documentation when I committed it for the new functions, in the
>>> FTS section. I'm not sure what we need to add to the JSON section if
>>> anything.
>> Not sure, if this is related but the formatting of
>> https://www.postgresql.org/docs/devel/static/functions-textsearch.html
>> looks a bit strange.
>>
>> Just 2 questions/notes:
>> 1) in what order are the values of the JSON extracted?
> In the order they exist in the underlying document.

Just asking as the order can have implications for fulltext searches. 
So, might be valuable for the docs.


Are these documents equally ordered in this sense?

srkunze=# select '{"a": "abc", "b": "def"}'::jsonb;          jsonb
-------------------------- {"a": "abc", "b": "def"}
(1 row)

srkunze=# select '{"b": "def", "a": "abc"}'::jsonb;          jsonb
-------------------------- {"a": "abc", "b": "def"}
(1 row)


Also what about non-ascii keys? Are they ordered by the default locale 
of the PostgreSQL cluster (say de_DE.utf-8)?

>> 2) Regarding the additional line:
>> to_tsvector([ config regconfig , ] document json(b))    tsvector
>> reduce document text to tsvector    to_tsvector('english', '{"a": "The
>> Fat Rats"}'::json)    'fat':2 'rat':3
>>
>> Maybe change "reduce document text to tsvector" to "extracting JSON
>> values <in what order> and reduce to tsvector"?
>>
>>
>
> OK, I will do something along those lines.
>
> cheers
>
> andrew
>




Re: [PATCH] few fts functions for jsonb

From
Andrew Dunstan
Date:

On 04/03/2017 03:41 PM, Sven R. Kunze wrote:
> On 03.04.2017 21:30, Andrew Dunstan wrote:
>> On 04/03/2017 02:44 PM, Sven R. Kunze wrote:
>>> On 01.04.2017 22:20, Andrew Dunstan wrote:
>>>> I added documentation when I committed it for the new functions, in
>>>> the
>>>> FTS section. I'm not sure what we need to add to the JSON section if
>>>> anything.
>>> Not sure, if this is related but the formatting of
>>> https://www.postgresql.org/docs/devel/static/functions-textsearch.html
>>> looks a bit strange.
>>>
>>> Just 2 questions/notes:
>>> 1) in what order are the values of the JSON extracted?
>> In the order they exist in the underlying document.
>
> Just asking as the order can have implications for fulltext searches.
> So, might be valuable for the docs.
>
>
> Are these documents equally ordered in this sense?
>
> srkunze=# select '{"a": "abc", "b": "def"}'::jsonb;
>           jsonb
> --------------------------
>  {"a": "abc", "b": "def"}
> (1 row)
>
> srkunze=# select '{"b": "def", "a": "abc"}'::jsonb;
>           jsonb
> --------------------------
>  {"a": "abc", "b": "def"}
> (1 row)
>


Yes, when converted to jsonb these two documents are identical.


>
> Also what about non-ascii keys? Are they ordered by the default locale
> of the PostgreSQL cluster (say de_DE.utf-8)?


Yes, I believe so.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services