Thread: Abnormal JSON query performance

Abnormal JSON query performance

From
007reader@gmail.com
Date:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly
-retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal 

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes
retrievalof individual key completely useless for practical cases. It’d leave only unattractive alternative of getting
theentire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each
query...

-Bob Jones

Re: Abnormal JSON query performance

From
Pavel Stehule
Date:
Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones

Re: Abnormal JSON query performance

From
007reader
Date:
Sorry, I mistakenly showed data type as json. In reality, it's jsonb. Based on the documentation, I'm under inpression that json is is stored parsed, hence my complaint 

-------- Original message --------
From: Pavel Stehule <pavel.stehule@gmail.com>
Date: 5/11/18 10:03 PM (GMT-08:00)
To: 007reader@gmail.com
Cc: pgsql-bugs@postgresql.org
Subject: Re: Abnormal JSON query performance

Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones

Re: Abnormal JSON query performance

From
Pavel Stehule
Date:


2018-05-12 9:12 GMT+02:00 007reader <007reader@gmail.com>:
Sorry, I mistakenly showed data type as json. In reality, it's jsonb. Based on the documentation, I'm under inpression that json is is stored parsed, hence my complaint 

so, try to use indexes, or try to move important fields to own table columns

jsonb is preparsed, but it is stored as 1 blob, and if read one field, then from disk is read complete jsonb

Regards

Pavel


-------- Original message --------
From: Pavel Stehule <pavel.stehule@gmail.com>
Date: 5/11/18 10:03 PM (GMT-08:00)
Subject: Re: Abnormal JSON query performance

Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones


Re: Abnormal JSON query performance

From
007reader@gmail.com
Date:
Sorry again,  autocorrect kills me. I use JSONB, not JSON. Serving the entire document for jsonb doesn’t make much sense. If jsonb pre-paresed,  selecting and assembling a few keys must be faster than assembling all of them.


On May 11, 2018, at 10:03 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones

Re: Abnormal JSON query performance

From
Pavel Stehule
Date:


2018-05-12 9:31 GMT+02:00 <007reader@gmail.com>:
Sorry again,  autocorrect kills me. I use JSONB, not JSON. Serving the entire document for jsonb doesn’t make much sense. If jsonb pre-paresed,  selecting and assembling a few keys must be faster than assembling all of them.
 
jsonb is preparsed, but that is all. PostgreSQL can't to store one value to separate blocks. It can't to do for any type - json(b) is not a exception.

Regards

Pavel




On May 11, 2018, at 10:03 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones


Re: Abnormal JSON query performance

From
007reader@gmail.com
Date:
You are right. I tried similar queries for the array data type retrieving multiple selected array elements and the query time increases as the number of requested array elements increases. 
This is a very disappointing fact. Documentation and multiple articles on the internet promote ability to extract specific JSON keys like “select field1->field2->>key1, field2->>key2, ...”, but as turns out to be very inefficient. The design seems to favor retrieval of the entire json document and processing on the application side.

Is it possible add a note to the documentation articulating this important detail? I’ll ask someone to write a blog about this deficiency ...

-Michael

On May 12, 2018, at 12:36 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:



2018-05-12 9:31 GMT+02:00 <007reader@gmail.com>:
Sorry again,  autocorrect kills me. I use JSONB, not JSON. Serving the entire document for jsonb doesn’t make much sense. If jsonb pre-paresed,  selecting and assembling a few keys must be faster than assembling all of them.
 
jsonb is preparsed, but that is all. PostgreSQL can't to store one value to separate blocks. It can't to do for any type - json(b) is not a exception.

Regards

Pavel




On May 11, 2018, at 10:03 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones


Re: Abnormal JSON query performance

From
Pavel Stehule
Date:


2018-05-13 19:16 GMT+02:00 <007reader@gmail.com>:
You are right. I tried similar queries for the array data type retrieving multiple selected array elements and the query time increases as the number of requested array elements increases. 
This is a very disappointing fact. Documentation and multiple articles on the internet promote ability to extract specific JSON keys like “select field1->field2->>key1, field2->>key2, ...”, but as turns out to be very inefficient. The design seems to favor retrieval of the entire json document and processing on the application side.

PostgreSQL is not JSON or XML database, that breaks any document to graph and works on some graph. Unfortunately I don't know any database, that does it. Searching in very complex graph is available only in memory databases.

PostgreSQL 11 allows index only scan over GIN index, if I remember well. Maybe it can help for you.
 

Is it possible add a note to the documentation articulating this important detail? I’ll ask someone to write a blog about this deficiency ...

"JSON data is subject to the same concurrency-control considerations as any other data type when stored in a table. Although storing large documents is practicable, keep in mind that any update acquires a row-level lock on the whole row. Consider limiting JSON documents to a manageable size in order to decrease lock contention among updating transactions. Ideally, JSON documents should each represent an atomic datum that business rules dictate cannot reasonably be further subdivided into smaller datums that could be modified independently."

This is part of documentation - and should be mentioned, so JSON or JSONB document is read/write as one value every time. So if you use long documents and usually read few fields, then seq scan will be very ineffective, and index scan for more rows will be expensive too.

Regards

Pavel


 

-Michael

On May 12, 2018, at 12:36 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:



2018-05-12 9:31 GMT+02:00 <007reader@gmail.com>:
Sorry again,  autocorrect kills me. I use JSONB, not JSON. Serving the entire document for jsonb doesn’t make much sense. If jsonb pre-paresed,  selecting and assembling a few keys must be faster than assembling all of them.
 
jsonb is preparsed, but that is all. PostgreSQL can't to store one value to separate blocks. It can't to do for any type - json(b) is not a exception.

Regards

Pavel




On May 11, 2018, at 10:03 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones



Re: Abnormal JSON query performance

From
007reader@gmail.com
Date:
I think we are talking about two different use cases. I am not interested in graph operations on a document. 

My interest is in fast access to data. On a relational table, the query time is about the same whether I have one or ten fields in a select statement. I’d love to see the same behavior when getting multiple keys from a JSON document. That doesn’t seem to require graph manipulations. In a simplistic naive approach, it may be just walking a JSON document and getting all fields from the select statement in a single pass. It’s not quite full independence from the number of fields, but should be better than doubling execution time for each additional key in the current implementation.

On May 13, 2018, at 12:43 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:



2018-05-13 19:16 GMT+02:00 <007reader@gmail.com>:
You are right. I tried similar queries for the array data type retrieving multiple selected array elements and the query time increases as the number of requested array elements increases. 
This is a very disappointing fact. Documentation and multiple articles on the internet promote ability to extract specific JSON keys like “select field1->field2->>key1, field2->>key2, ...”, but as turns out to be very inefficient. The design seems to favor retrieval of the entire json document and processing on the application side.

PostgreSQL is not JSON or XML database, that breaks any document to graph and works on some graph. Unfortunately I don't know any database, that does it. Searching in very complex graph is available only in memory databases.

PostgreSQL 11 allows index only scan over GIN index, if I remember well. Maybe it can help for you.
 

Is it possible add a note to the documentation articulating this important detail? I’ll ask someone to write a blog about this deficiency ...

"JSON data is subject to the same concurrency-control considerations as any other data type when stored in a table. Although storing large documents is practicable, keep in mind that any update acquires a row-level lock on the whole row. Consider limiting JSON documents to a manageable size in order to decrease lock contention among updating transactions. Ideally, JSON documents should each represent an atomic datum that business rules dictate cannot reasonably be further subdivided into smaller datums that could be modified independently."

This is part of documentation - and should be mentioned, so JSON or JSONB document is read/write as one value every time. So if you use long documents and usually read few fields, then seq scan will be very ineffective, and index scan for more rows will be expensive too.

Regards

Pavel


 

-Michael

On May 12, 2018, at 12:36 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:



2018-05-12 9:31 GMT+02:00 <007reader@gmail.com>:
Sorry again,  autocorrect kills me. I use JSONB, not JSON. Serving the entire document for jsonb doesn’t make much sense. If jsonb pre-paresed,  selecting and assembling a few keys must be faster than assembling all of them.
 
jsonb is preparsed, but that is all. PostgreSQL can't to store one value to separate blocks. It can't to do for any type - json(b) is not a exception.

Regards

Pavel




On May 11, 2018, at 10:03 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones



Abnormal JSON query performance

From
"David G. Johnston"
Date:
On Sunday, May 13, 2018, <007reader@gmail.com> wrote: 
My interest is in fast access to data. On a relational table, the query time is about the same whether I have one or ten fields in a select statement. I’d love to see the same behavior when getting multiple keys from a JSON document.

I would hazard to say this is a solid desire and one shared by many.  It is also seemingly something that today has little or no precedent in PostgreSQL.  The lack is not a bug.  Trying to use PostgreSQL in ways contrary to a normalized relational database forces one to decide between convenice of the platform versus suboptimal implementation when dealing with this kind of situation.  Discussion of those dynamics is better done on the -general list, not -bugs.

David J.

Re: Abnormal JSON query performance

From
Pavel Stehule
Date:


2018-05-14 5:48 GMT+02:00 <007reader@gmail.com>:
I think we are talking about two different use cases. I am not interested in graph operations on a document. 

My interest is in fast access to data. On a relational table, the query time is about the same whether I have one or ten fields in a select statement. I’d love to see the same behavior when getting multiple keys from a JSON document. That doesn’t seem to require graph manipulations. In a simplistic naive approach, it may be just walking a JSON document and getting all fields from the select statement in a single pass. It’s not quite full independence from the number of fields, but should be better than doubling execution time for each additional key in the current implementation.

Your expectation are not valid for longer fields. Just one experiment:

create table test(a varchar, b varchar);

CREATE OR REPLACE FUNCTION public.random_str(integer)
 RETURNS character varying
 LANGUAGE sql
AS $function$
select string_agg(v, '') from (
  select substring('abcdefghijklmnopqrstuvwxyz' from (random()*26)::int + 1 for 1) v
    from generate_series(1, $1)) s;
$function$

insert into test select random_str(4000), random_str(4000) from generate_series(1,100000);

postgres=# explain analyze select length(a) from test;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  QUERY PLAN                                                  │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on test  (cost=0.00..2084.00 rows=100000 width=4) (actual time=0.315..2952.151 rows=100000 loops=1) │
│ Planning Time: 0.109 ms                                                                                      │
│ Execution Time: 2960.654 ms                                                                                  │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)

postgres=# explain analyze select length(a), length(b) from test;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  QUERY PLAN                                                  │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on test  (cost=0.00..2334.00 rows=100000 width=8) (actual time=0.367..5420.946 rows=100000 loops=1) │
│ Planning Time: 0.103 ms                                                                                      │
│ Execution Time: 5431.446 ms                                                                                  │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)

postgres=# explain analyze select length(a), length(a) from test;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  QUERY PLAN                                                  │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Seq Scan on test  (cost=0.00..2334.00 rows=100000 width=8) (actual time=0.367..5404.059 rows=100000 loops=1) │
│ Planning Time: 0.100 ms                                                                                      │
│ Execution Time: 5414.443 ms                                                                                  │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(3 rows)


You can break document to more smaller pieces - but it has limits too.

The total speed depends on few factors:

1. How much data are read from disc - t1 (can be unimportant when data are in PostgreSQL cache - shared buffers)
2. Decompression - longer XML, JSON, Jsonb are stored in disc, in PostgreSQL cache in compressed form, every time every access to long doc does decompression - t2 .. Jsonb documents can be longer than JSON
3. Parsing document - every time for every access significant for XML and JSON - t3
4. Loading binary document - every time for every access significant for Jsonb - t4

Total time is t1 + t2 + t3 + t4.

I am not sure if t2 is optimized now for multiple access to same data - probably not - you can see some messages in archive about cache for detoasted data. Postgres has some optimizations, but its are used only from PLpgSQL and only for arrays and records. Surely not for JSON or Jsonb.

Probably you can do some optimization using by some own functions

CREATE OR REPLACE FUNCTION fx(d jsonb, OUT a text, b text)
AS $$
BEGIN
  a := d ->>'a';
  b := d ->>'b';
END;
$$ LANGUAGE plpgsql;

SELECT a, b FROM (SELECT fx(d) FROM jsontab) s;

can be faster than

SELECT d->>'a', d->>'b' FROM jsontab

due eliminate repeated t2 time.

Maybe someone write special row cache, that can eliminate repeated t2, t3, t4 times - that can be interesting for repeated access to different fields in JSON document. Currently there is nothing similar. Although there is a patch for JSON_TABLE function.

It is analogy to XMLTABLE function. With this function you can take more fields with just one document processing.

Regards

Pavel Stehule







On May 13, 2018, at 12:43 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:



2018-05-13 19:16 GMT+02:00 <007reader@gmail.com>:
You are right. I tried similar queries for the array data type retrieving multiple selected array elements and the query time increases as the number of requested array elements increases. 
This is a very disappointing fact. Documentation and multiple articles on the internet promote ability to extract specific JSON keys like “select field1->field2->>key1, field2->>key2, ...”, but as turns out to be very inefficient. The design seems to favor retrieval of the entire json document and processing on the application side.

PostgreSQL is not JSON or XML database, that breaks any document to graph and works on some graph. Unfortunately I don't know any database, that does it. Searching in very complex graph is available only in memory databases.

PostgreSQL 11 allows index only scan over GIN index, if I remember well. Maybe it can help for you.
 

Is it possible add a note to the documentation articulating this important detail? I’ll ask someone to write a blog about this deficiency ...

"JSON data is subject to the same concurrency-control considerations as any other data type when stored in a table. Although storing large documents is practicable, keep in mind that any update acquires a row-level lock on the whole row. Consider limiting JSON documents to a manageable size in order to decrease lock contention among updating transactions. Ideally, JSON documents should each represent an atomic datum that business rules dictate cannot reasonably be further subdivided into smaller datums that could be modified independently."

This is part of documentation - and should be mentioned, so JSON or JSONB document is read/write as one value every time. So if you use long documents and usually read few fields, then seq scan will be very ineffective, and index scan for more rows will be expensive too.

Regards

Pavel


 

-Michael

On May 12, 2018, at 12:36 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:



2018-05-12 9:31 GMT+02:00 <007reader@gmail.com>:
Sorry again,  autocorrect kills me. I use JSONB, not JSON. Serving the entire document for jsonb doesn’t make much sense. If jsonb pre-paresed,  selecting and assembling a few keys must be faster than assembling all of them.
 
jsonb is preparsed, but that is all. PostgreSQL can't to store one value to separate blocks. It can't to do for any type - json(b) is not a exception.

Regards

Pavel




On May 11, 2018, at 10:03 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Hi

2018-05-12 6:32 GMT+02:00 <007reader@gmail.com>:
I observed performance anomaly with JSON that I believe is incorrect in version 10, although didn’t test earlier ones.
Here is a test table:
create table test(
  id integer primary key,
  data json
);

A simple query works fine on a table of 92,000 records with ~60KB average record size.

select data->>key1 from test limit 1000;

The query above takes about 155 ms. If I add additional field to the list of fields, execution time increases linearly - retrieving 2 fields take ~310 ms, 3 fields ~ 465 ms and so on. The explain statement for 5 fields looks normal

Limit  (cost=0.00..34.10 rows=1000 width=160) (actual time=1.783..813.643 rows=1000 loops=1)
   ->  Seq Scan on loans  (cost=0.00..3165.85 rows=92834 width=160) (actual time=1.753..805.629 rows=1000 loops=1)
 Planning time: 8.546 ms
 Execution time: 817.874 ms

Looks like every record gets parsed separately for each key... Unless I’m doing something incorrectly, this makes retrieval of individual key completely useless for practical cases. It’d leave only unattractive alternative of getting the entire JSON document. In my case, instead of retrieving a few hundred byte, it would force getting 60KB for each query...

It is not a bug. This behave is expected.

Json values are stored as plain text in PostgreSQL. Every time if it is necessary it is parsed. If need fast access to one specific field, then you can use functional index

CREATE INDEX ON test(data->>key1);

You can try JSONB type - it is like Json, but it is preparsed, so access to some fields should be faster. Again, there can be used functional index, or special JSONB indexes for faster searching.

Don't forget - PostgreSQL is relational databases - the sequential access to data will be most fast, when data will be stored in relational form - any attribute is stored in one specific column. No atomic data types are more dynamic, but access is significantly slower - it is not important for hundreds values, but it is clean when you read more than thousands values, ..

p.s. index is your friend

Regards

Pavel


 

-Bob Jones




Re: Abnormal JSON query performance

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Sunday, May 13, 2018, <007reader@gmail.com> wrote:
>> My interest is in fast access to data. On a relational table, the query
>> time is about the same whether I have one or ten fields in a select
>> statement. I’d love to see the same behavior when getting multiple keys
>> from a JSON document.

> I would hazard to say this is a solid desire and one shared by many.  It is
> also seemingly something that today has little or no precedent in
> PostgreSQL.  The lack is not a bug.

Yeah.  This has been discussed from time to time before.  The problem
is how to reconcile it with PG's extensible architecture, in which these
various -> and ->> operators are independent functions that are black
boxes so far as the core code is concerned.  It's very unclear how to
set up an arrangement that would let them share processing.

For the moment, you can work around it to some extent by writing out
the shared processing manually, along the lines of

select (x).this, (x).that, (x).the_other from
  (select jsonb_populate_record(null::myrowtype, jsonb_column) as x
   from ...) ss

where myrowtype defines the fields you want to extract.

            regards, tom lane


Re: Abnormal JSON query performance

From
Merlin Moncure
Date:
On Mon, May 14, 2018 at 12:18 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "David G. Johnston" <david.g.johnston@gmail.com> writes:
> > On Sunday, May 13, 2018, <007reader@gmail.com> wrote:
> >> My interest is in fast access to data. On a relational table, the query
> >> time is about the same whether I have one or ten fields in a select
> >> statement. I’d love to see the same behavior when getting multiple keys
> >> from a JSON document.

> > I would hazard to say this is a solid desire and one shared by many.
It is
> > also seemingly something that today has little or no precedent in
> > PostgreSQL.  The lack is not a bug.

> Yeah.  This has been discussed from time to time before.  The problem
> is how to reconcile it with PG's extensible architecture, in which these
> various -> and ->> operators are independent functions that are black
> boxes so far as the core code is concerned.  It's very unclear how to
> set up an arrangement that would let them share processing.

> For the moment, you can work around it to some extent by writing out
> the shared processing manually, along the lines of

> select (x).this, (x).that, (x).the_other from
>    (select jsonb_populate_record(null::myrowtype, jsonb_column) as x
>     from ...) ss

> where myrowtype defines the fields you want to extract.

This is really the answer.  Pretty typically you'd use CROSS JOIN LATERAL
to fold the jsonb_populate_record portion into a broader query to work
around some restrictions.  TBH this approach is reasonably fast,
expressive, and flexible; it's not broken. My only gripe (such as it is) is
the verbosity of the json api.  Question: is there any technical reason as
to why this can't be worked into a simple cast?

jsonb_populate_record(null::myrowtype, jsonb_column)  ->
jsonb_column::myrowtype

merlin


merlin


Re: Abnormal JSON query performance

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> ... My only gripe (such as it is) is
> the verbosity of the json api.  Question: is there any technical reason as
> to why this can't be worked into a simple cast?

> jsonb_populate_record(null::myrowtype, jsonb_column)  ->
> jsonb_column::myrowtype

Hmm, good question.  I think it could work about the same at execution,
but I'm not sure whether the cast parsing machinery would need extension.
(The pg_cast entry would presumably have to have casttarget = record.
I'm not sure that a named record type would match that without an extra
hack, which is something that might or might not be a good idea in the
long run.  There are no casts-to-polymorphic-types at present.)

Now, from the standpoint of system design, the key question is whether
we're sufficiently convinced that the semantics of jsonb_populate_record
are The Right Thing that we're willing to give it pride of place as the
jsonb->record cast function.  As long as it's just a function, you can
put different semantics beside it easily --- just write another function.
But as far as a cast goes, there can be only one.

I would certainly not have had that much faith in jsonb_populate_record's
choices initially, but maybe now with nearly five years of field
experience we'd be good with making that commitment.

Another point is that the need to create a named composite type is
a usability strike against the cast syntax anyway.  Personally,
if I'm going to write it as a lateral function call, I'd much
rather write out the fields I want in-line, a la

    SELECT ... FROM ...some tables...,
      jsonb_to_record(jsonbcol) AS j(id int, name text, price numeric)
    WHERE ...

which is something you can do today.

            regards, tom lane


Re: Abnormal JSON query performance

From
"David G. Johnston"
Date:
On Mon, May 14, 2018 at 7:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
        SELECT ... FROM ...some tables...,
          jsonb_to_record(jsonbcol) AS j(id int, name text, price numeric)
        WHERE ...

which is something you can do today.

​Indeed you can - could you please point to the docs for that one?

SELECT *
FROM (VALUES ('{"id":1,"name":"Dave","country":"US"}'::json)) vals (v)
, json_to_record(v) j(name text, country text)

David J.

Re: Abnormal JSON query performance

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, May 14, 2018 at 7:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> SELECT ... FROM ...some tables...,
>> jsonb_to_record(jsonbcol) AS j(id int, name text, price numeric)
>> WHERE ...
>>
>> which is something you can do today.

> ​Indeed you can - could you please point to the docs for that one?

https://www.postgresql.org/docs/current/static/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE

Perhaps it'd be worth emphasizing the usefulness of jsonb_to_record[set]
a bit more, say with examples in section 8.14.

            regards, tom lane


Re: Abnormal JSON query performance

From
"David G. Johnston"
Date:
On Mon, May 14, 2018 at 8:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, May 14, 2018 at 7:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> SELECT ... FROM ...some tables...,
>> jsonb_to_record(jsonbcol) AS j(id int, name text, price numeric)
>> WHERE ...
>>
>> which is something you can do today.

> ​Indeed you can - could you please point to the docs for that one?

https://www.postgresql.org/docs/current/static/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE

​Thanks.  I was thinking you were still talking about "populate" variants of the functions and missed that you switched to the "to" variant in the final example.
 
Perhaps it'd be worth emphasizing the usefulness of jsonb_to_record[set]
a bit more, say with examples in section 8.14.

A section titled "JSON Element Extraction" under 8.14 that covers those functions in context and discusses the dynamics of multiple columns of -> invocations seem worthwhile.

"Ideally, JSON documents should each represent an atomic datum that business rules dictate cannot reasonably be further subdivided into smaller datums that could be modified independently."

See that new section should you decide to not heed the above advice, and in general converting between json and table forms.

David J.

Re: Abnormal JSON query performance

From
007reader
Date:
It would be great to document jsonb_populate_record better, especially the rowtype. May be it is obvious to an experienced user, but for a less experienced it isn't clear how it should be defined. Only after Tom's email, I realized that it can be done without creating a table.

My use case may be a bit more complex:
1. My JSON doc is large - few hundred keys and it is not practical to define rowtype for the entire doc. Plus not all docs have all keys in each record. I'd like to specify only a relatively small number of keys (by their path) for jsonb_populate_record instead of the entire json field. 
2. My docs have hierarchical structure, but the output should be flattened base on the structure defined in #1.

Can those problems be addressed within the current implementation?


-------- Original message --------
From: "David G. Johnston" <david.g.johnston@gmail.com>
Date: 5/14/18 9:09 AM (GMT-08:00)
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Merlin Moncure <mmoncure@gmail.com>, reader 1001 <007reader@gmail.com>, Pavel Stehule <pavel.stehule@gmail.com>, pgsql-bugs <pgsql-bugs@postgresql.org>
Subject: Re: Abnormal JSON query performance

On Mon, May 14, 2018 at 8:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, May 14, 2018 at 7:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> SELECT ... FROM ...some tables...,
>> jsonb_to_record(jsonbcol) AS j(id int, name text, price numeric)
>> WHERE ...
>>
>> which is something you can do today.

> ​Indeed you can - could you please point to the docs for that one?

https://www.postgresql.org/docs/current/static/functions-json.html#FUNCTIONS-JSON-PROCESSING-TABLE

​Thanks.  I was thinking you were still talking about "populate" variants of the functions and missed that you switched to the "to" variant in the final example.
 
Perhaps it'd be worth emphasizing the usefulness of jsonb_to_record[set]
a bit more, say with examples in section 8.14.

A section titled "JSON Element Extraction" under 8.14 that covers those functions in context and discusses the dynamics of multiple columns of -> invocations seem worthwhile.

"Ideally, JSON documents should each represent an atomic datum that business rules dictate cannot reasonably be further subdivided into smaller datums that could be modified independently."

See that new section should you decide to not heed the above advice, and in general converting between json and table forms.

David J.

Re: Abnormal JSON query performance

From
Dmitry Dolgov
Date:
> On 14 May 2018 at 23:37, 007reader <007reader@gmail.com> wrote:
> It would be great to document jsonb_populate_record better, especially the
> rowtype. May be it is obvious to an experienced user, but for a less
> experienced it isn't clear how it should be defined. Only after Tom's email,
> I realized that it can be done without creating a table.
>
> My use case may be a bit more complex:
> 1. My JSON doc is large - few hundred keys and it is not practical to define
> rowtype for the entire doc. Plus not all docs have all keys in each record.
> I'd like to specify only a relatively small number of keys (by their path)
> for jsonb_populate_record instead of the entire json field.
> 2. My docs have hierarchical structure, but the output should be flattened
> base on the structure defined in #1.
>
> Can those problems be addressed within the current implementation?

Just to mention about #1. If I understand you correctly, it's not necessary to
define a rowtype for the entire doc, you can do this only for a part
that you want to extract from the document. It's also fine to have
some keys missing:

    create type test as (a integer, b text, c text);
    select * from json_populate_record(null::test, '{"a": 1, "b":
"test", "d": "test2"}');

     a |  b   |  c
    ---+------+------
     1 | test | NULL
    (1 row)


Re: Abnormal JSON query performance

From
reader 1001
Date:
Yes, I realized it by now, thank you. 
My question remains for hierarchical keys in a JSON document. If I have a document like below, I clearly can extract key1 using the described rowtype definition. How can I specify selected keys deeper in the document, e.g. key3 and key5? 
{
   key1:value1,
   key2: {
       key3:value3},
       key4:[
         {
            key5:value5
          },
          {
            key6:value6
          }
        ]
    }

Thank you for your help.

On Tue, May 15, 2018 at 1:42 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On 14 May 2018 at 23:37, 007reader <007reader@gmail.com> wrote:
> It would be great to document jsonb_populate_record better, especially the
> rowtype. May be it is obvious to an experienced user, but for a less
> experienced it isn't clear how it should be defined. Only after Tom's email,
> I realized that it can be done without creating a table.
>
> My use case may be a bit more complex:
> 1. My JSON doc is large - few hundred keys and it is not practical to define
> rowtype for the entire doc. Plus not all docs have all keys in each record.
> I'd like to specify only a relatively small number of keys (by their path)
> for jsonb_populate_record instead of the entire json field.
> 2. My docs have hierarchical structure, but the output should be flattened
> base on the structure defined in #1.
>
> Can those problems be addressed within the current implementation?

Just to mention about #1. If I understand you correctly, it's not necessary to
define a rowtype for the entire doc, you can do this only for a part
that you want to extract from the document. It's also fine to have
some keys missing:

    create type test as (a integer, b text, c text);
    select * from json_populate_record(null::test, '{"a": 1, "b":
"test", "d": "test2"}');

     a |  b   |  c
    ---+------+------
     1 | test | NULL
    (1 row)

Re: Abnormal JSON query performance

From
"David G. Johnston"
Date:
On Tuesday, May 15, 2018, reader 1001 <007reader@gmail.com> wrote:
Yes, I realized it by now, thank you. 
My question remains for hierarchical keys in a JSON document. If I have a document like below, I clearly can extract key1 using the described rowtype definition. How can I specify selected keys deeper in the document, e.g. key3 and key5? 
{
   key1:value1,
   key2: {
       key3:value3},
       key4:[
         {
            key5:value5
          },
          {
            key6:value6
          }
        ]
    }


I believe you would need a type for each subtree and apply the function multiple times with the result of one feeding the next.

David J. 

Re: Abnormal JSON query performance

From
Dmitry Dolgov
Date:
> On 16 May 2018 at 05:59, David G. Johnston <david.g.johnston@gmail.com> wrote:
> On Tuesday, May 15, 2018, reader 1001 <007reader@gmail.com> wrote:
>>
>> Yes, I realized it by now, thank you.
>> My question remains for hierarchical keys in a JSON document. If I have a
>> document like below, I clearly can extract key1 using the described rowtype
>> definition. How can I specify selected keys deeper in the document, e.g.
>> key3 and key5?
>> {
>>    key1:value1,
>>    key2: {
>>        key3:value3},
>>        key4:[
>>          {
>>             key5:value5
>>           },
>>           {
>>             key6:value6
>>           }
>>         ]
>>     }
>>
>
> I believe you would need a type for each subtree and apply the function
> multiple times with the result of one feeding the next.

Yes, you need to defined a type for each subtree, but as far as I can
tell it's not necessary to apply the function multiple times,
`jsonb_populate_record` can work with nested types, so it's enough
just to have every new type included in the previous one. I have this
simple example, it should be easy to adapt it for your case (although
the value extraction part looks a bit cumbersome):

create type key1 as (a text);
create type key2 as (b key1);
create type key3 as (c key2);
create type key4 as (d key3, e text);

select (((d).c).b).a, e
    from jsonb_populate_record(null::key4,
    '{"d": {"c": {"b": {"a": "nested"}}}, "e": "test"}');

   a    |  e
--------+------
 nested | test
(1 row)


Re: Abnormal JSON query performance

From
Tom Lane
Date:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
>> On 16 May 2018 at 05:59, David G. Johnston <david.g.johnston@gmail.com> wrote:
>> On Tuesday, May 15, 2018, reader 1001 <007reader@gmail.com> wrote:
>>> My question remains for hierarchical keys in a JSON document. If I have a
>>> document like below, I clearly can extract key1 using the described rowtype
>>> definition. How can I specify selected keys deeper in the document, e.g.
>>> key3 and key5?

>> I believe you would need a type for each subtree and apply the function
>> multiple times with the result of one feeding the next.

> Yes, you need to defined a type for each subtree, but as far as I can
> tell it's not necessary to apply the function multiple times,
> `jsonb_populate_record` can work with nested types, so it's enough
> just to have every new type included in the previous one.

FWIW, I really doubt that there's much performance win from going further
than the first-level keys.  I suspect most of the cost that the OP is
seeing comes from fetching the large JSONB document out of toast storage
multiple times.  Fetching it just in a single jsonb_populate_record()
call will fix that.  So I'd just return the top-level field(s) as jsonb
column(s) and use the normal -> or ->> operators to go further down.

The vague ideas that I've had about fixing this type of problem
automatically mostly center around detecting the need for duplicate
toast fetches and doing that just once.  For data types having "expanded"
forms, it's tempting to consider also expanding them during the fetch,
but that's less clearly a win.

            regards, tom lane


Re: Abnormal JSON query performance

From
Pavel Stehule
Date:


2018-05-16 17:07 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
>> On 16 May 2018 at 05:59, David G. Johnston <david.g.johnston@gmail.com> wrote:
>> On Tuesday, May 15, 2018, reader 1001 <007reader@gmail.com> wrote:
>>> My question remains for hierarchical keys in a JSON document. If I have a
>>> document like below, I clearly can extract key1 using the described rowtype
>>> definition. How can I specify selected keys deeper in the document, e.g.
>>> key3 and key5?

>> I believe you would need a type for each subtree and apply the function
>> multiple times with the result of one feeding the next.

> Yes, you need to defined a type for each subtree, but as far as I can
> tell it's not necessary to apply the function multiple times,
> `jsonb_populate_record` can work with nested types, so it's enough
> just to have every new type included in the previous one.

FWIW, I really doubt that there's much performance win from going further
than the first-level keys.  I suspect most of the cost that the OP is
seeing comes from fetching the large JSONB document out of toast storage
multiple times.  Fetching it just in a single jsonb_populate_record()
call will fix that.  So I'd just return the top-level field(s) as jsonb
column(s) and use the normal -> or ->> operators to go further down.

The vague ideas that I've had about fixing this type of problem
automatically mostly center around detecting the need for duplicate
toast fetches and doing that just once.  For data types having "expanded"
forms, it's tempting to consider also expanding them during the fetch,
but that's less clearly a win.

Just note. If SQL/JSON will be implemented, then this discussion is useless, because JSON_TABLE function allows to read more values per one call.

Regards

Pavel 

                        regards, tom lane

Re: Abnormal JSON query performance

From
007reader
Date:
It works, but not too practical. Users choose JSON for reasons of flexibility, so it's impossible to predict all queries upfront. I found a limitation as well- I tried to create new type, run query and drop the types all within a single transaction, but turned out I can't create new type within a transaction (begin ... end;).

-------- Original message --------
From: Dmitry Dolgov <9erthalion6@gmail.com>
Date: 5/16/18 12:12 AM (GMT-08:00)
To: "David G. Johnston" <david.g.johnston@gmail.com>
Cc: reader 1001 <007reader@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>, Merlin Moncure <mmoncure@gmail.com>, Pavel Stehule <pavel.stehule@gmail.com>, pgsql-bugs <pgsql-bugs@postgresql.org>
Subject: Re: Abnormal JSON query performance

> On 16 May 2018 at 05:59, David G. Johnston <david.g.johnston@gmail.com> wrote:
> On Tuesday, May 15, 2018, reader 1001 <007reader@gmail.com> wrote:
>>
>> Yes, I realized it by now, thank you.
>> My question remains for hierarchical keys in a JSON document. If I have a
>> document like below, I clearly can extract key1 using the described rowtype
>> definition. How can I specify selected keys deeper in the document, e.g.
>> key3 and key5?
>> {
>>    key1:value1,
>>    key2: {
>>        key3:value3},
>>        key4:[
>>          {
>>             key5:value5
>>           },
>>           {
>>             key6:value6
>>           }
>>         ]
>>     }
>>
>
> I believe you would need a type for each subtree and apply the function
> multiple times with the result of one feeding the next.

Yes, you need to defined a type for each subtree, but as far as I can
tell it's not necessary to apply the function multiple times,
`jsonb_populate_record` can work with nested types, so it's enough
just to have every new type included in the previous one. I have this
simple example, it should be easy to adapt it for your case (although
the value extraction part looks a bit cumbersome):

create type key1 as (a text);
create type key2 as (b key1);
create type key3 as (c key2);
create type key4 as (d key3, e text);

select (((d).c).b).a, e
    from jsonb_populate_record(null::key4,
    '{"d": {"c": {"b": {"a": "nested"}}}, "e": "test"}');

   a    |  e
--------+------
nested | test
(1 row)

Re: Abnormal JSON query performance

From
007reader
Date:
Thank you for all your recomendations. My conclusion is that at this point, postgres doesn's have a robust way of extracting specific keys from JSON column. In real applications, JSON documents are usually hierarchical and fairly complex. Adding fast data access to parts of the JSON would be a great improvement and keep postgres' leadership in unstructured data space.

-------- Original message --------
From: Pavel Stehule <pavel.stehule@gmail.com>
Date: 5/16/18 8:10 AM (GMT-08:00)
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Dmitry Dolgov <9erthalion6@gmail.com>, "David G. Johnston" <david.g.johnston@gmail.com>, reader 1001 <007reader@gmail.com>, Merlin Moncure <mmoncure@gmail.com>, pgsql-bugs <pgsql-bugs@postgresql.org>
Subject: Re: Abnormal JSON query performance



2018-05-16 17:07 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
>> On 16 May 2018 at 05:59, David G. Johnston <david.g.johnston@gmail.com> wrote:
>> On Tuesday, May 15, 2018, reader 1001 <007reader@gmail.com> wrote:
>>> My question remains for hierarchical keys in a JSON document. If I have a
>>> document like below, I clearly can extract key1 using the described rowtype
>>> definition. How can I specify selected keys deeper in the document, e.g.
>>> key3 and key5?

>> I believe you would need a type for each subtree and apply the function
>> multiple times with the result of one feeding the next.

> Yes, you need to defined a type for each subtree, but as far as I can
> tell it's not necessary to apply the function multiple times,
> `jsonb_populate_record` can work with nested types, so it's enough
> just to have every new type included in the previous one.

FWIW, I really doubt that there's much performance win from going further
than the first-level keys.  I suspect most of the cost that the OP is
seeing comes from fetching the large JSONB document out of toast storage
multiple times.  Fetching it just in a single jsonb_populate_record()
call will fix that.  So I'd just return the top-level field(s) as jsonb
column(s) and use the normal -> or ->> operators to go further down.

The vague ideas that I've had about fixing this type of problem
automatically mostly center around detecting the need for duplicate
toast fetches and doing that just once.  For data types having "expanded"
forms, it's tempting to consider also expanding them during the fetch,
but that's less clearly a win.

Just note. If SQL/JSON will be implemented, then this discussion is useless, because JSON_TABLE function allows to read more values per one call.

Regards

Pavel 

                        regards, tom lane

Re: Abnormal JSON query performance

From
reader 1001
Date:
through my testing, I confirmed Tom's guess - performance benefit of going beyond the first level keys is negligible.  Getting top level key's JSON part of the document and use ->>, -> operators for extracting data works very well, even if I had to go down multiple levels of hierarchy. Thanks a lot! 

On Wed, May 16, 2018 at 8:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
>> On 16 May 2018 at 05:59, David G. Johnston <david.g.johnston@gmail.com> wrote:
>> On Tuesday, May 15, 2018, reader 1001 <007reader@gmail.com> wrote:
>>> My question remains for hierarchical keys in a JSON document. If I have a
>>> document like below, I clearly can extract key1 using the described rowtype
>>> definition. How can I specify selected keys deeper in the document, e.g.
>>> key3 and key5?

>> I believe you would need a type for each subtree and apply the function
>> multiple times with the result of one feeding the next.

> Yes, you need to defined a type for each subtree, but as far as I can
> tell it's not necessary to apply the function multiple times,
> `jsonb_populate_record` can work with nested types, so it's enough
> just to have every new type included in the previous one.

FWIW, I really doubt that there's much performance win from going further
than the first-level keys.  I suspect most of the cost that the OP is
seeing comes from fetching the large JSONB document out of toast storage
multiple times.  Fetching it just in a single jsonb_populate_record()
call will fix that.  So I'd just return the top-level field(s) as jsonb
column(s) and use the normal -> or ->> operators to go further down.

The vague ideas that I've had about fixing this type of problem
automatically mostly center around detecting the need for duplicate
toast fetches and doing that just once.  For data types having "expanded"
forms, it's tempting to consider also expanding them during the fetch,
but that's less clearly a win.

                        regards, tom lane

Re: Abnormal JSON query performance

From
Bruce Momjian
Date:
On Wed, May 16, 2018 at 11:07:59AM -0400, Tom Lane wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
> >> On 16 May 2018 at 05:59, David G. Johnston <david.g.johnston@gmail.com> wrote:
> >> On Tuesday, May 15, 2018, reader 1001 <007reader@gmail.com> wrote:
> >>> My question remains for hierarchical keys in a JSON document. If I have a
> >>> document like below, I clearly can extract key1 using the described rowtype
> >>> definition. How can I specify selected keys deeper in the document, e.g.
> >>> key3 and key5?
> 
> >> I believe you would need a type for each subtree and apply the function
> >> multiple times with the result of one feeding the next.
> 
> > Yes, you need to defined a type for each subtree, but as far as I can
> > tell it's not necessary to apply the function multiple times,
> > `jsonb_populate_record` can work with nested types, so it's enough
> > just to have every new type included in the previous one.
> 
> FWIW, I really doubt that there's much performance win from going further
> than the first-level keys.  I suspect most of the cost that the OP is
> seeing comes from fetching the large JSONB document out of toast storage
> multiple times.  Fetching it just in a single jsonb_populate_record()
> call will fix that.  So I'd just return the top-level field(s) as jsonb
> column(s) and use the normal -> or ->> operators to go further down.
> 
> The vague ideas that I've had about fixing this type of problem
> automatically mostly center around detecting the need for duplicate
> toast fetches and doing that just once.  For data types having "expanded"
> forms, it's tempting to consider also expanding them during the fetch,
> but that's less clearly a win.

Should this be a TODO item?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +