Thread: [PERFORM] Speeding up JSON + TSQUERY + GIN

[PERFORM] Speeding up JSON + TSQUERY + GIN

From
"Sven R. Kunze"
Date:
Hello everyone,

I am currently evaluating the possibility of using PostgreSQL for storing and querying jsonb+tsvector queries. Let's consider this setup:

create table docs (id serial primary key, meta jsonb);
# generate 10M entries, cf. appendix
create index docs_meta_idx ON docs using gin (meta jsonb_path_ops);
create index docs_name_idx ON docs using gin (to_tsvector('english', meta->>'name'));
create index docs_address_idx ON docs using gin (to_tsvector('english', meta->>'address'));


Testing around with some smaller datasets, functionality-wise it's great. However increasing to 10M, things tend to slow down (using PostgreSQL 9.5):


explain analyze select id from docs where meta @> '{"age": 20}';
 Planning time: 0.121 ms
 Execution time: 4873.507 ms

explain analyze select id from docs where meta @> '{"age": 20}';
 Planning time: 0.122 ms
 Execution time: 206.289 ms



explain analyze select id from docs where meta @> '{"age": 30}';
 Planning time: 0.109 ms
 Execution time: 7496.886 ms

explain analyze select id from docs where meta @> '{"age": 30}';
 Planning time: 0.114 ms
 Execution time: 1169.649 ms



explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
 Planning time: 0.179 ms
 Execution time: 10109.375 ms

explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
 
Planning time: 0.188 ms
 Execution time: 238.854 ms


Using "select pg_prewarm('docs');" and on any of the indexes doesn't help either.
After a "systemctl stop postgresql.service && sync && echo 3 > /proc/sys/vm/drop_caches && systemctl start postgresql.service" the age=20, 30 or name=john queries are slow again.


Is there a way to speed up or to warm up things permanently?


Regards,
Sven


Appendix I:

example json:

{"age": 20, "name": "Michelle Hernandez", "birth": "1991-08-16", "address": "94753 Tina Bridge Suite 318\\nEmilyport, MT 75302"}



Appendix II:


The Python script to generate fake json data. Needs "pip install faker".

>>> python fake_json.py > test.json  # generates 2M entries; takes some time
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'


-- fake_json.py --

import faker, json;
fake = faker.Faker();
for i in range(2*10**6):
    print(json.dumps({"name": fake.name(), "birth": fake.date(), "address": fake.address(), "age": fake.random_int(0,100)}).replace('\\n', '\\\\n'))

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
Oleg Bartunov
Date:


On Sun, Feb 26, 2017 at 4:28 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Hello everyone,

I am currently evaluating the possibility of using PostgreSQL for storing and querying jsonb+tsvector queries. Let's consider this setup:

create table docs (id serial primary key, meta jsonb);
# generate 10M entries, cf. appendix
create index docs_meta_idx ON docs using gin (meta jsonb_path_ops);
create index docs_name_idx ON docs using gin (to_tsvector('english', meta->>'name'));
create index docs_address_idx ON docs using gin (to_tsvector('english', meta->>'address'));


functional index tends to be slow, better use separate column(s) for tsvector
 

Testing around with some smaller datasets, functionality-wise it's great. However increasing to 10M, things tend to slow down (using PostgreSQL 9.5):


explain analyze select id from docs where meta @> '{"age": 20}';
 Planning time: 0.121 ms
 Execution time: 4873.507 ms

explain analyze select id from docs where meta @> '{"age": 20}';
 Planning time: 0.122 ms
 Execution time: 206.289 ms



explain analyze select id from docs where meta @> '{"age": 30}';
 Planning time: 0.109 ms
 Execution time: 7496.886 ms

explain analyze select id from docs where meta @> '{"age": 30}';
 Planning time: 0.114 ms
 Execution time: 1169.649 ms



explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
 Planning time: 0.179 ms
 Execution time: 10109.375 ms

explain analyze select id from docs where to_tsvector('english', meta->>'name') @@ to_tsquery('english', 'john');
 
Planning time: 0.188 ms
 Execution time: 238.854 ms

what is full output from explain analyze ?
 


Using "select pg_prewarm('docs');" and on any of the indexes doesn't help either.
After a "systemctl stop postgresql.service && sync && echo 3 > /proc/sys/vm/drop_caches && systemctl start postgresql.service" the age=20, 30 or name=john queries are slow again.


Is there a way to speed up or to warm up things permanently?


Regards,
Sven


Appendix I:

example json:

{"age": 20, "name": "Michelle Hernandez", "birth": "1991-08-16", "address": "94753 Tina Bridge Suite 318\\nEmilyport, MT 75302"}



Appendix II:


The Python script to generate fake json data. Needs "pip install faker".

>>> python fake_json.py > test.json  # generates 2M entries; takes some time
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'
>>> cat test.json | psql -c 'copy docs (meta) from stdin'


-- fake_json.py --

import faker, json;
fake = faker.Faker();
for i in range(2*10**6):
    print(json.dumps({"name": fake.name(), "birth": fake.date(), "address": fake.address(), "age": fake.random_int(0,100)}).replace('\\n', '\\\\n'))


Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
"Sven R. Kunze"
Date:
Thanks Oleg for your reply.

On 26.02.2017 21:13, Oleg Bartunov wrote:
On Sun, Feb 26, 2017 at 4:28 PM, Sven R. Kunze <srkunze@mail.de> wrote:
create index docs_meta_idx ON docs using gin (meta jsonb_path_ops);
create index docs_name_idx ON docs using gin (to_tsvector('english', meta->>'name'));
create index docs_address_idx ON docs using gin (to_tsvector('english', meta->>'address'));


functional index tends to be slow, better use separate column(s) for tsvector

Why? Don't we have indexes to make them faster?

The idea is to accelerate all operations as specified (cf. the table schema below) without adding more and more columns.

what is full output from explain analyze ?

Okay, let's stick to gin + @> operator for now before we tackle the functional index issue.
Maybe, I did something wrong while defining the gin indexes:



explain analyze select id from docs where meta @> '{"age": 40}';
                                                           QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on docs  (cost=86.50..9982.50 rows=10000 width=4) (actual time=97.443..8073.983 rows=98385 loops=1)
   Recheck Cond: (meta @> '{"age": 40}'::jsonb)
   Heap Blocks: exact=79106
   ->  Bitmap Index Scan on docs_meta_idx  (cost=0.00..84.00 rows=10000 width=0) (actual time=66.878..66.878 rows=98385 loops=1)
         Index Cond: (meta @> '{"age": 40}'::jsonb)
 Planning time: 0.118 ms
 Execution time: 8093.533 ms
(7 rows)

explain analyze select id from docs where meta @> '{"age": 40}';
                                                           QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on docs  (cost=86.50..9982.50 rows=10000 width=4) (actual time=99.527..3349.001 rows=98385 loops=1)
   Recheck Cond: (meta @> '{"age": 40}'::jsonb)
   Heap Blocks: exact=79106
   ->  Bitmap Index Scan on docs_meta_idx  (cost=0.00..84.00 rows=10000 width=0) (actual time=68.503..68.503 rows=98385 loops=1)
         Index Cond: (meta @> '{"age": 40}'::jsonb)
 Planning time: 0.113 ms
 Execution time: 3360.773 ms
(7 rows)

explain analyze select id from docs where meta @> '{"age": 40}';
                                                           QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on docs  (cost=86.50..9982.50 rows=10000 width=4) (actual time=64.928..168.311 rows=98385 loops=1)
   Recheck Cond: (meta @> '{"age": 40}'::jsonb)
   Heap Blocks: exact=79106
   ->  Bitmap Index Scan on docs_meta_idx  (cost=0.00..84.00 rows=10000 width=0) (actual time=45.340..45.340 rows=98385 loops=1)
         Index Cond: (meta @> '{"age": 40}'::jsonb)
 Planning time: 0.121 ms
 Execution time: 171.098 ms
(7 rows)

explain analyze select id from docs where meta @> '{"age": 40}';
                                                           QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on docs  (cost=86.50..9982.50 rows=10000 width=4) (actual time=86.118..215.755 rows=98385 loops=1)
   Recheck Cond: (meta @> '{"age": 40}'::jsonb)
   Heap Blocks: exact=79106
   ->  Bitmap Index Scan on docs_meta_idx  (cost=0.00..84.00 rows=10000 width=0) (actual time=54.535..54.535 rows=98385 loops=1)
         Index Cond: (meta @> '{"age": 40}'::jsonb)
 Planning time: 0.127 ms
 Execution time: 219.746 ms
(7 rows)

explain analyze select id from docs where meta @> '{"age": 40}';
                                                           QUERY PLAN                                                           
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on docs  (cost=86.50..9982.50 rows=10000 width=4) (actual time=83.197..211.840 rows=98385 loops=1)
   Recheck Cond: (meta @> '{"age": 40}'::jsonb)
   Heap Blocks: exact=79106
   ->  Bitmap Index Scan on docs_meta_idx  (cost=0.00..84.00 rows=10000 width=0) (actual time=53.036..53.036 rows=98385 loops=1)
         Index Cond: (meta @> '{"age": 40}'::jsonb)
 Planning time: 0.127 ms
 Execution time: 215.753 ms
(7 rows)


Regards,
Sven


Table Schema:

                         Table "public.docs"
 Column |  Type   |                     Modifiers                    
--------+---------+---------------------------------------------------
 id     | integer | not null default nextval('docs_id_seq'::regclass)
 meta   | jsonb   |
Indexes:
    "docs_pkey" PRIMARY KEY, btree (id)
    "docs_address_idx" gin (to_tsvector('english'::regconfig, meta ->> 'address'::text))
    "docs_address_trgm_idx" gin ((meta ->> 'address'::text) gin_trgm_ops)
    "docs_birth_idx" btree ((meta ->> 'birth'::text))
    "docs_meta_idx" gin (meta jsonb_path_ops)
    "docs_name_idx" gin (to_tsvector('english'::regconfig, meta ->> 'name'::text))

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
Jeff Janes
Date:
On Sun, Feb 26, 2017 at 5:28 AM, Sven R. Kunze <srkunze@mail.de> wrote:


Using "select pg_prewarm('docs');" and on any of the indexes doesn't help either.
After a "systemctl stop postgresql.service && sync && echo 3 > /proc/sys/vm/drop_caches && systemctl start postgresql.service" the age=20, 30 or name=john queries are slow again.


Is there a way to speed up or to warm up things permanently?


If by 'permanently', you mean even when you intentionally break things, then no.  You will always be able to intentionally break things.  There is on-going discussion of an auto-prewarm feature.  But that doesn't yet exist; and once it does, a super user will always be able to break it.

Presumably you have a use-case in mind other than intentional sabotage of your caches by root.  But, what is it?  If you reboot the server frequently, maybe you can just throw 'select pg_prewarm...' into an init script?

Cheers,

Jeff

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
"Sven R. Kunze"
Date:
On 27.02.2017 19:22, Jeff Janes wrote:
If by 'permanently', you mean even when you intentionally break things, then no.  You will always be able to intentionally break things.  There is on-going discussion of an auto-prewarm feature.  But that doesn't yet exist; and once it does, a super user will always be able to break it.

Presumably you have a use-case in mind other than intentional sabotage of your caches by root.  But, what is it?  If you reboot the server frequently, maybe you can just throw 'select pg_prewarm...' into an init script?

I didn't express myself well enough. pg_prewarm doesn't help to speed up those queries at all.


Looking at these numbers, I wonder why it takes ~5 secs to answer?


Best,
Sven

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
Jeff Janes
Date:
On Tue, Feb 28, 2017 at 12:27 AM, Sven R. Kunze <srkunze@mail.de> wrote:
On 27.02.2017 19:22, Jeff Janes wrote:
If by 'permanently', you mean even when you intentionally break things, then no.  You will always be able to intentionally break things.  There is on-going discussion of an auto-prewarm feature.  But that doesn't yet exist; and once it does, a super user will always be able to break it.

Presumably you have a use-case in mind other than intentional sabotage of your caches by root.  But, what is it?  If you reboot the server frequently, maybe you can just throw 'select pg_prewarm...' into an init script?

I didn't express myself well enough. pg_prewarm doesn't help to speed up those queries at all.


Oh.  In my hands, it works very well.  I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').

How much RAM do you have?  Maybe you don't have enough to hold the table in RAM.  What kind of IO system?  And what OS?


Cheers,

Jeff

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
"Sven R. Kunze"
Date:
On 28.02.2017 17:49, Jeff Janes wrote:
Oh.  In my hands, it works very well.  I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').

How much RAM do you have?  Maybe you don't have enough to hold the table in RAM.  What kind of IO system?  And what OS?

On my test system:

RAM: 4GB
IO: SSD (random_page_cost = 1.0)
OS: Ubuntu 16.04

Regards,
Sven

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
Jeff Janes
Date:
On Wed, Mar 1, 2017 at 6:02 AM, Sven R. Kunze <srkunze@mail.de> wrote:
On 28.02.2017 17:49, Jeff Janes wrote:
Oh.  In my hands, it works very well.  I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').

How much RAM do you have?  Maybe you don't have enough to hold the table in RAM.  What kind of IO system?  And what OS?

On my test system:

RAM: 4GB
IO: SSD (random_page_cost = 1.0)
OS: Ubuntu 16.04


4GB is not much RAM to be trying to pre-warm this amount of data into.  Towards the end of the pg_prewarm, it is probably evicting data read in by the earlier part of it.

What is shared_buffers?

Cheers,

Jeff

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
"Sven R. Kunze"
Date:
On 01.03.2017 18:04, Jeff Janes wrote:
On Wed, Mar 1, 2017 at 6:02 AM, Sven R. Kunze <srkunze@mail.de> wrote:
On 28.02.2017 17:49, Jeff Janes wrote:
Oh.  In my hands, it works very well.  I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').

How much RAM do you have?  Maybe you don't have enough to hold the table in RAM.  What kind of IO system?  And what OS?

On my test system:

RAM: 4GB
IO: SSD (random_page_cost = 1.0)
OS: Ubuntu 16.04


4GB is not much RAM to be trying to pre-warm this amount of data into.  Towards the end of the pg_prewarm, it is probably evicting data read in by the earlier part of it.

What is shared_buffers?

942MB.

But I see where you are coming from. How come that these queries need a Recheck Cond? I gather that this would require reading not only the index data but also the table itself which could be huge, right?

Sven

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
Jeff Janes
Date:
On Thu, Mar 2, 2017 at 1:19 PM, Sven R. Kunze <srkunze@mail.de> wrote:
On 01.03.2017 18:04, Jeff Janes wrote:
On Wed, Mar 1, 2017 at 6:02 AM, Sven R. Kunze <srkunze@mail.de> wrote:
On 28.02.2017 17:49, Jeff Janes wrote:
Oh.  In my hands, it works very well.  I get 70 seconds to do the {age: 20} query from pure cold caches, versus 1.4 seconds from cold caches which was followed by pg_prewarm('docs','prefetch').

How much RAM do you have?  Maybe you don't have enough to hold the table in RAM.  What kind of IO system?  And what OS?

On my test system:

RAM: 4GB
IO: SSD (random_page_cost = 1.0)
OS: Ubuntu 16.04


4GB is not much RAM to be trying to pre-warm this amount of data into.  Towards the end of the pg_prewarm, it is probably evicting data read in by the earlier part of it.

What is shared_buffers?

942MB.

But I see where you are coming from. How come that these queries need a Recheck Cond? I gather that this would require reading not only the index data but also the table itself which could be huge, right?

Bitmaps can overflow and drop the row-level information, tracking only the blocks which need to be inspected.  So it has to have a recheck in case that happens (although in your case it is not actually overflowing--but it still needs to be prepared for that).  Also, I think that jsonb_path_ops indexes the hashes of the paths, so it can deliver false positives which need to be rechecked.  And you are selecting `id`, which is not in the index so it would have to consult the table anyway to retrieve that.  Even if it could get all the data from the index itself, I don't think GIN indexes support that feature.

Cheers,

Jeff

Re: [PERFORM] Speeding up JSON + TSQUERY + GIN

From
"Sven R. Kunze"
Date:
On 06.03.2017 05:25, Jeff Janes wrote:
Bitmaps can overflow and drop the row-level information, tracking only the blocks which need to be inspected.  So it has to have a recheck in case that happens (although in your case it is not actually overflowing--but it still needs to be prepared for that).

Good to know.

Also, I think that jsonb_path_ops indexes the hashes of the paths, so it can deliver false positives which need to be rechecked.

Wow, that's a very important piece of information. It explains a lot. Thanks a lot.

And you are selecting `id`, which is not in the index so it would have to consult the table anyway to retrieve that.  Even if it could get all the data from the index itself, I don't think GIN indexes support that feature.

Yes, I see. I actually was sloppy about the query. What's really important here would be counting the number of rows. However, from what I can see, it's the best PostgreSQL can do right now.


Or you have any more ideas how to speed up counting?


Best,
Sven