Thread: Drastic select count performance hit when jsonb GIN indices are present

Drastic select count performance hit when jsonb GIN indices are present

From
Anton Melser
Date:
Hi,

I've been playing with jsonb for storing and querying data from the Mailgun Events API (http://documentation.mailgun.com/api-events.html#examples). I already have a system that parses the JSON to csv and loads into standard tables but what better way to spend the holidays than nerding out on new postgres tech :-)? I am using the official postgres-supplied 9.4 Ubuntu repo on 14.04.

I have been testing both just inserting the raw API call JSON ({"items":[{"item1"....],"paging":...}) and extracting the actual events from the items array ({"item1":...}, {"item2":...}) and inserting per-event rather than per page of 100 events in an "items" array. I did this to try and benefit from GIN indices but much to my surprise, adding indices when there is a line per-event actually drastically *reduces* performance!

mgevents=# create table myevents (event jsonb);
mgevents=# create index idx_myevents on myevents using gin (event jsonb_path_ops);
mgevents=# create index idx_myevents_no_path on myevents using gin (event);

mgevents=# select count(*) from myevents;
  count  
---------
 3715600
(1 row)

mgevents=# explain (analyze, buffers) select count(*) from myevents where event ->> 'event' = 'delivered';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=279884.45..279884.46 rows=1 width=0) (actual time=14626.213..14626.216 rows=1 loops=1)
   Buffers: shared read=448208
   ->  Seq Scan on myevents  (cost=0.00..279838.00 rows=18578 width=0) (actual time=0.719..11432.283 rows=1417152 loops=1)
         Filter: ((event ->> 'event'::text) = 'delivered'::text)
         Rows Removed by Filter: 2298448
         Buffers: shared read=448208
 Planning time: 0.074 ms
 Execution time: 14626.955 ms
(8 rows)

mgevents=# explain (analyze, buffers) select count(*) from myevents where event @> '{"event": "delivered"}';
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3624.38..3624.39 rows=1 width=0) (actual time=80984.095..80984.098 rows=1 loops=1)
   Buffers: shared hit=1 read=298589
   ->  Bitmap Heap Scan on myevents  (cost=35.80..3615.09 rows=3716 width=0) (actual time=351.510..77669.907 rows=1417152 loops=1)
         Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
         Heap Blocks: exact=298362
         Buffers: shared hit=1 read=298589
         ->  Bitmap Index Scan on idx_myevents  (cost=0.00..34.87 rows=3716 width=0) (actual time=257.219..257.219 rows=1417152 loops=1)
               Index Cond: (event @> '{"event": "delivered"}'::jsonb)
               Buffers: shared hit=1 read=227
 Planning time: 3.197 ms
 Execution time: 80986.340 ms
(11 rows)

mgevents=# drop index idx_myevents;

mgevents=# explain (analyze, buffers) select count(*) from myevents where event @> '{"event": "delivered"}';
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3631.38..3631.39 rows=1 width=0) (actual time=81898.250..81898.252 rows=1 loops=1)
   Buffers: shared hit=383 read=299133
   ->  Bitmap Heap Scan on myevents  (cost=42.80..3622.09 rows=3716 width=0) (actual time=534.816..78526.944 rows=1417152 loops=1)
         Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
         Heap Blocks: exact=298362
         Buffers: shared hit=383 read=299133
         ->  Bitmap Index Scan on idx_myevents_no_path  (cost=0.00..41.87 rows=3716 width=0) (actual time=453.412..453.412 rows=1417152 loops=1)
               Index Cond: (event @> '{"event": "delivered"}'::jsonb)
               Buffers: shared hit=383 read=771
 Planning time: 2.322 ms
 Execution time: 81898.578 ms
(11 rows)

mgevents=# drop index idx_myevents_no_path;

mgevents=# explain (analyze, buffers) select count(*) from myevents where event @> '{"event": "delivered"}';
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=270558.29..270558.30 rows=1 width=0) (actual time=19834.530..19834.532 rows=1 loops=1)
   Buffers: shared hit=130935 read=317273
   ->  Seq Scan on myevents  (cost=0.00..270549.00 rows=3716 width=0) (actual time=4.650..16764.726 rows=1417152 loops=1)
         Filter: (event @> '{"event": "delivered"}'::jsonb)
         Rows Removed by Filter: 2298448
         Buffers: shared hit=130935 read=317273
 Planning time: 0.238 ms
 Execution time: 19834.605 ms
(8 rows)

So it doesn't even appear to be a problem with the operator (->> vs @>) but rather that the planner is using the index rather than just a scan. Now as I'm completely new to jsonb and GIN there are probably lots of better ways to do this and configure the server (a VM on my laptop...) but it seems surprising that adding an index should ever drastically reduce select performance! I also tried an index on the event object, with and without jsonb_path_ops

mgevents=# create index idx_myevents_event_type_path on myevents using gin ((event -> 'event') jsonb_path_ops);

But it just ends up doing a scan anyway, so I'm probably doing something wrong.

Any pointers? Thanks!
Anton

Re: Drastic select count performance hit when jsonb GIN indices are present

From
Maxim Boguk
Date:
   ->  Bitmap Heap Scan on myevents  (cost=35.80..3615.09 rows=3716 width=0) (actual time=351.510..77669.907 rows=1417152 loops=1)
         Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
         Heap Blocks: exact=298362
​​
         Buffers: shared hit=1 read=298589
​...​
 Execution time: 80986.340 ms
 
   ->  Bitmap Heap Scan on
​​
myevents  (cost=42.80..3622.09 rows=3716 width=0) (actual time=534.816..78526.944 rows=1417152 loops=1)
         Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
         Heap Blocks: exact=298362
         Buffers: shared hit=383 read=299133
​...
 Execution time: 81898.578 ms

​Hi Anton,

What you see there (i think) - it's a performance hit of random disk read for non-cached database.
Try increase a shared buffers to value when table and index could fit into, and redo queries few time until you see something like
Buffers: shared hit=bigvalue read=0 and compare performance, it might change timing quite a lot.

Also, I recommend set track_io_timing=on in postgresql.conf and  after it use explain (analyze, buffers, timing) to see check how much time database spent doing IO operations.
Also try perform vacuum analyze ​myevents; before testing because it seems that you have no up to date visibility map on the table.


However, even in fully cached case selecting 40% on the table rows almost always will be faster via sequential scan, so I don't expect miracles.

--
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."

Re: Drastic select count performance hit when jsonb GIN indices are present

From
Anton Melser
Date:
Hi Maxim,

Thanks for the suggestions. Unfortunately, it does appear to be an IO issue but not one I can get around (on my laptop at least!). Postgres seems to refuse to put any of the index or table in shared buffers at all. For some reason, there seems to be a very great increase in space required for having each event (item) on its own row as opposed to having it with the original items array within far fewer database rows:

              relation               |    size    
-------------------------------------+------------
 public.myevents                     | 3502 MB
...
 pg_toast.pg_toast_57487             | 800 MB
...
 public.events                       | 2232 kB
...

myevents is where each event has a row to itself and events has 100 events in an array within the JSON rows (and the toast table above is where the data seem to be stored) - actually the events table has more data in it as there are the extra paging objects which have been removed from myevents. Performing vacuum analyze seems to have no effect either.

Getting back to my original point - you pointed out that for queries that need a decent % of the table it will be cheaper to do a scan, which is exactly what the query planner does for the relational version. If it only needs a small % of the values it looks at the index and for a large % it goes for a scan (it also puts everything in shared buffers and is lightening quick!). Is this just a lack of maturity in the jsonb planner or am I missing something?

Thanks again,
Anton

On 26 December 2014 at 14:19, Maxim Boguk <maxim.boguk@gmail.com> wrote:
   ->  Bitmap Heap Scan on myevents  (cost=35.80..3615.09 rows=3716 width=0) (actual time=351.510..77669.907 rows=1417152 loops=1)
         Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
         Heap Blocks: exact=298362
​​
         Buffers: shared hit=1 read=298589
​...​
 Execution time: 80986.340 ms
 
   ->  Bitmap Heap Scan on
​​
myevents  (cost=42.80..3622.09 rows=3716 width=0) (actual time=534.816..78526.944 rows=1417152 loops=1)
         Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
         Heap Blocks: exact=298362
         Buffers: shared hit=383 read=299133
​...
 Execution time: 81898.578 ms

​Hi Anton,

What you see there (i think) - it's a performance hit of random disk read for non-cached database.
Try increase a shared buffers to value when table and index could fit into, and redo queries few time until you see something like
Buffers: shared hit=bigvalue read=0 and compare performance, it might change timing quite a lot.

Also, I recommend set track_io_timing=on in postgresql.conf and  after it use explain (analyze, buffers, timing) to see check how much time database spent doing IO operations.
Also try perform vacuum analyze ​myevents; before testing because it seems that you have no up to date visibility map on the table.


However, even in fully cached case selecting 40% on the table rows almost always will be faster via sequential scan, so I don't expect miracles.

--
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."



--
echo '16i[q]sa[ln0=aln100%Pln100/snlbx]sbA0D4D465452snlbxq' | dc
This will help you for 99.9% of your problems ...

Re: Drastic select count performance hit when jsonb GIN indices are present

From
Maxim Boguk
Date:


Getting back to my original point - you pointed out that for queries that need a decent % of the table it will be cheaper to do a scan, which is exactly what the query planner does for the relational version. If it only needs a small % of the values it looks at the index and for a large % it goes for a scan (it also puts everything in shared buffers and is lightening quick!). Is this just a lack of maturity in the jsonb planner or am I missing something?

​Hi Anton,

Good selectivity estimators exists only for the scalar data types.
For the complex data types such as json/jsonb introducing a reasonable selectivity estimator is very complicated task, so database could only guess in this cases. 
In your case the database guessed amount of returned rows with 3 order of magnitude error (estimated 3716 rows, actually 1417152 rows).
Personally, I don't expect serious progress in json/jsonb selectivity estimators in short future, so better to avoid using a low-selectivity queries against indexed json/jsonb fields.

 
--
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com
МойКруг: http://mboguk.moikrug.ru/

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."