Thread: Slow join over three tables

Slow join over three tables

From
Tim van der Linden
Date:
Hi all

I have asked this question in a somewhat different form on the DBA Stackexchange site, but without much luck
(https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables).So I apologize for possible
doubleposting, but I hope this might get a better response on the mailing list. 

I'm joining three fairly large tables together, and it is slow. The tables are:

- "reports": 6 million rows
- "report_drugs": 20 million rows
- "report_adverses": 20 million rows

The table "reports" holds main report data and has a primary key column "id". The other two tables have a foreign key
tothat table with "rid". It are those columns that I use to join them together. 

All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" columns.

The query:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM reports r
JOIN report_drugs d ON d.rid = r.id
JOIN report_adverses a ON a.rid = r.id
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

The plan:

Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual time=5143.162..5143.185 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual time=54.784..5142.872 rows=448 loops=1)
     Join Filter: (d.rid = a.rid)
     ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
         ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31) (actual
time=0.669..3.900rows=14199 loops=1) 
               Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
         ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143
rows=1loops=14199) 
               Index Cond: (id = d.rid)
     ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=12) (actual
time=0.218..0.218rows=0 loops=14199) 
           Index Cond: (rid = r.id)
           Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
           Rows Removed by Filter: 5
Planning time: 13.994 ms
Execution time: 5143.235 ms

This takes well over 5 seconds, which to me, feels much too slow.
If I query each table directly with the same conditions, thus:

SELECT reason
FROM report_drugs
WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);

I get:

Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28 rows=14005 width=27) (actual
time=0.621..4.510rows=14199 loops=1) 
  Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
Planning time: 6.939 ms
Execution time: 4.759 ms

Under 5 ms. The same goes for querying the "adverse" column in the "report_adverses" table: under 20 ms.

This indicates to me that indeed the join itself causes a major performance bottleneck.

I'm running the cluster from an SSD drive, as a traditional HDD could not even manage the query in under 5 minutes. The
systemhas a total memory of 24 GB, runs on Debian and uses an 4Ghz 8 core i7-4790 processor. 

Some important postgresql.conf readouts:

- shared_buffers = 4GB
- work_mem = 64MB
- maintenance_work_mem = 1GB
- checkpoint_segments = 50
- checkpoint_completion_target = 0.9
- autovacuum = on

Is there something I am missing here? Any help on getting this join faster is much appreciated.

Cheers,
Tim


Re: Slow join over three tables

From
Sameer Kumar
Date:


On Wed, 27 Apr 2016 07:28 Tim van der Linden, <tim@shisaa.jp> wrote:
Hi all

I have asked this question in a somewhat different form on the DBA Stackexchange site, but without much luck (https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables). So I apologize for possible double posting, but I hope this might get a better response on the mailing list.

I'm joining three fairly large tables together, and it is slow. The tables are:

- "reports": 6 million rows
- "report_drugs": 20 million rows
- "report_adverses": 20 million rows

The table "reports" holds main report data and has a primary key column "id". The other two tables have a foreign key to that table with "rid". It are those columns that I use to join them together.

All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" columns.

The query:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM reports r
JOIN report_drugs d ON d.rid = r.id
JOIN report_adverses a ON a.rid = r.id
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

The plan:

Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual time=5143.162..5143.185 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual time=54.784..5142.872 rows=448 loops=1)
     Join Filter: (d.rid = a.rid)
     ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)

Quite clearly the nested loop joins are the most costly operations here.


         ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 loops=1)
               Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
         ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
               Index Cond: (id = d.rid)
     ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0 loops=14199)
           Index Cond: (rid = r.id)
           Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
           Rows Removed by Filter: 5

I suppose. It might help if the filters are performed before the join. I am not an expert on optimizer but I guess it might help if you change the join order and add duplicate conditions for reports-

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM report_drugs d 
JOIN report_adverses a ON a.rid = d.rid
JOIN reports r ON d.rid = r.id
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;


OR since you are using INNER JOIN, (As far as I understand the concept of joins) it won't hurt the result set if the where clause is pushed into the INNER JOIN criteria-

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM report_drugs d 
JOIN report_adverses a ON a.rid = d.rid AND
a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
JOIN reports r ON d.rid = r.id;


Planning time: 13.994 ms
Execution time: 5143.235 ms

This takes well over 5 seconds, which to me, feels much too slow.

Well given your hardware (which seems to be pretty decent), I am not sure if this is an unfair expectation. 
 
If I query each table directly with the same conditions, thus:

SELECT reason
FROM report_drugs
WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);

I get:

Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28 rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1)
  Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
Planning time: 6.939 ms
Execution time: 4.759 ms

Under 5 ms. The same goes for querying the "adverse" column in the "report_adverses" table: under 20 ms.

This indicates to me that indeed the join itself causes a major performance bottleneck.

I'm running the cluster from an SSD drive, as a traditional HDD could not even manage the query in under 5 minutes. The system has a total memory of 24 GB, runs on Debian and uses an 4Ghz 8 core i7-4790 processor.

Some important postgresql.conf readouts:

- shared_buffers = 4GB
- work_mem = 64MB
- maintenance_work_mem = 1GB
- checkpoint_segments = 50
- checkpoint_completion_target = 0.9
- autovacuum = on

Is there something I am missing here? Any help on getting this join faster is much appreciated.

Cheers,
Tim


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
--
Best Regards
Sameer Kumar | DB Solution Architect 
ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069 533

T: +65 6438 3504 | M: +65 8110 0350 | www.ashnik.com

Re: Slow join over three tables

From
David Rowley
Date:
On 27 April 2016 at 11:27, Tim van der Linden <tim@shisaa.jp> wrote:
> Hi all
>
> I have asked this question in a somewhat different form on the DBA Stackexchange site, but without much luck
(https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables).So I apologize for possible
doubleposting, but I hope this might get a better response on the mailing list. 
>
> I'm joining three fairly large tables together, and it is slow. The tables are:
>
> - "reports": 6 million rows
> - "report_drugs": 20 million rows
> - "report_adverses": 20 million rows
>
> The table "reports" holds main report data and has a primary key column "id". The other two tables have a foreign key
tothat table with "rid". It are those columns that I use to join them together. 
>
> All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" columns.
>
> The query:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
>
> The plan:
>
> Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual time=5143.162..5143.185 rows=448 loops=1)
>    Sort Key: r.created
>    Sort Method: quicksort  Memory: 60kB
>    ->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual time=54.784..5142.872 rows=448 loops=1)
>      Join Filter: (d.rid = a.rid)
>      ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
>          ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31)
(actualtime=0.669..3.900 rows=14199 loops=1) 
>                Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
>          ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual
time=0.143..0.143rows=1 loops=14199) 
>                Index Cond: (id = d.rid)
>      ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=12) (actual
time=0.218..0.218rows=0 loops=14199) 
>            Index Cond: (rid = r.id)
>            Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
>            Rows Removed by Filter: 5
> Planning time: 13.994 ms
> Execution time: 5143.235 ms
>
> This takes well over 5 seconds, which to me, feels much too slow.
> If I query each table directly with the same conditions, thus:
>
> SELECT reason
> FROM report_drugs
> WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);
>
> I get:
>
> Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28 rows=14005 width=27) (actual
time=0.621..4.510rows=14199 loops=1) 
>   Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> Planning time: 6.939 ms
> Execution time: 4.759 ms
>
> Under 5 ms. The same goes for querying the "adverse" column in the "report_adverses" table: under 20 ms.

I'm not sure why you're comparing this to the join plan above. They're
very different, I can only imagine that it' because you've not quite
understood what the EXPLAIN output is saying.

>      ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
>          ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31)
(actualtime=0.669..3.900 rows=14199 loops=1) 
>                Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
>          ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual
time=0.143..0.143rows=1 loops=14199) 
>                Index Cond: (id = d.rid)

This is a parameterised nested loop. This means that the inner side of
the loop (reports), is parameterised by the outerside of the loop, you
can see the d.rid in the Index Cond. This means that the
reports_id_key index is looked up 14199 times. You can see from the
plan that the nested loop here took 2038.952 milliseconds to complete,
so about 0.144 ms per lookup, not too bad, right?

The alternative to a parameterised nested loop plan is that a Hash
Join plan is generated. If you imagine what would happen here, likely
the matching report_drugs records would be hashed, then the outer side
of the join would then perform a SeqScan over the entire 6 million
reports records... probably not better. You can test this yourself
with; SET enable_nestloop = 0; you might need to SET enable_mergejoin
= 0; too. I imagine it'll be slower.

Likely the only way to speed this up would be to create indexes;

create index on reports (id, age, gender, created);

the above might allow an index only scan, which should speed up the
nested loop a bit. This will only be useful if you're never going to
need other report columns in the SELECT list.

and perhaps;

create index on report_adverses (rid, adverse);

this might speedup the 2nd join a bit.

On the other hand if you do find that SET enable_nestloop =0; to be
faster, then you may want to tweak some costs to encourage the planner
to choose that plan.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Slow join over three tables

From
Victor Yegorov
Date:
2016-04-27 2:27 GMT+03:00 Tim van der Linden <tim@shisaa.jp>:
The plan:

Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual time=5143.162..5143.185 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual time=54.784..5142.872 rows=448 loops=1)
     Join Filter: (d.rid = a.rid)
     ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
         ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 loops=1)
               Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
         ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
               Index Cond: (id = d.rid)
     ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0 loops=14199)
           Index Cond: (rid = r.id)
           Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
           Rows Removed by Filter: 5
Planning time: 13.994 ms
Execution time: 5143.235 ms

Can you post output of `EXPLAIN (analyze, buffers)`, please?
It'd be good to check how many buffers are hit/read during Index Scans.

I have had a somewhat similar case, when due to misconfigured autovacuum (not aggressive enough for the increased volume)
I got bloat accumulated in the indexes. Perhaps, REINDEX can help here.


--
Victor Y. Yegorov

Re: Slow join over three tables

From
Tim van der Linden
Date:
On Wed, 27 Apr 2016 01:45:55 +0000
Sameer Kumar <sameer.kumar@ashnik.com> wrote:

Hi Sameer

Thanks for taking the time to look into this!

> > ...
> Quite clearly the nested loop joins are the most costly operations here.

Indeed.

> > ...
> I suppose. It might help if the filters are performed before the join. I am
> not an expert on optimizer but I guess it might help if you change the join
> order and add duplicate conditions for reports-
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM report_drugs d
> JOIN report_adverses a ON a.rid = d.rid
> JOIN reports r ON d.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back
> pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

Looks like a nice approach, but it did no effect to the query time. The plan for this approach:

 Sort  (cost=104928.07..104928.86 rows=317 width=41) (actual time=5435.210..5435.236 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..104914.90 rows=317 width=41) (actual time=57.230..5434.930 rows=448 loops=1)
         Join Filter: (d.rid = a.rid)
         ->  Nested Loop  (cost=0.87..93919.79 rows=13870 width=28) (actual time=0.569..2240.955 rows=14200 loops=1)
               ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..496.92 rows=13870 width=8)
(actualtime=0.565..4.678 rows=14200 loops=1) 
                     Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
               ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.73 rows=1 width=20) (actual
time=0.157..0.157rows=1 loops=14200) 
                     Index Cond: (id = d.rid)
         ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=21) (actual
time=0.224..0.225rows=0 loops=14200) 
               Index Cond: (rid = r.id)
               Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back
 pain","back pain"}'::text[]))
               Rows Removed by Filter: 5
 Planning time: 18.512 ms
 Execution time: 5435.293 ms


> OR since you are using INNER JOIN, (As far as I understand the concept of
> joins) it won't hurt the result set if the where clause is pushed into the
> INNER JOIN criteria-

Correct. I have tried those as well, but the planner seems to take the exact same path and as a result the query time
isunchanged. 

> > ...
>
> --
> Best Regards
> Sameer Kumar | DB Solution Architect

Cheers,
Tim


Re: Slow join over three tables

From
Tim van der Linden
Date:
On Wed, 27 Apr 2016 14:09:04 +1200
David Rowley <david.rowley@2ndquadrant.com> wrote:

Hi David

Thanks for your time on this. I tried your proposals with the results below.

> > ...
> > Under 5 ms. The same goes for querying the "adverse" column in the "report_adverses" table: under 20 ms.
>
> I'm not sure why you're comparing this to the join plan above. They're
> very different, I can only imagine that it' because you've not quite
> understood what the EXPLAIN output is saying.

Well, it is more due to me poorly wording the question I guess. I understand that both plans are totally different and
bothqueries are fairly incomparable as two very different things are happening. 

My point here was to show that the "drug" column seems to be indexed correctly and, even though it is a very large
table,the query executes very fast (all these execution times are given after flushing the OS's disk cache). 

> >      ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual time=0.822..2038.952 rows=14199 loops=1)
> >          ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..500.28 rows=14005 width=31)
(actualtime=0.669..3.900 rows=14199 loops=1) 
> >                Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> >          ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 rows=1 width=41) (actual
time=0.143..0.143rows=1 loops=14199) 
> >                Index Cond: (id = d.rid)
>
> This is a parameterised nested loop. This means that the inner side of
> the loop (reports), is parameterised by the outerside of the loop, you
> can see the d.rid in the Index Cond. This means that the
> reports_id_key index is looked up 14199 times. You can see from the
> plan that the nested loop here took 2038.952 milliseconds to complete,
> so about 0.144 ms per lookup, not too bad, right?

That is indeed not bad at all, thanks for pointing this out. This nested loop is thus responsible for little over 40%
ofthe execution time. 

> The alternative to a parameterised nested loop plan is that a Hash
> Join plan is generated. If you imagine what would happen here, likely
> the matching report_drugs records would be hashed, then the outer side
> of the join would then perform a SeqScan over the entire 6 million
> reports records... probably not better. You can test this yourself
> with; SET enable_nestloop = 0; you might need to SET enable_mergejoin
> = 0; too. I imagine it'll be slower.

Doing this with the current (unaltered) indexes almost doubles the time and it seems to behave exactly as you
mentioned:

 Sort  (cost=372968.28..372969.07 rows=317 width=41) (actual time=9308.174..9308.187 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Hash Join  (cost=130941.07..372955.11 rows=317 width=41) (actual time=8131.509..9307.983 rows=448 loops=1)
         Hash Cond: (r.id = d.rid)
         ->  Seq Scan on reports r  (cost=0.00..105465.00 rows=5749300 width=20) (actual time=0.237..849.564
rows=5749300loops=1) 
         ->  Hash  (cost=130937.10..130937.10 rows=317 width=29) (actual time=8130.175..8130.175 rows=448 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 25kB
               ->  Hash Join  (cost=122150.13..130937.10 rows=317 width=29) (actual time=8118.130..8130.090 rows=448
loops=1)
                     Hash Cond: (d.rid = a.rid)
                     ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..496.92 rows=13870
width=8)(actual time=0.830..3.503 rows=14200 loops=1) 
                           Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
                     ->  Hash  (cost=121170.69..121170.69 rows=78320 width=21) (actual time=8116.544..8116.544
rows=76972loops=1) 
                           Buckets: 8192  Batches: 1  Memory Usage: 3609kB
                           ->  Bitmap Heap Scan on report_adverses a  (cost=2024.68..121170.69 rows=78320 width=21)
(actualtime=19.303..8093.257 rows=76972 loops=1) 
                                 Recheck Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back
pain","backpain"}'::text[])) 
                                 Heap Blocks: exact=56959
                                 ->  Bitmap Index Scan on report_adverses_adverse_idx  (cost=0.00..2005.10 rows=78320
width=0)(actual time=11.756..11.756 rows=76972 loops=1) 
                                       Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back
pain","backpain"}'::text[])) 
 Planning time: 11.348 ms
 Execution time: 9308.694 ms

> Likely the only way to speed this up would be to create indexes;
>
> create index on reports (id, age, gender, created);
> ...
> create index on report_adverses (rid, adverse);
>
> this might speedup the 2nd join a bit.

Done. This does help a bit. The query runs about ~1 second faster. The plan:

 Sort  (cost=66065.73..66066.59 rows=344 width=41) (actual time=4313.679..4313.708 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.44..66051.23 rows=344 width=41) (actual time=43.987..4313.435 rows=448 loops=1)
         Join Filter: (d.rid = a.rid)
         ->  Nested Loop  (cost=1.00..54700.19 rows=14319 width=28) (actual time=0.772..1158.338 rows=14200 loops=1)
               ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..507.78 rows=14319 width=8)
(actualtime=0.579..4.327 rows=14200 loops=1) 
                     Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
               ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..3.77 rows=1
width=20)(actual time=0.081..0.081 rows=1 loops=14200) 
                     Index Cond: (id = d.rid)
                     Heap Fetches: 0
         ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=21) (actual
time=0.222..0.222rows=0 loops=14200) 
               Index Cond: (rid = r.id)
               Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
               Rows Removed by Filter: 5
 Planning time: 15.968 ms
 Execution time: 4313.755 ms

Both the (rid, adverse) and the (id, age, gender, created) indexes are now used.

Yet ... this is "only" 1 second faster, still 4 seconds to complete this query (the join(s)). Is this truly the most
thatI could get out of it ... or could further tweaking to PostgreSQL's configuration be of any help here? 

>  David Rowley                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Cheers,
Tim


Re: Slow join over three tables

From
Tim van der Linden
Date:
On Wed, 27 Apr 2016 09:14:27 +0300
Victor Yegorov <vyegorov@gmail.com> wrote:

Hi Victor

> > ...
>
> Can you post output of `EXPLAIN (analyze, buffers)`, please?
> It'd be good to check how many buffers are hit/read during Index Scans.

Happy to, here it is:

 Sort  (cost=107727.85..107728.71 rows=344 width=41) (actual time=5357.453..5357.482 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   Buffers: shared hit=75462 read=37183
   ->  Nested Loop  (cost=1.31..107713.36 rows=344 width=41) (actual time=53.866..5357.211 rows=448 loops=1)
         Join Filter: (d.rid = a.rid)
         Buffers: shared hit=75459 read=37183
         ->  Nested Loop  (cost=0.87..96362.32 rows=14319 width=28) (actual time=0.569..2214.900 rows=14200 loops=1)
               Buffers: shared hit=39949 read=17072
               ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..507.78 rows=14319 width=8)
(actualtime=0.567..4.406 rows=14200 loops=1) 
                     Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
                     Buffers: shared hit=14 read=131
               ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.68 rows=1 width=20) (actual
time=0.155..0.155rows=1 loops=14200) 
                     Index Cond: (id = d.rid)
                     Buffers: shared hit=39935 read=16941
         ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=21) (actual
time=0.221..0.221rows=0 loops=14200) 
               Index Cond: (rid = r.id)
               Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
               Rows Removed by Filter: 5
               Buffers: shared hit=35510 read=20111
 Planning time: 13.962 ms
 Execution time: 5357.520 ms

> --
> Victor Y. Yegorov

Cheers,
Tim


Re: Slow join over three tables

From
David Rowley
Date:
On 27 April 2016 at 22:29, Tim van der Linden <tim@shisaa.jp> wrote:
>  Sort  (cost=372968.28..372969.07 rows=317 width=41) (actual time=9308.174..9308.187 rows=448 loops=1)
>    Sort Key: r.created
>  Sort  (cost=66065.73..66066.59 rows=344 width=41) (actual time=4313.679..4313.708 rows=448 loops=1)
>    Sort Key: r.created
>    Sort Method: quicksort  Memory: 60kB
>    ->  Nested Loop  (cost=1.44..66051.23 rows=344 width=41) (actual time=43.987..4313.435 rows=448 loops=1)
>          Join Filter: (d.rid = a.rid)
>          ->  Nested Loop  (cost=1.00..54700.19 rows=14319 width=28) (actual time=0.772..1158.338 rows=14200 loops=1)
>                ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..507.78 rows=14319 width=8)
(actualtime=0.579..4.327 rows=14200 loops=1) 
>                      Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
>                ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..3.77 rows=1
width=20)(actual time=0.081..0.081 rows=1 loops=14200) 
>                      Index Cond: (id = d.rid)
>                      Heap Fetches: 0
>          ->  Index Scan using report_adverses_rid_idx on report_adverses a  (cost=0.44..0.78 rows=1 width=21) (actual
time=0.222..0.222rows=0 loops=14200) 
>                Index Cond: (rid = r.id)
>                Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back pain"}'::text[]))
>                Rows Removed by Filter: 5
>  Planning time: 15.968 ms
>  Execution time: 4313.755 ms
>
> Both the (rid, adverse) and the (id, age, gender, created) indexes are now used.
>

Seems the (rid, adverse) is not being used. report_adverses_rid_idx
is your (rid) index.

> Yet ... this is "only" 1 second faster, still 4 seconds to complete this query (the join(s)). Is this truly the most
thatI could get out of it ... or could further tweaking to PostgreSQL's configuration be of any help here? 

EXPLAIN ANALYZE also has quite a bit of timing overhead, so it might
not be taking quite as long as you think.

How long does it take with EXPLAIN (ANALYZE, TIMING OFF) ... ?

Or perhaps just run the query, as there's only 448 rows anyway.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Slow join over three tables

From
Tim van der Linden
Date:
On Wed, 27 Apr 2016 22:40:43 +1200
David Rowley <david.rowley@2ndquadrant.com> wrote:

Hi David

> > ...
> >  Planning time: 15.968 ms
> >  Execution time: 4313.755 ms
> >
> > Both the (rid, adverse) and the (id, age, gender, created) indexes are now used.
> >
>
> Seems the (rid, adverse) is not being used. report_adverses_rid_idx
> is your (rid) index.

Ah ... good catch, I blame it on the late evening over here ;)

> > Yet ... this is "only" 1 second faster, still 4 seconds to complete this query (the join(s)). Is this truly the
mostthat I could get out of it ... or could further tweaking to PostgreSQL's configuration be of any help here? 
>
> EXPLAIN ANALYZE also has quite a bit of timing overhead, so it might
> not be taking quite as long as you think.
>
> How long does it take with EXPLAIN (ANALYZE, TIMING OFF) ... ?

Hmmm, this takes roughly the same amount of time:

 Planning time: 39.889 ms
 Execution time: 4532.642 ms

> Or perhaps just run the query, as there's only 448 rows anyway.

Also, did not speed up much ... query time pulled from the log statement: 4305.686 ms.

> --
>  David Rowley                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Cheers,
Tim


Re: Slow join over three tables

From
Alban Hertroys
Date:
> On 27 Apr 2016, at 4:09, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
> On 27 April 2016 at 11:27, Tim van der Linden <tim@shisaa.jp> wrote:

>> The query:
>>
>> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
>> FROM reports r
>> JOIN report_drugs d ON d.rid = r.id
>> JOIN report_adverses a ON a.rid = r.id
>> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
>> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
>>

> Likely the only way to speed this up would be to create indexes;
>
> create index on reports (id, age, gender, created);

Since you're not using age and gender in this (particular) query until the rows are combined into a result set already,
itdoesn't make a whole lot of sense to add them to the index. Moreover, since your'e ordering by created, I'd at least
putthat directly after id: 

    create index on reports (id, created);

> create index on report_adverses (rid, adverse);

In this case, you're using the values in adverse to filter relevant rid's for the FK join, so you might be better off
withthe inverse of above index: 
    create index on report_adverses (adverse, rid);
    create index on report_drugs (drug, rid);

Do these tables have a primary key and can that be used here?

I hope I'm not being superfluous here, but don't forget to vacuum analyze after creating those indexes. Eventually
autovacuumwill do that for you, but you could otherwise be running tests to verify the impact of adding those indexes
beforeautovacuum gets around to it. 

Finally, perhaps it's more efficient to weed out all unrelated drugs and adverses before relating them to reports,
somethinglike: 

SELECT r.id, r.age, r.gender, r.created, x.adverse, x.drug
FROM reports r
JOIN (
    SELECT d.rid, a.adverse, d.drug
    FROM report_drugs d
    JOIN report_adverses a ON a.rid = d.rid
    WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
    AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363])
) x ON x.rid = r.id
ORDER BY r.created;

Looking at the cardinality of your tables that does seem a bit unlikely though. Still, worth a shot...

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: Slow join over three tables

From
Tim van der Linden
Date:
On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys <haramrae@gmail.com> wrote:

Hi Alban

Thanks for chiming in!

> Since you're not using age and gender in this (particular) query until the rows are combined into a result set
already,it doesn't make a whole lot of sense to add them to the index. Moreover, since your'e ordering by created, I'd
atleast put that directly after id: 
>
>     create index on reports (id, created);

The ORDER statement did not attribute much to the total query time, but it makes sense to index this.

> In this case, you're using the values in adverse to filter relevant rid's for the FK join, so you might be better off
withthe inverse of above index: 
>     create index on report_adverses (adverse, rid);
>     create index on report_drugs (drug, rid);

Hold the phone ... you just cracked the code.

I added these two proposed indexes (reversing their order). After purging the OS disk cache, the query runs in (*drum
roll*)120 ms. You managed to speed this up ~40 times. 

The query plan:

 Sort  (cost=12108.99..12109.83 rows=337 width=41) (actual time=119.517..119.531 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.84 rows=337 width=41) (actual time=26.473..119.392 rows=448 loops=1)
         Join Filter: (d.rid = r.id)
         ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual time=25.624..33.650 rows=448 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual time=4.208..4.976 rows=14074 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 1050kB
                     ->  Index Only Scan using report_drugs_drug_rid_idx on report_drugs d  (cost=0.44..415.86
rows=14496width=8) (actual time=0.648..2.236 rows=14200 loops=1) 
                           Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
                           Heap Fetches: 0
               ->  Sort  (cost=9521.91..9721.56 rows=79860 width=21) (actual time=21.410..24.174 rows=76974 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6682kB
                     ->  Index Only Scan using report_adverses_adverse_rid_idx on report_adverses a
(cost=0.56..3019.24rows=79860 width=21) (actual time=0.916..10.689 rows=76972 loops=1) 
                           Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back
pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..2.01 rows=1 width=20)
(actualtime=0.190..0.191 rows=1 loops=448) 
               Index Cond: (id = a.rid)
               Heap Fetches: 0
 Planning time: 15.466 ms
 Execution time: 119.871 ms

This is amazing, and the slow down was purely a fact of missing indexes on the sweet spot.
You can now clearly see that the above two indexes are used (report_drugs_drug_rid_idx) and
(report_adverses_adverse_rid_idx)and that the nested loop is much faster. 

> Do these tables have a primary key and can that be used here?

Only the "reports" table has a PK, the other two don't (only a foreign key to "reports").

> I hope I'm not being superfluous here, but don't forget to vacuum analyze after creating those indexes. Eventually
autovacuumwill do that for you, but you could otherwise be running tests to verify the impact of adding those indexes
beforeautovacuum gets around to it. 

I'm becoming the king of manual VACUUM-ing the past few days ... no need to point that out ;)

> Finally, perhaps it's more efficient to weed out all unrelated drugs and adverses before relating them to reports,
somethinglike: 
>
> SELECT r.id, r.age, r.gender, r.created, x.adverse, x.drug
> FROM reports r
> JOIN (
>     SELECT d.rid, a.adverse, d.drug
>     FROM report_drugs d
>     JOIN report_adverses a ON a.rid = d.rid
>     WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
>     AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363])
> ) x ON x.rid = r.id
> ORDER BY r.created;

With the above indexes created, disk cache flushed, this query ran at the exact same speed, the plan:

 Sort  (cost=12108.14..12108.99 rows=337 width=41) (actual time=119.102..119.111 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.00 rows=337 width=41) (actual time=26.797..118.969 rows=448 loops=1)
         ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual time=25.899..33.203 rows=448 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual time=4.319..4.981 rows=14074 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 1050kB
                     ->  Index Only Scan using report_drugs_drug_rid_idx on report_drugs d  (cost=0.44..415.86
rows=14496width=8) (actual time=0.748..2.369 rows=14200 loops=1) 
                           Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
                           Heap Fetches: 0
               ->  Sort  (cost=9521.91..9721.56 rows=79860 width=21) (actual time=21.573..23.940 rows=76974 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6682kB
                     ->  Index Only Scan using report_adverses_adverse_rid_idx on report_adverses a
(cost=0.56..3019.24rows=79860 width=21) (actual time=0.922..10.637 rows=76972 loops=1) 
                           Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back
pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..2.01 rows=1 width=20)
(actualtime=0.190..0.191 rows=1 loops=448) 
               Index Cond: (id = a.rid)
               Heap Fetches: 0
 Planning time: 15.238 ms
 Execution time: 119.431 ms

So your hunch was right, not much difference there.

But man ... this query is now flying ... Hat's off to you, sir!

> Alban Hertroys

Cheers,
Tim


Re: Slow join over three tables

From
Karl Czajkowski
Date:
On Apr 27, Tim van der Linden modulated:
...
> I'm joining three fairly large tables together, and it is slow. The tables are:
>
> - "reports": 6 million rows
> - "report_drugs": 20 million rows
> - "report_adverses": 20 million rows
>
...
> All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" columns.
>
> The query:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
>

I would suggest a few experiments to see how you can modify the plans
available to the optimizer:

1. CREATE INDEX ON report_drugs (drug, rid)
2. CREATE INDEX ON report_adverses (adverse, rid)
3. CREATE INDEX ON report (id, created)

Re-run EXPLAIN ANALYZE of your query after each of these steps to see
how each one affects planning.  You might also try two variants of the
query at each step, with and without the ORDER BY.

Note, the index column order in (1) and (2) above is very important.


Karl



Re: Slow join over three tables

From
Tim van der Linden
Date:
On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys <haramrae@gmail.com> wrote:

> In this case, you're using the values in adverse to filter relevant rid's for the FK join, so you might be better off
withthe inverse of above index: 
>     create index on report_adverses (adverse, rid);
>     create index on report_drugs (drug, rid);

Hmmmm, like I reported yesterday this achieved a huge performance boost.

However, two (new) things I like to touch on while further experimenting with this query:

1. Altering Drug IDs or Adverse names effects the executing time negatively.

In this example altering the drug IDs I search for makes the query 6 times slower again:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM reports r
JOIN report_drugs d ON d.rid = r.id
JOIN report_adverses a ON a.rid = r.id
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
AND d.drug = ANY (ARRAY[9557, 17848, 17880, 18223]) ORDER BY r.created;

Different drug ID ([9557, 17848, 17880, 18223]), but the rest is the same. Query plan:

 Sort  (cost=31409.71..31417.48 rows=3107 width=41) (actual time=662.707..662.819 rows=4076 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 415kB
   ->  Nested Loop  (cost=25693.17..31229.48 rows=3107 width=41) (actual time=71.748..661.743 rows=4076 loops=1)
         ->  Merge Join  (cost=25692.61..26795.53 rows=3107 width=29) (actual time=70.841..97.094 rows=4076 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=16332.08..16671.61 rows=135810 width=8) (actual time=48.946..58.623 rows=135370 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 12498kB
                     ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..4753.44 rows=135810
width=8)(actual time=0.681..28.441 rows=135532 loops=1) 
                           Index Cond: (drug = ANY ('{9557,17848,17880,18223}'::integer[]))
               ->  Sort  (cost=9360.53..9556.94 rows=78565 width=21) (actual time=21.880..25.969 rows=77163 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6682kB
                     ->  Index Only Scan using report_adverses_adverse_rid_idx on report_adverses a
(cost=0.56..2972.57rows=78565 width=21) (actual time=0.983..10.744 rows=76972 loops=1) 
                           Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back
pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..1.42 rows=1 width=20)
(actualtime=0.137..0.138 rows=1 loops=4076) 
               Index Cond: (id = d.rid)
               Heap Fetches: 0
 Planning time: 16.438 ms
 Execution time: 663.583 ms

A total of 660 ms instead of the previous 120 ms. The amount of rows returned are 4076, instead of the 448 rows of the
previousquery. Could this be the cause of the slow down? Or is it the fact that the integers in the ANY clause are
furtherapart from one another and thus make for longer index searches? 

The same happens with changing the adverses, some adverses come back in ~120 ms, others can take up to one second.

2. Adding extra columns ignores indexes

This one was kind of expected. Adding extra columns from both "report_drugs" and "report_adverses" slows the lot down
again.Extra columns added: 

- a.recovery
- a.severity
- d.reason
- d.effectiveness
- d.duration

As I would expect, adding these columns would make the previous multi-column indexes useless. And they indeed were not
usedanymore. 

So my initial thought was to create new indexes on the columns I now query:

- CREATE INDEX ON report_adverses(adverse, rid, severity, recovery);
- CREATE INDEX ON report_drugs(drug, rid, reason, effectiveness, duration);

After running the query again, the new index on "report_adverses" got picked up, but the index on "report_drugs" did
not:

 Sort  (cost=12365.79..12366.61 rows=329 width=76) (actual time=129.106..129.120 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 66kB
   ->  Nested Loop  (cost=11212.93..12352.04 rows=329 width=76) (actual time=31.558..128.951 rows=448 loops=1)
         Join Filter: (d.rid = r.id)
         ->  Merge Join  (cost=11212.38..11680.44 rows=329 width=64) (actual time=30.705..39.527 rows=448 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=1503.85..1539.82 rows=14390 width=35) (actual time=6.977..7.993 rows=14074 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 1202kB
                     ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..510.02 rows=14390
width=35)(actual time=0.567..4.638 rows=14200 loops=1) 
                           Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
               ->  Sort  (cost=9708.53..9904.94 rows=78565 width=29) (actual time=23.717..26.540 rows=76974 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6702kB
                     ->  Index Only Scan using report_adverses_adverse_rid_severity_recovery_idx on report_adverses a
(cost=0.56..3320.57rows=78565 width=29) (actual time=0.878..12.297 rows=76972 loops=1) 
                           Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back
pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..2.03 rows=1 width=20)
(actualtime=0.198..0.199 rows=1 loops=448) 
               Index Cond: (id = a.rid)
               Heap Fetches: 0
 Planning time: 18.310 ms
 Execution time: 129.483 ms

Still fast enough ... but I was wondering why the planner would not use the new index and instead fall back on the
"report_drugs_drug_idx"single column index. 


> Alban Hertroys
> --
> If you can't see the forest for the trees,
> cut the trees and you'll find there is no forest.

Cheers,
Tim


Re: Slow join over three tables

From
Alban Hertroys
Date:
On 28 April 2016 at 08:36, Tim van der Linden <tim@shisaa.jp> wrote:
> On Wed, 27 Apr 2016 13:48:06 +0200
> Alban Hertroys <haramrae@gmail.com> wrote:
>
>> In this case, you're using the values in adverse to filter relevant rid's for the FK join, so you might be better
offwith the inverse of above index: 
>>       create index on report_adverses (adverse, rid);
>>       create index on report_drugs (drug, rid);
>
> Hmmmm, like I reported yesterday this achieved a huge performance boost.
>
> However, two (new) things I like to touch on while further experimenting with this query:
>
> 1. Altering Drug IDs or Adverse names effects the executing time negatively.
>
> In this example altering the drug IDs I search for makes the query 6 times slower again:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])
> AND d.drug = ANY (ARRAY[9557, 17848, 17880, 18223]) ORDER BY r.created;
>
> Different drug ID ([9557, 17848, 17880, 18223]), but the rest is the same. Query plan:
>
>  Sort  (cost=31409.71..31417.48 rows=3107 width=41) (actual time=662.707..662.819 rows=4076 loops=1)
>    Sort Key: r.created
>    Sort Method: quicksort  Memory: 415kB
>    ->  Nested Loop  (cost=25693.17..31229.48 rows=3107 width=41) (actual time=71.748..661.743 rows=4076 loops=1)
>          ->  Merge Join  (cost=25692.61..26795.53 rows=3107 width=29) (actual time=70.841..97.094 rows=4076 loops=1)
>                Merge Cond: (d.rid = a.rid)
>                ->  Sort  (cost=16332.08..16671.61 rows=135810 width=8) (actual time=48.946..58.623 rows=135370
loops=1)
>                      Sort Key: d.rid
>                      Sort Method: quicksort  Memory: 12498kB
>                      ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..4753.44 rows=135810
width=8)(actual time=0.681..28.441 rows=135532 loops=1) 
>                            Index Cond: (drug = ANY ('{9557,17848,17880,18223}'::integer[]))


You're doing ~9 times as many index lookups. A slowdown of 6x of this
part of the query seems rather reasonable.

>  Planning time: 16.438 ms
>  Execution time: 663.583 ms
>
> A total of 660 ms instead of the previous 120 ms. The amount of rows returned are 4076, instead of the 448 rows of
theprevious query. Could this be the cause of the slow down? Or is it the fact that the integers in the ANY clause are
furtherapart from one another and thus make for longer index searches? 

Technically probably both, but the index node distance (assuming a
btree index) should be insignificant.

The sorting of found candidate rows on rid probably contributes the
most. I forgot how quicksort scales to the number of items to sort,
probably something like 2log(n) or something like that.

That said, you didn't happen to perform these tests (assuming >1)
during a busier period of the database server? I ask because later on
you seem to be getting fast results again after some more index
changes and I don't see cause for the difference.

> The same happens with changing the adverses, some adverses come back in ~120 ms, others can take up to one second.

Likely the same issue

> 2. Adding extra columns ignores indexes
>
> This one was kind of expected. Adding extra columns from both "report_drugs" and "report_adverses" slows the lot down
again.Extra columns added: 
>
> - a.recovery
> - a.severity
> - d.reason
> - d.effectiveness
> - d.duration

I have to admit that I'm a bit behind with the current state of the
art of PostgreSQL, but last time I checked, the database needed to
look at each row in the result-set for transaction visibility
information. I recall that there was (at some point) much discussion
whether that visibility information could be added to indexes and that
there were strong arguments against doing so. Hence, I doubt that
those new index-only scans skip that step.

Unless I'm wrong there, adding non-queried fields to the index is only
going to affect your performance adversely. Relevant fields for
indices are those used in joins, those regularly used in conditions
(where-clauses) and those that are regularly sorted over.

Other fields are available from the candidate rows of the result set.
Putting them in the index just results in storing redundant
information.

That's how it always has been and I'm pretty sure that's still the case.

> As I would expect, adding these columns would make the previous multi-column indexes useless. And they indeed were
notused anymore. 

What adding those fields to the index did was drive up the cost of
using that index. It requires more memory to cache, for example, and
this will be evicted from caches sooner than a smaller index.

I suspect the query planner also noticed that and chose for the
smaller index as that indexes usage costs are now lower than those of
your (unnecessarily) large index.

> Still fast enough ... but I was wondering why the planner would not use the new index and instead fall back on the
"report_drugs_drug_idx"single column index. 

I'm actually a bit suspicious of those numbers; those are different
drug id's than those from the first explain too.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


Re: Slow join over three tables

From
Tim van der Linden
Date:
On Thu, 28 Apr 2016 16:38:53 +0200
Alban Hertroys <haramrae@gmail.com> wrote:

Hi Alban

First off ... thanks you for your time on this so far and giving me the educational smacks on the head :) I appreciate
thisa lot.  

> You're doing ~9 times as many index lookups. A slowdown of 6x of this
> part of the query seems rather reasonable.

Again, very true. But why 9 times as much?

> >  Planning time: 16.438 ms
> >  Execution time: 663.583 ms
> >
> > A total of 660 ms instead of the previous 120 ms. The amount of rows returned are 4076, instead of the 448 rows of
theprevious query. Could this be the cause of the slow down? Or is it the fact that the integers in the ANY clause are
furtherapart from one another and thus make for longer index searches? 
>
> Technically probably both, but the index node distance (assuming a
> btree index) should be insignificant.

It is indeed a b-tree index.

> The sorting of found candidate rows on rid probably contributes the
> most. I forgot how quicksort scales to the number of items to sort,
> probably something like 2log(n) or something like that.

Looked this up and you are right, a quicksort (in C) is of nlog(n) complexity ... and if I'm right this is "rather"
linear.Less sorting is thus the message :) 

> That said, you didn't happen to perform these tests (assuming >1)
> during a busier period of the database server? I ask because later on
> you seem to be getting fast results again after some more index
> changes and I don't see cause for the difference.

No, these test are on a local development machine with nothing else going on but the database cluster.

> > 2. Adding extra columns ignores indexes
> > ...
> > - a.recovery
> > - a.severity
> > - d.reason
> > - d.effectiveness
> > - d.duration
>
> I have to admit that I'm a bit behind with the current state of the
> art of PostgreSQL, but last time I checked, the database needed to
> look at each row in the result-set for transaction visibility
> information. I recall that there was (at some point) much discussion
> whether that visibility information could be added to indexes and that
> there were strong arguments against doing so. Hence, I doubt that
> those new index-only scans skip that step.

I see.

> Unless I'm wrong there, adding non-queried fields to the index is only
> going to affect your performance adversely. Relevant fields for
> indices are those used in joins, those regularly used in conditions
> (where-clauses) and those that are regularly sorted over.
>
> Other fields are available from the candidate rows of the result set.
> Putting them in the index just results in storing redundant
> information.

Storing redundant information and making for bloated indexes which the planner might choose to skip if I understand
correctly.Good to know. 

> > ...
> > Still fast enough ... but I was wondering why the planner would not use the new index and instead fall back on the
"report_drugs_drug_idx"single column index. 
>
> I'm actually a bit suspicious of those numbers; those are different
> drug id's than those from the first explain too.

You are completely right, this was the wrong plan ... the correct plan is:

 Sort  (cost=31757.71..31765.48 rows=3107 width=76) (actual time=722.348..722.461 rows=4076 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 495kB
   ->  Nested Loop  (cost=26041.17..31577.48 rows=3107 width=76) (actual time=139.407..721.090 rows=4076 loops=1)
         ->  Merge Join  (cost=26040.61..27143.53 rows=3107 width=64) (actual time=139.396..170.446 rows=4076 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=16332.08..16671.61 rows=135810 width=35) (actual time=108.866..119.143 rows=135370
loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 13992kB
                     ->  Index Scan using report_drugs_drug_idx on report_drugs d  (cost=0.44..4753.44 rows=135810
width=35)(actual time=0.038..53.758 rows=135532 loops=1) 
                           Index Cond: (drug = ANY ('{9557,17848,17880,18223}'::integer[]))
               ->  Sort  (cost=9708.53..9904.94 rows=78565 width=29) (actual time=30.517..34.876 rows=77163 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6702kB
                     ->  Index Only Scan using report_adverses_adverse_rid_severity_recovery_idx on report_adverses a
(cost=0.56..3320.57rows=78565 width=29) (actual time=1.005..16.135 rows=76972 loops=1) 
                           Index Cond: (adverse = ANY ('{"back pain - nonspecific","nonspecific back pain","back
pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports r  (cost=0.56..1.42 rows=1 width=20)
(actualtime=0.134..0.135 rows=1 loops=4076) 
               Index Cond: (id = d.rid)
               Heap Fetches: 0
 Planning time: 29.415 ms
 Execution time: 723.545 ms

And this is now indeed much closer to the ~660 ms from before, it doesn't make much of a difference after all.

Cheers,
Tim


Re: Slow join over three tables

From
MOLINA BRAVO FELIPE DE JESUS
Date:
Hi!!

what happens if you change your query to:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
  FROM reports r
  INNER JOIN report_drugs d USING (rid) 
  INNER JOIN report_adverses a USING (rid) 
 WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain'])  AND
d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;



I have seen differences in time between "ON vs USING"....i prefer "USING" when is possible



> >
> > All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" columns.
> >
> > The query:
> >
> > SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> > FROM reports r
> > JOIN report_drugs d ON d.rid = r.id
> > JOIN report_adverses a ON a.rid = r.id 
> > WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back pain', 'back pain']) 
> > AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
> >
> I would suggest a few experiments to see how you can modify the plans
> available to the optimizer:
>
> 1. CREATE INDEX ON report_drugs (drug, rid)
> 2. CREATE INDEX ON report_adverses (adverse, rid)
> 3. CREATE INDEX ON report (id, created)
>
> Re-run EXPLAIN ANALYZE of your query after each of these steps to see
> how each one affects planning.  You might also try two variants of the
> query at each step, with and without the ORDER BY.
>
> Note, the index column order in (1) and (2) above is very important.
>
>
> Karl
>
>
>