Thread: Join Bad Performance on different data types

Join Bad Performance on different data types

From
Adarsh Sharma
Date:
Hi,

Today i need to change datatype of one of my tables from bigint to bigint[] due to
application requirements. But One of my query hangs after this change :-

select DISTINCT glt.id || ':' || gtt.name as id_type, glt.name, latitude, longitude, radius, latitude || ',' || longitude as latlon from graph1 glt, graph2 gcr, graph3 gtd, graph5 td, graph6 gtt where gtt.id=td.entity_type_id AND glt.id=ANY(gtd.id_list) and gtd.id = gcr.t_ids) AND gcr.id=ANY(td.detail_id);

Explain Analyze Expected time :  Total runtime: 19460.809 ms

select DISTINCT glt.id || ':' || gtt.name as id_type, glt.name, latitude, longitude, radius, latitude || ',' || longitude as latlon from graph1 glt, graph2 gcr, graph3 gtd, graph5 td, graph6 gtt where gtt.id=td.entity_type_id AND glt.id=ANY(gtd.id_list) and gtd.id = ANY(gcr.t_ids) AND gcr.id=ANY(td.detail_id);

Query not finishing from last 1 hour. Tables size are ~ 5k rows and fairly small tables.

I tried creating simple and gin indexes on the column(t_ids) but still not helping. Anyone has any idea or faced this before. Postgresql version is 9.2.

Thanks



Re: Join Bad Performance on different data types

From
Sameer Kumar
Date:

On Tue, Mar 4, 2014 at 2:57 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:
I tried creating simple and gin indexes on the column(t_ids) but still not helping. Anyone has any idea or faced this before. Postgresql version is 9.2.

have you done a vacuum analyze or analyze after this step?

You might have to disable sequential scans
set enable_seqscan=off;

And then fire the query.


Best Regards,

Sameer Kumar | Database Consultant

ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533

M: +65 8110 0350  T: +65 6438 3504 | www.ashnik.com

icons

 

Email patch

 

This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).

Attachment

Re: Join Bad Performance on different data types

From
Pavel Stehule
Date:
Hello

PostgreSQL doesn't use index when types on left and right part are not equal.

Probably you lost some index - you can see a difference in EXPLAIN SELECT ...

Regards

Pavel


2014-03-04 7:57 GMT+01:00 Adarsh Sharma <eddy.adarsh@gmail.com>:
Hi,

Today i need to change datatype of one of my tables from bigint to bigint[] due to
application requirements. But One of my query hangs after this change :-

select DISTINCT glt.id || ':' || gtt.name as id_type, glt.name, latitude, longitude, radius, latitude || ',' || longitude as latlon from graph1 glt, graph2 gcr, graph3 gtd, graph5 td, graph6 gtt where gtt.id=td.entity_type_id AND glt.id=ANY(gtd.id_list) and gtd.id = gcr.t_ids) AND gcr.id=ANY(td.detail_id);

Explain Analyze Expected time :  Total runtime: 19460.809 ms

select DISTINCT glt.id || ':' || gtt.name as id_type, glt.name, latitude, longitude, radius, latitude || ',' || longitude as latlon from graph1 glt, graph2 gcr, graph3 gtd, graph5 td, graph6 gtt where gtt.id=td.entity_type_id AND glt.id=ANY(gtd.id_list) and gtd.id = ANY(gcr.t_ids) AND gcr.id=ANY(td.detail_id);

Query not finishing from last 1 hour. Tables size are ~ 5k rows and fairly small tables.

I tried creating simple and gin indexes on the column(t_ids) but still not helping. Anyone has any idea or faced this before. Postgresql version is 9.2.

Thanks




Re: Join Bad Performance on different data types

From
Adarsh Sharma
Date:



On Tue, Mar 4, 2014 at 1:13 PM, Sameer Kumar <sameer.kumar@ashnik.com> wrote:

On Tue, Mar 4, 2014 at 2:57 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:
I tried creating simple and gin indexes on the column(t_ids) but still not helping. Anyone has any idea or faced this before. Postgresql version is 9.2.

have you done a vacuum analyze or analyze after this step?

You might have to disable sequential scans
set enable_seqscan=off;

And then fire the query.


Thanks Sameer. yes i already did vacuum analyze but i tried enable_seqscan=off; this time and explain analyze finished in 34 seconds.


PLAN                                                                                     
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=10651634346.70..10651780073.12 rows=4163612 width=64) (actual time=34375.675..34764.705 rows=751392 loops=1)
   ->  Nested Loop  (cost=10000000000.03..10646590270.49 rows=336271747 width=64) (actual time=0.217..24988.534 rows=6541944 loops=1)
         ->  Nested Loop  (cost=10000000000.02..10012318364.23 rows=33628639 width=116) (actual time=0.177..3427.380 rows=1431 loops=1)
               ->  Nested Loop  (cost=10000000000.01..10001045237.36 rows=3368723 width=38) (actual time=0.138..3373.767 rows=1431 loops=1)
                     ->  Nested Loop  (cost=10000000000.00..10000097742.23 rows=340181 width=38) (actual time=0.047..2151.183 rows=418145 loops=1)
                           Join Filter: (td.entity_type_id = gtt.id)
                           Rows Removed by Join Filter: 1269335
                           ->  Seq Scan on graph5 td  (cost=10000000000.00..10000077008.13 rows=345413 width=33) (actual time=0.020..1231.823 rows=421870 loops=1)
                           ->  Materialize  (cost=0.00..9.33 rows=4 width=13) (actual time=0.000..0.001 rows=4 loops=421870)
                                 ->  Index Scan using geo_type_pkey on graph6 gtt  (cost=0.00..9.31 rows=4 width=13) (actual time=0.009..0.012 rows=4 loops=1)
                     ->  Index Scan using graph2_pkey on graph2 gcr  (cost=0.01..2.69 rows=10 width=33) (actual time=0.002..0.002 rows=0 loops=418145)
                           Index Cond: (id = ANY (td.graph3_id))
               ->  Index Scan using graph3_pkey on graph3 gtd  (cost=0.01..3.25 rows=10 width=115) (actual time=0.035..0.036 rows=1 loops=1431)
                     Index Cond: (id = ANY (gcr.t_ids))
         ->  Index Scan using graph1_pkey on graph1 glt  (cost=0.01..18.51 rows=10 width=55) (actual time=0.085..9.082 rows=4572 loops=1431)
               Index Cond: (id = ANY (gtd.lat_long_id_list))
 Total runtime: 34810.040 ms


Is dere any way i can rewrite the query so that i need not to set seqscan-off,  because i dont want to embed one more line in application code and also dont want to change global setting in postgresql.conf to disable seqscan.


Thanks



 
This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).

Re: Join Bad Performance on different data types

From
Sameer Kumar
Date:

On Tue, Mar 4, 2014 at 4:19 PM, Adarsh Sharma <eddy.adarsh@gmail.com> wrote:
Is dere any way i can rewrite the query so that i need not to set seqscan-off,  because i dont want to embed one more line in application code and also dont want to change global setting in postgresql.conf to disable seqscan.

You can use a specific user for this query and set parameter's value at user level.

But if you don't have any way of doing that then you pretty much have to "embed one more line in application code" [though I have never understood why application teams do not want to do that]. It should be fine, unless you are developing a product which can be used with someother RDBMS and hence you don't want to introduce a PostgreSQL specific line in code.



Best Regards,

Sameer Kumar | Database Consultant

ASHNIK PTE. LTD.

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533

M: +65 8110 0350  T: +65 6438 3504 | www.ashnik.com

icons

 

Email patch

 

This email may contain confidential, privileged or copyright material and is solely for the use of the intended recipient(s).

Attachment

Re: Join Bad Performance on different data types

From
Victor Yegorov
Date:
2014-03-04 10:19 GMT+02:00 Adarsh Sharma <eddy.adarsh@gmail.com>:
PLAN                                                                                     
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=10651634346.70..10651780073.12 rows=4163612 width=64) (actual time=34375.675..34764.705 rows=751392 loops=1)
   ->  Nested Loop  (cost=10000000000.03..10646590270.49 rows=336271747 width=64) (actual time=0.217..24988.534 rows=6541944 loops=1)
         ->  Nested Loop  (cost=10000000000.02..10012318364.23 rows=33628639 width=116) (actual time=0.177..3427.380 rows=1431 loops=1)
               ->  Nested Loop  (cost=10000000000.01..10001045237.36 rows=3368723 width=38) (actual time=0.138..3373.767 rows=1431 loops=1)
                     ->  Nested Loop  (cost=10000000000.00..10000097742.23 rows=340181 width=38) (actual time=0.047..2151.183 rows=418145 loops=1)
                           Join Filter: (td.entity_type_id = gtt.id)
                           Rows Removed by Join Filter: 1269335
                           ->  Seq Scan on graph5 td  (cost=10000000000.00..10000077008.13 rows=345413 width=33) (actual time=0.020..1231.823 rows=421870 loops=1)
                           ->  Materialize  (cost=0.00..9.33 rows=4 width=13) (actual time=0.000..0.001 rows=4 loops=421870)
                                 ->  Index Scan using geo_type_pkey on graph6 gtt  (cost=0.00..9.31 rows=4 width=13) (actual time=0.009..0.012 rows=4 loops=1)
                     ->  Index Scan using graph2_pkey on graph2 gcr  (cost=0.01..2.69 rows=10 width=33) (actual time=0.002..0.002 rows=0 loops=418145)
                           Index Cond: (id = ANY (td.graph3_id))
               ->  Index Scan using graph3_pkey on graph3 gtd  (cost=0.01..3.25 rows=10 width=115) (actual time=0.035..0.036 rows=1 loops=1431)
                     Index Cond: (id = ANY (gcr.t_ids))
         ->  Index Scan using graph1_pkey on graph1 glt  (cost=0.01..18.51 rows=10 width=55) (actual time=0.085..9.082 rows=4572 loops=1431)
               Index Cond: (id = ANY (gtd.lat_long_id_list))
 Total runtime: 34810.040 ms


Is dere any way i can rewrite the query so that i need not to set seqscan-off,  because i dont want to embed one more line in application code and also dont want to change global setting in postgresql.conf to disable seqscan.

Could you, kindly, also post `EXPLAIN` output of your original and modified queries also, leaving out all `enable_...` setting at their defaults.
Just explain, without `analyze`.


--
Victor Y. Yegorov

Re: Join Bad Performance on different data types

From
Victor Yegorov
Date:
2014-03-05 6:53 GMT+02:00 Adarsh Sharma <eddy.adarsh@gmail.com>:
Anyhow working on finding the root cause.

I would do the following:

1. Check your `default_statistics_target`, like:

        SHOW default_statistics_target;

2. Your previously posted `EXPLAIN ANALYZE` with `set enable_seqscan=off` shows
    a significant skew in the estiamted number of rows. I would really bump up the
    `default_statistics_target` higher (modify your `postgresql.conf` and reload),
    up to 500, re-analyzed all tables and checked the performance again.

    If you don't want to do it for the whole cluster, then increase for individual columns.
    I think increasing it up to 500 (or higher) for the columns containing arrays will be
    beneficial in any case:

        ALTER TABLE graph5 ALTER detail_id SET STATISTICS 1000;
        ALTER TABLE graph3 ALTER id_list  SET STATISTICS 1000;
        ALTER TABLE graph2 ALTER targeting_ids SET STATISTICS 1000;

    Of course, you need to `ANALYZE` (or `VACUUM ANALYZE`) the tables after this change.

3. Next, I've reformatted your query the following way (easier for me):

        SELECT DISTINCT
               glt.id || ':' || gtt.name as id_type,
               glt.name,latitude,longitude,radius,
               latitude || ',' || longitude as latlon
          FROM graph5 td JOIN graph6 gtt ON gtt.id=td.entity_type_id JOIN graph2 gcr ON gcr.id=ANY(td.detail_id) JOIN graph3 gtd ON gtd.id=gcr.t_ids JOIN graph1 glt ON glt.id=ANY(gtd.id_list);

    I've ordered joins (well, tables) in the way they depend on each other.
    I would try out to move the first join into the CTE in order to isolate it's results
    and help optimizer to better estiamte the rows. Something like this:

        WITH first AS (
            SELECT td.*,gtt.*
              FROM graph5 td
              JOIN graph6 gtt ON gtt.id=td.entity_type_id
        )
        SELECT DISTINCT
               glt.id || ':' || first.name as id_type,
               glt.name,latitude,longitude,radius,
               latitude || ',' || longitude as latlon
          FROM first
          JOIN graph2 gcr ON gcr.id=ANY(first.detail_id)
          JOIN graph3 gtd ON gtd.id=gcr.t_ids
          JOIN graph1 glt ON glt.id=ANY(gtd.id_list)

4. Try disabling materialization, like `set enable_material=off`.


--
Victor Y. Yegorov