Thread: improving performance for a delete

improving performance for a delete

From

kevin kempter

Date:

20 May 2008, 16:58:28

Hi all;

I have 2 tables where I basically want to delete from the first table
(seg_id_tmp7) any rows where the entire row already exists in the
second table (sl_cd_segment_dim)

I have a query that looks like this (and it's slow):


delete from seg_id_tmp7
where
    customer_srcid::text ||
    show_srcid::text ||
    show_name::text ||
    season_srcid::text ||
    season_name::text ||
    episode_srcid::text ||
    episode_name::text ||
    segment_type_id::text ||
    segment_type::text ||
    segment_srcid::text ||
    segment_name::text
in
    ( select
        customer_srcid::text ||
        show_srcid::text ||
        show_name::text ||
        season_srcid::text ||
        season_name::text ||
        episode_srcid::text ||
        episode_name::text ||
        segment_type_id::text ||
        segment_type::text ||
        segment_srcid::text ||
        segment_name::text
        from sl_cd_location_dim )
;





Here's the query plan for it:

                                     QUERY PLAN
-----------------------------------------------------------------------------------
  Seq Scan on seg_id_tmp7  (cost=0.00..138870701.56 rows=2136 width=6)
    Filter: (subplan)
    SubPlan
      ->  Seq Scan on sl_cd_location_dim  (cost=0.00..63931.60
rows=433040 width=8)
(4 rows)








I also tried this:

delete from seg_id_tmp7
where
    ( customer_srcid ,
    show_srcid ,
    show_name ,
    season_srcid ,
    season_name ,
    episode_srcid ,
    episode_name ,
    segment_type_id ,
    segment_type ,
    segment_srcid ,
    segment_name )
in
    ( select
        customer_srcid ,
        show_srcid ,
        show_name ,
        season_srcid ,
        season_name ,
        episode_srcid ,
        episode_name ,
        segment_type_id ,
        segment_type ,
        segment_srcid ,
        segment_name
        from sl_cd_location_dim )
;


and I get this query plan:

                                     QUERY PLAN
-----------------------------------------------------------------------------------
  Seq Scan on seg_id_tmp7  (cost=0.00..87997034.20 rows=2136 width=6)
    Filter: (subplan)
    SubPlan
      ->  Seq Scan on sl_cd_location_dim  (cost=0.00..40114.40
rows=433040 width=8)
(4 rows)



If it helps here's the describe's (including indexes) for both tables:

# \d seg_id_tmp7
                 Table "public.seg_id_tmp7"
      Column      |            Type             | Modifiers
-----------------+-----------------------------+-----------
  customer_srcid  | bigint                      |
  show_srcid      | bigint                      |
  show_name       | character varying           |
  season_srcid    | bigint                      |
  season_name     | character varying           |
  episode_srcid   | bigint                      |
  episode_name    | character varying           |
  segment_type_id | bigint                      |
  segment_type    | character varying           |
  segment_srcid   | bigint                      |
  segment_name    | character varying           |
  create_dt       | timestamp without time zone |




# \d sl_cd_segment_dim
                                          Table
"public.sl_cd_segment_dim"
         Column        |            Type
|                          Modifiers
----------------------+-----------------------------
+-------------------------------------------------------------
  sl_cd_segment_dim_id | bigint                      | not null
default nextval('sl_cd_segment_dim_seq'::regclass)
  customer_srcid       | bigint                      | not null
  show_srcid           | bigint                      | not null
  show_name            | character varying(500)      | not null
  season_srcid         | bigint                      | not null
  season_name          | character varying(500)      | not null
  episode_srcid        | bigint                      | not null
  episode_name         | character varying(500)      | not null
  segment_type_id      | integer                     |
  segment_type         | character varying(500)      |
  segment_srcid        | bigint                      |
  segment_name         | character varying(500)      |
  effective_dt         | timestamp without time zone | not null
default now()
  inactive_dt          | timestamp without time zone |
  last_update_dt       | timestamp without time zone | not null
default now()
Indexes:
     "sl_cd_segment_dim_pk" PRIMARY KEY, btree (sl_cd_segment_dim_id)
     "seg1" btree (customer_srcid)
     "seg2" btree (show_srcid)
     "seg3" btree (season_srcid)
     "seg4" btree (episode_srcid)
     "seg5" btree (segment_srcid)
     "sl_cd_segment_dim_ix1" btree (customer_srcid)






Any thoughts, suggestions, etc on how to improve performance for this
delete ?


Thanks in advance..

/Kevin

Re: improving performance for a delete

From

kevin kempter

Date:

20 May 2008, 17:03:40

Version 8.3.1


On May 20, 2008, at 1:51 PM, kevin kempter wrote:

> Hi all;
>
> I have 2 tables where I basically want to delete from the first
> table  (seg_id_tmp7) any rows where the entire row already exists in
> the second table (sl_cd_segment_dim)
>
> I have a query that looks like this (and it's slow):
>
>
> delete from seg_id_tmp7
> where
>     customer_srcid::text ||
>     show_srcid::text ||
>     show_name::text ||
>     season_srcid::text ||
>     season_name::text ||
>     episode_srcid::text ||
>     episode_name::text ||
>     segment_type_id::text ||
>     segment_type::text ||
>     segment_srcid::text ||
>     segment_name::text
> in
>     ( select
>         customer_srcid::text ||
>         show_srcid::text ||
>         show_name::text ||
>         season_srcid::text ||
>         season_name::text ||
>         episode_srcid::text ||
>         episode_name::text ||
>         segment_type_id::text ||
>         segment_type::text ||
>         segment_srcid::text ||
>         segment_name::text
>         from sl_cd_location_dim )
> ;
>
>
>
>
>
> Here's the query plan for it:
>
>                                    QUERY PLAN
> -----------------------------------------------------------------------------------
> Seq Scan on seg_id_tmp7  (cost=0.00..138870701.56 rows=2136 width=6)
>   Filter: (subplan)
>   SubPlan
>     ->  Seq Scan on sl_cd_location_dim  (cost=0.00..63931.60
> rows=433040 width=8)
> (4 rows)
>
>
>
>
>
>
>
>
> I also tried this:
>
> delete from seg_id_tmp7
> where
>     ( customer_srcid ,
>     show_srcid ,
>     show_name ,
>     season_srcid ,
>     season_name ,
>     episode_srcid ,
>     episode_name ,
>     segment_type_id ,
>     segment_type ,
>     segment_srcid ,
>     segment_name )
> in
>     ( select
>         customer_srcid ,
>         show_srcid ,
>         show_name ,
>         season_srcid ,
>         season_name ,
>         episode_srcid ,
>         episode_name ,
>         segment_type_id ,
>         segment_type ,
>         segment_srcid ,
>         segment_name
>         from sl_cd_location_dim )
> ;
>
>
> and I get this query plan:
>
>                                    QUERY PLAN
> -----------------------------------------------------------------------------------
> Seq Scan on seg_id_tmp7  (cost=0.00..87997034.20 rows=2136 width=6)
>   Filter: (subplan)
>   SubPlan
>     ->  Seq Scan on sl_cd_location_dim  (cost=0.00..40114.40
> rows=433040 width=8)
> (4 rows)
>
>
>
> If it helps here's the describe's (including indexes) for both tables:
>
> # \d seg_id_tmp7
>                Table "public.seg_id_tmp7"
>     Column      |            Type             | Modifiers
> -----------------+-----------------------------+-----------
> customer_srcid  | bigint                      |
> show_srcid      | bigint                      |
> show_name       | character varying           |
> season_srcid    | bigint                      |
> season_name     | character varying           |
> episode_srcid   | bigint                      |
> episode_name    | character varying           |
> segment_type_id | bigint                      |
> segment_type    | character varying           |
> segment_srcid   | bigint                      |
> segment_name    | character varying           |
> create_dt       | timestamp without time zone |
>
>
>
>
> # \d sl_cd_segment_dim
>                                         Table
> "public.sl_cd_segment_dim"
>        Column        |            Type
> |                          Modifiers
> ----------------------+-----------------------------
> +-------------------------------------------------------------
> sl_cd_segment_dim_id | bigint                      | not null
> default nextval('sl_cd_segment_dim_seq'::regclass)
> customer_srcid       | bigint                      | not null
> show_srcid           | bigint                      | not null
> show_name            | character varying(500)      | not null
> season_srcid         | bigint                      | not null
> season_name          | character varying(500)      | not null
> episode_srcid        | bigint                      | not null
> episode_name         | character varying(500)      | not null
> segment_type_id      | integer                     |
> segment_type         | character varying(500)      |
> segment_srcid        | bigint                      |
> segment_name         | character varying(500)      |
> effective_dt         | timestamp without time zone | not null
> default now()
> inactive_dt          | timestamp without time zone |
> last_update_dt       | timestamp without time zone | not null
> default now()
> Indexes:
>    "sl_cd_segment_dim_pk" PRIMARY KEY, btree (sl_cd_segment_dim_id)
>    "seg1" btree (customer_srcid)
>    "seg2" btree (show_srcid)
>    "seg3" btree (season_srcid)
>    "seg4" btree (episode_srcid)
>    "seg5" btree (segment_srcid)
>    "sl_cd_segment_dim_ix1" btree (customer_srcid)
>
>
>
>
>
>
> Any thoughts, suggestions, etc on how to improve performance for
> this delete ?
>
>
> Thanks in advance..
>
> /Kevin
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: improving performance for a delete

From

PFC

Date:

20 May 2008, 17:47:53

On Tue, 20 May 2008 22:03:30 +0200, kevin kempter
<kevin@kevinkempterllc.com> wrote:

> Version 8.3.1
>
>
> On May 20, 2008, at 1:51 PM, kevin kempter wrote:
>
>> Hi all;
>>
>> I have 2 tables where I basically want to delete from the first table
>> (seg_id_tmp7) any rows where the entire row already exists in the
>> second table (sl_cd_segment_dim)
>>
>> I have a query that looks like this (and it's slow):
>>
>>
>> delete from seg_id_tmp7
>> where
>>     customer_srcid::text ||

    Besides being slow as hell and not able to use any indexes, the string
concatenation can also yield incorrect results, for instance :

season_name::text || episode_srcid::text

    Will have the same contents for

season_name='season 1' episode_srcid=12
season_name='season 11' episode_srcid=2

    I suggest doing it the right way, one possibility being :

test=> EXPLAIN DELETE from test where (id,value) in (select id,value from
test2);
                                QUERY PLAN
-------------------------------------------------------------------------
  Hash IN Join  (cost=2943.00..6385.99 rows=2 width=6)
    Hash Cond: ((test.id = test2.id) AND (test.value = test2.value))
    ->  Seq Scan on test  (cost=0.00..1442.99 rows=99999 width=14)
    ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)
          ->  Seq Scan on test2  (cost=0.00..1443.00 rows=100000 width=8)

    Thanks to the hash it is very fast, one seq scan on both tables, instead
of one seq scan PER ROW in your query.

    Another solution would be :

test=> EXPLAIN DELETE FROM test USING test2 WHERE test.id=test2.id AND
test.value=test2.value;
                                QUERY PLAN
-------------------------------------------------------------------------
  Hash Join  (cost=2943.00..6385.99 rows=2 width=6)
    Hash Cond: ((test.id = test2.id) AND (test.value = test2.value))
    ->  Seq Scan on test  (cost=0.00..1442.99 rows=99999 width=14)
    ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)
          ->  Seq Scan on test2  (cost=0.00..1443.00 rows=100000 width=8)

    Which chooses the same plan here, quite logically, as it is the best one
in this particular case.