Re: how to efficiently update tuple in many-to-many relationship? - Mailing list pgsql-performance

From Drew Wilson
Subject Re: how to efficiently update tuple in many-to-many relationship?
Date
Msg-id D14BB823-DA04-453C-AD70-248B34F1307B@gmail.com
Whole thread Raw
In response to Re: how to efficiently update tuple in many-to-many relationship?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: how to efficiently update tuple in many-to-many relationship?
List pgsql-performance
Thanks for the suggestions, Tom. But I'm still stumped.

On Apr 9, 2007, at 7:13 PM, Tom Lane wrote:

> Drew Wilson <drewmwilson@gmail.com> writes:
>> Here's the query plan for a SELECT statement that returns 1,207,161
>> rows in 6 seconds.
>> ...
>> And here's the query plan for the UPDATE query that seems to never
>> complete. (Execution time > 30 minutes.)
>
> Well, the subplan is certainly the same as before, so it seems
> there are
> two possibilities:
>
> * there's something unreasonably inefficient about the hash join being
> used to perform the IN (work_mem too small? inefficient-to-compare
> datatype? bad data distribution?)
I'm not sure why. The ids are OIDs generated from a sequence, with no
deletions.

> * the time is actually going into the UPDATE operation proper, or
> perhaps some triggers it fires (have you got any foreign keys
> involving
> this table?  what's checkpoint_segments set to?)

> You could narrow it down by checking the runtime for
> select count(*) from translation_pair_data
>   where translation_pair_id in
>         (select translation_pair_id from translation_pair_data ...
After I increasing work_mem from 1M to 32M, checkpoint_segments from
3 to 8, (and reloading), the UPDATE operation still takes about 15
minutes (944 seconds) to update 637,712 rows.

Whereas replacing the the "UPDATE ... WHERE translation_pair_id IN"
with "SELECT count(*) WHERE translation_pair_id IN" drops the time
from 15 minutes to 19 seconds (returning the same 637712 rows.)

> If that's slow it's the topmost hash join's fault, else we have
> to look at the UPDATE's side effects.

The SELECT is not slow, so its a side effect of the update... Looking
at the table definition, there is a "BEFORE ON DELETE" trigger
defined, two CHECK constraints for this table, and three foreign
keys. Nothing looks suspicious to me.
Any clues in the table description below?

Here's the table definition. (And I've appended updated query plans
descriptions.)

MatchBox=# \d translation_pair_data               Table
"public.translation_pair_data"
        Column        |            Type             |   Modifiers
---------------------+-----------------------------+---------------
translation_pair_id | oid                         | not null
translation_id      | oid                         | not null
history_id          | oid                         | not null
source_id           | oid                         | not null
created_ts          | timestamp without time zone | default now()
last_added_ts       | timestamp without time zone | default now()
obsolete            | boolean                     |
style               | character(1)                |
localizability      | boolean                     |
ui_restricted       | boolean                     |
linguistic          | boolean                     |
gender              | character(1)                |
platforms           | character varying[]         |
is_public           | boolean                     |
Indexes:
     "translation_pair_pkey" PRIMARY KEY, btree (translation_pair_id)
     "translation_pair_source_id_key" UNIQUE, btree (source_id,
translation_id)
     "translation_pair_created_date" btree (date(created_ts))
     "translation_pair_data_is_public" btree (is_public)
     "translation_pair_source_id" btree (source_id)
     "translation_pair_source_id_is_not_obsolete" btree (source_id,
obsolete) WHERE obsolete IS NOT TRUE
     "translation_pair_translation_id" btree (translation_id)
Check constraints:
     "translation_pair_gender_check" CHECK (gender = 'M'::bpchar OR
gender = 'F'::bpchar OR gender = 'N'::bpchar)
     "translation_pair_style_check" CHECK (style = 'P'::bpchar OR
style = 'O'::bpchar OR style = 'N'::bpchar)
Foreign-key constraints:
     "translation_pair_history_id_fkey" FOREIGN KEY (history_id)
REFERENCES history(history_id)
     "translation_pair_source_id_fkey" FOREIGN KEY (source_id)
REFERENCES source_data(source_id)
     "translation_pair_translation_id_fkey" FOREIGN KEY
(translation_id) REFERENCES translation_data(translation_id)
Triggers:
     del_tp_prodtype BEFORE DELETE ON translation_pair_data FOR EACH
ROW EXECUTE PROCEDURE eme_delete_tp_prodtype()


Thanks for all your help,

Drew

p.s. here are the updated query plans after bumping work_mem to 32M.

MatchBox=# explain select count(*) from
translation_pair_data

                             where translation_pair_id in (select
translation_pair_id from translation_pair_data join instance i using
(translation_pair_id) join loc_submission ls using(loc_submission_id)
where ls.is_public = true);

QUERY PLAN
------------------------------------------------------------------------
-------------------------------------------------------------------
Aggregate  (cost=424978.46..424978.47 rows=1 width=0)
    ->  Hash IN Join  (cost=324546.91..420732.64 rows=1698329 width=0)
          Hash Cond:
(public.translation_pair_data.translation_pair_id =
public.translation_pair_data.translation_pair_id)
          ->  Seq Scan on translation_pair_data  (cost=0.00..38494.29
rows=1698329 width=4)
          ->  Hash  (cost=290643.93..290643.93 rows=2006718 width=8)
                ->  Hash Join  (cost=66710.78..290643.93 rows=2006718
width=8)
                      Hash Cond: (i.translation_pair_id =
public.translation_pair_data.translation_pair_id)
                      ->  Hash Join  (cost=352.38..169363.36
rows=2006718 width=4)
                            Hash Cond: (i.loc_submission_id =
ls.loc_submission_id)
                            ->  Seq Scan on instance i
(cost=0.00..99016.16 rows=5706016 width=8)
                            ->  Hash  (cost=283.23..283.23 rows=5532
width=4)
                                  ->  Index Scan using
loc_submission_is_public on loc_submission ls  (cost=0.00..283.23
rows=5532 width=4)
                                        Index Cond: (is_public = true)
                                        Filter: is_public
                      ->  Hash  (cost=38494.29..38494.29 rows=1698329
width=4)
                            ->  Seq Scan on translation_pair_data
(cost=0.00..38494.29 rows=1698329 width=4)

The SELECT above takes approx 20s, whereas this UPDATE below takes
944s (15 minutes)

MatchBox=# explain update translation_pair_data set is_public = true
where translation_pair_id in (select translation_pair_id from
translation_pair_data join instance i using(translation_pair_id) join
loc_submission ls using(loc_submission_id) where ls.is_public = true);
                                                              QUERY PLAN
------------------------------------------------------------------------
-------------------------------------------------------------
Hash IN Join  (cost=324546.91..457218.64 rows=1698329 width=90)
    Hash Cond: (public.translation_pair_data.translation_pair_id =
public.translation_pair_data.translation_pair_id)
    ->  Seq Scan on translation_pair_data  (cost=0.00..38494.29
rows=1698329 width=90)
    ->  Hash  (cost=290643.93..290643.93 rows=2006718 width=8)
          ->  Hash Join  (cost=66710.78..290643.93 rows=2006718 width=8)
                Hash Cond: (i.translation_pair_id =
public.translation_pair_data.translation_pair_id)
                ->  Hash Join  (cost=352.38..169363.36 rows=2006718
width=4)
                      Hash Cond: (i.loc_submission_id =
ls.loc_submission_id)
                      ->  Seq Scan on instance i
(cost=0.00..99016.16 rows=5706016 width=8)
                      ->  Hash  (cost=283.23..283.23 rows=5532 width=4)
                            ->  Index Scan using
loc_submission_is_public on loc_submission ls  (cost=0.00..283.23
rows=5532 width=4)
                                  Index Cond: (is_public = true)
                                  Filter: is_public
                ->  Hash  (cost=38494.29..38494.29 rows=1698329 width=4)
                      ->  Seq Scan on translation_pair_data
(cost=0.00..38494.29 rows=1698329 width=4)


pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: how to efficiently update tuple in many-to-many relationship?
Next
From: "Merlin Moncure"
Date:
Subject: Re: join to view over custom aggregate seems like it should be faster