Thread: Slow query

Slow query

From

Roger Ging

Date:

22 October 2004, 20:06:26

The following query has never finished.  I have let it run for over 24
hours.  This is a one time update that is part of a conversion script
from MSSQL data.  All of the tables are freshly built and inserted
into.  I have not run explain analyze because it does not return in a
reasonable time.  Explain output is posted below.  Any suggestions on
syntax changes or anything else to improve this would be appreciated.

Dual PIII 1Ghz
4 GB RAM
4 spindle IDE RAID 0 on LSI controller.
Postgres 7.4.5
Linux version 2.6.3-7mdk-p3-smp-64GB

postgresql.cong snip:

tcpip_socket = true
max_connections = 40
shared_buffers = 1000
sort_mem = 65536
fsync = true


source_song_title +-10,500,000 rows
source_song +-9,500,000 rows
source_system 10 rows
source_title +- 5,600,000

Code run right before this query:
create index ssa_source_song_id on source_song_artist (source_song_id);
analyze source_song_artist;
create index sa_artist_id on source_artist (artist_id);
analyze source_artist;
create index ss_source_song_id on source_song (source_song_id);
analyze source_song;
create index st_title_id on source_title (title_id);
analyze source_title;

source_song.source_song_id = int4
source_song_title.source_song_id = int4
source_title.title_id = int4
source_song_title.title_id = int4

update source_song_title set
source_song_title_id = nextval('source_song_title_seq')
,licensing_match_order = (select licensing_match_order from
source_system where source_system_id = ss.source_system_id)
,affiliation_match_order = (select affiliation_match_order from
source_system where source_system_id = ss.source_system_id)
,title = st.title
from source_song_title sst
join source_song ss on ss.source_song_id = sst.source_song_id
join source_title st on st.title_id = sst.title_id
where source_song_title.source_song_id = sst.source_song_id;


Explain output:
"Hash Join  (cost=168589.60..16651072.43 rows=6386404 width=335)"
"  Hash Cond: ("outer".title_id = "inner".title_id)"
"  ->  Merge Join  (cost=0.00..1168310.61 rows=6386403 width=311)"
"        Merge Cond: ("outer".source_song_id = "inner".source_song_id)"
"        ->  Merge Join  (cost=0.00..679279.40 rows=6386403 width=16)"
"              Merge Cond: ("outer".source_song_id =
"inner".source_song_id)"
"              ->  Index Scan using source_song_title_pkey on
source_song_title sst  (cost=0.00..381779.37 rows=10968719 width=8)"
"              ->  Index Scan using ss_source_song_id on source_song ss
(cost=0.00..190583.36 rows=6386403 width=8)"
"        ->  Index Scan using source_song_title_pkey on
source_song_title  (cost=0.00..381779.37 rows=10968719 width=303)"
"  ->  Hash  (cost=117112.08..117112.08 rows=5513808 width=32)"
"        ->  Seq Scan on source_title st  (cost=0.00..117112.08
rows=5513808 width=32)"
"  SubPlan"
"    ->  Seq Scan on source_system  (cost=0.00..1.14 rows=2 width=4)"
"          Filter: (source_system_id = $0)"
"    ->  Seq Scan on source_system  (cost=0.00..1.14 rows=2 width=2)"
"          Filter: (source_system_id = $0)"

Re: Slow query

From

Tom Lane

Date:

22 October 2004, 21:37:10

Roger Ging <rging@paccomsys.com> writes:
> update source_song_title set
> source_song_title_id = nextval('source_song_title_seq')
> ,licensing_match_order = (select licensing_match_order from
> source_system where source_system_id = ss.source_system_id)
> ,affiliation_match_order = (select affiliation_match_order from
> source_system where source_system_id = ss.source_system_id)
> ,title = st.title
> from source_song_title sst
> join source_song ss on ss.source_song_id = sst.source_song_id
> join source_title st on st.title_id = sst.title_id
> where source_song_title.source_song_id = sst.source_song_id;

Why is "source_song_title sst" in there?  To the extent that
source_song_id is not unique, you are multiply updating rows
because of the self-join.

            regards, tom lane

Re: Slow query

From

Joshua Marsh

Date:

23 October 2004, 13:07:38

Any time you run subqueries, it's going to slow down the update
process a lot.  Each record that is updated in source_song_title runs
two additional queries.  When I do large updates like this, I usualy
Run a transaction that will select all the new data into a new table
on a join.  For example

SELECT
  a.*,
  b.licensing_match_order,
  b.affiliation_match_order,
  d.title
INTO
  updated_data
FROM
  source_song_title AS a
INNER JOIN
  source_system AS b
ON
  b.id = d.id
INNER JOIN
  source_song AS c
ON
  a.id = c.id
INNER JOIN
  source_title AS d
ON
  a.id = d.id

I'm not sure that query does what you want, but you get the idea.
Then just drop the old table and rename the updated_data table.  This
way instead of doing a bunch of updates, you do one select and a
rename.

-Josh

On Fri, 22 Oct 2004 16:37:14 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Roger Ging <rging@paccomsys.com> writes:
> > update source_song_title set
> > source_song_title_id = nextval('source_song_title_seq')
> > ,licensing_match_order = (select licensing_match_order from
> > source_system where source_system_id = ss.source_system_id)
> > ,affiliation_match_order = (select affiliation_match_order from
> > source_system where source_system_id = ss.source_system_id)
> > ,title = st.title
> > from source_song_title sst
> > join source_song ss on ss.source_song_id = sst.source_song_id
> > join source_title st on st.title_id = sst.title_id
> > where source_song_title.source_song_id = sst.source_song_id;
>
> Why is "source_song_title sst" in there?  To the extent that
> source_song_id is not unique, you are multiply updating rows
> because of the self-join.
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>