Thread: ETL optimization
Situation: I'm trying to optimize an ETL process with many upserts (~100k aggregated rows) (no duplicates allowed). The source (table t2) table holds around 14 million rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in about 100,000 distinct rows in the destination table (t1). What I've tried: i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ish rows), and try an update first, check the ROW_COUNT, if 0, then do an insert. ... run time: approx. 25 mins ii. in a function (pseudo code), (table name is dynamic): ... up_stm := 'UPDATE '||t1||' SET x=t2.x FROM (select sum(x),a,b,c from t2 group by a,b,c) as t2 WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c'; EXECUTE up_stm; ins_stm := 'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2 WHERE NOT EXISTS (select true from '||t1||' where '||t1||'.a=t2.a and '||t1||'.b=t2.b and '||t1||'.c=t2.c limit 1)'; EXECUTE ins_stm; ... takes about 7 minutes. The performance of this is reasonable, but there is room for improvement. I think it's the NOT EXISTS subquery on the insert that makes the first run slow. Any revisions that may be faster (for the subquery)? Note, this subquery is necessary so that duplicates don't get into the target table (t1). Subsequent runs will be mostly updates (and still slow), with few inserts. I'm not seeing a way for that update statement to be sped up, but maybe someone else does? iii. UNIQUE constraint on table "t1". This didn't seem to perform too badly with fewer rows (preliminary tests), but as you'd expect, on error the whole transaction would roll back. Is it possible to skip a row if it causes an error, as opposed to aborting the transaction altogether? To summarize, I'm looking for the most efficient and fastest way to perform my upserts. Tips and/or references to pertinent docs are also appreciated! If any more information is necessary, please let me know. (postgresql 8.0.3, linux) Cheers, Bricklen -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________
I don't know what this will change wrt how often you need to run VACUUM (I'm a SQL Server guy), but instead of an update and insert, try a delete and insert. You'll only have to find the duplicate rows once and your insert doesn't need a where clause.
Meetesh
Meetesh
On 6/23/05, Bricklen Anderson <BAnderson@presinet.com> wrote:
Situation:
I'm trying to optimize an ETL process with many upserts (~100k aggregated rows)
(no duplicates allowed). The source (table t2) table holds around 14 million
rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in about
100,000 distinct rows in the destination table (t1).
What I've tried:
i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ish rows), and
try an update first, check the ROW_COUNT, if 0, then do an insert.
...
run time: approx. 25 mins
ii. in a function (pseudo code), (table name is dynamic):
...
up_stm :=
'UPDATE '||t1||' SET x=t2.x
FROM (select sum(x),a,b,c
from t2
group by a,b,c) as t2
WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c';
EXECUTE up_stm;
ins_stm :=
'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c
FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2
WHERE NOT EXISTS
(select true from '||t1||'
where '||t1||'.a=t2.a
and '||t1||'.b=t2.b
and '||t1||'.c=t2.c
limit 1)';
EXECUTE ins_stm;
...
takes about 7 minutes. The performance of this is reasonable, but there is room
for improvement.
I think it's the NOT EXISTS subquery on the insert that makes the first run
slow. Any revisions that may be faster (for the subquery)?
Note, this subquery is necessary so that duplicates don't get into the target
table (t1).
Subsequent runs will be mostly updates (and still slow), with few inserts. I'm
not seeing a way for that update statement to be sped up, but maybe someone else
does?
iii. UNIQUE constraint on table "t1". This didn't seem to perform too badly with
fewer rows (preliminary tests), but as you'd expect, on error the whole
transaction would roll back. Is it possible to skip a row if it causes an error,
as opposed to aborting the transaction altogether?
To summarize, I'm looking for the most efficient and fastest way to perform my
upserts. Tips and/or references to pertinent docs are also appreciated!
If any more information is necessary, please let me know.
(postgresql 8.0.3, linux)
Cheers,
Bricklen
--
_______________________________
This e-mail may be privileged and/or confidential, and the sender does
not waive any related rights and obligations. Any distribution, use or
copying of this e-mail or the information it contains by other than an
intended recipient is unauthorized. If you received this e-mail in
error, please advise me (by return e-mail or otherwise) immediately.
_______________________________
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Meetesh Karia wrote: > I don't know what this will change wrt how often you need to run VACUUM > (I'm a SQL Server guy), but instead of an update and insert, try a > delete and insert. You'll only have to find the duplicate rows once and > your insert doesn't need a where clause. > > Meetesh > Vacuum analyze in generally run about once an hour. You know, I didn't even think to try a delete + insert combo (which will not be visible to the other queries that are occurring). Truncate is out of the question, because of the aforementioned queries, but I'll give the d+i a shot. Thanks! -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________
Hi, At 21:38 23/06/2005, Bricklen Anderson wrote: >Situation: >I'm trying to optimize an ETL process with many upserts (~100k aggregated >rows) >(no duplicates allowed). The source (table t2) table holds around 14 million >rows, and I'm grabbing them 100,000 rows at a time from t2, resulting in about >100,000 distinct rows in the destination table (t1). > > >What I've tried: > >i. FOR EXECUTE LOOP over my result set (aggregated results, 100k-ish >rows), and >try an update first, check the ROW_COUNT, if 0, then do an insert. >... >run time: approx. 25 mins > > >ii. in a function (pseudo code), (table name is dynamic): >... >up_stm := >'UPDATE '||t1||' SET x=t2.x >FROM (select sum(x),a,b,c > from t2 > group by a,b,c) as t2 >WHERE '||t1||'.a=t2.a AND '||t1||'.b=t2.b AND '||t1||'.c=t3.c'; > >EXECUTE up_stm; > >ins_stm := >'INSERT INTO '||t1||' (x,a,b,c) select x,a,b,c >FROM (select sum(x) as x,a,b,c from t2 group by a,b,c) as t2 >WHERE NOT EXISTS > (select true from '||t1||' > where '||t1||'.a=t2.a > and '||t1||'.b=t2.b > and '||t1||'.c=t2.c > limit 1)'; > >EXECUTE ins_stm; >... I have a similar situation, and the solution I use (though I haven't really tested many different situations): - have a trigger ON INSERT which does: UPDATE set whatever_value=NEW.whatever_value,... WHERE whatever_key=NEW.whatever.key AND... IF FOUND THEN RETURN NULL; ELSE RETURN NEW; END IF; - use COPY For optimal performance, a different trigger function is created for each table, which allows the query plan of the UPDATE to be cached. Let us know how that works out for you and if you find a better solution! Jacques.
Jacques Caron wrote: > > I have a similar situation, and the solution I use (though I haven't > really tested many different situations): > - have a trigger ON INSERT which does: > UPDATE set whatever_value=NEW.whatever_value,... WHERE > whatever_key=NEW.whatever.key AND... > IF FOUND THEN > RETURN NULL; > ELSE > RETURN NEW; > END IF; > - use COPY > > For optimal performance, a different trigger function is created for > each table, which allows the query plan of the UPDATE to be cached. > > Let us know how that works out for you and if you find a better solution! > > Jacques. > Hi Jacques, thanks for the suggestion. I've previously tested triggers under a variety of situations and there was no way that they would work under the load we currently have, and the much greater load that we will be expecting soon (~40x increase in data). I'm in the process of testing the delete scenario right now, and at first blush seems to perform fairly well. 2.5 million rows before aggregation, and 171000 after, in a little under 7 minutes. Currently testing again with about 18.5 million rows. A drawback by using the delete method is that we cannot do any of the aggregation incrementally, but so far that hasn't been a big help anyways. I still need to test the performance of concurrent querying against the destination table whilst the aggregation is occurring. -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________
On Thu, 23 Jun 2005, Bricklen Anderson wrote: > iii. UNIQUE constraint on table "t1". This didn't seem to perform too > badly with fewer rows (preliminary tests), but as you'd expect, on error > the whole transaction would roll back. Is it possible to skip a row if > it causes an error, as opposed to aborting the transaction altogether? You don't need to roll back the whole transaction if you use savepoints or the exception features in pl/pgsql Take a look at this example: http://developer.postgresql.org/docs/postgres/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE -- /Dennis Björklund
Dennis Bjorklund wrote: > On Thu, 23 Jun 2005, Bricklen Anderson wrote: > > >>iii. UNIQUE constraint on table "t1". This didn't seem to perform too >>badly with fewer rows (preliminary tests), but as you'd expect, on error >>the whole transaction would roll back. Is it possible to skip a row if >>it causes an error, as opposed to aborting the transaction altogether? > > > You don't need to roll back the whole transaction if you use savepoints or > the exception features in pl/pgsql > > Take a look at this example: > > http://developer.postgresql.org/docs/postgres/plpgsql-control-structures.html#PLPGSQL-UPSERT-EXAMPLE > Hmmm... forgot about savepoints. That's an interesting idea that I'll have to check out. I earlier mentioned that I was going to test the delete + insert version, and it works pretty well. I got it down to about 3 minutes using that method. I'll test the savepoint and the exception version that you listed as well. Thanks! -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________