Thread: MERGE: performance advices
I need to merge 2 tables: update d set c1=s.c1, c2=s.c2... from s where d.pk=s.pk; insert into d (pk, c1, c2, ...) select pk, c1, c2, c3 from s where s.pk not in (select pk from d); Any strategy to make it faster? Including modifying postgres.conf temporary? Considering I've no concurrency problems. The tables I'm dealing with are "read only" for everything else other than the merge process. thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo <mail@webthatworks.it> writes: > I need to merge 2 tables: > > update d set c1=s.c1, c2=s.c2... from s where d.pk=s.pk; > insert into d (pk, c1, c2, ...) select pk, c1, c2, c3 from s > where s.pk not in (select pk from d); you could try making the not in an exists. In released versions of Postgres sometimes one is better than the other. Raising work_mem might matter if it lets you do a hash join for either the IN/EXISTS or the join. There is another approach though whether it's faster depends on how many indexes you have and other factors: CREATE TABLE new_d AS SELECT DISTINCT ON (pk) pk,c1,c FROM (select 1 as t, * from s union all select 2 as t, * from d ) ORDER BY pk, t This will pull in all the rows from both tables and sort them by pk with records from s appearing before matching records from t and then keep only the first value for each pk. Then you'll have to build indexes, swap the tables, and fix any views or rules which refer to the old table (they'll still refer to the old table, not the new table even after renaming it to the old name). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
On Tue, Sep 2, 2008 at 4:19 AM, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: > insert into d (pk, c1, c2, ...) select pk, c1, c2, c3 from s > where s.pk not in (select pk from d); This insert statement might be faster: INSERT INTO d (pk, c1, c2, ... ) SELECT pk, c1, c2, ... FROM s LEFT JOIN d ON s.pk = d.pk WHERE d.pk IS NULL; -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
Richard Broersma wrote: > On Tue, Sep 2, 2008 at 4:19 AM, Ivan Sergio Borgonovo > <mail@webthatworks.it> wrote: > > > >>insert into d (pk, c1, c2, ...) select pk, c1, c2, c3 from s >> where s.pk not in (select pk from d); > > > This insert statement might be faster: > > INSERT INTO d (pk, c1, c2, ... ) > SELECT pk, c1, c2, ... > FROM s > LEFT JOIN d ON s.pk = d.pk > WHERE d.pk IS NULL; > > Hello Richard, Is there a way to do something similar with the following? I am an SQL noob and the following takes longer to run than is reasonable, on the order of hours. insert into myevents select * from t_unit_event_log a where exists (select b.event_log_no from myevents b where a.event_status = 1 and a.event_ref_log_no IS NOT NULL and a.event_ref_log_no = b.event_log_no and a.event_log_no not in (select event_log_no from myevents) ) Thanks, Steve
On Tue, Sep 2, 2008 at 8:10 AM, Steve Clark <sclark@netwolves.com> wrote: > Is there a way to do something similar with the following? I am an SQL noob > and the > following takes longer to run than is reasonable, on the order of hours. > > insert into myevents select * from t_unit_event_log a where exists > (select b.event_log_no from myevents b > where a.event_status = 1 and a.event_ref_log_no IS NOT NULL > and a.event_ref_log_no = b.event_log_no and a.event_log_no not > in > (select event_log_no from myevents) > ) To start off with, this SQL statement can be refined a bit. Many of the sub-query WHERE clause constraints have nothing to do with the Correlated sub-query. The refinement would look like so: INSERT INTO Myevents SELECT * FROM T_unit_event_log AS A WHERE A.event_status = 1 AND A.event_ref_log_no IS NOT NULL AND A.event_log_no NOT IN ( SELECT event_log_no FROM Myevents) AND EXISTS ( SELECT B.event_log_no FROM Myevents AS B WHERE A.event_ref_log_no = B.event_log_no ); The next step would be to rework the NOT IN sub-query into a LEFT JOIN WHERE IS NULL; INSERT INTO Myevents SELECT * FROM T_unit_event_log AS A LEFT JOIN Myevents AS C ON A.event_log_no = C.event_log_no WHERE A.event_status = 1 AND A.event_ref_log_no IS NOT NULL AND C.event_log_no IS NULL AND EXISTS ( SELECT B.event_log_no FROM Myevents AS B WHERE A.event_ref_log_no = B.event_log_no ); There is one possible alteration that may or many not improve performance. This would be to replace the EXISTS with a LEFT JOIN WHERE IS NOT NULL; INSERT INTO Myevents SELECT * FROM T_unit_event_log AS A LEFT JOIN Myevents AS C ON A.event_log_no = C.event_log_no LEFT JOIN Myevents AS B ON A.event_ref_log_no = B.event_log_no WHERE C.event_log_no IS NULL AND B.event_log_no IS NOT NULL AND A.event_status = 1 AND A.event_ref_log_no IS NOT NULL; -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
Richard Broersma wrote: > On Tue, Sep 2, 2008 at 8:10 AM, Steve Clark <sclark@netwolves.com> wrote: > > >>Is there a way to do something similar with the following? I am an SQL noob >>and the >>following takes longer to run than is reasonable, on the order of hours. >> >>insert into myevents select * from t_unit_event_log a where exists >> (select b.event_log_no from myevents b >> where a.event_status = 1 and a.event_ref_log_no IS NOT NULL >> and a.event_ref_log_no = b.event_log_no and a.event_log_no not >>in >> (select event_log_no from myevents) >> ) > > > > To start off with, this SQL statement can be refined a bit. Many of > the sub-query WHERE clause constraints have nothing to do with the > Correlated sub-query. The refinement would look like so: > > INSERT INTO Myevents > SELECT * > FROM T_unit_event_log AS A > WHERE A.event_status = 1 > AND A.event_ref_log_no IS NOT NULL > AND A.event_log_no NOT IN ( SELECT event_log_no > FROM Myevents) > AND EXISTS ( SELECT B.event_log_no > FROM Myevents AS B > WHERE A.event_ref_log_no = B.event_log_no ); > > > The next step would be to rework the NOT IN sub-query into a LEFT JOIN > WHERE IS NULL; > > INSERT INTO Myevents > SELECT * > FROM T_unit_event_log AS A > LEFT JOIN Myevents AS C > ON A.event_log_no = C.event_log_no > WHERE A.event_status = 1 > AND A.event_ref_log_no IS NOT NULL > AND C.event_log_no IS NULL > AND EXISTS ( SELECT B.event_log_no > FROM Myevents AS B > WHERE A.event_ref_log_no = B.event_log_no ); > > There is one possible alteration that may or many not improve > performance. This would be to replace the EXISTS with a LEFT JOIN > WHERE IS NOT NULL; > > INSERT INTO Myevents > SELECT * > FROM T_unit_event_log AS A > LEFT JOIN Myevents AS C > ON A.event_log_no = C.event_log_no > LEFT JOIN Myevents AS B > ON A.event_ref_log_no = B.event_log_no > WHERE C.event_log_no IS NULL > AND B.event_log_no IS NOT NULL > AND A.event_status = 1 > AND A.event_ref_log_no IS NOT NULL; > > Hi Richard and thanks for the response. When I try the last two queries i get and error. I have listed the results of explain on all three. srm2=# explain srm2-# INSERT INTO Myevents srm2-# SELECT * srm2-# FROM T_unit_event_log AS A srm2-# WHERE A.event_status = 1 srm2-# AND A.event_ref_log_no IS NOT NULL srm2-# AND A.event_log_no NOT IN ( SELECT event_log_no srm2(# FROM Myevents) srm2-# AND EXISTS ( SELECT B.event_log_no srm2(# FROM Myevents AS B srm2(# WHERE A.event_ref_log_no = B.event_log_no ); QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Index Scan using indx_tuel_usn_ec_es on t_unit_event_log a (cost=31711.73..3607445990.61 rows=51844 width=146) Index Cond: (event_status = 1) Filter: ((event_ref_log_no IS NOT NULL) AND (subplan) AND (NOT (subplan))) SubPlan -> Materialize (cost=31711.73..42857.85 rows=830612 width=4) -> Seq Scan on myevents (cost=0.00..28041.12 rows=830612 width=4) -> Index Scan using indx1myevents on myevents b (cost=0.00..8.37 rows=1 width=4) Index Cond: ($1 = event_log_no) (8 rows) srm2=# srm2=# srm2=# explain srm2-# INSERT INTO Myevents srm2-# SELECT * srm2-# FROM T_unit_event_log AS A srm2-# LEFT JOIN Myevents AS C srm2-# ON A.event_log_no = C.event_log_no srm2-# WHERE A.event_status = 1 srm2-# AND A.event_ref_log_no IS NOT NULL srm2-# AND C.event_log_no IS NULL srm2-# AND EXISTS ( SELECT B.event_log_no srm2(# FROM Myevents AS B srm2(# WHERE A.event_ref_log_no = B.event_log_no ); ERROR: INSERT has more expressions than target columns srm2=# srm2=# explain srm2-# INSERT INTO Myevents srm2-# SELECT * srm2-# FROM T_unit_event_log AS A srm2-# LEFT JOIN Myevents AS C srm2-# ON A.event_log_no = C.event_log_no srm2-# LEFT JOIN Myevents AS B srm2-# ON A.event_ref_log_no = B.event_log_no srm2-# WHERE C.event_log_no IS NULL srm2-# AND B.event_log_no IS NOT NULL srm2-# AND A.event_status = 1 srm2-# AND A.event_ref_log_no IS NOT NULL; ERROR: INSERT has more expressions than target columns I really appreciate your help. Steve
On Tue, Sep 2, 2008 at 9:47 AM, Steve Clark <sclark@netwolves.com> wrote: > srm2=# explain > srm2-# INSERT INTO Myevents > srm2-# SELECT * > ERROR: INSERT has more expressions than target columns > srm2=# explain > srm2-# INSERT INTO Myevents > srm2-# SELECT * > ERROR: INSERT has more expressions than target columns Oops, replace SELECT * with SELECT A.*. -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug
Richard Broersma wrote: > On Tue, Sep 2, 2008 at 9:47 AM, Steve Clark <sclark@netwolves.com> wrote: > >>srm2=# explain >>srm2-# INSERT INTO Myevents >>srm2-# SELECT * > > >>ERROR: INSERT has more expressions than target columns > > >>srm2=# explain >>srm2-# INSERT INTO Myevents >>srm2-# SELECT * >>ERROR: INSERT has more expressions than target columns > > > Oops, replace SELECT * with SELECT A.*. > Richard, This is AWESOME! This now only takes seconds where before it was taking longer than I wanted to wait, I had let it run all night before aborting it. Thanks so much Richard, Regards, Steve
"Richard Broersma" <richard.broersma@gmail.com> writes: > There is one possible alteration that may or many not improve > performance. This would be to replace the EXISTS with a LEFT JOIN > WHERE IS NOT NULL; That changes the behavior, doesn't it? Or is event_log_no a unique key for Myevents? I think what you want is to make the EXISTS an IN instead. regards, tom lane
Tom Lane wrote: > "Richard Broersma" <richard.broersma@gmail.com> writes: > >>There is one possible alteration that may or many not improve >>performance. This would be to replace the EXISTS with a LEFT JOIN >>WHERE IS NOT NULL; > > > That changes the behavior, doesn't it? Or is event_log_no a unique > key for Myevents? I think what you want is to make the EXISTS an IN > instead. > > regards, tom lane > > Yes event_log_no is a unique key for myevents.
On Tue, Sep 2, 2008 at 10:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Richard Broersma" <richard.broersma@gmail.com> writes: >> There is one possible alteration that may or many not improve >> performance. This would be to replace the EXISTS with a LEFT JOIN >> WHERE IS NOT NULL; > > That changes the behavior, doesn't it? Or is event_log_no a unique > key for Myevents? I think what you want is to make the EXISTS an IN > instead. Thanks for pointing that out Tom. I hadn't consider how the EXISTS clause would return potentially fewer rows if event_log_no wasn't unique. -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug