Thread: Performance issue with cross table updates
Hi all I am no database wizard so I am hoping someone may be able to assist me :) I get a daily CSV file of 6.5 million records. I create a temporary table and COPY them in. On completion I create an index on the mdnid column. This column is also indexed in table 2. This part is very fast. I had some 'checkpoint too often' issues, but that I have resolved. I then use the following procedure to update all the records, and if a record does not exist, insert it instead: declare v_update e_lookup_load%ROWTYPE; begin for v_update in select * from e_lookup_load loop update e_lookup set cur_balance = v_update.cur_balance, last_usage = v_update.last_usage, b_code = v_update.b_code, u_date = v_update.u_date, date_of_insert = v_update.date_of_insert where mdnid = v_update.mdnid; if NOT FOUND then begin insert into e_lookup(mdnid,cur_balance,last_usage,b_code,u_date,date_of_insert) values (v_update.mdnid, v_update.cur_balance, v_update.last_usage, v_update.b_code, v_update.u_date, v_date_of_insert); end; end if; end loop; end; From my understanding, a for loop is encapsulated in a cursor anyway so no need to do that. Am I fundamentally doing something wrong as the operation is slow as molasses? Maybe there is a better way altogether that I have not thought of? The bottom line is that at no point can the e_lookup table be unavailable to clients, else I would have just done a simple drop and rename post the COPY. Kind Regards Craig
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Craig Gibson > Sent: Monday, September 10, 2012 12:34 PM > To: pgsql-general@postgresql.org > Subject: [GENERAL] Performance issue with cross table updates > > Hi all > > I am no database wizard so I am hoping someone may be able to assist me :) > > I get a daily CSV file of 6.5 million records. I create a temporary table and > COPY them in. On completion I create an index on the mdnid column. This > column is also indexed in table 2. This part is very fast. I had some 'checkpoint > too often' issues, but that I have resolved. > > I then use the following procedure to update all the records, and if a record > does not exist, insert it instead: > > ................................ > > From my understanding, a for loop is encapsulated in a cursor anyway so no > need to do that. Am I fundamentally doing something wrong as the > operation is slow as molasses? Maybe there is a better way altogether that I > have not thought of? The bottom line is that at no point can the e_lookup > table be unavailable to clients, else I would have just done a simple drop and > rename post the COPY. > What version of PostgreSQL are you using. Your ability to solve this in more efficient ways is greatly enhanced the newer the version of PostgreSQL you are using (writable CTE mostly). That said, you really need to use "set logic" and not procedural logic. In this situation I would simply add another column to the import staging table called "exists_on_master" or something similar. I would then: UPDATE staging SET exists_on_master = TRUE FROM live WHERE staging.id = live.id; I would then issue a: UPDATE live SET .... FROM staging WHERE staging.id = live.id; Then I would execute: INSERT INTO live (...) SELECT ... FROM staging WHERE NOT exists_on_master; You don't really need the extra boolean on the staging table but it makes the subsequent queries more explicit. Though for 6.5 million records you may want to try a couple of variations of this to see which performs best. I am presuming that the live table does not change other than by this synchronization query so you really shouldn't have any concurrency issues. You should not be attempting to UPDATE a record that does not exist. Identify all the ones you know will fail in advance and just perform the insert. David J.
Craig Gibson <craiggib@gmail.com> writes: > I get a daily CSV file of 6.5 million records. I create a temporary > table and COPY them in. On completion I create an index on the mdnid > column. This column is also indexed in table 2. This part is very > fast. I had some 'checkpoint too often' issues, but that I have > resolved. FWIW, you probably should also issue an ANALYZE on the temp table after you've loaded it. If you skip that, the planner is working blind as to column statistics and is quite likely to choose inappropriate plans. I concur with the other comment that you might be able to make this a lot faster if you could convert it into a couple of multiple-row commands instead of doing a manual loop. But you'd need up-to-date statistics to get a decent plan for that, too ... regards, tom lane
On Mon, Sep 10, 2012 at 9:34 AM, Craig Gibson <craiggib@gmail.com> wrote: > Hi all > > I am no database wizard so I am hoping someone may be able to assist me :) > > I get a daily CSV file of 6.5 million records. I create a temporary > table and COPY them in. On completion I create an index on the mdnid > column. This column is also indexed in table 2. This part is very > fast. I had some 'checkpoint too often' issues, but that I have > resolved. > > I then use the following procedure to update all the records, and if a > record does not exist, insert it instead: > ... > > From my understanding, a for loop is encapsulated in a cursor anyway > so no need to do that. Am I fundamentally doing something wrong as the > operation is slow as molasses? How big is the parent table? Are you CPU limited or IO limited? If you are not CPU limited, then I would guess that the indexes on the parent table do not fit in RAM or shared_buffers and that maintaining the indexes on the parent table during the updates/inserts is the bottleneck. > Maybe there is a better way altogether > that I have not thought of? The bottom line is that at no point can > the e_lookup table be unavailable to clients, It always has to be available for updates, or just for selects? > else I would have just > done a simple drop and rename post the COPY. Maybe you can just do an atomic rename. Cheers, Jeff