Thread: slow table updates
I'm trying to update a table but it's taking a very long time. I would appreciate any tips folks may have about ways to speed it up.
The table is paprospect2, as below:
paprospect2 contains ~40M rows. The goal now is to migrate the data to the supertable-inherited column with
The update's been running for 5 hours (unloaded dual 2.4 GHz Xeon w/2GB RAM, SCSI160 10K drive). There are no other jobs running. Load is ~1.2 and the update's using ~3-5% of the CPU.
Presumably the large number of blocks written (bo) versus blocks read (bi) reflects an enormous amount of bookkeeping that has to be done for MVCC, logging, perhaps rewriting a row for the new definition (a guess -- I don't know how this is handled), indicies, etc. There's no swapping and no processes are waiting. In short, it seems that this is ENTIRELY an I/O issue. Obviously, faster drives will help (but probably only by small factor).
Any ideas how I might speed this up? Presumably this is all getting wrapped in a transaction -- does that hurt me for such a large update?
Thanks,
Reece
Bonus diversionary topic: In case it's not obvious, the motivation for this is that the subtable (paprospect2) contains a column (run_id) whose definition I would like to migrate to the inherited table (i.e., the 'super-table'). Although postgresql permits adding a column to a supertable with the same name as an extant column in a subtable, it appears that such "merged definition" columns do not have the same properties as a typical inherited column. In particular, dropping the column from the supertable does not drop it from the subtable (but renaming it does change both names). Hmm.
The table is paprospect2, as below:
The columns pfeature_id..confidence and run_id_new (in red) are from an inherited table. Although the inheritance itself is probably not relevant here (correction welcome), I suspect it may be relevant that all existing rows were written before the table definition included run_id_new. p2thread_i_trigger is defined fires on insert only (not update).\d paprospect2 Column | Type | Modifiers -------------+---------+------------------------------------------------------------------- pfeature_id | integer | not null default nextval('unison.pfeature_pfeature_id_seq'::text)pseq_id | integer | not nullpftype_id | integer | not nullstart | integer |stop | integer |confidence | real |run_id | integer | not null[snip 13 integer and real columns] run_id_new | integer | Indexes: paprospect2_redundant_alignment unique btree (pseq_id, "start", stop, run_id, pmodel_id), p2thread_p2params_id btree (run_id), p2thread_pmodel_id btree (pmodel_id) Foreign Key constraints: pftype_id_exists FOREIGN KEY (pftype_id) REFERENCES pftype(pftype_id) ON UPDATE CASCADE ON DELETE CASCADE, p2thread_pmodel_id_exists FOREIGN KEY (pmodel_id) REFERENCES pmprospect2(pmodel_id) ON UPDATE CASCADE ON DELETE CASCADE, pseq_id_exists FOREIGN KEY (pseq_id) REFERENCES pseq(pseq_id) ON UPDATE CASCADE ON DELETE CASCADE Triggers: p2thread_i_trigger
paprospect2 contains ~40M rows. The goal now is to migrate the data to the supertable-inherited column with
update paprospect2 set run_id_new=run_id;
The update's been running for 5 hours (unloaded dual 2.4 GHz Xeon w/2GB RAM, SCSI160 10K drive). There are no other jobs running. Load is ~1.2 and the update's using ~3-5% of the CPU.
This suggests that the update is I/O bound (duh) and vmstat supports this:$ ps -ostime,time,pcpu,cmd 28701 STIME TIME %CPU CMD 12:18 00:07:19 2.3 postgres: admin csb 128.137.116.213 UPDATE
$ vmstat 1 procs memory swap io system cpur b w swpd free buff cache si so bi bo in cs us sy id0 1 0 0 11288 94632 3558960 0 0 14 6 12 21 1 0 60 1 0 0 12044 94632 3558956 0 0 0 972 332 16 0 1 990 1 0 0 11092 94632 3558932 0 0 16 4420 309 25 0 2 970 1 0 0 11456 94636 3558928 0 0 0 980 326 23 0 1 991 0 0 0 12340 94636 3558924 0 0 16 532 329 14 0 0 1000 1 0 0 12300 94636 3558916 0 0 0 1376 324 16 1 0 990 1 0 0 12252 94636 3558904 0 0 16 1888 325 18 0 0 990 1 0 0 11452 94636 3558888 0 0 16 2864 324 23 1 1 980 1 0 0 12172 94636 3558884 0 0 0 940 320 12 0 1 990 1 0 0 12180 94636 3558872 0 0 16 1840 318 22 0 1 990 1 0 0 11588 94636 3558856 0 0 0 2752 312 16 1 2 97
Presumably the large number of blocks written (bo) versus blocks read (bi) reflects an enormous amount of bookkeeping that has to be done for MVCC, logging, perhaps rewriting a row for the new definition (a guess -- I don't know how this is handled), indicies, etc. There's no swapping and no processes are waiting. In short, it seems that this is ENTIRELY an I/O issue. Obviously, faster drives will help (but probably only by small factor).
Any ideas how I might speed this up? Presumably this is all getting wrapped in a transaction -- does that hurt me for such a large update?
Thanks,
Reece
Bonus diversionary topic: In case it's not obvious, the motivation for this is that the subtable (paprospect2) contains a column (run_id) whose definition I would like to migrate to the inherited table (i.e., the 'super-table'). Although postgresql permits adding a column to a supertable with the same name as an extant column in a subtable, it appears that such "merged definition" columns do not have the same properties as a typical inherited column. In particular, dropping the column from the supertable does not drop it from the subtable (but renaming it does change both names). Hmm.
-- Reece Hart, Ph.D. rkh@gene.com, http://www.gene.com/ Genentech, Inc. 650/225-6133 (voice), -5389 (fax) Bioinformatics and Protein Engineering 1 DNA Way, MS-93 http://www.in-machina.com/~reece/ South San Francisco, CA 94080-4990 reece@in-machina.com, GPG: 0x25EC91A0 |
Richard-
Thanks for the suggestions. I too had thought about the FK checks, even though the columns aren't getting updated.
I'm flabbergasted that the update is still running (~22 hours elapsed). By comparison, the database takes only 4 hours to recreate from backup! Something funny is happening here. I just interrupted the update and will find another way.
But hang on, what's this:
I am the only user and I'm not doing a copy... this must be part of the update process. Does anyone out there know whether updates do a table copy instead of in-table udpating (perhaps as a special case for whole-table updates)?
Of course, I can't help but wonder whether I just killed it when it was nearly done...
Thanks,
Reece
Thanks for the suggestions. I too had thought about the FK checks, even though the columns aren't getting updated.
I'm flabbergasted that the update is still running (~22 hours elapsed). By comparison, the database takes only 4 hours to recreate from backup! Something funny is happening here. I just interrupted the update and will find another way.
But hang on, what's this:
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 30164 compbio 25 0 6056 6056 3568 R 98.8 0.1 29:55 postgres: admin csb [local] COPY
I am the only user and I'm not doing a copy... this must be part of the update process. Does anyone out there know whether updates do a table copy instead of in-table udpating (perhaps as a special case for whole-table updates)?
Of course, I can't help but wonder whether I just killed it when it was nearly done...
Thanks,
Reece
-- Reece Hart, Ph.D. rkh@gene.com, http://www.gene.com/ Genentech, Inc. 650/225-6133 (voice), -5389 (fax) Bioinformatics and Protein Engineering 1 DNA Way, MS-93 http://www.in-machina.com/~reece/ South San Francisco, CA 94080-4990 reece@in-machina.com, GPG: 0x25EC91A0 |