Thread: Loading speed
Hi again, Is there an FAQ or something I can read about optimising speed of insertions and queries? I am doing some simple-minded stuff at the moment with pgsql, and my insertions are very very slow. A rough idea of the problem here for background: A protein has a small number of chains ('A', 'B' etc), each of which has N2 amino acids, each of which has N3 atoms. Now I've got lots and lots of atoms. Hundreds of thousands most like, maybe a million or two, I haven't counted. How can I most efficiently load these? The atom table is a bit like this with some language changes: CREATE TABLE atom ( id INTEGER PRIMARY KEY DEFAULT nextval('atom_id_seq'), protein_id CHAR(4) REFERENCES protein, chain_id CHAR(1) NOT NULL, amino_acid_id INTEGER REFERENCES amino_acid, atom_type TEXT, element TEXT, x FLOAT, y FLOAT, z FLOAT ); Now the REFERENCES bits build triggers, and in each case it refers to the primary key of the table. This is always indexed, yes? When I construct the table it says CREATE TABLE / PRIMARY KEY will create implicit index 'amino_acid_pkey' for table 'amino_acid' - this IS an index on the column I named as the PRIMARY KEY, not on some oid, isn't id? With my simple-minded pygresql script, I have one insert statement per atom, and of course it crawls very feebly. Perhaps there's an insert_many method I haven't found yet (the docs suck, I will keep looking). Or perhaps I should write them out to a text file and use COPY? Any advice? cheers Cath Cath Lawrence, Cath.Lawrence@anu.edu.au Senior Scientific Programmer, Centre for Bioinformation Science, John Curtin School of Medical Research (room 4088) Australian National University, Canberra ACT 0200 ph: (02) 61257959 mobile: 0421-902694 fax: (02) 61252595
On Wed, 2003-10-22 at 03:25, Cath Lawrence wrote: > Now the REFERENCES bits build triggers, and in each case it refers to > the primary key of the table. It needs to refer to a UNIQUE column, not necessarily the primary key. > This is always indexed, yes? When I > construct the table it says CREATE TABLE / PRIMARY KEY will create > implicit index 'amino_acid_pkey' for table 'amino_acid' - this IS an > index on the column I named as the PRIMARY KEY, not on some oid, isn't > id? Yes > With my simple-minded pygresql script, I have one insert statement per > atom, and of course it crawls very feebly. Perhaps there's an > insert_many method I haven't found yet (the docs suck, I will keep > looking). Use transactions to group them 1000 at a time; that will greatly reduce the overhead on each insertion. BEGIN; INSERT... INSERT... [1000 times] END; > Or perhaps I should write them out to a text file and use COPY? That is the fastest way to do bulk importing. -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight, UK http://www.lfix.co.uk/oliver GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C ======================================== "But what things were gain to me, those I counted loss for Christ." Philippians 3:7
Cath Lawrence <Cath.Lawrence@anu.edu.au> writes: > The atom table is a bit like this with some language changes: > CREATE TABLE atom ( > id INTEGER PRIMARY KEY DEFAULT > nextval('atom_id_seq'), > protein_id CHAR(4) REFERENCES protein, > chain_id CHAR(1) NOT NULL, > amino_acid_id INTEGER REFERENCES amino_acid, > atom_type TEXT, > element TEXT, > x FLOAT, > y FLOAT, > z FLOAT > ); > With my simple-minded pygresql script, I have one insert statement per > atom, and of course it crawls very feebly. Perhaps there's an > insert_many method I haven't found yet (the docs suck, I will keep > looking). > Or perhaps I should write them out to a text file and use COPY? COPY would be better than individual INSERTs, if you can bunch a reasonably large number of rows into each COPY (the point being to cut down per-statement overhead, of course). You could perhaps avoid the use of an intermediate file, if pygresql supports COPY FROM STDIN (I don't know if it does). However, I suspect the real problem is with poor performance of the REFERENCES checks. What is likely happening is that the planner thinks the "protein" and "amino_acid" tables are small, and is generating plans for the checks that only work well for small tables. Once you get a respectable amount of data into the referenced tables, the checks take forever. If you can load the "protein" and "amino_acid" tables first, then the answer is simple --- do that, then VACUUM and/or ANALYZE those tables, then start on the "atom" table. If you want to load them in parallel then I think your best bet is to drop and recreate all three tables before you start loading, and do *not* run VACUUM or ANALYZE while the tables are empty. The default planner assumptions if no VACUUM or ANALYZE has happened should give tolerable plans. Another thing you can think about is to not do the REFERENCES checks while loading data at all; that is, create the tables without REFERENCES, load all the data, ANALYZE, then install the references clauses with ALTER TABLE ADD FOREIGN KEY. However this is only really a good idea if you're pretty certain the initial data is clean --- if you have any referential check failures, tracking them down will be painful. (7.4 would tell you the failing key value, but I don't think 7.3 does.) regards, tom lane