Thread: Bulk Insert and Index use
Hi, I have a question on bulk checking, inserting into a table and how best to use an index for performance. The data I have to work with is a monthly CD Rom csv data dump of 300,000 property owners from one area/shire. So every CD has 300,000 odd lines, each line of data which fills the 'property' table. Beginning with the first CD each line should require one SELECT and one INSERT as it will be the first property with this address. The SELECT uses fields like 'street' and 'suburb', to check for an existing property, so I have built an index on those fields. My question is does each INSERT rebuild the index on the 'street' and 'suburb' fields? I believe it does but I'm asking to be sure. If this is the case I guess performance will suffer when I have, say, 200,000 rows in the table. Would it be like: a) Use index to search on 'street' and 'suburb' b) No result? Insert new record c) Rebuild index on 'street' and 'suburb' for each row? Would this mean that after 200,000 rows each INSERT will require the index of 000's of rows to be re-indexed? So far I believe my only options are to use either and index or sequential scan and see which is faster. A minute for your thoughts and/or suggestions would be great. Thanks. Regards, Rudi.
If the bulk load has the possibility of duplicating data, then you need to change methods. Try bulk loading into a temp table, index it like the original, eliminate the dups and merge the tables. It is also possible to do an insert from the temp table into the final table like: insert into original (x,x,x) (select temp.1, temp.2, etc from temp left join original on temp.street=original.street where original.street is null) Good Luck Jim Rudi Starcevic wrote: > Hi, > > I have a question on bulk checking, inserting into a table and > how best to use an index for performance. > > The data I have to work with is a monthly CD Rom csv data dump of > 300,000 property owners from one area/shire. > > So every CD has 300,000 odd lines, each line of data which fills the > 'property' table. > > Beginning with the first CD each line should require one SELECT and > one INSERT as it will be the first property with this address. > > The SELECT uses fields like 'street' and 'suburb', to check for an > existing property, > so I have built an index on those fields. > > My question is does each INSERT rebuild the index on the 'street' and > 'suburb' fields? > I believe it does but I'm asking to be sure. > > If this is the case I guess performance will suffer when I have, say, > 200,000 > rows in the table. > > Would it be like: > > a) Use index to search on 'street' and 'suburb' > b) No result? Insert new record > c) Rebuild index on 'street' and 'suburb' > > for each row? > Would this mean that after 200,000 rows each INSERT will require > the index of 000's of rows to be re-indexed? > > So far I believe my only options are to use either and index > or sequential scan and see which is faster. > > A minute for your thoughts and/or suggestions would be great. > > Thanks. > Regards, > Rudi. > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > >
Usualy any bulk load is faster with indexes dropped and the rebuilt ... failing that (like you really need the indexes whileloading, say into a "hot" table) be sure to wrap all the SQL into one transaction (BEGIN;...COMMIT;) ... if any datafailes it all fails, which is usually easier to deal with than partial data loads, and it is *much* faster than havingeach insert being its own transaction. HTH, Greg Williamson DBA GlobeXplorer LLC -----Original Message----- From: Rudi Starcevic [mailto:tech@wildcash.com] Sent: Tue 8/10/2004 4:04 PM To: pgsql-performance@postgresql.org Cc: Subject: [PERFORM] Bulk Insert and Index use Hi, I have a question on bulk checking, inserting into a table and how best to use an index for performance. The data I have to work with is a monthly CD Rom csv data dump of 300,000 property owners from one area/shire. So every CD has 300,000 odd lines, each line of data which fills the 'property' table. Beginning with the first CD each line should require one SELECT and one INSERT as it will be the first property with this address. The SELECT uses fields like 'street' and 'suburb', to check for an existing property, so I have built an index on those fields. My question is does each INSERT rebuild the index on the 'street' and 'suburb' fields? I believe it does but I'm asking to be sure. If this is the case I guess performance will suffer when I have, say, 200,000 rows in the table. Would it be like: a) Use index to search on 'street' and 'suburb' b) No result? Insert new record c) Rebuild index on 'street' and 'suburb' for each row? Would this mean that after 200,000 rows each INSERT will require the index of 000's of rows to be re-indexed? So far I believe my only options are to use either and index or sequential scan and see which is faster. A minute for your thoughts and/or suggestions would be great. Thanks. Regards, Rudi. ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster
Hi Jim, Thanks for your time. > If the bulk load has the possibility of duplicating data Yes, each row will require either: a) One SELECT + One INSERT or b) One SELECT + One UPDATE I did think of using more than one table, ie. temp table. As each month worth of data is added I expect to see a change from lots of INSERTS to lots of UPDATES. Perhaps when the UPDATES become more dominant it would be best to start using Indexes. While INSERTS are more prevelant perhaps a seq. scan is better. I guess of all the options available it boils down to which is quicker for my data: index or sequential scan. Many thanks. Jim J wrote: > If the bulk load has the possibility of duplicating data, then you need > to change methods. Try bulk loading into a temp table, index it like > the original, eliminate the dups and merge the tables. > > It is also possible to do an insert from the temp table into the final > table like: > insert into original (x,x,x) (select temp.1, temp.2, etc from temp left > join original on temp.street=original.street where original.street is null) > > Good Luck > Jim > > Rudi Starcevic wrote: > >> Hi, >> >> I have a question on bulk checking, inserting into a table and >> how best to use an index for performance. >> >> The data I have to work with is a monthly CD Rom csv data dump of >> 300,000 property owners from one area/shire. >> >> So every CD has 300,000 odd lines, each line of data which fills the >> 'property' table. >> >> Beginning with the first CD each line should require one SELECT and >> one INSERT as it will be the first property with this address. >> >> The SELECT uses fields like 'street' and 'suburb', to check for an >> existing property, >> so I have built an index on those fields. >> >> My question is does each INSERT rebuild the index on the 'street' and >> 'suburb' fields? >> I believe it does but I'm asking to be sure. >> >> If this is the case I guess performance will suffer when I have, say, >> 200,000 >> rows in the table. >> >> Would it be like: >> >> a) Use index to search on 'street' and 'suburb' >> b) No result? Insert new record >> c) Rebuild index on 'street' and 'suburb' >> >> for each row? >> Would this mean that after 200,000 rows each INSERT will require >> the index of 000's of rows to be re-indexed? >> >> So far I believe my only options are to use either and index >> or sequential scan and see which is faster. >> >> A minute for your thoughts and/or suggestions would be great. >> >> Thanks. >> Regards, >> Rudi. >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 4: Don't 'kill -9' the postmaster >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > > -- Regards, Rudi. Internet Media Productions
In an attempt to throw the authorities off his trail, tech@wildcash.com (Rudi Starcevic) transmitted: > A minute for your thoughts and/or suggestions would be great. Could you give a more concrete example? E.g. - the DDL for the table(s), most particularly. At first guess, I think you're worrying about a nonissue. Each insert will lead to a _modification_ of the various indices, which costs _something_, but which is WAY less expensive than creating each index from scratch. But perhaps I'm misreading things; DDL for the intended tables and indices would be real handy. -- output = ("cbbrowne" "@" "cbbrowne.com") http://www.ntlug.org/~cbbrowne/linux.html Rules of the Evil Overlord #21. "I will hire a talented fashion designer to create original uniforms for my Legions of Terror, as opposed to some cheap knock-offs that make them look like Nazi stormtroopers, Roman footsoldiers, or savage Mongol hordes. All were eventually defeated and I want my troops to have a more positive mind-set." <http://www.eviloverlord.com/>
If it has to read a majority (or even a good percentage) of the rows in question a sequential scan is probably faster ...and as Jim pointed out, a temp table can often be a useful medium for getting speed in a load and then allowing you toclean/alter data for a final (easy) push. G -----Original Message----- From: Rudi Starcevic [mailto:tech@wildcash.com] Sent: Tue 8/10/2004 8:33 PM To: pgsql-performance@postgresql.org Cc: Subject: Re: [PERFORM] Bulk Insert and Index use Hi Jim, Thanks for your time. > If the bulk load has the possibility of duplicating data Yes, each row will require either: a) One SELECT + One INSERT or b) One SELECT + One UPDATE I did think of using more than one table, ie. temp table. As each month worth of data is added I expect to see a change from lots of INSERTS to lots of UPDATES. Perhaps when the UPDATES become more dominant it would be best to start using Indexes. While INSERTS are more prevelant perhaps a seq. scan is better. I guess of all the options available it boils down to which is quicker for my data: index or sequential scan. Many thanks. Jim J wrote: > If the bulk load has the possibility of duplicating data, then you need > to change methods. Try bulk loading into a temp table, index it like > the original, eliminate the dups and merge the tables. > > It is also possible to do an insert from the temp table into the final > table like: > insert into original (x,x,x) (select temp.1, temp.2, etc from temp left > join original on temp.street=original.street where original.street is null) > > Good Luck > Jim > > Rudi Starcevic wrote: > >> Hi, >> >> I have a question on bulk checking, inserting into a table and >> how best to use an index for performance. >> >> The data I have to work with is a monthly CD Rom csv data dump of >> 300,000 property owners from one area/shire. >> >> So every CD has 300,000 odd lines, each line of data which fills the >> 'property' table. >> >> Beginning with the first CD each line should require one SELECT and >> one INSERT as it will be the first property with this address. >> >> The SELECT uses fields like 'street' and 'suburb', to check for an >> existing property, >> so I have built an index on those fields. >> >> My question is does each INSERT rebuild the index on the 'street' and >> 'suburb' fields? >> I believe it does but I'm asking to be sure. >> >> If this is the case I guess performance will suffer when I have, say, >> 200,000 >> rows in the table. >> >> Would it be like: >> >> a) Use index to search on 'street' and 'suburb' >> b) No result? Insert new record >> c) Rebuild index on 'street' and 'suburb' >> >> for each row? >> Would this mean that after 200,000 rows each INSERT will require >> the index of 000's of rows to be re-indexed? >> >> So far I believe my only options are to use either and index >> or sequential scan and see which is faster. >> >> A minute for your thoughts and/or suggestions would be great. >> >> Thanks. >> Regards, >> Rudi. >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 4: Don't 'kill -9' the postmaster >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > > -- Regards, Rudi. Internet Media Productions ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Hi, > In an attempt to throw the authorities off his trail, tech@wildcash.com (Rudi Starcevic) transmitted: > A minute for your thoughts and/or suggestions would be great. Heh heh .... > Could you give a more concrete example? E.g. - the DDL for the > table(s), most particularly. Thanks, I didn't add the DDL as I though it may make my question too long. I have the DDL at another office so I'll pick up this email thread when I get there in a couple hours. > At first guess, I think you're worrying about a nonissue. Each insert > will lead to a _modification_ of the various indices, which costs > _something_, but which is WAY less expensive than creating each index > from scratch. Very interesting, modification and creation. I will post another email later today. Many thanks. -- Regards, Rudi. Internet Media Productions