Thread: Bulk Insert and Index use

Bulk Insert and Index use

From
Rudi Starcevic
Date:
Hi,

I have a question on bulk checking, inserting into a table and
how best to use an index for performance.

The data I have to work with is a monthly CD Rom csv data dump of
300,000 property owners from one area/shire.

So every CD has 300,000 odd lines, each line of data which fills the
'property' table.

Beginning with the first CD each line should require one SELECT and
one INSERT as it will be the first property with this address.

The SELECT uses fields like 'street' and 'suburb', to check for an
existing property,
so I have built an index on those fields.

My question is does each INSERT rebuild the index on the 'street' and
'suburb' fields?
I believe it does but I'm asking to be sure.

If this is the case I guess performance will suffer when I have, say,
200,000
rows in the table.

Would it be like:

a) Use index to search on 'street' and 'suburb'
b) No result? Insert new record
c) Rebuild index on 'street' and 'suburb'

for each row?
Would this mean that after 200,000 rows each INSERT will require
the index of 000's of rows to be re-indexed?

So far I believe my only options are to use either and index
or sequential scan and see which is faster.

A minute for your thoughts and/or suggestions would be great.

Thanks.
Regards,
Rudi.


Re: Bulk Insert and Index use

From
Jim J
Date:
If the bulk load has the possibility of duplicating data, then you need
to change methods.  Try bulk loading into a temp table,  index it like
the original, eliminate the dups and merge the tables.

It is also possible to do an insert from the temp table into the final
table like:
insert into original (x,x,x)  (select temp.1, temp.2, etc from temp left
join original on temp.street=original.street where original.street is null)

Good Luck
Jim

Rudi Starcevic wrote:

> Hi,
>
> I have a question on bulk checking, inserting into a table and
> how best to use an index for performance.
>
> The data I have to work with is a monthly CD Rom csv data dump of
> 300,000 property owners from one area/shire.
>
> So every CD has 300,000 odd lines, each line of data which fills the
> 'property' table.
>
> Beginning with the first CD each line should require one SELECT and
> one INSERT as it will be the first property with this address.
>
> The SELECT uses fields like 'street' and 'suburb', to check for an
> existing property,
> so I have built an index on those fields.
>
> My question is does each INSERT rebuild the index on the 'street' and
> 'suburb' fields?
> I believe it does but I'm asking to be sure.
>
> If this is the case I guess performance will suffer when I have, say,
> 200,000
> rows in the table.
>
> Would it be like:
>
> a) Use index to search on 'street' and 'suburb'
> b) No result? Insert new record
> c) Rebuild index on 'street' and 'suburb'
>
> for each row?
> Would this mean that after 200,000 rows each INSERT will require
> the index of 000's of rows to be re-indexed?
>
> So far I believe my only options are to use either and index
> or sequential scan and see which is faster.
>
> A minute for your thoughts and/or suggestions would be great.
>
> Thanks.
> Regards,
> Rudi.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>
>


Re: Bulk Insert and Index use

From
"Gregory S. Williamson"
Date:
Usualy any bulk load is faster with indexes dropped and the rebuilt ... failing that (like you really need the indexes
whileloading, say into a "hot" table) be sure to wrap all the SQL into one transaction (BEGIN;...COMMIT;) ... if any
datafailes it all fails, which is usually easier to deal with than partial data loads, and it is *much* faster than
havingeach insert being its own transaction. 

HTH,

Greg Williamson
DBA
GlobeXplorer LLC

-----Original Message-----
From:    Rudi Starcevic [mailto:tech@wildcash.com]
Sent:    Tue 8/10/2004 4:04 PM
To:    pgsql-performance@postgresql.org
Cc:
Subject:    [PERFORM] Bulk Insert and Index use
Hi,

I have a question on bulk checking, inserting into a table and
how best to use an index for performance.

The data I have to work with is a monthly CD Rom csv data dump of
300,000 property owners from one area/shire.

So every CD has 300,000 odd lines, each line of data which fills the
'property' table.

Beginning with the first CD each line should require one SELECT and
one INSERT as it will be the first property with this address.

The SELECT uses fields like 'street' and 'suburb', to check for an
existing property,
so I have built an index on those fields.

My question is does each INSERT rebuild the index on the 'street' and
'suburb' fields?
I believe it does but I'm asking to be sure.

If this is the case I guess performance will suffer when I have, say,
200,000
rows in the table.

Would it be like:

a) Use index to search on 'street' and 'suburb'
b) No result? Insert new record
c) Rebuild index on 'street' and 'suburb'

for each row?
Would this mean that after 200,000 rows each INSERT will require
the index of 000's of rows to be re-indexed?

So far I believe my only options are to use either and index
or sequential scan and see which is faster.

A minute for your thoughts and/or suggestions would be great.

Thanks.
Regards,
Rudi.


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster




Re: Bulk Insert and Index use

From
Rudi Starcevic
Date:
Hi Jim,

Thanks for your time.

 > If the bulk load has the possibility of duplicating data

Yes, each row will require either:

a) One SELECT + One INSERT
or
b) One SELECT + One UPDATE

I did think of using more than one table, ie. temp table.
As each month worth of data is added I expect to see
a change from lots of INSERTS to lots of UPDATES.

Perhaps when the UPDATES become more dominant it would
be best to start using Indexes.

While INSERTS are more prevelant perhaps a seq. scan is better.

I guess of all the options available it boils down to which
is quicker for my data: index or sequential scan.

Many thanks.

Jim J wrote:

> If the bulk load has the possibility of duplicating data, then you need
> to change methods.  Try bulk loading into a temp table,  index it like
> the original, eliminate the dups and merge the tables.
>
> It is also possible to do an insert from the temp table into the final
> table like:
> insert into original (x,x,x)  (select temp.1, temp.2, etc from temp left
> join original on temp.street=original.street where original.street is null)
>
> Good Luck
> Jim
>
> Rudi Starcevic wrote:
>
>> Hi,
>>
>> I have a question on bulk checking, inserting into a table and
>> how best to use an index for performance.
>>
>> The data I have to work with is a monthly CD Rom csv data dump of
>> 300,000 property owners from one area/shire.
>>
>> So every CD has 300,000 odd lines, each line of data which fills the
>> 'property' table.
>>
>> Beginning with the first CD each line should require one SELECT and
>> one INSERT as it will be the first property with this address.
>>
>> The SELECT uses fields like 'street' and 'suburb', to check for an
>> existing property,
>> so I have built an index on those fields.
>>
>> My question is does each INSERT rebuild the index on the 'street' and
>> 'suburb' fields?
>> I believe it does but I'm asking to be sure.
>>
>> If this is the case I guess performance will suffer when I have, say,
>> 200,000
>> rows in the table.
>>
>> Would it be like:
>>
>> a) Use index to search on 'street' and 'suburb'
>> b) No result? Insert new record
>> c) Rebuild index on 'street' and 'suburb'
>>
>> for each row?
>> Would this mean that after 200,000 rows each INSERT will require
>> the index of 000's of rows to be re-indexed?
>>
>> So far I believe my only options are to use either and index
>> or sequential scan and see which is faster.
>>
>> A minute for your thoughts and/or suggestions would be great.
>>
>> Thanks.
>> Regards,
>> Rudi.
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Don't 'kill -9' the postmaster
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html
>
>


--


Regards,
Rudi.

Internet Media Productions

Re: Bulk Insert and Index use

From
Christopher Browne
Date:
In an attempt to throw the authorities off his trail, tech@wildcash.com (Rudi Starcevic) transmitted:
> A minute for your thoughts and/or suggestions would be great.

Could you give a more concrete example?  E.g. - the DDL for the
table(s), most particularly.

At first guess, I think you're worrying about a nonissue.  Each insert
will lead to a _modification_ of the various indices, which costs
_something_, but which is WAY less expensive than creating each index
from scratch.

But perhaps I'm misreading things; DDL for the intended tables and
indices would be real handy.
--
output = ("cbbrowne" "@" "cbbrowne.com")
http://www.ntlug.org/~cbbrowne/linux.html
Rules  of the  Evil  Overlord #21.  "I  will hire  a talented  fashion
designer  to create  original uniforms  for my  Legions of  Terror, as
opposed  to  some cheap  knock-offs  that  make  them look  like  Nazi
stormtroopers, Roman  footsoldiers, or savage Mongol  hordes. All were
eventually  defeated and  I want  my troops  to have  a  more positive
mind-set." <http://www.eviloverlord.com/>

Re: Bulk Insert and Index use

From
"Gregory S. Williamson"
Date:
If it has to read a majority (or even a good percentage) of the rows in question a sequential scan is probably faster
...and as Jim pointed out, a temp table can often be a useful medium for getting speed in a load and then allowing you
toclean/alter data for a final (easy) push. 

G
-----Original Message-----
From:    Rudi Starcevic [mailto:tech@wildcash.com]
Sent:    Tue 8/10/2004 8:33 PM
To:    pgsql-performance@postgresql.org
Cc:
Subject:    Re: [PERFORM] Bulk Insert and Index use
Hi Jim,

Thanks for your time.

 > If the bulk load has the possibility of duplicating data

Yes, each row will require either:

a) One SELECT + One INSERT
or
b) One SELECT + One UPDATE

I did think of using more than one table, ie. temp table.
As each month worth of data is added I expect to see
a change from lots of INSERTS to lots of UPDATES.

Perhaps when the UPDATES become more dominant it would
be best to start using Indexes.

While INSERTS are more prevelant perhaps a seq. scan is better.

I guess of all the options available it boils down to which
is quicker for my data: index or sequential scan.

Many thanks.

Jim J wrote:

> If the bulk load has the possibility of duplicating data, then you need
> to change methods.  Try bulk loading into a temp table,  index it like
> the original, eliminate the dups and merge the tables.
>
> It is also possible to do an insert from the temp table into the final
> table like:
> insert into original (x,x,x)  (select temp.1, temp.2, etc from temp left
> join original on temp.street=original.street where original.street is null)
>
> Good Luck
> Jim
>
> Rudi Starcevic wrote:
>
>> Hi,
>>
>> I have a question on bulk checking, inserting into a table and
>> how best to use an index for performance.
>>
>> The data I have to work with is a monthly CD Rom csv data dump of
>> 300,000 property owners from one area/shire.
>>
>> So every CD has 300,000 odd lines, each line of data which fills the
>> 'property' table.
>>
>> Beginning with the first CD each line should require one SELECT and
>> one INSERT as it will be the first property with this address.
>>
>> The SELECT uses fields like 'street' and 'suburb', to check for an
>> existing property,
>> so I have built an index on those fields.
>>
>> My question is does each INSERT rebuild the index on the 'street' and
>> 'suburb' fields?
>> I believe it does but I'm asking to be sure.
>>
>> If this is the case I guess performance will suffer when I have, say,
>> 200,000
>> rows in the table.
>>
>> Would it be like:
>>
>> a) Use index to search on 'street' and 'suburb'
>> b) No result? Insert new record
>> c) Rebuild index on 'street' and 'suburb'
>>
>> for each row?
>> Would this mean that after 200,000 rows each INSERT will require
>> the index of 000's of rows to be re-indexed?
>>
>> So far I believe my only options are to use either and index
>> or sequential scan and see which is faster.
>>
>> A minute for your thoughts and/or suggestions would be great.
>>
>> Thanks.
>> Regards,
>> Rudi.
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Don't 'kill -9' the postmaster
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html
>
>


--


Regards,
Rudi.

Internet Media Productions

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org




Re: Bulk Insert and Index use

From
Rudi Starcevic
Date:
Hi,

> In an attempt to throw the authorities off his trail, tech@wildcash.com (Rudi Starcevic) transmitted:
> A minute for your thoughts and/or suggestions would be great.

Heh heh ....

> Could you give a more concrete example?  E.g. - the DDL for the
> table(s), most particularly.

Thanks, I didn't add the DDL as I though it may make my question too
long. I have the DDL at another office so I'll pick up this email
thread when I get there in a couple hours.

> At first guess, I think you're worrying about a nonissue.  Each insert
> will lead to a _modification_ of the various indices, which costs
> _something_, but which is WAY less expensive than creating each index
> from scratch.

Very interesting, modification and creation.
I will post another email later today.

Many thanks.

--

Regards,
Rudi.

Internet Media Productions