Thread: Re: [GENERAL] 7.4Beta

Re: [GENERAL] 7.4Beta

From

"Dann Corbit"

Date:

15 August 2003, 18:53:42

> -----Original Message-----
> From: Andreas Pflug [mailto:pgadmin@pse-consulting.de]
> Sent: Friday, August 15, 2003 11:36 AM
> To: Dann Corbit
> Cc: Stephan Szabo; PostgreSQL-development
> Subject: Re: [HACKERS] [GENERAL] 7.4Beta
>
> Dann Corbit wrote:
>
> >Simplification of bulk operations can be very important for
> customers
> >(on the other hand).  For the CONNX tool set, we offer an escape on
> >INSERT/SELECT that performs the operation in bulk mode.
> >
> >There are serious downsides to bulk operations also (such as
> not being
> >logged and therefore existing outside of a transaction).  Therefore,
> >they are useful really only in cases where a complete
> destruction and
> >repopulation is called for.  If anything goes haywire, you
> can't simply
> >roll it back.
> >
> >Yet the speed savings can be enormous (orders of magnitude).
> >
> >Compared to iteration over a set of prepared inserts, a bulk insert
> >(such as using Microsoft's BCP API or Oracles Direct Path
> loading) can
> >be 100 times faster.  If you are moving gigabytes of data and
> >performing a complete refresh, the method to use becomes obvious.
> >
> >When we go outside of the language bounds, a curly braced escape
> >notation is used.  For instance, an insert/select might look
> like this:
> >INSERT INTO <destination_table> SELECT <column_list> FROM
> ><source_table> {fn commitcount 1000} {bulkmode} The commit
> count says
> >to use batches of 1000 rows and bulkmode says to use the fastest
> >possible insert method.
> >
> >Imagine (for instance) that PostgreSQL can use an ODBC {or similar}
> >data source as a table.  Then, with a syntax such as the above (or
> >SELECT INTO etc.), you could very rapidly move data from one system
> >into another.
> >
> >
> When saying "bulk operation" I don't necessarily mean using
> bulk load or
> stuff like that. What I mean is handling large amounts of
> similar data
> at the same time. That doesn't say anything about transactions or
> logging problems.
> Imagine you have 100k or rows to load, each having FKs to (hopefully)
> existing rows
> - Step 1: load the rows into a temp table
> - Step 2: identify duplicates, logging  and deleting them
> - Step 3: insert all data satisfying the FK constraints.
> - Step 4: log all that didn't insert.
>
> This are relatively few statements (not the simplest), which can be
> handled in a safe manner.
> Just an example, how a RDBMS can do the job in a mass
> oriented (and thus
> optimizable) way.

I really, really like SQL*Server's solution to the problem you have
stated above.  A typical example is you have a pile of incoming words
and you want to create a dictionary.  An example might be a library
database, where you have an abstract for each item.  You parse the words
and look for new ones to add.

What SQL*Server allows is an index with an option called "IGNORE
DUPLICATES" that simply tosses out rows that are already in the table.
For applications like what I have described and what you have described
it is an incredibly useful extension.  Once I got used to it, I found
myself using it all the time.  Of course, you must be very careful to
ensure that the duplicates really are completely unimportant.

Re: [GENERAL] 7.4Beta

From

Andreas Pflug

Date:

15 August 2003, 19:13:52

Dann Corbit wrote:

>>-----Original Message-----
>>From: Andreas Pflug [mailto:pgadmin@pse-consulting.de] 
>>Sent: Friday, August 15, 2003 11:36 AM
>>To: Dann Corbit
>>Cc: Stephan Szabo; PostgreSQL-development
>>Subject: Re: [HACKERS] [GENERAL] 7.4Beta
>>
>>Dann Corbit wrote:
>>
>>    
>>
>>>Simplification of bulk operations can be very important for 
>>>      
>>>
>>customers 
>>    
>>
>>>(on the other hand).  For the CONNX tool set, we offer an escape on 
>>>INSERT/SELECT that performs the operation in bulk mode.
>>>
>>>There are serious downsides to bulk operations also (such as 
>>>      
>>>
>>not being 
>>    
>>
>>>logged and therefore existing outside of a transaction).  Therefore, 
>>>they are useful really only in cases where a complete 
>>>      
>>>
>>destruction and 
>>    
>>
>>>repopulation is called for.  If anything goes haywire, you 
>>>      
>>>
>>can't simply 
>>    
>>
>>>roll it back.
>>>
>>>Yet the speed savings can be enormous (orders of magnitude).
>>>
>>>Compared to iteration over a set of prepared inserts, a bulk insert 
>>>(such as using Microsoft's BCP API or Oracles Direct Path 
>>>      
>>>
>>loading) can 
>>    
>>
>>>be 100 times faster.  If you are moving gigabytes of data and 
>>>performing a complete refresh, the method to use becomes obvious.
>>>
>>>When we go outside of the language bounds, a curly braced escape 
>>>notation is used.  For instance, an insert/select might look 
>>>      
>>>
>>like this: 
>>    
>>
>>>INSERT INTO <destination_table> SELECT <column_list> FROM 
>>><source_table> {fn commitcount 1000} {bulkmode} The commit 
>>>      
>>>
>>count says 
>>    
>>
>>>to use batches of 1000 rows and bulkmode says to use the fastest 
>>>possible insert method.
>>>
>>>Imagine (for instance) that PostgreSQL can use an ODBC {or similar} 
>>>data source as a table.  Then, with a syntax such as the above (or 
>>>SELECT INTO etc.), you could very rapidly move data from one system 
>>>into another.
>>> 
>>>
>>>      
>>>
>>When saying "bulk operation" I don't necessarily mean using 
>>bulk load or 
>>stuff like that. What I mean is handling large amounts of 
>>similar data 
>>at the same time. That doesn't say anything about transactions or 
>>logging problems.
>>Imagine you have 100k or rows to load, each having FKs to (hopefully) 
>>existing rows
>>- Step 1: load the rows into a temp table
>>- Step 2: identify duplicates, logging  and deleting them
>>- Step 3: insert all data satisfying the FK constraints.
>>- Step 4: log all that didn't insert.
>>
>>This are relatively few statements (not the simplest), which can be 
>>handled in a safe manner.
>>Just an example, how a RDBMS can do the job in a mass 
>>oriented (and thus 
>>optimizable) way.
>>    
>>
>
>I really, really like SQL*Server's solution to the problem you have
>stated above.  A typical example is you have a pile of incoming words
>and you want to create a dictionary.  An example might be a library
>database, where you have an abstract for each item.  You parse the words
>and look for new ones to add.
>
>What SQL*Server allows is an index with an option called "IGNORE
>DUPLICATES" that simply tosses out rows that are already in the table.
>For applications like what I have described and what you have described
>it is an incredibly useful extension.  Once I got used to it, I found
>myself using it all the time.  Of course, you must be very careful to
>ensure that the duplicates really are completely unimportant.
>  
>

Hm, option is quite a quick hack. I like logging in the database much 
better than some sql messages. And it's only about duplicates, not about 
fk constraint violations.

Regards,
Andresa