Re: [GENERAL] 7.4Beta - Mailing list pgsql-hackers
From | Dann Corbit |
---|---|
Subject | Re: [GENERAL] 7.4Beta |
Date | |
Msg-id | D90A5A6C612A39408103E6ECDD77B829408B8C@voyager.corporate.connx.com Whole thread Raw |
Responses |
Re: [GENERAL] 7.4Beta
|
List | pgsql-hackers |
> -----Original Message----- > From: Andreas Pflug [mailto:pgadmin@pse-consulting.de] > Sent: Friday, August 15, 2003 11:36 AM > To: Dann Corbit > Cc: Stephan Szabo; PostgreSQL-development > Subject: Re: [HACKERS] [GENERAL] 7.4Beta > > Dann Corbit wrote: > > >Simplification of bulk operations can be very important for > customers > >(on the other hand). For the CONNX tool set, we offer an escape on > >INSERT/SELECT that performs the operation in bulk mode. > > > >There are serious downsides to bulk operations also (such as > not being > >logged and therefore existing outside of a transaction). Therefore, > >they are useful really only in cases where a complete > destruction and > >repopulation is called for. If anything goes haywire, you > can't simply > >roll it back. > > > >Yet the speed savings can be enormous (orders of magnitude). > > > >Compared to iteration over a set of prepared inserts, a bulk insert > >(such as using Microsoft's BCP API or Oracles Direct Path > loading) can > >be 100 times faster. If you are moving gigabytes of data and > >performing a complete refresh, the method to use becomes obvious. > > > >When we go outside of the language bounds, a curly braced escape > >notation is used. For instance, an insert/select might look > like this: > >INSERT INTO <destination_table> SELECT <column_list> FROM > ><source_table> {fn commitcount 1000} {bulkmode} The commit > count says > >to use batches of 1000 rows and bulkmode says to use the fastest > >possible insert method. > > > >Imagine (for instance) that PostgreSQL can use an ODBC {or similar} > >data source as a table. Then, with a syntax such as the above (or > >SELECT INTO etc.), you could very rapidly move data from one system > >into another. > > > > > When saying "bulk operation" I don't necessarily mean using > bulk load or > stuff like that. What I mean is handling large amounts of > similar data > at the same time. That doesn't say anything about transactions or > logging problems. > Imagine you have 100k or rows to load, each having FKs to (hopefully) > existing rows > - Step 1: load the rows into a temp table > - Step 2: identify duplicates, logging and deleting them > - Step 3: insert all data satisfying the FK constraints. > - Step 4: log all that didn't insert. > > This are relatively few statements (not the simplest), which can be > handled in a safe manner. > Just an example, how a RDBMS can do the job in a mass > oriented (and thus > optimizable) way. I really, really like SQL*Server's solution to the problem you have stated above. A typical example is you have a pile of incoming words and you want to create a dictionary. An example might be a library database, where you have an abstract for each item. You parse the words and look for new ones to add. What SQL*Server allows is an index with an option called "IGNORE DUPLICATES" that simply tosses out rows that are already in the table. For applications like what I have described and what you have described it is an incredibly useful extension. Once I got used to it, I found myself using it all the time. Of course, you must be very careful to ensure that the duplicates really are completely unimportant.
pgsql-hackers by date: