Re: [GENERAL] 7.4Beta - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: [GENERAL] 7.4Beta
Date
Msg-id D90A5A6C612A39408103E6ECDD77B829408B8C@voyager.corporate.connx.com
Whole thread Raw
Responses Re: [GENERAL] 7.4Beta
List pgsql-hackers
> -----Original Message-----
> From: Andreas Pflug [mailto:pgadmin@pse-consulting.de]
> Sent: Friday, August 15, 2003 11:36 AM
> To: Dann Corbit
> Cc: Stephan Szabo; PostgreSQL-development
> Subject: Re: [HACKERS] [GENERAL] 7.4Beta
>
> Dann Corbit wrote:
>
> >Simplification of bulk operations can be very important for
> customers
> >(on the other hand).  For the CONNX tool set, we offer an escape on
> >INSERT/SELECT that performs the operation in bulk mode.
> >
> >There are serious downsides to bulk operations also (such as
> not being
> >logged and therefore existing outside of a transaction).  Therefore,
> >they are useful really only in cases where a complete
> destruction and
> >repopulation is called for.  If anything goes haywire, you
> can't simply
> >roll it back.
> >
> >Yet the speed savings can be enormous (orders of magnitude).
> >
> >Compared to iteration over a set of prepared inserts, a bulk insert
> >(such as using Microsoft's BCP API or Oracles Direct Path
> loading) can
> >be 100 times faster.  If you are moving gigabytes of data and
> >performing a complete refresh, the method to use becomes obvious.
> >
> >When we go outside of the language bounds, a curly braced escape
> >notation is used.  For instance, an insert/select might look
> like this:
> >INSERT INTO <destination_table> SELECT <column_list> FROM
> ><source_table> {fn commitcount 1000} {bulkmode} The commit
> count says
> >to use batches of 1000 rows and bulkmode says to use the fastest
> >possible insert method.
> >
> >Imagine (for instance) that PostgreSQL can use an ODBC {or similar}
> >data source as a table.  Then, with a syntax such as the above (or
> >SELECT INTO etc.), you could very rapidly move data from one system
> >into another.
> >
> >
> When saying "bulk operation" I don't necessarily mean using
> bulk load or
> stuff like that. What I mean is handling large amounts of
> similar data
> at the same time. That doesn't say anything about transactions or
> logging problems.
> Imagine you have 100k or rows to load, each having FKs to (hopefully)
> existing rows
> - Step 1: load the rows into a temp table
> - Step 2: identify duplicates, logging  and deleting them
> - Step 3: insert all data satisfying the FK constraints.
> - Step 4: log all that didn't insert.
>
> This are relatively few statements (not the simplest), which can be
> handled in a safe manner.
> Just an example, how a RDBMS can do the job in a mass
> oriented (and thus
> optimizable) way.

I really, really like SQL*Server's solution to the problem you have
stated above.  A typical example is you have a pile of incoming words
and you want to create a dictionary.  An example might be a library
database, where you have an abstract for each item.  You parse the words
and look for new ones to add.

What SQL*Server allows is an index with an option called "IGNORE
DUPLICATES" that simply tosses out rows that are already in the table.
For applications like what I have described and what you have described
it is an incredibly useful extension.  Once I got used to it, I found
myself using it all the time.  Of course, you must be very careful to
ensure that the duplicates really are completely unimportant.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [GENERAL] 7.4Beta
Next
From: Andreas Pflug
Date:
Subject: Re: [GENERAL] 7.4Beta