Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and
Date
Msg-id 1135433897.2964.650.camel@localhost.localdomain
Whole thread Raw
In response to Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and  (Greg Stark <gsstark@mit.edu>)
List pgsql-hackers
On Fri, 2005-12-23 at 22:41 -0500, Greg Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> > > I revised the idea with MINIMAL XLOG (instead of WITHOUT XLOG) like the
> > > below. I think in this way, we can always gaurantee its correctness and
> > > can always improve it.
> > 
> > I think the entire idea is a waste of time anyway.  If we have the COPY
> > case covered then we have done the only case that has any practical use
> > AFAICS --- you don't use UPDATE or DELETE for bulk loads, and you don't
> > use retail INSERTs either if you are striving for speed.
> 
> Well the "industry standard" approach for loading data is called "ETL". I
> forget what the E stands for, but the idea is to first load the data into a
> table exactly as it appears in the input data.
> 
> Then the "T" stands for "transform". This could include just normalizing the
> input data into the correct format but it could also include looking up
> foreign key values for input strings, checking for duplicate records, etc.
> This would be a series of UPDATEs and DELETEs.
> 
> I guess the "L" stands for load, where you insert it into the real tables.

Just for completeness: the whole data load area is typically known as
ETL now, but there are two main approaches: ETL and ELT

ETL stands for 
Extract - take the data from the source system
Transform - the process of mapping from source to target data model
Load - load the transformed data into the target system

In this mode, the only SQL used is COPY, during the "L" phase.

"ETL" is the "industry standard" as set by vendors who sell tools that
work that way: Informatica, Ascential etc. Another, equally viable route
is similar to your description, but is known as ELT, which is supported
by vendors such as Oracle (proprietary-only) and Sunopsis (open):

Extract - take the data from the source system
Load - load the data into the DW staging area in same form as source
Tranform - transform the data using SQL commands

In the latter ELT approach, you use COPY during the "L" phase, but then
you definitely do use INSERT SELECT and sometimes UPDATE commands,
rarely DELETEs during the "T" phase.

The ELT approach is the main basis for a *large* number of very large
Oracle, Teradata and DB2 data warehouses that have custom-built load
procedures. Check out this link for a high level explanation,
http://www.sunopsis.com/corporate/us/products/sunopsis/snps_dc.htm
but noting that Sunopsis did not invent this approach, or the name ELT.

> Up until the point where you load it into the real tables you don't really
> care about the transactional integrity of the data. If it's lost you can
> always just restart the process.

> In many cases you could use temporary tables, but sometimes you might want
> multiple processes or multiple transactions to be able to see the data.
> Consider for example a loading process that includes a human operator
> reviewing the data before the data is approved to be loaded into the final
> tables.

Agreed

> But I don't see turning on and off the WAL on a per-transaction basis to be
> useful. Every transaction in the system is affected by the WAL status of every
> other transaction working with the same tables. It doesn't serve any purpose
> to have one transaction bypassing the WAL while everyone else does WAL logging
> for the same table; they're all going to lose if the system crashes.

Yes, I really don't like that way.

> It seems to me the only rational way to approach this is to have a per-table
> flag that sets that table to be non-logged. Essentially changing a table's
> behaviour to that of a temporary table except that other transactions can see
> it. If the system crashes the table is truncated on system restore.

Often, you need to speed up the load into your largest tables. The data
you are loading often comes from a file that you can backed-up, so if
the database crashes during the load, you can always restart the load.
But that doesn't mean you want to lose the data that is already there if
you crash.

That's exactly what COPY LOCK has been designed to do. It is analagous
to Oracle's sql*loader direct path nologging mode.

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Fixing row comparison semantics
Next
From: Tom Lane
Date:
Subject: Re: Fixing row comparison semantics