Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and |
Date | |
Msg-id | 1135433897.2964.650.camel@localhost.localdomain Whole thread Raw |
In response to | Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and (Greg Stark <gsstark@mit.edu>) |
List | pgsql-hackers |
On Fri, 2005-12-23 at 22:41 -0500, Greg Stark wrote: > Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > > > I revised the idea with MINIMAL XLOG (instead of WITHOUT XLOG) like the > > > below. I think in this way, we can always gaurantee its correctness and > > > can always improve it. > > > > I think the entire idea is a waste of time anyway. If we have the COPY > > case covered then we have done the only case that has any practical use > > AFAICS --- you don't use UPDATE or DELETE for bulk loads, and you don't > > use retail INSERTs either if you are striving for speed. > > Well the "industry standard" approach for loading data is called "ETL". I > forget what the E stands for, but the idea is to first load the data into a > table exactly as it appears in the input data. > > Then the "T" stands for "transform". This could include just normalizing the > input data into the correct format but it could also include looking up > foreign key values for input strings, checking for duplicate records, etc. > This would be a series of UPDATEs and DELETEs. > > I guess the "L" stands for load, where you insert it into the real tables. Just for completeness: the whole data load area is typically known as ETL now, but there are two main approaches: ETL and ELT ETL stands for Extract - take the data from the source system Transform - the process of mapping from source to target data model Load - load the transformed data into the target system In this mode, the only SQL used is COPY, during the "L" phase. "ETL" is the "industry standard" as set by vendors who sell tools that work that way: Informatica, Ascential etc. Another, equally viable route is similar to your description, but is known as ELT, which is supported by vendors such as Oracle (proprietary-only) and Sunopsis (open): Extract - take the data from the source system Load - load the data into the DW staging area in same form as source Tranform - transform the data using SQL commands In the latter ELT approach, you use COPY during the "L" phase, but then you definitely do use INSERT SELECT and sometimes UPDATE commands, rarely DELETEs during the "T" phase. The ELT approach is the main basis for a *large* number of very large Oracle, Teradata and DB2 data warehouses that have custom-built load procedures. Check out this link for a high level explanation, http://www.sunopsis.com/corporate/us/products/sunopsis/snps_dc.htm but noting that Sunopsis did not invent this approach, or the name ELT. > Up until the point where you load it into the real tables you don't really > care about the transactional integrity of the data. If it's lost you can > always just restart the process. > In many cases you could use temporary tables, but sometimes you might want > multiple processes or multiple transactions to be able to see the data. > Consider for example a loading process that includes a human operator > reviewing the data before the data is approved to be loaded into the final > tables. Agreed > But I don't see turning on and off the WAL on a per-transaction basis to be > useful. Every transaction in the system is affected by the WAL status of every > other transaction working with the same tables. It doesn't serve any purpose > to have one transaction bypassing the WAL while everyone else does WAL logging > for the same table; they're all going to lose if the system crashes. Yes, I really don't like that way. > It seems to me the only rational way to approach this is to have a per-table > flag that sets that table to be non-logged. Essentially changing a table's > behaviour to that of a temporary table except that other transactions can see > it. If the system crashes the table is truncated on system restore. Often, you need to speed up the load into your largest tables. The data you are loading often comes from a file that you can backed-up, so if the database crashes during the load, you can always restart the load. But that doesn't mean you want to lose the data that is already there if you crash. That's exactly what COPY LOCK has been designed to do. It is analagous to Oracle's sql*loader direct path nologging mode. Best Regards, Simon Riggs
pgsql-hackers by date: