Re: NOLOGGING option, or ? - Mailing list pgsql-hackers
From | Steve Atkins |
---|---|
Subject | Re: NOLOGGING option, or ? |
Date | |
Msg-id | 20050602150550.GA17081@gp.word-to-the-wise.com Whole thread Raw |
In response to | Re: NOLOGGING option, or ? ("Luke Lonergan" <llonergan@greenplum.com>) |
Responses |
Re: NOLOGGING option, or ?
|
List | pgsql-hackers |
On Thu, Jun 02, 2005 at 07:33:13AM -0700, Luke Lonergan wrote: > Oliver, > > > Haven't you just replaced one preprocessing step with another, then? > > Generally not. The most common problem with the current choice of escape > character is that there are *lots* of data load scenarios with backslash in > the text strings. I can only think of one where it's common. Windows filenames. But if you're going to support arbitrary data in a load then whatever escape character you choose will appear sometimes. > The extra preprocessing to escape them is unnecessary on > other databases and, in effect, causes the load to be even slower because > you have to prepare the data ahead of time. > Also, note that this patch can also do escape processing and the net result > will still be 5+ times faster than what is there. I strongly suspect that a patch to improve performance without changing behaviour would be accepted with no questions asked. One that allowed specifying the field and record delimiters and the escape character and null symbol might require more discussion about an appropriate syntax at the very least. So you may want to separate the two. > In the data warehousing industry, data conversion and manipulation is > normally kept distinct from data loading. Conversion is done by tools > called ETL (Extract Transform Load) and the database will have a very fast > path for direct loading of the resulting data. PostgreSQL is definitely a > strange database right now in that there is a default filter applied to the > data on load. > It's even more strange because the load path is so slow, and now that we've > found that the slowness is there mostly because of non-optimized parsing and > attribute conversion routines. The question of how to do escape processing > is a separate one, but is wrapped up in the question of whether to introduce > a new loading routine or whether to optimize the old one. There are already two loader routines. One of them is text-based and is designed for easy generation of data load format using simple text manipulation tools by using delimiters. It also allows (unlike your suggestion) for loading of arbitrary data from a text file. Because it allows for arbitrary data and uses delimiters to separate fields it has to use an escaping mechanism. If you want to be able to load arbitrary data and not have to handle escape characters there's are two obvious ways to do it. The first is that used by MIME and suggested by you. That is to use a separator that you believe will not appear in the data. That can be done by using a long multicharacter separator containing random characters and assuming that sequence won't appear, it can be done by parsing the input data twice, looking for strings that don't appear for use as delimiters or it can take advantage of knowledge about what characters can and can not appear in the input data. (I can't imagine any case involving data-mining of web logs where the last is likely to be relevant). The other is to use length+data format for each tuple, avoiding all issues of escapes in the data and allowing arbitrary data to be represented. That's how the binary load format PG supports works, I believe. If you're really concerned about speed of load that may be a better format for your front-end to generate, perhaps? Cheers, Steve
pgsql-hackers by date: