Thread: COPY fast parse patch
Here is the patch I was talking about in my message to the "NOLOGGING option, or ?" thread. I would like to indicate some important points about it: 1) The patch includes 2 parallel parsing code paths. One is the regular COPY path that we all know, and the other is the improved one that I wrote. This is only temporary, as there is a lot of code duplication, but I left it as such for several purposes: - The improved path for now supports only ASCII delimited text format with a condition that client and server encodings are identical. If this is not the case when running COPY the old path will take place. In other words, under no condition a "not supported" error will be raised. - Having both code paths allows for easy performance comparison between the two. To run the regular COPY parsing call CopyFrom() from DoCopy() and to run the improved parsing COPY call FastCopyFrom() from DoCopy(). Right now FastCopyFrom() will be called automatically if all conditions explained in previous point are met, but it's easy to change (need to re-compile). * NOTE: the function names Fast*() as ugly as they are, are there to emphasize the differences between the old and the improved path (i.e: CopyReadLine() vs. FastReadLine()... ). They are not intended to stay this way. This is not elegant (yet)! 2) There are some utilities such as bytebuffer and strchrlen that are at the bottom of the file. This is probably not the right home for them, but for now to simplify things they are included in copy.c 3) EOL is assumed NL. I raised a point about EOL's in COPY in my previous thread, and it explains it. 4) Performance numbers could be viewed at http://lists.pgfoundry.org/pipermail/bizgres-general/2005-May/000135.html Some numbers include: 8.7MB/sec -> 11.8MB/sec on 15 column (Mixed) table. 12.1MB/sec -> 21MB/sec on 1 column (TEXT) table. 5) Data integrity and escaping improvements. Treats all characters as data (unless it's an escaped delim or EOL) and therefore data integrity is preserved. However, some people that already got used to the postgres COPY escaping way may want to keep it. They could do so by still using the old COPY. As a part of submitting this patch I also presented an argument for a use of a LOAD DATA command (in the NOLOGGING option thread). The points I made there are closely related to this message. There may be a valid argument that most of the points I raised could be implemented in the COPY code instead of a LOAD DATA command, but that requires a great deal of flexibility to add features and adding them to the COPY syntax. But that may not be a good idea for some and will also be problematic for backwards compatiability. Thx, Alon.
Attachment
Alon Goldshuv wrote: > 5) Data integrity and escaping improvements. Treats all characters as data > (unless it's an escaped delim or EOL) and therefore data > integrity is preserved. However, some people that already got > used to the postgres COPY escaping way may want to keep it. They could do so > by still using the old COPY. OK, if the COPY file contains: \n\||\r\\a|\N and the delimiter is '|', what does that represent? What if there were a literal '\' followed by a '|' in the data table being dumped? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote: > 1) The patch includes 2 parallel parsing code paths. One is the regular COPY > path that we all know, and the other is the improved one that I wrote. This > is only temporary, as there is a lot of code duplication Right; I really dislike the idea of having two separate code paths for COPY. When you say this approach is "temporary", are you suggesting that you intend to reimplement your changes as improvements/replacements of the existing COPY code path rather than as a parallel code path? > As a part of submitting this patch I also presented an argument for a use of > a LOAD DATA command (in the NOLOGGING option thread). The points I made > there are closely related to this message. There may be a valid argument > that most of the points I raised could be implemented in the COPY code > instead of a LOAD DATA command I'm definitely not keen to see a new "LOAD DATA" command. > But that may not be a good idea for some and will also be problematic > for backwards compatiability. In what way would the performance improvements to COPY be backward incompatible with the existing COPY behaviour? -Neil
Neil Conway said: > On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote: >> 1) The patch includes 2 parallel parsing code paths. One is the >> regular COPY path that we all know, and the other is the improved one >> that I wrote. This is only temporary, as there is a lot of code >> duplication > > Right; I really dislike the idea of having two separate code paths for > COPY. When you say this approach is "temporary", are you suggesting > that you intend to reimplement your changes as > improvements/replacements of the existing COPY code path rather than as > a parallel code path? > It's not an all or nothing deal. When we put in CSV handling, we introduced two new routines for attribute input/output and otherwise used the rest of the COPY code. When I did a fix for the multiline problem, it was originally done with a separate read line function for CSV mode - Bruce didn't like that so I merged it back into the existing code. In restrospect, given this discussion, that might not have been an optimal choice. But the point is that you can break out at several levels. Incidentally, there might be a good case for allowing the user to set the line end explicitly, but you can't just hardwire it - we will get massive Windows breakage. What is more, in CSV mode line end sequences can occur within logical lines. You need to take that into account. It's tricky and easy to get badly wrong. I will be the first to admit that there are probably some very good possibilities for optimisation of this code. My impression though has been that in almost all cases it's fast enough anyway. I know that on some very modest hardware I have managed to load a 6m row TPC line-items table in just a few minutes. Before we start getting too hung up, I'd be interested to know just how much data people want to load and how fast they want it to be. If people have massive data loads that take hours, days or weeks then it's obviously worth improving if we can. I'm curious to know what size datasets people are really handling this way. cheers andrew
Andrew, > I will be the first to admit that there are probably some very good > possibilities for optimisation of this code. My impression though has been > that in almost all cases it's fast enough anyway. I know that on some very > modest hardware I have managed to load a 6m row TPC line-items table in just > a few minutes. Before we start getting too hung up, I'd be interested to > know just how much data people want to load and how fast they want it to be. > If people have massive data loads that take hours, days or weeks then it's > obviously worth improving if we can. I'm curious to know what size datasets > people are really handling this way. x0+ GB files are common in data warehousing. The issue is often "can we load our data within the time allotted for the batch window", usually a matter of an hour or two. Assuming that TPC lineitem is 140Bytes/row, 6M rows in 3 minutes is 4.7 MB/s. To load a 10GB file at that rate takes about 2/3 hour. If one were to restore a 300GB database, it would take 18 hours. Maintenance operations are impractical after a few hours, 18 is a non-starter. In practice, we're usually replacing an Oracle system with PostgreSQL, and the load speed difference between the two is currently embarrassing and makes the work impractical. - Luke
Neil, Right; I really dislike the idea of having two separate code paths for COPY. When you say this approach is "temporary", are you suggesting that you intend to reimplement your changes as improvements/replacements of the existing COPY code path rather than as a parallel code path? ==== My thoughts were -- see how the responses are, and if people think that this is a better way to go than replace the COPYparsing logic to the new one. The whole escaping discussion that goes on is something else, escapes could be implementedin either way, but the important thing I am trying to show is that there is a much faster way to parse the datainstead of doing a char-by-char petch-examine-load. > As a part of submitting this patch I also presented an argument for a use of > a LOAD DATA command (in the NOLOGGING option thread). The points I made > there are closely related to this message. There may be a valid argument > that most of the points I raised could be implemented in the COPY code > instead of a LOAD DATA command I'm definitely not keen to see a new "LOAD DATA" command. ==== It seems that most people don't :-) I can see valid arguments to both having it and not having it. > But that may not be a good idea for some and will also be problematic > for backwards compatiability. In what way would the performance improvements to COPY be backward incompatible with the existing COPY behaviour? ==== That comment was in respect to the escape logic. You can regard it as irrelevant for now as long as the escape discussiongoes on in parallel. Alon. -Neil
On Thu, 2005-06-02 at 00:53 -0400, Alon Goldshuv wrote: > My thoughts were -- see how the responses are, and if people think > that this is a better way to go than replace the COPY parsing logic to > the new one. If the new approach to parsing can handle all the cases that the old approach can handle (WRT CSV, binary, encoding conversion and the like) and is faster to boot, I don't see a reason not to replace the current approach with something derived from your patch. -Neil
Luke Lonergan said: > Andrew, > >> I will be the first to admit that there are probably some very good >> possibilities for optimisation of this code. My impression though has >> been that in almost all cases it's fast enough anyway. I know that on >> some very modest hardware I have managed to load a 6m row TPC >> line-items table in just a few minutes. Before we start getting too >> hung up, I'd be interested to know just how much data people want to >> load and how fast they want it to be. If people have massive data >> loads that take hours, days or weeks then it's obviously worth >> improving if we can. I'm curious to know what size datasets people are >> really handling this way. > > x0+ GB files are common in data warehousing. The issue is often "can > we load our data within the time allotted for the batch window", > usually a matter of an hour or two. > > Assuming that TPC lineitem is 140Bytes/row, 6M rows in 3 minutes is 4.7 > MB/s. To load a 10GB file at that rate takes about 2/3 hour. If one > were to restore a 300GB database, it would take 18 hours. Maintenance > operations are impractical after a few hours, 18 is a non-starter. > > In practice, we're usually replacing an Oracle system with PostgreSQL, > and the load speed difference between the two is currently embarrassing > and makes the work impractical. > OK ... that seems fair enough. The next question is where the data being loaded comes from? pg_dump? How does load speed compare with using COPY's binary mode? cheers andrew
Andrew, > OK ... that seems fair enough. The next question is where the data being > loaded comes from? pg_dump? How does load speed compare with using COPY's > binary mode? Oddly, our tests in the past have shown that binary is actually slower. Luke
FYI - we're prepping a proposal for just that. Stay tuned for Alon's proposal. Luke On 6/1/05 11:43 PM, "Neil Conway" <neilc@samurai.com> wrote: > On Thu, 2005-06-02 at 00:53 -0400, Alon Goldshuv wrote: >> My thoughts were -- see how the responses are, and if people think >> that this is a better way to go than replace the COPY parsing logic to >> the new one. > > If the new approach to parsing can handle all the cases that the old > approach can handle (WRT CSV, binary, encoding conversion and the like) > and is faster to boot, I don't see a reason not to replace the current > approach with something derived from your patch. > > -Neil > > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >