Thread: CopyReadLineText optimization revisited
I'm reviving the effort I started a while back to make COPY faster: http://archives.postgresql.org/pgsql-patches/2008-02/msg00100.php http://archives.postgresql.org/pgsql-patches/2008-03/msg00015.php The patch I now have is based on using memchr() to search end-of-line. In a nutshell: * we perform possible encoding conversion early, one input block at a time, rather than after splitting the input into lines. This allows us to assume in the later stages that the data is in server encoding, allowing us to search for the '\n' byte without worrying about multi-byte characters. * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() to find the next NL/CR character. This is where the speedup comes from. Unfortunately we can't do that in the CSV codepath, because newlines can be embedded in quoted, so that's unchanged. These changes seem to give an overall speedup of between 0-10%, depending on the shape of the table. I tested various tables from the TPC-H schema, and a narrow table consisting of just one short text column. I can't think of a case where these changes would be a net loss in performance, and it didn't perform worse on any of the cases I tested either. There's a small fly in the ointment: the patch won't recognize backslash followed by a linefeed as an escaped linefeed. I think we should simply drop support for that. The docs already say: > It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \nand \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriagereturn, and to represent a data newline by a backslash and newline. However, these representations might not be acceptedin future releases. They are also highly vulnerable to corruption if the COPY file is transferred across differentmachines (for example, from Unix to Windows or vice versa). I vaguely recall that we discussed this some time ago already and agreed that we can drop it if it makes life easier. This patch is in pretty good shape, however it needs to be tested with different exotic input formats. Also, the loop in CopyReadLineText could probaby be cleaned up a bit, some of the uglifications that were done for performance reasons in the old code are no longer necessary, as memchr() is doing the heavy-lifting and the loop only iterates 1-2 times per line in typical cases. It's not strictly necessary, but how about dropping support for the old COPY protocol, and the EOF marker \. while we're at it? It would allow us to drop some code, making the remaining code simpler, and reduce the testing effort. Thoughts on that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > * we perform possible encoding conversion early, one input block at a > time, rather than after splitting the input into lines. This allows us > to assume in the later stages that the data is in server encoding, > allowing us to search for the '\n' byte without worrying about > multi-byte characters. Seems reasonable, although the need to deal with multibyte characters crossing a block boundary injects some ugliness that wasn't there before. > * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() > to find the next NL/CR character. This is where the speedup comes from. That seems like the speedup, if any, would be extremely platform-dependent. What have you tested on? > There's a small fly in the ointment: the patch won't recognize backslash > followed by a linefeed as an escaped linefeed. I think we should simply > drop support for that. I think this is likely to break apps that have worked for years. I can't get excited about doing that in return for an "0-10%" speedup that might only materialize on some platforms. If the numbers were better, it'd be worth paying that price, but ... > It's not strictly necessary, but how about dropping support for the old > COPY protocol, and the EOF marker \. while we're at it? It would allow > us to drop some code, making the remaining code simpler, and reduce the > testing effort. Thoughts on that? Again, I think the threshold requirement for breaking compatibility needs to be a lot higher than this. regards, tom lane
On 26/08/10 22:16, Tom Lane wrote: > Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes: >> * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() >> to find the next NL/CR character. This is where the speedup comes from. > > That seems like the speedup, if any, would be extremely > platform-dependent. What have you tested on? Only on my 32-bit Intel Linux laptop. If anyone out there has more interesting platforms to test on, that would be appreciated. >> There's a small fly in the ointment: the patch won't recognize backslash >> followed by a linefeed as an escaped linefeed. I think we should simply >> drop support for that. > > I think this is likely to break apps that have worked for years. I > can't get excited about doing that in return for an "0-10%" speedup > that might only materialize on some platforms. If the numbers were > better, it'd be worth paying that price, but ... Ok. If we have to, we can keep that, it just requires more programming. After searching for a \n, we can peek at the previous byte to check if it's a backslash (and if it is, the one before that to see if it's a backslash too, and so forth until we find a non-backslash). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > On 26/08/10 22:16, Tom Lane wrote: >> I think this is likely to break apps that have worked for years. I >> can't get excited about doing that in return for an "0-10%" speedup >> that might only materialize on some platforms. If the numbers were >> better, it'd be worth paying that price, but ... > Ok. If we have to, we can keep that, it just requires more programming. > After searching for a \n, we can peek at the previous byte to check if > it's a backslash (and if it is, the one before that to see if it's a > backslash too, and so forth until we find a non-backslash). I seem to recall that this same problem was discussed before, and we concluded that you couldn't reliably parse backwards to figure out whether the newline was backslashed. Although that might have been in the context of data still in client encoding, where you have the extra frammish that a "backslash" could be a non-first byte of a character. Anyway it'd be a good idea to recheck those old discussions if you can find 'em. Another approach that came to me was to parse forwards, and if we find a backslash at the end of the line, then conclude that we had a backslashed newline and slurp in another line's worth of data at that time. I'm not sure how much restructuring would be needed to make that feasible, but it seems worth considering. regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Ok. If we have to, we can keep that, it just requires more > programming. After searching for a \n, we can peek at the previous byte to > check if it's a backslash (and if it is, the one before that to see if it's > a backslash too, and so forth until we find a non-backslash). That's what pgloader does to allow for non-quoted fields containing escaped separator in some contrived input formats (UNLOAD from Informix, I'm looking at you). I guess the same kind of playing could be applied to CSV too, but it'd be necessary to search back to the previous \n and count the QUOTE chars you find. Which does not sound like a huge win, even if you remember the state at the last quoted \n. Fancy format parsing ain't fun. Regards, -- dim
Was this implemented? Is it a TODO? --------------------------------------------------------------------------- Heikki Linnakangas wrote: > I'm reviving the effort I started a while back to make COPY faster: > > http://archives.postgresql.org/pgsql-patches/2008-02/msg00100.php > http://archives.postgresql.org/pgsql-patches/2008-03/msg00015.php > > The patch I now have is based on using memchr() to search end-of-line. > In a nutshell: > > * we perform possible encoding conversion early, one input block at a > time, rather than after splitting the input into lines. This allows us > to assume in the later stages that the data is in server encoding, > allowing us to search for the '\n' byte without worrying about > multi-byte characters. > > * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() > to find the next NL/CR character. This is where the speedup comes from. > Unfortunately we can't do that in the CSV codepath, because newlines can > be embedded in quoted, so that's unchanged. > > These changes seem to give an overall speedup of between 0-10%, > depending on the shape of the table. I tested various tables from the > TPC-H schema, and a narrow table consisting of just one short text column. > > I can't think of a case where these changes would be a net loss in > performance, and it didn't perform worse on any of the cases I tested > either. > > There's a small fly in the ointment: the patch won't recognize backslash > followed by a linefeed as an escaped linefeed. I think we should simply > drop support for that. The docs already say: > > > It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \nand \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriagereturn, and to represent a data newline by a backslash and newline. However, these representations might not be acceptedin future releases. They are also highly vulnerable to corruption if the COPY file is transferred across differentmachines (for example, from Unix to Windows or vice versa). > > I vaguely recall that we discussed this some time ago already and agreed > that we can drop it if it makes life easier. > > This patch is in pretty good shape, however it needs to be tested with > different exotic input formats. Also, the loop in CopyReadLineText could > probaby be cleaned up a bit, some of the uglifications that were done > for performance reasons in the old code are no longer necessary, as > memchr() is doing the heavy-lifting and the loop only iterates 1-2 times > per line in typical cases. > > > It's not strictly necessary, but how about dropping support for the old > COPY protocol, and the EOF marker \. while we're at it? It would allow > us to drop some code, making the remaining code simpler, and reduce the > testing effort. Thoughts on that? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com [ Attachment, skipping... ] > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Fri, Feb 18, 2011 at 9:35 PM, Bruce Momjian <bruce@momjian.us> wrote: > Was this implemented? Is it a TODO? It's not entirely clear to me that there's a meaningful win here. Speeding up COPY is already on the TODO list, with this link: http://archives.postgresql.org/pgsql-hackers/2008-02/msg00954.php ...and it may be that some of the ideas proposed there are more worthwhile than this one. But maybe that same TODO item could also get a link to the start of this thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Fri, Feb 18, 2011 at 9:35 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Was this implemented? ?Is it a TODO? > > It's not entirely clear to me that there's a meaningful win here. > Speeding up COPY is already on the TODO list, with this link: > > http://archives.postgresql.org/pgsql-hackers/2008-02/msg00954.php > > ...and it may be that some of the ideas proposed there are more > worthwhile than this one. > > But maybe that same TODO item could also get a link to the start of > this thread. OK, I added a link to this thread to the existing COPY TODO performance item. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +