Thread: CopyReadLineText optimization
The purpose of CopyReadLineText is to scan the input buffer, and find the next newline, taking into account any escape characters. It currently operates in a loop, one byte at a time, searching for LF, CR, or a backslash. That's a bit slow: I've been running oprofile on COPY, and I've seen CopyReadLine to take around ~10% of the CPU time, and Joshua Drake just posted a very similar profile to hackers. Attached is a patch that modifies CopyReadLineText so that it uses memchr to speed up the scan. The nice thing about memchr is that we can take advantage of any clever optimizations that might be in libc or compiler. In the tests I've been running, it roughly halves the time spent in CopyReadLine (including the new memchr calls), thus reducing the total CPU overhead by ~5%. I'm planning to run more tests with data that has backslashes and with different width tables to see what the worst-case and best-case performance is like. Also, it doesn't work for CSV format at the moment; that needs to be fixed. 5% isn't exactly breathtaking, but it's a start. I tried the same trick to CopyReadAttributesText, but unfortunately it doesn't seem to help there because you need to "stop" the efficient word-at-a-time scan that memchr does (at least with glibc, YMMV) whenever there's a column separator, while in CopyReadLineText you get to process the whole line in one call, assuming there's no backslashes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/commands/copy.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.295 diff -c -r1.295 copy.c *** src/backend/commands/copy.c 1 Jan 2008 19:45:48 -0000 1.295 --- src/backend/commands/copy.c 24 Feb 2008 00:40:09 -0000 *************** *** 67,72 **** --- 67,77 ---- EOL_CRNL } EolType; + typedef struct { + char c; /* byte we're looking for */ + char *s; /* pointer to next occurance of this byte */ + } searchbyte; + /* * This struct contains all the state variables used throughout a COPY * operation. For simplicity, we use the same struct for all variants of COPY, *************** *** 159,164 **** --- 164,171 ---- char *raw_buf; int raw_buf_index; /* next byte to process */ int raw_buf_len; /* total # of bytes stored */ + + searchbyte sb[3]; /* pointers to next "interesting" characters in raw_buf */ } CopyStateData; typedef CopyStateData *CopyState; *************** *** 170,175 **** --- 177,283 ---- CopyState cstate; /* CopyStateData for the command */ } DR_copy; + #define swap_sb(a,b) do { \ + searchbyte c = (a); \ + (a) = (b); \ + (b) = (c); \ + } while(0) + + /* is a < b? NULLs are last */ + #define cmp_sb(a,b) ((a).s < (b).s && (a).s != NULL) + + static void + sort_sb(searchbyte *sb) + { + if (cmp_sb(sb[2], sb[1])) + swap_sb(sb[1], sb[2]); + if (cmp_sb(sb[1], sb[0])) + swap_sb(sb[0], sb[1]); + if (cmp_sb(sb[2], sb[1])) + swap_sb(sb[1], sb[2]); + } + + /* + * Starts a new search in a string + */ + static void + init_interesting(searchbyte *sb, char *str, int len) + { + sb[0].c = '\n'; + sb[1].c = '\r'; + sb[2].c = '\\'; + sb[0].s = memchr(str, '\n', len); + sb[1].s = memchr(str, '\r', len); + sb[2].s = memchr(str, '\\', len); + sort_sb(sb); + } + + /* + * Look for the next interesting character + * + * sb - array of three searchbytes. + * ss - string to search + * len_left - number of bytes left in ss to search + * + * Returns the offset of the next interesting location, relative to 'ss' + */ + static int + next_interesting(searchbyte *sb, char *ss, int len_left) + { + char *s; + + if (sb[0].s == NULL) + return 0; + + /* + * The caller skipped over the next pointer we had memorized, so we have to + * search for the one after that. + */ + if (ss > sb[0].s) + { + sb[0].s = memchr(ss, sb[0].c, len_left); + if (sb[0].s == NULL) + return 0; /* we're done */ + if (sb[1].s != NULL) + { + if (ss > sb[1].s) + { + /* + * The caller skipped over the 2nd pointer we had memorized + * as well. Have to search for that as well + */ + sb[1].s = memchr(ss, sb[1].c, len_left); + if (sb[2].s != NULL && ss > sb[2].s) + { + /* Caller skipped over 3rd pointer as well... */ + sb[2].s = memchr(ss, sb[2].c, len_left); + } + } + sort_sb(sb); + } + } + + /* + * Return the next interesting location we had memorized, and search + * for the next occurance of that byte. + */ + s = sb[0].s; + + sb[0].s = memchr(s+1, sb[0].c, (ss+len_left) - (s + 1)); + + /* + * Bubble down the next location into the three-element list we have, + */ + if (cmp_sb(sb[1], sb[0])) + { + swap_sb(sb[0], sb[1]); + if (cmp_sb(sb[2], sb[1])) + swap_sb(sb[1], sb[2]); + } + + return s - ss; + } + /* * These macros centralize code used to process line_buf and raw_buf buffers. *************** *** 2350,2355 **** --- 2458,2465 ---- raw_buf_ptr = 0; copy_buf_len = cstate->raw_buf_len; + init_interesting(cstate->sb, copy_raw_buf, copy_buf_len); + /* * If we are completely out of data, break out of the loop, * reporting EOF. *************** *** 2363,2368 **** --- 2473,2495 ---- } /* OK to fetch a character */ + if (!cstate->csv_mode) + { + int skip = next_interesting(cstate->sb, + ©_raw_buf[raw_buf_ptr], + copy_buf_len - raw_buf_ptr); + + + if (skip != 0) + { + raw_buf_ptr += skip; + first_char_in_line = false; + + prev_raw_ptr = raw_buf_ptr - 1; + IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(1); + } + } + prev_raw_ptr = raw_buf_ptr; c = copy_raw_buf[raw_buf_ptr++];
Cool! It's been a while since we've done the same kind of thing :-) - Luke > -----Original Message----- > From: pgsql-patches-owner@postgresql.org > [mailto:pgsql-patches-owner@postgresql.org] On Behalf Of > Heikki Linnakangas > Sent: Saturday, February 23, 2008 5:30 PM > To: pgsql-patches@postgresql.org > Subject: [PATCHES] CopyReadLineText optimization > > The purpose of CopyReadLineText is to scan the input buffer, > and find the next newline, taking into account any escape > characters. It currently operates in a loop, one byte at a > time, searching for LF, CR, or a backslash. That's a bit > slow: I've been running oprofile on COPY, and I've seen > CopyReadLine to take around ~10% of the CPU time, and Joshua > Drake just posted a very similar profile to hackers. > > Attached is a patch that modifies CopyReadLineText so that it > uses memchr to speed up the scan. The nice thing about memchr > is that we can take advantage of any clever optimizations > that might be in libc or compiler. > > In the tests I've been running, it roughly halves the time > spent in CopyReadLine (including the new memchr calls), thus > reducing the total CPU overhead by ~5%. I'm planning to run > more tests with data that has backslashes and with different > width tables to see what the worst-case and best-case > performance is like. Also, it doesn't work for CSV format at > the moment; that needs to be fixed. > > 5% isn't exactly breathtaking, but it's a start. I tried the > same trick to CopyReadAttributesText, but unfortunately it > doesn't seem to help there because you need to "stop" the > efficient word-at-a-time scan that memchr does (at least with > glibc, YMMV) whenever there's a column separator, while in > CopyReadLineText you get to process the whole line in one > call, assuming there's no backslashes. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com >
Heikki Linnakangas wrote: > Attached is a patch that modifies CopyReadLineText so that it uses > memchr to speed up the scan. The nice thing about memchr is that we can > take advantage of any clever optimizations that might be in libc or > compiler. Here's an updated version of the patch. The principle is the same, but the same optimization is now used for CSV input as well, and there's more comments. I still need to do more benchmarking. I mentioned a ~5% speedup on the test I ran earlier, which was a load of the lineitem table from TPC-H. It looks like with cheaper data types the gain can be much bigger; here's an oprofile from loading the TPC-H partsupp table, Before: samples % image name symbol name 5146 25.7635 postgres CopyReadLine 4089 20.4716 postgres DoCopy 1449 7.2544 reiserfs (no symbols) 1369 6.8539 postgres pg_verify_mbstr_len 1013 5.0716 libc-2.7.so memcpy 749 3.7499 libc-2.7.so ____strtod_l_internal 598 2.9939 postgres heap_formtuple 548 2.7436 libc-2.7.so ____strtol_l_internal 403 2.0176 libc-2.7.so memset 309 1.5470 libc-2.7.so strlen 208 1.0414 postgres AllocSetAlloc ... After: samples % image name symbol name 4165 25.7879 postgres DoCopy 1574 9.7455 postgres pg_verify_mbstr_len 1520 9.4112 reiserfs (no symbols) 1005 6.2225 libc-2.7.so memchr 986 6.1049 libc-2.7.so memcpy 632 3.9131 libc-2.7.so ____strtod_l_internal 589 3.6468 postgres heap_formtuple 546 3.3806 libc-2.7.so ____strtol_l_internal 386 2.3899 libc-2.7.so memset 366 2.2661 postgres CopyReadLine 287 1.7770 libc-2.7.so strlen 215 1.3312 postgres LWLockAcquire 208 1.2878 postgres hash_any 176 1.0897 postgres LWLockRelease 161 0.9968 postgres InputFunctionCall 157 0.9721 postgres AllocSetAlloc ... Profile shows that with the patch, ~8.5% of the CPU time is spent in CopyReadLine+memchr, vs. 25.5% before. That's a quite significant speedup. I still need to test the worst-case performance, with input that has a lot of escapes. It would be interesting to hear reports with this patch from people on different platforms. These results are from my laptop with 32-bit Intel CPU, running Linux. There could be big differences in the memchr implementations. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/commands/copy.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.295 diff -c -r1.295 copy.c *** src/backend/commands/copy.c 1 Jan 2008 19:45:48 -0000 1.295 --- src/backend/commands/copy.c 29 Feb 2008 17:43:53 -0000 *************** *** 67,72 **** --- 67,275 ---- EOL_CRNL } EolType; + + /***** Fast byte searcher functions ***** + * + * CopyReadLine needs to search for newlines, as well as quoting and escape + * characters to determine which newlines are real and which ones are escaped. + * Doing that in the naive way, looping through the input byte by byte and + * comparing against the interesting characters, can be slow. + * + * To speed that up, we provide a "searcher" interface, that can be used to + * search a piece of memory for up to 4 different bytes simultaneously. It's + * similar to memchr, but allows searching for multiple chars at the same + * time. + * + * To start a search on a new block of memory, call init_searcher. Then call + * next_searcher repeatedly to return all the occurrances of the bytes being + * searched for. + * + * The searcher implementation uses memchr to search for the bytes, and keeps + * track of where the next occurrance of each is. Using memchr allows us + * to take advantage of any platform-specific optimizations. + */ + + /* + * Struct used to store state of the current search between calls to + * next_searcher. Initialized by init_search. + */ + typedef struct { + /* Each element in c corresponds the element in s. These are sorted + * by the pointers (ptr) */ + int n; /* elements used in the arrays */ + char c[4]; /* bytes we're looking for */ + char *ptr[4]; /* pointers to next occurances of the bytes */ + char *end; /* end of block being searched. Last byte to search is end-1 */ + } searcher; + + /* Functions internal to "searcher" */ + + #define swap_searcher_entries(searcher, a, b) do { \ + char tmpc = (searcher)->c[a]; \ + char *tmpptr = (searcher)->ptr[a]; \ + (searcher)->c[a] = (searcher)->c[b]; \ + (searcher)->ptr[a] = (searcher)->ptr[b]; \ + (searcher)->c[b] = tmpc; \ + (searcher)->ptr[b] = tmpptr; \ + } while(0) + + static void + sort_searcher(searcher *s) + { + switch(s->n) + { + case 0: + case 1: + /* Nothing to do */ + break; + case 2: + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + break; + case 3: + if (s->ptr[0] > s->ptr[2]) + swap_searcher_entries(s, 0, 2); + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[1] > s->ptr[2]) + swap_searcher_entries(s, 1, 2); + break; + case 4: + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[2] > s->ptr[3]) + swap_searcher_entries(s, 2, 3); + if (s->ptr[1] > s->ptr[2]) + swap_searcher_entries(s, 2, 3); + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[2] > s->ptr[3]) + swap_searcher_entries(s, 2, 3); + break; + } + } + + /* Remove the topmost element */ + static void + remove_top(searcher *searcher) + { + int i; + + searcher->n--; + + /* Move up the remaining items */ + for (i = 0; i < searcher->n; i++) + { + searcher->c[i] = searcher->c[i + 1]; + searcher->ptr[i] = searcher->ptr[i + 1]; + } + } + + /* + * The first element in the list has been replaced by the caller. + * Move it to the right place in the list, to keep it sorted. + */ + static void + sift_down(searcher *searcher) + { + if (searcher->n < 2 || searcher->ptr[0] <= searcher->ptr[1]) + return; + swap_searcher_entries(searcher, 0, 1); + + if (searcher->n < 3 || searcher->ptr[1] <= searcher->ptr[2]) + return; + swap_searcher_entries(searcher, 1, 2); + + if (searcher->n < 4 || searcher->ptr[2] <= searcher->ptr[3]) + return; + swap_searcher_entries(searcher, 2, 3); + } + + /* + * Starts a new search in a block of memory. + * + * searcher - for storing state between calls. Opaque to caller, + * init_searcher will initialize this. + * str - block of memory to search + * len - length of str + * c - bytes to search for, max 4. + * n - number of bytes in c + */ + static void + init_searcher(searcher *searcher, char *str, int len, char c[4], int n) + { + int i; + + Assert(n <= 4); + + searcher->n = 0; + searcher->end = str + len; + + /* Find first occurances of the searched-for bytes */ + for (i = 0; i < n; i++) + { + char *hit = memchr(str, c[i], len); + if (hit != NULL) + { + searcher->c[searcher->n] = c[i]; + searcher->ptr[searcher->n] = hit; + searcher->n++; + } + /* + * If the byte doesn't occur in the string at all, we don't + * need to put it in the list. + */ + } + sort_searcher(searcher); + } + + /* + * Look for the next interesting character + * + * searcher - state, opaque to caller + * str - block to search. + * + * 'str' must be part of the block passed to init_searcher, and >= + * the return value of last call to next_searcher. + * + * Returns the offset of the next interesting location, relative to 'ss' + */ + static char * + next_searcher(searcher *searcher, char *str) + { + Assert(str < searcher->end); + + if (searcher->n == 0) + return NULL; + + /* + * If the caller skipped over the next pointer we had memorized, we have to + * search for the one after that. + */ + while (str > searcher->ptr[0]) + { + char *s = memchr(str, searcher->c[0], searcher->end - str); + + if (s == NULL) + { + remove_top(searcher); + if (searcher->n == 0) + return NULL; + } + else + { + searcher->ptr[0] = s; + sift_down(searcher); + } + } + + /* + * Return the next interesting location we had memorized. + */ + return searcher->ptr[0]; + } + + /* * This struct contains all the state variables used throughout a COPY * operation. For simplicity, we use the same struct for all variants of COPY, *************** *** 159,164 **** --- 362,369 ---- char *raw_buf; int raw_buf_index; /* next byte to process */ int raw_buf_len; /* total # of bytes stored */ + + searcher searcher; } CopyStateData; typedef CopyStateData *CopyState; *************** *** 2285,2299 **** --- 2490,2516 ---- last_was_esc = false; char quotec = '\0'; char escapec = '\0'; + char search_chars[4]; + int nsearch_chars = 0; + + search_chars[nsearch_chars++] = '\n'; + search_chars[nsearch_chars++] = '\r'; if (cstate->csv_mode) { quotec = cstate->quote[0]; escapec = cstate->escape[0]; + + search_chars[nsearch_chars++] = quotec; + /* ignore special escape processing if it's the same as quotec */ if (quotec == escapec) escapec = '\0'; + else + search_chars[nsearch_chars++] = escapec; } + else + search_chars[nsearch_chars++] = '\\'; mblen_str[1] = '\0'; *************** *** 2360,2368 **** --- 2577,2601 ---- break; } need_data = false; + + init_searcher(&cstate->searcher, copy_raw_buf, copy_buf_len, + search_chars, nsearch_chars); } /* OK to fetch a character */ + if ((!cstate->csv_mode || !first_char_in_line) + && (raw_buf_ptr + 4 <= copy_buf_len)) + { + char *next = next_searcher(&cstate->searcher, + ©_raw_buf[raw_buf_ptr]); + + if (next != NULL) + { + raw_buf_ptr = next - copy_raw_buf; + first_char_in_line = false; + } + } + prev_raw_ptr = raw_buf_ptr; c = copy_raw_buf[raw_buf_ptr++];
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Heikki Linnakangas wrote: > Heikki Linnakangas wrote: > > Attached is a patch that modifies CopyReadLineText so that it uses > > memchr to speed up the scan. The nice thing about memchr is that we can > > take advantage of any clever optimizations that might be in libc or > > compiler. > > Here's an updated version of the patch. The principle is the same, but > the same optimization is now used for CSV input as well, and there's > more comments. > > I still need to do more benchmarking. I mentioned a ~5% speedup on the > test I ran earlier, which was a load of the lineitem table from TPC-H. > It looks like with cheaper data types the gain can be much bigger; > here's an oprofile from loading the TPC-H partsupp table, > > Before: > > samples % image name symbol name > 5146 25.7635 postgres CopyReadLine > 4089 20.4716 postgres DoCopy > 1449 7.2544 reiserfs (no symbols) > 1369 6.8539 postgres pg_verify_mbstr_len > 1013 5.0716 libc-2.7.so memcpy > 749 3.7499 libc-2.7.so ____strtod_l_internal > 598 2.9939 postgres heap_formtuple > 548 2.7436 libc-2.7.so ____strtol_l_internal > 403 2.0176 libc-2.7.so memset > 309 1.5470 libc-2.7.so strlen > 208 1.0414 postgres AllocSetAlloc > ... > > After: > > samples % image name symbol name > 4165 25.7879 postgres DoCopy > 1574 9.7455 postgres pg_verify_mbstr_len > 1520 9.4112 reiserfs (no symbols) > 1005 6.2225 libc-2.7.so memchr > 986 6.1049 libc-2.7.so memcpy > 632 3.9131 libc-2.7.so ____strtod_l_internal > 589 3.6468 postgres heap_formtuple > 546 3.3806 libc-2.7.so ____strtol_l_internal > 386 2.3899 libc-2.7.so memset > 366 2.2661 postgres CopyReadLine > 287 1.7770 libc-2.7.so strlen > 215 1.3312 postgres LWLockAcquire > 208 1.2878 postgres hash_any > 176 1.0897 postgres LWLockRelease > 161 0.9968 postgres InputFunctionCall > 157 0.9721 postgres AllocSetAlloc > ... > > Profile shows that with the patch, ~8.5% of the CPU time is spent in > CopyReadLine+memchr, vs. 25.5% before. That's a quite significant speedup. > > I still need to test the worst-case performance, with input that has a > lot of escapes. It would be interesting to hear reports with this patch > from people on different platforms. These results are from my laptop > with 32-bit Intel CPU, running Linux. There could be big differences in > the memchr implementations. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Heikki Linnakangas wrote: > I still need to test the worst-case performance, with input that has a > lot of escapes. Ok, I've done some more performance testing with this. I tested COPY FROM with a table with a single "text" column. There was a million rows in the table, with a 1000 character long string: postgres=# CREATE TABLE narrowtable2 (id text); CREATE TABLE postgres=# INSERT INTO narrowtable2 SELECT repeat(E'\\', 1000) FROM generate_series(1, 1000000); INSERT 0 1000000 After that, I dumped that to a file, and loaded it back using COPY FROM: time ~/installations/cvshead/bin/psql postgres -c "BEGIN; TRUNCATE narrowtable2; COPY narrowtable2 FROM '/home/perftester/narrowtable3.tbl'; ROLLBACK;" I repeated the test with different frequencies of backslashes in the string, with and without the patch, and the took the smallest number of each test case: backslashes with without patch all 24.9 15.6 every 4th 12.7 11.4 every 8th 10.4 10.7 every 16th 8.7 10.3 none 6.8 9.8 So the overhead of using memchr slows us down if there's a lot of escape or quote characters. The breakeven point seems to be about 1 in 8 characters. I'm not sure if that's a good tradeoff or not... I also tested a table with single integer column, and found no meaningful difference (10.5 without patch vs 10.6 with patch). oprofile shows that in this test case, only ~5% of the CPU time is spent in CopyReadLineText, and the patch doesn't change that. Without patch: samples % image name app name symbol name 7563 12.7220 no-vmlinux postgres (no symbols) 4050 6.8127 postgres postgres DoCopy 3334 5.6083 postgres postgres LWLockAcquire 3238 5.4468 postgres postgres CopyReadLine 2900 4.8782 postgres postgres LWLockRelease 2781 4.6780 libc-2.7.so postgres __GI_____strtoll_l_internal 2778 4.6730 postgres postgres heap_formtuple 2636 4.4341 postgres postgres hash_any 2087 3.5106 no-vmlinux no-vmlinux (no symbols) 1748 2.9404 libc-2.7.so postgres memset 1724 2.9000 postgres postgres PinBuffer 1670 2.8092 postgres postgres PageAddItem 1645 2.7671 postgres postgres heap_insert 1459 2.4542 postgres postgres UnpinBuffer 1457 2.4509 postgres postgres ReadBuffer_common 1321 2.2221 postgres postgres hash_search_with_hash_value 1278 2.1498 postgres postgres MarkBufferDirty 1219 2.0505 oprofiled oprofiled (no symbols) 972 1.6350 postgres postgres pg_verify_mbstr_len 756 1.2717 postgres postgres RelationPutHeapTuple 665 1.1186 postgres postgres pg_atoi 631 1.0614 postgres postgres RelationGetBufferForTuple 613 1.0312 postgres postgres AllocSetReset ... With patch: samples % image name app name symbol name 42720 18.1450 no-vmlinux postgres (no symbols) 15367 6.5270 postgres postgres DoCopy 11831 5.0251 postgres postgres LWLockAcquire 11500 4.8845 no-vmlinux no-vmlinux (no symbols) 10182 4.3247 postgres postgres LWLockRelease 9912 4.2100 libc-2.7.so postgres __GI_____strtoll_l_internal 9811 4.1671 postgres postgres hash_any 8824 3.7479 postgres postgres heap_formtuple 7459 3.1682 postgres postgres CopyReadLine 7187 3.0526 postgres postgres PageAddItem 6313 2.6814 libc-2.7.so postgres memset 5842 2.4813 postgres postgres PinBuffer 5230 2.2214 postgres postgres UnpinBuffer 5160 2.1917 postgres postgres heap_insert 4838 2.0549 postgres postgres ReadBuffer_common 4819 2.0468 postgres postgres hash_search_with_hash_value 4691 1.9925 postgres postgres MarkBufferDirty 3675 1.5609 libc-2.7.so postgres memchr 3617 1.5363 postgres postgres AllocSetAlloc 3585 1.5227 postgres postgres pg_verify_mbstr_len 3326 1.4127 postgres postgres AllocSetReset ... These tests were on a test server with a dual-core 64-bit Intel Xeons. I'd still like to hear reports from other platforms. Another thing that seems like an obvious win is to merge CopyReadLine and CopyReadAttributesText/CSV so that we do just one pass over the input. But that seems suspiciously obvious, I wonder if I'm missing something. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Heikki Linnakangas wrote: >> Attached is a patch that modifies CopyReadLineText so that it uses >> memchr to speed up the scan. The nice thing about memchr is that we >> can take advantage of any clever optimizations that might be in libc >> or compiler. > > Here's an updated version of the patch. The principle is the same, but > the same optimization is now used for CSV input as well, and there's > more comments. Another update attached: It occurred to me that the memchr approach is only safe for server encodings, where the non-first bytes of a multi-byte character always have the hi-bit set. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/commands/copy.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.295 diff -c -r1.295 copy.c *** src/backend/commands/copy.c 1 Jan 2008 19:45:48 -0000 1.295 --- src/backend/commands/copy.c 5 Mar 2008 14:44:58 -0000 *************** *** 67,72 **** --- 67,275 ---- EOL_CRNL } EolType; + + /***** Fast byte searcher functions ***** + * + * CopyReadLine needs to search for newlines, as well as quoting and escape + * characters to determine which newlines are real and which ones are escaped. + * Doing that in the naive way, looping through the input byte by byte and + * comparing against the interesting characters, can be slow. + * + * To speed that up, we provide a "searcher" interface, that can be used to + * search a piece of memory for up to 4 different bytes simultaneously. It's + * similar to memchr, but allows searching for multiple chars at the same + * time. + * + * To start a search on a new block of memory, call init_searcher. Then call + * next_searcher repeatedly to return all the occurrances of the bytes being + * searched for. + * + * The searcher implementation uses memchr to search for the bytes, and keeps + * track of where the next occurrance of each is. Using memchr allows us + * to take advantage of any platform-specific optimizations. + */ + + /* + * Struct used to store state of the current search between calls to + * next_searcher. Initialized by init_search. + */ + typedef struct { + /* Each element in c corresponds the element in s. These are sorted + * by the pointers (ptr) */ + int n; /* elements used in the arrays */ + char c[4]; /* bytes we're looking for */ + char *ptr[4]; /* pointers to next occurances of the bytes */ + char *end; /* end of block being searched. Last byte to search is end-1 */ + } searcher; + + /* Functions internal to "searcher" */ + + #define swap_searcher_entries(searcher, a, b) do { \ + char tmpc = (searcher)->c[a]; \ + char *tmpptr = (searcher)->ptr[a]; \ + (searcher)->c[a] = (searcher)->c[b]; \ + (searcher)->ptr[a] = (searcher)->ptr[b]; \ + (searcher)->c[b] = tmpc; \ + (searcher)->ptr[b] = tmpptr; \ + } while(0) + + static void + sort_searcher(searcher *s) + { + switch(s->n) + { + case 0: + case 1: + /* Nothing to do */ + break; + case 2: + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + break; + case 3: + if (s->ptr[0] > s->ptr[2]) + swap_searcher_entries(s, 0, 2); + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[1] > s->ptr[2]) + swap_searcher_entries(s, 1, 2); + break; + case 4: + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[2] > s->ptr[3]) + swap_searcher_entries(s, 2, 3); + if (s->ptr[1] > s->ptr[2]) + swap_searcher_entries(s, 2, 3); + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[2] > s->ptr[3]) + swap_searcher_entries(s, 2, 3); + break; + } + } + + /* Remove the topmost element */ + static void + remove_top(searcher *searcher) + { + int i; + + searcher->n--; + + /* Move up the remaining items */ + for (i = 0; i < searcher->n; i++) + { + searcher->c[i] = searcher->c[i + 1]; + searcher->ptr[i] = searcher->ptr[i + 1]; + } + } + + /* + * The first element in the list has been replaced by the caller. + * Move it to the right place in the list, to keep it sorted. + */ + static void + sift_down(searcher *searcher) + { + if (searcher->n < 2 || searcher->ptr[0] <= searcher->ptr[1]) + return; + swap_searcher_entries(searcher, 0, 1); + + if (searcher->n < 3 || searcher->ptr[1] <= searcher->ptr[2]) + return; + swap_searcher_entries(searcher, 1, 2); + + if (searcher->n < 4 || searcher->ptr[2] <= searcher->ptr[3]) + return; + swap_searcher_entries(searcher, 2, 3); + } + + /* + * Starts a new search in a block of memory. + * + * searcher - for storing state between calls. Opaque to caller, + * init_searcher will initialize this. + * str - block of memory to search + * len - length of str + * c - bytes to search for, max 4. + * n - number of bytes in c + */ + static void + init_searcher(searcher *searcher, char *str, int len, char c[4], int n) + { + int i; + + Assert(n <= 4); + + searcher->n = 0; + searcher->end = str + len; + + /* Find first occurances of the searched-for bytes */ + for (i = 0; i < n; i++) + { + char *hit = memchr(str, c[i], len); + if (hit != NULL) + { + searcher->c[searcher->n] = c[i]; + searcher->ptr[searcher->n] = hit; + searcher->n++; + } + /* + * If the byte doesn't occur in the string at all, we don't + * need to put it in the list. + */ + } + sort_searcher(searcher); + } + + /* + * Look for the next interesting character + * + * searcher - state, opaque to caller + * str - block to search. + * + * 'str' must be part of the block passed to init_searcher, and >= + * the return value of last call to next_searcher. + * + * Returns the offset of the next interesting location, relative to 'ss' + */ + static char * + next_searcher(searcher *searcher, char *str) + { + Assert(str < searcher->end); + + if (searcher->n == 0) + return NULL; + + /* + * If the caller skipped over the next pointer we had memorized, we have to + * search for the one after that. + */ + while (str > searcher->ptr[0]) + { + char *s = memchr(str, searcher->c[0], searcher->end - str); + + if (s == NULL) + { + remove_top(searcher); + if (searcher->n == 0) + return NULL; + } + else + { + searcher->ptr[0] = s; + sift_down(searcher); + } + } + + /* + * Return the next interesting location we had memorized. + */ + return searcher->ptr[0]; + } + + /* * This struct contains all the state variables used throughout a COPY * operation. For simplicity, we use the same struct for all variants of COPY, *************** *** 159,164 **** --- 362,369 ---- char *raw_buf; int raw_buf_index; /* next byte to process */ int raw_buf_len; /* total # of bytes stored */ + + searcher searcher; } CopyStateData; typedef CopyStateData *CopyState; *************** *** 2285,2299 **** --- 2490,2516 ---- last_was_esc = false; char quotec = '\0'; char escapec = '\0'; + char search_chars[4]; + int nsearch_chars = 0; + + search_chars[nsearch_chars++] = '\n'; + search_chars[nsearch_chars++] = '\r'; if (cstate->csv_mode) { quotec = cstate->quote[0]; escapec = cstate->escape[0]; + + search_chars[nsearch_chars++] = quotec; + /* ignore special escape processing if it's the same as quotec */ if (quotec == escapec) escapec = '\0'; + else + search_chars[nsearch_chars++] = escapec; } + else + search_chars[nsearch_chars++] = '\\'; mblen_str[1] = '\0'; *************** *** 2360,2368 **** --- 2577,2603 ---- break; } need_data = false; + + if (!cstate->encoding_embeds_ascii) + init_searcher(&cstate->searcher, copy_raw_buf, copy_buf_len, + search_chars, nsearch_chars); } /* OK to fetch a character */ + if ((!cstate->csv_mode || !first_char_in_line) + && !cstate->encoding_embeds_ascii + && (raw_buf_ptr + 4 <= copy_buf_len)) + { + char *next = next_searcher(&cstate->searcher, + ©_raw_buf[raw_buf_ptr]); + + if (next != NULL) + { + raw_buf_ptr = next - copy_raw_buf; + first_char_in_line = false; + } + } + prev_raw_ptr = raw_buf_ptr; c = copy_raw_buf[raw_buf_ptr++];
Heikki Linnakangas wrote: > > > So the overhead of using memchr slows us down if there's a lot of > escape or quote characters. The breakeven point seems to be about 1 in > 8 characters. I'm not sure if that's a good tradeoff or not... > > How about we test the first buffer read in from the file and change strategy accordingly? cheers andrew
Heikki Linnakangas wrote: > Heikki Linnakangas wrote: >> Heikki Linnakangas wrote: >>> Attached is a patch that modifies CopyReadLineText so that it uses >>> memchr to speed up the scan. The nice thing about memchr is that we >>> can take advantage of any clever optimizations that might be in libc >>> or compiler. >> >> Here's an updated version of the patch. The principle is the same, >> but the same optimization is now used for CSV input as well, and >> there's more comments. > > Another update attached: It occurred to me that the memchr approach is > only safe for server encodings, where the non-first bytes of a > multi-byte character always have the hi-bit set. > We currently make the following assumption in the code: * These four characters, and the CSV escape and quote characters, are * assumed the same in frontend and backend encodings. * The four characters are the carriage return, line feed, backslash and dot. I think the requirement might well actually be somewhat stronger than that: i.e. that none of these will appear as a non-first byte in any multi-byte client encoding character. If that's right, then we should be able to write CopyReadLineText without bothering about multi-byte chars. If it's not right then I suspect we have some cases that can fail now anyway. (I believe some client encodings at least use backslash in subsequent chars, and that's a nasty one because the "\." end sequence is hard coded). I believe all the chars up to 0x2f are safe - that includes both quote chars and dot) cheers andrew
Andrew Dunstan wrote: > Heikki Linnakangas wrote: >> Another update attached: It occurred to me that the memchr approach is >> only safe for server encodings, where the non-first bytes of a >> multi-byte character always have the hi-bit set. >> > > We currently make the following assumption in the code: > > * These four characters, and the CSV escape and quote characters, are > * assumed the same in frontend and backend encodings. > * > > The four characters are the carriage return, line feed, backslash and dot. > > I think the requirement might well actually be somewhat stronger than > that: i.e. that none of these will appear as a non-first byte in any > multi-byte client encoding character. If that's right, then we should be > able to write CopyReadLineText without bothering about multi-byte chars. > If it's not right then I suspect we have some cases that can fail now > anyway. No, we don't require that, and we do handle it correctly. We use pg_encoding_mblen to determine the length of each character in CopyReadLineText when the encoding is a client-only encoding, and only look at the first byte of each character. In CopyReadAttributesText, where we have a similar loop, we've already transformed the input to server encoding. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Andrew Dunstan wrote: >> We currently make the following assumption in the code: >> >> * These four characters, and the CSV escape and quote characters, are >> * assumed the same in frontend and backend encodings. >> >> The four characters are the carriage return, line feed, backslash and dot. >> >> I think the requirement might well actually be somewhat stronger than >> that: i.e. that none of these will appear as a non-first byte in any >> multi-byte client encoding character. > No, we don't require that, and we do handle it correctly. I believe we *have to* handle it correctly, because backslash actually does appear as a non-first byte in SJIS or some other Far Eastern encoding. In the CSV case such a restriction would be untenable anyway. BTW, I notice that the code allows CSV escape and quote characters that have the high bit set (in single-byte server encodings that is). Is this a good idea? It seems like such are extremely unlikely to be the same in two different encodings. Maybe we should restrict to the ASCII range? Especially if the client encoding is multibyte ... regards, tom lane
Tom Lane wrote: > BTW, I notice that the code allows CSV escape and quote characters that > have the high bit set (in single-byte server encodings that is). Is > this a good idea? It seems like such are extremely unlikely to be the > same in two different encodings. Maybe we should restrict to the ASCII > range? Especially if the client encoding is multibyte ... > > > In the commonest case these are both either " or '. I would not have any objection to requiring them to be ASCII chars. cheers andrew
Heikki Linnakangas wrote: > Andrew Dunstan wrote: >> Heikki Linnakangas wrote: >>> Another update attached: It occurred to me that the memchr approach is >>> only safe for server encodings, where the non-first bytes of a >>> multi-byte character always have the hi-bit set. >>> >> >> We currently make the following assumption in the code: >> >> * These four characters, and the CSV escape and quote characters, >> are >> * assumed the same in frontend and backend encodings. >> * >> >> The four characters are the carriage return, line feed, backslash and >> dot. >> >> I think the requirement might well actually be somewhat stronger than >> that: i.e. that none of these will appear as a non-first byte in any >> multi-byte client encoding character. If that's right, then we should >> be able to write CopyReadLineText without bothering about multi-byte >> chars. If it's not right then I suspect we have some cases that can >> fail now anyway. > > No, we don't require that, and we do handle it correctly. We use > pg_encoding_mblen to determine the length of each character in > CopyReadLineText when the encoding is a client-only encoding, and only > look at the first byte of each character. In CopyReadAttributesText, > where we have a similar loop, we've already transformed the input to > server encoding. Oops. I see that now. Funny how I missed it when I went looking for it :-( I think I understand the patch now :-) I'm still a bit worried about applying it unless it gets some adaptive behaviour or something so that we don't cause any serious performance regressions in some cases. Also, could we perhaps benefit from inlining some calls, or is your compiler doing that anyway? cheers andrew
Tom Lane wrote: > BTW, I notice that the code allows CSV escape and quote characters that > have the high bit set (in single-byte server encodings that is). Is > this a good idea? It seems like such are extremely unlikely to be the > same in two different encodings. Maybe we should restrict to the ASCII > range? Especially if the client encoding is multibyte ... At least many of the ISO-8859-* encodings have many common non-ascii characters, and there's no problem if the client_ and server_encodings match. But it does seem risky to allow it if we can't detect and throw an error on the non-safe cases. Perhaps we could translate the chars from client to server encoding? If the client encoding is a multibyte one, then we certainly should elog(ERROR) if you try to do that. Though from a practical point of view, I doubt anyone would mind if we just always restricted them to ASCII range... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Andrew Dunstan wrote: > I'm still a bit worried about applying it unless it gets some adaptive > behaviour or something so that we don't cause any serious performance > regressions in some cases. I'll try to come up with something. At the most conservative end, we could fall back to the current method on the first escape, quote or backslash character. > Also, could we perhaps benefit from inlining > some calls, or is your compiler doing that anyway? gcc does inline all static functions that are only called from one site, and small functions, using some heuristic. I don't think more aggressive inlining would help. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Andrew Dunstan wrote: >> I'm still a bit worried about applying it unless it gets some >> adaptive behaviour or something so that we don't cause any serious >> performance regressions in some cases. > > I'll try to come up with something. At the most conservative end, we > could fall back to the current method on the first escape, quote or > backslash character. > > That's far too conservative, I think. Somewhere a bit short of your observed breakeven point seems about right. cheers andrew
Andrew Dunstan wrote: > Heikki Linnakangas wrote: >> Andrew Dunstan wrote: >>> I'm still a bit worried about applying it unless it gets some >>> adaptive behaviour or something so that we don't cause any serious >>> performance regressions in some cases. >> >> I'll try to come up with something. At the most conservative end, we >> could fall back to the current method on the first escape, quote or >> backslash character. > > That's far too conservative, I think. Somewhere a bit short of your > observed breakeven point seems about right. The problem is, you don't know how many "stop" characters there is until you've counted them. We could fall back after X such characters, or only start using memchr after seeing 8 consecutive non-stop characters. Whatever we choose, the heuristic needs to be very simple and fast to check, otherwise we just introduce more overhead trying to decide which method to use. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 6 Mar 2008, Heikki Linnakangas wrote: > At the most conservative end, we could fall back to the current method > on the first escape, quote or backslash character. I would just count the number of escaped/quote characters on each line, and then at the end of the line switch modes between the current code on the new version based on what the previous line looked like. That way the only additional overhead is a small bit only when escapes show up often, plus a touch more just once per line. Barely noticable in the case where nothing is escaped, very small regression for escape-heavy stuff but certainly better than the drop you reported in the last rev of this patch. Rev two of that design would keep a weighted moving average of the total number of escaped characters per line (say wma=(7*wma+current)/8) and switch modes based on that instead of the previous one. There's enough play in the transition between where the two approaches work better at that this should be easy enough to get a decent transition between. Based on your data I would put the transition at wma>4, which should keep the old code in play even if only half the lines have the bad regression that shows up with >8 escapes per line. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Thu, 6 Mar 2008, Heikki Linnakangas wrote: > >> At the most conservative end, we could fall back to the current >> method on the first escape, quote or backslash character. > > I would just count the number of escaped/quote characters on each > line, and then at the end of the line switch modes between the current > code on the new version based on what the previous line looked like. > That way the only additional overhead is a small bit only when escapes > show up often, plus a touch more just once per line. Barely noticable > in the case where nothing is escaped, very small regression for > escape-heavy stuff but certainly better than the drop you reported in > the last rev of this patch. > > Rev two of that design would keep a weighted moving average of the > total number of escaped characters per line (say > wma=(7*wma+current)/8) and switch modes based on that instead of the > previous one. There's enough play in the transition between where the > two approaches work better at that this should be easy enough to get a > decent transition between. Based on your data I would put the > transition at wma>4, which should keep the old code in play even if > only half the lines have the bad regression that shows up with >8 > escapes per line. > > I'd be inclined just to look at the first buffer of data we read in, and make a one-off decision there, if we can get away with it. Then the cost of testing is fixed rather than per line. cheers andrew
Heikki Linnakangas wrote: > Andrew Dunstan wrote: >> I'm still a bit worried about applying it unless it gets some >> adaptive behaviour or something so that we don't cause any serious >> performance regressions in some cases. > > I'll try to come up with something. At the most conservative end, we > could fall back to the current method on the first escape, quote or > backslash character. > >> Also, could we perhaps benefit from inlining some calls, or is your >> compiler doing that anyway? > > gcc does inline all static functions that are only called from one > site, and small functions, using some heuristic. I don't think more > aggressive inlining would help. > Another question that occurred to me - did you try using strpbrk() to look for the next interesting character rather than your homegrown searcher gadget? If so, how did that perform? cheers andrew
Andrew Dunstan wrote: > > > Heikki Linnakangas wrote: >> Andrew Dunstan wrote: >>> I'm still a bit worried about applying it unless it gets some >>> adaptive behaviour or something so that we don't cause any serious >>> performance regressions in some cases. >> >> I'll try to come up with something. At the most conservative end, we >> could fall back to the current method on the first escape, quote or >> backslash character. >> >>> Also, could we perhaps benefit from inlining some calls, or is your >>> compiler doing that anyway? >> >> gcc does inline all static functions that are only called from one >> site, and small functions, using some heuristic. I don't think more >> aggressive inlining would help. >> > > Another question that occurred to me - did you try using strpbrk() to > look for the next interesting character rather than your homegrown > searcher gadget? If so, how did that perform? I haven't tried that. There's a small difference: strpbrk stops at '\0'. But come to think of it, I guess it doesn't matter. Will test... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Andrew Dunstan wrote: > Another question that occurred to me - did you try using strpbrk() to > look for the next interesting character rather than your homegrown > searcher gadget? If so, how did that perform? It looks like strpbrk() performs poorly: unpatched: testname | min duration ----------+----------------- all | 00:00:08.099656 1/2 | 00:00:06.734241 1/4 | 00:00:06.016946 1/8 | 00:00:05.622122 1/16 | 00:00:05.304252 none | 00:00:05.155755 (6 rows) strpbrk: testname | min duration ----------+----------------- all | 00:00:22.980997 1/2 | 00:00:13.724834 1/4 | 00:00:08.980246 1/8 | 00:00:06.582357 1/16 | 00:00:05.291485 none | 00:00:06.239468 (6 rows) memchr: testname | min duration ----------+----------------- all | 00:00:13.684479 1/2 | 00:00:09.509213 1/4 | 00:00:06.921959 1/8 | 00:00:05.654761 1/16 | 00:00:04.719027 none | 00:00:03.718361 (6 rows) Attached is the test script and patches I used, if someone wants to test this on another platform. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/commands/copy.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.295 diff -c -r1.295 copy.c *** src/backend/commands/copy.c 1 Jan 2008 19:45:48 -0000 1.295 --- src/backend/commands/copy.c 5 Mar 2008 14:44:58 -0000 *************** *** 67,72 **** --- 67,275 ---- EOL_CRNL } EolType; + + /***** Fast byte searcher functions ***** + * + * CopyReadLine needs to search for newlines, as well as quoting and escape + * characters to determine which newlines are real and which ones are escaped. + * Doing that in the naive way, looping through the input byte by byte and + * comparing against the interesting characters, can be slow. + * + * To speed that up, we provide a "searcher" interface, that can be used to + * search a piece of memory for up to 4 different bytes simultaneously. It's + * similar to memchr, but allows searching for multiple chars at the same + * time. + * + * To start a search on a new block of memory, call init_searcher. Then call + * next_searcher repeatedly to return all the occurrances of the bytes being + * searched for. + * + * The searcher implementation uses memchr to search for the bytes, and keeps + * track of where the next occurrance of each is. Using memchr allows us + * to take advantage of any platform-specific optimizations. + */ + + /* + * Struct used to store state of the current search between calls to + * next_searcher. Initialized by init_search. + */ + typedef struct { + /* Each element in c corresponds the element in s. These are sorted + * by the pointers (ptr) */ + int n; /* elements used in the arrays */ + char c[4]; /* bytes we're looking for */ + char *ptr[4]; /* pointers to next occurances of the bytes */ + char *end; /* end of block being searched. Last byte to search is end-1 */ + } searcher; + + /* Functions internal to "searcher" */ + + #define swap_searcher_entries(searcher, a, b) do { \ + char tmpc = (searcher)->c[a]; \ + char *tmpptr = (searcher)->ptr[a]; \ + (searcher)->c[a] = (searcher)->c[b]; \ + (searcher)->ptr[a] = (searcher)->ptr[b]; \ + (searcher)->c[b] = tmpc; \ + (searcher)->ptr[b] = tmpptr; \ + } while(0) + + static void + sort_searcher(searcher *s) + { + switch(s->n) + { + case 0: + case 1: + /* Nothing to do */ + break; + case 2: + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + break; + case 3: + if (s->ptr[0] > s->ptr[2]) + swap_searcher_entries(s, 0, 2); + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[1] > s->ptr[2]) + swap_searcher_entries(s, 1, 2); + break; + case 4: + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[2] > s->ptr[3]) + swap_searcher_entries(s, 2, 3); + if (s->ptr[1] > s->ptr[2]) + swap_searcher_entries(s, 2, 3); + if (s->ptr[0] > s->ptr[1]) + swap_searcher_entries(s, 0, 1); + if (s->ptr[2] > s->ptr[3]) + swap_searcher_entries(s, 2, 3); + break; + } + } + + /* Remove the topmost element */ + static void + remove_top(searcher *searcher) + { + int i; + + searcher->n--; + + /* Move up the remaining items */ + for (i = 0; i < searcher->n; i++) + { + searcher->c[i] = searcher->c[i + 1]; + searcher->ptr[i] = searcher->ptr[i + 1]; + } + } + + /* + * The first element in the list has been replaced by the caller. + * Move it to the right place in the list, to keep it sorted. + */ + static void + sift_down(searcher *searcher) + { + if (searcher->n < 2 || searcher->ptr[0] <= searcher->ptr[1]) + return; + swap_searcher_entries(searcher, 0, 1); + + if (searcher->n < 3 || searcher->ptr[1] <= searcher->ptr[2]) + return; + swap_searcher_entries(searcher, 1, 2); + + if (searcher->n < 4 || searcher->ptr[2] <= searcher->ptr[3]) + return; + swap_searcher_entries(searcher, 2, 3); + } + + /* + * Starts a new search in a block of memory. + * + * searcher - for storing state between calls. Opaque to caller, + * init_searcher will initialize this. + * str - block of memory to search + * len - length of str + * c - bytes to search for, max 4. + * n - number of bytes in c + */ + static void + init_searcher(searcher *searcher, char *str, int len, char c[4], int n) + { + int i; + + Assert(n <= 4); + + searcher->n = 0; + searcher->end = str + len; + + /* Find first occurances of the searched-for bytes */ + for (i = 0; i < n; i++) + { + char *hit = memchr(str, c[i], len); + if (hit != NULL) + { + searcher->c[searcher->n] = c[i]; + searcher->ptr[searcher->n] = hit; + searcher->n++; + } + /* + * If the byte doesn't occur in the string at all, we don't + * need to put it in the list. + */ + } + sort_searcher(searcher); + } + + /* + * Look for the next interesting character + * + * searcher - state, opaque to caller + * str - block to search. + * + * 'str' must be part of the block passed to init_searcher, and >= + * the return value of last call to next_searcher. + * + * Returns the offset of the next interesting location, relative to 'ss' + */ + static char * + next_searcher(searcher *searcher, char *str) + { + Assert(str < searcher->end); + + if (searcher->n == 0) + return NULL; + + /* + * If the caller skipped over the next pointer we had memorized, we have to + * search for the one after that. + */ + while (str > searcher->ptr[0]) + { + char *s = memchr(str, searcher->c[0], searcher->end - str); + + if (s == NULL) + { + remove_top(searcher); + if (searcher->n == 0) + return NULL; + } + else + { + searcher->ptr[0] = s; + sift_down(searcher); + } + } + + /* + * Return the next interesting location we had memorized. + */ + return searcher->ptr[0]; + } + + /* * This struct contains all the state variables used throughout a COPY * operation. For simplicity, we use the same struct for all variants of COPY, *************** *** 159,164 **** --- 362,369 ---- char *raw_buf; int raw_buf_index; /* next byte to process */ int raw_buf_len; /* total # of bytes stored */ + + searcher searcher; } CopyStateData; typedef CopyStateData *CopyState; *************** *** 2285,2299 **** --- 2490,2516 ---- last_was_esc = false; char quotec = '\0'; char escapec = '\0'; + char search_chars[4]; + int nsearch_chars = 0; + + search_chars[nsearch_chars++] = '\n'; + search_chars[nsearch_chars++] = '\r'; if (cstate->csv_mode) { quotec = cstate->quote[0]; escapec = cstate->escape[0]; + + search_chars[nsearch_chars++] = quotec; + /* ignore special escape processing if it's the same as quotec */ if (quotec == escapec) escapec = '\0'; + else + search_chars[nsearch_chars++] = escapec; } + else + search_chars[nsearch_chars++] = '\\'; mblen_str[1] = '\0'; *************** *** 2360,2368 **** --- 2577,2603 ---- break; } need_data = false; + + if (!cstate->encoding_embeds_ascii) + init_searcher(&cstate->searcher, copy_raw_buf, copy_buf_len, + search_chars, nsearch_chars); } /* OK to fetch a character */ + if ((!cstate->csv_mode || !first_char_in_line) + && !cstate->encoding_embeds_ascii + && (raw_buf_ptr + 4 <= copy_buf_len)) + { + char *next = next_searcher(&cstate->searcher, + ©_raw_buf[raw_buf_ptr]); + + if (next != NULL) + { + raw_buf_ptr = next - copy_raw_buf; + first_char_in_line = false; + } + } + prev_raw_ptr = raw_buf_ptr; c = copy_raw_buf[raw_buf_ptr++]; Index: src/backend/commands/copy.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v retrieving revision 1.296 diff -c -r1.296 copy.c *** src/backend/commands/copy.c 8 Mar 2008 01:16:26 -0000 1.296 --- src/backend/commands/copy.c 9 Mar 2008 09:34:13 -0000 *************** *** 2285,2299 **** --- 2285,2311 ---- last_was_esc = false; char quotec = '\0'; char escapec = '\0'; + char search_chars[5] = { 0, 0, 0, 0, 0 }; + int nsearch_chars = 0; + + search_chars[nsearch_chars++] = '\n'; + search_chars[nsearch_chars++] = '\r'; if (cstate->csv_mode) { quotec = cstate->quote[0]; escapec = cstate->escape[0]; + + search_chars[nsearch_chars++] = quotec; + /* ignore special escape processing if it's the same as quotec */ if (quotec == escapec) escapec = '\0'; + else + search_chars[nsearch_chars++] = escapec; } + else + search_chars[nsearch_chars++] = '\\'; mblen_str[1] = '\0'; *************** *** 2363,2368 **** --- 2375,2394 ---- } /* OK to fetch a character */ + if ((!cstate->csv_mode || !first_char_in_line) + && !cstate->encoding_embeds_ascii + && (raw_buf_ptr + 4 <= copy_buf_len)) + { + + char *next = strpbrk(©_raw_buf[raw_buf_ptr], search_chars); + + if (next != NULL) + { + raw_buf_ptr = next - copy_raw_buf; + first_char_in_line = false; + } + } + prev_raw_ptr = raw_buf_ptr; c = copy_raw_buf[raw_buf_ptr++];
Attachment
Heikki Linnakangas wrote: > Andrew Dunstan wrote: >> Another question that occurred to me - did you try using strpbrk() to >> look for the next interesting character rather than your homegrown >> searcher gadget? If so, how did that perform? > > It looks like strpbrk() performs poorly: Yes, not surprising. I just looked at the implementation in glibc, which I assume you are using, and it seemed rather basic. The one in NetBSD's libc looks much more efficient. See http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc and http://cvsweb.de.netbsd.org/cgi-bin/cvsweb.cgi/src/lib/libc/string/strpbrk.c?rev=1.16;content-type=text%2Fx-cvsweb-markup Not that what you've done isn't good, if a little obscure (as is the NetBSD implementation) cheers andrew
Andrew Dunstan wrote: > Heikki Linnakangas wrote: >> It looks like strpbrk() performs poorly: > > Yes, not surprising. I just looked at the implementation in glibc, which > I assume you are using, and it seemed rather basic. The one in NetBSD's > libc looks much more efficient. > > See > > http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc There's architecture-specific optimized versions of that in glibc, though. For example, for x86_64: http://sourceware.org/cgi-bin/cvsweb.cgi/~checkout~/libc/sysdeps/x86_64/strspn.S?rev=1.4&cvsroot=glibc > Not that what you've done isn't good, if a little obscure (as is the > NetBSD implementation) Yeah, it's a lot of code for such a simple thing :-(. On the other hand, it's very isolated, so it's not that bad. We chatted about this with Greg Stark yesterday, and we came up with a few more observations and ideas: 1. CopyReadLineText is all about finding the the next end of line; splitting to fields is done later. We therefore only care about quotes and escapes when they affect the end of line detection. In text mode, we only need to care about a backslash that precedes a LF/CR. Therefore, we could search for the next LF/CR character with memchr(), and check if the preceding character is a backslash (and if it is, check if there's yet another backslash behind it, and so forth until we hit a non-backslash character). 2. The manual has this to say about escaping newlines with a backslash: "It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the \n and \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and carriage return, and to represent a data newline by a backslash and newline. However, these representations might not be accepted in future releases." If we disallowed the backslash+LF sequence, like that paragraph suggests we could, we wouldn't need to care about backslashes in CopyReadLineText in text mode at all. 3. If we did the client->server encoding conversion before CopyReadLineText, one input block at a time: - We could skip calling pg_encoding_mblen for each character, when the client encoding is one not supported as a server encoding. That would be a big performance gain when that's the case, and would simplify the code in CopyReadLineText a little bit as well. - The conversion might be faster, as it could operate on bigger blocks - We could merge the CopyReadLine and CopyReadAttributes functions, and split the input block into fields in one loop. I would imagine that being more efficient than scanning the input twice, no matter how fast we make those scans. There's a small difficulty with doing the conversion one input block at a time: there's no suitable interface in the conversion routines for that. The input block boundary can be between 1st and 2nd byte of a multi-byte character, so we'd need to have a function that converts a block of bytes, ignoring any incomplete bytes at the end. I think I'll experiment with the 1st idea, to see how much code restructuring it needs and what the performance is like. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com