Thread: CopyReadLineText optimization

CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
The purpose of CopyReadLineText is to scan the input buffer, and find
the next newline, taking into account any escape characters. It
currently operates in a loop, one byte at a time, searching for LF, CR,
or a backslash. That's a bit slow: I've been running oprofile on COPY,
and I've seen CopyReadLine to take around ~10% of the CPU time, and
Joshua Drake just posted a very similar profile to hackers.

Attached is a patch that modifies CopyReadLineText so that it uses
memchr to speed up the scan. The nice thing about memchr is that we can
take advantage of any clever optimizations that might be in libc or
compiler.

In the tests I've been running, it roughly halves the time spent in
CopyReadLine (including the new memchr calls), thus reducing the total
CPU overhead by ~5%. I'm planning to run more tests with data that has
backslashes and with different width tables to see what the worst-case
and best-case performance is like. Also, it doesn't work for CSV format
at the moment; that needs to be fixed.

5% isn't exactly breathtaking, but it's a start. I tried the same trick
to CopyReadAttributesText, but unfortunately it doesn't seem to help
there because you need to "stop" the efficient word-at-a-time scan that
memchr does (at least with glibc, YMMV) whenever there's a column
separator, while in CopyReadLineText you get to process the whole line
in one call, assuming there's no backslashes.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c    1 Jan 2008 19:45:48 -0000    1.295
--- src/backend/commands/copy.c    24 Feb 2008 00:40:09 -0000
***************
*** 67,72 ****
--- 67,77 ----
      EOL_CRNL
  } EolType;

+ typedef struct {
+     char c;        /* byte we're looking for */
+     char *s;    /* pointer to next occurance of this byte */
+ } searchbyte;
+
  /*
   * This struct contains all the state variables used throughout a COPY
   * operation. For simplicity, we use the same struct for all variants of COPY,
***************
*** 159,164 ****
--- 164,171 ----
      char       *raw_buf;
      int            raw_buf_index;    /* next byte to process */
      int            raw_buf_len;    /* total # of bytes stored */
+
+     searchbyte    sb[3];            /* pointers to next "interesting" characters in raw_buf */
  } CopyStateData;

  typedef CopyStateData *CopyState;
***************
*** 170,175 ****
--- 177,283 ----
      CopyState    cstate;            /* CopyStateData for the command */
  } DR_copy;

+ #define swap_sb(a,b) do { \
+     searchbyte c = (a); \
+     (a) = (b); \
+     (b) = (c); \
+ } while(0)
+
+ /* is a < b? NULLs are last */
+ #define cmp_sb(a,b) ((a).s < (b).s && (a).s != NULL)
+
+ static void
+ sort_sb(searchbyte *sb)
+ {
+     if (cmp_sb(sb[2], sb[1]))
+         swap_sb(sb[1], sb[2]);
+     if (cmp_sb(sb[1], sb[0]))
+         swap_sb(sb[0], sb[1]);
+     if (cmp_sb(sb[2], sb[1]))
+         swap_sb(sb[1], sb[2]);
+ }
+
+ /*
+  * Starts a new search in a string
+  */
+ static void
+ init_interesting(searchbyte *sb, char *str, int len)
+ {
+     sb[0].c = '\n';
+     sb[1].c = '\r';
+     sb[2].c = '\\';
+     sb[0].s = memchr(str, '\n', len);
+     sb[1].s = memchr(str, '\r', len);
+     sb[2].s = memchr(str, '\\', len);
+     sort_sb(sb);
+ }
+
+ /*
+  * Look for the next interesting character
+  *
+  * sb - array of three searchbytes.
+  * ss - string to search
+  * len_left - number of bytes left in ss to search
+  *
+  * Returns the offset of the next interesting location, relative to 'ss'
+  */
+ static int
+ next_interesting(searchbyte *sb, char *ss, int len_left)
+ {
+     char *s;
+
+     if (sb[0].s == NULL)
+         return 0;
+
+     /*
+      * The caller skipped over the next pointer we had memorized, so we have to
+      * search for the one after that.
+      */
+     if (ss > sb[0].s)
+     {
+         sb[0].s = memchr(ss, sb[0].c, len_left);
+         if (sb[0].s == NULL)
+             return 0; /* we're done */
+         if (sb[1].s != NULL)
+         {
+             if (ss > sb[1].s)
+             {
+                 /*
+                  * The caller skipped over the 2nd pointer we had memorized
+                  * as well. Have to search for that as well
+                  */
+                 sb[1].s = memchr(ss, sb[1].c, len_left);
+                 if (sb[2].s != NULL && ss > sb[2].s)
+                 {
+                     /* Caller skipped over 3rd pointer as well... */
+                     sb[2].s = memchr(ss, sb[2].c, len_left);
+                 }
+             }
+             sort_sb(sb);
+         }
+     }
+
+     /*
+      * Return the next interesting location we had memorized, and search
+      * for the next occurance of that byte.
+      */
+     s = sb[0].s;
+
+     sb[0].s = memchr(s+1, sb[0].c, (ss+len_left) - (s + 1));
+
+     /*
+      * Bubble down the next location into the three-element list we have,
+      */
+     if (cmp_sb(sb[1], sb[0]))
+     {
+         swap_sb(sb[0], sb[1]);
+         if (cmp_sb(sb[2], sb[1]))
+             swap_sb(sb[1], sb[2]);
+     }
+
+     return s - ss;
+ }
+

  /*
   * These macros centralize code used to process line_buf and raw_buf buffers.
***************
*** 2350,2355 ****
--- 2458,2465 ----
              raw_buf_ptr = 0;
              copy_buf_len = cstate->raw_buf_len;

+             init_interesting(cstate->sb, copy_raw_buf, copy_buf_len);
+
              /*
               * If we are completely out of data, break out of the loop,
               * reporting EOF.
***************
*** 2363,2368 ****
--- 2473,2495 ----
          }

          /* OK to fetch a character */
+         if (!cstate->csv_mode)
+         {
+             int skip = next_interesting(cstate->sb,
+                                         ©_raw_buf[raw_buf_ptr],
+                                         copy_buf_len - raw_buf_ptr);
+
+
+             if (skip != 0)
+             {
+                 raw_buf_ptr += skip;
+                 first_char_in_line = false;
+
+                 prev_raw_ptr = raw_buf_ptr - 1;
+                 IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(1);
+             }
+         }
+
          prev_raw_ptr = raw_buf_ptr;
          c = copy_raw_buf[raw_buf_ptr++];


Re: CopyReadLineText optimization

From
"Luke Lonergan"
Date:
Cool!  It's been a while since we've done the same kind of thing :-)

- Luke

> -----Original Message-----
> From: pgsql-patches-owner@postgresql.org
> [mailto:pgsql-patches-owner@postgresql.org] On Behalf Of
> Heikki Linnakangas
> Sent: Saturday, February 23, 2008 5:30 PM
> To: pgsql-patches@postgresql.org
> Subject: [PATCHES] CopyReadLineText optimization
>
> The purpose of CopyReadLineText is to scan the input buffer,
> and find the next newline, taking into account any escape
> characters. It currently operates in a loop, one byte at a
> time, searching for LF, CR, or a backslash. That's a bit
> slow: I've been running oprofile on COPY, and I've seen
> CopyReadLine to take around ~10% of the CPU time, and Joshua
> Drake just posted a very similar profile to hackers.
>
> Attached is a patch that modifies CopyReadLineText so that it
> uses memchr to speed up the scan. The nice thing about memchr
> is that we can take advantage of any clever optimizations
> that might be in libc or compiler.
>
> In the tests I've been running, it roughly halves the time
> spent in CopyReadLine (including the new memchr calls), thus
> reducing the total CPU overhead by ~5%. I'm planning to run
> more tests with data that has backslashes and with different
> width tables to see what the worst-case and best-case
> performance is like. Also, it doesn't work for CSV format at
> the moment; that needs to be fixed.
>
> 5% isn't exactly breathtaking, but it's a start. I tried the
> same trick to CopyReadAttributesText, but unfortunately it
> doesn't seem to help there because you need to "stop" the
> efficient word-at-a-time scan that memchr does (at least with
> glibc, YMMV) whenever there's a column separator, while in
> CopyReadLineText you get to process the whole line in one
> call, assuming there's no backslashes.
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com
>

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Heikki Linnakangas wrote:
> Attached is a patch that modifies CopyReadLineText so that it uses
> memchr to speed up the scan. The nice thing about memchr is that we can
> take advantage of any clever optimizations that might be in libc or
> compiler.

Here's an updated version of the patch. The principle is the same, but
the same optimization is now used for CSV input as well, and there's
more comments.

I still need to do more benchmarking. I mentioned a ~5% speedup on the
test I ran earlier, which was a load of the lineitem table from TPC-H.
It looks like with cheaper data types the gain can be much bigger;
here's an oprofile from loading the TPC-H partsupp table,

Before:

samples  %        image name               symbol name
5146     25.7635  postgres                 CopyReadLine
4089     20.4716  postgres                 DoCopy
1449      7.2544  reiserfs                 (no symbols)
1369      6.8539  postgres                 pg_verify_mbstr_len
1013      5.0716  libc-2.7.so              memcpy
749       3.7499  libc-2.7.so              ____strtod_l_internal
598       2.9939  postgres                 heap_formtuple
548       2.7436  libc-2.7.so              ____strtol_l_internal
403       2.0176  libc-2.7.so              memset
309       1.5470  libc-2.7.so              strlen
208       1.0414  postgres                 AllocSetAlloc
...

After:

samples  %        image name               symbol name
4165     25.7879  postgres                 DoCopy
1574      9.7455  postgres                 pg_verify_mbstr_len
1520      9.4112  reiserfs                 (no symbols)
1005      6.2225  libc-2.7.so              memchr
986       6.1049  libc-2.7.so              memcpy
632       3.9131  libc-2.7.so              ____strtod_l_internal
589       3.6468  postgres                 heap_formtuple
546       3.3806  libc-2.7.so              ____strtol_l_internal
386       2.3899  libc-2.7.so              memset
366       2.2661  postgres                 CopyReadLine
287       1.7770  libc-2.7.so              strlen
215       1.3312  postgres                 LWLockAcquire
208       1.2878  postgres                 hash_any
176       1.0897  postgres                 LWLockRelease
161       0.9968  postgres                 InputFunctionCall
157       0.9721  postgres                 AllocSetAlloc
...

Profile shows that with the patch, ~8.5% of the CPU time is spent in
CopyReadLine+memchr, vs. 25.5% before. That's a quite significant speedup.

I still need to test the worst-case performance, with input that has a
lot of escapes. It would be interesting to hear reports with this patch
from people on different platforms. These results are from my laptop
with 32-bit Intel CPU, running Linux. There could be big differences in
the memchr implementations.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c    1 Jan 2008 19:45:48 -0000    1.295
--- src/backend/commands/copy.c    29 Feb 2008 17:43:53 -0000
***************
*** 67,72 ****
--- 67,275 ----
      EOL_CRNL
  } EolType;

+
+ /***** Fast byte searcher functions *****
+  *
+  * CopyReadLine needs to search for newlines, as well as quoting and escape
+  * characters to determine which newlines are real and which ones are escaped.
+  * Doing that in the naive way, looping through the input byte by byte and
+  * comparing against the interesting characters, can be slow.
+  *
+  * To speed that up, we provide a "searcher" interface, that can be used to
+  * search a piece of memory for up to 4 different bytes simultaneously. It's
+  * similar to memchr, but allows searching for multiple chars at the same
+  * time.
+  *
+  * To start a search on a new block of memory, call init_searcher. Then call
+  * next_searcher repeatedly to return all the occurrances of the bytes being
+  * searched for.
+  *
+  * The searcher implementation uses memchr to search for the bytes, and keeps
+  * track of where the next occurrance of each is. Using memchr allows us
+  * to take advantage of any platform-specific optimizations.
+  */
+
+ /*
+  * Struct used to store state of the current search between calls to
+  * next_searcher. Initialized by init_search.
+  */
+ typedef struct {
+     /* Each element in c corresponds the element in s. These are sorted
+      * by the pointers (ptr) */
+     int n;        /* elements used in the arrays */
+     char c[4];    /* bytes we're looking for */
+     char *ptr[4]; /* pointers to next occurances of the bytes */
+     char *end;    /* end of block being searched. Last byte to search is end-1 */
+ } searcher;
+
+ /* Functions internal to "searcher" */
+
+ #define swap_searcher_entries(searcher, a, b) do { \
+     char tmpc = (searcher)->c[a]; \
+     char *tmpptr = (searcher)->ptr[a]; \
+     (searcher)->c[a] = (searcher)->c[b]; \
+     (searcher)->ptr[a] = (searcher)->ptr[b]; \
+     (searcher)->c[b] = tmpc; \
+     (searcher)->ptr[b] = tmpptr; \
+ } while(0)
+
+ static void
+ sort_searcher(searcher *s)
+ {
+     switch(s->n)
+     {
+         case 0:
+         case 1:
+             /* Nothing to do */
+             break;
+         case 2:
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             break;
+         case 3:
+             if (s->ptr[0] > s->ptr[2])
+                 swap_searcher_entries(s, 0, 2);
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[1] > s->ptr[2])
+                 swap_searcher_entries(s, 1, 2);
+             break;
+         case 4:
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[2] > s->ptr[3])
+                 swap_searcher_entries(s, 2, 3);
+             if (s->ptr[1] > s->ptr[2])
+                 swap_searcher_entries(s, 2, 3);
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[2] > s->ptr[3])
+                 swap_searcher_entries(s, 2, 3);
+             break;
+     }
+ }
+
+ /* Remove the topmost element */
+ static void
+ remove_top(searcher *searcher)
+ {
+     int i;
+
+     searcher->n--;
+
+     /* Move up the remaining items */
+     for (i = 0; i < searcher->n; i++)
+     {
+         searcher->c[i] = searcher->c[i + 1];
+         searcher->ptr[i] = searcher->ptr[i + 1];
+     }
+ }
+
+ /*
+  * The first element in the list has been replaced by the caller.
+  * Move it to the right place in the list, to keep it sorted.
+  */
+ static void
+ sift_down(searcher *searcher)
+ {
+     if (searcher->n < 2 || searcher->ptr[0] <= searcher->ptr[1])
+         return;
+     swap_searcher_entries(searcher, 0, 1);
+
+     if (searcher->n < 3 || searcher->ptr[1] <= searcher->ptr[2])
+         return;
+     swap_searcher_entries(searcher, 1, 2);
+
+     if (searcher->n < 4 || searcher->ptr[2] <= searcher->ptr[3])
+         return;
+     swap_searcher_entries(searcher, 2, 3);
+ }
+
+ /*
+  * Starts a new search in a block of memory.
+  *
+  *  searcher    - for storing state between calls. Opaque to caller,
+  *                  init_searcher will initialize this.
+  *  str            - block of memory to search
+  *  len            - length of str
+  *  c            - bytes to search for, max 4.
+  *  n            - number of bytes in c
+  */
+ static void
+ init_searcher(searcher *searcher, char *str, int len, char c[4], int n)
+ {
+     int i;
+
+     Assert(n <= 4);
+
+     searcher->n = 0;
+     searcher->end = str + len;
+
+     /* Find first occurances of the searched-for bytes */
+     for (i = 0; i < n; i++)
+     {
+         char *hit = memchr(str, c[i], len);
+         if (hit != NULL)
+         {
+             searcher->c[searcher->n] = c[i];
+             searcher->ptr[searcher->n] = hit;
+             searcher->n++;
+         }
+         /*
+          * If the byte doesn't occur in the string at all, we don't
+          * need to put it in the list.
+          */
+     }
+     sort_searcher(searcher);
+ }
+
+ /*
+  * Look for the next interesting character
+  *
+  * searcher - state, opaque to caller
+  * str        - block to search.
+  *
+  * 'str' must be part of the block passed to init_searcher, and >=
+  * the return value of last call to next_searcher.
+  *
+  * Returns the offset of the next interesting location, relative to 'ss'
+  */
+ static char *
+ next_searcher(searcher *searcher, char *str)
+ {
+     Assert(str < searcher->end);
+
+     if (searcher->n == 0)
+         return NULL;
+
+     /*
+      * If the caller skipped over the next pointer we had memorized, we have to
+      * search for the one after that.
+      */
+     while (str > searcher->ptr[0])
+     {
+         char *s = memchr(str, searcher->c[0], searcher->end - str);
+
+         if (s == NULL)
+         {
+             remove_top(searcher);
+             if (searcher->n == 0)
+                 return NULL;
+         }
+         else
+         {
+             searcher->ptr[0] = s;
+             sift_down(searcher);
+         }
+     }
+
+     /*
+      * Return the next interesting location we had memorized.
+      */
+     return searcher->ptr[0];
+ }
+
+
  /*
   * This struct contains all the state variables used throughout a COPY
   * operation. For simplicity, we use the same struct for all variants of COPY,
***************
*** 159,164 ****
--- 362,369 ----
      char       *raw_buf;
      int            raw_buf_index;    /* next byte to process */
      int            raw_buf_len;    /* total # of bytes stored */
+
+     searcher    searcher;
  } CopyStateData;

  typedef CopyStateData *CopyState;
***************
*** 2285,2299 ****
--- 2490,2516 ----
                  last_was_esc = false;
      char        quotec = '\0';
      char        escapec = '\0';
+     char        search_chars[4];
+     int            nsearch_chars = 0;
+
+     search_chars[nsearch_chars++] = '\n';
+     search_chars[nsearch_chars++] = '\r';

      if (cstate->csv_mode)
      {
          quotec = cstate->quote[0];
          escapec = cstate->escape[0];
+
+         search_chars[nsearch_chars++] = quotec;
+
          /* ignore special escape processing if it's the same as quotec */
          if (quotec == escapec)
              escapec = '\0';
+         else
+             search_chars[nsearch_chars++] = escapec;
      }
+     else
+         search_chars[nsearch_chars++] = '\\';

      mblen_str[1] = '\0';

***************
*** 2360,2368 ****
--- 2577,2601 ----
                  break;
              }
              need_data = false;
+
+             init_searcher(&cstate->searcher, copy_raw_buf, copy_buf_len,
+                           search_chars, nsearch_chars);
          }

          /* OK to fetch a character */
+         if ((!cstate->csv_mode || !first_char_in_line)
+             && (raw_buf_ptr + 4 <= copy_buf_len))
+         {
+             char *next = next_searcher(&cstate->searcher,
+                                        ©_raw_buf[raw_buf_ptr]);
+
+             if (next != NULL)
+             {
+                 raw_buf_ptr = next - copy_raw_buf;
+                 first_char_in_line = false;
+             }
+         }
+
          prev_raw_ptr = raw_buf_ptr;
          c = copy_raw_buf[raw_buf_ptr++];


Re: CopyReadLineText optimization

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > Attached is a patch that modifies CopyReadLineText so that it uses
> > memchr to speed up the scan. The nice thing about memchr is that we can
> > take advantage of any clever optimizations that might be in libc or
> > compiler.
>
> Here's an updated version of the patch. The principle is the same, but
> the same optimization is now used for CSV input as well, and there's
> more comments.
>
> I still need to do more benchmarking. I mentioned a ~5% speedup on the
> test I ran earlier, which was a load of the lineitem table from TPC-H.
> It looks like with cheaper data types the gain can be much bigger;
> here's an oprofile from loading the TPC-H partsupp table,
>
> Before:
>
> samples  %        image name               symbol name
> 5146     25.7635  postgres                 CopyReadLine
> 4089     20.4716  postgres                 DoCopy
> 1449      7.2544  reiserfs                 (no symbols)
> 1369      6.8539  postgres                 pg_verify_mbstr_len
> 1013      5.0716  libc-2.7.so              memcpy
> 749       3.7499  libc-2.7.so              ____strtod_l_internal
> 598       2.9939  postgres                 heap_formtuple
> 548       2.7436  libc-2.7.so              ____strtol_l_internal
> 403       2.0176  libc-2.7.so              memset
> 309       1.5470  libc-2.7.so              strlen
> 208       1.0414  postgres                 AllocSetAlloc
> ...
>
> After:
>
> samples  %        image name               symbol name
> 4165     25.7879  postgres                 DoCopy
> 1574      9.7455  postgres                 pg_verify_mbstr_len
> 1520      9.4112  reiserfs                 (no symbols)
> 1005      6.2225  libc-2.7.so              memchr
> 986       6.1049  libc-2.7.so              memcpy
> 632       3.9131  libc-2.7.so              ____strtod_l_internal
> 589       3.6468  postgres                 heap_formtuple
> 546       3.3806  libc-2.7.so              ____strtol_l_internal
> 386       2.3899  libc-2.7.so              memset
> 366       2.2661  postgres                 CopyReadLine
> 287       1.7770  libc-2.7.so              strlen
> 215       1.3312  postgres                 LWLockAcquire
> 208       1.2878  postgres                 hash_any
> 176       1.0897  postgres                 LWLockRelease
> 161       0.9968  postgres                 InputFunctionCall
> 157       0.9721  postgres                 AllocSetAlloc
> ...
>
> Profile shows that with the patch, ~8.5% of the CPU time is spent in
> CopyReadLine+memchr, vs. 25.5% before. That's a quite significant speedup.
>
> I still need to test the worst-case performance, with input that has a
> lot of escapes. It would be interesting to hear reports with this patch
> from people on different platforms. These results are from my laptop
> with 32-bit Intel CPU, running Linux. There could be big differences in
> the memchr implementations.
>
> --
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com


>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Heikki Linnakangas wrote:
> I still need to test the worst-case performance, with input that has a
> lot of escapes.

Ok, I've done some more performance testing with this. I tested COPY
FROM with a table with a single "text" column. There was a million rows
in the table, with a 1000 character long string:

postgres=# CREATE TABLE narrowtable2 (id text);
CREATE TABLE
postgres=# INSERT INTO narrowtable2 SELECT repeat(E'\\', 1000) FROM
generate_series(1, 1000000);
INSERT 0 1000000

After that, I dumped that to a file, and loaded it back using COPY FROM:

time ~/installations/cvshead/bin/psql postgres -c "BEGIN; TRUNCATE
narrowtable2; COPY narrowtable2 FROM
'/home/perftester/narrowtable3.tbl'; ROLLBACK;"

I repeated the test with different frequencies of backslashes in the
string, with and without the patch, and the took the smallest number of
each test case:

backslashes    with    without patch
all        24.9    15.6
every 4th    12.7    11.4
every 8th    10.4    10.7
every 16th    8.7    10.3
none        6.8    9.8

So the overhead of using memchr slows us down if there's a lot of escape
or quote characters. The breakeven point seems to be about 1 in 8
characters. I'm not sure if that's a good tradeoff or not...


I also tested a table with single integer column, and found no
meaningful difference (10.5 without patch vs 10.6 with patch). oprofile
shows that in this test case, only ~5% of the CPU time is spent in
CopyReadLineText, and the patch doesn't change that.

Without patch:
samples  %        image name               app name
symbol name
7563     12.7220  no-vmlinux               postgres                 (no
symbols)
4050      6.8127  postgres                 postgres                 DoCopy
3334      5.6083  postgres                 postgres
LWLockAcquire
3238      5.4468  postgres                 postgres
CopyReadLine
2900      4.8782  postgres                 postgres
LWLockRelease
2781      4.6780  libc-2.7.so              postgres
__GI_____strtoll_l_internal
2778      4.6730  postgres                 postgres
heap_formtuple
2636      4.4341  postgres                 postgres                 hash_any
2087      3.5106  no-vmlinux               no-vmlinux               (no
symbols)
1748      2.9404  libc-2.7.so              postgres                 memset
1724      2.9000  postgres                 postgres
PinBuffer
1670      2.8092  postgres                 postgres
PageAddItem
1645      2.7671  postgres                 postgres
heap_insert
1459      2.4542  postgres                 postgres
UnpinBuffer
1457      2.4509  postgres                 postgres
ReadBuffer_common
1321      2.2221  postgres                 postgres
hash_search_with_hash_value
1278      2.1498  postgres                 postgres
MarkBufferDirty
1219      2.0505  oprofiled                oprofiled                (no
symbols)
972       1.6350  postgres                 postgres
pg_verify_mbstr_len
756       1.2717  postgres                 postgres
RelationPutHeapTuple
665       1.1186  postgres                 postgres                 pg_atoi
631       1.0614  postgres                 postgres
RelationGetBufferForTuple
613       1.0312  postgres                 postgres
AllocSetReset
...

With patch:
samples  %        image name               app name
symbol name
42720    18.1450  no-vmlinux               postgres                 (no
symbols)
15367     6.5270  postgres                 postgres                 DoCopy
11831     5.0251  postgres                 postgres
LWLockAcquire
11500     4.8845  no-vmlinux               no-vmlinux               (no
symbols)
10182     4.3247  postgres                 postgres
LWLockRelease
9912      4.2100  libc-2.7.so              postgres
__GI_____strtoll_l_internal
9811      4.1671  postgres                 postgres                 hash_any
8824      3.7479  postgres                 postgres
heap_formtuple
7459      3.1682  postgres                 postgres
CopyReadLine
7187      3.0526  postgres                 postgres
PageAddItem
6313      2.6814  libc-2.7.so              postgres                 memset
5842      2.4813  postgres                 postgres
PinBuffer
5230      2.2214  postgres                 postgres
UnpinBuffer
5160      2.1917  postgres                 postgres
heap_insert
4838      2.0549  postgres                 postgres
ReadBuffer_common
4819      2.0468  postgres                 postgres
hash_search_with_hash_value
4691      1.9925  postgres                 postgres
MarkBufferDirty
3675      1.5609  libc-2.7.so              postgres                 memchr
3617      1.5363  postgres                 postgres
AllocSetAlloc
3585      1.5227  postgres                 postgres
pg_verify_mbstr_len
3326      1.4127  postgres                 postgres
AllocSetReset
...


These tests were on a test server with a dual-core 64-bit Intel Xeons.
I'd still like to hear reports from other platforms.

Another thing that seems like an obvious win is to merge CopyReadLine
and CopyReadAttributesText/CSV so that we do just one pass over the
input. But that seems suspiciously obvious, I wonder if I'm missing
something.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
>> Attached is a patch that modifies CopyReadLineText so that it uses
>> memchr to speed up the scan. The nice thing about memchr is that we
>> can take advantage of any clever optimizations that might be in libc
>> or compiler.
>
> Here's an updated version of the patch. The principle is the same, but
> the same optimization is now used for CSV input as well, and there's
> more comments.

Another update attached: It occurred to me that the memchr approach is
only safe for server encodings, where the non-first bytes of a
multi-byte character always have the hi-bit set.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c    1 Jan 2008 19:45:48 -0000    1.295
--- src/backend/commands/copy.c    5 Mar 2008 14:44:58 -0000
***************
*** 67,72 ****
--- 67,275 ----
      EOL_CRNL
  } EolType;

+
+ /***** Fast byte searcher functions *****
+  *
+  * CopyReadLine needs to search for newlines, as well as quoting and escape
+  * characters to determine which newlines are real and which ones are escaped.
+  * Doing that in the naive way, looping through the input byte by byte and
+  * comparing against the interesting characters, can be slow.
+  *
+  * To speed that up, we provide a "searcher" interface, that can be used to
+  * search a piece of memory for up to 4 different bytes simultaneously. It's
+  * similar to memchr, but allows searching for multiple chars at the same
+  * time.
+  *
+  * To start a search on a new block of memory, call init_searcher. Then call
+  * next_searcher repeatedly to return all the occurrances of the bytes being
+  * searched for.
+  *
+  * The searcher implementation uses memchr to search for the bytes, and keeps
+  * track of where the next occurrance of each is. Using memchr allows us
+  * to take advantage of any platform-specific optimizations.
+  */
+
+ /*
+  * Struct used to store state of the current search between calls to
+  * next_searcher. Initialized by init_search.
+  */
+ typedef struct {
+     /* Each element in c corresponds the element in s. These are sorted
+      * by the pointers (ptr) */
+     int n;        /* elements used in the arrays */
+     char c[4];    /* bytes we're looking for */
+     char *ptr[4]; /* pointers to next occurances of the bytes */
+     char *end;    /* end of block being searched. Last byte to search is end-1 */
+ } searcher;
+
+ /* Functions internal to "searcher" */
+
+ #define swap_searcher_entries(searcher, a, b) do { \
+     char tmpc = (searcher)->c[a]; \
+     char *tmpptr = (searcher)->ptr[a]; \
+     (searcher)->c[a] = (searcher)->c[b]; \
+     (searcher)->ptr[a] = (searcher)->ptr[b]; \
+     (searcher)->c[b] = tmpc; \
+     (searcher)->ptr[b] = tmpptr; \
+ } while(0)
+
+ static void
+ sort_searcher(searcher *s)
+ {
+     switch(s->n)
+     {
+         case 0:
+         case 1:
+             /* Nothing to do */
+             break;
+         case 2:
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             break;
+         case 3:
+             if (s->ptr[0] > s->ptr[2])
+                 swap_searcher_entries(s, 0, 2);
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[1] > s->ptr[2])
+                 swap_searcher_entries(s, 1, 2);
+             break;
+         case 4:
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[2] > s->ptr[3])
+                 swap_searcher_entries(s, 2, 3);
+             if (s->ptr[1] > s->ptr[2])
+                 swap_searcher_entries(s, 2, 3);
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[2] > s->ptr[3])
+                 swap_searcher_entries(s, 2, 3);
+             break;
+     }
+ }
+
+ /* Remove the topmost element */
+ static void
+ remove_top(searcher *searcher)
+ {
+     int i;
+
+     searcher->n--;
+
+     /* Move up the remaining items */
+     for (i = 0; i < searcher->n; i++)
+     {
+         searcher->c[i] = searcher->c[i + 1];
+         searcher->ptr[i] = searcher->ptr[i + 1];
+     }
+ }
+
+ /*
+  * The first element in the list has been replaced by the caller.
+  * Move it to the right place in the list, to keep it sorted.
+  */
+ static void
+ sift_down(searcher *searcher)
+ {
+     if (searcher->n < 2 || searcher->ptr[0] <= searcher->ptr[1])
+         return;
+     swap_searcher_entries(searcher, 0, 1);
+
+     if (searcher->n < 3 || searcher->ptr[1] <= searcher->ptr[2])
+         return;
+     swap_searcher_entries(searcher, 1, 2);
+
+     if (searcher->n < 4 || searcher->ptr[2] <= searcher->ptr[3])
+         return;
+     swap_searcher_entries(searcher, 2, 3);
+ }
+
+ /*
+  * Starts a new search in a block of memory.
+  *
+  *  searcher    - for storing state between calls. Opaque to caller,
+  *                  init_searcher will initialize this.
+  *  str            - block of memory to search
+  *  len            - length of str
+  *  c            - bytes to search for, max 4.
+  *  n            - number of bytes in c
+  */
+ static void
+ init_searcher(searcher *searcher, char *str, int len, char c[4], int n)
+ {
+     int i;
+
+     Assert(n <= 4);
+
+     searcher->n = 0;
+     searcher->end = str + len;
+
+     /* Find first occurances of the searched-for bytes */
+     for (i = 0; i < n; i++)
+     {
+         char *hit = memchr(str, c[i], len);
+         if (hit != NULL)
+         {
+             searcher->c[searcher->n] = c[i];
+             searcher->ptr[searcher->n] = hit;
+             searcher->n++;
+         }
+         /*
+          * If the byte doesn't occur in the string at all, we don't
+          * need to put it in the list.
+          */
+     }
+     sort_searcher(searcher);
+ }
+
+ /*
+  * Look for the next interesting character
+  *
+  * searcher - state, opaque to caller
+  * str        - block to search.
+  *
+  * 'str' must be part of the block passed to init_searcher, and >=
+  * the return value of last call to next_searcher.
+  *
+  * Returns the offset of the next interesting location, relative to 'ss'
+  */
+ static char *
+ next_searcher(searcher *searcher, char *str)
+ {
+     Assert(str < searcher->end);
+
+     if (searcher->n == 0)
+         return NULL;
+
+     /*
+      * If the caller skipped over the next pointer we had memorized, we have to
+      * search for the one after that.
+      */
+     while (str > searcher->ptr[0])
+     {
+         char *s = memchr(str, searcher->c[0], searcher->end - str);
+
+         if (s == NULL)
+         {
+             remove_top(searcher);
+             if (searcher->n == 0)
+                 return NULL;
+         }
+         else
+         {
+             searcher->ptr[0] = s;
+             sift_down(searcher);
+         }
+     }
+
+     /*
+      * Return the next interesting location we had memorized.
+      */
+     return searcher->ptr[0];
+ }
+
+
  /*
   * This struct contains all the state variables used throughout a COPY
   * operation. For simplicity, we use the same struct for all variants of COPY,
***************
*** 159,164 ****
--- 362,369 ----
      char       *raw_buf;
      int            raw_buf_index;    /* next byte to process */
      int            raw_buf_len;    /* total # of bytes stored */
+
+     searcher    searcher;
  } CopyStateData;

  typedef CopyStateData *CopyState;
***************
*** 2285,2299 ****
--- 2490,2516 ----
                  last_was_esc = false;
      char        quotec = '\0';
      char        escapec = '\0';
+     char        search_chars[4];
+     int            nsearch_chars = 0;
+
+     search_chars[nsearch_chars++] = '\n';
+     search_chars[nsearch_chars++] = '\r';

      if (cstate->csv_mode)
      {
          quotec = cstate->quote[0];
          escapec = cstate->escape[0];
+
+         search_chars[nsearch_chars++] = quotec;
+
          /* ignore special escape processing if it's the same as quotec */
          if (quotec == escapec)
              escapec = '\0';
+         else
+             search_chars[nsearch_chars++] = escapec;
      }
+     else
+         search_chars[nsearch_chars++] = '\\';

      mblen_str[1] = '\0';

***************
*** 2360,2368 ****
--- 2577,2603 ----
                  break;
              }
              need_data = false;
+
+             if (!cstate->encoding_embeds_ascii)
+                 init_searcher(&cstate->searcher, copy_raw_buf, copy_buf_len,
+                               search_chars, nsearch_chars);
          }

          /* OK to fetch a character */
+         if ((!cstate->csv_mode || !first_char_in_line)
+             && !cstate->encoding_embeds_ascii
+             && (raw_buf_ptr + 4 <= copy_buf_len))
+         {
+             char *next = next_searcher(&cstate->searcher,
+                                        ©_raw_buf[raw_buf_ptr]);
+
+             if (next != NULL)
+             {
+                 raw_buf_ptr = next - copy_raw_buf;
+                 first_char_in_line = false;
+             }
+         }
+
          prev_raw_ptr = raw_buf_ptr;
          c = copy_raw_buf[raw_buf_ptr++];


Re: CopyReadLineText optimization

From
Andrew Dunstan
Date:

Heikki Linnakangas wrote:
>
>
> So the overhead of using memchr slows us down if there's a lot of
> escape or quote characters. The breakeven point seems to be about 1 in
> 8 characters. I'm not sure if that's a good tradeoff or not...
>
>

How about we test the first buffer read in from the file and change
strategy accordingly?

cheers

andrew

Re: CopyReadLineText optimization

From
Andrew Dunstan
Date:

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> Attached is a patch that modifies CopyReadLineText so that it uses
>>> memchr to speed up the scan. The nice thing about memchr is that we
>>> can take advantage of any clever optimizations that might be in libc
>>> or compiler.
>>
>> Here's an updated version of the patch. The principle is the same,
>> but the same optimization is now used for CSV input as well, and
>> there's more comments.
>
> Another update attached: It occurred to me that the memchr approach is
> only safe for server encodings, where the non-first bytes of a
> multi-byte character always have the hi-bit set.
>

We currently make the following assumption in the code:

     * These four characters, and the CSV escape and quote characters, are
     * assumed the same in frontend and backend encodings.
     *

The four characters are the carriage return, line feed, backslash and dot.

I think the requirement might well actually be somewhat stronger than
that: i.e. that none of these will appear as a non-first byte in any
multi-byte client encoding character. If that's right, then we should be
able to write CopyReadLineText without bothering about multi-byte chars.
If it's not right then I suspect we have some cases that can fail now
anyway. (I believe some client encodings at least use backslash in
subsequent chars, and that's a nasty one because the "\." end sequence
is hard coded). I believe all the chars up to 0x2f are safe - that
includes both quote chars and dot)

cheers

andrew

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> Another update attached: It occurred to me that the memchr approach is
>> only safe for server encodings, where the non-first bytes of a
>> multi-byte character always have the hi-bit set.
>>
>
> We currently make the following assumption in the code:
>
>     * These four characters, and the CSV escape and quote characters, are
>     * assumed the same in frontend and backend encodings.
>     *
>
> The four characters are the carriage return, line feed, backslash and dot.
>
> I think the requirement might well actually be somewhat stronger than
> that: i.e. that none of these will appear as a non-first byte in any
> multi-byte client encoding character. If that's right, then we should be
> able to write CopyReadLineText without bothering about multi-byte chars.
> If it's not right then I suspect we have some cases that can fail now
> anyway.

No, we don't require that, and we do handle it correctly. We use
pg_encoding_mblen to determine the length of each character in
CopyReadLineText when the encoding is a client-only encoding, and only
look at the first byte of each character. In CopyReadAttributesText,
where we have a similar loop, we've already transformed the input to
server encoding.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: CopyReadLineText optimization

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Andrew Dunstan wrote:
>> We currently make the following assumption in the code:
>>
>> * These four characters, and the CSV escape and quote characters, are
>> * assumed the same in frontend and backend encodings.
>>
>> The four characters are the carriage return, line feed, backslash and dot.
>>
>> I think the requirement might well actually be somewhat stronger than
>> that: i.e. that none of these will appear as a non-first byte in any
>> multi-byte client encoding character.

> No, we don't require that, and we do handle it correctly.

I believe we *have to* handle it correctly, because backslash actually
does appear as a non-first byte in SJIS or some other Far Eastern
encoding.  In the CSV case such a restriction would be untenable anyway.

BTW, I notice that the code allows CSV escape and quote characters that
have the high bit set (in single-byte server encodings that is).  Is
this a good idea?  It seems like such are extremely unlikely to be the
same in two different encodings.  Maybe we should restrict to the ASCII
range?  Especially if the client encoding is multibyte ...

            regards, tom lane

Re: CopyReadLineText optimization

From
Andrew Dunstan
Date:

Tom Lane wrote:
> BTW, I notice that the code allows CSV escape and quote characters that
> have the high bit set (in single-byte server encodings that is).  Is
> this a good idea?  It seems like such are extremely unlikely to be the
> same in two different encodings.  Maybe we should restrict to the ASCII
> range?  Especially if the client encoding is multibyte ...
>
>
>

In the commonest case these are both either " or '. I would not have any
objection to requiring them to be ASCII chars.

cheers

andrew

Re: CopyReadLineText optimization

From
Andrew Dunstan
Date:

Heikki Linnakangas wrote:
> Andrew Dunstan wrote:
>> Heikki Linnakangas wrote:
>>> Another update attached: It occurred to me that the memchr approach is
>>> only safe for server encodings, where the non-first bytes of a
>>> multi-byte character always have the hi-bit set.
>>>
>>
>> We currently make the following assumption in the code:
>>
>>     * These four characters, and the CSV escape and quote characters,
>> are
>>     * assumed the same in frontend and backend encodings.
>>     *
>>
>> The four characters are the carriage return, line feed, backslash and
>> dot.
>>
>> I think the requirement might well actually be somewhat stronger than
>> that: i.e. that none of these will appear as a non-first byte in any
>> multi-byte client encoding character. If that's right, then we should
>> be able to write CopyReadLineText without bothering about multi-byte
>> chars. If it's not right then I suspect we have some cases that can
>> fail now anyway.
>
> No, we don't require that, and we do handle it correctly. We use
> pg_encoding_mblen to determine the length of each character in
> CopyReadLineText when the encoding is a client-only encoding, and only
> look at the first byte of each character. In CopyReadAttributesText,
> where we have a similar loop, we've already transformed the input to
> server encoding.

Oops. I see that now. Funny how I missed it when I went looking for it :-(

I think I understand the patch now :-)

I'm still a bit worried about applying it unless it gets some adaptive
behaviour or something so that we don't cause any serious performance
regressions in some cases. Also, could we perhaps benefit from inlining
some calls, or is your compiler doing that anyway?

cheers

andrew

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> BTW, I notice that the code allows CSV escape and quote characters that
> have the high bit set (in single-byte server encodings that is).  Is
> this a good idea?  It seems like such are extremely unlikely to be the
> same in two different encodings.  Maybe we should restrict to the ASCII
> range?  Especially if the client encoding is multibyte ...

At least many of the ISO-8859-* encodings have many common non-ascii
characters, and there's no problem if the client_ and server_encodings
match. But it does seem risky to allow it if we can't detect and throw
an error on the non-safe cases. Perhaps we could translate the chars
from client to server encoding?

If the client encoding is a multibyte one, then we certainly should
elog(ERROR) if you try to do that.

Though from a practical point of view, I doubt anyone would mind if we
just always restricted them to ASCII range...

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Andrew Dunstan wrote:
> I'm still a bit worried about applying it unless it gets some adaptive
> behaviour or something so that we don't cause any serious performance
> regressions in some cases.

I'll try to come up with something. At the most conservative end, we
could fall back to the current method on the first escape, quote or
backslash character.

> Also, could we perhaps benefit from inlining
> some calls, or is your compiler doing that anyway?

gcc does inline all static functions that are only called from one site,
  and small functions, using some heuristic. I don't think more
aggressive inlining would help.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: CopyReadLineText optimization

From
Andrew Dunstan
Date:

Heikki Linnakangas wrote:
> Andrew Dunstan wrote:
>> I'm still a bit worried about applying it unless it gets some
>> adaptive behaviour or something so that we don't cause any serious
>> performance regressions in some cases.
>
> I'll try to come up with something. At the most conservative end, we
> could fall back to the current method on the first escape, quote or
> backslash character.
>
>

That's far too conservative, I think. Somewhere a bit short of your
observed breakeven point seems about right.

cheers

andrew

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> Andrew Dunstan wrote:
>>> I'm still a bit worried about applying it unless it gets some
>>> adaptive behaviour or something so that we don't cause any serious
>>> performance regressions in some cases.
>>
>> I'll try to come up with something. At the most conservative end, we
>> could fall back to the current method on the first escape, quote or
>> backslash character.
>
> That's far too conservative, I think. Somewhere a bit short of your
> observed breakeven point seems about right.

The problem is, you don't know how many "stop" characters there is until
you've counted them.

We could fall back after X such characters, or only start using memchr
after seeing 8 consecutive non-stop characters. Whatever we choose, the
heuristic needs to be very simple and fast to check, otherwise we just
introduce more overhead trying to decide which method to use.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: CopyReadLineText optimization

From
Greg Smith
Date:
On Thu, 6 Mar 2008, Heikki Linnakangas wrote:

> At the most conservative end, we could fall back to the current method
> on the first escape, quote or backslash character.

I would just count the number of escaped/quote characters on each line,
and then at the end of the line switch modes between the current code on
the new version based on what the previous line looked like.  That way the
only additional overhead is a small bit only when escapes show up often,
plus a touch more just once per line.  Barely noticable in the case where
nothing is escaped, very small regression for escape-heavy stuff but
certainly better than the drop you reported in the last rev of this patch.

Rev two of that design would keep a weighted moving average of the total
number of escaped characters per line (say wma=(7*wma+current)/8) and
switch modes based on that instead of the previous one.  There's enough
play in the transition between where the two approaches work better at
that this should be easy enough to get a decent transition between.
Based on your data I would put the transition at wma>4, which should keep
the old code in play even if only half the lines have the bad regression
that shows up with >8 escapes per line.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: CopyReadLineText optimization

From
Andrew Dunstan
Date:

Greg Smith wrote:
> On Thu, 6 Mar 2008, Heikki Linnakangas wrote:
>
>> At the most conservative end, we could fall back to the current
>> method on the first escape, quote or backslash character.
>
> I would just count the number of escaped/quote characters on each
> line, and then at the end of the line switch modes between the current
> code on the new version based on what the previous line looked like.
> That way the only additional overhead is a small bit only when escapes
> show up often, plus a touch more just once per line.  Barely noticable
> in the case where nothing is escaped, very small regression for
> escape-heavy stuff but certainly better than the drop you reported in
> the last rev of this patch.
>
> Rev two of that design would keep a weighted moving average of the
> total number of escaped characters per line (say
> wma=(7*wma+current)/8) and switch modes based on that instead of the
> previous one.  There's enough play in the transition between where the
> two approaches work better at that this should be easy enough to get a
> decent transition between. Based on your data I would put the
> transition at wma>4, which should keep the old code in play even if
> only half the lines have the bad regression that shows up with >8
> escapes per line.
>
>

I'd be inclined just to look at the first buffer of data we read in, and
make a one-off decision there, if we can get away with it. Then the cost
of testing is fixed rather than per line.

cheers

andrew

Re: CopyReadLineText optimization

From
Andrew Dunstan
Date:

Heikki Linnakangas wrote:
> Andrew Dunstan wrote:
>> I'm still a bit worried about applying it unless it gets some
>> adaptive behaviour or something so that we don't cause any serious
>> performance regressions in some cases.
>
> I'll try to come up with something. At the most conservative end, we
> could fall back to the current method on the first escape, quote or
> backslash character.
>
>> Also, could we perhaps benefit from inlining some calls, or is your
>> compiler doing that anyway?
>
> gcc does inline all static functions that are only called from one
> site,  and small functions, using some heuristic. I don't think more
> aggressive inlining would help.
>

Another question that occurred to me - did you try using strpbrk() to
look for the next interesting character rather than your homegrown
searcher gadget? If so, how did that perform?

cheers

andrew

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Andrew Dunstan wrote:
>
>
> Heikki Linnakangas wrote:
>> Andrew Dunstan wrote:
>>> I'm still a bit worried about applying it unless it gets some
>>> adaptive behaviour or something so that we don't cause any serious
>>> performance regressions in some cases.
>>
>> I'll try to come up with something. At the most conservative end, we
>> could fall back to the current method on the first escape, quote or
>> backslash character.
>>
>>> Also, could we perhaps benefit from inlining some calls, or is your
>>> compiler doing that anyway?
>>
>> gcc does inline all static functions that are only called from one
>> site,  and small functions, using some heuristic. I don't think more
>> aggressive inlining would help.
>>
>
> Another question that occurred to me - did you try using strpbrk() to
> look for the next interesting character rather than your homegrown
> searcher gadget? If so, how did that perform?

I haven't tried that. There's a small difference: strpbrk stops at '\0'.
But come to think of it, I guess it doesn't matter. Will test...

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Andrew Dunstan wrote:
> Another question that occurred to me - did you try using strpbrk() to
> look for the next interesting character rather than your homegrown
> searcher gadget? If so, how did that perform?

It looks like strpbrk() performs poorly:

unpatched:
  testname |  min duration
----------+-----------------
  all      | 00:00:08.099656
  1/2      | 00:00:06.734241
  1/4      | 00:00:06.016946
  1/8      | 00:00:05.622122
  1/16     | 00:00:05.304252
  none     | 00:00:05.155755
(6 rows)

strpbrk:

  testname |  min duration
----------+-----------------
  all      | 00:00:22.980997
  1/2      | 00:00:13.724834
  1/4      | 00:00:08.980246
  1/8      | 00:00:06.582357
  1/16     | 00:00:05.291485
  none     | 00:00:06.239468
(6 rows)

memchr:

  testname |  min duration
----------+-----------------
  all      | 00:00:13.684479
  1/2      | 00:00:09.509213
  1/4      | 00:00:06.921959
  1/8      | 00:00:05.654761
  1/16     | 00:00:04.719027
  none     | 00:00:03.718361
(6 rows)

Attached is the test script and patches I used, if someone wants to test
this on another platform.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.295
diff -c -r1.295 copy.c
*** src/backend/commands/copy.c    1 Jan 2008 19:45:48 -0000    1.295
--- src/backend/commands/copy.c    5 Mar 2008 14:44:58 -0000
***************
*** 67,72 ****
--- 67,275 ----
      EOL_CRNL
  } EolType;

+
+ /***** Fast byte searcher functions *****
+  *
+  * CopyReadLine needs to search for newlines, as well as quoting and escape
+  * characters to determine which newlines are real and which ones are escaped.
+  * Doing that in the naive way, looping through the input byte by byte and
+  * comparing against the interesting characters, can be slow.
+  *
+  * To speed that up, we provide a "searcher" interface, that can be used to
+  * search a piece of memory for up to 4 different bytes simultaneously. It's
+  * similar to memchr, but allows searching for multiple chars at the same
+  * time.
+  *
+  * To start a search on a new block of memory, call init_searcher. Then call
+  * next_searcher repeatedly to return all the occurrances of the bytes being
+  * searched for.
+  *
+  * The searcher implementation uses memchr to search for the bytes, and keeps
+  * track of where the next occurrance of each is. Using memchr allows us
+  * to take advantage of any platform-specific optimizations.
+  */
+
+ /*
+  * Struct used to store state of the current search between calls to
+  * next_searcher. Initialized by init_search.
+  */
+ typedef struct {
+     /* Each element in c corresponds the element in s. These are sorted
+      * by the pointers (ptr) */
+     int n;        /* elements used in the arrays */
+     char c[4];    /* bytes we're looking for */
+     char *ptr[4]; /* pointers to next occurances of the bytes */
+     char *end;    /* end of block being searched. Last byte to search is end-1 */
+ } searcher;
+
+ /* Functions internal to "searcher" */
+
+ #define swap_searcher_entries(searcher, a, b) do { \
+     char tmpc = (searcher)->c[a]; \
+     char *tmpptr = (searcher)->ptr[a]; \
+     (searcher)->c[a] = (searcher)->c[b]; \
+     (searcher)->ptr[a] = (searcher)->ptr[b]; \
+     (searcher)->c[b] = tmpc; \
+     (searcher)->ptr[b] = tmpptr; \
+ } while(0)
+
+ static void
+ sort_searcher(searcher *s)
+ {
+     switch(s->n)
+     {
+         case 0:
+         case 1:
+             /* Nothing to do */
+             break;
+         case 2:
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             break;
+         case 3:
+             if (s->ptr[0] > s->ptr[2])
+                 swap_searcher_entries(s, 0, 2);
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[1] > s->ptr[2])
+                 swap_searcher_entries(s, 1, 2);
+             break;
+         case 4:
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[2] > s->ptr[3])
+                 swap_searcher_entries(s, 2, 3);
+             if (s->ptr[1] > s->ptr[2])
+                 swap_searcher_entries(s, 2, 3);
+             if (s->ptr[0] > s->ptr[1])
+                 swap_searcher_entries(s, 0, 1);
+             if (s->ptr[2] > s->ptr[3])
+                 swap_searcher_entries(s, 2, 3);
+             break;
+     }
+ }
+
+ /* Remove the topmost element */
+ static void
+ remove_top(searcher *searcher)
+ {
+     int i;
+
+     searcher->n--;
+
+     /* Move up the remaining items */
+     for (i = 0; i < searcher->n; i++)
+     {
+         searcher->c[i] = searcher->c[i + 1];
+         searcher->ptr[i] = searcher->ptr[i + 1];
+     }
+ }
+
+ /*
+  * The first element in the list has been replaced by the caller.
+  * Move it to the right place in the list, to keep it sorted.
+  */
+ static void
+ sift_down(searcher *searcher)
+ {
+     if (searcher->n < 2 || searcher->ptr[0] <= searcher->ptr[1])
+         return;
+     swap_searcher_entries(searcher, 0, 1);
+
+     if (searcher->n < 3 || searcher->ptr[1] <= searcher->ptr[2])
+         return;
+     swap_searcher_entries(searcher, 1, 2);
+
+     if (searcher->n < 4 || searcher->ptr[2] <= searcher->ptr[3])
+         return;
+     swap_searcher_entries(searcher, 2, 3);
+ }
+
+ /*
+  * Starts a new search in a block of memory.
+  *
+  *  searcher    - for storing state between calls. Opaque to caller,
+  *                  init_searcher will initialize this.
+  *  str            - block of memory to search
+  *  len            - length of str
+  *  c            - bytes to search for, max 4.
+  *  n            - number of bytes in c
+  */
+ static void
+ init_searcher(searcher *searcher, char *str, int len, char c[4], int n)
+ {
+     int i;
+
+     Assert(n <= 4);
+
+     searcher->n = 0;
+     searcher->end = str + len;
+
+     /* Find first occurances of the searched-for bytes */
+     for (i = 0; i < n; i++)
+     {
+         char *hit = memchr(str, c[i], len);
+         if (hit != NULL)
+         {
+             searcher->c[searcher->n] = c[i];
+             searcher->ptr[searcher->n] = hit;
+             searcher->n++;
+         }
+         /*
+          * If the byte doesn't occur in the string at all, we don't
+          * need to put it in the list.
+          */
+     }
+     sort_searcher(searcher);
+ }
+
+ /*
+  * Look for the next interesting character
+  *
+  * searcher - state, opaque to caller
+  * str        - block to search.
+  *
+  * 'str' must be part of the block passed to init_searcher, and >=
+  * the return value of last call to next_searcher.
+  *
+  * Returns the offset of the next interesting location, relative to 'ss'
+  */
+ static char *
+ next_searcher(searcher *searcher, char *str)
+ {
+     Assert(str < searcher->end);
+
+     if (searcher->n == 0)
+         return NULL;
+
+     /*
+      * If the caller skipped over the next pointer we had memorized, we have to
+      * search for the one after that.
+      */
+     while (str > searcher->ptr[0])
+     {
+         char *s = memchr(str, searcher->c[0], searcher->end - str);
+
+         if (s == NULL)
+         {
+             remove_top(searcher);
+             if (searcher->n == 0)
+                 return NULL;
+         }
+         else
+         {
+             searcher->ptr[0] = s;
+             sift_down(searcher);
+         }
+     }
+
+     /*
+      * Return the next interesting location we had memorized.
+      */
+     return searcher->ptr[0];
+ }
+
+
  /*
   * This struct contains all the state variables used throughout a COPY
   * operation. For simplicity, we use the same struct for all variants of COPY,
***************
*** 159,164 ****
--- 362,369 ----
      char       *raw_buf;
      int            raw_buf_index;    /* next byte to process */
      int            raw_buf_len;    /* total # of bytes stored */
+
+     searcher    searcher;
  } CopyStateData;

  typedef CopyStateData *CopyState;
***************
*** 2285,2299 ****
--- 2490,2516 ----
                  last_was_esc = false;
      char        quotec = '\0';
      char        escapec = '\0';
+     char        search_chars[4];
+     int            nsearch_chars = 0;
+
+     search_chars[nsearch_chars++] = '\n';
+     search_chars[nsearch_chars++] = '\r';

      if (cstate->csv_mode)
      {
          quotec = cstate->quote[0];
          escapec = cstate->escape[0];
+
+         search_chars[nsearch_chars++] = quotec;
+
          /* ignore special escape processing if it's the same as quotec */
          if (quotec == escapec)
              escapec = '\0';
+         else
+             search_chars[nsearch_chars++] = escapec;
      }
+     else
+         search_chars[nsearch_chars++] = '\\';

      mblen_str[1] = '\0';

***************
*** 2360,2368 ****
--- 2577,2603 ----
                  break;
              }
              need_data = false;
+
+             if (!cstate->encoding_embeds_ascii)
+                 init_searcher(&cstate->searcher, copy_raw_buf, copy_buf_len,
+                               search_chars, nsearch_chars);
          }

          /* OK to fetch a character */
+         if ((!cstate->csv_mode || !first_char_in_line)
+             && !cstate->encoding_embeds_ascii
+             && (raw_buf_ptr + 4 <= copy_buf_len))
+         {
+             char *next = next_searcher(&cstate->searcher,
+                                        ©_raw_buf[raw_buf_ptr]);
+
+             if (next != NULL)
+             {
+                 raw_buf_ptr = next - copy_raw_buf;
+                 first_char_in_line = false;
+             }
+         }
+
          prev_raw_ptr = raw_buf_ptr;
          c = copy_raw_buf[raw_buf_ptr++];

Index: src/backend/commands/copy.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.296
diff -c -r1.296 copy.c
*** src/backend/commands/copy.c    8 Mar 2008 01:16:26 -0000    1.296
--- src/backend/commands/copy.c    9 Mar 2008 09:34:13 -0000
***************
*** 2285,2299 ****
--- 2285,2311 ----
                  last_was_esc = false;
      char        quotec = '\0';
      char        escapec = '\0';
+     char        search_chars[5] = { 0, 0, 0, 0, 0 };
+     int            nsearch_chars = 0;
+
+     search_chars[nsearch_chars++] = '\n';
+     search_chars[nsearch_chars++] = '\r';

      if (cstate->csv_mode)
      {
          quotec = cstate->quote[0];
          escapec = cstate->escape[0];
+
+         search_chars[nsearch_chars++] = quotec;
+
          /* ignore special escape processing if it's the same as quotec */
          if (quotec == escapec)
              escapec = '\0';
+         else
+             search_chars[nsearch_chars++] = escapec;
      }
+     else
+         search_chars[nsearch_chars++] = '\\';

      mblen_str[1] = '\0';

***************
*** 2363,2368 ****
--- 2375,2394 ----
          }

          /* OK to fetch a character */
+         if ((!cstate->csv_mode || !first_char_in_line)
+             && !cstate->encoding_embeds_ascii
+             && (raw_buf_ptr + 4 <= copy_buf_len))
+         {
+
+             char *next = strpbrk(©_raw_buf[raw_buf_ptr], search_chars);
+
+             if (next != NULL)
+             {
+                 raw_buf_ptr = next - copy_raw_buf;
+                 first_char_in_line = false;
+             }
+         }
+
          prev_raw_ptr = raw_buf_ptr;
          c = copy_raw_buf[raw_buf_ptr++];


Attachment

Re: CopyReadLineText optimization

From
Andrew Dunstan
Date:

Heikki Linnakangas wrote:
> Andrew Dunstan wrote:
>> Another question that occurred to me - did you try using strpbrk() to
>> look for the next interesting character rather than your homegrown
>> searcher gadget? If so, how did that perform?
>
> It looks like strpbrk() performs poorly:


Yes, not surprising. I just looked at the implementation in glibc, which
I assume you are using, and it seemed rather basic. The one in NetBSD's
libc looks much more efficient.

See


http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc
and

http://cvsweb.de.netbsd.org/cgi-bin/cvsweb.cgi/src/lib/libc/string/strpbrk.c?rev=1.16;content-type=text%2Fx-cvsweb-markup

Not that what you've done isn't good, if a little obscure (as is the
NetBSD implementation)

cheers

andrew




Re: CopyReadLineText optimization

From
"Heikki Linnakangas"
Date:
Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> It looks like strpbrk() performs poorly:
>
> Yes, not surprising. I just looked at the implementation in glibc, which
> I assume you are using, and it seemed rather basic. The one in NetBSD's
> libc looks much more efficient.
>
> See
>
>
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc


There's architecture-specific optimized versions of that in glibc,
though. For example, for x86_64:

http://sourceware.org/cgi-bin/cvsweb.cgi/~checkout~/libc/sysdeps/x86_64/strspn.S?rev=1.4&cvsroot=glibc

> Not that what you've done isn't good, if a little obscure (as is the
> NetBSD implementation)

Yeah, it's a lot of code for such a simple thing :-(. On the other hand,
it's very isolated, so it's not that bad.

We chatted about this with Greg Stark yesterday, and we came up with a
few more observations and ideas:

1. CopyReadLineText is all about finding the the next end of line;
splitting to fields is done later. We therefore only care about quotes
and escapes when they affect the end of line detection. In text mode, we
  only need to care about a backslash that precedes a LF/CR. Therefore,
we could search for the next LF/CR character with memchr(), and check if
the preceding character is a backslash (and if it is, check if there's
yet another backslash behind it, and so forth until we hit a
non-backslash character).

2. The manual has this to say about escaping newlines with a backslash:

"It is strongly recommended that applications generating COPY data
convert data newlines and carriage returns to the \n and \r sequences
respectively. At present it is possible to represent a data carriage
return by a backslash and carriage return, and to represent a data
newline by a backslash and newline. However, these representations might
not be accepted in future releases."

If we disallowed the backslash+LF sequence, like that paragraph suggests
we could, we wouldn't need to care about backslashes in CopyReadLineText
in text mode at all.

3. If we did the client->server encoding conversion before
CopyReadLineText, one input block at a time:

- We could skip calling pg_encoding_mblen for each character, when the
client encoding is one not supported as a server encoding. That would be
a big performance gain when that's the case, and would simplify the code
in CopyReadLineText a little bit as well.

- The conversion might be faster, as it could operate on bigger blocks

- We could merge the CopyReadLine and CopyReadAttributes functions, and
split the input block into fields in one loop. I would imagine that
being more efficient than scanning the input twice, no matter how fast
we make those scans.

There's a small difficulty with doing the conversion one input block at
a time: there's no suitable interface in the conversion routines for
that. The input block boundary can be between 1st and 2nd byte of a
multi-byte character, so we'd need to have a function that converts a
block of bytes, ignoring any incomplete bytes at the end.

I think I'll experiment with the 1st idea, to see how much code
restructuring it needs and what the performance is like.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com