Thread: Undocumented feature costs a lot of performance in COPY IN

Undocumented feature costs a lot of performance in COPY IN

From
Tom Lane
Date:
I have been fooling around profiling various ways of inserting wide
(8000-byte, not all that wide) bytea fields, per Brent Verner's note
of a few days ago.  COPY IN should be, and is, the fastest way to
do it.  But I was rather startled to discover that 25% of the runtime
of COPY IN went to an inefficient way of fetching single bytes from
pqcomm.c (pq_getbytes(&ch, 1) instead of ch = pq_getbyte()), and
20% of what's left after fixing that is going into the strchr() call
in CopyReadAttribute.

Now the point of that strchr() call is to detect whether the current
character is the column delimiter.  The COPY reference page clearly
says:
By default, a text copy uses a tab ("\t") character as adelimiter between fields. The field delimiter may be changed
toanyother single character with the keyword phrase USINGDELIMITERS. Characters in data fields which happen to match
thedelimitercharacter will be backslash quoted. Note that thedelimiter is always a single character. If multiple
charactersarespecified in the delimiter string, only the first characteris used.
 

and indeed, only the first character is used by COPY OUT.  But COPY IN
is presently coded so that if multiple characters are mentioned in
USING DELIMITERS, any one of them will be taken as a field delimiter.

I would like to change the code to just "if (c == delim[0])",
which should buy back most of that 20% and make the behavior match the
documentation.  Question for the list: is this a bad change?  Is anyone
out there actually using this undocumented behavior?
        regards, tom lane


Re: Undocumented feature costs a lot of performance in COPY

From
Bruce Momjian
Date:
> I would like to change the code to just "if (c == delim[0])",
> which should buy back most of that 20% and make the behavior match the
> documentation.  Question for the list: is this a bad change?  Is anyone
> out there actually using this undocumented behavior?

Yes, please fix it.  In fact, I think we should throw an error if more
than one character is specified as a delimiter.  Saying we ignore
multiple characters in the documentation is not enough when we silently
ignore them in the code.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Undocumented feature costs a lot of performance in COPY IN

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Yes, please fix it.  In fact, I think we should throw an error if more
> than one character is specified as a delimiter.  Saying we ignore
> multiple characters in the documentation is not enough when we silently
> ignore them in the code.

Well, it'd be an easy enough addition:
if (strlen(delim) != 1)    elog(ERROR, "COPY delimiter must be a single character");

This isn't multibyte-aware, but then neither is the implementation;
delimiters that are multibyte characters won't work at the moment.

Any comments out there?
        regards, tom lane


Re: Undocumented feature costs a lot of performance in COPY IN

From
Doug McNaught
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> and indeed, only the first character is used by COPY OUT.  But COPY IN
> is presently coded so that if multiple characters are mentioned in
> USING DELIMITERS, any one of them will be taken as a field delimiter.
> 
> I would like to change the code to just "if (c == delim[0])",
> which should buy back most of that 20% and make the behavior match the
> documentation.  Question for the list: is this a bad change?  Is anyone
> out there actually using this undocumented behavior?

Not I.

As an utter nitpick, the syntax should IMHO be USING DELIMITER (no S)
if there is only one possible delimiter character.  But that *would*
break lots of apps so I don't advocate it.  ;)

-Doug
-- 
Let us cross over the river, and rest under the shade of the trees.  --T. J. Jackson, 1863


Re: Undocumented feature costs a lot of performance in

From
Bill Studenmund
Date:
On Tue, 4 Dec 2001, Tom Lane wrote:

>     By default, a text copy uses a tab ("\t") character as a
>     delimiter between fields. The field delimiter may be changed to
>     any other single character with the keyword phrase USING
>     DELIMITERS. Characters in data fields which happen to match the
>     delimiter character will be backslash quoted. Note that the
>     delimiter is always a single character. If multiple characters
>     are specified in the delimiter string, only the first character
>     is used.
>
> and indeed, only the first character is used by COPY OUT.  But COPY IN
> is presently coded so that if multiple characters are mentioned in
> USING DELIMITERS, any one of them will be taken as a field delimiter.
>
> I would like to change the code to just "if (c == delim[0])",
> which should buy back most of that 20% and make the behavior match the
> documentation.  Question for the list: is this a bad change?  Is anyone
> out there actually using this undocumented behavior?

I think you should make the change. Because, as I understand it, when you
give multiple delimiter characters COPY OUT will not delimit characters
other than the first, since they won't be treated special. But COPY IN
will treat them special; you will read in more columns than you output.
Thus as it is, you can't COPY IN something you COPY OUT'd.

One alternative would be to make the code use different paths for the
just-one and many delimiter cases. But then COPY OUT would need fixing.

Take care,

Bill




Re: Undocumented feature costs a lot of performance in COPY IN

From
Tom Lane
Date:
Bill Studenmund <wrstuden@netbsd.org> writes:
> One alternative would be to make the code use different paths for the
> just-one and many delimiter cases. But then COPY OUT would need fixing.

Well, it's not clear what COPY OUT should *do* with multiple
alternatives, anyway.  Pick one at random?  I guess it does that now,
if you consider "always use the first one" as a random choice.  The
real problem is that it will only backslash the first one, too.  That
means that data emitted with DELIMITERS "|_=", say, will fail to be
reloaded correctly if that same DELIMITERS string is given to COPY IN
--- because any _ or = characters in the data won't be backslashed,
but would need to be to keep COPY IN from treating them as delimiters.

For COPY OUT's purposes, a sensible interpretation of a multicharacter
delimiter string would be that the whole string is emitted as the
delimiter.  Eg,
COPY OUT WITH DELIMITERS "<TAB>";
foo<TAB>bar<TAB>baz...

But as long as COPY IN considers that delimiter spec to mean "any one of
these characters", and not a multicharacter string, we couldn't do that.

If we restrict DELIMITERS strings to be exactly one character for a
release or three, we could think about implementing this idea of
multicharacter delimiter strings later on.  Not sure if anyone really
needs it though.  In any case, the current behavior is inconsistent.
        regards, tom lane


Re: Undocumented feature costs a lot of performance in

From
Bill Studenmund
Date:
On Tue, 4 Dec 2001, Tom Lane wrote:

> Bill Studenmund <wrstuden@netbsd.org> writes:
> > One alternative would be to make the code use different paths for the
> > just-one and many delimiter cases. But then COPY OUT would need fixing.
>
> Well, it's not clear what COPY OUT should *do* with multiple
> alternatives, anyway.  Pick one at random?  I guess it does that now,
> if you consider "always use the first one" as a random choice.  The

I think that'd be fine.

> real problem is that it will only backslash the first one, too.  That

Ick. I was thinking that if you gave multiple delimiters, it would escape
each one. Which would be slow, and is why I think seperate code paths
would be good. :-)

> means that data emitted with DELIMITERS "|_=", say, will fail to be
> reloaded correctly if that same DELIMITERS string is given to COPY IN
> --- because any _ or = characters in the data won't be backslashed,
> but would need to be to keep COPY IN from treating them as delimiters.
>
> For COPY OUT's purposes, a sensible interpretation of a multicharacter
> delimiter string would be that the whole string is emitted as the
> delimiter.  Eg,
>
>     COPY OUT WITH DELIMITERS "<TAB>";
>
>     foo<TAB>bar<TAB>baz
>     ...
>
> But as long as COPY IN considers that delimiter spec to mean "any one of
> these characters", and not a multicharacter string, we couldn't do that.
>
> If we restrict DELIMITERS strings to be exactly one character for a
> release or three, we could think about implementing this idea of
> multicharacter delimiter strings later on.  Not sure if anyone really
> needs it though.  In any case, the current behavior is inconsistent.

I think this restriction sounds fine, and quite practical. :-)

Take care,

Bill



Re: Undocumented feature costs a lot of performance in COPY

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Yes, please fix it.  In fact, I think we should throw an error if more
> > than one character is specified as a delimiter.  Saying we ignore
> > multiple characters in the documentation is not enough when we silently
> > ignore them in the code.
> 
> Well, it'd be an easy enough addition:
> 
>     if (strlen(delim) != 1)
>         elog(ERROR, "COPY delimiter must be a single character");
> 
> This isn't multibyte-aware, but then neither is the implementation;
> delimiters that are multibyte characters won't work at the moment.

My point was that the documentation was saying it could only be one
character, and that we would ignore any characters after the first one,
but there was no enforcement in the code.

The right way to do it is to just say in the documentation it has to be
one character, and throw an error in the code if it isn't.

Limitations should be enforced in the code, if possible, not just
mentioned in the documenation, which may or may not get read.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Undocumented feature costs a lot of performance in

From
Hannu Krosing
Date:

Bill Studenmund wrote:

>>
>>But as long as COPY IN considers that delimiter spec to mean "any one of
>>these characters", and not a multicharacter string, we couldn't do that.
>>
>>If we restrict DELIMITERS strings to be exactly one character for a
>>release or three, we could think about implementing this idea of
>>multicharacter delimiter strings later on.  Not sure if anyone really
>>needs it though.
>>
\r\n is quite popular (row) delimiter on some systems (and causes 
sometimes a weird box
char to appear at the end of last database field :), but I doubt I can 
give any examples
of multichar field delimiters

>>  In any case, the current behavior is inconsistent.
>>
>
>I think this restriction sounds fine, and quite practical. :-)
>
I sincerely doubt that anyone knowingly :) uses this undocumented 
feature for copy in,
as it can be found out only by trial and error.

Much better to remove it, enforce it in code as Bruce suggested, and 
document it.

------------------
Hannu




Re: Undocumented feature costs a lot of performance in COPY IN

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> We could support keywords DELIMITER and DELIMITERS and only document
> the first one.

One could also argue that it should be WITH DELIMITER for more
consistency with the other optional clauses.

But let's put that in the TODO list, not try to get it done now...
        regards, tom lane


Re: Undocumented feature costs a lot of performance in COPY

From
Bruce Momjian
Date:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > and indeed, only the first character is used by COPY OUT.  But COPY IN
> > is presently coded so that if multiple characters are mentioned in
> > USING DELIMITERS, any one of them will be taken as a field delimiter.
> > 
> > I would like to change the code to just "if (c == delim[0])",
> > which should buy back most of that 20% and make the behavior match the
> > documentation.  Question for the list: is this a bad change?  Is anyone
> > out there actually using this undocumented behavior?
> 
> Not I.
> 
> As an utter nitpick, the syntax should IMHO be USING DELIMITER (no S)
> if there is only one possible delimiter character.  But that *would*
> break lots of apps so I don't advocate it.  ;)

We could support keywords DELIMITER and DELIMITERS and only document the first
one.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Undocumented feature costs a lot of performance in

From
Tatsuo Ishii
Date:
> Well, it'd be an easy enough addition:
> 
>     if (strlen(delim) != 1)
>         elog(ERROR, "COPY delimiter must be a single character");
> 
> This isn't multibyte-aware, but then neither is the implementation;
> delimiters that are multibyte characters won't work at the moment.
> 
> Any comments out there?

I think it will be acceptable for multibyte users only ASCII
characters could be a candidate of delimiters.
I don't think anybody wants to use Kanji as a delimiter:-)
--
Tatsuo Ishii


Re: Undocumented feature costs a lot of performance in COPY

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > We could support keywords DELIMITER and DELIMITERS and only document
> > the first one.
> 
> One could also argue that it should be WITH DELIMITER for more
> consistency with the other optional clauses.
> 
> But let's put that in the TODO list, not try to get it done now...

Updated TODO:

COPY...       o Change syntax to WITH DELIMITER, (keep old syntax around?)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026