Thread: Undocumented feature costs a lot of performance in COPY IN
I have been fooling around profiling various ways of inserting wide (8000-byte, not all that wide) bytea fields, per Brent Verner's note of a few days ago. COPY IN should be, and is, the fastest way to do it. But I was rather startled to discover that 25% of the runtime of COPY IN went to an inefficient way of fetching single bytes from pqcomm.c (pq_getbytes(&ch, 1) instead of ch = pq_getbyte()), and 20% of what's left after fixing that is going into the strchr() call in CopyReadAttribute. Now the point of that strchr() call is to detect whether the current character is the column delimiter. The COPY reference page clearly says: By default, a text copy uses a tab ("\t") character as adelimiter between fields. The field delimiter may be changed toanyother single character with the keyword phrase USINGDELIMITERS. Characters in data fields which happen to match thedelimitercharacter will be backslash quoted. Note that thedelimiter is always a single character. If multiple charactersarespecified in the delimiter string, only the first characteris used. and indeed, only the first character is used by COPY OUT. But COPY IN is presently coded so that if multiple characters are mentioned in USING DELIMITERS, any one of them will be taken as a field delimiter. I would like to change the code to just "if (c == delim[0])", which should buy back most of that 20% and make the behavior match the documentation. Question for the list: is this a bad change? Is anyone out there actually using this undocumented behavior? regards, tom lane
> I would like to change the code to just "if (c == delim[0])", > which should buy back most of that 20% and make the behavior match the > documentation. Question for the list: is this a bad change? Is anyone > out there actually using this undocumented behavior? Yes, please fix it. In fact, I think we should throw an error if more than one character is specified as a delimiter. Saying we ignore multiple characters in the documentation is not enough when we silently ignore them in the code. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Yes, please fix it. In fact, I think we should throw an error if more > than one character is specified as a delimiter. Saying we ignore > multiple characters in the documentation is not enough when we silently > ignore them in the code. Well, it'd be an easy enough addition: if (strlen(delim) != 1) elog(ERROR, "COPY delimiter must be a single character"); This isn't multibyte-aware, but then neither is the implementation; delimiters that are multibyte characters won't work at the moment. Any comments out there? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > and indeed, only the first character is used by COPY OUT. But COPY IN > is presently coded so that if multiple characters are mentioned in > USING DELIMITERS, any one of them will be taken as a field delimiter. > > I would like to change the code to just "if (c == delim[0])", > which should buy back most of that 20% and make the behavior match the > documentation. Question for the list: is this a bad change? Is anyone > out there actually using this undocumented behavior? Not I. As an utter nitpick, the syntax should IMHO be USING DELIMITER (no S) if there is only one possible delimiter character. But that *would* break lots of apps so I don't advocate it. ;) -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863
On Tue, 4 Dec 2001, Tom Lane wrote: > By default, a text copy uses a tab ("\t") character as a > delimiter between fields. The field delimiter may be changed to > any other single character with the keyword phrase USING > DELIMITERS. Characters in data fields which happen to match the > delimiter character will be backslash quoted. Note that the > delimiter is always a single character. If multiple characters > are specified in the delimiter string, only the first character > is used. > > and indeed, only the first character is used by COPY OUT. But COPY IN > is presently coded so that if multiple characters are mentioned in > USING DELIMITERS, any one of them will be taken as a field delimiter. > > I would like to change the code to just "if (c == delim[0])", > which should buy back most of that 20% and make the behavior match the > documentation. Question for the list: is this a bad change? Is anyone > out there actually using this undocumented behavior? I think you should make the change. Because, as I understand it, when you give multiple delimiter characters COPY OUT will not delimit characters other than the first, since they won't be treated special. But COPY IN will treat them special; you will read in more columns than you output. Thus as it is, you can't COPY IN something you COPY OUT'd. One alternative would be to make the code use different paths for the just-one and many delimiter cases. But then COPY OUT would need fixing. Take care, Bill
Bill Studenmund <wrstuden@netbsd.org> writes: > One alternative would be to make the code use different paths for the > just-one and many delimiter cases. But then COPY OUT would need fixing. Well, it's not clear what COPY OUT should *do* with multiple alternatives, anyway. Pick one at random? I guess it does that now, if you consider "always use the first one" as a random choice. The real problem is that it will only backslash the first one, too. That means that data emitted with DELIMITERS "|_=", say, will fail to be reloaded correctly if that same DELIMITERS string is given to COPY IN --- because any _ or = characters in the data won't be backslashed, but would need to be to keep COPY IN from treating them as delimiters. For COPY OUT's purposes, a sensible interpretation of a multicharacter delimiter string would be that the whole string is emitted as the delimiter. Eg, COPY OUT WITH DELIMITERS "<TAB>"; foo<TAB>bar<TAB>baz... But as long as COPY IN considers that delimiter spec to mean "any one of these characters", and not a multicharacter string, we couldn't do that. If we restrict DELIMITERS strings to be exactly one character for a release or three, we could think about implementing this idea of multicharacter delimiter strings later on. Not sure if anyone really needs it though. In any case, the current behavior is inconsistent. regards, tom lane
On Tue, 4 Dec 2001, Tom Lane wrote: > Bill Studenmund <wrstuden@netbsd.org> writes: > > One alternative would be to make the code use different paths for the > > just-one and many delimiter cases. But then COPY OUT would need fixing. > > Well, it's not clear what COPY OUT should *do* with multiple > alternatives, anyway. Pick one at random? I guess it does that now, > if you consider "always use the first one" as a random choice. The I think that'd be fine. > real problem is that it will only backslash the first one, too. That Ick. I was thinking that if you gave multiple delimiters, it would escape each one. Which would be slow, and is why I think seperate code paths would be good. :-) > means that data emitted with DELIMITERS "|_=", say, will fail to be > reloaded correctly if that same DELIMITERS string is given to COPY IN > --- because any _ or = characters in the data won't be backslashed, > but would need to be to keep COPY IN from treating them as delimiters. > > For COPY OUT's purposes, a sensible interpretation of a multicharacter > delimiter string would be that the whole string is emitted as the > delimiter. Eg, > > COPY OUT WITH DELIMITERS "<TAB>"; > > foo<TAB>bar<TAB>baz > ... > > But as long as COPY IN considers that delimiter spec to mean "any one of > these characters", and not a multicharacter string, we couldn't do that. > > If we restrict DELIMITERS strings to be exactly one character for a > release or three, we could think about implementing this idea of > multicharacter delimiter strings later on. Not sure if anyone really > needs it though. In any case, the current behavior is inconsistent. I think this restriction sounds fine, and quite practical. :-) Take care, Bill
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Yes, please fix it. In fact, I think we should throw an error if more > > than one character is specified as a delimiter. Saying we ignore > > multiple characters in the documentation is not enough when we silently > > ignore them in the code. > > Well, it'd be an easy enough addition: > > if (strlen(delim) != 1) > elog(ERROR, "COPY delimiter must be a single character"); > > This isn't multibyte-aware, but then neither is the implementation; > delimiters that are multibyte characters won't work at the moment. My point was that the documentation was saying it could only be one character, and that we would ignore any characters after the first one, but there was no enforcement in the code. The right way to do it is to just say in the documentation it has to be one character, and throw an error in the code if it isn't. Limitations should be enforced in the code, if possible, not just mentioned in the documenation, which may or may not get read. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bill Studenmund wrote: >> >>But as long as COPY IN considers that delimiter spec to mean "any one of >>these characters", and not a multicharacter string, we couldn't do that. >> >>If we restrict DELIMITERS strings to be exactly one character for a >>release or three, we could think about implementing this idea of >>multicharacter delimiter strings later on. Not sure if anyone really >>needs it though. >> \r\n is quite popular (row) delimiter on some systems (and causes sometimes a weird box char to appear at the end of last database field :), but I doubt I can give any examples of multichar field delimiters >> In any case, the current behavior is inconsistent. >> > >I think this restriction sounds fine, and quite practical. :-) > I sincerely doubt that anyone knowingly :) uses this undocumented feature for copy in, as it can be found out only by trial and error. Much better to remove it, enforce it in code as Bruce suggested, and document it. ------------------ Hannu
Bruce Momjian <pgman@candle.pha.pa.us> writes: > We could support keywords DELIMITER and DELIMITERS and only document > the first one. One could also argue that it should be WITH DELIMITER for more consistency with the other optional clauses. But let's put that in the TODO list, not try to get it done now... regards, tom lane
> Tom Lane <tgl@sss.pgh.pa.us> writes: > > > and indeed, only the first character is used by COPY OUT. But COPY IN > > is presently coded so that if multiple characters are mentioned in > > USING DELIMITERS, any one of them will be taken as a field delimiter. > > > > I would like to change the code to just "if (c == delim[0])", > > which should buy back most of that 20% and make the behavior match the > > documentation. Question for the list: is this a bad change? Is anyone > > out there actually using this undocumented behavior? > > Not I. > > As an utter nitpick, the syntax should IMHO be USING DELIMITER (no S) > if there is only one possible delimiter character. But that *would* > break lots of apps so I don't advocate it. ;) We could support keywords DELIMITER and DELIMITERS and only document the first one. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Well, it'd be an easy enough addition: > > if (strlen(delim) != 1) > elog(ERROR, "COPY delimiter must be a single character"); > > This isn't multibyte-aware, but then neither is the implementation; > delimiters that are multibyte characters won't work at the moment. > > Any comments out there? I think it will be acceptable for multibyte users only ASCII characters could be a candidate of delimiters. I don't think anybody wants to use Kanji as a delimiter:-) -- Tatsuo Ishii
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > We could support keywords DELIMITER and DELIMITERS and only document > > the first one. > > One could also argue that it should be WITH DELIMITER for more > consistency with the other optional clauses. > > But let's put that in the TODO list, not try to get it done now... Updated TODO: COPY... o Change syntax to WITH DELIMITER, (keep old syntax around?) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026