Thread: invalidly encoded strings
I have been looking at fixing the issue of accepting strings that are not valid in the database encoding. It appears from previous discussion that we need to add a call to pg_verifymbstr() to the relevant input routines and ensure that the chr() function returns a valid string. That leaves several issues: . which are the relevant input routines? I have identified the following as needing remediation: textin(), bpcharin(), varcharin(), anyenum_in(), namein(). Do we also need one for cstring_in()? Does the xml code handle this as part of xml validation? . what do we need to do to make the verification code more efficient? I think we need to address the correctness issue first, but doing so should certainly make us want to improve the verification code. For example, I'm wondering if it might benefit from having a tiny cache. . for chr() under UTF8, it seems to be generally agreed that the argument should represent the codepoint and the function should return the correspondingly encoded character. If so, possible the argument should be a bigint to accommodate the full range of possible code points. It is not clear what the argument should represent for other multi-byte encodings for any argument higher than 127. Similarly, it is not clear what ascii() should return in such cases. I would be inclined just to error out. cheers andrew
On Sun, Sep 09, 2007 at 12:02:28AM -0400, Andrew Dunstan wrote: > . what do we need to do to make the verification code more efficient? I > think we need to address the correctness issue first, but doing so > should certainly make us want to improve the verification code. For > example, I'm wondering if it might benefit from having a tiny cache. It has been pointed out the the verification for UTF-8 is very inefficient, involving several function calls to first get the length, then check characters, etc. It could be significantly improved. I don't know whether a cache would make any useful difference. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > On Sun, Sep 09, 2007 at 12:02:28AM -0400, Andrew Dunstan wrote: > >> . what do we need to do to make the verification code more efficient? I >> think we need to address the correctness issue first, but doing so >> should certainly make us want to improve the verification code. For >> example, I'm wondering if it might benefit from having a tiny cache. >> > > It has been pointed out the the verification for UTF-8 is very > inefficient, involving several function calls to first get the length, > then check characters, etc. It could be significantly improved. I don't > know whether a cache would make any useful difference. > > > Well, it looks to me like "several" = 2. If function call overhead is the worry, we could either create inline versions of pg_utf_mblen and pg_utf8_islegal and rely on the compiler to make things good for us, or inline the calls directly ourselves. Either way there's some duplication, since the whole "extern inline" thing is still a mess. I'd be inclined to do the inlining by hand since it's not that much code and it would be more portable. Unless I'm missing something it should be a 10 minute c&p job. Is that all we're worried about? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I have been looking at fixing the issue of accepting strings that are > not valid in the database encoding. It appears from previous discussion > that we need to add a call to pg_verifymbstr() to the relevant input > routines and ensure that the chr() function returns a valid string. That > leaves several issues: > . which are the relevant input routines? I have identified the following > as needing remediation: textin(), bpcharin(), varcharin(), anyenum_in(), > namein(). Do we also need one for cstring_in()? Does the xml code > handle this as part of xml validation? This seems entirely the wrong approach, because 99% of the time a check placed here will be redundant with the one in the main client-input logic. (That was, indeed, the reason I took such checks out of these places in the first place.) Moreover, as you've already found out there are N places that would have to be fixed and we'd face a constant hazard of new datatypes introducing new holes. AFAICS the risk comes only from chr() and the possibility of numeric backslash-escapes in SQL string literals, and we ought to think about fixing it in those places. A possible answer is to add a verifymbstr to the string literal converter anytime it has processed a numeric backslash-escape in the string. Open questions for that are (1) does it have negative effects for bytea, and if so is there any hope of working around it? (2) how can we postpone the conversion/test to the parse analysis phase? (To the extent that database encoding is frozen it'd probably be OK to do it in the scanner, but such a choice will come back to bite us eventually.) > . for chr() under UTF8, it seems to be generally agreed that the > argument should represent the codepoint and the function should return > the correspondingly encoded character. If so, possible the argument > should be a bigint to accommodate the full range of possible code > points. It is not clear what the argument should represent for other > multi-byte encodings for any argument higher than 127. Similarly, it is > not clear what ascii() should return in such cases. I would be inclined > just to error out. In SQL_ASCII I'd argue for allowing 0..255. In actual MB encodings, OK with throwing error. regards, tom lane
Tom Lane wrote: > > A possible answer is to add a verifymbstr to the string literal > converter anytime it has processed a numeric backslash-escape in the > string. Open questions for that are (1) does it have negative effects > for bytea, and if so is there any hope of working around it? (2) how > can we postpone the conversion/test to the parse analysis phase? > Finding out how to do (2) seems to me the only possible answer to (1). I'll have a look. Is that going to cover data coming in via COPY? and parameters for prepared statements? >> . for chr() under UTF8, it seems to be generally agreed that the >> argument should represent the codepoint and the function should return >> the correspondingly encoded character. If so, possible the argument >> should be a bigint to accommodate the full range of possible code >> points. It is not clear what the argument should represent for other >> multi-byte encodings for any argument higher than 127. Similarly, it is >> not clear what ascii() should return in such cases. I would be inclined >> just to error out. >> > > In SQL_ASCII I'd argue for allowing 0..255. In actual MB encodings, > OK with throwing error. > > > I was planning on allowing up to 255 for all single byte encodings too. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Is that going to cover data coming in via COPY? and parameters for > prepared statements? Those should be checked already --- if not, the right fix is still to fix it there, not in per-datatype code. I think we are OK though, eg see "need_transcoding" logic in copy.c. >> In SQL_ASCII I'd argue for allowing 0..255. In actual MB encodings, >> OK with throwing error. > I was planning on allowing up to 255 for all single byte encodings too. OK, that sounds fine. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Is that going to cover data coming in via COPY? and parameters for >> prepared statements? >> > > Those should be checked already --- if not, the right fix is still to > fix it there, not in per-datatype code. I think we are OK though, > eg see "need_transcoding" logic in copy.c. > Well, a little experimentation shows that we currently are not OK: in foo.data: \366\66 utf8test=# \copy xx from foo.data utf8test=# select encode(t::bytea,'hex') from xx;encode --------f636 (1 row) utf8test=# \copy xx to bb.data utf8test=# \copy xx from bb.data ERROR: invalid byte sequence for encoding "UTF8": 0xf636 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY xx, line 1 utf8test=# BTW, all the foo_recv functions that call pq_getmsgtext or pq_getmsgstring are thereby calling pg_verify_mbstr already (via pg_client_to_server). So I am still not 100% convinced that doing the same directly in the corresponding foo_in functions is such a bad idea. cheers andrew
On Sun, 2007-09-09 at 10:51 -0400, Tom Lane wrote: > A possible answer is to add a verifymbstr to the string literal > converter anytime it has processed a numeric backslash-escape in the > string. Open questions for that are (1) does it have negative effects > for bytea, and if so is there any hope of working around it? (2) how > can we postpone the conversion/test to the parse analysis phase? > (To the extent that database encoding is frozen it'd probably be OK > to do it in the scanner, but such a choice will come back to bite > us eventually.) Regarding #1: Currently, you can pass a bytea literal as either: E'\377\377\377' or E'\\377\\377\\377'. The first strategy (single backslash) is not correct, because if you do E'\377\000\377', the embedded null character counts as the end of the cstring, even though there are bytes after it. Similar strange things happen if you have a E'\134' (backslash) somewhere in the string. However, I have no doubt that there are people who use the first strategy anyway, and the proposed change would break that for them. There may be more issues, too. Regards,Jeff Davis
Andrew Dunstan <andrew@dunslane.net> writes: > Well, a little experimentation shows that we currently are not OK: This experiment is inadequately described. What is the type of the column involved? regards, tom lane
Jeff Davis <pgsql@j-davis.com> writes: > Currently, you can pass a bytea literal as either: E'\377\377\377' or > E'\\377\\377\\377'. > The first strategy (single backslash) is not correct, because if you do > E'\377\000\377', the embedded null character counts as the end of the > cstring, even though there are bytes after it. Similar strange things > happen if you have a E'\134' (backslash) somewhere in the string. > However, I have no doubt that there are people who use the first > strategy anyway, and the proposed change would break that for them. If their code is broken anyway, calling their attention to it would be a good idea, hm? If we are not going to reject the embedded-null case then there is hardly any point in considering any behavioral change at all here. I want to point out in particular that Andrew's proposal of making datatype input functions responsible for encoding verification cannot possibly handle this, since they have to take the "terminating" null at face value. regards, tom lane
On Sun, 2007-09-09 at 17:09 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Currently, you can pass a bytea literal as either: E'\377\377\377' or > > E'\\377\\377\\377'. > > > The first strategy (single backslash) is not correct, because if you do > > E'\377\000\377', the embedded null character counts as the end of the > > cstring, even though there are bytes after it. Similar strange things > > happen if you have a E'\134' (backslash) somewhere in the string. > > However, I have no doubt that there are people who use the first > > strategy anyway, and the proposed change would break that for them. > > If their code is broken anyway, calling their attention to it would be a > good idea, hm? > Agreed. > If we are not going to reject the embedded-null case then there is > hardly any point in considering any behavioral change at all here. > I want to point out in particular that Andrew's proposal of making > datatype input functions responsible for encoding verification cannot > possibly handle this, since they have to take the "terminating" null > at face value. Would stringTypeDatum() in parse_type.c be a good place to put the pg_verifymbstr()? Regards,Jeff Davis
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Well, a little experimentation shows that we currently are not OK: >> > > This experiment is inadequately described. > What is the type of the column involved? > > > Sorry. It's text. cheers andrew
Jeff Davis <pgsql@j-davis.com> writes: > Would stringTypeDatum() in parse_type.c be a good place to put the > pg_verifymbstr()? Probably not, in its current form, since it hasn't got any idea where the "char *string" came from; moreover it is not in any better position than the typinput function to determine whether there was a bogus embedded null. OTOH, there may be no decent way to fix the embedded-null problem other than by hacking the scanner to reject \0 immediately. If we did that it would give us more flexibility about where to put the encoding validity checks. In any case, I feel dubious that checking in stringTypeDatum will cover every code path. Somewhere around where A_Const gets transformed to Const seems like it'd be a better plan. (But I think that in most utility statement parsetrees, A_Const never does get transformed to Const; and there seem to be a few places in gram.y where an SCONST gives rise to something other than A_Const; so this is still not a bulletproof choice, at least not without additional changes.) In the short run it might be best to do it in scan.l after all. A few minutes' thought about what it'd take to delay the decisions till later yields a depressingly large number of changes; and we do not have time to be developing mostly-cosmetic patches for 8.3. Given that database_encoding is frozen for any one DB at the moment, and that that is unlikely to change in the near future, insisting on a solution that allows it to vary is probably unreasonable at this stage of the game. regards, tom lane
Tom Lane wrote: > In the short run it might be best to do it in scan.l after all. > I have not come up with a way of doing that and handling the bytea case. If you have I'm all ears. And then I am still worried about COPY. cheers andrew
On Sun, 2007-09-09 at 23:22 -0400, Tom Lane wrote: > In the short run it might be best to do it in scan.l after all. A few > minutes' thought about what it'd take to delay the decisions till later > yields a depressingly large number of changes; and we do not have time > to be developing mostly-cosmetic patches for 8.3. Given that > database_encoding is frozen for any one DB at the moment, and that that > is unlikely to change in the near future, insisting on a solution that > allows it to vary is probably unreasonable at this stage of the game. > Sounds reasonable to me. Regards,Jeff Davis
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> In the short run it might be best to do it in scan.l after all. > I have not come up with a way of doing that and handling the bytea case. AFAICS we have no realistic choice other than to reject \0 in SQL literals; to do otherwise requires API changes throughout that stack of modules. And once you admit \0 is no good, it's not clear that \somethingelse is any better for bytea-using clients. Moreover, given that we are moving away from backslash escapes as fast as we can sanely travel, expending large amounts of energy to make them work better doesn't seem like a good use of development manpower. > If you have I'm all ears. And then I am still worried about COPY. Haven't looked at your test case yet... but it sure looks to me like the COPY code *ought* to cover this. regards, tom lane
On Sun, 2007-09-09 at 23:33 -0400, Andrew Dunstan wrote: > Tom Lane wrote: > > In the short run it might be best to do it in scan.l after all. > > > > I have not come up with a way of doing that and handling the bytea case. > If you have I'm all ears. And then I am still worried about COPY. If it's done in the scanner it should still accept things like: E'\\377\\000\\377'::bytea right? I think the concern is when they use only one slash, like: E'\377\000\377'::bytea which, as I mentioned before, is not correct anyway. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > If it's done in the scanner it should still accept things like: > E'\\377\\000\\377'::bytea > right? Right, that will work, because the transformed literal is '\377\000\377' (no strange characters there, just what it says) and that has not got any encoding violations. > I think the concern is when they use only one slash, like: > E'\377\000\377'::bytea > which, as I mentioned before, is not correct anyway. Note also that if you have standard_conforming_strings set to ON, this gives the very same result: '\377\000\377'::bytea I'm not convinced that we need to move heaven and earth to make this work the same with or without E''. Basically what I'm thinking is we should just decree that the de-escaped string literal still has to be valid per the database encoding. If you don't want to be bound by that, use the right number of backslashes to get the thing passed through to the bytea input routine. regards, tom lane
Tom Lane wrote: >> . for chr() under UTF8, it seems to be generally agreed >> that the argument should represent the codepoint and the >> function should return the correspondingly encoded character. >> If so, possible the argument should be a bigint to >> accommodate the full range of possible code points. >> It is not clear what the argument should represent for other >> multi-byte encodings for any argument higher than 127. >> Similarly, it is not clear what ascii() should return in >> such cases. I would be inclined just to error out. > > In SQL_ASCII I'd argue for allowing 0..255. In actual MB > encodings, OK with throwing error. I'd like to repeat my suggestion for chr() and ascii(). Instead of the code point, I'd prefer the actual encoding of the character as argument to chr() and return value of ascii(). The advantage I see is the following: - It would make these functions from oracle_compat.c compatible with Oracle (Oracle's chr() and ascii() work the way I suggest). I agree with Tom's earlier suggestion to throw an error for chr(0), although this is not what Oracle does. Of course, if it is generally perceived that the code point is more useful than the encoding, then Oracle compliance is probably secondary. Yours, Laurenz Albe
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Jeff Davis <pgsql@j-davis.com> writes: > >> I think the concern is when they use only one slash, like: >> E'\377\000\377'::bytea >> which, as I mentioned before, is not correct anyway. Wait, why would this be wrong? How would you enter the three byte bytea of consisting of those three bytes described above? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
>>> I think the concern is when they use only one slash, like: >>> E'\377\000\377'::bytea >>> which, as I mentioned before, is not correct anyway. > > Wait, why would this be wrong? How would you enter the three byte bytea of > consisting of those three bytes described above? Either as E'\\377\\000\\377' or '\377\000\377' /Dennis
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> In the short run it might be best to do it in scan.l after all. >>> > > >> I have not come up with a way of doing that and handling the bytea case. >> > > AFAICS we have no realistic choice other than to reject \0 in SQL > literals; to do otherwise requires API changes throughout that stack of > modules. And once you admit \0 is no good, it's not clear that > \somethingelse is any better for bytea-using clients. Moreover, given > that we are moving away from backslash escapes as fast as we can sanely > travel, expending large amounts of energy to make them work better > doesn't seem like a good use of development manpower. > > > Perhaps we're talking at cross purposes. The problem with doing encoding validation in scan.l is that it lacks context. Null bytes are only the tip of the bytea iceberg, since any arbitrary sequence of bytes can be valid for a bytea. So we can only do validation of encoding on a literal when we know it isn't destined for a bytea. That's what I haven't come up with a way of doing in the scanner (and as you noted upthread it's getting pretty darn late in the cycle for us to be looking for ways to do things). I still don't see why it's OK for us to do validation from the foo_recv functions but not the corresponding foo_in functions. At least in the short term that would provide us with fairly complete protection against accepting invalidly encoded data into the database, once we fix up chr(), without having to mess with the scanner, parser, COPY code etc. We could still get corruption from UDFs and UDTs - it's hard to see how we can avoid that danger. Yes, we would need to make sure that any additions to the type system acted properly, and yes we should fix up any validation inefficiencies (like possibly inlining calls in the UTF8 validation code). Personally, I don't see those as killer objections. cheers andrew
Albe Laurenz wrote: > I'd like to repeat my suggestion for chr() and ascii(). > > Instead of the code point, I'd prefer the actual encoding of > the character as argument to chr() and return value of ascii(). > > > [snip] > Of course, if it is generally perceived that the code point > is more useful than the encoding, then Oracle compliance > is probably secondary. > > > Last time this was discussed, you were the only person arguing for that behaviour, IIRC. And frankly, I don't know how to do it sanely anyway. A character encoding has a fixed byte pattern, but a given byte pattern doesn't have a single universal number value. I really don't think we want to have the value of chr(n) depend on the endianness of the machine, do we? The reason we are prepared to make an exception for Unicode is precisely because the code point maps to an encoding pattern independently of architecture, ISTM. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Perhaps we're talking at cross purposes. > The problem with doing encoding validation in scan.l is that it lacks > context. Null bytes are only the tip of the bytea iceberg, since any > arbitrary sequence of bytes can be valid for a bytea. If you think that, then we're definitely talking at cross purposes. I assert we should require the post-scanning value of a string literal to be valid in the database encoding. If you want to produce an arbitrary byte sequence within a bytea value, the way to get there is for the bytea input function to do the de-escaping, not for the string literal parser to do it. The current situation where there is overlapping functionality is a bit unfortunate, but once standard_conforming_strings is on by default, it'll get a lot easier to work with. I'm not eager to contort APIs throughout the backend in order to produce a more usable solution for the standard_conforming_strings = off case, given the expected limited lifespan of that usage. The only reason I was considering not doing it in scan.l is that scan.l's behavior ideally shouldn't depend on any changeable variables. But until there's some prospect of database_encoding actually being mutable at run time, there's not much point in investing a lot of sweat on that either. > I still don't see why it's OK for us to do validation from the foo_recv > functions but not the corresponding foo_in functions. Basically, a CSTRING handed to an input function should already have been encoding-validated by somebody. The most obvious reason why this must be so is the embedded-null problem, but in general it will already have been validated (typically as part of a larger string such as the whole SQL query or whole COPY data line), and doing that over is pointless and expensive. On the other hand, the entire point of a recv function is that it gets raw data that no one else in the backend knows the format of; so if the data is to be considered textual, the recv function has to be the one that considers it so and invokes appropriate conversion or validation. The reason backslash escapes in string literals are a problem is that they can produce incorrect-encoding results from what had been a validated string. > At least in the > short term that would provide us with fairly complete protection against > accepting invalidly encoded data into the database, once we fix up > chr(), without having to mess with the scanner, parser, COPY code etc. Instead, we have to mess with an unknown number of UDTs ... regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > The reason we are prepared to make an exception for Unicode is precisely > because the code point maps to an encoding pattern independently of > architecture, ISTM. Right --- there is a well-defined standard for the numerical value of each character in Unicode. And it's also clear what to do in single-byte encodings. It's not at all clear what the representation ought to be for other multibyte encodings. A direct transliteration of the byte sequence not only has endianness issues, but will have a weird non-dense set of valid values because of the restrictions on valid multibyte characters. Given that chr() has never before behaved sanely for multibyte values at all, extending it to Unicode code points is a reasonable extension, and throwing error for other encodings is reasonable too. If we ever do come across code-point standards for other encodings we can adopt 'em at that time. regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Those should be checked already --- if not, the right fix is still to >> fix it there, not in per-datatype code. I think we are OK though, >> eg see "need_transcoding" logic in copy.c. > Well, a little experimentation shows that we currently are not OK: > in foo.data: > \366\66 I looked at this and found that the problem is that CopyReadAttributesText() does backslash conversions and doesn't validate the result. My proposal at this point is that both scan.l and CopyReadAttributesText need to guarantee that the result of de-escaping is valid in the database encoding: they should reject \0 and check validity of multibyte characters. You could optimize this fairly well (important in the COPY case) by not redoing verifymbstr unless you'd seen at least one numerical backslash-escape that set the high bit. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Perhaps we're talking at cross purposes. >> > > >> The problem with doing encoding validation in scan.l is that it lacks >> context. Null bytes are only the tip of the bytea iceberg, since any >> arbitrary sequence of bytes can be valid for a bytea. >> > > If you think that, then we're definitely talking at cross purposes. > I assert we should require the post-scanning value of a string literal > to be valid in the database encoding. If you want to produce an > arbitrary byte sequence within a bytea value, the way to get there is > for the bytea input function to do the de-escaping, not for the string > literal parser to do it. > [looks again ... thinks ...] Ok, I'm sold. > > The only reason I was considering not doing it in scan.l is that > scan.l's behavior ideally shouldn't depend on any changeable variables. > But until there's some prospect of database_encoding actually being > mutable at run time, there's not much point in investing a lot of sweat > on that either. > Agreed. This would just be one item of many to change if/when we ever come to that. > Instead, we have to mess with an unknown number of UDTs ... > > > We're going to have that danger anyway, aren't we, unless we check the encoding validity of the result of every UDF that returns some text type? I'm not going to lose sleep over something I can't cure but the user can - what concerns me is that our own code ensures data intregrity, including for encoding. Anyway, it looks to me like we have the following items to do: . add validity checking to the scanner . fix COPY code . fix chr() . improve efficiency of validity checks, at least for UTF8 cheers andrew cheers andrew
> Andrew Dunstan <andrew@dunslane.net> writes: > > The reason we are prepared to make an exception for Unicode is precisely > > because the code point maps to an encoding pattern independently of > > architecture, ISTM. > > Right --- there is a well-defined standard for the numerical value of > each character in Unicode. And it's also clear what to do in > single-byte encodings. It's not at all clear what the representation > ought to be for other multibyte encodings. A direct transliteration > of the byte sequence not only has endianness issues, but will have > a weird non-dense set of valid values because of the restrictions on > valid multibyte characters. > > Given that chr() has never before behaved sanely for multibyte values at > all, extending it to Unicode code points is a reasonable extension, > and throwing error for other encodings is reasonable too. If we ever do > come across code-point standards for other encodings we can adopt 'em at > that time. I don't understand whole discussion. Why do you think that employing the Unicode code point as the chr() argument could avoid endianness issues? Are you going to represent Unicode code point as UCS-4? Then you have to specify the endianness anyway. (see the UCS-4 standard for more details) Or are you going to represent Unicode point as a character string such as 'U+0259'? Then representing any encoding as a string could avoid endianness issues anyway, and I don't see Unicode code point is any better than others. Also I'd like to point out all encodings has its own code point systems as far as I know. For example, EUC-JP has its corresponding code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see we can't use "code point" as chr()'s argument for othe encodings(of course we need optional parameter specifying which character set is supposed). -- Tatsuo Ishii SRA OSS, Inc. Japan
BTW, I'm sure this was discussed but I forgot the conclusion: should chr(0) throw an error? If we're trying to get rid of embedded-null problems, seems it must. regards, tom lane
On Tue, Sep 11, 2007 at 12:30:51AM +0900, Tatsuo Ishii wrote: > Why do you think that employing the Unicode code point as the chr() > argument could avoid endianness issues? Are you going to represent > Unicode code point as UCS-4? Then you have to specify the endianness > anyway. (see the UCS-4 standard for more details) Because the argument to chr() is an integer, which has no endian-ness. You only get into endian-ness if you look at how you store the resulting string. > Also I'd like to point out all encodings has its own code point > systems as far as I know. For example, EUC-JP has its corresponding > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see > we can't use "code point" as chr()'s argument for othe encodings(of > course we need optional parameter specifying which character set is > supposed). Oh, the last discussion on this didn't answer this question. Is there a standard somewhere that maps integers to characters in EUC-JP. If so, how can I find out what character 512 is? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Tom Lane wrote: > BTW, I'm sure this was discussed but I forgot the conclusion: should > chr(0) throw an error? If we're trying to get rid of embedded-null > problems, seems it must. > > > I think it should, yes. cheers andrew
Tatsuo Ishii wrote: > > I don't understand whole discussion. > > Why do you think that employing the Unicode code point as the chr() > argument could avoid endianness issues? Are you going to represent > Unicode code point as UCS-4? Then you have to specify the endianness > anyway. (see the UCS-4 standard for more details) > The code point is simply a number. The result of chr() will be a text value one char (not one byte) wide, in the relevant database encoding. U+nnnn maps to the same Unicode char and hence the same UTF8 encoding pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on all machines. So to get this char in a UTF8 database you could call "select chr(169)" and get back the byte pattern \xC2A9. > Or are you going to represent Unicode point as a character string such > as 'U+0259'? Then representing any encoding as a string could avoid > endianness issues anyway, and I don't see Unicode code point is any > better than others. > The argument will be a number, as now. > Also I'd like to point out all encodings has its own code point > systems as far as I know. For example, EUC-JP has its corresponding > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see > we can't use "code point" as chr()'s argument for othe encodings(of > course we need optional parameter specifying which character set is > supposed). > Where can I find the tables that map code points (as opposed to encodings) to characters for these others? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> BTW, I'm sure this was discussed but I forgot the conclusion: should >> chr(0) throw an error? > I think it should, yes. OK. Looking back, there was also some mention of changing chr's argument to bigint, but I'd counsel against doing that. We should not need it since we only support 4-byte UTF8, hence code points only up to 21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points, no?). If Tatsuo can find official code-point mappings for any other MB encodings, +1 on supporting those too. regards, tom lane
On Mon, Sep 10, 2007 at 11:48:29AM -0400, Tom Lane wrote: > BTW, I'm sure this was discussed but I forgot the conclusion: should > chr(0) throw an error? If we're trying to get rid of embedded-null > problems, seems it must. It is pointed out on wikipedia that Java sometimes uses to byte pair C0 80 to represent the NUL character to allow it to be embedded in C strings without tripping anything up. It is however technically an illegal representation (violates the minimal representation rule) and thus rejected by postgres. I'm not suggesting we copy this, but it does show there are other ways to deal with this. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Tom Lane wrote: > OK. Looking back, there was also some mention of changing chr's > argument to bigint, but I'd counsel against doing that. We should not > need it since we only support 4-byte UTF8, hence code points only up to > 21 bits (and indeed even 6-byte UTF8 can only have 31-bit code points, > no?). > > > Yes, that was a thinko on my part. I realised that yesterday. cheers andrew
> Tatsuo Ishii wrote: > > > > I don't understand whole discussion. > > > > Why do you think that employing the Unicode code point as the chr() > > argument could avoid endianness issues? Are you going to represent > > Unicode code point as UCS-4? Then you have to specify the endianness > > anyway. (see the UCS-4 standard for more details) > > > > The code point is simply a number. The result of chr() will be a text > value one char (not one byte) wide, in the relevant database encoding. > > U+nnnn maps to the same Unicode char and hence the same UTF8 encoding > pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on > all machines. So to get this char in a UTF8 database you could call > "select chr(169)" and get back the byte pattern \xC2A9. If you regard the unicode code point as simply a number, why not regard the multibyte characters as a number too? I mean, since 0xC2A9 = 49833, "select chr(49833)" should work fine no? Also I'm wondering you what we should do with different backend/frontend encoding combo. For example, if your database is in UTF-8, and your client encoding is LATIN2, what integer value should be passed to chr()? LATIN2 or Unicode code point? > > Or are you going to represent Unicode point as a character string such > > as 'U+0259'? Then representing any encoding as a string could avoid > > endianness issues anyway, and I don't see Unicode code point is any > > better than others. > > > > The argument will be a number, as now. > > > Also I'd like to point out all encodings has its own code point > > systems as far as I know. For example, EUC-JP has its corresponding > > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see > > we can't use "code point" as chr()'s argument for othe encodings(of > > course we need optional parameter specifying which character set is > > supposed). > > > > Where can I find the tables that map code points (as opposed to > encodings) to characters for these others? You mean code point table of character set? The actual standard is not on the web since it is copyrighted by the Japanese goverment (you need to buy as a book or a pdf file). However you could find many code point tables on the web. For example, JIS X 0208 code points can be found on: http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html (you need to have a Japanese font and set page encoding to Shift JIS) BTW, if you want to pass "code point of character set" rather than encoding value, you need to give chr() what character set you are reffering to. So we need to have two arguments, one is for code point, the other is for character set specification. What do you think? -- Tatsuo Ishii SRA OSS, Inc. Japan
> Tatsuo Ishii wrote: > > > > I don't understand whole discussion. > > > > Why do you think that employing the Unicode code point as the chr() > > argument could avoid endianness issues? Are you going to represent > > Unicode code point as UCS-4? Then you have to specify the endianness > > anyway. (see the UCS-4 standard for more details) > > > > The code point is simply a number. The result of chr() will be a text > value one char (not one byte) wide, in the relevant database encoding. > > U+nnnn maps to the same Unicode char and hence the same UTF8 encoding > pattern regardless of endianness. e.g. U+00a9 is the copyright symbol on > all machines. So to get this char in a UTF8 database you could call > "select chr(169)" and get back the byte pattern \xC2A9. If you regard the unicode code point as simply a number, why not regard the multibyte characters as a number too? I mean, since 0xC2A9 = 49833, "select chr(49833)" should work fine no? Also I'm wondering you what we should do with different backend/frontend encoding combo. For example, if your database is in UTF-8, and your client encoding is LATIN2, what integer value should be passed to chr()? LATIN2 or Unicode code point? > > Or are you going to represent Unicode point as a character string such > > as 'U+0259'? Then representing any encoding as a string could avoid > > endianness issues anyway, and I don't see Unicode code point is any > > better than others. > > > > The argument will be a number, as now. > > > Also I'd like to point out all encodings has its own code point > > systems as far as I know. For example, EUC-JP has its corresponding > > code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see > > we can't use "code point" as chr()'s argument for othe encodings(of > > course we need optional parameter specifying which character set is > > supposed). > > > > Where can I find the tables that map code points (as opposed to > encodings) to characters for these others? You mean code point table of character set? The actual standard is not on the web since it is copyrighted by the Japanese goverment (you need to buy as a book or a pdf file). However you could find many code point tables on the web. For example, JIS X 0208 code points can be found on: http://www.infonet.co.jp/ueyama/ip/binary/x0208txt.html (you need to have a Japanese font and set page encoding to Shift JIS) BTW, if you want to pass "code point of character set" rather than encoding value, you need to give chr() what character set you are reffering to. So we need to have two arguments, one is for code point, the other is for character set specification. What do you think? -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> writes: > If you regard the unicode code point as simply a number, why not > regard the multibyte characters as a number too? Because there's a standard specifying the Unicode code points *as numbers*. The mapping from those numbers to UTF8 strings (and other representations) is well-defined by the standard. > Also I'm wondering you what we should do with different > backend/frontend encoding combo. Nothing. chr() has always worked with reference to the database encoding, and we should keep it that way. BTW, it strikes me that there is another hole that we need to plug in this area, and that's the convert() function. Being able to create a value of type text that is not in the database encoding is simply broken. Perhaps we could make it work on bytea instead (providing a cast from text to bytea but not vice versa), or maybe we should just forbid the whole thing if the database encoding isn't SQL_ASCII. regards, tom lane
Tatsuo Ishii wrote: > > If you regard the unicode code point as simply a number, why not > regard the multibyte characters as a number too? I mean, since 0xC2A9 > = 49833, "select chr(49833)" should work fine no? > > No. The number corresponding to a given byte pattern depends on the endianness of the architecture. That's exactly why we can't sanely use the byte pattern of the encoded characters as numbers. cheers andrew
> Tatsuo Ishii <ishii@postgresql.org> writes: > > If you regard the unicode code point as simply a number, why not > > regard the multibyte characters as a number too? > > Because there's a standard specifying the Unicode code points *as > numbers*. The mapping from those numbers to UTF8 strings (and other > representations) is well-defined by the standard. > > > Also I'm wondering you what we should do with different > > backend/frontend encoding combo. > > Nothing. chr() has always worked with reference to the database > encoding, and we should keep it that way. Where is it documented? > BTW, it strikes me that there is another hole that we need to plug in > this area, and that's the convert() function. Being able to create > a value of type text that is not in the database encoding is simply > broken. Perhaps we could make it work on bytea instead (providing > a cast from text to bytea but not vice versa), or maybe we should just > forbid the whole thing if the database encoding isn't SQL_ASCII. Please don't do that. It will break an usefull use case of convert(). A user has a database encoded in UTF-8. He has English, French, Chinese and Japanese data in tables. To sort the tables in the language order, he will do like this: SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); Without using convert(), he will get random order of data. This is because Kanji characters are in random order in UTF-8, while Kanji characters are reasonably ordered in EUC_JP. -- Tatsuo Ishii SRA OSS, Inc. Japan
On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: > > BTW, it strikes me that there is another hole that we need to plug in > > this area, and that's the convert() function. Being able to create > > a value of type text that is not in the database encoding is simply > > broken. Perhaps we could make it work on bytea instead (providing > > a cast from text to bytea but not vice versa), or maybe we should just > > forbid the whole thing if the database encoding isn't SQL_ASCII. > > Please don't do that. It will break an usefull use case of convert(). > > A user has a database encoded in UTF-8. He has English, French, > Chinese and Japanese data in tables. To sort the tables in the > language order, he will do like this: > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); > > Without using convert(), he will get random order of data. This is > because Kanji characters are in random order in UTF-8, while Kanji > characters are reasonably ordered in EUC_JP. Isn't the collation a locale issue, not an encoding issue? Is there a ja_JP.UTF-8 that defines the proper order? Regards,Jeff Davis
> On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: > > > BTW, it strikes me that there is another hole that we need to plug in > > > this area, and that's the convert() function. Being able to create > > > a value of type text that is not in the database encoding is simply > > > broken. Perhaps we could make it work on bytea instead (providing > > > a cast from text to bytea but not vice versa), or maybe we should just > > > forbid the whole thing if the database encoding isn't SQL_ASCII. > > > > Please don't do that. It will break an usefull use case of convert(). > > > > A user has a database encoded in UTF-8. He has English, French, > > Chinese and Japanese data in tables. To sort the tables in the > > language order, he will do like this: > > > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); > > > > Without using convert(), he will get random order of data. This is > > because Kanji characters are in random order in UTF-8, while Kanji > > characters are reasonably ordered in EUC_JP. > > Isn't the collation a locale issue, not an encoding issue? Is there a > ja_JP.UTF-8 that defines the proper order? I don't think it helps. The point is, he needs different language's collation, while PostgreSQL allows only one collation(locale) per database cluster. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii wrote: >> BTW, it strikes me that there is another hole that we need to plug in >> this area, and that's the convert() function. Being able to create >> a value of type text that is not in the database encoding is simply >> broken. Perhaps we could make it work on bytea instead (providing >> a cast from text to bytea but not vice versa), or maybe we should just >> forbid the whole thing if the database encoding isn't SQL_ASCII. >> > > Please don't do that. It will break an usefull use case of convert(). > > A user has a database encoded in UTF-8. He has English, French, > Chinese and Japanese data in tables. To sort the tables in the > language order, he will do like this: > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); > > Without using convert(), he will get random order of data. This is > because Kanji characters are in random order in UTF-8, while Kanji > characters are reasonably ordered in EUC_JP. > > Tatsuo-san, would not this case be at least as well met by an operator supplying the required ordering? The operator of course would not have the danger of supplying values that are invalid in the database encoding. Admittedly, the user might need several operators for the case you describe. I'm not sure we are going to be able to catch every path by which invalid data can get into the database in one release. I suspect we might need two or three goes at this. (I'm just wondering if the routines that return cstrings are a possible vector). cheers andrew
Jeff Davis wrote: > On Tue, 2007-09-11 at 11:27 +0900, Tatsuo Ishii wrote: > >>> BTW, it strikes me that there is another hole that we need to plug in >>> this area, and that's the convert() function. Being able to create >>> a value of type text that is not in the database encoding is simply >>> broken. Perhaps we could make it work on bytea instead (providing >>> a cast from text to bytea but not vice versa), or maybe we should just >>> forbid the whole thing if the database encoding isn't SQL_ASCII. >>> >> Please don't do that. It will break an usefull use case of convert(). >> >> A user has a database encoded in UTF-8. He has English, French, >> Chinese and Japanese data in tables. To sort the tables in the >> language order, he will do like this: >> >> SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); >> >> Without using convert(), he will get random order of data. This is >> because Kanji characters are in random order in UTF-8, while Kanji >> characters are reasonably ordered in EUC_JP. >> > > Isn't the collation a locale issue, not an encoding issue? Is there a > ja_JP.UTF-8 that defines the proper order? > > > That won't help you much if you have all the collection mentioned above. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I'm not sure we are going to be able to catch every path by which > invalid data can get into the database in one release. I suspect we > might need two or three goes at this. (I'm just wondering if the > routines that return cstrings are a possible vector). We need to have a policy that cstrings are in the database encoding. regards, tom lane
Tatsuo Ishii <ishii@postgresql.org> writes: >> BTW, it strikes me that there is another hole that we need to plug in >> this area, and that's the convert() function. Being able to create >> a value of type text that is not in the database encoding is simply >> broken. Perhaps we could make it work on bytea instead (providing >> a cast from text to bytea but not vice versa), or maybe we should just >> forbid the whole thing if the database encoding isn't SQL_ASCII. > Please don't do that. It will break an usefull use case of convert(). The reason we have a problem here is that we've been choosing convenience over safety in encoding-related issues. I wonder if we must stoop to having a "strict_encoding_checks" GUC variable to satisfy everyone. > A user has a database encoded in UTF-8. He has English, French, > Chinese and Japanese data in tables. To sort the tables in the > language order, he will do like this: > SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); > Without using convert(), he will get random order of data. I'd say that *with* convert() he will get a random order of data. This is making a boatload of unsupportable assumptions about the locale and encoding of the surrounding database. There are a lot of bad-encoding situations for which strcoll() simply breaks down completely and can't even deliver self-consistent answers. It might work the way you are expecting if the database uses SQL_ASCII encoding and C locale --- and I'd be fine with allowing convert() only when the database encoding is SQL_ASCII. regards, tom lane
> Tatsuo Ishii <ishii@postgresql.org> writes: > >> BTW, it strikes me that there is another hole that we need to plug in > >> this area, and that's the convert() function. Being able to create > >> a value of type text that is not in the database encoding is simply > >> broken. Perhaps we could make it work on bytea instead (providing > >> a cast from text to bytea but not vice versa), or maybe we should just > >> forbid the whole thing if the database encoding isn't SQL_ASCII. > > > Please don't do that. It will break an usefull use case of convert(). > > The reason we have a problem here is that we've been choosing > convenience over safety in encoding-related issues. I wonder if we must > stoop to having a "strict_encoding_checks" GUC variable to satisfy > everyone. Please show me concrete examples how I could introduce a vulnerability using this kind of convert() usage. > > A user has a database encoded in UTF-8. He has English, French, > > Chinese and Japanese data in tables. To sort the tables in the > > language order, he will do like this: > > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); > > > Without using convert(), he will get random order of data. > > I'd say that *with* convert() he will get a random order of data. This > is making a boatload of unsupportable assumptions about the locale and > encoding of the surrounding database. There are a lot of bad-encoding > situations for which strcoll() simply breaks down completely and can't > even deliver self-consistent answers. > > It might work the way you are expecting if the database uses SQL_ASCII > encoding and C locale --- and I'd be fine with allowing convert() only > when the database encoding is SQL_ASCII. I don't believe that. With C locale, the convert() works fine as I described. -- Tatsuo Ishii SRA OSS, Inc. Japan
On Tue, 2007-09-11 at 11:53 +0900, Tatsuo Ishii wrote: > > Isn't the collation a locale issue, not an encoding issue? Is there a > > ja_JP.UTF-8 that defines the proper order? > > I don't think it helps. The point is, he needs different language's > collation, while PostgreSQL allows only one collation(locale) per > database cluster. My thought was: if we had some function that treated a string as a different locale, it might solve the problem without violating the database encoding. Converting to a different encoding, and producing text result, is the source of the problem. Regards,Jeff Davis
On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > Please show me concrete examples how I could introduce a vulnerability > using this kind of convert() usage. Try the sequence below. Then, try to dump and then reload the database. When you try to reload it, you will get an error: ERROR: invalid byte sequence for encoding "UTF8": 0xbd Regards,Jeff Davis test=> select version(); version --------------------------------------------------------------------------------------------------------------------------PostgreSQL 8.3develon x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.3 20070601 (prerelease) (Debian 4.1.2-12) (1 row) test=> show lc_collate;lc_collate -------------en_US.UTF-8 (1 row) test=> create table encoding_test(t text); CREATE TABLE test=> insert into encoding_test values('初'); INSERT 0 1 test=> insert into encoding_test values(convert('初' using utf8_to_euc_jp)); INSERT 0 1
> On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > > Please show me concrete examples how I could introduce a vulnerability > > using this kind of convert() usage. > > Try the sequence below. Then, try to dump and then reload the database. > When you try to reload it, you will get an error: > > ERROR: invalid byte sequence for encoding "UTF8": 0xbd I know this could be a problem (like chr() with invalid byte pattern). What I really want to know is, read query something like this: SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); could be a problem (I assume we use C locale). -- Tatsuo Ishii SRA OSS, Inc. Japan > Regards, > Jeff Davis > > test=> select version(); > > version > -------------------------------------------------------------------------------------------------------------------------- > PostgreSQL 8.3devel on x86_64-unknown-linux-gnu, compiled by GCC gcc > (GCC) 4.1.3 20070601 (prerelease) (Debian 4.1.2-12) > (1 row) > > test=> show lc_collate; > lc_collate > ------------- > en_US.UTF-8 > (1 row) > > test=> create table encoding_test(t text); > CREATE TABLE > test=> insert into encoding_test values('初'); > INSERT 0 1 > test=> insert into encoding_test values(convert('初' using > utf8_to_euc_jp)); > INSERT 0 1 > >
On Tue, Sep 11, 2007 at 11:27:50AM +0900, Tatsuo Ishii wrote: > SELECT * FROM japanese_table ORDER BY convert(japanese_text using utf8_to_euc_jp); > > Without using convert(), he will get random order of data. This is > because Kanji characters are in random order in UTF-8, while Kanji > characters are reasonably ordered in EUC_JP. The usual way to approach this is to make convert return bytea instead of text. Then your problems vanish. Bytea can still be sorted, but it won't be treated as a text string and thus does not need to conform to the requirements of a text string. Languages like perl distinguish between "encode" which is text->bytea and "decode" which is bytea->text. We've got "convert" for oth and that causes problems. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote: > > > On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > > > Please show me concrete examples how I could introduce a > vulnerability > > > using this kind of convert() usage. > > > > Try the sequence below. Then, try to dump and then reload the > database. > > When you try to reload it, you will get an error: > > > > ERROR: invalid byte sequence for encoding "UTF8": 0xbd > > I know this could be a problem (like chr() with invalid byte pattern). > What I really want to know is, read query something like this: > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > utf8_to_euc_jp); I guess I don't quite understand the question. I agree that ORDER BY convert() must be safe in the C locale, because it just passes the strings to strcmp(). Are you saying that we should not remove convert() until we can support multiple locales in one database? If we make convert() operate on bytea and return bytea, as Tom suggested, would that solve your use case? Regards,Jeff Davis
>> Try the sequence below. Then, try to dump and then reload the database. >> When you try to reload it, you will get an error: >> >> ERROR: invalid byte sequence for encoding "UTF8": 0xbd > > I know this could be a problem (like chr() with invalid byte pattern). And that's enough of a problem already. We don't need more problems. > What I really want to know is, read query something like this: > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > utf8_to_euc_jp); > > could be a problem (I assume we use C locale). If convert() produce a sequence of bytes that can't be interpreted as a string in the server encoding then it's broken. Imho convert() should return a bytea value. If we hade good encoding/charset support we could do better, but we can't today. The above example would work fine if convert() returned a bytea. In the C locale the string would be compared byte for byte and that's what you get with bytea values as well. Strings are not sequences of bytes that can be interpreted in different ways. That's what bytea values are. Strings are in a specific encoding always, and in pg that encoding is fixed to a single one for a whole cluster at initdb time. We should not confuse text with bytea. /Dennis
> On Tue, 2007-09-11 at 14:50 +0900, Tatsuo Ishii wrote: > > > > > On Tue, 2007-09-11 at 12:29 +0900, Tatsuo Ishii wrote: > > > > Please show me concrete examples how I could introduce a > > vulnerability > > > > using this kind of convert() usage. > > > > > > Try the sequence below. Then, try to dump and then reload the > > database. > > > When you try to reload it, you will get an error: > > > > > > ERROR: invalid byte sequence for encoding "UTF8": 0xbd > > > > I know this could be a problem (like chr() with invalid byte pattern). > > What I really want to know is, read query something like this: > > > > SELECT * FROM japanese_table ORDER BY convert(japanese_text using > > utf8_to_euc_jp); > > I guess I don't quite understand the question. > > I agree that ORDER BY convert() must be safe in the C locale, because it > just passes the strings to strcmp(). > > Are you saying that we should not remove convert() until we can support > multiple locales in one database? > > If we make convert() operate on bytea and return bytea, as Tom > suggested, would that solve your use case? The problem is, the above use case is just one of what I can think of. Another use case is, something like this: SELECT sum(octet_length(convert(text_column using utf8_to_euc_jp))) FROM mytable; to know the total byte length of text column if it's encoded in EUC_JP. So I'm not sure we could change convert() returning bytea without complaing from users... -- Tatsuo Ishii SRA OSS, Inc. Japan
Andrew Dunstan wrote: >> Instead of the code point, I'd prefer the actual encoding of >> the character as argument to chr() and return value of ascii(). > > And frankly, I don't know how to do it sanely anyway. A character > encoding has a fixed byte pattern, but a given byte pattern > doesn't have > a single universal number value. I really don't think we want to have > the value of chr(n) depend on the endianness of the machine, do we? > > The reason we are prepared to make an exception for Unicode > is precisely because the code point maps to an encoding > pattern independently of architecture, ISTM. Point taken. I only wanted to make sure that there are good reasons to differ from Oracle. Oracle's chr() is big-endian on all platforms, BTW. Yours, Laurenz Albe
On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote: > The reason we have a problem here is that we've been choosing > convenience over safety in encoding-related issues. I wonder if we must > stoop to having a "strict_encoding_checks" GUC variable to satisfy > everyone. That would be satisfactory to me. However, I'm sure some will cringe at a MySQL-like configurable integrity option. Might it be better as an initdb-time option? I can't think why anyone would want to change it later. > It might work the way you are expecting if the database uses SQL_ASCII > encoding and C locale --- and I'd be fine with allowing convert() only > when the database encoding is SQL_ASCII. I prefer this option. It's consistent with the docs, which say: "The SQL_ASCII setting behaves considerably differently from the other settings," and also "One way to use multiple encodings safely is to set the locale to C or POSIX during initdb, thus disabling any real locale awareness." Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote: >> It might work the way you are expecting if the database uses SQL_ASCII >> encoding and C locale --- and I'd be fine with allowing convert() only >> when the database encoding is SQL_ASCII. > I prefer this option. I think really the technically cleanest solution would be to make convert() return bytea instead of text; then we'd not have to put restrictions on what encoding or locale it's working inside of. However, it's not clear to me whether there are valid usages that that would foreclose. Tatsuo mentioned length() but bytea has that. What I think we'd need to have a complete solution is convert(text, name) returns bytea-- convert from DB encoding to arbitrary encoding convert(bytea, name, name) returns bytea-- convert between any two encodings convert(bytea, name) returns text-- convert from arbitrary encoding to DB encoding The second and third would need to do a verify step before converting, of course. Comments? regards, tom lane
On Tue, 2007-09-11 at 14:48 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote: > >> It might work the way you are expecting if the database uses SQL_ASCII > >> encoding and C locale --- and I'd be fine with allowing convert() only > >> when the database encoding is SQL_ASCII. > > > I prefer this option. > > I think really the technically cleanest solution would be to make > convert() return bytea instead of text; then we'd not have to put > restrictions on what encoding or locale it's working inside of. > However, it's not clear to me whether there are valid usages that > that would foreclose. Tatsuo mentioned length() but bytea has that. Once it's in bytea, you can make operators to achieve the old functionality. If I understood correctly, he was making a backwards compatibility argument, not a functionality argument. I can't think of a problem without a workaround, but maybe there are some. > What I think we'd need to have a complete solution is > > convert(text, name) returns bytea > -- convert from DB encoding to arbitrary encoding > > convert(bytea, name, name) returns bytea > -- convert between any two encodings > > convert(bytea, name) returns text > -- convert from arbitrary encoding to DB encoding > > The second and third would need to do a verify step before > converting, of course. > I like it. Regards,Jeff Davis
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > On Mon, 2007-09-10 at 23:20 -0400, Tom Lane wrote: > >> It might work the way you are expecting if the database uses SQL_ASCII > >> encoding and C locale --- and I'd be fine with allowing convert() only > >> when the database encoding is SQL_ASCII. > > > I prefer this option. > > I think really the technically cleanest solution would be to make > convert() return bytea instead of text; then we'd not have to put > restrictions on what encoding or locale it's working inside of. > However, it's not clear to me whether there are valid usages that > that would foreclose. Tatsuo mentioned length() but bytea has that. But length(bytea) cannot count characters, only bytes. > What I think we'd need to have a complete solution is > > convert(text, name) returns bytea > -- convert from DB encoding to arbitrary encoding > > convert(bytea, name, name) returns bytea > -- convert between any two encodings > > convert(bytea, name) returns text > -- convert from arbitrary encoding to DB encoding That seems good. This is the encode/decode that other systems have. However ISTM we would also need something like length(bytea, name) returns int-- counts the number of characters assuming that the bytea is in-- the given encoding Hmm, I wonder if counting chars is consistent regardless of the encoding the string is in. To me it sounds like it should, in which case it works to convert to the DB encoding and count chars there. -- Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4" <inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell <crab> inflex: you know that "amalgam" means "mixture with mercury", more or less, right? <crab> i.e., "deadly poison"
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I think really the technically cleanest solution would be to make >> convert() return bytea instead of text; then we'd not have to put >> restrictions on what encoding or locale it's working inside of. >> However, it's not clear to me whether there are valid usages that >> that would foreclose. Tatsuo mentioned length() but bytea has that. > But length(bytea) cannot count characters, only bytes. So what? If you want characters, just count the original text string. Encoding conversion won't change that. > Hmm, I wonder if counting chars is consistent regardless of the > encoding the string is in. To me it sounds like it should, in which > case it works to convert to the DB encoding and count chars there. A conversion that isn't one-for-one is not merely an encoding conversion IMHO. regards, tom lane
Tatsuo Ishii <ishii@postgresql.org> writes: >> If we make convert() operate on bytea and return bytea, as Tom >> suggested, would that solve your use case? > The problem is, the above use case is just one of what I can think of. > Another use case is, something like this: > SELECT sum(octet_length(convert(text_column using utf8_to_euc_jp))) FROM mytable; > to know the total byte length of text column if it's encoded in > EUC_JP. Since octet_length exists and produces identical results for both text and bytea, this example is hardly very convincing... regards, tom lane
> However ISTM we would also need something like > > length(bytea, name) returns int > -- counts the number of characters assuming that the bytea is in > -- the given encoding > > Hmm, I wonder if counting chars is consistent regardless of the > encoding the string is in. To me it sounds like it should, in which > case it works to convert to the DB encoding and count chars there. Not necessarily. It's possible that after encoding conversion, number of chars are different before and after. An example is, UTF-8 and EUC_JIS_2004. 0xa4f7(EUC_JIS_2004) <--> U+304B *and* U+309A (Unicode) This is defined in the Japanese goverment's standard. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tom Lane wrote: > I think really the technically cleanest solution would be to make > convert() return bytea instead of text; then we'd not have to put > restrictions on what encoding or locale it's working inside of. > However, it's not clear to me whether there are valid usages that > that would foreclose. Tatsuo mentioned length() but bytea has that. > > What I think we'd need to have a complete solution is > > convert(text, name) returns bytea > -- convert from DB encoding to arbitrary encoding > > convert(bytea, name, name) returns bytea > -- convert between any two encodings > > convert(bytea, name) returns text > -- convert from arbitrary encoding to DB encoding > > The second and third would need to do a verify step before > converting, of course. > > > Are you wanting this done for 8.3? If so, by whom? :-) cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Are you wanting this done for 8.3? If so, by whom? :-) [ shrug... ] I'm not the one who's worried about closing all the holes leading to encoding problems. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Are you wanting this done for 8.3? If so, by whom? :-) >> > > [ shrug... ] I'm not the one who's worried about closing all the holes > leading to encoding problems. > > > I can certainly have a go at it. Are we still talking about Oct 1 for a possible beta? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I can certainly have a go at it. Are we still talking about Oct 1 for a > possible beta? Yeah, there's still a little time left --- HOT will take at least a few more days. regards, tom lane
Tom Lane wrote: > What I think we'd need to have a complete solution is > > convert(text, name) returns bytea > -- convert from DB encoding to arbitrary encoding > > convert(bytea, name, name) returns bytea > -- convert between any two encodings > > convert(bytea, name) returns text > -- convert from arbitrary encoding to DB encoding > > The second and third would need to do a verify step before > converting, of course. > > > Is there any reason these functions shouldn't be marked immutable? Also, I'm wondering if we should give them disambiguating names, rather than call them all convert. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> What I think we'd need to have a complete solution is >> >> convert(text, name) returns bytea >> -- convert from DB encoding to arbitrary encoding >> >> convert(bytea, name, name) returns bytea >> -- convert between any two encodings >> >> convert(bytea, name) returns text >> -- convert from arbitrary encoding to DB encoding >> >> The second and third would need to do a verify step before >> converting, of course. > Is there any reason these functions shouldn't be marked immutable? No, not as long as DB encoding is frozen... > I'm wondering if we should give them disambiguating names, rather than > call them all convert. No. We have a function overloading system, we should use it. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> What I think we'd need to have a complete solution is >>> >>> convert(text, name) returns bytea >>> -- convert from DB encoding to arbitrary encoding >>> >>> convert(bytea, name, name) returns bytea >>> -- convert between any two encodings >>> >>> convert(bytea, name) returns text >>> -- convert from arbitrary encoding to DB encoding >>> >>> The second and third would need to do a verify step before >>> converting, of course. >>> >> I'm wondering if we should give them disambiguating names, rather than >> call them all convert. >> > > No. We have a function overloading system, we should use it. > > > In general I agree with you. What's bothering me here though is that in the two argument forms, if the first argument is text the second argument is the destination encoding, but if the first argument is a bytea the second argument is the source encoding. That strikes me as likely to be quite confusing, and we might alleviate that with something like: text convert_from(bytea, name) bytea convert_to(text, name) But if I'm the only one bothered by it I won't worry. cheers andrew
Ühel kenal päeval, T, 2007-09-18 kell 08:08, kirjutas Andrew Dunstan: > > Tom Lane wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > > > >> Tom Lane wrote: > >> > >>> What I think we'd need to have a complete solution is > >>> > >>> convert(text, name) returns bytea > >>> -- convert from DB encoding to arbitrary encoding > >>> > >>> convert(bytea, name, name) returns bytea > >>> -- convert between any two encodings > >>> > >>> convert(bytea, name) returns text > >>> -- convert from arbitrary encoding to DB encoding > >>> > >>> The second and third would need to do a verify step before > >>> converting, of course. > >>> > > >> I'm wondering if we should give them disambiguating names, rather than > >> call them all convert. > >> > > > > No. We have a function overloading system, we should use it. > > > > > > > In general I agree with you. > > What's bothering me here though is that in the two argument forms, if > the first argument is text the second argument is the destination > encoding, but if the first argument is a bytea the second argument is > the source encoding. That strikes me as likely to be quite confusing, > and we might alleviate that with something like: > > text convert_from(bytea, name) > bytea convert_to(text, name) how is this fundamentally different from encode/decode functions we have now ? > But if I'm the only one bothered by it I won't worry. > > cheers > > andrew > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
Hannu Krosing wrote: > Ühel kenal päeval, T, 2007-09-18 kell 08:08, kirjutas Andrew Dunstan: > >> Tom Lane wrote: >> >>> Andrew Dunstan <andrew@dunslane.net> writes: >>> >>> >>>> Tom Lane wrote: >>>> >>>> >>>>> What I think we'd need to have a complete solution is >>>>> >>>>> convert(text, name) returns bytea >>>>> -- convert from DB encoding to arbitrary encoding >>>>> >>>>> convert(bytea, name, name) returns bytea >>>>> -- convert between any two encodings >>>>> >>>>> convert(bytea, name) returns text >>>>> -- convert from arbitrary encoding to DB encoding >>>>> >>>>> The second and third would need to do a verify step before >>>>> converting, of course. >>>>> >>>>> >>>> I'm wondering if we should give them disambiguating names, rather than >>>> call them all convert. >>>> >>>> >>> No. We have a function overloading system, we should use it. >>> >>> >>> >>> >> In general I agree with you. >> >> What's bothering me here though is that in the two argument forms, if >> the first argument is text the second argument is the destination >> encoding, but if the first argument is a bytea the second argument is >> the source encoding. That strikes me as likely to be quite confusing, >> and we might alleviate that with something like: >> >> text convert_from(bytea, name) >> bytea convert_to(text, name) >> > > how is this fundamentally different from encode/decode functions we have > now ? > > > They are in effect reversed. encode() applies the named encoding to a bytea. convert_from() above unapplies the named encoding (i.e. converts the bytea to text in the database encoding). cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > What's bothering me here though is that in the two argument forms, if > the first argument is text the second argument is the destination > encoding, but if the first argument is a bytea the second argument is > the source encoding. That strikes me as likely to be quite confusing, > and we might alleviate that with something like: > text convert_from(bytea, name) > bytea convert_to(text, name) In a green field this would be fine, but what will you do with the existing 2-argument form of convert()? [ pokes around... ] Actually, on looking into it, it seems the 2-argument form of convert() is there because somebody thought it would be a suitable approximation to SQL99's <form-of-use conversion>, which in reality I fear means something different entirely. I might not be grasping what the spec is trying to say, but I *think* that what they intend is that the argument of CONVERT(x USING encoding) has to be a client-side variable in embedded SQL, and that the intent is that x is in the specified encoding and it's brought into the DB encoding by the function. Or maybe I'm all wrong. Anyway, on the strength of that, these functions are definitely best named to stay away from the spec syntax, so +1 for your proposal above. regards, tom lane
Tom Lane wrote: > Anyway, on the strength of that, these functions are definitely > best named to stay away from the spec syntax, so +1 for your > proposal above. > > > OK, I have committed this and the other the functional changes that should change the encoding holes. Catalog version bumped. cheers andrew
"Andrew Dunstan" <andrew@dunslane.net> writes: > Tom Lane wrote: > >> No. We have a function overloading system, we should use it. >> > In general I agree with you. > > What's bothering me here though is that in the two argument forms, if the first > argument is text the second argument is the destination encoding, but if the > first argument is a bytea the second argument is the source encoding. That > strikes me as likely to be quite confusing, and we might alleviate that with > something like: > > text convert_from(bytea, name) > bytea convert_to(text, name) > > But if I'm the only one bothered by it I won't worry. I tend to agree with you. We should only use overloading when the function is essentially the same just tweaked as appropriate for the datatype, not when the meaning is different. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com