Thread: client side syntax error localisation for psql (v1)
Dear patchers, Please find attached my first attempt at providing a localisation information on the client side for syntax errors in "psql". Basically, one function is added which takes care of the burden. I put a lot of comments in the function to try make it clear what is going on. It validates for me against current cvs head. I've also added new regression tests in "errors.sql". Although I have taken great and painful care to deal with multi-byte encodings, I'm not sure how I can validate that, as I don't have some japanese terminal to play with. Have a nice day, -- Fabien Coelho - coelho@cri.ensmp.fr
> Dear patchers, Sorry, wrong list:-( -- Fabien.
Fabien COELHO <coelho@cri.ensmp.fr> writes: > Please find attached my first attempt at providing a localisation > information on the client side for syntax errors in "psql". "Localisation" tends to mean something quite different around here. I'd suggest renaming the function to something like ReportSyntaxErrorPosition. Would it read better to add "..." at the front and/or back of the query extract, when you have actually truncated text at that end? The "on line N" bit seems just noise to me. regards, tom lane
Fabien COELHO <coelho@cri.ensmp.fr> writes: >> The "on line N" bit seems just noise to me. > It depends. I can see that it would be useful in a very large query. Perhaps include it only when the query has more than N lines, for some N like three to five or so? Another possibility is to keep the cursor as just "^", and bury the line info in the query extract. For instance: Short single-line query, no truncation needed: QUERY: SELECT * FROM foo WHRE bar;QUERY: ^ Truncation on the right just adds ..., no other change needed: QUERY: SELECT * FROM foo WHRE lots-of-conditions...QUERY: ^ When you truncate on the left, count the number of newlines removed, and emit "LINE n" if there were any: QUERY: LINE 4: FROM foo WHRE lots-of-conditions...QUERY: ^ orQUERY: LINE 4: ...FROM foo WHRE lots-of-conditions...QUERY: ^ (So "LINE n" would never say "LINE 1", only larger line numbers.) Here I'm imagining that the leading ... gets inserted when text has been removed after the start of the indicated line. So another possible output format is QUERY: ...FROM foo WHRE lots-of-conditions...QUERY: ^ if you removed some text but not any newlines from the start of the query. I think this wouldn't be terribly complex to implement, and it would make things look fairly nice for both short and long queries. One last point: it seems a bit odd to use QUERY: as the prefix for both the query extract and the cursor line. I don't have a suggestion what to use instead, offhand. regards, tom lane
Dear Tom, > >> The "on line N" bit seems just noise to me. > > It depends. > > I can see that it would be useful in a very large query. Perhaps > include it only when the query has more than N lines, for some N > like three to five or so? Yes, I can do that. > Another possibility is to keep the cursor as just "^", and bury the > line info in the query extract. For instance: Well, I want to preserve space for the query extract, that's where the context information is! If I put more information there, I'll have to reduce the extract length. Moreover the cursor line information content is small, so it seems better to put the line information there. > (So "LINE n" would never say "LINE 1", only larger line numbers.) I agree that LINE 1 looks pretty useless, especially if there is only one line;-) > I think this wouldn't be terribly complex to implement, and it would > make things look fairly nice for both short and long queries. There is also an alignment issue here, as depending on the number of figures, the cursor line would have to be fixed accordingly. > One last point: it seems a bit odd to use QUERY: as the prefix for both > the query extract and the cursor line. I don't have a suggestion what > to use instead, offhand. I agree, but couldn't think of a better choice either. There is also a localisation issue here, as the translation of both lines must match so that the alignment is kept. I thought that if it is the very same word, the translation should be the same. Thanks for your comments and ideas, I'll submit a v3 on (my) tomorrow. Have a nice day, -- Fabien Coelho - coelho@cri.ensmp.fr
Dear Tom, > > Please find attached my first attempt at providing a localisation > > information on the client side for syntax errors in "psql". > > "Localisation" tends to mean something quite different around here. I'd > suggest renaming the function to something like ReportSyntaxErrorPosition. Ok. "Localisation" was the good french word;-) > Would it read better to add "..." at the front and/or back of the query > extract, when you have actually truncated text at that end? I thought about it as well, but did not implement it. I agree that it would look better. I can have a go. > The "on line N" bit seems just noise to me. It depends. If you're in the middle of a sql-script as I often do, giving the line within the statement isn't bad. Say: SELECT o.titre FROM collections AS col, ouvrages AS o, a_ecrit AS e, auteurs AS a WHERE o.id=e.oeuvre AND e.auteur=a.id AND o.collection=col.idAND col.nom LIKE '%Spirou%' AND a.nom<>'Franquin' EXCEPT SELECT o.titre FROM collections AS col, ouvrages AS o, a_ecrit AS e, auteurs AS a WHERE o.id=e.oeuvre AND e.auteur=a.id AND o.collection=col.idAND col.nom LIKE '%Spirou%' AND a.nom='Franquin'; All lines look the same... So I would prefer to keep it. Well, if it is the only issue against the adoption of the patch, I will do without the line number. I'll submit a new patch in a moment. Thanks for your comments, -- Fabien.
Fabien COELHO <coelho@cri.ensmp.fr> writes: >> Another possibility is to keep the cursor as just "^", and bury the >> line info in the query extract. For instance: > Well, I want to preserve space for the query extract, that's where the > context information is! If I put more information there, I'll have to > reduce the extract length. But in the form you are using, you have to reserve that space anyway. Consider an error marker at the end of the line: QUERY: select 'text that is long enough to be a problem':QUERY: ^ on line1 Having to leave space for "on line 1" at the right end is not better than putting "line 1:" at the left end. > Moreover the cursor line information content is small, so it seems better > to put the line information there. It seems confusing to me. The cursor is just a marker. Consider also the next step up from this scheme: instead of assuming a dumb terminal, suppose you can use color or underlining or reverse-video or some such. Then you might want to do just one line of output: QUERY: SELECT foo FROM bar WHRE baz; ---- where my underlining represents using the terminal's highlight ability. With such a scheme, putting LINE n on the same line fits naturally. > There is also an alignment issue here, as depending on the number of > figures, the cursor line would have to be fixed accordingly. Sure, but you have to buy into that anyway if you are going to do "...". It's not going to add more than about one strlen() call to what you need to do. > There is also a localisation issue here, as the translation of both lines > must match so that the alignment is kept. I thought that if it is the very > same word, the translation should be the same. Hmm, that's a good point, although perhaps you could insert a note to translators to add space as needed to make the two strings the same length. regards, tom lane
Fabien COELHO wrote: > There is also a localisation issue here, as the translation of both > lines must match so that the alignment is kept. I thought that if it > is the very same word, the translation should be the same. You can just indent with as many spaces. This is done in other places as well. Btw., a better word than "query" would be "statement".
> Please find attached my first attempt at providing a localisation > information on the client side for syntax errors in "psql". Basically, one > function is added which takes care of the burden. I put a lot of > comments in the function to try make it clear what is going on. > > It validates for me against current cvs head. > > I've also added new regression tests in "errors.sql". > > Although I have taken great and painful care to deal with multi-byte > encodings, I'm not sure how I can validate that, as I don't have some > japanese terminal to play with. As far as I understand your code, it will be broken on many multi byte encodings. 1) a character is not always represented on a terminal propotional to the storage size. For example a kanji character inUTF-8 encoding has a storage size of 3 bytes while it occupies spaces only twice of ASCII characters on a terminal. Samething can be said to LATIN 2,3 etc. in UTF-8 perhaps. 2) It assume all encodings are "ASCII compatible". Apparently some client-side-only encodings do not satisfy this request.Examples include SJIS, Big5. I think 2) would be solved by carefull codings. However 1) is hard to solve. We need kind of a mapping table between "storage size" and "terminal size" for each encoding. Note that this is not new problem, and sophiscated editors such as Emacs (mule) has already concur that. -- Tatsuo Ishii
Dear Tom, > > Well, I want to preserve space for the query extract, that's where the > > context information is! If I put more information there, I'll have to > > reduce the extract length. > > But in the form you are using, you have to reserve that space anyway. > Consider an error marker at the end of the line: > > QUERY: select 'text that is long enough to be a problem': > QUERY: ^ on line 1 > > Having to leave space for "on line 1" at the right end is not better > than putting "line 1:" at the left end. Yes, I partly agree. However the space above can be used nevertheless for the extract, so it is not totally "lost": QUERY: select 'text that is long enough to be a problem': from foo ... QUERY: ^ on line 1 > > Moreover the cursor line information content is small, so it seems better > > to put the line information there. > > It seems confusing to me. The cursor is just a marker. Sure. > Consider also the next step up from this scheme: instead of assuming a > dumb terminal, suppose you can use color or underlining or reverse-video > or some such. Then you might want to do just one line of output: > > QUERY: SELECT foo FROM bar WHRE baz; > ---- > where my underlining represents using the terminal's highlight ability. > With such a scheme, putting LINE n on the same line fits naturally. Sure, but the current status of psql is to have no such thing, and I'm not going to start terminal-dependent features in the software;-) I just want to help my students with syntax errors. As a compromise, I can suggest the following: LINE 4: WHERE id=123 AND name LIKE 'calvin' GROP BY name... ^ So "QUERY:" is dropped out. If there is very few lines, I can just put "LINE:". I'll have to deal with the alignment problem, no big deal, but not very nice either. > > There is also an alignment issue here, as depending on the number of > > figures, the cursor line would have to be fixed accordingly. > > Sure, but you have to buy into that anyway if you are going to do "...". Well, I know that "..." is 3 char long, so I can put a " " in the cursor line in that case. If it is a figure, the length depends on its value. > It's not going to add more than about one strlen() call to what you need > to do. I cannot strlen an integer value;-) I have to convert to a string, or deal directly with the line number. With my new current implementation, the line number is only shown if the query is more than 3 lines. -- Fabien Coelho - coelho@cri.ensmp.fr
Dear Tatsuo, Thanks for your reply, as I noticed from the source code that your name often appears when dealing with multi-byte issues;-) On Fri, 12 Mar 2004, Tatsuo Ishii wrote: > As far as I understand your code, it will be broken on many multi byte > encodings. > > 1) a character is not always represented on a terminal propotional to > the storage size. For example a kanji character in UTF-8 encoding > has a storage size of 3 bytes while it occupies spaces only twice > of ASCII characters on a terminal. Same thing can be said to LATIN > 2,3 etc. in UTF-8 perhaps. I thought I dealt with that in the code by calling PQmblen for every char. Am I wrong ? > 2) It assume all encodings are "ASCII compatible". Apparently some > client-side-only encodings do not satisfy this request. Examples > include SJIS, Big5. What I mean by "ASCII compatible" is that spaces, new lines, carriage returns, tabs and NULL (C string terminaison) are one byte characters. This assumption seemed pretty safe to me. If this is not the case, I cannot understand how any error message could work in psql. If one printf(" "), that would not be a space character? Or is the terminal doing some "on the fly" translation?? What if a file is read with such encoding??? Or is there a special compilation option to generate special strings, but in this case the executable would not be compatible with any other terminal???? Well, I just underline my lack of knowledge here:-( If not, how can I detect these special characters that I need to change ? Maybe I could translate the string to a pg_wchar[] if the function is available to psql? Also as I quick and dirty temporary fix, I can skip statement extraction for those encodings that do not meet my expectations. So I would need to know what encodings are at risk with the current scheme? -- Fabien Coelho - coelho@cri.ensmp.fr
> Fabien COELHO wrote: > > There is also a localisation issue here, as the translation of both > > lines must match so that the alignment is kept. I thought that if it > > is the very same word, the translation should be the same. > > You can just indent with as many spaces. This is done in other places > as well. Yes, but then I need to know the length of the translation... I did that, so I hope sprintf did the translation job. Thanks, -- Fabien Coelho - coelho@cri.ensmp.fr
> Dear Tatsuo, > > Thanks for your reply, as I noticed from the source code that your name > often appears when dealing with multi-byte issues;-) > > On Fri, 12 Mar 2004, Tatsuo Ishii wrote: > > As far as I understand your code, it will be broken on many multi byte > > encodings. > > > > 1) a character is not always represented on a terminal propotional to > > the storage size. For example a kanji character in UTF-8 encoding > > has a storage size of 3 bytes while it occupies spaces only twice > > of ASCII characters on a terminal. Same thing can be said to LATIN > > 2,3 etc. in UTF-8 perhaps. > > I thought I dealt with that in the code by calling PQmblen for every char. > Am I wrong ? PQmblen returns the storage size, which is not necessarily same as the character width reprensented in a terminal. For example for a kanji character in UTF-8 PQmblen returns 3, but it ocuppies 2 x ASCII character space, not x 3. Isn't that a problem for you? > > 2) It assume all encodings are "ASCII compatible". Apparently some > > client-side-only encodings do not satisfy this request. Examples > > include SJIS, Big5. > > What I mean by "ASCII compatible" is that spaces, new lines, carriage > returns, tabs and NULL (C string terminaison) are one byte characters. > This assumption seemed pretty safe to me. > > If this is not the case, I cannot understand how any error message could > work in psql. If one printf(" "), that would not be a space character? > Or is the terminal doing some "on the fly" translation?? What if a > file is read with such encoding??? Or is there a special compilation > option to generate special strings, but in this case the executable > would not be compatible with any other terminal???? > > Well, I just underline my lack of knowledge here:-( > > If not, how can I detect these special characters that I need to change ? > Maybe I could translate the string to a pg_wchar[] if the function is > available to psql? I think you can do it safely using PQmblen. 1) start from the begining of the target string 2) apply PQmblen 3) if it returns 1, you can do the spcecial character detection 4) otherwise it must not be an ASCII character and you can skip as many characters as PQmnlen returns 5) goto 1) if any characters remain > Also as I quick and dirty temporary fix, I can skip statement extraction > for those encodings that do not meet my expectations. So I would need to > know what encodings are at risk with the current scheme? > > -- > Fabien Coelho - coelho@cri.ensmp.fr > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
> PQmblen returns the storage size, which is not necessarily same as the > character width reprensented in a terminal. For example for a kanji > character in UTF-8 PQmblen returns 3, but it ocuppies 2 x ASCII > character space, not x 3. Isn't that a problem for you? > > > > 2) It assume all encodings are "ASCII compatible". Apparently some > > > client-side-only encodings do not satisfy this request. Examples > > > include SJIS, Big5. > > > > What I mean by "ASCII compatible" is that spaces, new lines, carriage > > returns, tabs and NULL (C string terminaison) are one byte characters. > > This assumption seemed pretty safe to me. > > > > If this is not the case, I cannot understand how any error message could > > work in psql. If one printf(" "), that would not be a space character? > > Or is the terminal doing some "on the fly" translation?? What if a > > file is read with such encoding??? Or is there a special compilation > > option to generate special strings, but in this case the executable > > would not be compatible with any other terminal???? > > > > Well, I just underline my lack of knowledge here:-( > > > > If not, how can I detect these special characters that I need to change ? > > Maybe I could translate the string to a pg_wchar[] if the function is > > available to psql? > > I think you can do it safely using PQmblen. > > 1) start from the begining of the target string > > 2) apply PQmblen > > 3) if it returns 1, you can do the spcecial character detection > > 4) otherwise it must not be an ASCII character and you can skip as > many characters as PQmnlen returns > > 5) goto 1) if any characters remain ~~of course this should be 2) -- Tatsuo Ishii
Dear Tatsuo, > > > 1) a character is not always represented on a terminal propotional to > > > the storage size. For example a kanji character in UTF-8 encoding > > > has a storage size of 3 bytes while it occupies spaces only twice > > > of ASCII characters on a terminal. Same thing can be said to LATIN > > > 2,3 etc. in UTF-8 perhaps. > > > > I thought I dealt with that in the code by calling PQmblen for every char. > > Am I wrong ? > > PQmblen returns the storage size, which is not necessarily same as the > character width reprensented in a terminal. For example for a kanji > character in UTF-8 PQmblen returns 3, but it ocuppies 2 x ASCII > character space, not x 3. Isn't that a problem for you? If I read you correctly, you mean that 1 character may take 3 bytes of storage in the string, but it is not guaranteed to be 1 character from the terminal perspective... Argh, that's definitely an issue:-( I assumed that one character whatever the encoding would be 1 character on the display. If it is not the case, I think I can put/compute this information in the translation structures that is use by PQmblen, and implement a PQmbtermlen function... Maybe you could point me some source of information about display lengths of characters depending on the encoding? > > What I mean by "ASCII compatible" is that spaces, new lines, carriage > > returns, tabs and NULL (C string terminaison) are one byte characters. > > This assumption seemed pretty safe to me. > > I think you can do it safely using PQmblen. Ok, what you describe is basically what I've done with the qidx computation as suggested by Tom Lane and then later I check that the encoded length is one to find my special characters. Thanks for you reply, -- Fabien Coelho - coelho@cri.ensmp.fr
> > PQmblen returns the storage size, which is not necessarily same as the > > character width reprensented in a terminal. For example for a kanji > > character in UTF-8 PQmblen returns 3, but it ocuppies 2 x ASCII > > character space, not x 3. Isn't that a problem for you? > > If I read you correctly, you mean that 1 character may take 3 bytes > of storage in the string, but it is not guaranteed to be 1 character > from the terminal perspective... Argh, that's definitely an issue:-( > I assumed that one character whatever the encoding would be 1 character > on the display. That's not correct... One thing I have to note is that some Asian characters such as Japanese, Chinese require twice the space on a terminal for each character comparing with plain ASCII characters. This is hard to explain to those who are not familiar with kanji... Could you take a look at included screen shot? As you can see there are four ASCII characters in the first line. On the second line there are *two* kanji characters and they occupy same space as above four ASCII characters. Moreover the strage size for the first line is 4, but the strage size for the second line may vary depending on the encoding. If the encoding is EUC_JP or SJIS, it takes 4 bytes, however it takes 6 bytes if the encoding is UTF-8. Got it? > If it is not the case, I think I can put/compute this information in the > translation structures that is use by PQmblen, and implement a > PQmbtermlen function... > > Maybe you could point me some source of information about display lengths > of characters depending on the encoding? I could write "PQmbtermlen" function for every encoding supported by PostgreSQL except UTF-8. Such kind of info for UTF-8 might be quite complex. I believe there are some mapping tables or functions to get such kind of info somewhere on the Internet, but I don't remember. > > I think you can do it safely using PQmblen. > > Ok, what you describe is basically what I've done with the qidx > computation as suggested by Tom Lane and then later I check that the > encoded length is one to find my special characters. Oh, ok. > Thanks for you reply, You are welcome! -- Tatsuo Ishii
Fabien COELHO <coelho@cri.ensmp.fr> writes: > As a compromise, I can suggest the following: > LINE 4: WHERE id=123 AND name LIKE 'calvin' GROP BY name... > ^ That works for me. I don't mind it saying LINE 1: in the one-line case. >> It's not going to add more than about one strlen() call to what you need >> to do. > I cannot strlen an integer value;-) I have to convert to a string, or > deal directly with the line number. Well, what I'm imagining is that you sprintf("LINE %d") and then strlen that in the process of constructing the first line, and later add that length to the offset you need in the second line. Tatsuo's complaints about characters that span more than one position are a much nastier problem, in any case :-(. Can we cope? regards, tom lane
Dear Tatsuo, > One thing I have to note is that some Asian characters such as Japanese, > Chinese require twice the space on a terminal for each character > comparing with plain ASCII characters. This is hard to explain to those > who are not familiar with kanji... I learnt a little bit of chinese a few years ago, but I never saw the computer version. > Could you take a look at included screen shot? I haven't found it. However I've made a little bit of trying with my emacs 21 mule demonstration, and indeed there is two different character widths... > As you can see there are four ASCII characters in the first line. On the > second line there are *two* kanji characters and they occupy same space > as above four ASCII characters. Moreover the strage size for the first > line is 4, but the strage size for the second line may vary depending on > the encoding. If the encoding is EUC_JP or SJIS, it takes 4 bytes, > however it takes 6 bytes if the encoding is UTF-8. Got it? Yep. > > Maybe you could point me some source of information about display lengths > > of characters depending on the encoding? > > I could write "PQmbtermlen" function for every encoding supported by > PostgreSQL That would be great ! ;-) > except UTF-8. Such kind of info for UTF-8 might be quite > complex. I believe there are some mapping tables or functions to get > such kind of info somewhere on the Internet, but I don't remember. That would be even greater;-) I guess a "return 1" version would be a false quick fix that could be improved later on... -- Fabien COELHO.
> > Could you take a look at included screen shot? > > I haven't found it. However I've made a little bit of trying with my Oops. I have included this time. > > > Maybe you could point me some source of information about display lengths > > > of characters depending on the encoding? > > > > I could write "PQmbtermlen" function for every encoding supported by > > PostgreSQL > > That would be great ! ;-) Ok, I will work on this. > > except UTF-8. Such kind of info for UTF-8 might be quite > > complex. I believe there are some mapping tables or functions to get > > such kind of info somewhere on the Internet, but I don't remember. > > That would be even greater;-) > > I guess a "return 1" version would be a false quick fix that could > be improved later on... Agreeed.
On Sat, 13 Mar 2004, Tatsuo Ishii wrote: > Oops. I have included this time. How ! a japanese vi ! > > > I could write "PQmbtermlen" function for every encoding supported by > > > PostgreSQL > > > > That would be great ! ;-) > > Ok, I will work on this. Thanks. -- Fabien Coelho - coelho@cri.ensmp.fr