Thread: Facing issue in using special characters

Facing issue in using special characters

From
M Tarkeshwar Rao
Date:

Hi all,

 

Facing issue in using special characters. We are trying to insert records to a remote Postgres Server and our application not able to perform this because of errors.

It seems that issue is because of the special characters that has been used in one of the field of a row.

 

Regards

Tarkeshwar

Re: Facing issue in using special characters

From
"David G. Johnston"
Date:
On Thursday, March 14, 2019, M Tarkeshwar Rao <m.tarkeshwar.rao@ericsson.com> wrote:

Facing issue in using special characters. We are trying to insert records to a remote Postgres Server and our application not able to perform this because of errors.

It seems that issue is because of the special characters that has been used in one of the field of a row.


Emailing -general ONLY is both sufficient and polite.  Providing more detail, and ideally an example, is necessary.

David J.

Re: Facing issue in using special characters

From
Gunther
Date:

This is not an issue for "hackers" nor "performance" in fact even for "general" it isn't really an issue.

"Special characters" is actually nonsense.

When people complain about "special characters" they haven't thought things through.

If you are unwilling to think things through and go step by step to make sure you know what you are doing, then you will not get it and really nobody can help you.

In my professional experience, people who complain about "special characters" need to be cut loose or be given a chance (if they are established employees who carry some weight). If a contractor complains about "special characters" they need to be fired.

Understand charsets -- character set, code point, and encoding. Then understand how encoding and string literals and "escape sequences" in string literals might work.

Know that UNICODE today is the one standard, and there is no more need to do code table switch. There is nothing special about a Hebrew alef or a greek lower case alpha or a latin A. Nor a hyphen and en-dash or an em-dash. All these characters are in the UNICODE. Yes, there are some Japanese who claim that they don't like that their Chinese character versions are put together with simplified reform Chinese font. But that's a font issue, not a character code issue.

7 bit ASCII is the first page of UNICODE, even in the UTF-8 encoding.

ISO Latin 1, or the Windoze 123 whatever special table of ISO Latin 1 has the same code points as UNICODE pages 0 and 1, but not compatible with UTF-8 coding because of the way UTF-8 uses the 8th bit.

But none of this is likely your problem.

Your problem is about string literals in SQL for examples. About the configuration of your database (I always use initdb with --locale C and --encoding UTF-8). Use UTF-8 in the database. Then all your issues are about string literals in SQL and in JAVA and JSON and XML or whatever you are using.

You have to do the right thing. If you produce any representation, whether that is XML or JSON or SQL or URL query parameters, or a CSV file, or anything at all, you need to escape your string values properly.

This question with no detail didn't deserve such a thorough answer, but it's my soap box. I do not accept people complaining about "special characters". My own people get that same sermon from me when they make that mistake.

-Gunther

On 3/15/2019 1:19, M Tarkeshwar Rao wrote:

Hi all,

 

Facing issue in using special characters. We are trying to insert records to a remote Postgres Server and our application not able to perform this because of errors.

It seems that issue is because of the special characters that has been used in one of the field of a row.

 

Regards

Tarkeshwar

Re: Facing issue in using special characters

From
Gunther
Date:

This is not an issue for "hackers" nor "performance" in fact even for "general" it isn't really an issue.

"Special characters" is actually nonsense.

When people complain about "special characters" they haven't thought things through.

If you are unwilling to think things through and go step by step to make sure you know what you are doing, then you will not get it and really nobody can help you.

In my professional experience, people who complain about "special characters" need to be cut loose or be given a chance (if they are established employees who carry some weight). If a contractor complains about "special characters" they need to be fired.

Understand charsets -- character set, code point, and encoding. Then understand how encoding and string literals and "escape sequences" in string literals might work.

Know that UNICODE today is the one standard, and there is no more need to do code table switch. There is nothing special about a Hebrew alef or a greek lower case alpha or a latin A. Nor a hyphen and en-dash or an em-dash. All these characters are in the UNICODE. Yes, there are some Japanese who claim that they don't like that their Chinese character versions are put together with simplified reform Chinese font. But that's a font issue, not a character code issue.

7 bit ASCII is the first page of UNICODE, even in the UTF-8 encoding.

ISO Latin 1, or the Windoze 123 whatever special table of ISO Latin 1 has the same code points as UNICODE pages 0 and 1, but not compatible with UTF-8 coding because of the way UTF-8 uses the 8th bit.

But none of this is likely your problem.

Your problem is about string literals in SQL for examples. About the configuration of your database (I always use initdb with --locale C and --encoding UTF-8). Use UTF-8 in the database. Then all your issues are about string literals in SQL and in JAVA and JSON and XML or whatever you are using.

You have to do the right thing. If you produce any representation, whether that is XML or JSON or SQL or URL query parameters, or a CSV file, or anything at all, you need to escape your string values properly.

This question with no detail didn't deserve such a thorough answer, but it's my soap box. I do not accept people complaining about "special characters". My own people get that same sermon from me when they make that mistake.

-Gunther

On 3/15/2019 1:19, M Tarkeshwar Rao wrote:

Hi all,

 

Facing issue in using special characters. We are trying to insert records to a remote Postgres Server and our application not able to perform this because of errors.

It seems that issue is because of the special characters that has been used in one of the field of a row.

 

Regards

Tarkeshwar

Re: Facing issue in using special characters

From
Gunther
Date:

This is not an issue for "hackers" nor "performance" in fact even for "general" it isn't really an issue.

"Special characters" is actually nonsense.

When people complain about "special characters" they haven't thought things through.

If you are unwilling to think things through and go step by step to make sure you know what you are doing, then you will not get it and really nobody can help you.

In my professional experience, people who complain about "special characters" need to be cut loose or be given a chance (if they are established employees who carry some weight). If a contractor complains about "special characters" they need to be fired.

Understand charsets -- character set, code point, and encoding. Then understand how encoding and string literals and "escape sequences" in string literals might work.

Know that UNICODE today is the one standard, and there is no more need to do code table switch. There is nothing special about a Hebrew alef or a greek lower case alpha or a latin A. Nor a hyphen and en-dash or an em-dash. All these characters are in the UNICODE. Yes, there are some Japanese who claim that they don't like that their Chinese character versions are put together with simplified reform Chinese font. But that's a font issue, not a character code issue.

7 bit ASCII is the first page of UNICODE, even in the UTF-8 encoding.

ISO Latin 1, or the Windoze 123 whatever special table of ISO Latin 1 has the same code points as UNICODE pages 0 and 1, but not compatible with UTF-8 coding because of the way UTF-8 uses the 8th bit.

But none of this is likely your problem.

Your problem is about string literals in SQL for examples. About the configuration of your database (I always use initdb with --locale C and --encoding UTF-8). Use UTF-8 in the database. Then all your issues are about string literals in SQL and in JAVA and JSON and XML or whatever you are using.

You have to do the right thing. If you produce any representation, whether that is XML or JSON or SQL or URL query parameters, or a CSV file, or anything at all, you need to escape your string values properly.

This question with no detail didn't deserve such a thorough answer, but it's my soap box. I do not accept people complaining about "special characters". My own people get that same sermon from me when they make that mistake.

-Gunther

On 3/15/2019 1:19, M Tarkeshwar Rao wrote:

Hi all,

 

Facing issue in using special characters. We are trying to insert records to a remote Postgres Server and our application not able to perform this because of errors.

It seems that issue is because of the special characters that has been used in one of the field of a row.

 

Regards

Tarkeshwar

Re: Facing issue in using special characters

From
Chapman Flack
Date:
On 3/15/19 11:59 AM, Gunther wrote:
> This is not an issue for "hackers" nor "performance" in fact even for
> "general" it isn't really an issue.

As long as it's already been posted, may as well make it something
helpful to find in the archive.

> Understand charsets -- character set, code point, and encoding. Then
> understand how encoding and string literals and "escape sequences" in
> string literals might work.

Good advice for sure.

> Know that UNICODE today is the one standard, and there is no more need

I wasn't sure from the question whether the original poster was in
a position to choose the encoding of the database. Lots of things are
easier if it can be set to UTF-8 these days, but perhaps it's a legacy
situation.

Maybe a good start would be to go do

  SHOW server_encoding;
  SHOW client_encoding;

and then hit the internet and look up what that encoding (or those
encodings, if different) can and can't represent, and go from there.

It's worth knowing that, when the server encoding isn't UTF-8,
PostgreSQL will have the obvious limitations entailed by that,
but also some non-obvious ones that may be surprising, e.g. [1].

-Chap


[1]
https://www.postgresql.org/message-id/CA%2BTgmobUp8Q-wcjaKvV%3DsbDcziJoUUvBCB8m%2B_xhgOV4DjiA1A%40mail.gmail.com


Re: Facing issue in using special characters

From
"Warner, Gary, Jr"
Date:
Many of us have faced character encoding issues because we are not in control of our input sources and made the common assumption that UTF-8 covers everything.

In my lab, as an example, some of our social media posts have included ZawGyi Burmese character sets rather than Unicode Burmese.  (Because Myanmar developed technology In a closed to the world environment, they made up their own non-standard character set which is very common still in Mobile phones.). We had fully tested the app with Unicode Burmese, but honestly didn’t know ZawGyi was even a thing that we would see in our dataset.  We’ve also had problems with non-Unicode word separators in Arabic.

What we’ve found to be helpful is to view the troubling code in a hex editor and determine what non-standard characters may be causing the problem.

It may be some data conversion is necessary before insertion. But the first step is knowing WHICH characters are causing the issue.

Re: Facing issue in using special characters

From
"Warner, Gary, Jr"
Date:
Many of us have faced character encoding issues because we are not in control of our input sources and made the common assumption that UTF-8 covers everything.

In my lab, as an example, some of our social media posts have included ZawGyi Burmese character sets rather than Unicode Burmese.  (Because Myanmar developed technology In a closed to the world environment, they made up their own non-standard character set which is very common still in Mobile phones.). We had fully tested the app with Unicode Burmese, but honestly didn’t know ZawGyi was even a thing that we would see in our dataset.  We’ve also had problems with non-Unicode word separators in Arabic.

What we’ve found to be helpful is to view the troubling code in a hex editor and determine what non-standard characters may be causing the problem.

It may be some data conversion is necessary before insertion. But the first step is knowing WHICH characters are causing the issue.

Re: Facing issue in using special characters

From
"Warner, Gary, Jr"
Date:
Many of us have faced character encoding issues because we are not in control of our input sources and made the common assumption that UTF-8 covers everything.

In my lab, as an example, some of our social media posts have included ZawGyi Burmese character sets rather than Unicode Burmese.  (Because Myanmar developed technology In a closed to the world environment, they made up their own non-standard character set which is very common still in Mobile phones.). We had fully tested the app with Unicode Burmese, but honestly didn’t know ZawGyi was even a thing that we would see in our dataset.  We’ve also had problems with non-Unicode word separators in Arabic.

What we’ve found to be helpful is to view the troubling code in a hex editor and determine what non-standard characters may be causing the problem.

It may be some data conversion is necessary before insertion. But the first step is knowing WHICH characters are causing the issue.

Re: Facing issue in using special characters

From
"Peter J. Holzer"
Date:
On 2019-03-17 15:01:40 +0000, Warner, Gary, Jr wrote:
> Many of us have faced character encoding issues because we are not in control
> of our input sources and made the common assumption that UTF-8 covers
> everything.

UTF-8 covers "everything" in the sense that there is a round-trip from
each character in every commonly-used charset/encoding to Unicode and
back.

The actual code may of course be different. For example, the € sign is
0xA4 in iso-8859-15, but U+20AC in Unicode. So you need an
encoding/decoding step.

And "commonly-used" means just that. Unicode covers a lot of character
sets, but it can't cover every character set ever invented (I invented
my own character sets when I was sixteen. Nobody except me ever used
them and they have long succumbed to bit rot).

> In my lab, as an example, some of our social media posts have included ZawGyi
> Burmese character sets rather than Unicode Burmese.  (Because Myanmar developed
> technology In a closed to the world environment, they made up their own
> non-standard character set which is very common still in Mobile phones.).

I'd be surprised if there was a character set which is "very common in
Mobile phones", even in a relatively poor country like Myanmar. Does
ZawGyi actually include characters which aren't in Unicode are are they
just encoded differently?

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment