Re: bytea encode performance issues - Mailing list pgsql-general

From Sim Zacks
Subject Re: bytea encode performance issues
Date
Msg-id 489B094E.2090200@compulab.co.il
Whole thread Raw
In response to Re: bytea encode performance issues  ("Merlin Moncure" <mmoncure@gmail.com>)
List pgsql-general
Merlin,

You are suggesting a fight with the flexible dynamics of email by
fitting it into a UTF shell - it doesn't always work.

I would suggest you read the postgresql definition of SQL-ASCII:
> The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is
SQL_ASCII,the server interprets byte values 0-127 according to the ASCII standard, while byte values 128-255 are taken
asuninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is
notso much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most
cases,if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting, because PostgreSQL will be
unableto help you by converting or validating non-ASCII characters.  

It says, In most cases it is unwise to use it if you are working with
non-ascii data. That is because most situations do not accept multiple
encodings. However, email is a special case where the user does not have
control of what is being sent. Therefore it is possible (and it happens
to us) that we get emails that are not convertible to UTF-8.

The only way I could convert from mysql, which does not check encoding
to postgresql utf-8 was to first use the SQL-ASCII database as a bridge,
because it did not check the encoding and load it into a bytea and then
take a backup of the database and restore it into a UTF-8 database.

Sim


Merlin Moncure wrote:
> On Thu, Aug 7, 2008 at 9:38 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Thu, Aug 7, 2008 at 1:16 AM, Sim Zacks <sim@compulab.co.il> wrote:
>>>> I don't quite follow that...the whole point of utf8 encoded database
>>>> is so that you can use text functions and operators without the bytea
>>>> treatment.  As long as your client encoding is set up properly (so
>>>> that data coming in and out is computed to utf8), then you should be
>>>> ok.  Dropping to ascii is usually not the solution.  Your data
>>>> inputting application should set the client encoding properly and
>>>> coerce data into the unicode text type...it's really the only
>>>> solution.
>>>>
>>> Email does not always follow a specific character set. I have tried
>>> converting the data that comes in to utf-8 and it does not always work.
>>> We receive Hebrew emails which come in mostly 2 flavors, UTF-8 and
>>> windows-1255. Unfortunately, they are not compatible with one another.
>>> SQL-ASCII and ASCII are different as someone on the list pointed out to
>>> me. According to the documentation, SQL-ASCII makes no assumption about
>>> encoding, so you can throw in any encoding you want.
>> no, you can't! SQL-ASCII means that the database treats everything
>> like ascii.  This means that any operation that deals with text could
>> (and in the case of Hebrew, almost certianly will) be broken.  Simple
>> things like getting the length of a string will be wrong.  If you are
>> accepting unicode input, you absolutely must be using a unicode
>> encoded backend.
>
> er, I see the problem (single piece of text with multiple encodings
> inside) :-).  ok, it's more complicated than I thought.  still, you
> need to convert the email to utf8.  There simply must be a way,
> otherwise your emails are not well defined.  This is a client side
> problem...if you push it to the server in ascii, you can't use any
> server side text operations reliably.
>
> merlin
>
> merlin

pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: bytea encode performance issues
Next
From: RASHA OSMAN
Date:
Subject: Response time between shared buffer cache and operating system