Thread: Character Encoding Question

Character Encoding Question

From
Don Parris
Date:
Hi all,

I am using Psycopg2 with Python3 & PostgreSQL 9.1 with a database encoded as UTF-8 (Kubuntu 12.10).  I have a question about the encoding and decoding, and why some fetch calls did what I wanted while fetchall() seemed to choke.  I got the following error message when calling fetchall():
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

The traceback pointed to the fetchall() statement.

First, I was able to do a pattern matching search on a specific field in my payee table and could get back whatever records matched the user input in my script.  Then I ran a simple SELECT on the same table, using fetchall(), but that choked.

I decided to experiment some before asking questions here on the list, and discovered that, if I call fetchone(), I got a single record with the fields presented as rows in my Bash console.  Then I tried fetchmany() and got the same record presented as a tuple.

In the psycopg usage doc (http://www.initd.org/psycopg/docs/usage.html#unicode-handling), I found this reference:
conn.set_client_encoding('LATIN9')
I tried that, but set the encoding to UTF-8 instead of the LATIN9, supposing that might be useful.  Given the successful result, it appears that was just what the doctor ordered!

My first question is, is this really all I need to do - just set the client encoding?  Or is there something else I need to do or be aware of?

My second question is why does fetchall() fail without setting the client encoding (while the other fetch calls work)?

Thanks!
Don
--
D.C. Parris, FMP, Linux+, ESL Certificate
Minister, Security/FM Coordinator, Free Software Advocate
GPG Key ID: F5E179BE

Re: Character Encoding Question

From
Daniele Varrazzo
Date:
On Thu, Mar 28, 2013 at 3:35 PM, Don Parris <parrisdc@gmail.com> wrote:
> Hi all,
>
> I am using Psycopg2 with Python3 & PostgreSQL 9.1 with a database encoded as
> UTF-8 (Kubuntu 12.10).  I have a question about the encoding and decoding,
> and why some fetch calls did what I wanted while fetchall() seemed to choke.
> I got the following error message when calling fetchall():
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9:
> ordinal not in range(128)
>
> The traceback pointed to the fetchall() statement.
>
> First, I was able to do a pattern matching search on a specific field in my
> payee table and could get back whatever records matched the user input in my
> script.  Then I ran a simple SELECT on the same table, using fetchall(), but
> that choked.
>
> I decided to experiment some before asking questions here on the list, and
> discovered that, if I call fetchone(), I got a single record with the fields
> presented as rows in my Bash console.  Then I tried fetchmany() and got the
> same record presented as a tuple.
>
> In the psycopg usage doc
> (http://www.initd.org/psycopg/docs/usage.html#unicode-handling), I found
> this reference:
>
> conn.set_client_encoding('LATIN9')
>
> I tried that, but set the encoding to UTF-8 instead of the LATIN9, supposing
> that might be useful.  Given the successful result, it appears that was just
> what the doctor ordered!

Yes, that should be the case. What was the original encoding? Can you
check "conn.client_encoding" (to get what psycopg thinks the encoding
is) and get the result from the query "show client_encoding" (to get
what the database say it is)? It's strange your connection is not in
utf8 encoding by default.


> My first question is, is this really all I need to do - just set the client
> encoding?  Or is there something else I need to do or be aware of?

Yes, but as said it is suspicious it wasn't already in utf8.


> My second question is why does fetchall() fail without setting the client
> encoding (while the other fetch calls work)?

Are you sure the error is in fetchall itself and not somewhere
downstream? The decoding code path should be common for all the
fetch*() methods. Can you produce a synthetic test case with a couple
of rows in a table to trigger the error?

What psycopg version are you using?

Thanks,

-- Daniele


Re: Character Encoding Question

From
Don Parris
Date:
Thanks Daniele,

I'm using psycopg 2.4.5 with Python 3.2.3 on Kubuntu 12.10.

This is the connection encoding based on a quick check at the Python console:
>>>print(con.encoding)
SQLASCII

I honestly don't know where the error actually is - not a very advanced programmer by any means.  All I really know is that the program breaks when I use fetchall(), but it works with fetchone() and fetchmany().  It could have something to do with the way my main program module imports the module that calls the function, for all I know.  Here is the traceback:
File "main_program_file.py", line 63, in <module>
    rpt_payee_list()
  File "imported_sqlcommand_module.py", line 50, in rpt_payee_list
    rows = cur.fetchall()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

Here is some fake data retrieved by the query (when specifying the UTF-8 client encoding):
(79, 'Payee', "Bob's Restaurant", '123 Some St.', 'Big City', 'STATE', 'USA', '337709', 'Restaurant')
(80, 'Payee', 'Tony's Fancy Restaurant', '34 Long Rd', 'Big City', 'STATE', 'USA', 'NA', 'Restaurant')
(81, 'Payee', 'Piña Colada Resort', 'Resort Avenue', 'Resort Town', 'Lima', 'Perú', 'NA', 'Resort Hotel')
(82, 'Payee', "Great Foods", '45 Eating Way', 'Big City', 'STATE', 'USA', '2028', 'Grocery Store')
(83, 'Payee', 'The Home Store', '25 Shopping Plaza', 'Big City', 'STATE', 'USA', '17', 'Home Improvement Store')

Here is the actual function in all its glory (try not to laugh too hard - I'm still learning):
def rpt_payee_list():
    con = None
    con = psycopg2.connect(database='financedb_dev', user='donp')
    con.set_client_encoding('UTF-8')
    cur = con.cursor()   
    try:
        cur.execute("""SELECT * FROM entity
            ORDER BY entity_name""")
    except:
        print('It ain\'t workin')
       
    rows = cur.fetchall()

    for row in rows:
        print(row)
   
    if con:
        con.close()

A little later, I will likely create a connection function that I can call when I want to run a specific query, but for the initial steps, I've included the connection as part of the function that creates/runs the query.



On Thu, Mar 28, 2013 at 11:49 AM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
On Thu, Mar 28, 2013 at 3:35 PM, Don Parris <parrisdc@gmail.com> wrote:
> Hi all,
>
> I am using Psycopg2 with Python3 & PostgreSQL 9.1 with a database encoded as
> UTF-8 (Kubuntu 12.10).  I have a question about the encoding and decoding,
> and why some fetch calls did what I wanted while fetchall() seemed to choke.
> I got the following error message when calling fetchall():
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9:
> ordinal not in range(128)
>
> The traceback pointed to the fetchall() statement.
>
> First, I was able to do a pattern matching search on a specific field in my
> payee table and could get back whatever records matched the user input in my
> script.  Then I ran a simple SELECT on the same table, using fetchall(), but
> that choked.
>
> I decided to experiment some before asking questions here on the list, and
> discovered that, if I call fetchone(), I got a single record with the fields
> presented as rows in my Bash console.  Then I tried fetchmany() and got the
> same record presented as a tuple.
>
> In the psycopg usage doc
> (http://www.initd.org/psycopg/docs/usage.html#unicode-handling), I found
> this reference:
>
> conn.set_client_encoding('LATIN9')
>
> I tried that, but set the encoding to UTF-8 instead of the LATIN9, supposing
> that might be useful.  Given the successful result, it appears that was just
> what the doctor ordered!

Yes, that should be the case. What was the original encoding? Can you
check "conn.client_encoding" (to get what psycopg thinks the encoding
is) and get the result from the query "show client_encoding" (to get
what the database say it is)? It's strange your connection is not in
utf8 encoding by default.


> My first question is, is this really all I need to do - just set the client
> encoding?  Or is there something else I need to do or be aware of?

Yes, but as said it is suspicious it wasn't already in utf8.


> My second question is why does fetchall() fail without setting the client
> encoding (while the other fetch calls work)?

Are you sure the error is in fetchall itself and not somewhere
downstream? The decoding code path should be common for all the
fetch*() methods. Can you produce a synthetic test case with a couple
of rows in a table to trigger the error?

What psycopg version are you using?

Thanks,

-- Daniele



--
D.C. Parris, FMP, Linux+, ESL Certificate
Minister, Security/FM Coordinator, Free Software Advocate
GPG Key ID: F5E179BE

Re: Character Encoding Question

From
Daniele Varrazzo
Date:
On Thu, Mar 28, 2013 at 4:34 PM, Don Parris <parrisdc@gmail.com> wrote:
> Thanks Daniele,
>
> I'm using psycopg 2.4.5 with Python 3.2.3 on Kubuntu 12.10.
>
> This is the connection encoding based on a quick check at the Python
> console:
>>>>print(con.encoding)
> SQLASCII

In this case it's natural that you get an error on decode. But is it
really the database encoding? What does this query say:

    cur = con.cursor()
    cur.execute("show client_encoding")
    print cur.fetchone()[0]


> I honestly don't know where the error actually is - not a very advanced
> programmer by any means.  All I really know is that the program breaks when
> I use fetchall(), but it works with fetchone() and fetchmany().  It could
> have something to do with the way my main program module imports the module
> that calls the function, for all I know.  Here is the traceback:

I'll try to load your dataset and play with it, thank you.

-- Daniele


Re: Character Encoding Question

From
Daniele Varrazzo
Date:
Please keep the mailing list in copy.

On Thu, Mar 28, 2013 at 5:17 PM, Don Parris <parrisdc@gmail.com> wrote:
> Here is the result of the query you suggested:
>
>>>> cur = con.cursor()
>>>> cur.execute("show client_encoding")
>>>> print(cur.fetchone()[0])
> SQL_ASCII

This means that everything is going as expected: connection has been
requested in ascii and obtained so. I'll check if really the encoding
errors happen inconsistently in the fetch*() methods. What I suspect
is that you issued fetchone() and got an ascii-only record: if you
continued to loop on the entire recordset you'd eventually get an
error. Try (without setting the encoding to utf8):

    cur.execute("select [ your stuff, including non-ascii records")
    while 1:
        record = cur.fetchone()
        if not record:
            break

I'm pretty sure you will get an encoding error too.


> I created the DB in postgresql using the following command:
> CREATE DATABASE mydb
>     WITH TEMPLATE template0 ENCODING 'UTF8';
>
> Although I cannot now recall my reason for doing so, there *is* a reason why
> I'm using this approach.  :-/  I should have said "why" in my comments!

I don't see anything wrong here. The only question is what is setting
the connection to SQL_ASCII. Maybe something in postgresql.conf or a
setting of the database user. It's not a psycopg issue anyway: the
driver is following the instructions.

-- Daniele


Re: Character Encoding Question

From
Federico Di Gregorio
Date:
On 28/03/2013 17:41, Daniele Varrazzo wrote:
> On Thu, Mar 28, 2013 at 4:34 PM, Don Parris <parrisdc@gmail.com> wrote:
>> > Thanks Daniele,
>> >
>> > I'm using psycopg 2.4.5 with Python 3.2.3 on Kubuntu 12.10.
>> >
>> > This is the connection encoding based on a quick check at the Python
>> > console:
>>>>> >>>>print(con.encoding)
>> > SQLASCII
> In this case it's natural that you get an error on decode. But is it
> really the database encoding? What does this query say:

IMHO this seems very much like the "uh-oh! I put latin9 data into my
database and now I am trying to read it as unicode" problem. PostgreSQL
does NOT enforce any kind of encoding. It just converts between the
client encoding and the database one before storing the data. If you
configure any of the two encodings the wrong way is quite easy to store
data in the wrong format.

federico



Re: Character Encoding Question

From
Don Parris
Date:
For the list:

Here is the result of the query you suggested:

>>> cur = con.cursor()
>>> cur.execute("show client_encoding")
>>> print(cur.fetchone()[0])
SQL_ASCII


I created the DB in postgresql using the following command:
CREATE DATABASE mydb
    WITH TEMPLATE template0 ENCODING 'UTF8';

Although I cannot now recall my reason for doing so, there *is* a reason why I'm using this approach.  :-/  I should have said "why" in my comments!


On Thu, Mar 28, 2013 at 12:41 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
On Thu, Mar 28, 2013 at 4:34 PM, Don Parris <parrisdc@gmail.com> wrote:
> Thanks Daniele,
>
> I'm using psycopg 2.4.5 with Python 3.2.3 on Kubuntu 12.10.
>
> This is the connection encoding based on a quick check at the Python
> console:
>>>>print(con.encoding)
> SQLASCII

In this case it's natural that you get an error on decode. But is it
really the database encoding? What does this query say:

    cur = con.cursor()
    cur.execute("show client_encoding")
    print cur.fetchone()[0]


> I honestly don't know where the error actually is - not a very advanced
> programmer by any means.  All I really know is that the program breaks when
> I use fetchall(), but it works with fetchone() and fetchmany().  It could
> have something to do with the way my main program module imports the module
> that calls the function, for all I know.  Here is the traceback:

I'll try to load your dataset and play with it, thank you.

-- Daniele



--
D.C. Parris, FMP, Linux+, ESL Certificate
Minister, Security/FM Coordinator, Free Software Advocate
GPG Key ID: F5E179BE

Re: Character Encoding Question

From
Don Parris
Date:
On Thu, Mar 28, 2013 at 1:30 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
Please keep the mailing list in copy.

Yes... My apologies - I have gotten accustomed to munging lists....  Will try to remember (or maybe stop using gmail for this list).

 
On Thu, Mar 28, 2013 at 5:17 PM, Don Parris <parrisdc@gmail.com> wrote:
> Here is the result of the query you suggested:
<SNIP>
>Try (without setting the encoding to utf8):

    cur.execute("select [ your stuff, including non-ascii records")
    while 1:
        record = cur.fetchone()
        if not record:
            break

I'm pretty sure you will get an encoding error too.

You are right about this.  It stopped at record #7 (entity_id = 7) and gave me the traceback.  That is, record #7 was the last record printed.  Does that have anything to do with the position mentioned in the traceback?  Would #8 have been the 'guilty' record?

This record was loaded (as were all of the first 80-100 in this table) via \copy command from within psql.  The remaining records have been inserted manually via PGAdmin, simply opening the table and adding the rows.  I also played with my pattern matching search query, which has turned up great results, until I searched on record #8 - then it broke.

Given that most of the early records were loaded via \copy, why & how would the character encoding change in the middle of a \copy command?  Or is it possible they were all different when initially saved as a CSV file?  I could understand better if the encoding broke the program, say at the first record I loaded via \copy, but maybe worked on the first record I inserted via PGAdmin - or vice versa. 
 

> I created the DB in postgresql using the following command:
> CREATE DATABASE mydb
>     WITH TEMPLATE template0 ENCODING 'UTF8';
>
> Although I cannot now recall my reason for doing so, there *is* a reason why
> I'm using this approach.  :-/  I should have said "why" in my comments!

I don't see anything wrong here. The only question is what is setting
the connection to SQL_ASCII. Maybe something in postgresql.conf or a
setting of the database user. It's not a psycopg issue anyway: the
driver is following the instructions.

I can't say there's anything wrong with psycopg - I just want to better understand what is happening with my data and front-end code.



--
D.C. Parris, FMP, Linux+, ESL Certificate
Minister, Security/FM Coordinator, Free Software Advocate
GPG Key ID: F5E179BE

Re: Character Encoding Question

From
Don Parris
Date:
On Thu, Mar 28, 2013 at 1:12 PM, Federico Di Gregorio <fog@dndg.it> wrote:
On 28/03/2013 17:41, Daniele Varrazzo wrote:
> On Thu, Mar 28, 2013 at 4:34 PM, Don Parris <parrisdc@gmail.com> wrote:
>> > Thanks Daniele,
>> >
>> > I'm using psycopg 2.4.5 with Python 3.2.3 on Kubuntu 12.10.
>> >
>> > This is the connection encoding based on a quick check at the Python
>> > console:
>>>>> >>>>print(con.encoding)
>> > SQLASCII
> In this case it's natural that you get an error on decode. But is it
> really the database encoding? What does this query say:

IMHO this seems very much like the "uh-oh! I put latin9 data into my
database and now I am trying to read it as unicode" problem. PostgreSQL
does NOT enforce any kind of encoding. It just converts between the
client encoding and the database one before storing the data. If you
configure any of the two encodings the wrong way is quite easy to store
data in the wrong format.

federico

Thanks  Federico,
If I created the database using the UTF-8 encoding, then why would some data be encoded differently than the rest?  And how can I control how the data gets inserted?  See my previous post, where I mentioned loading a good chunk of the data via the \copy command in psql, and then later added more via PGAdmin.  Many records seem to work just fine, but quite a few others don't - and I was just naively entering or loading data without knowing any encoding was being changed.

Re: Character Encoding Question

From
Daniele Varrazzo
Date:
On Thu, Mar 28, 2013 at 8:49 PM, Don Parris <parrisdc@gmail.com> wrote:
> On Thu, Mar 28, 2013 at 1:30 PM, Daniele Varrazzo

>> >Try (without setting the encoding to utf8):
>>
>>     cur.execute("select [ your stuff, including non-ascii records")
>>     while 1:
>>         record = cur.fetchone()
>>         if not record:
>>             break
>>
>> I'm pretty sure you will get an encoding error too.
>>
> You are right about this.  It stopped at record #7 (entity_id = 7) and gave
> me the traceback.  That is, record #7 was the last record printed.  Does
> that have anything to do with the position mentioned in the traceback?
> Would #8 have been the 'guilty' record?

yes: try again setting the encoding to utf8 and you'll see a record
with a non-ascii char: it could be "Piña Colada Resort".


> This record was loaded (as were all of the first 80-100 in this table) via
> \copy command from within psql.  The remaining records have been inserted
> manually via PGAdmin

Your database is probably fine and perfectly encoded. If there's
anything to be understood is why psycopg connects by default with a
SQL_ASCII encoding - which is not good to transfer the ñ char or any
other char with ascii value > 127. The answer could be, from the most
general to the most specific, in the postgresql.conf (global setting),
in the pg_user view (per-user setting), in some environment variable
(per-connection setting). Which one of these settings say
client_encoding=sql_ascii?

I bet it's the first.

-- Daniele


Re: Character Encoding Question

From
Joe Abbate
Date:
On 28/03/13 16:59, Don Parris wrote:
> If I created the database using the UTF-8 encoding, then why would some
> data be encoded differently than the rest?  And how can I control how
> the data gets inserted?  See my previous post, where I mentioned loading
> a good chunk of the data via the \copy command in psql, and then later
> added more via PGAdmin.  Many records seem to work just fine, but quite
> a few others don't - and I was just naively entering or loading data
> without knowing any encoding was being changed.

If the database is created with UTF-8 encoding, all character data will
be encoded as UTF-8.  The problem was that your client was using
SQL_ASCII encoding so any UTF-8, non-ASCII data (i.e., characters above
decimal 127) received from PG couldn't be decoded.  IIRC the client
encoding is set according to the template0 encoding.  I would do a psql
-l to see the encoding of other databases in your cluster, in particular
template0, template1 and postgres.

Joe


Re: Character Encoding Question

From
Don Parris
Date:
On Thu, Mar 28, 2013 at 7:07 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
On Thu, Mar 28, 2013 at 8:49 PM, Don Parris <parrisdc@gmail.com> wrote:
> On Thu, Mar 28, 2013 at 1:30 PM, Daniele Varrazzo

<SNIP>
 
Your database is probably fine and perfectly encoded. If there's
anything to be understood is why psycopg connects by default with a
SQL_ASCII encoding - which is not good to transfer the ñ char or any
other char with ascii value > 127. The answer could be, from the most
general to the most specific, in the postgresql.conf (global setting),
in the pg_user view (per-user setting), in some environment variable
(per-connection setting). Which one of these settings say
client_encoding=sql_ascii?

I bet it's the first.

-- Daniele

Aha!  As it turns out, I started looking into the character set support in the postgresql documentation, and discovered the psql -l command.  It showed this test database is actually *not* encoded in UTF-8 at all, but rather in ASCII.  I am not sure how I managed to do that, but I did.  I was sure I had used the same DB creation script and just changed the DB name, but clearly, I missed something.  I am not sure if it is necessary to drop and re-create the database to correct this, but that is what I have done.

When I tried using \encoding or SET client_encoding, I got no errors, but I still saw this test DB set as ASCII when running the psql -l command.  Anyway, I'll have to pursue this further later.  Many thanks for the help!


--
D.C. Parris, FMP, Linux+, ESL Certificate
Minister, Security/FM Coordinator, Free Software Advocate
GPG Key ID: F5E179BE

Re: Character Encoding Question

From
Don Parris
Date:
On Thu, Mar 28, 2013 at 9:21 PM, Joe Abbate <jma@freedomcircle.com> wrote:
On 28/03/13 16:59, Don Parris wrote:
> If I created the database using the UTF-8 encoding, then why would some
> data be encoded differently than the rest?  And how can I control how
> the data gets inserted?  See my previous post, where I mentioned loading
> a good chunk of the data via the \copy command in psql, and then later
> added more via PGAdmin.  Many records seem to work just fine, but quite
> a few others don't - and I was just naively entering or loading data
> without knowing any encoding was being changed.

If the database is created with UTF-8 encoding, all character data will
be encoded as UTF-8.  The problem was that your client was using
SQL_ASCII encoding so any UTF-8, non-ASCII data (i.e., characters above
decimal 127) received from PG couldn't be decoded.  IIRC the client
encoding is set according to the template0 encoding.  I would do a psql
-l to see the encoding of other databases in your cluster, in particular
template0, template1 and postgres.

From the postgresql.conf file:
#client_encoding = sql_ascii        # actually, defaults to database encoding

The way I understand it, the client should default to whatever encoding the DB uses.  Sounds like they tried to make it difficult to do what I apparently did.


--
D.C. Parris, FMP, Linux+, ESL Certificate
Minister, Security/FM Coordinator, Free Software Advocate
GPG Key ID: F5E179BE

Re: Character Encoding Question

From
Joe Abbate
Date:
On 28/03/13 22:07, Don Parris wrote:
> From the postgresql.conf file:
> #client_encoding = sql_ascii        # actually, defaults to database
> encoding
>
> The way I understand it, the client should default to whatever encoding
> the DB uses.  Sounds like they tried to make it difficult to do what I
> apparently did.

Yes, you're right.  However, when you do the cluster initdb, the
template0 database gets created with whatever encoding is specified to
initdb (IIRC nowadays the default is UTF-8, but in some previous version
the default used to be SQL_ASCII).  After that, any createdb that
doesn't specify an encoding defaults to the template0 encoding.  So if
your template0 has SQL_ASCII encoding, it's very easy to create a
database with that encoding.

Joe


Re: Character Encoding Question

From
Don Parris
Date:
On Thu, Mar 28, 2013 at 10:42 PM, Joe Abbate <jma@freedomcircle.com> wrote:
On 28/03/13 22:07, Don Parris wrote:
> From the postgresql.conf file:
> #client_encoding = sql_ascii        # actually, defaults to database
> encoding
>
> The way I understand it, the client should default to whatever encoding
> the DB uses.  Sounds like they tried to make it difficult to do what I
> apparently did.

Yes, you're right.  However, when you do the cluster initdb, the
template0 database gets created with whatever encoding is specified to
initdb (IIRC nowadays the default is UTF-8, but in some previous version
the default used to be SQL_ASCII).  After that, any createdb that
doesn't specify an encoding defaults to the template0 encoding.  So if
your template0 has SQL_ASCII encoding, it's very easy to create a
database with that encoding.

Joe

 
Thanks Joe.  That's good to know.
 

Re: Character Encoding Question

From
Daniele Varrazzo
Date:
On Fri, Mar 29, 2013 at 2:01 AM, Don Parris <parrisdc@gmail.com> wrote:

> Aha!  As it turns out, I started looking into the character set support in
> the postgresql documentation, and discovered the psql -l command.  It showed
> this test database is actually *not* encoded in UTF-8 at all, but rather in
> ASCII.  I am not sure how I managed to do that, but I did.  I was sure I had
> used the same DB creation script and just changed the DB name, but clearly,
> I missed something.  I am not sure if it is necessary to drop and re-create
> the database to correct this, but that is what I have done.
>
> When I tried using \encoding or SET client_encoding, I got no errors, but I
> still saw this test DB set as ASCII when running the psql -l command.
> Anyway, I'll have to pursue this further later.  Many thanks for the help!

In this case you should convert your database to utf8 (because it
contains utf8 data) asap. SQL_ASCII actually doesn't mean ASCII but
means store whatever octet you throw at it as it is, it's more akin to
binary data (but without the possibility to store 0x00). From your
examples, and with some luck, your database may contain utf8 only
data, but if you connect with different clients or encodings and feed
some latin1 etc. the database will be just happy to accept everything,
no question asked; just, it will be a nightmare to read the data back
or to make it uniform later.

If you don't have familiarity with encodings and relative problems,
the Spolsky article is a nice introduction
<http://www.joelonsoftware.com/articles/Unicode.html>.


-- Daniele


Re: Character Encoding Question

From
Don Parris
Date:
On Fri, Mar 29, 2013 at 5:35 AM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
On Fri, Mar 29, 2013 at 2:01 AM, Don Parris <parrisdc@gmail.com> wrote:

> Aha!  As it turns out, I started looking into the character set support in
> the postgresql documentation, and discovered the psql -l command.  It showed
> this test database is actually *not* encoded in UTF-8 at all, but rather in
> ASCII.  I am not sure how I managed to do that, but I did.  I was sure I had
> used the same DB creation script and just changed the DB name, but clearly,
> I missed something.  I am not sure if it is necessary to drop and re-create
> the database to correct this, but that is what I have done.
>
> When I tried using \encoding or SET client_encoding, I got no errors, but I
> still saw this test DB set as ASCII when running the psql -l command.
> Anyway, I'll have to pursue this further later.  Many thanks for the help!

In this case you should convert your database to utf8 (because it
contains utf8 data) asap. SQL_ASCII actually doesn't mean ASCII but
means store whatever octet you throw at it as it is, it's more akin to
binary data (but without the possibility to store 0x00). From your
examples, and with some luck, your database may contain utf8 only
data, but if you connect with different clients or encodings and feed
some latin1 etc. the database will be just happy to accept everything,
no question asked; just, it will be a nightmare to read the data back
or to make it uniform later.

If you don't have familiarity with encodings and relative problems,
the Spolsky article is a nice introduction
<http://www.joelonsoftware.com/articles/Unicode.html>.



Thanks Daniele,

I think I sent a follow-up post to this one saying that I have now converted this db to UTF-8.  I appreciate your help in tracking down what the problem was, as well as the link to this article.  Good reading for sure.  If I understand the article correctly, I can handle pretty much any language - Korean, bulgarian, Arabic, etc... - by using the UTF-8 encoding.  Is that correct?

Incidentally, my code actually broke on records that were only in English.  Or at least that is how it appears.  The particular table I was searching on contains no non-English letters.  It probably will contain non-English characters in the future, but does not now.

I am very interested in being able to support multiple languages, as my wife and I speak Castillano (Peruvian flavored) and I speak a little German and a few words in other languages.  That's a topic for another day and probably for another list, however.  :-)

Again, many thanks to all of you for the help!



--
D.C. Parris, FMP, Linux+, ESL Certificate
Minister, Security/FM Coordinator, Free Software Advocate
GPG Key ID: F5E179BE