Thread: problem with query

problem with query

From
Roberto Scattini
Date:
hi, today we discovered that this query doesn't return the expected values:

SELECT DISTINCT
 p.id, p.apellido AS "Apellido", p.nombre AS "Nombre", pf.nombre AS "Funcion", to_char(da.f_ingreso_pg, 'dd/mm/yyyy') AS "Fecha Ingreso PG", e.estado AS "Estado", to_char(pe.fecha, 'dd/mm/yyyy') AS "Fecha Estado"
 FROM personal.personas AS p
  LEFT JOIN personal.personas_da_pg AS da ON p.id=da.id_persona
  LEFT JOIN personal.personas_estado AS pe ON da.estado_id=pe.id
  LEFT JOIN personal.estados AS e ON pe.id_estado=e.id
  LEFT JOIN procu_departamento AS d ON d.id=da.id_departamento
  LEFT JOIN procu_departamento AS dto ON left(d.c_organigrama, 4)||'000'=dto.c_organigrama
  LEFT JOIN procu_departamento AS dir ON left(d.c_organigrama, 3)||'0000'=dir.c_organigrama
  LEFT JOIN procu_departamento AS dg ON left(d.c_organigrama, 2)||'00000'=dg.c_organigrama
  LEFT JOIN personal.funciones AS pf ON pf.id=da.funcion_id
  LEFT JOIN personal.profesiones AS pp ON pp.id=p.id_profesion
 WHERE p.apellido ilike '%nuñez%'
 ORDER BY "Apellido"

the exact same query with  ilike '%NUÑEZ%' works OK...

we are using postgresql 9.1 from ubuntu packages and the database encoding is (sadly) SQL_ASCII

can anybody point me in the right direction to solve this issue?

--
Roberto Scattini

Re: problem with query

From
Chris Curvey
Date:


On Thu, Sep 12, 2013 at 4:34 PM, Roberto Scattini <roberto.scattini@gmail.com> wrote:
hi, today we discovered that this query doesn't return the expected values:

SELECT DISTINCT
 p.id, p.apellido AS "Apellido", p.nombre AS "Nombre", pf.nombre AS "Funcion", to_char(da.f_ingreso_pg, 'dd/mm/yyyy') AS "Fecha Ingreso PG", e.estado AS "Estado", to_char(pe.fecha, 'dd/mm/yyyy') AS "Fecha Estado"
 FROM personal.personas AS p
  LEFT JOIN personal.personas_da_pg AS da ON p.id=da.id_persona
  LEFT JOIN personal.personas_estado AS pe ON da.estado_id=pe.id
  LEFT JOIN personal.estados AS e ON pe.id_estado=e.id
  LEFT JOIN procu_departamento AS d ON d.id=da.id_departamento
  LEFT JOIN procu_departamento AS dto ON left(d.c_organigrama, 4)||'000'=dto.c_organigrama
  LEFT JOIN procu_departamento AS dir ON left(d.c_organigrama, 3)||'0000'=dir.c_organigrama
  LEFT JOIN procu_departamento AS dg ON left(d.c_organigrama, 2)||'00000'=dg.c_organigrama
  LEFT JOIN personal.funciones AS pf ON pf.id=da.funcion_id
  LEFT JOIN personal.profesiones AS pp ON pp.id=p.id_profesion
 WHERE p.apellido ilike '%nuñez%'
 ORDER BY "Apellido"

the exact same query with  ilike '%NUÑEZ%' works OK...

we are using postgresql 9.1 from ubuntu packages and the database encoding is (sadly) SQL_ASCII

can anybody point me in the right direction to solve this issue?

--
Roberto Scattini

If you could tells what you are expecting, and what you are actually getting, that would be helpful.

--
The person who says it cannot be done should not interrupt the person who is doing it.  -- Chinese Proverb

Re: problem with query

From
Roberto Scattini
Date:


On Thu, Sep 12, 2013 at 6:22 PM, Chris Curvey <chris@chriscurvey.com> wrote:


On Thu, Sep 12, 2013 at 4:34 PM, Roberto Scattini <roberto.scattini@gmail.com> wrote:

we are using postgresql 9.1 from ubuntu packages and the database encoding is (sadly) SQL_ASCII

can anybody point me in the right direction to solve this issue?

--
Roberto Scattini

If you could tells what you are expecting, and what you are actually getting, that would be helpful.


hi chris, sorry, the query with ilike '%NUÑEZ%'  is returning 4 rows and the query with ilike '%nuñez%' is returning 0 rows.

a simplified version of the query with his results:


pgn=# set client_encoding = SQL_ASCII;
SET
pgon=# SELECT DISTINCT
 p.id, p.apellido, p.nombre
 FROM personal.personas AS p
  WHERE p.apellido ilike '%nuñez%';
 id | apellido | nombre
----+----------+--------
(0 rows)

pgon=# SELECT DISTINCT
 p.id, p.apellido, p.nombre
 FROM personal.personas AS p
  WHERE p.apellido ilike '%NUÑEZ%';
  id   |    apellido    |     nombre
-------+----------------+-----------------
 39489 | NUÑEZ          | JUAN
 39937 | PEREZ NUÑEZ    | FRANCISCO
 39498 | NUÑEZ ESQUIBEL | RICARDO RODOLFO
 40220 | NUÑEZ VERA     | MANUEL SANTIAGO
(4 rows)


thanks!


--
Roberto Scattini

Re: problem with query

From
Chris Curvey
Date:

On Thu, Sep 12, 2013 at 5:33 PM, Roberto Scattini <roberto.scattini@gmail.com> wrote:


On Thu, Sep 12, 2013 at 6:22 PM, Chris Curvey <chris@chriscurvey.com> wrote:


On Thu, Sep 12, 2013 at 4:34 PM, Roberto Scattini <roberto.scattini@gmail.com> wrote:

we are using postgresql 9.1 from ubuntu packages and the database encoding is (sadly) SQL_ASCII

can anybody point me in the right direction to solve this issue?

--
Roberto Scattini

If you could tells what you are expecting, and what you are actually getting, that would be helpful.


hi chris, sorry, the query with ilike '%NUÑEZ%'  is returning 4 rows and the query with ilike '%nuñez%' is returning 0 rows.

a simplified version of the query with his results:


pgn=# set client_encoding = SQL_ASCII;
SET
pgon=# SELECT DISTINCT
 p.id, p.apellido, p.nombre
 FROM personal.personas AS p
  WHERE p.apellido ilike '%nuñez%';
 id | apellido | nombre
----+----------+--------
(0 rows)

pgon=# SELECT DISTINCT
 p.id, p.apellido, p.nombre
 FROM personal.personas AS p
  WHERE p.apellido ilike '%NUÑEZ%';
  id   |    apellido    |     nombre
-------+----------------+-----------------
 39489 | NUÑEZ          | JUAN
 39937 | PEREZ NUÑEZ    | FRANCISCO
 39498 | NUÑEZ ESQUIBEL | RICARDO RODOLFO
 40220 | NUÑEZ VERA     | MANUEL SANTIAGO
(4 rows)



Is this just a case-sentitvity issue?  if personas.apellido is a varchar field, then I think that's your trouble.  (it would have to be citext in order for "nunez" = "NUNEZ".)  


Re: problem with query

From
Bosco Rama
Date:
On 09/12/13 14:49, Chris Curvey wrote:
> Is this just a case-sentitvity issue?  if personas.apellido is a varchar
> field, then I think that's your trouble.  (it would have to be citext in
> order for "nunez" = "NUNEZ".)

He's using 'ilike' in his query, so this is more likely to be a
locale/charset/collation/encoding issue.

Roberto, what encodings are in use by the client and the server/DB?

HTH,
Bosco.


Re: problem with query

From
Roberto Scattini
Date:



On Thu, Sep 12, 2013 at 6:49 PM, Chris Curvey <chris@chriscurvey.com> wrote:


Is this just a case-sentitvity issue?  if personas.apellido is a varchar field, then I think that's your trouble.  (it would have to be citext in order for "nunez" = "NUNEZ".)  



yes, is just a case-sensitivity issue. yes, is a varchar field. 
but what i dont understand is why the problem of "insensitivity" is only with ñ (lower). i mean, if i remove my ñ (and subsecuent chars), both querys return the same rows:

pgon=# set client_encoding = SQL_ASCII;
SET
pgon=# SELECT DISTINCT
 p.id, p.apellido, p.nombre
 FROM personal.personas AS p
  WHERE p.apellido ilike '%NU%';
  id   |    apellido    |      nombre
-------+----------------+------------------
 39489 | NUÑEZ          | JUAN
 39937 | PEREZ NUÑEZ    | FRANCISCO
 40229 | PANNUNZIO      | MARIA CRISTINA
 39453 | GUARNUCIO      | CARLA GISELLE
 39498 | NUÑEZ ESQUIBEL | RICARDO RODOLFO
 40326 | MANUZZA        | ANDREA ALEJANDRA
 40205 | NIETO FRAGNUL  | LAUTARO ALFREDO
 40220 | NUÑEZ VERA     | MANUEL SANTIAGO
(8 rows)

pgon=# SELECT DISTINCT
 p.id, p.apellido, p.nombre
 FROM personal.personas AS p
  WHERE p.apellido ilike '%nu%';
  id   |    apellido    |      nombre
-------+----------------+------------------
 39489 | NUÑEZ          | JUAN
 39937 | PEREZ NUÑEZ    | FRANCISCO
 40229 | PANNUNZIO      | MARIA CRISTINA
 39453 | GUARNUCIO      | CARLA GISELLE
 39498 | NUÑEZ ESQUIBEL | RICARDO RODOLFO
 40326 | MANUZZA        | ANDREA ALEJANDRA
 40205 | NIETO FRAGNUL  | LAUTARO ALFREDO
 40220 | NUÑEZ VERA     | MANUEL SANTIAGO
(8 rows)


what makes the 'ñ' char special that makes the queries the same when it is not there?

--
Roberto Scattini

Re: problem with query

From
Roberto Scattini
Date:



On Thu, Sep 12, 2013 at 7:02 PM, Bosco Rama <postgres@boscorama.com> wrote:
On 09/12/13 14:49, Chris Curvey wrote:
> Is this just a case-sentitvity issue?  if personas.apellido is a varchar
> field, then I think that's your trouble.  (it would have to be citext in
> order for "nunez" = "NUNEZ".)

He's using 'ilike' in his query, so this is more likely to be a
locale/charset/collation/encoding issue.

Roberto, what encodings are in use by the client and the server/DB?


hi bosco, in fact i believe that i have that problem... but i cant undestand why and how to fix it.

the database has SQL_ASCII encoding, and my client... i am not sure, when i connect directly with psql from localhost i must set client encoding to SQL_ASCII or i receive an "invalid utf8 sequence" error...

but the problem is there, from psql, from pgadmin and from php/apache (my webservers)


thanks 


--
Roberto Scattini

Re: problem with query

From
Rodrigo Gonzalez
Date:
On Thu, 12 Sep 2013 19:07:04 -0300
Roberto Scattini <roberto.scattini@gmail.com> wrote:

> On Thu, Sep 12, 2013 at 7:02 PM, Bosco Rama <postgres@boscorama.com>
> wrote:
>
> > On 09/12/13 14:49, Chris Curvey wrote:
> > > Is this just a case-sentitvity issue?  if personas.apellido is a
> > > varchar field, then I think that's your trouble.  (it would have
> > > to be citext in order for "nunez" = "NUNEZ".)
> >
> > He's using 'ilike' in his query, so this is more likely to be a
> > locale/charset/collation/encoding issue.
> >
> > Roberto, what encodings are in use by the client and the server/DB?
> >
> >
> hi bosco, in fact i believe that i have that problem... but i cant
> undestand why and how to fix it.
>
> the database has SQL_ASCII encoding, and my client... i am not sure,
> when i connect directly with psql from localhost i must set client
> encoding to SQL_ASCII or i receive an "invalid utf8 sequence" error...
>
> but the problem is there, from psql, from pgadmin and from php/apache
> (my webservers)

Without being an expert....dont use sql_ascii, dump, recreate the DB as
utf8 and you will get what you want.

>
>
> thanks
>
>



Re: problem with query

From
Bosco Rama
Date:
On 09/12/13 15:07, Roberto Scattini wrote:
> hi bosco, in fact i believe that i have that problem... but i cant
> undestand why and how to fix it.
>
> the database has SQL_ASCII encoding, and my client... i am not sure, when i
> connect directly with psql from localhost i must set client encoding to
> SQL_ASCII or i receive an "invalid utf8 sequence" error...
>
> but the problem is there, from psql, from pgadmin and from php/apache (my
> webservers)

Well, the bytes that get encoded (in UTF8) for lowercase 'enye' is
x'c3b1' and for the uppercase version is x'c391' thus when they are
compared as straight bytes they will never be considered equal wrt
case-insensitivity.

The "invalid utf8 sequence" error is due to, as it says, an invalid
utf8 sequence in your text strings.  This is due to the lack of any
enforcement provided by the SQL_ASCII encoding that was probably in
effect when the data was initially inserted.

To get to a point where you can use your data in its proper encoding
you will need to identify which strings are invalid and then dump
the DB and restore it into an appropriately initialized DB.  You could
use the restore process to help you find the bad strings but that is
tedious as all hell. Better to scan you strings somehow and locate the
ones that are invalid and fix them prior to the dump.

HTH,
Bosco.


Re: problem with query

From
John R Pierce
Date:
On 9/12/2013 3:03 PM, Roberto Scattini wrote:
> yes, is just a case-sensitivity issue. yes, is a varchar field.
> but what i dont understand is why the problem of "insensitivity" is
> only with ñ (lower). i mean, if i remove my ñ (and subsecuent chars),
> both querys return the same rows:


SQLASCII means its just 8 bit octets, and only standard USASCII
characters are recognized, which doesn't include ñ  (what encoding is
that in, anyways?)



--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: problem with query

From
David Johnston
Date:
Roberto Scattini wrote
> what makes the 'ñ' char special that makes the queries the same when it is
> not there?

My knowledge here is a little rough around the edges but the following is
conceptually true:

For the most part legacy encodings (or non-encodings as this case
technically falls under) recognize only the explicit case-conversions for
the latin alphabet A-Z (mapped onto "a-z") without any accents.

To reasonably process strings/varchars/clobs that contain accented letters
it is necessary to use a more modern encoding - such as UTF-8/Unicode -
which contains the necessary logic to perform the additional conversions.

these should (not going to test it myself at this time) by
case-insensitively identical:

abcdëFGH
ABCDëfgh

since the "a-d, f-h" can be converted between and the one symbol that
cannot, "ë" is the same in both string....so it isn't that the ë breaks
things but rather that symbol has no upper-case alternative to equivalently
match against...just like numbers and symbols behave in the same situation.
To ASCII ë is just a symbol without any "letter of alphabet"
characteristics.

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/problem-with-query-tp5770637p5770653.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: problem with query

From
Giuseppe Broccolo
Date:
Il 12/09/2013 22:34, Roberto Scattini ha scritto:
hi, today we discovered that this query doesn't return the expected values:

SELECT DISTINCT
 p.id, p.apellido AS "Apellido", p.nombre AS "Nombre", pf.nombre AS "Funcion", to_char(da.f_ingreso_pg, 'dd/mm/yyyy') AS "Fecha Ingreso PG", e.estado AS "Estado", to_char(pe.fecha, 'dd/mm/yyyy') AS "Fecha Estado"
 FROM personal.personas AS p
  LEFT JOIN personal.personas_da_pg AS da ON p.id=da.id_persona
  LEFT JOIN personal.personas_estado AS pe ON da.estado_id=pe.id
  LEFT JOIN personal.estados AS e ON pe.id_estado=e.id
  LEFT JOIN procu_departamento AS d ON d.id=da.id_departamento
  LEFT JOIN procu_departamento AS dto ON left(d.c_organigrama, 4)||'000'=dto.c_organigrama
  LEFT JOIN procu_departamento AS dir ON left(d.c_organigrama, 3)||'0000'=dir.c_organigrama
  LEFT JOIN procu_departamento AS dg ON left(d.c_organigrama, 2)||'00000'=dg.c_organigrama
  LEFT JOIN personal.funciones AS pf ON pf.id=da.funcion_id
  LEFT JOIN personal.profesiones AS pp ON pp.id=p.id_profesion
 WHERE p.apellido ilike '%nuñez%'
 ORDER BY "Apellido"
The problem is the encoding: SQL_ASCII encodes only the first 128 characters, so 'ñ' and 'Ñ' cannot be encoded in ASCII. If you insert text 'ñ' or 'Ñ' in a table inside a database (e.g. 'ascii_test') with SQL_ASCII encoding, they will be seen as two distinct characters:

ascii_test=# SELECT length(E'ñ');
 length
--------
      2
(1 row)
ascii_test=# SELECT length(E'Ñ');
 length
--------
      2
(1 row)
ascii_test=# SELECT 'ñ'::bytea;
 bytea 
--------
 \xc3b1
(1 row)
ascii_test=# SELECT 'Ñ'::bytea;
 bytea 
--------
 \xc391
(1 row)

Hexadecimals 'c3', 'b1' and '91' does not decode anything in ASCII, even if terminal show pairs as 'ñ' or 'Ñ'. So ILIKE operator cannot apply case insensitive pattern matching to data which does not encode any string (in the SQL_ASCII encoding) and works as a normal LIKE.

Even if the client which insert data has 8-bit encoding (UTF8, where 'ñ' and 'Ñ' are decoded), SQL_ASCII database cannot convert strings anyway, and problem remains.

The best way is to work with a UTF8 encoded database. Is there a particular reason to work with SQL_ASCII encoding?

Giuseppe.
-- 
Giuseppe Broccolo - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
giuseppe.broccolo@2ndQuadrant.it | www.2ndQuadrant.it

Re: problem with query

From
Roberto Scattini
Date:
hi giuseppe,


On Fri, Sep 13, 2013 at 11:49 AM, Giuseppe Broccolo <giuseppe.broccolo@2ndquadrant.it> wrote:
The problem is the encoding: SQL_ASCII encodes only the first 128 characters, so 'ñ' and 'Ñ' cannot be encoded in ASCII. If you insert text 'ñ' or 'Ñ' in a table inside a database (e.g. 'ascii_test') with SQL_ASCII encoding, they will be seen as two distinct characters:


Hexadecimals 'c3', 'b1' and '91' does not decode anything in ASCII, even if terminal show pairs as 'ñ' or 'Ñ'. So ILIKE operator cannot apply case insensitive pattern matching to data which does not encode any string (in the SQL_ASCII encoding) and works as a normal LIKE.

Even if the client which insert data has 8-bit encoding (UTF8, where 'ñ' and 'Ñ' are decoded), SQL_ASCII database cannot convert strings anyway, and problem remains.

The best way is to work with a UTF8 encoded database. Is there a particular reason to work with SQL_ASCII encoding?


no, the only reason until now has been "If it ain't broke, don't fix it."... so maybe it is time to start thinking in migrate the encoding to utf8...

is there any normal procedure to do this?

dump and then restore in a freshly initiated database should be enough?


thanks!
 


--
Roberto Scattini