Thread: unicode searches failing that use % and LIKE operators

unicode searches failing that use % and LIKE operators

From
Benjamin Weaver
Date:
Dear all,

I have the following problem:  a compound search, involving 2 wildcarded
character search terms, in which one search term consists of Latin characters
and the other, of UTF-8 unicode Greek characters, fails.  This is strange,
because similar searches in which both terms are either unicode Greek or Latin
characters succeed.

Both terms query a column of type text.  Searches of this kind fail both via
JDBC and the PSQL interface via an xterm window (which handles unicode
properly).  The JDBC search will have put both terms in UTF-8.  The column is
called metadatafulltext.  The search uses the LIKE operator in conjunction with
the wildcard character:

For example, the following search fails:

SELECT ..

FROM metadatafulltext...

WHERE metadatafulltext LIKE '%Jones%' AND metadatafulltext LIKE '%ALPHABETA%'
(where ALPHABETA is actually a unicode Greek string (\u03b1\u03b2).


whereas searches using all Greek characters succeed:

WHERE metadatafulltext LIKE '%BETAEPSILONDELTA%' AND metadatafulltext LIKE
'%ALPHABETA%'


and equally, all-Latin searches also succeed:
WHERE metadatafulltext LIKE '%Jones%' AND metadatafulltext LIKE '%Smith%'



What must I do to ensure that mixed-term searches of the first kind succeed?


Thanks in advance,

Ben Weaver

--
Benjamin Weaver
Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society, Oxford
email:  benjamin.weaver@classics.ox.ac.uk
phone:  (0)1865 610236


Re: unicode searches failing that use % and LIKE operators

From
Tom Lane
Date:
Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes:
> I have the following problem:  a compound search, involving 2 wildcarded
> character search terms, in which one search term consists of Latin characters
> and the other, of UTF-8 unicode Greek characters, fails.  This is strange,
> because similar searches in which both terms are either unicode Greek or Latin
> characters succeed.

That's pretty strange all right.  What PG version is this, what database
encoding are you using, and what are the server's lc_collate and
lc_ctype settings?

            regards, tom lane

Re: unicode searches failing that use % and LIKE operators

From
Benjamin Weaver
Date:
Tom,

Thanks.  I am running:

Postgres 8.1.4
server_encoding  UTF8
lc_collate en_GB.UTF-8
lc_ctype   en_GB.UTF-8

Ben

In message <19842.1193077201@sss.pgh.pa.us> Tom Lane <tgl@sss.pgh.pa.us>
writes:
> Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes:
> > I have the following problem:  a compound search, involving 2 wildcarded
> > character search terms, in which one search term consists of Latin
characters
> > and the other, of UTF-8 unicode Greek characters, fails.  This is strange,
> > because similar searches in which both terms are either unicode Greek or
Latin
> > characters succeed.
>
> That's pretty strange all right.  What PG version is this, what database
> encoding are you using, and what are the server's lc_collate and
> lc_ctype settings?
>
>             regards, tom lane

--
Benjamin Weaver
Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society,
Oxford
email:  benjamin.weaver@classics.ox.ac.uk
phone:  (0)1865 610236


Re: unicode searches failing that use % and LIKE operators

From
Benjamin Weaver
Date:
Tom,

To be more precise, the mixed queries "fail" in that they return hits of 0
rows, when they should return more than 0 rows.

Ben

In message <19842.1193077201@sss.pgh.pa.us> Tom Lane <tgl@sss.pgh.pa.us>
writes:
> Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes:
> > I have the following problem:  a compound search, involving 2 wildcarded
> > character search terms, in which one search term consists of Latin
characters
> > and the other, of UTF-8 unicode Greek characters, fails.  This is strange,
> > because similar searches in which both terms are either unicode Greek or
Latin
> > characters succeed.
>
> That's pretty strange all right.  What PG version is this, what database
> encoding are you using, and what are the server's lc_collate and
> lc_ctype settings?
>
>             regards, tom lane

--
Benjamin Weaver
Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society,
Oxford
email:  benjamin.weaver@classics.ox.ac.uk
phone:  (0)1865 610236


Re: unicode searches failing that use % and LIKE operators

From
Tom Lane
Date:
Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes:
> Postgres 8.1.4
> server_encoding  UTF8
> lc_collate en_GB.UTF-8
> lc_ctype   en_GB.UTF-8

Hmph, nothing strange-looking there.  I tried to reproduce the problem
here, without success.  Now I was using 8.1.10 on Linux (I gather your
platform is not Linux from the spelling of the locale names), but there
hasn't been any change in the LIKE code in the 8.1 branch.

One thing that would be worth trying is to turn on query logging
(set log_statement = 'all') and verify that the problem query appears in
the postmaster log as you expect; this is just to eliminate any doubt
about whether the string encoding is getting messed up somewhere between
client and server.

Assuming you don't find a smoking gun that way, can you put together a
self-contained test case?  There must be some detail of what you're
doing that I failed to reproduce.

            regards, tom lane

Re: unicode searches failing that use % and LIKE operators

From
Gregory Stark
Date:
"Benjamin Weaver" <benjamin.weaver@classics.ox.ac.uk> writes:

> Tom,
>
> Thanks.  I am running:
>
> Postgres 8.1.4
> server_encoding  UTF8
> lc_collate en_GB.UTF-8
> lc_ctype   en_GB.UTF-8

Hm, I wonder what the en_GB locale on your machine does when it sees
characters unused in English such as Greek characters. Is this Windows or some
form of Unix?

Do you have a particular LIKE pattern and a particular example for which it
fails? Or does it work reasonably when you test individual values but fail
when you perform large searches?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: unicode searches failing that use % and LIKE operators

From
Benjamin Weaver
Date:
Dear Greg, Tom,

I AM in fact running the db on Linux.  Redhat 9.  Are the encoding parameters
wrong for Linux?

I am sending the queries via JDBC from a windows machine.  But I have also
gotten the same results via psql when sending the queries from one local redhat
9 box to the redhat 9 database server box.

Ben


In message <87bqaqhme9.fsf@oxford.xeocode.com> Gregory Stark
<stark@enterprisedb.com> writes:
> "Benjamin Weaver" <benjamin.weaver@classics.ox.ac.uk> writes:
>
> > Tom,
> >
> > Thanks.  I am running:
> >
> > Postgres 8.1.4
> > server_encoding  UTF8
> > lc_collate en_GB.UTF-8
> > lc_ctype   en_GB.UTF-8
>
> Hm, I wonder what the en_GB locale on your machine does when it sees
> characters unused in English such as Greek characters. Is this Windows or some
> form of Unix?
>
> Do you have a particular LIKE pattern and a particular example for which it
> fails? Or does it work reasonably when you test individual values but fail
> when you perform large searches?
>
> --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com

--
Benjamin Weaver
Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society, Oxford
email:  benjamin.weaver@classics.ox.ac.uk
phone:  (0)1865 610236


Re: unicode searches failing that use % and LIKE operators

From
Tom Lane
Date:
Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes:
> I AM in fact running the db on Linux.  Redhat 9.  Are the encoding parameters
> wrong for Linux?

Hmm ... RH 9 is awfully old.  It's at least conceivable that you're
getting bit by some glibc bug.  However, if these are just plain LIKE
calls and not ILIKE then I don't think that theory is very viable
--- AFAICS the regular LIKE code doesn't depend on anything from the
platform.

A slightly more viable theory is that you're getting bit by a gcc bug.
Did you build the Postgres executables in-house, and if not where did
you get them from?  It might be helpful if you'd show us the output
of "pg_config".

Again, if you could put together a self-contained test case (preferably
a short psql script) it would be helpful so we could try the case on
other machines.  It's not at all clear at this point whether the bug
is Postgres' fault or something about the underlying platform.

            regards, tom lane

Re: unicode searches failing that use % and LIKE operators

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes:
>> Postgres 8.1.4
>> server_encoding  UTF8
>> lc_collate en_GB.UTF-8
>> lc_ctype   en_GB.UTF-8
>
> Hmph, nothing strange-looking there.  I tried to reproduce the problem
> here, without success.  Now I was using 8.1.10 on Linux (I gather your
> platform is not Linux from the spelling of the locale names)

Really? On my Debian system the locales are named precisely like that.
What do they look like on Red Hat?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: unicode searches failing that use % and LIKE operators

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Hmph, nothing strange-looking there.  I tried to reproduce the problem
>> here, without success.  Now I was using 8.1.10 on Linux (I gather your
>> platform is not Linux from the spelling of the locale names)

> Really? On my Debian system the locales are named precisely like that.
> What do they look like on Red Hat?

$ locale -a | grep en_GB
en_GB
en_GB.iso88591
en_GB.iso885915
en_GB.utf8

This is on Fedora Core 6.  It appears that you can *set* the locale
to "en_GB.UTF-8" successfully, but it doesn't seem to be the preferred
spelling.

            regards, tom lane

Re: unicode searches failing that use % and LIKE operators

From
Benjamin Weaver
Date:
Thanks, Tom and Greg, for all your help.  I agree that Redhat 9 is a bit
creaky and that we here should upgrade.  In the meantime I will put together a
tiny test case that will reproduce the problem (at least on my platform).

Ben

In message <26870.1193091184@sss.pgh.pa.us> Tom Lane <tgl@sss.pgh.pa.us>
writes:
> Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes:
> > I AM in fact running the db on Linux.  Redhat 9.  Are the encoding
parameters
> > wrong for Linux?
>
> Hmm ... RH 9 is awfully old.  It's at least conceivable that you're
> getting bit by some glibc bug.  However, if these are just plain LIKE
> calls and not ILIKE then I don't think that theory is very viable
> --- AFAICS the regular LIKE code doesn't depend on anything from the
> platform.
>
> A slightly more viable theory is that you're getting bit by a gcc bug.
> Did you build the Postgres executables in-house, and if not where did
> you get them from?  It might be helpful if you'd show us the output
> of "pg_config".
>
> Again, if you could put together a self-contained test case (preferably
> a short psql script) it would be helpful so we could try the case on
> other machines.  It's not at all clear at this point whether the bug
> is Postgres' fault or something about the underlying platform.
>
>             regards, tom lane

--
Benjamin Weaver
Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society,
Oxford
email:  benjamin.weaver@classics.ox.ac.uk
phone:  (0)1865 610236


Re: unicode searches failing that use % and LIKE operators

From
Benjamin Weaver
Date:
Tom, Greg,

Please accept my considerable apologies.  The fault was my own program that
loaded metadata into the text column, metadatafulltext.  My reason (not a
sufficient excuse!) for overlooking the problem is that the Unicode is
invisible in psql windows, therefore it was not a simple matter of displaying
each row of metadatafulltext to examine its contents.  Some rows got loaded
with Latin metadata, others with Greek, whereas each and every row needed both.

You both have been very helpful.  In the future I will be careful as I can to
ensure any problems don't exist with my down read-write-create-selete software!

Thanks again for your help,

Ben

In message <26870.1193091184@sss.pgh.pa.us> Tom Lane <tgl@sss.pgh.pa.us>
writes:
> Benjamin Weaver <benjamin.weaver@classics.ox.ac.uk> writes:
> > I AM in fact running the db on Linux.  Redhat 9.  Are the encoding
parameters
> > wrong for Linux?
>
> Hmm ... RH 9 is awfully old.  It's at least conceivable that you're
> getting bit by some glibc bug.  However, if these are just plain LIKE
> calls and not ILIKE then I don't think that theory is very viable
> --- AFAICS the regular LIKE code doesn't depend on anything from the
> platform.
>
> A slightly more viable theory is that you're getting bit by a gcc bug.
> Did you build the Postgres executables in-house, and if not where did
> you get them from?  It might be helpful if you'd show us the output
> of "pg_config".
>
> Again, if you could put together a self-contained test case (preferably
> a short psql script) it would be helpful so we could try the case on
> other machines.  It's not at all clear at this point whether the bug
> is Postgres' fault or something about the underlying platform.
>
>             regards, tom lane

--
Benjamin Weaver
Faculty Research Associate, Imaging Papyri Projects, Herculaneum Society,
Oxford
email:  benjamin.weaver@classics.ox.ac.uk
phone:  (0)1865 610236


Re: unicode searches failing that use % and LIKE operators

From
fabio fumarola
Date:
I had the same problem. I solved it by doing query using the pattern matching operator ~*. 
I paste an example query

select id_edb,transcription from epigraph where transcription ~* 'εἰρήνῃ'

and it works

best

Fabio