Thread: Backpatching of "Teach the regular expression functions to do case-insensitive matching"

Hi,

In my opinion this is actually a bug in < 9.0. As its a (imo) low impact fix 
thats constrained to two files it seems sensible to backpatch it now that the 
solution has proven itself in the field?

The issue is hard to find and has come up several times in the field. And it has 
been slightly embarassing more than once ;)

I am happy to provide patches for the supported releases < 9.0 if that helps 
the issue.

Greetings,

Andres


Andres Freund <andres@anarazel.de> writes:
> In my opinion this is actually a bug in < 9.0. As its a (imo) low impact fix 
> thats constrained to two files it seems sensible to backpatch it now that the 
> solution has proven itself in the field?

FWIW, I still don't trust that patch a lot (and I was the one who wrote it).
The question is whether you believe that every platform uses Unicode
code points directly as the wchar_t representation in UTF8-based locales.
Although we've not heard any complaints suggesting that that assumption
doesn't hold, I don't feel that 9.0 has been out long enough to prove
much.  The complaints about the old code were rare enough to make me
think people just don't exercise the case too much, or don't notice if
the behavior is wrong.  So it likely hasn't had that much exercise in
9.0 either.

In short, I'm pretty wary of back-patching it.
        regards, tom lane


On Thu, May 5, 2011 at 5:21 AM, Andres Freund <andres@anarazel.de> wrote:
> In my opinion this is actually a bug in < 9.0. As its a (imo) low impact fix
> thats constrained to two files it seems sensible to backpatch it now that the
> solution has proven itself in the field?
>
> The issue is hard to find and has come up several times in the field. And it has
> been slightly embarassing more than once ;)

Can you share some more details about your experiences?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On Friday, May 06, 2011 04:30:01 AM Robert Haas wrote:
> On Thu, May 5, 2011 at 5:21 AM, Andres Freund <andres@anarazel.de> wrote:
> > In my opinion this is actually a bug in < 9.0. As its a (imo) low impact
> > fix thats constrained to two files it seems sensible to backpatch it now
> > that the solution has proven itself in the field?
> > The issue is hard to find and has come up several times in the field. And
> > it has been slightly embarassing more than once ;)
> Can you share some more details about your experiences?
About the embarassing or hard to find part?

One of the hard to find part parts involved a search (constraining word order
after a tsearch search) where slightly fewer than usual search results were
returned in production.
Nobody had noticed during testing that case insensitive search worked for most
things except multibyte chars as the tested case was something like: SELECT
'ÖFFENTLICHKEIT' ~* 'Öffentlichkeit' and the regex condition was only relevant
when searching for multiple words.

One of the emarassing examples was that I suggested moving away from a
solution using several ILIKE rules to one case insenitive regular expression.
Totally forgetting that I knew that this was only fixed in 9.0. This turned out
to be faster. And it turned out to be wrong. In production :-(.


Both sum up that the problem is often not noticed as most of the people
realizing that that case could be a problem don't have a knowledge of the
content and don't notice the problem until later...

Andres


On Fri, May 6, 2011 at 9:22 AM, Andres Freund <andres@anarazel.de> wrote:
> On Friday, May 06, 2011 04:30:01 AM Robert Haas wrote:
>> On Thu, May 5, 2011 at 5:21 AM, Andres Freund <andres@anarazel.de> wrote:
>> > In my opinion this is actually a bug in < 9.0. As its a (imo) low impact
>> > fix thats constrained to two files it seems sensible to backpatch it now
>> > that the solution has proven itself in the field?
>> > The issue is hard to find and has come up several times in the field. And
>> > it has been slightly embarassing more than once ;)
>> Can you share some more details about your experiences?
> About the embarassing or hard to find part?
>
> One of the hard to find part parts involved a search (constraining word order
> after a tsearch search) where slightly fewer than usual search results were
> returned in production.
> Nobody had noticed during testing that case insensitive search worked for most
> things except multibyte chars as the tested case was something like: SELECT
> 'ÖFFENTLICHKEIT' ~* 'Öffentlichkeit' and the regex condition was only relevant
> when searching for multiple words.
>
> One of the emarassing examples was that I suggested moving away from a
> solution using several ILIKE rules to one case insenitive regular expression.
> Totally forgetting that I knew that this was only fixed in 9.0. This turned out
> to be faster. And it turned out to be wrong. In production :-(.
>
>
> Both sum up that the problem is often not noticed as most of the people
> realizing that that case could be a problem don't have a knowledge of the
> content and don't notice the problem until later...

After mulling this over a bit more, I guess I''m a little skeptical of
back-patching this because it is clearly a behavior change.  It seems
unlikely, but not impossible, that someone is relying on the current
behavior, and changing it in a minor release might be considered
unfriendly.

On the flip side, the risk of it flat-out blowing up seems pretty
small.  For someone to invent their own version of wchar_t that uses
something other than Unicode code points would be pretty much pure
masochism, wouldn't it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On the flip side, the risk of it flat-out blowing up seems pretty
> small.  For someone to invent their own version of wchar_t that uses
> something other than Unicode code points would be pretty much pure
> masochism, wouldn't it?

Well, no, that's not clear.  The C standard has pretty carefully avoided
constraining the wchar_t representation, so implementors are free to do
whatever is most convenient from the standpoint of their library routines.
I could easily see somebody deciding to do something that wasn't quite
Unicode because it let him re-use lookup tables designed for some other
encoding, or some such.

Now it's also perfectly possible, maybe even likely, that nobody's done
that on any platform we care about.  But I don't believe we know that
with any degree of certainty.  We definitely have not made any effort to
establish whether it's true --- for example, we have no regression tests
that address the point.  (I think that collate.linux.utf8 touches on it,
but we're a long way from being able to run that on non-glibc
platforms...)
        regards, tom lane


On Sat, May 7, 2011 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On the flip side, the risk of it flat-out blowing up seems pretty
>> small.  For someone to invent their own version of wchar_t that uses
>> something other than Unicode code points would be pretty much pure
>> masochism, wouldn't it?
>
> Well, no, that's not clear.  The C standard has pretty carefully avoided
> constraining the wchar_t representation, so implementors are free to do
> whatever is most convenient from the standpoint of their library routines.
> I could easily see somebody deciding to do something that wasn't quite
> Unicode because it let him re-use lookup tables designed for some other
> encoding, or some such.
>
> Now it's also perfectly possible, maybe even likely, that nobody's done
> that on any platform we care about.  But I don't believe we know that
> with any degree of certainty.  We definitely have not made any effort to
> establish whether it's true --- for example, we have no regression tests
> that address the point.  (I think that collate.linux.utf8 touches on it,
> but we're a long way from being able to run that on non-glibc
> platforms...)

Well, since any problems in this are are going to bite us eventually
in 9.0+ even without any further action on our part, maybe it would be
wise to think up something we could add to the regression tests.  That
would give us some immediate feedback from the buildfarm, and also
significantly improve the odds of someone compiling on a weird
platform noticing if things are broken.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> Well, since any problems in this are are going to bite us eventually
> in 9.0+ even without any further action on our part, maybe it would be
> wise to think up something we could add to the regression tests.  That
> would give us some immediate feedback from the buildfarm, and also
> significantly improve the odds of someone compiling on a weird
> platform noticing if things are broken.

No objection here, but how will we do that?  The regression tests are
designed to work in any locale/encoding, and would become significantly
less useful if they weren't.
        regards, tom lane


On Mon, May 9, 2011 at 10:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Well, since any problems in this are are going to bite us eventually
>> in 9.0+ even without any further action on our part, maybe it would be
>> wise to think up something we could add to the regression tests.  That
>> would give us some immediate feedback from the buildfarm, and also
>> significantly improve the odds of someone compiling on a weird
>> platform noticing if things are broken.
>
> No objection here, but how will we do that?  The regression tests are
> designed to work in any locale/encoding, and would become significantly
> less useful if they weren't.

I'm just shooting from the hip here, but maybe we could have a
separate (probably smaller) set of tests that are only designed to
work in a limited range of locales and/or encodings.  I'm really
pleased that we now have the src/test/isolation stuff, and I think
some more auxilliary test suites would be quite excellent.  Even if
people didn't always want to run every single one when doing things
manually, the buildfarm certainly could.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 9, 2011 at 10:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> No objection here, but how will we do that? �The regression tests are
>> designed to work in any locale/encoding, and would become significantly
>> less useful if they weren't.

> I'm just shooting from the hip here, but maybe we could have a
> separate (probably smaller) set of tests that are only designed to
> work in a limited range of locales and/or encodings.  I'm really
> pleased that we now have the src/test/isolation stuff, and I think
> some more auxilliary test suites would be quite excellent.  Even if
> people didn't always want to run every single one when doing things
> manually, the buildfarm certainly could.

Hmm.  We don't need new infrastructure like the isolation tests do,
so another subdirectory seems like overkill.  I am thinking about a new
target "installcheck-collations" in src/test/regress/GNUmakefile that
creates a UTF8-encoding database and runs a different test schedule than
the regular tests.

The problem we'd have is that there's no way (at present) to make such
a test pass on every platform.  Windows has its own set of locale names
(which initdb fails to install as collations anyway) and we also have
the problem that OS X can be counted on to get UTF8 sorting wrong.
(It might be okay for case-folding though; not sure.)  Possibly we could
just provide an alternate expected file for OS X, but I don't see a
decent workaround for Windows --- it would pretty much have to have its
very own test case.

Andrew, what kinds of options have we got for persuading the buildfarm
to run such tests only on a subset of platforms?
        regards, tom lane


Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I am thinking about a new target "installcheck-collations" in
> src/test/regress/GNUmakefile that creates a UTF8-encoding database
> and runs a different test schedule than the regular tests.
I don't know the best way to do this (or how many people agree we
should), but I found that running the standard `make installcheck`
against a database where the postgresql.conf file set
default_transaction_isolation = serializable helped provide some
baseline testing of SSI.  None of the results change, but it helps
protect against certain classes of dumb error.  (I know this from
personal experience.)  I know *I'd* feel better if at least a few
buildfarm animals were set up to do this.
-Kevin


On mån, 2011-05-09 at 10:56 -0400, Robert Haas wrote:
> I'm just shooting from the hip here, but maybe we could have a
> separate (probably smaller) set of tests that are only designed to
> work in a limited range of locales and/or encodings.  I'm really
> pleased that we now have the src/test/isolation stuff, and I think
> some more auxilliary test suites would be quite excellent.  Even if
> people didn't always want to run every single one when doing things
> manually, the buildfarm certainly could.

Well, the result of "people don't always run them" is the rest of
src/test/.  How much of that stuff even works anymore?




On mån, 2011-05-09 at 12:42 -0400, Tom Lane wrote:
> The problem we'd have is that there's no way (at present) to make such
> a test pass on every platform.  Windows has its own set of locale names
> (which initdb fails to install as collations anyway) and we also have
> the problem that OS X can be counted on to get UTF8 sorting wrong.
> (It might be okay for case-folding though; not sure.)  Possibly we could
> just provide an alternate expected file for OS X, but I don't see a
> decent workaround for Windows --- it would pretty much have to have its
> very own test case.

Windows >=Vista has locale names similar to Linux, and my cursory
testing with some hacked up test suite indicates that it produces the
same results as the Linux expected file, modulo some error message
differences.  So I think this could be made to work, it just needs
someone to implement a few bits.




Peter Eisentraut <peter_e@gmx.net> writes:
> On mån, 2011-05-09 at 12:42 -0400, Tom Lane wrote:
>> The problem we'd have is that there's no way (at present) to make such
>> a test pass on every platform.  Windows has its own set of locale names
>> (which initdb fails to install as collations anyway) and we also have
>> the problem that OS X can be counted on to get UTF8 sorting wrong.
>> (It might be okay for case-folding though; not sure.)  Possibly we could
>> just provide an alternate expected file for OS X, but I don't see a
>> decent workaround for Windows --- it would pretty much have to have its
>> very own test case.

> Windows >=Vista has locale names similar to Linux, and my cursory
> testing with some hacked up test suite indicates that it produces the
> same results as the Linux expected file, modulo some error message
> differences.  So I think this could be made to work, it just needs
> someone to implement a few bits.

Well, that would be great, but the "someone" is not going to be me;
I don't do Windows.  I'd be willing to take responsibility for putting
in the regression test once the necessary Windows-specific code was
committed, though.
        regards, tom lane


On Tue, May 10, 2011 at 3:09 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On mån, 2011-05-09 at 10:56 -0400, Robert Haas wrote:
>> I'm just shooting from the hip here, but maybe we could have a
>> separate (probably smaller) set of tests that are only designed to
>> work in a limited range of locales and/or encodings.  I'm really
>> pleased that we now have the src/test/isolation stuff, and I think
>> some more auxilliary test suites would be quite excellent.  Even if
>> people didn't always want to run every single one when doing things
>> manually, the buildfarm certainly could.
>
> Well, the result of "people don't always run them" is the rest of
> src/test/.  How much of that stuff even works anymore?

I don't know.  But I'm not sure I see your point.  The fact that we
haven't yet succeeded in doing something doesn't mean that it's either
impossible or unimportant.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


On tis, 2011-05-10 at 15:17 -0400, Tom Lane wrote:
> Well, that would be great, but the "someone" is not going to be me;
> I don't do Windows.

Yeah, me neither.  At least not for this release.



On tis, 2011-05-10 at 15:48 -0400, Robert Haas wrote:
> On Tue, May 10, 2011 at 3:09 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> > Well, the result of "people don't always run them" is the rest of
> > src/test/.  How much of that stuff even works anymore?
> 
> I don't know.  But I'm not sure I see your point.  The fact that we
> haven't yet succeeded in doing something doesn't mean that it's either
> impossible or unimportant.

Yes, but doing the same thing again in hope of different results isn't
the right thing either.

I'm all for more test suites, but we should make them as widely
accessible and accessed as possible so that they get maintained.



Peter Eisentraut <peter_e@gmx.net> writes:
> I'm all for more test suites, but we should make them as widely
> accessible and accessed as possible so that they get maintained.

Yeah.  My preference would really be to push something like
collate.linux.utf8 into the standard regression tests, but we'd
first have to get it to where there was only one .sql file and
not more than three or so variant expected files (one of which
would be the one for platforms that don't support locale_t,
analogous to the no-support expected file for the xml test).

If we were at that point, then instead of having a separate make target,
I'd be very strongly tempted to include the test in the standard tests
by the expedient of having it create and \c to a separate database with
suitable values of ENCODING, LC_COLLATE, etc.

The lack of initdb support for getting more-or-less-standard collation
entries into pg_collation on Windows seems to be the major missing piece
from here (dunno if Peter is aware of others).  If we don't fix that
before release, we're going to regret it anyway IMO, because of the
inevitable tide of questions/complaints from Windows users trying to use
the collation feature.  We've already seen at least one such from a beta
tester.
        regards, tom lane


On tis, 2011-05-10 at 18:05 -0400, Tom Lane wrote:
> The lack of initdb support for getting more-or-less-standard collation
> entries into pg_collation on Windows seems to be the major missing
> piece from here (dunno if Peter is aware of others).  If we don't fix
> that before release, we're going to regret it anyway IMO, because of
> the inevitable tide of questions/complaints from Windows users trying
> to use the collation feature.  We've already seen at least one such
> from a beta tester.

Well, someone who wants it will have to do it.  It's pretty simple, but
not simple enough to code it blindly.  If someone wants to do it, I can
tell them exactly what to do.




Peter Eisentraut <peter_e@gmx.net> writes:
> On tis, 2011-05-10 at 18:05 -0400, Tom Lane wrote:
>> The lack of initdb support for getting more-or-less-standard collation
>> entries into pg_collation on Windows seems to be the major missing
>> piece from here (dunno if Peter is aware of others).  If we don't fix
>> that before release, we're going to regret it anyway IMO, because of
>> the inevitable tide of questions/complaints from Windows users trying
>> to use the collation feature.  We've already seen at least one such
>> from a beta tester.

> Well, someone who wants it will have to do it.  It's pretty simple, but
> not simple enough to code it blindly.  If someone wants to do it, I can
> tell them exactly what to do.

Hm, do you know how to enumerate the available locales on Windows?
(Still not volunteering, since I couldn't test it, but that's the only
missing piece of information AFAIK.)
        regards, tom lane


On ons, 2011-05-11 at 16:47 -0400, Tom Lane wrote:
> Hm, do you know how to enumerate the available locales on Windows?

EnumSystemLocalesEx()

Reference:
http://msdn.microsoft.com/en-us/library/dd317829(v=vs.85).aspx

Example: http://msdn.microsoft.com/en-us/library/dd319091(v=vs.85).aspx

As you can see in the example, this returns names like "en-US" and
"es-ES".  I would imagine we normalize this to the usual "en_US",
"es_ES" (but we could also install the not normalized names, just like
we install "en_US.utf8").

But you need to rearrange the code in initdb a bit because this thing
works with callbacks.

There is an older interface EnumSystemLocales() which returns locale
IDs, which you then have to look up and convert into a name manually.
There is code for that in the old installer CVS on pgfoundry.  But it's
very ugly, so I'd rather skip that and just concentrate on supporting
the newer interface.



Peter Eisentraut <peter_e@gmx.net> writes:
> On ons, 2011-05-11 at 16:47 -0400, Tom Lane wrote:
>> Hm, do you know how to enumerate the available locales on Windows?

> EnumSystemLocalesEx()

> Reference:
> http://msdn.microsoft.com/en-us/library/dd317829(v=vs.85).aspx

> Example: http://msdn.microsoft.com/en-us/library/dd319091(v=vs.85).aspx

Doesn't look too bad ...

> There is an older interface EnumSystemLocales() which returns locale
> IDs, which you then have to look up and convert into a name manually.
> There is code for that in the old installer CVS on pgfoundry.  But it's
> very ugly, so I'd rather skip that and just concentrate on supporting
> the newer interface.

I guess the question is what happens on pre-Vista Windows if we use
EnumSystemLocalesEx.  I don't object to just not populating pg_collation
in that case, but we probably don't want it to fail entirely.
        regards, tom lane


Peter Eisentraut <peter_e@gmx.net> writes:
> On ons, 2011-05-11 at 16:47 -0400, Tom Lane wrote:
>> Hm, do you know how to enumerate the available locales on Windows?

> EnumSystemLocalesEx()

> Reference:
> http://msdn.microsoft.com/en-us/library/dd317829(v=vs.85).aspx

> Example: http://msdn.microsoft.com/en-us/library/dd319091(v=vs.85).aspx

> As you can see in the example, this returns names like "en-US" and
> "es-ES".  I would imagine we normalize this to the usual "en_US",
> "es_ES" (but we could also install the not normalized names, just like
> we install "en_US.utf8").

I poked around in Microsoft's documentation a bit.  It's not entirely
clear to me that the locale names enumerated by this API match up
with the names accepted by setlocale() --- in particular, 
http://msdn.microsoft.com/en-us/library/hzz3tw78(v=VS.90).aspx
does not show any two-letter abbreviations for most of the languages.
That would need some testing, but it seems likely that what we'd have to
do is construct the longer-form locale name using info obtained from
GetLocaleInfoEx.  Also, the locale names from EnumSystemLocalesEx
definitely don't include any code page identifier.  We could probably
just enter the alias under UTF8, and again under the ANSI code page
number from GetLocaleInfoEx, if the locale has one.

> There is an older interface EnumSystemLocales() which returns locale
> IDs, which you then have to look up and convert into a name manually.
> There is code for that in the old installer CVS on pgfoundry.  But it's
> very ugly, so I'd rather skip that and just concentrate on supporting
> the newer interface.

Another thing I found out from the docs is that locale IDs aren't unique
in Vista and later: they have "supplemental" locales that have distinct
names but share the same ID as the primary.  It's entirely unclear how
that maps to setlocale names, but it does seem like we probably don't
want to use EnumSystemLocales.
        regards, tom lane