Thread: Solaris testers wanted for strxfrm() behavior

Solaris testers wanted for strxfrm() behavior

From
Noah Misch
Date:
convert_string_datum() says:

        /*
         * Note: originally we guessed at a suitable output buffer size, and
         * only needed to call strxfrm twice if our guess was too small.
         * However, it seems that some versions of Solaris have buggy strxfrm
         * that can write past the specified buffer length in that scenario.
         * So, do it the dumb way for portability.

That arrived in commit 59d9a37, and I think this is the background:
http://www.postgresql.org/message-id/flat/3224.1020394610@sss.pgh.pa.us

PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does
not account for the Solaris bug.  I wish to determine whether that bug is
still relevant today.  If you have access to Solaris with the is_IS.ISO8859-1
locale installed (or root access to install it), please compile and run the
attached test program on each Solaris version you have available.  Reply here
with the program's output.  I especially need a report from Solaris 10, but
reports from older and newer versions are valuable.  Thanks.


Here is the output on OmniOS r151006, which does not have the bug:

SunOS ip-10-152-178-106.ec2.internal 5.11 omnios-b281e50 i86pc i386 i86xpv
locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0)
locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0)
locale "": strxfrm returned 264; last modified byte at 58 (0x0)

Attachment

Re: Solaris testers wanted for strxfrm() behavior

From
Peter Geoghegan
Date:
On Sat, Jun 27, 2015 at 9:51 AM, Noah Misch <noah@leadboat.com> wrote:
> PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does
> not account for the Solaris bug.  I wish to determine whether that bug is
> still relevant today.  If you have access to Solaris with the is_IS.ISO8859-1
> locale installed (or root access to install it), please compile and run the
> attached test program on each Solaris version you have available.  Reply here
> with the program's output.  I especially need a report from Solaris 10, but
> reports from older and newer versions are valuable.  Thanks.

I did consider this.

Sorry, but I must question the point of worrying about an ancient bug
in Solaris. When you have to worry about a standard library function
blithely writing past the end of a buffer, when its C89 era interface
must be passed the size of said buffer, where does it end?

This is not a portability concern, like checking for an INT_MAX return
value from strxfrm() on Windows. The Solaris issue is patently a bug
that existed in some particular release of the Solaris C stdlib many
years ago. The documented behavior of strxfrm() in that library was
surely not "We ignore the bufsize argument".
-- 
Peter Geoghegan



Re: Solaris testers wanted for strxfrm() behavior

From
Oskari Saarenmaa
Date:
27.06.2015, 19:51, Noah Misch kirjoitti:
> convert_string_datum() says:
> 
>         /*
>          * Note: originally we guessed at a suitable output buffer size, and
>          * only needed to call strxfrm twice if our guess was too small.
>          * However, it seems that some versions of Solaris have buggy strxfrm
>          * that can write past the specified buffer length in that scenario.
>          * So, do it the dumb way for portability.
> 
> That arrived in commit 59d9a37, and I think this is the background:
> http://www.postgresql.org/message-id/flat/3224.1020394610@sss.pgh.pa.us
> 
> PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does
> not account for the Solaris bug.  I wish to determine whether that bug is
> still relevant today.  If you have access to Solaris with the is_IS.ISO8859-1
> locale installed (or root access to install it), please compile and run the
> attached test program on each Solaris version you have available.  Reply here
> with the program's output.  I especially need a report from Solaris 10, but
> reports from older and newer versions are valuable.  Thanks.
> 
> 
> Here is the output on OmniOS r151006, which does not have the bug:
> 
> SunOS ip-10-152-178-106.ec2.internal 5.11 omnios-b281e50 i86pc i386 i86xpv
> locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0)
> locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0)
> locale "": strxfrm returned 264; last modified byte at 58 (0x0)

SunOS larry 5.10 Generic_147147-26 sun4u sparc SUNW,Sun-Fire-V215
locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58
(0x0)
locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58
(0x0)
locale "": strxfrm returned 26; last modified byte at 27 (0x0)

/ Oskari



Re: Solaris testers wanted for strxfrm() behavior

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Sat, Jun 27, 2015 at 9:51 AM, Noah Misch <noah@leadboat.com> wrote:
>> PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does
>> not account for the Solaris bug.  I wish to determine whether that bug is
>> still relevant today.  If you have access to Solaris with the is_IS.ISO8859-1
>> locale installed (or root access to install it), please compile and run the
>> attached test program on each Solaris version you have available.  Reply here
>> with the program's output.  I especially need a report from Solaris 10, but
>> reports from older and newer versions are valuable.  Thanks.

> I did consider this.

> Sorry, but I must question the point of worrying about an ancient bug
> in Solaris.

I think the point of Noah's query is to find out whether "ancient" is an
accurate description.  If indeed nothing newer than Solaris 8 exhibits
the bug, I'd be okay with not working around it, but otherwise we have
some decisions to make.

Also, the post Noah dug up was merely the oldest description of the
problem.  I believe what prompted us to actually change the code was
some reports in July 2003:

http://www.postgresql.org/message-id/flat/56510AAEF435D240958D1CE8C6B1770A016D2DB9@mailc03.aurigin.com
http://www.postgresql.org/message-id/flat/56510AAEF435D240958D1CE8C6B1770A016D2DB0@mailc03.aurigin.com

(the archive threading here is pretty crappy, but you can poke around
among posts of similar date)

Those reports suggest that the problem was observable in many non-C
locales on Solaris 8, not only Icelandic.
        regards, tom lane



Re: Solaris testers wanted for strxfrm() behavior

From
Peter Geoghegan
Date:
On Sat, Jun 27, 2015 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think the point of Noah's query is to find out whether "ancient" is an
> accurate description.  If indeed nothing newer than Solaris 8 exhibits
> the bug, I'd be okay with not working around it, but otherwise we have
> some decisions to make.

Even Solaris 9 is EOL.

> Also, the post Noah dug up was merely the oldest description of the
> problem.  I believe what prompted us to actually change the code was
> some reports in July 2003:
>
> http://www.postgresql.org/message-id/flat/56510AAEF435D240958D1CE8C6B1770A016D2DB9@mailc03.aurigin.com

You said it yourself at the time -- why trust the strxfrm()
implementation when a NULL pointer is passed? It may have worked on
someone's machine in 2003, but that isn't a very good reason. It was
never a documented part of the interface that this fails or that
works; how could it be? This Solaris strxfrm() issue is (in the
simplest and least contentious sense) a bug. It is not a portability
issue. Someone made a mistake, and most likely the mistake was
corrected in the next point release. That is the only logical
explanation I can see.

I wouldn't say that adding defenses to workaround serious bugs in a C
stdlib is something we should never do, but ISTM that the standard for
undertaking that ought to be very high. BTW, unlike the author of
convert_string_datum(), I think it's okay that glibc sometimes returns
different values with strxfrm() calls on the same string (based on
whether or not the passed-in buffer is actually big enough to store
the transformed string) -- that is actually allowed by the standard, I
think. In other words, Solaris 8 has the only actually buggy strxfrm()
of the cases convert_string_datum() enumerates, since AFAICT the
standard doesn't promise that the strxfrm() return value need be
exact, only sufficient (this leeway seems useful to me).

-- 
Peter Geoghegan



Re: Solaris testers wanted for strxfrm() behavior

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Sat, Jun 27, 2015 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think the point of Noah's query is to find out whether "ancient" is an
>> accurate description.

> You said it yourself at the time -- why trust the strxfrm()
> implementation when a NULL pointer is passed? It may have worked on
> someone's machine in 2003, but that isn't a very good reason. It was
> never a documented part of the interface that this fails or that
> works; how could it be? This Solaris strxfrm() issue is (in the
> simplest and least contentious sense) a bug. It is not a portability
> issue. Someone made a mistake, and most likely the mistake was
> corrected in the next point release.

The point here is to *find out*, rather than assuming.  I agree that
Sun should have been embarrassed that such a bug ever made it into
a released libc, but it did.  The question is how long did it take
them to notice and fix it.  Assuming that it happened in "the next point
release", with absolutely no evidence to support that optimistic guess,
doesn't seem like a good idea to me.

(BTW, another way to settle the question might be to check the Solaris
release history.  Wonder if that's available anywhere.)
        regards, tom lane



Re: Solaris testers wanted for strxfrm() behavior

From
Peter Geoghegan
Date:
On Sun, Jun 28, 2015 at 8:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The point here is to *find out*, rather than assuming.  I agree that
> Sun should have been embarrassed that such a bug ever made it into
> a released libc, but it did.  The question is how long did it take
> them to notice and fix it.  Assuming that it happened in "the next point
> release", with absolutely no evidence to support that optimistic guess,
> doesn't seem like a good idea to me.
>
> (BTW, another way to settle the question might be to check the Solaris
> release history.  Wonder if that's available anywhere.)

I did attempt that before abbreviated keys went in, too. I didn't have
much luck finding the relevant information.

*My* point here is that this seems like taking unbounded
responsibility for a massive number of standard library bugs --
indisputable, garden-variety bugs. This bug was really likely to
result in a segfault back when Postgres lacked (but also arguably
needed) defenses against it, which was a period of about a year. The
reason that defenses were added was because there were several
complaints, but maybe those complaints were misdirected.

It might have been the right decision at the time to paper over the
problem, but only for a year or two. I'd only favor adding defenses if
it could be expected to take longer for the Solaris stdlib people to
ship a fix for their egregious bug than it would take for the next
Postgres point release. Why should that be true, though?

This just happened to be one of the more embarrassing, obvious stdlib
bugs that appeared in the last 15 years, and so we're talking about it
now, but there are plenty more, a good deal of which are far more
recent than this one. Where does it end? I don't see why this bug is
special just because 4 or 5 people complained about it on
pgsql-hackers over a decade ago.

-- 
Peter Geoghegan



Re: Solaris testers wanted for strxfrm() behavior

From
Josh Berkus
Date:
On 06/28/2015 12:29 PM, Peter Geoghegan wrote:
> It might have been the right decision at the time to paper over the
> problem, but only for a year or two. I'd only favor adding defenses if
> it could be expected to take longer for the Solaris stdlib people to
> ship a fix for their egregious bug than it would take for the next
> Postgres point release. Why should that be true, though?

BTW, I've asked Joyent to test on their platform.

My perspective is that if both SmartOS and OmniOS pass, it's not our
responsibility to support OldSolaris if they won't update libraries.

However, OpenSCG and EDB support more OldSolaris customers than anyone,
AFAIK, so I'd like to her some opinions from them ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Solaris testers wanted for strxfrm() behavior

From
Thomas Munro
Date:
On Mon, Jun 29, 2015 at 7:58 AM, Josh Berkus <josh@agliodbs.com> wrote:
> My perspective is that if both SmartOS and OmniOS pass, it's not our
> responsibility to support OldSolaris if they won't update libraries.

Just by the way, I wonder if this was that bug:

https://illumos.org/issues/1594

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Solaris testers wanted for strxfrm() behavior

From
Peter Geoghegan
Date:
On Sun, Jun 28, 2015 at 12:58 PM, Josh Berkus <josh@agliodbs.com> wrote:
> My perspective is that if both SmartOS and OmniOS pass, it's not our
> responsibility to support OldSolaris if they won't update libraries.

Obviously I especially don't want to double the number of strxfrm()
calls made during text abbreviation for *everyone* just to work around
this silly bug. Those calls will generally be a large fraction of the
cost of any text sort in 9.5, so clearly that would be unacceptable.

Maybe Noah should commit a patch that makes the initial size of the
buffer that stores the transformed string blob very small. This can be
reverted once it has some buildfarm cycles. That is a bit of a scatter
gun approach, but maybe that's inevitable given the paucity of
information around the issue.
-- 
Peter Geoghegan



Re: Solaris testers wanted for strxfrm() behavior

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Just by the way, I wonder if this was that bug:
> https://illumos.org/issues/1594

Oooh.  Might or might not be *same* bug, but it sure looks like it could
have the right symptom.  If this is indeed inherited from old Solaris,
I'm afraid we are totally fooling ourselves if we guess that it's no
longer present in the wild.
        regards, tom lane



Re: Solaris testers wanted for strxfrm() behavior

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Sun, Jun 28, 2015 at 12:58 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> My perspective is that if both SmartOS and OmniOS pass, it's not our
>> responsibility to support OldSolaris if they won't update libraries.

> Obviously I especially don't want to double the number of strxfrm()
> calls made during text abbreviation for *everyone* just to work around
> this silly bug. Those calls will generally be a large fraction of the
> cost of any text sort in 9.5, so clearly that would be unacceptable.

I agree, but ...

> Maybe Noah should commit a patch that makes the initial size of the
> buffer that stores the transformed string blob very small. This can be
> reverted once it has some buildfarm cycles. That is a bit of a scatter
> gun approach, but maybe that's inevitable given the paucity of
> information around the issue.

Another idea would be to make a test during postmaster start to see
if this bug exists, and fail if so.  I'm generally on board with the
thought that we don't need to work on systems with such a bad bug,
but it would be a good thing if the failure was clean and produced
a helpful error message, rather than looking like a Postgres bug.
        regards, tom lane



Re: Solaris testers wanted for strxfrm() behavior

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> It might have been the right decision at the time to paper over the
> problem, but only for a year or two. I'd only favor adding defenses if
> it could be expected to take longer for the Solaris stdlib people to
> ship a fix for their egregious bug than it would take for the next
> Postgres point release. Why should that be true, though?

> This just happened to be one of the more embarrassing, obvious stdlib
> bugs that appeared in the last 15 years, and so we're talking about it
> now, but there are plenty more, a good deal of which are far more
> recent than this one. Where does it end? I don't see why this bug is
> special just because 4 or 5 people complained about it on
> pgsql-hackers over a decade ago.

Peter, your arguments are beginning to sound like desperate excuse-making.

The reason that bug is "special" is that it looks like a crash in
Postgres, one that people have complained of because they didn't see it
in other programs, which is not totally surprising because it requires
a somewhat unusual usage of strxfrm().  I think the dumb two-call
implementation exhibited in convert_string_datum() is mainstream usage,
which would explain why Sun hadn't noticed the bug ages ago.

It might be all right to refuse to support platforms that have this
bug, but I think we need to determine with some more clarity just
how widespread the bug still is, and in any case we should at least
install some defense that would allow us to report that libc is
buggy.  Otherwise we'll be back to fielding bug reports that trace
to this, which is no fun for anyone and does not make us look good.
        regards, tom lane



Re: Solaris testers wanted for strxfrm() behavior

From
Robert Haas
Date:
On Sun, Jun 28, 2015 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> It might have been the right decision at the time to paper over the
>> problem, but only for a year or two. I'd only favor adding defenses if
>> it could be expected to take longer for the Solaris stdlib people to
>> ship a fix for their egregious bug than it would take for the next
>> Postgres point release. Why should that be true, though?
>
>> This just happened to be one of the more embarrassing, obvious stdlib
>> bugs that appeared in the last 15 years, and so we're talking about it
>> now, but there are plenty more, a good deal of which are far more
>> recent than this one. Where does it end? I don't see why this bug is
>> special just because 4 or 5 people complained about it on
>> pgsql-hackers over a decade ago.
>
> Peter, your arguments are beginning to sound like desperate excuse-making.
>
> The reason that bug is "special" is that it looks like a crash in
> Postgres, one that people have complained of because they didn't see it
> in other programs, which is not totally surprising because it requires
> a somewhat unusual usage of strxfrm().  I think the dumb two-call
> implementation exhibited in convert_string_datum() is mainstream usage,
> which would explain why Sun hadn't noticed the bug ages ago.
>
> It might be all right to refuse to support platforms that have this
> bug, but I think we need to determine with some more clarity just
> how widespread the bug still is, and in any case we should at least
> install some defense that would allow us to report that libc is
> buggy.  Otherwise we'll be back to fielding bug reports that trace
> to this, which is no fun for anyone and does not make us look good.

I completely agree.  Noah is quite right to try to find out whether
this is still an issue, and I'm glad he's doing it, and I think it's
very unfortunate that Peter is trying to discourage that research.  If
in fact the bug does not still exist in the wild, then Noah's research
will demonstrate that we have no problem, and we do not need to do
anything.  If it does, then we can decide what to do about that.  But
I have come to value Noah's diligent attitude towards hunting down
problems in our code base, and I hope Peter (and everyone else) will
appreciate that attitude as well - or at the very least, stay out of
the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Solaris testers wanted for strxfrm() behavior

From
Peter Geoghegan
Date:
On Sun, Jun 28, 2015 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The reason that bug is "special" is that it looks like a crash in
> Postgres, one that people have complained of because they didn't see it
> in other programs, which is not totally surprising because it requires
> a somewhat unusual usage of strxfrm().  I think the dumb two-call
> implementation exhibited in convert_string_datum() is mainstream usage,
> which would explain why Sun hadn't noticed the bug ages ago.

It hardly matters much, but I don't think that it is. I think the
issue is entirely explained by sloppy code in the Solaris 8 stdlib.

Anyone using strxfrm() is probably doing so for exactly the same
reason as bttext_abbrev_convert() -- making sorting text in a
collation-aware fashion faster. That's what strxfrm() is more or less
advertised for by the glibc docs, for example, and in fact the
strxfrm() glibc example involves resizing a buffer to fit [1], to
avoid doubling the number of strxfrm() calls.

For that reason, I think that convert_string_datum() is much further
from mainstream usage than bttext_abbrev_convert().

> It might be all right to refuse to support platforms that have this
> bug, but I think we need to determine with some more clarity just
> how widespread the bug still is, and in any case we should at least
> install some defense that would allow us to report that libc is
> buggy.  Otherwise we'll be back to fielding bug reports that trace
> to this, which is no fun for anyone and does not make us look good.

I can definitely get behind having a defense at startup. That seems
like cheap insurance. It probably isn't too much work to adopt Noah's
test program.

[1] http://www.gnu.org/software/libc/manual/html_node/Collation-Functions.html
-- 
Peter Geoghegan



Re: Solaris testers wanted for strxfrm() behavior

From
Thomas Munro
Date:
On Mon, Jun 29, 2015 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Just by the way, I wonder if this was that bug:
>> https://illumos.org/issues/1594
>
> Oooh.  Might or might not be *same* bug, but it sure looks like it could
> have the right symptom.  If this is indeed inherited from old Solaris,
> I'm afraid we are totally fooling ourselves if we guess that it's no
> longer present in the wild.

Also, here is an interesting patch that went into the Apache C++
standard library.  Maybe the problem was limited to amd64 system...

https://github.com/illumos/illumos-userland/blob/master/components/stdcxx/patches/047-collate.cpp.patch

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Solaris testers wanted for strxfrm() behavior

From
Peter Geoghegan
Date:
On Sun, Jun 28, 2015 at 4:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I completely agree.  Noah is quite right to try to find out whether
> this is still an issue, and I'm glad he's doing it, and I think it's
> very unfortunate that Peter is trying to discourage that research.

Far from it. I am providing constructive feedback.

> If in fact the bug does not still exist in the wild, then Noah's research
> will demonstrate that we have no problem, and we do not need to do
> anything.  If it does, then we can decide what to do about that.  But
> I have come to value Noah's diligent attitude towards hunting down
> problems in our code base, and I hope Peter (and everyone else) will
> appreciate that attitude as well - or at the very least, stay out of
> the way.

I hope that it goes without saying that I greatly appreciate Noah's
efforts in tracking down this kind of thing. I am not standing in
anyone's way. My intent is to save Noah some work. I imagined that it
might not be clear to him how completely unreasonable it is for a
strxfrm() implementation to have this issue -- it is totally
unreasonable. It is not a portability problem. Clearly Tom agrees with
that view, since he stated that he thinks it's okay to not support a
platform with an abjectly broke strxfrm().

Now we're talking about failing in a sane way, rather than trying to
add band aids, and now the exact extent of the problem seems less
important.
-- 
Peter Geoghegan



Re: Solaris testers wanted for strxfrm() behavior

From
Peter Geoghegan
Date:
On Sun, Jun 28, 2015 at 4:35 PM, Peter Geoghegan <pg@heroku.com> wrote:
> It hardly matters much, but I don't think that it is. I think the
> issue is entirely explained by sloppy code in the Solaris 8 stdlib.

I don't imagine that it will come as a surprise to anybody, but the
manpage [1] for strxfrm() covering old Solaris libc versions
(predating even version 8) states:

"No more than n bytes are placed into the resulting array pointed to
by s1, including the terminating null byte" ('n' is the bufsize
argument passed by the caller).

This is proof positive (though it is hardly needed) that Solaris was
never supposed to allow this. I also noticed that the manpage had a
typo:

"Because no return value is reserved to indicate an error, an
application wishing to check for error situations should set errno to
0, then call strcoll(3C), then check errno and if it is non-zero,
assume an error has occurred."

Obviously this is a copy-paste leftover from the strcoll() manpage.
Pretty slipshod for a C standard library implementation, I would say.

[1] http://docs.oracle.com/cd/E19455-01/806-0627/6j9vhfn88/index.html#indexterm-1090
-- 
Peter Geoghegan



Re: Solaris testers wanted for strxfrm() behavior

From
Josh Berkus
Date:
All,

Joyent confirms that the bug is fixed on SmartOS:

This is what I see:
 SunOS pkgsrc-pbulk-2014Q4-1.local 5.11 joyent_20141030T081701Z i86pc
i386 i86pc locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at
58 (0x0) locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at
58 (0x0) locale "": strxfrm returned 26; last modified byte at 27 (0x0)

Cheers,


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Solaris testers wanted for strxfrm() behavior

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Joyent confirms that the bug is fixed on SmartOS:

The more interesting bit of information would be *when* it was fixed.
        regards, tom lane



Re: Solaris testers wanted for strxfrm() behavior

From
Josh Berkus
Date:
On 06/29/2015 02:08 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> Joyent confirms that the bug is fixed on SmartOS:
> 
> The more interesting bit of information would be *when* it was fixed.

Answer: "not certain, but fixed at least 2 years ago".


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Solaris testers wanted for strxfrm() behavior

From
Robert Haas
Date:
On Mon, Jun 29, 2015 at 6:07 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 06/29/2015 02:08 PM, Tom Lane wrote:
>> Josh Berkus <josh@agliodbs.com> writes:
>>> Joyent confirms that the bug is fixed on SmartOS:
>>
>> The more interesting bit of information would be *when* it was fixed.
>
> Answer: "not certain, but fixed at least 2 years ago".

2 years is definitely not long enough to assume that no unpatched
machines exist in the wild.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Solaris testers wanted for strxfrm() behavior

From
Noah Misch
Date:
On Mon, Jun 29, 2015 at 11:52:26AM +1200, Thomas Munro wrote:
> On Mon, Jun 29, 2015 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@enterprisedb.com> writes:
> >> Just by the way, I wonder if this was that bug:
> >> https://illumos.org/issues/1594
> >
> > Oooh.  Might or might not be *same* bug, but it sure looks like it could
> > have the right symptom.  If this is indeed inherited from old Solaris,
> > I'm afraid we are totally fooling ourselves if we guess that it's no
> > longer present in the wild.

Very interesting.  Looks like the illumos strxfrm() came from FreeBSD, not
from Solaris; illumos introduced their bug independently:

https://illumos.org/issues/2
https://github.com/illumos/illumos-gate/commits/master/usr/src/lib/libc/port/locale/collate.c

> Also, here is an interesting patch that went into the Apache C++
> standard library.  Maybe the problem was limited to amd64 system...
> 
> https://github.com/illumos/illumos-userland/blob/master/components/stdcxx/patches/047-collate.cpp.patch

That's a useful data point.  Based on Oskari Saarenmaa's report, newer Solaris
10 is not affected.  The fix presumably showed up after the 05/08 release and
no later than the 01/13 release.

On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote:
> > On Sun, Jun 28, 2015 at 12:58 PM, Josh Berkus <josh@agliodbs.com> wrote:
> >> My perspective is that if both SmartOS and OmniOS pass, it's not our
> >> responsibility to support OldSolaris if they won't update libraries.

> Another idea would be to make a test during postmaster start to see
> if this bug exists, and fail if so.  I'm generally on board with the
> thought that we don't need to work on systems with such a bad bug,
> but it would be a good thing if the failure was clean and produced
> a helpful error message, rather than looking like a Postgres bug.

Failing cleanly on unpatched Solaris is adequate, agreed.  A check at
postmaster start isn't enough, because the postmaster might run in the C
locale while individual databases or collations use problem locales.  The
safest thing is to test after every setlocale(LC_COLLATE) and
newlocale(LC_COLLATE).  That's once at backend start and once per backend per
collation used, more frequent than I would like.  Hmm.



Re: Solaris testers wanted for strxfrm() behavior

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote:
>> Another idea would be to make a test during postmaster start to see
>> if this bug exists, and fail if so.  I'm generally on board with the
>> thought that we don't need to work on systems with such a bad bug,
>> but it would be a good thing if the failure was clean and produced
>> a helpful error message, rather than looking like a Postgres bug.

> Failing cleanly on unpatched Solaris is adequate, agreed.  A check at
> postmaster start isn't enough, because the postmaster might run in the C
> locale while individual databases or collations use problem locales.  The
> safest thing is to test after every setlocale(LC_COLLATE) and
> newlocale(LC_COLLATE).  That's once at backend start and once per backend per
> collation used, more frequent than I would like.  Hmm.

I was thinking more along the lines of making a single test by momentarily
switching into a known-buggy locale.

However, your results here imply that there are or were more than one bug
with this symptom, which may make a catchall test impossible.
        regards, tom lane



Re: Solaris testers wanted for strxfrm() behavior

From
Josh Berkus
Date:
On 06/29/2015 07:53 PM, Robert Haas wrote:
> On Mon, Jun 29, 2015 at 6:07 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 06/29/2015 02:08 PM, Tom Lane wrote:
>>> Josh Berkus <josh@agliodbs.com> writes:
>>>> Joyent confirms that the bug is fixed on SmartOS:
>>>
>>> The more interesting bit of information would be *when* it was fixed.
>>
>> Answer: "not certain, but fixed at least 2 years ago".
> 
> 2 years is definitely not long enough to assume that no unpatched
> machines exist in the wild.  :-(

Well, AFAIK, 90% of SmartOS users are on the Joyent Cloud and get
upgraded automatically.  It's not much used as a portable OS.  So for
them, it is enough.

Given that Theo's patch against Illumos went in in 2012, I think we can
take that as our cutoff date for Illumos.

The question is: how many folks out there are running PostgreSQL on
Solaris 10?  And are they at all likely to upgrade to PostgreSQL 9.5?

Unfortunately, these questions are coming up right when Bjorn is on
vacation ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Solaris testers wanted for strxfrm() behavior

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> The question is: how many folks out there are running PostgreSQL on
> Solaris 10?  And are they at all likely to upgrade to PostgreSQL 9.5?

That's only the pertinent question if the bug exists on Solaris 10,
which I don't think we know do we?  Oskari Saarenmaa's results upthread
seemed to indicate not.
        regards, tom lane



Re: Solaris testers wanted for strxfrm() behavior

From
Noah Misch
Date:
On Tue, Jun 30, 2015 at 09:45:08AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote:
> >> Another idea would be to make a test during postmaster start to see
> >> if this bug exists, and fail if so.  I'm generally on board with the
> >> thought that we don't need to work on systems with such a bad bug,
> >> but it would be a good thing if the failure was clean and produced
> >> a helpful error message, rather than looking like a Postgres bug.
> 
> > Failing cleanly on unpatched Solaris is adequate, agreed.  A check at
> > postmaster start isn't enough, because the postmaster might run in the C
> > locale while individual databases or collations use problem locales.  The
> > safest thing is to test after every setlocale(LC_COLLATE) and
> > newlocale(LC_COLLATE).  That's once at backend start and once per backend per
> > collation used, more frequent than I would like.  Hmm.
> 
> I was thinking more along the lines of making a single test by momentarily
> switching into a known-buggy locale.

It seems like every Solaris system I meet has a different set of installed
locales.  Any one I might pick could sometimes be unavailable when other buggy
locales are available.

On this Solaris 10 05/08 (10u5) system, the buggy and non-buggy locales have
no clear pattern.  While fr_FR.UTF-8 is fine, fr_LU.UTF-8 is buggy.



Re: Solaris testers wanted for strxfrm() behavior

From
Noah Misch
Date:
On Wed, Jul 01, 2015 at 03:22:33AM -0400, Noah Misch wrote:
> On Tue, Jun 30, 2015 at 09:45:08AM -0400, Tom Lane wrote:
> > Noah Misch <noah@leadboat.com> writes:
> > > On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote:
> > >> Another idea would be to make a test during postmaster start to see
> > >> if this bug exists, and fail if so.  I'm generally on board with the
> > >> thought that we don't need to work on systems with such a bad bug,
> > >> but it would be a good thing if the failure was clean and produced
> > >> a helpful error message, rather than looking like a Postgres bug.
> >
> > > Failing cleanly on unpatched Solaris is adequate, agreed.  A check at
> > > postmaster start isn't enough, because the postmaster might run in the C
> > > locale while individual databases or collations use problem locales.  The
> > > safest thing is to test after every setlocale(LC_COLLATE) and
> > > newlocale(LC_COLLATE).  That's once at backend start and once per backend per
> > > collation used, more frequent than I would like.  Hmm.

Solaris does not have locale_t or strxfrm_l(), so per-collation checks aren't
relevant after all.  Checking at postmaster start and backend start seems
cheap enough, hence this patch.

Attachment

Re: Solaris testers wanted for strxfrm() behavior

From
Noah Misch
Date:
On Sat, Jun 27, 2015 at 11:57:30AM -0700, Peter Geoghegan wrote:
> On Sat, Jun 27, 2015 at 9:51 AM, Noah Misch <noah@leadboat.com> wrote:
> > PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does
> > not account for the Solaris bug.  I wish to determine whether that bug is
> > still relevant today.  If you have access to Solaris with the is_IS.ISO8859-1
> > locale installed (or root access to install it), please compile and run the
> > attached test program on each Solaris version you have available.  Reply here
> > with the program's output.  I especially need a report from Solaris 10, but
> > reports from older and newer versions are valuable.  Thanks.
> 
> I did consider this.
> 
> Sorry, but I must question the point of worrying about an ancient bug
> in Solaris.

One function had a comment explaining its workaround for an OS bug, while
another function ignored the same bug.  That is always a defect in the
comments at least; our code shall tell a uniform story about its API
assumptions.  I started this thread estimating that it would end with me
merely deleting the comment.  Thomas Munro and Tom Lane located evidence I
hadn't found, evidence that changed the conclusion.

> When you have to worry about a standard library function
> blithely writing past the end of a buffer, when its C89 era interface
> must be passed the size of said buffer, where does it end?

Don't worry about the possibility of such basic bugs until someone reports
one.  Once you have such a report, though, assume the interface behaves as
last reported until you receive new evidence.  We decide whether to work
around such bugs based on factors like prevalence of affected systems,
simplicity of the workaround, and ease of field diagnosis in the absence of
the workaround.  Non-factors include whether the bug is egregious, is contrary
to documentation, or is contrary to a pertinent third-party specification.
Those sources speak to which of the library provider or PostgreSQL was wrong,
but they play little or no role in dictating the workarounds to deploy.

I hope that clarifies things.

nm



Re: Solaris testers wanted for strxfrm() behavior

From
Peter Geoghegan
Date:
On Wed, Jul 8, 2015 at 10:18 PM, Noah Misch <noah@leadboat.com> wrote:
> One function had a comment explaining its workaround for an OS bug, while
> another function ignored the same bug.  That is always a defect in the
> comments at least; our code shall tell a uniform story about its API
> assumptions.  I started this thread estimating that it would end with me
> merely deleting the comment.  Thomas Munro and Tom Lane located evidence I
> hadn't found, evidence that changed the conclusion.

That seems very reasonable. I noticed that you removed the glibc
strxfrm() comment (or at least the questioning of its behavior), which
was a good decision.

>> When you have to worry about a standard library function
>> blithely writing past the end of a buffer, when its C89 era interface
>> must be passed the size of said buffer, where does it end?
>
> Don't worry about the possibility of such basic bugs until someone reports
> one.  Once you have such a report, though, assume the interface behaves as
> last reported until you receive new evidence.  We decide whether to work
> around such bugs based on factors like prevalence of affected systems,
> simplicity of the workaround, and ease of field diagnosis in the absence of
> the workaround.

I must admit that I was rather surprised that more or less the same
blitheness about writing past the end of a buffer occurred a second
time in an apparently independent standard library implementation. I
think that illustrates your point well.

Thanks
-- 
Peter Geoghegan



Re: Solaris testers wanted for strxfrm() behavior

From
Bjorn Munch
Date:
On 27/06 12.51, Noah Misch wrote:
> 
> PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does
> not account for the Solaris bug.  I wish to determine whether that bug is
> still relevant today.  If you have access to Solaris with the is_IS.ISO8859-1
> locale installed (or root access to install it), please compile and run the
> attached test program on each Solaris version you have available.  Reply here
> with the program's output.  I especially need a report from Solaris 10, but
> reports from older and newer versions are valuable.  Thanks.

I ran this program on Solaris 9 U5 (September 2006) on Sparc and got:

---
SunOS tor10-z1 5.9 Generic_Virtual sun4v sparc SUNW,SPARC-Enterprise-T5120
locale "is_IS.ISO8859-1": strxfrm returned 162; last modified byte at 106 (0xffd2)
locale "is_IS.ISO8859-1": strxfrm returned 162; last modified byte at 106 (0xffd2)
locale "": strxfrm returned 26; last modified byte at 58 (0x0)
---

NB this runs in a Container on a Solaris 10 host. On our oldest
Solaris 10 sparc box, running Solaris 10 U6 (October 2008) I get:

---
SunOS tor07 5.10 Generic_137137-09 sun4v sparc SUNW,SPARC-Enterprise-T5120
locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0)
locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0)
locale "": strxfrm returned 26; last modified byte at 27 (0x0)
---

On x86: Solaris 10 U5 (May 2008) I get:

---
SunOS loki08 5.10 Generic_127128-11 i86pc i386 i86pc
locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0)
locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0)
locale "": strxfrm returned 26; last modified byte at 27 (0x0)
---

- Bjorn



Re: Solaris testers wanted for strxfrm() behavior

From
Noah Misch
Date:
On Tue, Jul 21, 2015 at 04:45:21PM +0200, Bjorn Munch wrote:
> On 27/06 12.51, Noah Misch wrote:
> > 
> > PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does
> > not account for the Solaris bug.  I wish to determine whether that bug is
> > still relevant today.  If you have access to Solaris with the is_IS.ISO8859-1
> > locale installed (or root access to install it), please compile and run the
> > attached test program on each Solaris version you have available.  Reply here
> > with the program's output.  I especially need a report from Solaris 10, but
> > reports from older and newer versions are valuable.  Thanks.
> 
> I ran this program on Solaris 9 U5 (September 2006) on Sparc and got:

I appreciate your testing.  A few sources give December 2003 as the month for
Solaris 9 Update 5; would you verify the vintage you used?



Re: Solaris testers wanted for strxfrm() behavior

From
Bjorn Munch
Date:
On 22/07 02.29, Noah Misch wrote:
> > I ran this program on Solaris 9 U5 (September 2006) on Sparc and got:
> 
> I appreciate your testing.  A few sources give December 2003 as the month for
> Solaris 9 Update 5; would you verify the vintage you used?

Sorry I was mis-parsing the /etc/release. 9/05 is the month. I should
know that. :-/ This was Solaris 9 U9 from September 2005.

From the output it looks like the bug was present here?

I did a quick search for bugs in libc with strxfrm in the title and
got a few hits but none that seemed to be this one.

- Bjorn



Re: Solaris testers wanted for strxfrm() behavior

From
Noah Misch
Date:
On Wed, Jul 22, 2015 at 08:40:05AM +0200, Bjorn Munch wrote:
> On 22/07 02.29, Noah Misch wrote:
> > > I ran this program on Solaris 9 U5 (September 2006) on Sparc and got:
> > 
> > I appreciate your testing.  A few sources give December 2003 as the month for
> > Solaris 9 Update 5; would you verify the vintage you used?
> 
> Sorry I was mis-parsing the /etc/release. 9/05 is the month. I should
> know that. :-/ This was Solaris 9 U9 from September 2005.

Does it have recent Solaris 9 updates, or is it closer to a base install of
the 9/05 packages?  Apparently, Oracle issued a final batch of Solaris 9
updates in February 2015.  I would find it useful to know the age of the
installed package containing /lib/libc.so.1.

> From the output it looks like the bug was present here?

Yes.

> I did a quick search for bugs in libc with strxfrm in the title and
> got a few hits but none that seemed to be this one.

I was curious about that.  Thanks for looking.



Re: Solaris testers wanted for strxfrm() behavior

From
Josh Berkus
Date:
Noah, All:

Where are we with this?  Do we feel confident that this bug is only on
old versions of Solaris we don't care about?  Or does it remain to be
resolved?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Solaris testers wanted for strxfrm() behavior

From
Noah Misch
Date:
On Fri, Jul 31, 2015 at 08:29:51PM -0700, Josh Berkus wrote:
> Where are we with this?  Do we feel confident that this bug is only on
> old versions of Solaris we don't care about?  Or does it remain to be
> resolved?

Affected systems either have an available vendor update addressing the problem
or are so old that we don't care.  Commit be8b06c makes affected systems fail
early, which resolves the matter.