Thread: Solaris testers wanted for strxfrm() behavior
convert_string_datum() says: /* * Note: originally we guessed at a suitable output buffer size, and * only needed to call strxfrm twice if our guess was too small. * However, it seems that some versions of Solaris have buggy strxfrm * that can write past the specified buffer length in that scenario. * So, do it the dumb way for portability. That arrived in commit 59d9a37, and I think this is the background: http://www.postgresql.org/message-id/flat/3224.1020394610@sss.pgh.pa.us PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does not account for the Solaris bug. I wish to determine whether that bug is still relevant today. If you have access to Solaris with the is_IS.ISO8859-1 locale installed (or root access to install it), please compile and run the attached test program on each Solaris version you have available. Reply here with the program's output. I especially need a report from Solaris 10, but reports from older and newer versions are valuable. Thanks. Here is the output on OmniOS r151006, which does not have the bug: SunOS ip-10-152-178-106.ec2.internal 5.11 omnios-b281e50 i86pc i386 i86xpv locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0) locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0) locale "": strxfrm returned 264; last modified byte at 58 (0x0)
Attachment
On Sat, Jun 27, 2015 at 9:51 AM, Noah Misch <noah@leadboat.com> wrote: > PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does > not account for the Solaris bug. I wish to determine whether that bug is > still relevant today. If you have access to Solaris with the is_IS.ISO8859-1 > locale installed (or root access to install it), please compile and run the > attached test program on each Solaris version you have available. Reply here > with the program's output. I especially need a report from Solaris 10, but > reports from older and newer versions are valuable. Thanks. I did consider this. Sorry, but I must question the point of worrying about an ancient bug in Solaris. When you have to worry about a standard library function blithely writing past the end of a buffer, when its C89 era interface must be passed the size of said buffer, where does it end? This is not a portability concern, like checking for an INT_MAX return value from strxfrm() on Windows. The Solaris issue is patently a bug that existed in some particular release of the Solaris C stdlib many years ago. The documented behavior of strxfrm() in that library was surely not "We ignore the bufsize argument". -- Peter Geoghegan
27.06.2015, 19:51, Noah Misch kirjoitti: > convert_string_datum() says: > > /* > * Note: originally we guessed at a suitable output buffer size, and > * only needed to call strxfrm twice if our guess was too small. > * However, it seems that some versions of Solaris have buggy strxfrm > * that can write past the specified buffer length in that scenario. > * So, do it the dumb way for portability. > > That arrived in commit 59d9a37, and I think this is the background: > http://www.postgresql.org/message-id/flat/3224.1020394610@sss.pgh.pa.us > > PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does > not account for the Solaris bug. I wish to determine whether that bug is > still relevant today. If you have access to Solaris with the is_IS.ISO8859-1 > locale installed (or root access to install it), please compile and run the > attached test program on each Solaris version you have available. Reply here > with the program's output. I especially need a report from Solaris 10, but > reports from older and newer versions are valuable. Thanks. > > > Here is the output on OmniOS r151006, which does not have the bug: > > SunOS ip-10-152-178-106.ec2.internal 5.11 omnios-b281e50 i86pc i386 i86xpv > locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0) > locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0) > locale "": strxfrm returned 264; last modified byte at 58 (0x0) SunOS larry 5.10 Generic_147147-26 sun4u sparc SUNW,Sun-Fire-V215 locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0) locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0) locale "": strxfrm returned 26; last modified byte at 27 (0x0) / Oskari
Peter Geoghegan <pg@heroku.com> writes: > On Sat, Jun 27, 2015 at 9:51 AM, Noah Misch <noah@leadboat.com> wrote: >> PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does >> not account for the Solaris bug. I wish to determine whether that bug is >> still relevant today. If you have access to Solaris with the is_IS.ISO8859-1 >> locale installed (or root access to install it), please compile and run the >> attached test program on each Solaris version you have available. Reply here >> with the program's output. I especially need a report from Solaris 10, but >> reports from older and newer versions are valuable. Thanks. > I did consider this. > Sorry, but I must question the point of worrying about an ancient bug > in Solaris. I think the point of Noah's query is to find out whether "ancient" is an accurate description. If indeed nothing newer than Solaris 8 exhibits the bug, I'd be okay with not working around it, but otherwise we have some decisions to make. Also, the post Noah dug up was merely the oldest description of the problem. I believe what prompted us to actually change the code was some reports in July 2003: http://www.postgresql.org/message-id/flat/56510AAEF435D240958D1CE8C6B1770A016D2DB9@mailc03.aurigin.com http://www.postgresql.org/message-id/flat/56510AAEF435D240958D1CE8C6B1770A016D2DB0@mailc03.aurigin.com (the archive threading here is pretty crappy, but you can poke around among posts of similar date) Those reports suggest that the problem was observable in many non-C locales on Solaris 8, not only Icelandic. regards, tom lane
On Sat, Jun 27, 2015 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think the point of Noah's query is to find out whether "ancient" is an > accurate description. If indeed nothing newer than Solaris 8 exhibits > the bug, I'd be okay with not working around it, but otherwise we have > some decisions to make. Even Solaris 9 is EOL. > Also, the post Noah dug up was merely the oldest description of the > problem. I believe what prompted us to actually change the code was > some reports in July 2003: > > http://www.postgresql.org/message-id/flat/56510AAEF435D240958D1CE8C6B1770A016D2DB9@mailc03.aurigin.com You said it yourself at the time -- why trust the strxfrm() implementation when a NULL pointer is passed? It may have worked on someone's machine in 2003, but that isn't a very good reason. It was never a documented part of the interface that this fails or that works; how could it be? This Solaris strxfrm() issue is (in the simplest and least contentious sense) a bug. It is not a portability issue. Someone made a mistake, and most likely the mistake was corrected in the next point release. That is the only logical explanation I can see. I wouldn't say that adding defenses to workaround serious bugs in a C stdlib is something we should never do, but ISTM that the standard for undertaking that ought to be very high. BTW, unlike the author of convert_string_datum(), I think it's okay that glibc sometimes returns different values with strxfrm() calls on the same string (based on whether or not the passed-in buffer is actually big enough to store the transformed string) -- that is actually allowed by the standard, I think. In other words, Solaris 8 has the only actually buggy strxfrm() of the cases convert_string_datum() enumerates, since AFAICT the standard doesn't promise that the strxfrm() return value need be exact, only sufficient (this leeway seems useful to me). -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Sat, Jun 27, 2015 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think the point of Noah's query is to find out whether "ancient" is an >> accurate description. > You said it yourself at the time -- why trust the strxfrm() > implementation when a NULL pointer is passed? It may have worked on > someone's machine in 2003, but that isn't a very good reason. It was > never a documented part of the interface that this fails or that > works; how could it be? This Solaris strxfrm() issue is (in the > simplest and least contentious sense) a bug. It is not a portability > issue. Someone made a mistake, and most likely the mistake was > corrected in the next point release. The point here is to *find out*, rather than assuming. I agree that Sun should have been embarrassed that such a bug ever made it into a released libc, but it did. The question is how long did it take them to notice and fix it. Assuming that it happened in "the next point release", with absolutely no evidence to support that optimistic guess, doesn't seem like a good idea to me. (BTW, another way to settle the question might be to check the Solaris release history. Wonder if that's available anywhere.) regards, tom lane
On Sun, Jun 28, 2015 at 8:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The point here is to *find out*, rather than assuming. I agree that > Sun should have been embarrassed that such a bug ever made it into > a released libc, but it did. The question is how long did it take > them to notice and fix it. Assuming that it happened in "the next point > release", with absolutely no evidence to support that optimistic guess, > doesn't seem like a good idea to me. > > (BTW, another way to settle the question might be to check the Solaris > release history. Wonder if that's available anywhere.) I did attempt that before abbreviated keys went in, too. I didn't have much luck finding the relevant information. *My* point here is that this seems like taking unbounded responsibility for a massive number of standard library bugs -- indisputable, garden-variety bugs. This bug was really likely to result in a segfault back when Postgres lacked (but also arguably needed) defenses against it, which was a period of about a year. The reason that defenses were added was because there were several complaints, but maybe those complaints were misdirected. It might have been the right decision at the time to paper over the problem, but only for a year or two. I'd only favor adding defenses if it could be expected to take longer for the Solaris stdlib people to ship a fix for their egregious bug than it would take for the next Postgres point release. Why should that be true, though? This just happened to be one of the more embarrassing, obvious stdlib bugs that appeared in the last 15 years, and so we're talking about it now, but there are plenty more, a good deal of which are far more recent than this one. Where does it end? I don't see why this bug is special just because 4 or 5 people complained about it on pgsql-hackers over a decade ago. -- Peter Geoghegan
On 06/28/2015 12:29 PM, Peter Geoghegan wrote: > It might have been the right decision at the time to paper over the > problem, but only for a year or two. I'd only favor adding defenses if > it could be expected to take longer for the Solaris stdlib people to > ship a fix for their egregious bug than it would take for the next > Postgres point release. Why should that be true, though? BTW, I've asked Joyent to test on their platform. My perspective is that if both SmartOS and OmniOS pass, it's not our responsibility to support OldSolaris if they won't update libraries. However, OpenSCG and EDB support more OldSolaris customers than anyone, AFAIK, so I'd like to her some opinions from them ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jun 29, 2015 at 7:58 AM, Josh Berkus <josh@agliodbs.com> wrote: > My perspective is that if both SmartOS and OmniOS pass, it's not our > responsibility to support OldSolaris if they won't update libraries. Just by the way, I wonder if this was that bug: https://illumos.org/issues/1594 -- Thomas Munro http://www.enterprisedb.com
On Sun, Jun 28, 2015 at 12:58 PM, Josh Berkus <josh@agliodbs.com> wrote: > My perspective is that if both SmartOS and OmniOS pass, it's not our > responsibility to support OldSolaris if they won't update libraries. Obviously I especially don't want to double the number of strxfrm() calls made during text abbreviation for *everyone* just to work around this silly bug. Those calls will generally be a large fraction of the cost of any text sort in 9.5, so clearly that would be unacceptable. Maybe Noah should commit a patch that makes the initial size of the buffer that stores the transformed string blob very small. This can be reverted once it has some buildfarm cycles. That is a bit of a scatter gun approach, but maybe that's inevitable given the paucity of information around the issue. -- Peter Geoghegan
Thomas Munro <thomas.munro@enterprisedb.com> writes: > Just by the way, I wonder if this was that bug: > https://illumos.org/issues/1594 Oooh. Might or might not be *same* bug, but it sure looks like it could have the right symptom. If this is indeed inherited from old Solaris, I'm afraid we are totally fooling ourselves if we guess that it's no longer present in the wild. regards, tom lane
Peter Geoghegan <pg@heroku.com> writes: > On Sun, Jun 28, 2015 at 12:58 PM, Josh Berkus <josh@agliodbs.com> wrote: >> My perspective is that if both SmartOS and OmniOS pass, it's not our >> responsibility to support OldSolaris if they won't update libraries. > Obviously I especially don't want to double the number of strxfrm() > calls made during text abbreviation for *everyone* just to work around > this silly bug. Those calls will generally be a large fraction of the > cost of any text sort in 9.5, so clearly that would be unacceptable. I agree, but ... > Maybe Noah should commit a patch that makes the initial size of the > buffer that stores the transformed string blob very small. This can be > reverted once it has some buildfarm cycles. That is a bit of a scatter > gun approach, but maybe that's inevitable given the paucity of > information around the issue. Another idea would be to make a test during postmaster start to see if this bug exists, and fail if so. I'm generally on board with the thought that we don't need to work on systems with such a bad bug, but it would be a good thing if the failure was clean and produced a helpful error message, rather than looking like a Postgres bug. regards, tom lane
Peter Geoghegan <pg@heroku.com> writes: > It might have been the right decision at the time to paper over the > problem, but only for a year or two. I'd only favor adding defenses if > it could be expected to take longer for the Solaris stdlib people to > ship a fix for their egregious bug than it would take for the next > Postgres point release. Why should that be true, though? > This just happened to be one of the more embarrassing, obvious stdlib > bugs that appeared in the last 15 years, and so we're talking about it > now, but there are plenty more, a good deal of which are far more > recent than this one. Where does it end? I don't see why this bug is > special just because 4 or 5 people complained about it on > pgsql-hackers over a decade ago. Peter, your arguments are beginning to sound like desperate excuse-making. The reason that bug is "special" is that it looks like a crash in Postgres, one that people have complained of because they didn't see it in other programs, which is not totally surprising because it requires a somewhat unusual usage of strxfrm(). I think the dumb two-call implementation exhibited in convert_string_datum() is mainstream usage, which would explain why Sun hadn't noticed the bug ages ago. It might be all right to refuse to support platforms that have this bug, but I think we need to determine with some more clarity just how widespread the bug still is, and in any case we should at least install some defense that would allow us to report that libc is buggy. Otherwise we'll be back to fielding bug reports that trace to this, which is no fun for anyone and does not make us look good. regards, tom lane
On Sun, Jun 28, 2015 at 7:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@heroku.com> writes: >> It might have been the right decision at the time to paper over the >> problem, but only for a year or two. I'd only favor adding defenses if >> it could be expected to take longer for the Solaris stdlib people to >> ship a fix for their egregious bug than it would take for the next >> Postgres point release. Why should that be true, though? > >> This just happened to be one of the more embarrassing, obvious stdlib >> bugs that appeared in the last 15 years, and so we're talking about it >> now, but there are plenty more, a good deal of which are far more >> recent than this one. Where does it end? I don't see why this bug is >> special just because 4 or 5 people complained about it on >> pgsql-hackers over a decade ago. > > Peter, your arguments are beginning to sound like desperate excuse-making. > > The reason that bug is "special" is that it looks like a crash in > Postgres, one that people have complained of because they didn't see it > in other programs, which is not totally surprising because it requires > a somewhat unusual usage of strxfrm(). I think the dumb two-call > implementation exhibited in convert_string_datum() is mainstream usage, > which would explain why Sun hadn't noticed the bug ages ago. > > It might be all right to refuse to support platforms that have this > bug, but I think we need to determine with some more clarity just > how widespread the bug still is, and in any case we should at least > install some defense that would allow us to report that libc is > buggy. Otherwise we'll be back to fielding bug reports that trace > to this, which is no fun for anyone and does not make us look good. I completely agree. Noah is quite right to try to find out whether this is still an issue, and I'm glad he's doing it, and I think it's very unfortunate that Peter is trying to discourage that research. If in fact the bug does not still exist in the wild, then Noah's research will demonstrate that we have no problem, and we do not need to do anything. If it does, then we can decide what to do about that. But I have come to value Noah's diligent attitude towards hunting down problems in our code base, and I hope Peter (and everyone else) will appreciate that attitude as well - or at the very least, stay out of the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jun 28, 2015 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The reason that bug is "special" is that it looks like a crash in > Postgres, one that people have complained of because they didn't see it > in other programs, which is not totally surprising because it requires > a somewhat unusual usage of strxfrm(). I think the dumb two-call > implementation exhibited in convert_string_datum() is mainstream usage, > which would explain why Sun hadn't noticed the bug ages ago. It hardly matters much, but I don't think that it is. I think the issue is entirely explained by sloppy code in the Solaris 8 stdlib. Anyone using strxfrm() is probably doing so for exactly the same reason as bttext_abbrev_convert() -- making sorting text in a collation-aware fashion faster. That's what strxfrm() is more or less advertised for by the glibc docs, for example, and in fact the strxfrm() glibc example involves resizing a buffer to fit [1], to avoid doubling the number of strxfrm() calls. For that reason, I think that convert_string_datum() is much further from mainstream usage than bttext_abbrev_convert(). > It might be all right to refuse to support platforms that have this > bug, but I think we need to determine with some more clarity just > how widespread the bug still is, and in any case we should at least > install some defense that would allow us to report that libc is > buggy. Otherwise we'll be back to fielding bug reports that trace > to this, which is no fun for anyone and does not make us look good. I can definitely get behind having a defense at startup. That seems like cheap insurance. It probably isn't too much work to adopt Noah's test program. [1] http://www.gnu.org/software/libc/manual/html_node/Collation-Functions.html -- Peter Geoghegan
On Mon, Jun 29, 2015 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> Just by the way, I wonder if this was that bug: >> https://illumos.org/issues/1594 > > Oooh. Might or might not be *same* bug, but it sure looks like it could > have the right symptom. If this is indeed inherited from old Solaris, > I'm afraid we are totally fooling ourselves if we guess that it's no > longer present in the wild. Also, here is an interesting patch that went into the Apache C++ standard library. Maybe the problem was limited to amd64 system... https://github.com/illumos/illumos-userland/blob/master/components/stdcxx/patches/047-collate.cpp.patch -- Thomas Munro http://www.enterprisedb.com
On Sun, Jun 28, 2015 at 4:35 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I completely agree. Noah is quite right to try to find out whether > this is still an issue, and I'm glad he's doing it, and I think it's > very unfortunate that Peter is trying to discourage that research. Far from it. I am providing constructive feedback. > If in fact the bug does not still exist in the wild, then Noah's research > will demonstrate that we have no problem, and we do not need to do > anything. If it does, then we can decide what to do about that. But > I have come to value Noah's diligent attitude towards hunting down > problems in our code base, and I hope Peter (and everyone else) will > appreciate that attitude as well - or at the very least, stay out of > the way. I hope that it goes without saying that I greatly appreciate Noah's efforts in tracking down this kind of thing. I am not standing in anyone's way. My intent is to save Noah some work. I imagined that it might not be clear to him how completely unreasonable it is for a strxfrm() implementation to have this issue -- it is totally unreasonable. It is not a portability problem. Clearly Tom agrees with that view, since he stated that he thinks it's okay to not support a platform with an abjectly broke strxfrm(). Now we're talking about failing in a sane way, rather than trying to add band aids, and now the exact extent of the problem seems less important. -- Peter Geoghegan
On Sun, Jun 28, 2015 at 4:35 PM, Peter Geoghegan <pg@heroku.com> wrote: > It hardly matters much, but I don't think that it is. I think the > issue is entirely explained by sloppy code in the Solaris 8 stdlib. I don't imagine that it will come as a surprise to anybody, but the manpage [1] for strxfrm() covering old Solaris libc versions (predating even version 8) states: "No more than n bytes are placed into the resulting array pointed to by s1, including the terminating null byte" ('n' is the bufsize argument passed by the caller). This is proof positive (though it is hardly needed) that Solaris was never supposed to allow this. I also noticed that the manpage had a typo: "Because no return value is reserved to indicate an error, an application wishing to check for error situations should set errno to 0, then call strcoll(3C), then check errno and if it is non-zero, assume an error has occurred." Obviously this is a copy-paste leftover from the strcoll() manpage. Pretty slipshod for a C standard library implementation, I would say. [1] http://docs.oracle.com/cd/E19455-01/806-0627/6j9vhfn88/index.html#indexterm-1090 -- Peter Geoghegan
All, Joyent confirms that the bug is fixed on SmartOS: This is what I see: SunOS pkgsrc-pbulk-2014Q4-1.local 5.11 joyent_20141030T081701Z i86pc i386 i86pc locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0) locale "is_IS.ISO8859-1": strxfrm returned 212; last modified byte at 58 (0x0) locale "": strxfrm returned 26; last modified byte at 27 (0x0) Cheers, -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > Joyent confirms that the bug is fixed on SmartOS: The more interesting bit of information would be *when* it was fixed. regards, tom lane
On 06/29/2015 02:08 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> Joyent confirms that the bug is fixed on SmartOS: > > The more interesting bit of information would be *when* it was fixed. Answer: "not certain, but fixed at least 2 years ago". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Jun 29, 2015 at 6:07 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 06/29/2015 02:08 PM, Tom Lane wrote: >> Josh Berkus <josh@agliodbs.com> writes: >>> Joyent confirms that the bug is fixed on SmartOS: >> >> The more interesting bit of information would be *when* it was fixed. > > Answer: "not certain, but fixed at least 2 years ago". 2 years is definitely not long enough to assume that no unpatched machines exist in the wild. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 29, 2015 at 11:52:26AM +1200, Thomas Munro wrote: > On Mon, Jun 29, 2015 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@enterprisedb.com> writes: > >> Just by the way, I wonder if this was that bug: > >> https://illumos.org/issues/1594 > > > > Oooh. Might or might not be *same* bug, but it sure looks like it could > > have the right symptom. If this is indeed inherited from old Solaris, > > I'm afraid we are totally fooling ourselves if we guess that it's no > > longer present in the wild. Very interesting. Looks like the illumos strxfrm() came from FreeBSD, not from Solaris; illumos introduced their bug independently: https://illumos.org/issues/2 https://github.com/illumos/illumos-gate/commits/master/usr/src/lib/libc/port/locale/collate.c > Also, here is an interesting patch that went into the Apache C++ > standard library. Maybe the problem was limited to amd64 system... > > https://github.com/illumos/illumos-userland/blob/master/components/stdcxx/patches/047-collate.cpp.patch That's a useful data point. Based on Oskari Saarenmaa's report, newer Solaris 10 is not affected. The fix presumably showed up after the 05/08 release and no later than the 01/13 release. On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote: > > On Sun, Jun 28, 2015 at 12:58 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> My perspective is that if both SmartOS and OmniOS pass, it's not our > >> responsibility to support OldSolaris if they won't update libraries. > Another idea would be to make a test during postmaster start to see > if this bug exists, and fail if so. I'm generally on board with the > thought that we don't need to work on systems with such a bad bug, > but it would be a good thing if the failure was clean and produced > a helpful error message, rather than looking like a Postgres bug. Failing cleanly on unpatched Solaris is adequate, agreed. A check at postmaster start isn't enough, because the postmaster might run in the C locale while individual databases or collations use problem locales. The safest thing is to test after every setlocale(LC_COLLATE) and newlocale(LC_COLLATE). That's once at backend start and once per backend per collation used, more frequent than I would like. Hmm.
Noah Misch <noah@leadboat.com> writes: > On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote: >> Another idea would be to make a test during postmaster start to see >> if this bug exists, and fail if so. I'm generally on board with the >> thought that we don't need to work on systems with such a bad bug, >> but it would be a good thing if the failure was clean and produced >> a helpful error message, rather than looking like a Postgres bug. > Failing cleanly on unpatched Solaris is adequate, agreed. A check at > postmaster start isn't enough, because the postmaster might run in the C > locale while individual databases or collations use problem locales. The > safest thing is to test after every setlocale(LC_COLLATE) and > newlocale(LC_COLLATE). That's once at backend start and once per backend per > collation used, more frequent than I would like. Hmm. I was thinking more along the lines of making a single test by momentarily switching into a known-buggy locale. However, your results here imply that there are or were more than one bug with this symptom, which may make a catchall test impossible. regards, tom lane
On 06/29/2015 07:53 PM, Robert Haas wrote: > On Mon, Jun 29, 2015 at 6:07 PM, Josh Berkus <josh@agliodbs.com> wrote: >> On 06/29/2015 02:08 PM, Tom Lane wrote: >>> Josh Berkus <josh@agliodbs.com> writes: >>>> Joyent confirms that the bug is fixed on SmartOS: >>> >>> The more interesting bit of information would be *when* it was fixed. >> >> Answer: "not certain, but fixed at least 2 years ago". > > 2 years is definitely not long enough to assume that no unpatched > machines exist in the wild. :-( Well, AFAIK, 90% of SmartOS users are on the Joyent Cloud and get upgraded automatically. It's not much used as a portable OS. So for them, it is enough. Given that Theo's patch against Illumos went in in 2012, I think we can take that as our cutoff date for Illumos. The question is: how many folks out there are running PostgreSQL on Solaris 10? And are they at all likely to upgrade to PostgreSQL 9.5? Unfortunately, these questions are coming up right when Bjorn is on vacation ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > The question is: how many folks out there are running PostgreSQL on > Solaris 10? And are they at all likely to upgrade to PostgreSQL 9.5? That's only the pertinent question if the bug exists on Solaris 10, which I don't think we know do we? Oskari Saarenmaa's results upthread seemed to indicate not. regards, tom lane
On Tue, Jun 30, 2015 at 09:45:08AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote: > >> Another idea would be to make a test during postmaster start to see > >> if this bug exists, and fail if so. I'm generally on board with the > >> thought that we don't need to work on systems with such a bad bug, > >> but it would be a good thing if the failure was clean and produced > >> a helpful error message, rather than looking like a Postgres bug. > > > Failing cleanly on unpatched Solaris is adequate, agreed. A check at > > postmaster start isn't enough, because the postmaster might run in the C > > locale while individual databases or collations use problem locales. The > > safest thing is to test after every setlocale(LC_COLLATE) and > > newlocale(LC_COLLATE). That's once at backend start and once per backend per > > collation used, more frequent than I would like. Hmm. > > I was thinking more along the lines of making a single test by momentarily > switching into a known-buggy locale. It seems like every Solaris system I meet has a different set of installed locales. Any one I might pick could sometimes be unavailable when other buggy locales are available. On this Solaris 10 05/08 (10u5) system, the buggy and non-buggy locales have no clear pattern. While fr_FR.UTF-8 is fine, fr_LU.UTF-8 is buggy.
On Wed, Jul 01, 2015 at 03:22:33AM -0400, Noah Misch wrote: > On Tue, Jun 30, 2015 at 09:45:08AM -0400, Tom Lane wrote: > > Noah Misch <noah@leadboat.com> writes: > > > On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote: > > >> Another idea would be to make a test during postmaster start to see > > >> if this bug exists, and fail if so. I'm generally on board with the > > >> thought that we don't need to work on systems with such a bad bug, > > >> but it would be a good thing if the failure was clean and produced > > >> a helpful error message, rather than looking like a Postgres bug. > > > > > Failing cleanly on unpatched Solaris is adequate, agreed. A check at > > > postmaster start isn't enough, because the postmaster might run in the C > > > locale while individual databases or collations use problem locales. The > > > safest thing is to test after every setlocale(LC_COLLATE) and > > > newlocale(LC_COLLATE). That's once at backend start and once per backend per > > > collation used, more frequent than I would like. Hmm. Solaris does not have locale_t or strxfrm_l(), so per-collation checks aren't relevant after all. Checking at postmaster start and backend start seems cheap enough, hence this patch.
Attachment
On Sat, Jun 27, 2015 at 11:57:30AM -0700, Peter Geoghegan wrote: > On Sat, Jun 27, 2015 at 9:51 AM, Noah Misch <noah@leadboat.com> wrote: > > PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does > > not account for the Solaris bug. I wish to determine whether that bug is > > still relevant today. If you have access to Solaris with the is_IS.ISO8859-1 > > locale installed (or root access to install it), please compile and run the > > attached test program on each Solaris version you have available. Reply here > > with the program's output. I especially need a report from Solaris 10, but > > reports from older and newer versions are valuable. Thanks. > > I did consider this. > > Sorry, but I must question the point of worrying about an ancient bug > in Solaris. One function had a comment explaining its workaround for an OS bug, while another function ignored the same bug. That is always a defect in the comments at least; our code shall tell a uniform story about its API assumptions. I started this thread estimating that it would end with me merely deleting the comment. Thomas Munro and Tom Lane located evidence I hadn't found, evidence that changed the conclusion. > When you have to worry about a standard library function > blithely writing past the end of a buffer, when its C89 era interface > must be passed the size of said buffer, where does it end? Don't worry about the possibility of such basic bugs until someone reports one. Once you have such a report, though, assume the interface behaves as last reported until you receive new evidence. We decide whether to work around such bugs based on factors like prevalence of affected systems, simplicity of the workaround, and ease of field diagnosis in the absence of the workaround. Non-factors include whether the bug is egregious, is contrary to documentation, or is contrary to a pertinent third-party specification. Those sources speak to which of the library provider or PostgreSQL was wrong, but they play little or no role in dictating the workarounds to deploy. I hope that clarifies things. nm
On Wed, Jul 8, 2015 at 10:18 PM, Noah Misch <noah@leadboat.com> wrote: > One function had a comment explaining its workaround for an OS bug, while > another function ignored the same bug. That is always a defect in the > comments at least; our code shall tell a uniform story about its API > assumptions. I started this thread estimating that it would end with me > merely deleting the comment. Thomas Munro and Tom Lane located evidence I > hadn't found, evidence that changed the conclusion. That seems very reasonable. I noticed that you removed the glibc strxfrm() comment (or at least the questioning of its behavior), which was a good decision. >> When you have to worry about a standard library function >> blithely writing past the end of a buffer, when its C89 era interface >> must be passed the size of said buffer, where does it end? > > Don't worry about the possibility of such basic bugs until someone reports > one. Once you have such a report, though, assume the interface behaves as > last reported until you receive new evidence. We decide whether to work > around such bugs based on factors like prevalence of affected systems, > simplicity of the workaround, and ease of field diagnosis in the absence of > the workaround. I must admit that I was rather surprised that more or less the same blitheness about writing past the end of a buffer occurred a second time in an apparently independent standard library implementation. I think that illustrates your point well. Thanks -- Peter Geoghegan
On 27/06 12.51, Noah Misch wrote: > > PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does > not account for the Solaris bug. I wish to determine whether that bug is > still relevant today. If you have access to Solaris with the is_IS.ISO8859-1 > locale installed (or root access to install it), please compile and run the > attached test program on each Solaris version you have available. Reply here > with the program's output. I especially need a report from Solaris 10, but > reports from older and newer versions are valuable. Thanks. I ran this program on Solaris 9 U5 (September 2006) on Sparc and got: --- SunOS tor10-z1 5.9 Generic_Virtual sun4v sparc SUNW,SPARC-Enterprise-T5120 locale "is_IS.ISO8859-1": strxfrm returned 162; last modified byte at 106 (0xffd2) locale "is_IS.ISO8859-1": strxfrm returned 162; last modified byte at 106 (0xffd2) locale "": strxfrm returned 26; last modified byte at 58 (0x0) --- NB this runs in a Container on a Solaris 10 host. On our oldest Solaris 10 sparc box, running Solaris 10 U6 (October 2008) I get: --- SunOS tor07 5.10 Generic_137137-09 sun4v sparc SUNW,SPARC-Enterprise-T5120 locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0) locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0) locale "": strxfrm returned 26; last modified byte at 27 (0x0) --- On x86: Solaris 10 U5 (May 2008) I get: --- SunOS loki08 5.10 Generic_127128-11 i86pc i386 i86pc locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0) locale "is_IS.ISO8859-1": strxfrm returned 216; last modified byte at 58 (0x0) locale "": strxfrm returned 26; last modified byte at 27 (0x0) --- - Bjorn
On Tue, Jul 21, 2015 at 04:45:21PM +0200, Bjorn Munch wrote: > On 27/06 12.51, Noah Misch wrote: > > > > PostgreSQL 9.5 adds a strxfrm() call in bttext_abbrev_convert(), which does > > not account for the Solaris bug. I wish to determine whether that bug is > > still relevant today. If you have access to Solaris with the is_IS.ISO8859-1 > > locale installed (or root access to install it), please compile and run the > > attached test program on each Solaris version you have available. Reply here > > with the program's output. I especially need a report from Solaris 10, but > > reports from older and newer versions are valuable. Thanks. > > I ran this program on Solaris 9 U5 (September 2006) on Sparc and got: I appreciate your testing. A few sources give December 2003 as the month for Solaris 9 Update 5; would you verify the vintage you used?
On 22/07 02.29, Noah Misch wrote: > > I ran this program on Solaris 9 U5 (September 2006) on Sparc and got: > > I appreciate your testing. A few sources give December 2003 as the month for > Solaris 9 Update 5; would you verify the vintage you used? Sorry I was mis-parsing the /etc/release. 9/05 is the month. I should know that. :-/ This was Solaris 9 U9 from September 2005. From the output it looks like the bug was present here? I did a quick search for bugs in libc with strxfrm in the title and got a few hits but none that seemed to be this one. - Bjorn
On Wed, Jul 22, 2015 at 08:40:05AM +0200, Bjorn Munch wrote: > On 22/07 02.29, Noah Misch wrote: > > > I ran this program on Solaris 9 U5 (September 2006) on Sparc and got: > > > > I appreciate your testing. A few sources give December 2003 as the month for > > Solaris 9 Update 5; would you verify the vintage you used? > > Sorry I was mis-parsing the /etc/release. 9/05 is the month. I should > know that. :-/ This was Solaris 9 U9 from September 2005. Does it have recent Solaris 9 updates, or is it closer to a base install of the 9/05 packages? Apparently, Oracle issued a final batch of Solaris 9 updates in February 2015. I would find it useful to know the age of the installed package containing /lib/libc.so.1. > From the output it looks like the bug was present here? Yes. > I did a quick search for bugs in libc with strxfrm in the title and > got a few hits but none that seemed to be this one. I was curious about that. Thanks for looking.
Noah, All: Where are we with this? Do we feel confident that this bug is only on old versions of Solaris we don't care about? Or does it remain to be resolved? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Jul 31, 2015 at 08:29:51PM -0700, Josh Berkus wrote: > Where are we with this? Do we feel confident that this bug is only on > old versions of Solaris we don't care about? Or does it remain to be > resolved? Affected systems either have an available vendor update addressing the problem or are so old that we don't care. Commit be8b06c makes affected systems fail early, which resolves the matter.