Thread: Beta 6 Regression results on Redat 7.0.

Beta 6 Regression results on Redat 7.0.

From
Lamar Owen
Date:
Ok, thanks to our snowstorm :-0 I have been working on the beta 6 RPM situation
on my _slow_ notebook today (power outages for ten minutes at a time happening
at hour or so intervals due to 45mph+ winds and a foot of snow....).

Well, I have preliminary RPM's built -- just need to work on the contrib tree
situation.  I ran regression the usual RPM way (which I am fully aware is not
the normally approved method, but it _would_ be the method any RPM beta testers
would use), and got a different failure, one that is not locale related
(LC_ALL=C both for the initdb and the postmaster startup in the newest
initscript).  See attached regression.diffs for details of the temptest failure
I experienced.

Regression run with CWD=/usr/share/test/regress, user=postgres.
./pg_regress --schedule=parallel_schedule

This is the only regression test failure I have found thus far. I have never
seen this failure before, so I'm not sure where to proceed.

Now to attack the contrib tree (looking forward to my new notebook, as this old
P133 takes an hour and twenty minutes to slog through a full build....).

Seeing that RC1 is in prep, is there a pressing need to upload and release beta
6 RPM's, or will it be a day or two before RC1?
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: Beta 6 Regression results on Redat 7.0.

From
Tom Lane
Date:
Lamar Owen <lamar.owen@wgcr.org> writes:
>   DROP TABLE temptest;
> + NOTICE:  FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global 1)
> + ERROR:  heap_drop_with_catalog: FlushRelationBuffers returned -2
>   SELECT * FROM temptest;

Hoo, that's interesting ...  Exactly what fileset were you using again?

> Seeing that RC1 is in prep, is there a pressing need to upload and
> release beta 6 RPM's, or will it be a day or two before RC1?

I think you might as well wait for RC1 as far as actually making RPMs
goes.  But do you want to let anyone else check out the RPM build
process?  For instance, I've been wondering what you did about the
which-set-of-headers-to-install issue.
        regards, tom lane


Re: Beta 6 Regression results on Redat 7.0.

From
The Hermit Hacker
Date:
On Tue, 20 Mar 2001, Lamar Owen wrote:

> Ok, thanks to our snowstorm :-0 I have been working on the beta 6 RPM situation
> on my _slow_ notebook today (power outages for ten minutes at a time happening
> at hour or so intervals due to 45mph+ winds and a foot of snow....).
>
> Well, I have preliminary RPM's built -- just need to work on the contrib tree
> situation.  I ran regression the usual RPM way (which I am fully aware is not
> the normally approved method, but it _would_ be the method any RPM beta testers
> would use), and got a different failure, one that is not locale related
> (LC_ALL=C both for the initdb and the postmaster startup in the newest
> initscript).  See attached regression.diffs for details of the temptest failure
> I experienced.
>
> Regression run with CWD=/usr/share/test/regress, user=postgres.
> ./pg_regress --schedule=parallel_schedule
>
> This is the only regression test failure I have found thus far. I have never
> seen this failure before, so I'm not sure where to proceed.
>
> Now to attack the contrib tree (looking forward to my new notebook, as this old
> P133 takes an hour and twenty minutes to slog through a full build....).
>
> Seeing that RC1 is in prep, is there a pressing need to upload and release beta
> 6 RPM's, or will it be a day or two before RC1?

Im going to do RC1 tonight ... so no pressng need :)




Re: Beta 6 Regression results on Redat 7.0.

From
Lamar Owen
Date:
On Tue, 20 Mar 2001, Tom Lane wrote:
> Lamar Owen <lamar.owen@wgcr.org> writes:
> >   DROP TABLE temptest;
> > + NOTICE:  FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global 1)
> > + ERROR:  heap_drop_with_catalog: FlushRelationBuffers returned -2
> >   SELECT * FROM temptest;
> Hoo, that's interesting ...  Exactly what fileset were you using again?

When you say 'fileset', I'm assuming you are referring to the --schedule
parameter -- I am invoking the following command:
./pg_regress --schedule=parallel_schedule  

7.1beta6 distribution tarball.  LC_ALL=C.  Compiled on RedHat 7 as shipped.

I'm rerunning to see if it is intermittent. Second run -- no error.  Running a
third time......no error.  Now I'm confused.  What would cause such an error,
Tom?  I'm going to check on my desktop, once power gets more stable (and it
quits lightning -- yes, a snowstorm with lightning :-0  I certainly got what I
wanted.....).  So, more to come later.

> > Seeing that RC1 is in prep, is there a pressing need to upload and
> > release beta 6 RPM's, or will it be a day or two before RC1?
> I think you might as well wait for RC1 as far as actually making RPMs
> goes.  But do you want to let anyone else check out the RPM build
> process?  For instance, I've been wondering what you did about the
> which-set-of-headers-to-install issue.

Oh, ok.  Spec file attached.  All other files needed are the beta6 tarball and
the contents of the beta4-1 source rpm, with names changed to match the beta6
version number.  There are some other changes I have to merge in --
particularly a set from Karl for the optional PL/Perl build, as well as others,
so this is a preliminary spec file.

But I was just getting the basic build done and tested.

To directly answer your question, I'm using 'make install-all-headers' and
stuffing it into the devel rpm in one piece at this time.
-- 
Lamar Owen
WGCR Internet Radio
1 Peter 4:11

Re: Beta 6 Regression results on Redat 7.0.

From
Tom Lane
Date:
Lamar Owen <lamar.owen@wgcr.org> writes:
> DROP TABLE temptest;
> + NOTICE:  FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global 1)
> + ERROR:  heap_drop_with_catalog: FlushRelationBuffers returned -2
> SELECT * FROM temptest;
>> Hoo, that's interesting ...  Exactly what fileset were you using again?

> When you say 'fileset', I'm assuming you are referring to the --schedule
> parameter --

No, I was wondering about whether you had an inconsistent set of source
files, or had managed to not do a complete rebuild, or something like
that.  The above error should be entirely impossible considering that
the table in question is a temp table that's not been touched by any
other backend.  If you did manage to get this from a clean build then
I think we have a serious problem to look at.

>> I think you might as well wait for RC1 as far as actually making RPMs
>> goes.  But do you want to let anyone else check out the RPM build
>> process?  For instance, I've been wondering what you did about the
>> which-set-of-headers-to-install issue.

> Oh, ok.  Spec file attached.  All other files needed are the beta6 tarball and
> the contents of the beta4-1 source rpm, with names changed to match the beta6
> version number.

OK, I will pull the files and try to replicate this on my own laptop.
Does anyone else have time to try to duplicate the problem tonight?
If it's replicatable at all, I think it's a release stopper.

> To directly answer your question, I'm using 'make install-all-headers' and
> stuffing it into the devel rpm in one piece at this time.

Works for me.
        regards, tom lane


Re: Beta 6 Regression results on Redat 7.0.

From
Lamar Owen
Date:
On Tue, 20 Mar 2001, Tom Lane wrote:
> Lamar Owen <lamar.owen@wgcr.org> writes:
> > DROP TABLE temptest;
> > + NOTICE:  FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global 1)
> > + ERROR:  heap_drop_with_catalog: FlushRelationBuffers returned -2
> > SELECT * FROM temptest;

> >> Hoo, that's interesting ...  Exactly what fileset were you using again?
> > When you say 'fileset', I'm assuming you are referring to the --schedule
> > parameter --
> No, I was wondering about whether you had an inconsistent set of source
> files, or had managed to not do a complete rebuild, or something like
> that.  The above error should be entirely impossible considering that
> the table in question is a temp table that's not been touched by any
> other backend.  If you did manage to get this from a clean build then
> I think we have a serious problem to look at.

Standard RPM rebuild -- always wipes the whole build tree out and re-expands
from the tarball, reapplies patches, and rebuilds from scratch every time I
change even the smallest detail in the spec file -- which is why it takes so
long to get these things out.  So, no, this is a scratch build from a fresh
tarball.

> Does anyone else have time to try to duplicate the problem tonight?
> If it's replicatable at all, I think it's a release stopper.

I have not yet been able to repeat the problem.  I am running my fifth
regression test run (which takes a long time on this P133) with a freshly
initdb'ed PGDATA -- the previous regression runs were done on the same PGDATA
tree as the first run was done on.  Took 12 minutes 40 seconds, but I can't
repeat the error. I'm hoping it was a problem on my machine -- educate me on
what caused the error so I can see if something in my setup did something not
so nice.  So, the score is one error out of six test runs, thus far.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


Re: Beta 6 Regression results on Redat 7.0.

From
Tom Lane
Date:
Lamar Owen <lamar.owen@wgcr.org> writes:
> I'm hoping it was a problem on my machine -- educate me on
> what caused the error

Well, that's exactly what I'd like to know.  The direct cause of the
error is that DROP TABLE is finding that some other backend has a
reference-count hold on a page of the temp table it's trying to drop.
Since no other backend should be trying to touch this temp table,
there's something pretty fishy here.

Given that this is a parallel test, you may be looking at a
low-probability timing-dependent failure.  I'd say set up the machine
and run repeat tests for an hour or three ... that's what I plan to do
here.

BTW, what postmaster parameters are you using --- -B and so forth?
        regards, tom lane


RE: Beta 6 Regression results on Redat 7.0.

From
"Mikheev, Vadim"
Date:
> I'm rerunning to see if it is intermittent. Second run -- no 
> error.  Running a third time......no error.  Now I'm confused.
>  What would cause such an error, Tom? I'm going to check on my

Hmm, concurrent checkpoint? Probably we could simplify dirty test
in ByfferSync() - ie test bufHdr->cntxDirty without holding
shlock (and pin!) on buffer: should be good as long as we set
cntxDirty flag *before* XLogInsert in access methods. Have to
look more...

Vadim


Re: Beta 6 Regression results on Redat 7.0.

From
Lamar Owen
Date:
On Tue, 20 Mar 2001, Tom Lane wrote:
> Since no other backend should be trying to touch this temp table,
> there's something pretty fishy here.

I see.
> Given that this is a parallel test, you may be looking at a
> low-probability timing-dependent failure.  I'd say set up the machine
> and run repeat tests for an hour or three ... that's what I plan to do
> here.

As a broadcast engineer, I'm a little too familiar with such things.  But this
isn't an engineer list, so I'll spare you the war stories. :-)

> BTW, what postmaster parameters are you using --- -B and so forth?

Default.  To be changed before RPM release, but currently it is the default.
The only option that postmaster.opts records is -D, and I'm not passing
anything else. 
-- 
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


Re: Beta 6 Regression results on Redat 7.0.

From
The Hermit Hacker
Date:
On Tue, 20 Mar 2001, Tom Lane wrote:

> Lamar Owen <lamar.owen@wgcr.org> writes:
> > I'm hoping it was a problem on my machine -- educate me on
> > what caused the error
>
> Well, that's exactly what I'd like to know.  The direct cause of the
> error is that DROP TABLE is finding that some other backend has a
> reference-count hold on a page of the temp table it's trying to drop.
> Since no other backend should be trying to touch this temp table,
> there's something pretty fishy here.
>
> Given that this is a parallel test, you may be looking at a
> low-probability timing-dependent failure.  I'd say set up the machine
> and run repeat tests for an hour or three ... that's what I plan to do
> here.

Okay, I roll'd an RC1 but haven't put it up for FTP yet ... I'll wait for
a few hours to see if anyone can reproduce this, and, if not, put out what
I've rolled ...

say, 00:00AST ...



Re: Beta 6 Regression results on Redat 7.0.

From
Tom Lane
Date:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> Hmm, concurrent checkpoint? Probably we could simplify dirty test
> in ByfferSync() - ie test bufHdr->cntxDirty without holding
> shlock (and pin!) on buffer: should be good as long as we set
> cntxDirty flag *before* XLogInsert in access methods. Have to
> look more...

Yes, I'm wondering if some other backend is trying to write/flush
the buffer (maybe as part of a checkpoint, maybe not).  But seems
like we should have seen this before, if so; that's not a low-
probability scenario, particularly with just 64 buffers...
        regards, tom lane


Re: Beta 6 Regression results on Redat 7.0.

From
Tom Lane
Date:
The Hermit Hacker <scrappy@hub.org> writes:
> Okay, I roll'd an RC1 but haven't put it up for FTP yet ... I'll wait for
> a few hours to see if anyone can reproduce this, and, if not, put out what
> I've rolled ...

This will not be RC1 :-(

I'm been running one backend doing repeated iterations of

CREATE TABLE temptest(col int);
INSERT INTO temptest VALUES (1);

CREATE TEMP TABLE temptest(col int);
INSERT INTO temptest VALUES (2);
SELECT * FROM temptest;
DROP TABLE temptest;

SELECT * FROM temptest;
DROP TABLE temptest;

and another one doing repeated CHECKPOINTs.  I've already gotten a
couple occurrences of Lamar's failure.

I think the problem is that BufferSync unconditionally does PinBuffer
on each buffer, and holds the pin during intervals where it's released
BufMgrLock, even if there's not really anything for it to do on that
buffer.  If someone else is running FlushRelationBuffers then it's
possible for that routine to see a nonzero pin count when it looks.

Vadim, what do you think about how to change this?  I think this is
BufferSync's fault not FlushRelationBuffers's ...
        regards, tom lane


Re: Beta 6 Regression results on Redat 7.0.

From
Lamar Owen
Date:
On Tue, 20 Mar 2001, Tom Lane wrote:
> This will not be RC1 :-(
> 'Ive already gotten a
> couple occurrences of Lamar's failure.

Well, I was at least hoping it was a problem here -- particularly since I
haven't been able to reproduce it.  But, since it is not a local problem, I'm
glad I caught it -- on the first regression test run, no less.  I've run a
dozen tests since without duplication.

Although, like you, Tom, I'm curious as to why it hadn't showed up before -- is
the fact that this is a slow machine a factor, possibly?

Although I am now much more leery of our regression suite -- this issue isn't
even tested, in reality.  Do we have _any_ WAL-related tests?  The parallel
testing is a good thing -- but I wonder what boundary conditions aren't getting
tested.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


Re: Beta 6 Regression results on Redat 7.0.

From
Tom Lane
Date:
Lamar Owen <lamar.owen@wgcr.org> writes:
> Although I am now much more leery of our regression suite

The regression tests are not at all designed to test concurrent
behavior, and never have been.  The parallel form runs some tests
in parallel, true, but those tests are deliberately designed not to
interact.  So I don't put any faith in the regression tests as a means
to catch bugs like this.  We need some thought and work on better
concurrent tests...
        regards, tom lane


RE: Beta 6 Regression results on Redat 7.0.

From
"Mikheev, Vadim"
Date:
> I think the problem is that BufferSync unconditionally does PinBuffer
> on each buffer, and holds the pin during intervals where it's released
> BufMgrLock, even if there's not really anything for it to do on that
> buffer.  If someone else is running FlushRelationBuffers then it's
> possible for that routine to see a nonzero pin count when it looks.
> 
> Vadim, what do you think about how to change this?  I think this is
> BufferSync's fault not FlushRelationBuffers's ...

I'm looking there right now...

Vadim


Re: Beta 6 Regression results on Redat 7.0.

From
Tom Lane
Date:
>> I think the problem is that BufferSync unconditionally does PinBuffer
>> on each buffer, and holds the pin during intervals where it's released
>> BufMgrLock, even if there's not really anything for it to do on that
>> buffer.  If someone else is running FlushRelationBuffers then it's
>> possible for that routine to see a nonzero pin count when it looks.

Further note: this bug does not arise in 7.0.* because in that code,
BufferSync will only pin buffers that have been dirtied in the current
transaction.  This cannot affect a concurrent FlushRelationBuffers,
which should be holding exclusive lock on the table it's flushing.

Or can it?  The above is safe enough for user tables, but on system
tables we have a bad habit of releasing locks early.  It seems possible
that a VACUUM on a system table might see pins due to BufferSyncs
running in concurrent transactions that have altered that system table.

Perhaps this issue does explain some of the reports of
FlushRelationBuffers failure that we've seen from the field.
        regards, tom lane


Re: Beta 6 Regression results on Redat 7.0.

From
Thomas Lockhart
Date:
> Seeing that RC1 is in prep, is there a pressing need to upload and release beta
> 6 RPM's, or will it be a day or two before RC1?

Can I get the src rpm to give a try on Mandrake? I had trouble with
7.0.3 (a mysterious disappearing file in the perl build) and would like
to see where we are at with 7.1...
                   - Thomas


RE: Beta 6 Regression results on Redat 7.0.

From
"Mikheev, Vadim"
Date:
> Further note: this bug does not arise in 7.0.* because in that code,
> BufferSync will only pin buffers that have been dirtied in the current
> transaction.  This cannot affect a concurrent FlushRelationBuffers,
> which should be holding exclusive lock on the table it's flushing.
> 
> Or can it?  The above is safe enough for user tables, but on system
> tables we have a bad habit of releasing locks early. It seems possible
> that a VACUUM on a system table might see pins due to BufferSyncs
> running in concurrent transactions that have altered that system table.
> 
> Perhaps this issue does explain some of the reports of
> FlushRelationBuffers failure that we've seen from the field.

Another possible source of this problem (in 7.0.X) is BufferReplace..?

Vadim


Re: Beta 6 Regression results on Redat 7.0.

From
Lamar Owen
Date:
On Tue, 20 Mar 2001, Thomas Lockhart wrote:
> > Seeing that RC1 is in prep, is there a pressing need to upload and release beta
> > 6 RPM's, or will it be a day or two before RC1?
> Can I get the src rpm to give a try on Mandrake? I had trouble with
> 7.0.3 (a mysterious disappearing file in the perl build) and would like
> to see where we are at with 7.1...

Sure.  If you want to try out one already up there, pull the beta4 set off the
ftp site.  I'm on dialup right now -- it will take quite some time to get an
src.rpm up for beta 6.  Although, it does look like it may be a little bit
before RC1, now.  I'm at beta6-0.2 right now, with several changes to make in
the line, but, I can upload if you can wait a couple of hours (I'm in a rebuild
right now for 0.2, which will take 77 minutes or more on this machine, and then
I have to scp it over to hub.).

Tomorrow morning, if I can get out of the snow-covered driveway and to work, I
can upload it much quicker.

I'll go ahead and upload the one I'm testing with right now if you'd like.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


Re: Beta 6 Regression results on Redat 7.0.

From
Thomas Lockhart
Date:
> I'll go ahead and upload the one I'm testing with right now if you'd like.

Not necessary, unless (I suppose) that you know the rpm for beta 4 is
broken. That vintage CVS tree behaved well enough for me try it out
afaicr...
                   - Thomas


Re: Beta 6 Regression results on Redat 7.0.

From
Lamar Owen
Date:
On Tue, 20 Mar 2001, Thomas Lockhart wrote:
> > I'll go ahead and upload the one I'm testing with right now if you'd like.
> Not necessary, unless (I suppose) that you know the rpm for beta 4 is
> broken. That vintage CVS tree behaved well enough for me try it out
> afaicr...

It's a good start to test with for the purposes for which I think you want to
test for.  (and I'm an English teacher by night -- argh).  Beta 6 changes a few
minor things and one major thing -- the minor things are:
- Separate libs package with requisite dependency redo
- Change in the initscript to use pg_ctl to (properly) stop postmaster (no  kill -9's here this time :-))
- Change in the initscript to initdb with LC_ALL=C and to start postmaster    with LC_ALL=C as well.
- devel subpackage now uses make install-all-headers instead of cpp hack to   pull in required headers for client and
serverdevelopment.
 

The major thing is going to be a build of the contrib tree and a contrib
subpackage -- the source will remain as part of the docs, but now that whole
set of useful files will be built out.  That is what I was beginning to do when
I stumbled across the regression failure that subsequently took the rest of the
afternoon to track.

Before final release I have a rewrite of the README to do, as well as a full
update of the migration scripts for testing.

I'm looking at /usr/lib/pgsql/contrib/* for the contrib stuff.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11


RPM building (was regression on RedHat)

From
Thomas Lockhart
Date:
> It's a good start to test with for the purposes for which I think you want to
> test for.  (and I'm an English teacher by night -- argh).

:)

Mandrake (as of 7.2) still does a brain-dead mix of "-O3" and
"-ffast-math", which is a risky and unnecessary combination according to
the gcc folks (and which kills some of our date/time rounding). From the
man page for gcc:

-ffast-mathThis  option  should never be turned on by any `-O' optionsince it can result in incorrect output for
programswhichdepend on an exact implementation of IEEE  or  ANSIrules/specifications for math functions.
 

I'd like to get away from having to post a non-brain-dead /root/.rpmrc
file which omits the -ffast-math flag. Can you suggest mechanisms for
putting a "-fno-fast-math" into the spec file? Isn't there a mechanism
to mark things as "distro specific"? Suggestions?

Also, I'm getting the same symptom as I had for 7.0.3 with a
"disappearing file". Anyone seen this? I recall tracing this back for
the 7.0.3 case and found that Pg.bs existed in the build tree, at least
at some point in the build, but then goes away. 7.0.2, at least at the
time I did the build, did not have the problem :(

File not found: /var/tmp/postgresql-7.1beta4-root/ (cont'd) usr/lib/perl5/site_perl/5.6.0/i386-linux/auto/Pg/Pg.bs
                     - Thomas


Re: RPM building (was regression on RedHat)

From
Thomas Swan
Date:
At 3/20/2001 09:24 PM, Thomas Lockhart wrote:
> > It's a good start to test with for the purposes for which I think you 
> want to
> > test for.  (and I'm an English teacher by night -- argh).
>
>:)
>
>Mandrake (as of 7.2) still does a brain-dead mix of "-O3" and
>"-ffast-math", which is a risky and unnecessary combination according to
>the gcc folks (and which kills some of our date/time rounding). From the
>man page for gcc:
>
>-ffast-math
>  This  option  should never be turned on by any `-O' option
>  since it can result in incorrect output for programs which
>  depend on an exact implementation of IEEE  or  ANSI
>  rules/specifications for math functions.
>
>I'd like to get away from having to post a non-brain-dead /root/.rpmrc
>file which omits the -ffast-math flag. Can you suggest mechanisms for
>putting a "-fno-fast-math" into the spec file? Isn't there a mechanism
>to mark things as "distro specific"? Suggestions?

I don't know if it helps.  But, a stock install has the environment 
MACHTYPE=i586-mandrake-linux.

If you hunt for mandrake in the MACHTYPE variable you could reset those 
variables.

Also, I think those are set in the rpmrc file of the distro for the i386 
target.  If you specify anything else like i486, i686, you don't have that 
problem.

It would be in the RPM_OPT_FLAGS or RPM_OPTS part of the build 
environment.  I don't think there would be a problem overriding it, in 
fact, I would recommend the following : RPM_OPTS="$RPM_OPTS 
-fno-fast-math".   Since gcc will take the last argument as overriding the 
first, it would be a nice safeguard.

Even setting CFLAGS="$CFLAGS -fno-fast-math" might be good idea.

Hope this helps,
Thomas



Re: RPM building (was regression on RedHat)

From
teg@redhat.com (Trond Eivind Glomsrød)
Date:
Thomas Lockhart <lockhart@alumni.caltech.edu> writes:

> > It's a good start to test with for the purposes for which I think you want to
> > test for.  (and I'm an English teacher by night -- argh).
> 
> :)
> 
> Mandrake (as of 7.2) still does a brain-dead mix of "-O3" and
> "-ffast-math", which is a risky and unnecessary combination according to
> the gcc folks (and which kills some of our date/time rounding). From the
> man page for gcc:
> 
> -ffast-math
>  This  option  should never be turned on by any `-O' option
>  since it can result in incorrect output for programs which
>  depend on an exact implementation of IEEE  or  ANSI
>  rules/specifications for math functions.
> 
> I'd like to get away from having to post a non-brain-dead /root/.rpmrc
> file which omits the -ffast-math flag. Can you suggest mechanisms for
> putting a "-fno-fast-math" into the spec file? Isn't there a mechanism
> to mark things as "distro specific"? Suggestions?

If Mandrake wants to be broken, let them - and tell them.

-- 
Trond Eivind Glomsrød
Red Hat, Inc.


Re: RPM building (was regression on RedHat)

From
Thomas Lockhart
Date:
> If Mandrake wants to be broken, let them - and tell them.

They know ;) But just as with RH, they build ~1500 packages, so it is
probably not realistic to get them to change their build standards over
one misbehavior in one package.

The goal here is to get PostgreSQL to work well for as many platforms as
possible. Heck, we even build for M$ ;)

So, I'm still looking for the best way to add a compile flag while
making it clear that it is for one distro only. Of course, it would be
possible to just add it at the end of the flags, but it would be nice to
do that only when necessary.

Regards.
                    - Thomas


RE: Beta 6 Regression results on Redat 7.0.

From
"Mikheev, Vadim"
Date:
> I'm been running one backend doing repeated iterations of
> 
> CREATE TABLE temptest(col int);
> INSERT INTO temptest VALUES (1);
> 
> CREATE TEMP TABLE temptest(col int);
> INSERT INTO temptest VALUES (2);
> SELECT * FROM temptest;
> DROP TABLE temptest;
> 
> SELECT * FROM temptest;
> DROP TABLE temptest;
> 
> and another one doing repeated CHECKPOINTs.  I've already gotten a
> couple occurrences of Lamar's failure.

I wasn't able to reproduce failure with current sources.

Vadim


Re: RPM building (was regression on RedHat)

From
Peter Eisentraut
Date:
Thomas Lockhart writes:

> Mandrake (as of 7.2) still does a brain-dead mix of "-O3" and
> "-ffast-math", which is a risky and unnecessary combination according to
> the gcc folks (and which kills some of our date/time rounding). From the
> man page for gcc:
>
> -ffast-math
>  This  option  should never be turned on by any `-O' option
>  since it can result in incorrect output for programs which
>  depend on an exact implementation of IEEE  or  ANSI
>  rules/specifications for math functions.

You're reading this wrong.  What this means is:

"If you're working on GCC, do not ever think of enabling -ffast-math
implicitly by any -Ox level [since most other -fxxx options are grouped
under some -Ox], since programs that might want optimization could still
depend on correct IEEE math."

In particular, Mandrake is not wrong to compile with -O3 and -ffast-math.
The consequence would only be slightly incorrect math results, and that is
what indeed happened.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: RPM building (was regression on RedHat)

From
Justin Clift
Date:
NO!

It's not "Mandrake" that will be broken.  Mandrake is also often used by
new Linux users who wouldn't have the slightest idea about setting GCC
options.  It'll be THEM that have broken installations if we take this
approach (as an aside, that means that WE will be probably also be
answering more questions about PostgreSQL being broken on Mandrake
systems).

Isn't it better that PostgreSQL works with what it's got on a system AND
ALSO that someone notifies the Mandrake people regarding the problem?

Regards and best wishes,

Justin Clift

Trond Eivind Glomsrød wrote:
> 
<snip>
>
> If Mandrake wants to be broken, let them - and tell them.
> 
> --
> Trond Eivind Glomsrød
> Red Hat, Inc.
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


Re: RPM building (was regression on RedHat)

From
Justin Clift
Date:
Is the right approach for the ./configure script to check for the
existence of the /etc/mandrake-release file as at least an initial
indicator that the compile is happening on Mandrake?

Regards and best wishes,

Justin Clift

Thomas Lockhart wrote:
> 
> > If Mandrake wants to be broken, let them - and tell them.
> 
> They know ;) But just as with RH, they build ~1500 packages, so it is
> probably not realistic to get them to change their build standards over
> one misbehavior in one package.
> 
> The goal here is to get PostgreSQL to work well for as many platforms as
> possible. Heck, we even build for M$ ;)
> 
> So, I'm still looking for the best way to add a compile flag while
> making it clear that it is for one distro only. Of course, it would be
> possible to just add it at the end of the flags, but it would be nice to
> do that only when necessary.
> 
> Regards.
> 
>                      - Thomas
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Re: RPM building (was regression on RedHat)

From
teg@redhat.com (Trond Eivind Glomsrød)
Date:
Justin Clift <aa2@bigpond.net.au> writes:

> It's not "Mandrake" that will be broken.  Mandrake is also often used by
> new Linux users who wouldn't have the slightest idea about setting GCC
> options.  It'll be THEM that have broken installations if we take this
> approach (as an aside, that means that WE will be probably also be
> answering more questions about PostgreSQL being broken on Mandrake
> systems).
> 
> Isn't it better that PostgreSQL works with what it's got on a system AND
> ALSO that someone notifies the Mandrake people regarding the problem?

Most people will use what the vendor ship - a vendor (like us) look
into the benefits (stability, performance, compatiblity) of different
packages, and make a selection. If they've done a choice of which
options are used in their distribution, they are obviously fine with
the consequences.

-- 
Trond Eivind Glomsrød
Red Hat, Inc.


Re: RPM building (was regression on RedHat)

From
Tom Lane
Date:
Justin Clift <aa2@bigpond.net.au> writes:
>> So, I'm still looking for the best way to add a compile flag while
>> making it clear that it is for one distro only.

Since this is only an RPM problem, it should be solved in the RPM spec
file, not by hacking the configure script.  We had at least one similar
patch in the 7.0 spec file (for -fsigned-char stupidity in the RPM
configuration on LinuxPPC).  That's not needed anymore, but couldn't
you fix Mandrake the same way?
        regards, tom lane


Re: RPM building (was regression on RedHat)

From
Thomas Lockhart
Date:
> You're reading this wrong.  What this means is:
> "If you're working on GCC, do not ever think of enabling -ffast-math
> implicitly by any -Ox level [since most other -fxxx options are grouped
> under some -Ox], since programs that might want optimization could still
> depend on correct IEEE math."
> In particular, Mandrake is not wrong to compile with -O3 and -ffast-math.
> The consequence would only be slightly incorrect math results, and that is
> what indeed happened.

?? I think we agree. It happens to be the case that slightly incorrect
results are wrong results, and that full IEEE math conformance gives
exactly correct results. For the case of date/time, the "slightly wrong"
results round up to 60.0 seconds for times on an even minute boundary,
which is just plain wrong.
                      - Thomas


Re: RPM building (was regression on RedHat)

From
Peter Eisentraut
Date:
Thomas Lockhart writes:

> ?? I think we agree. It happens to be the case that slightly incorrect
> results are wrong results, and that full IEEE math conformance gives
> exactly correct results. For the case of date/time, the "slightly wrong"
> results round up to 60.0 seconds for times on an even minute boundary,
> which is just plain wrong.

Well, you're going to have to ask a numerical analyst about this.  If you
take that stance then -ffast-math is always wrong, no matter what the
combination of other switches.  The "wrong" results might be harder to
reproduce without any optimization going on, but they could still happen.

-- 
Peter Eisentraut      peter_e@gmx.net       http://yi.org/peter-e/



Re: RPM building (was regression on RedHat)

From
Thomas Lockhart
Date:
> Well, you're going to have to ask a numerical analyst about this.  If you
> take that stance then -ffast-math is always wrong, no matter what the
> combination of other switches.  The "wrong" results might be harder to
> reproduce without any optimization going on, but they could still happen.

Grumble. OK, I'll rephrase my statement: it is not "wrong", but "does
not produce the *required* result". 

The date/time stuff relies on conventional IEEE arithmetic rounding and
truncation rules to produce the world-wide, universally accepted
conventions for date/time representation. And will do so *if* the
compiler produces math which conforms to IEEE (and many other, in my
experience) conventions for arithmetic. So, if someone actually would
want to get date/time results which conform to those conventions, and if
they would characterize that conformance as "correct", then they might
make the leap of phrase to characterize nonconformance to those
conventions as "wrong".
                 - Thomas (who is just finishing eight days of jury
duty ;)