Thread: Various intermittent bugs/instability - how to debug?

Various intermittent bugs/instability - how to debug?

From
Frederik Ramm
Date:
Dear PostgreSQL community,

    I hope you can help me with a problem I'm having - I'm stuck and
don't know how to debug this further.

I have a rather large nightly process that imports a lot of data from
the OpenStreetMap project into a PostGIS database, then proceeds doing
all sorts of things - creating spatial indexes, computing bounding
boxes, doing simplification of geometries, that kind of stuff. The whole
job usually takes about five hours.

I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine
with 8 GB RAM.

Every other night, the process aborts with some strange error message,
and never at the same position:

ERROR:  invalid page header in block 166406 of relation "node_tags"

ERROR:  could not open segment 2 of relation 1663/24253056/24253895
(target block 1421295656): No such file or directory

ERROR:  Unknown geometry type: 10

When I continue the process after the failure, it will usually work.

I know you all think "hardware problem" now. Of course this was my first
guess as well. I ran a memory test for a night, no results; I downgraded
do "failsafe defaults" for all BIOS timings, again no change. Ran
"cpuburn" and all sorts of other things to grill the hardware - nothing.

Then I bought an entirely new machine; similar setup, but using a
Gigabyte instead of Asus mainboard, different chipset, slightly faster
Quad-Core processor, and again 8 GB RAM and Ubuntu "Hardy" with
PostgresSQL 8.3 and matching PostGIS.

Believe it or not, this machine shows the *same* problems. It is not
100% reproducible, sometimes the job works fully, but every other day it
just breaks down with one of the funny messages like above. No memtest
errors here either.

Both machines are "consumer" quality, i.e. normal Intel processors and
not the "server" (Xeon) stock.

I am at a loss - how can I proceed? This looks like a hardware problem
alright, but so simliar problems on two so different machines? Is there
something wrong with Intel's Quad-Core CPUs?

What could I do to have a better chance of reproducing the error and
ultimately identifying the component responsible? Is there some kind of
"PostgresSQL load test", something like "cpuburn" for PostgreSQL?

Have there been other reports of intermittent problems like mine, and
does anybody have any blind guesses...?

Thanks
Frederik

--
Frederik Ramm               www.geofabrik.de
Geofabrik GmbH              Handelsregister: HRB Mannheim 703657
Rueppurrer Strasse 4        Geschaeftsfuehrung: Frederik Ramm
76137 Karlsruhe             Tel: 0721-1803560-0
frederik.ramm@geofabrik.de  Fax: 0721-1803560-9

Re: Various intermittent bugs/instability - how to debug?

From
"Joshua D. Drake"
Date:
Frederik Ramm wrote:
> Dear PostgreSQL community,

> Every other night, the process aborts with some strange error message,
> and never at the same position:
>
> ERROR:  invalid page header in block 166406 of relation "node_tags"
>
> ERROR:  could not open segment 2 of relation 1663/24253056/24253895
> (target block 1421295656): No such file or directory

> What could I do to have a better chance of reproducing the error and
> ultimately identifying the component responsible? Is there some kind of
> "PostgresSQL load test", something like "cpuburn" for PostgreSQL?
>
> Have there been other reports of intermittent problems like mine, and
> does anybody have any blind guesses...?

Hard drives. You said you replaced the machines and ran cpu burn etc...
You didn't say you checked the hard drives (or replaced them).

Joshua D. Drake


>
> Thanks
> Frederik
>


Re: Various intermittent bugs/instability - how to debug?

From
Frederik Ramm
Date:
Hi,

Joshua D. Drake wrote:
> Hard drives. You said you replaced the machines and ran cpu burn etc...
> You didn't say you checked the hard drives (or replaced them).

The new machine has new hard drives. On the old machine I had a hardware
RAID5 array made up of 6x500 GB SATA with an Areca RAID controller and
Linux LVM; the new machine has 3x750 GB SATA plugged directly into the
main board (no RAID, no LVM). So even the drives are quite different,
and it's different drivers in both cases...

Bye
Frederik

--
Frederik Ramm               www.geofabrik.de
Geofabrik GmbH              Handelsregister: HRB Mannheim 703657
Rueppurrer Strasse 4        Geschaeftsfuehrung: Frederik Ramm
76137 Karlsruhe             Tel: 0721-1803560-0
frederik.ramm@geofabrik.de  Fax: 0721-1803560-9

Re: Various intermittent bugs/instability - how to debug?

From
"Scott Marlowe"
Date:
On Tue, Sep 9, 2008 at 8:50 AM, Frederik Ramm
<frederik.ramm@geofabrik.de> wrote:
> Dear PostgreSQL community,
>
>   I hope you can help me with a problem I'm having - I'm stuck and don't
> know how to debug this further.
>
> I have a rather large nightly process that imports a lot of data from the
> OpenStreetMap project into a PostGIS database, then proceeds doing all sorts
> of things - creating spatial indexes, computing bounding boxes, doing
> simplification of geometries, that kind of stuff. The whole job usually
> takes about five hours.
>
> I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine with
> 8 GB RAM.
>
> Every other night, the process aborts with some strange error message, and
> never at the same position:
>
> ERROR:  invalid page header in block 166406 of relation "node_tags"
>
> ERROR:  could not open segment 2 of relation 1663/24253056/24253895 (target
> block 1421295656): No such file or directory
>
> ERROR:  Unknown geometry type: 10
>
> When I continue the process after the failure, it will usually work.
>
> I know you all think "hardware problem" now. Of course this was my first
> guess as well. I ran a memory test for a night, no results; I downgraded do
> "failsafe defaults" for all BIOS timings, again no change. Ran "cpuburn" and
> all sorts of other things to grill the hardware - nothing.

You definitely are suffering from db corruption, and given the number
and differing type of errors, it would seem unlikely that pgsql has a
load of bugs only you are seeing.  OTOH, if the bug is hidden deep in
postgis or something, then who knows...

I'd definitely run something like bonnie++ for a few days and see if
it gets HD errors or not.

And definitely run memtest86 for a day or so and make sure you're not
getting any errors there.

Re: Various intermittent bugs/instability - how to debug?

From
"Scott Marlowe"
Date:
Also, I'd try running a large, long running pgbench test to see if you
get any errors.  If such a simple schema / operation in large scale
gets errors it's almost certainly a hardware failure.

Have you had power removed while the system was running?  With a lot
of direct attached SATA drives you can get corruption from that as
well due to linux / sata often not obeying write barriers.

Re: Various intermittent bugs/instability - how to debug?

From
Martijn van Oosterhout
Date:
On Tue, Sep 09, 2008 at 04:50:53PM +0200, Frederik Ramm wrote:
> I have a rather large nightly process that imports a lot of data from
> the OpenStreetMap project into a PostGIS database, then proceeds doing
> all sorts of things - creating spatial indexes, computing bounding
> boxes, doing simplification of geometries, that kind of stuff. The whole
> job usually takes about five hours.

Which program is this?

> ERROR:  invalid page header in block 166406 of relation "node_tags"

After this happens, can you find the relfilenode of the node_tags table
and extract the relevent block? Something like:

dd if=<dboid>/<relfilenode> bs=8k skip=166406 count=1 of=/tmp/block

What sort of data does it contain?

> ERROR:  could not open segment 2 of relation 1663/24253056/24253895
> (target block 1421295656): No such file or directory

Does any file of that name exist?

ls 1663/24253056/24253895*

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: Various intermittent bugs/instability - how to debug?

From
Tom Lane
Date:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> On Tue, Sep 9, 2008 at 8:50 AM, Frederik Ramm
>> I hope you can help me with a problem I'm having - I'm stuck and don't
>> know how to debug this further.

> You definitely are suffering from db corruption, and given the number
> and differing type of errors, it would seem unlikely that pgsql has a
> load of bugs only you are seeing.  OTOH, if the bug is hidden deep in
> postgis or something, then who knows...

Yeah, I would wonder about a postgis bug.  Another thing to look into
is kernel bugs --- I gather that these boxes are running more or less
the same kernel?  If there's something about your hardware that the
kernel doesn't quite grok, all sorts of fun could ensue.

            regards, tom lane

Re: Various intermittent bugs/instability - how to debug?

From
Mark Cave-Ayland
Date:
Frederik Ramm wrote:

> Dear
> PostgreSQL community,
>
>    I hope you can help me with a problem I'm having - I'm stuck and
> don't know how to debug this further.
>
> I have a rather large nightly process that imports a lot of data from
> the OpenStreetMap project into a PostGIS database, then proceeds doing
> all sorts of things - creating spatial indexes, computing bounding
> boxes, doing simplification of geometries, that kind of stuff. The whole
> job usually takes about five hours.
>
> I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine
> with 8 GB RAM.
>
> Every other night, the process aborts with some strange error message,
> and never at the same position:
>
> ERROR:  invalid page header in block 166406 of relation "node_tags"
>
> ERROR:  could not open segment 2 of relation 1663/24253056/24253895
> (target block 1421295656): No such file or directory
>
> ERROR:  Unknown geometry type: 10
>
> When I continue the process after the failure, it will usually work.
>
> I know you all think "hardware problem" now. Of course this was my first
> guess as well. I ran a memory test for a night, no results; I downgraded
> do "failsafe defaults" for all BIOS timings, again no change. Ran
> "cpuburn" and all sorts of other things to grill the hardware - nothing.
>
> Then I bought an entirely new machine; similar setup, but using a
> Gigabyte instead of Asus mainboard, different chipset, slightly faster
> Quad-Core processor, and again 8 GB RAM and Ubuntu "Hardy" with
> PostgresSQL 8.3 and matching PostGIS.
>
> Believe it or not, this machine shows the *same* problems. It is not
> 100% reproducible, sometimes the job works fully, but every other day it
> just breaks down with one of the funny messages like above. No memtest
> errors here either.
>
> Both machines are "consumer" quality, i.e. normal Intel processors and
> not the "server" (Xeon) stock.
>
> I am at a loss - how can I proceed? This looks like a hardware problem
> alright, but so simliar problems on two so different machines? Is there
> something wrong with Intel's Quad-Core CPUs?
>
> What could I do to have a better chance of reproducing the error and
> ultimately identifying the component responsible? Is there some kind of
> "PostgresSQL load test", something like "cpuburn" for PostgreSQL?
>
> Have there been other reports of intermittent problems like mine, and
> does anybody have any blind guesses...?
>
> Thanks
> Frederik


Hi Frederik,

We did find a memory clobber in the PostGIS ANALYZE routine a while
back, but the fix hasn't yet made it into a release.

If you are building from source, please can you try applying the patch
here: http://code.google.com/p/postgis/issues/detail?id=43 and reporting
back whether it helps or not?


ATB,

Mark.

--
Mark Cave-Ayland
Sirius Corporation - The Open Source Experts
http://www.siriusit.co.uk
T: +44 870 608 0063

Re: Various intermittent bugs/instability - how to debug?

From
Frederik Ramm
Date:
Hi,

    just to give an update on this:

Frederik Ramm wrote:
> Every other night, the process aborts with some strange error message,
> and never at the same position:

[...]

Turns out it *was* a RAM defect on one of the machines. memtest86 ran
for a day and didn't detect it, but when I started making copies of a
100 GB file and m5dsumming it afterwards, I'd get one or two bad copies
per day. I then juggled RAM chips until I had identified the defective
one. As you can imagine, this took a while.

Lesson learnt: memtest86 running for a day and not finding an error does
not mean your RAM is ok!

I still have occasional problems on the *other* machine (if you remember
my original post, I had similar errors on two different machines) and
even massive data copying did not show any error there but at least a
PostGIS bug seems extremely unlikely now (and that's still running stock
packages, not patched).

Bye
Frederik

--
Frederik Ramm               www.geofabrik.de
Geofabrik GmbH              Handelsregister: HRB Mannheim 703657
Rueppurrer Strasse 4        Geschaeftsfuehrung: Frederik Ramm
76137 Karlsruhe             Tel: 0721-1803560-0
frederik.ramm@geofabrik.de  Fax: 0721-1803560-9