Thread: Various intermittent bugs/instability - how to debug?
Dear PostgreSQL community, I hope you can help me with a problem I'm having - I'm stuck and don't know how to debug this further. I have a rather large nightly process that imports a lot of data from the OpenStreetMap project into a PostGIS database, then proceeds doing all sorts of things - creating spatial indexes, computing bounding boxes, doing simplification of geometries, that kind of stuff. The whole job usually takes about five hours. I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine with 8 GB RAM. Every other night, the process aborts with some strange error message, and never at the same position: ERROR: invalid page header in block 166406 of relation "node_tags" ERROR: could not open segment 2 of relation 1663/24253056/24253895 (target block 1421295656): No such file or directory ERROR: Unknown geometry type: 10 When I continue the process after the failure, it will usually work. I know you all think "hardware problem" now. Of course this was my first guess as well. I ran a memory test for a night, no results; I downgraded do "failsafe defaults" for all BIOS timings, again no change. Ran "cpuburn" and all sorts of other things to grill the hardware - nothing. Then I bought an entirely new machine; similar setup, but using a Gigabyte instead of Asus mainboard, different chipset, slightly faster Quad-Core processor, and again 8 GB RAM and Ubuntu "Hardy" with PostgresSQL 8.3 and matching PostGIS. Believe it or not, this machine shows the *same* problems. It is not 100% reproducible, sometimes the job works fully, but every other day it just breaks down with one of the funny messages like above. No memtest errors here either. Both machines are "consumer" quality, i.e. normal Intel processors and not the "server" (Xeon) stock. I am at a loss - how can I proceed? This looks like a hardware problem alright, but so simliar problems on two so different machines? Is there something wrong with Intel's Quad-Core CPUs? What could I do to have a better chance of reproducing the error and ultimately identifying the component responsible? Is there some kind of "PostgresSQL load test", something like "cpuburn" for PostgreSQL? Have there been other reports of intermittent problems like mine, and does anybody have any blind guesses...? Thanks Frederik -- Frederik Ramm www.geofabrik.de Geofabrik GmbH Handelsregister: HRB Mannheim 703657 Rueppurrer Strasse 4 Geschaeftsfuehrung: Frederik Ramm 76137 Karlsruhe Tel: 0721-1803560-0 frederik.ramm@geofabrik.de Fax: 0721-1803560-9
Frederik Ramm wrote: > Dear PostgreSQL community, > Every other night, the process aborts with some strange error message, > and never at the same position: > > ERROR: invalid page header in block 166406 of relation "node_tags" > > ERROR: could not open segment 2 of relation 1663/24253056/24253895 > (target block 1421295656): No such file or directory > What could I do to have a better chance of reproducing the error and > ultimately identifying the component responsible? Is there some kind of > "PostgresSQL load test", something like "cpuburn" for PostgreSQL? > > Have there been other reports of intermittent problems like mine, and > does anybody have any blind guesses...? Hard drives. You said you replaced the machines and ran cpu burn etc... You didn't say you checked the hard drives (or replaced them). Joshua D. Drake > > Thanks > Frederik >
Hi, Joshua D. Drake wrote: > Hard drives. You said you replaced the machines and ran cpu burn etc... > You didn't say you checked the hard drives (or replaced them). The new machine has new hard drives. On the old machine I had a hardware RAID5 array made up of 6x500 GB SATA with an Areca RAID controller and Linux LVM; the new machine has 3x750 GB SATA plugged directly into the main board (no RAID, no LVM). So even the drives are quite different, and it's different drivers in both cases... Bye Frederik -- Frederik Ramm www.geofabrik.de Geofabrik GmbH Handelsregister: HRB Mannheim 703657 Rueppurrer Strasse 4 Geschaeftsfuehrung: Frederik Ramm 76137 Karlsruhe Tel: 0721-1803560-0 frederik.ramm@geofabrik.de Fax: 0721-1803560-9
On Tue, Sep 9, 2008 at 8:50 AM, Frederik Ramm <frederik.ramm@geofabrik.de> wrote: > Dear PostgreSQL community, > > I hope you can help me with a problem I'm having - I'm stuck and don't > know how to debug this further. > > I have a rather large nightly process that imports a lot of data from the > OpenStreetMap project into a PostGIS database, then proceeds doing all sorts > of things - creating spatial indexes, computing bounding boxes, doing > simplification of geometries, that kind of stuff. The whole job usually > takes about five hours. > > I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine with > 8 GB RAM. > > Every other night, the process aborts with some strange error message, and > never at the same position: > > ERROR: invalid page header in block 166406 of relation "node_tags" > > ERROR: could not open segment 2 of relation 1663/24253056/24253895 (target > block 1421295656): No such file or directory > > ERROR: Unknown geometry type: 10 > > When I continue the process after the failure, it will usually work. > > I know you all think "hardware problem" now. Of course this was my first > guess as well. I ran a memory test for a night, no results; I downgraded do > "failsafe defaults" for all BIOS timings, again no change. Ran "cpuburn" and > all sorts of other things to grill the hardware - nothing. You definitely are suffering from db corruption, and given the number and differing type of errors, it would seem unlikely that pgsql has a load of bugs only you are seeing. OTOH, if the bug is hidden deep in postgis or something, then who knows... I'd definitely run something like bonnie++ for a few days and see if it gets HD errors or not. And definitely run memtest86 for a day or so and make sure you're not getting any errors there.
Also, I'd try running a large, long running pgbench test to see if you get any errors. If such a simple schema / operation in large scale gets errors it's almost certainly a hardware failure. Have you had power removed while the system was running? With a lot of direct attached SATA drives you can get corruption from that as well due to linux / sata often not obeying write barriers.
On Tue, Sep 09, 2008 at 04:50:53PM +0200, Frederik Ramm wrote: > I have a rather large nightly process that imports a lot of data from > the OpenStreetMap project into a PostGIS database, then proceeds doing > all sorts of things - creating spatial indexes, computing bounding > boxes, doing simplification of geometries, that kind of stuff. The whole > job usually takes about five hours. Which program is this? > ERROR: invalid page header in block 166406 of relation "node_tags" After this happens, can you find the relfilenode of the node_tags table and extract the relevent block? Something like: dd if=<dboid>/<relfilenode> bs=8k skip=166406 count=1 of=/tmp/block What sort of data does it contain? > ERROR: could not open segment 2 of relation 1663/24253056/24253895 > (target block 1421295656): No such file or directory Does any file of that name exist? ls 1663/24253056/24253895* Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > On Tue, Sep 9, 2008 at 8:50 AM, Frederik Ramm >> I hope you can help me with a problem I'm having - I'm stuck and don't >> know how to debug this further. > You definitely are suffering from db corruption, and given the number > and differing type of errors, it would seem unlikely that pgsql has a > load of bugs only you are seeing. OTOH, if the bug is hidden deep in > postgis or something, then who knows... Yeah, I would wonder about a postgis bug. Another thing to look into is kernel bugs --- I gather that these boxes are running more or less the same kernel? If there's something about your hardware that the kernel doesn't quite grok, all sorts of fun could ensue. regards, tom lane
Frederik Ramm wrote: > Dear > PostgreSQL community, > > I hope you can help me with a problem I'm having - I'm stuck and > don't know how to debug this further. > > I have a rather large nightly process that imports a lot of data from > the OpenStreetMap project into a PostGIS database, then proceeds doing > all sorts of things - creating spatial indexes, computing bounding > boxes, doing simplification of geometries, that kind of stuff. The whole > job usually takes about five hours. > > I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine > with 8 GB RAM. > > Every other night, the process aborts with some strange error message, > and never at the same position: > > ERROR: invalid page header in block 166406 of relation "node_tags" > > ERROR: could not open segment 2 of relation 1663/24253056/24253895 > (target block 1421295656): No such file or directory > > ERROR: Unknown geometry type: 10 > > When I continue the process after the failure, it will usually work. > > I know you all think "hardware problem" now. Of course this was my first > guess as well. I ran a memory test for a night, no results; I downgraded > do "failsafe defaults" for all BIOS timings, again no change. Ran > "cpuburn" and all sorts of other things to grill the hardware - nothing. > > Then I bought an entirely new machine; similar setup, but using a > Gigabyte instead of Asus mainboard, different chipset, slightly faster > Quad-Core processor, and again 8 GB RAM and Ubuntu "Hardy" with > PostgresSQL 8.3 and matching PostGIS. > > Believe it or not, this machine shows the *same* problems. It is not > 100% reproducible, sometimes the job works fully, but every other day it > just breaks down with one of the funny messages like above. No memtest > errors here either. > > Both machines are "consumer" quality, i.e. normal Intel processors and > not the "server" (Xeon) stock. > > I am at a loss - how can I proceed? This looks like a hardware problem > alright, but so simliar problems on two so different machines? Is there > something wrong with Intel's Quad-Core CPUs? > > What could I do to have a better chance of reproducing the error and > ultimately identifying the component responsible? Is there some kind of > "PostgresSQL load test", something like "cpuburn" for PostgreSQL? > > Have there been other reports of intermittent problems like mine, and > does anybody have any blind guesses...? > > Thanks > Frederik Hi Frederik, We did find a memory clobber in the PostGIS ANALYZE routine a while back, but the fix hasn't yet made it into a release. If you are building from source, please can you try applying the patch here: http://code.google.com/p/postgis/issues/detail?id=43 and reporting back whether it helps or not? ATB, Mark. -- Mark Cave-Ayland Sirius Corporation - The Open Source Experts http://www.siriusit.co.uk T: +44 870 608 0063
Hi, just to give an update on this: Frederik Ramm wrote: > Every other night, the process aborts with some strange error message, > and never at the same position: [...] Turns out it *was* a RAM defect on one of the machines. memtest86 ran for a day and didn't detect it, but when I started making copies of a 100 GB file and m5dsumming it afterwards, I'd get one or two bad copies per day. I then juggled RAM chips until I had identified the defective one. As you can imagine, this took a while. Lesson learnt: memtest86 running for a day and not finding an error does not mean your RAM is ok! I still have occasional problems on the *other* machine (if you remember my original post, I had similar errors on two different machines) and even massive data copying did not show any error there but at least a PostGIS bug seems extremely unlikely now (and that's still running stock packages, not patched). Bye Frederik -- Frederik Ramm www.geofabrik.de Geofabrik GmbH Handelsregister: HRB Mannheim 703657 Rueppurrer Strasse 4 Geschaeftsfuehrung: Frederik Ramm 76137 Karlsruhe Tel: 0721-1803560-0 frederik.ramm@geofabrik.de Fax: 0721-1803560-9