Thread: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
Mike Williams
Date:
Hi guys, We upgraded from 8.3.5 to 8.3.9 yesterday, after an extended period of testing, to 8.3.10 today in hope of fixing it, and we're getting the following error on occasion. ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 2171336): No such file or directory CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<= $1 AND unique_ip=-1 ORDER BY day DESC LIMIT 1" PL/pgSQL function "update_unique_ips" line 13 at SQL statement update_unique_ips() is called once per minute, but we get anywhere from one error per minute, to one error per hour. Relation 743420 in database 743352 is 441270272 bytes long and growing by a few thousand bytes here and there. Obviously, as we've gone from 8.3.9 to 8.3.10, we're restarted postgres a few times. We have also REINDEXd each table, the database, dropped and re-created all the indexes. I've even pg_dump'd the database, and loaded it into a fresh database. The relation numbers changed, but the 1663 bit remained the same. Best I can tell it's this bit of code from the beginning of update_unique_ips() that's doing it, but it seems completely innocuous to me. CREATE FUNCTION update_unique_ips() RETURNS timestamp without time zone AS $$ DECLARE last_day timestamp; day_for_update timestamp; BEGIN -- declare last_day SELECT INTO last_day day-interval '30 min' as day FROM min_stats ORDER BY day DESC LIMIT 1; SELECT INTO day_for_update day FROM min_stats WHERE day<=last_day AND unique_ip=-1 ORDER BY day DESC LIMIT 1; Could anyone help us out here please? Thanks -- Mike Williams
Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
Tom Lane
Date:
Mike Williams <mike.williams@comodo.com> writes: > We upgraded from 8.3.5 to 8.3.9 yesterday, after an extended period of > testing, to 8.3.10 today in hope of fixing it, and we're getting the following > error on occasion. > ERROR: could not open segment 1 of relation 1663/743352/743420 (target block > 2171336): No such file or directory > CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<= $1 AND > unique_ip=-1 ORDER BY day DESC LIMIT 1" > PL/pgSQL function "update_unique_ips" line 13 at SQL statement > update_unique_ips() is called once per minute, but we get anywhere from one > error per minute, to one error per hour. > Relation 743420 in database 743352 is 441270272 bytes long and growing by a > few thousand bytes here and there. > Obviously, as we've gone from 8.3.9 to 8.3.10, we're restarted postgres a few > times. We have also REINDEXd each table, the database, dropped and re-created > all the indexes. > I've even pg_dump'd the database, and loaded it into a fresh database. The > relation numbers changed, but the 1663 bit remained the same. Wow. The symptom seems to be what you'd get from a corrupt tuple pointer in an index, but it's hard to believe that such a thing would persist over reindexes and even reloads. Is the "target block" number always the same, or does that change? Can you reproduce the problem on a different machine? What platform are you running on? regards, tom lane
Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
"Mike Williams"
Date:
On Sun, March 28, 2010 7:20 pm, Tom Lane wrote: > Mike Williams <mike.williams@comodo.com> writes: >> ERROR: could not open segment 1 of relation 1663/743352/743420 (target >> block >> 2171336): No such file or directory >> CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<= $1 AND >> unique_ip=-1 ORDER BY day DESC LIMIT 1" >> PL/pgSQL function "update_unique_ips" line 13 at SQL statement > Wow. The symptom seems to be what you'd get from a corrupt tuple > pointer in an index, but it's hard to believe that such a thing would > persist over reindexes and even reloads. Is the "target block" number > always the same, or does that change? Can you reproduce the problem on > a different machine? What platform are you running on? While it's always pleasing to find an error condition that makes an expert go "Wow", it is also quite troubling! :) The "target block" stays the same for extended periods, but does change. Oddly it's been quite for a while now, half an hour or so ago it had been going nearly every minute for a good hour. For the last 30 times the block was 11387925, before then 6634333 once, then 15520041 for 4 times, 6667190 before that for a while. And just as I'm about to hit send, 3 more errors. All block 14528541. Where does the 1663 number come from? <datadir/base>/743352/743420 is a real file, but I don't have any file 1663. The server is Gentoo, and I don't have an identical server to use. This machines HA partner is staying at 8.3.5 until we're completely and entirely sure 8.3.10 works right. -- Mike Williams
Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
Tom Lane
Date:
"Mike Williams" <mike.williams@comodo.com> writes: > The "target block" stays the same for extended periods, but does change. Fascinating. Don't know what it means, but it's fascinating. > Where does the 1663 number come from? > <datadir/base>/743352/743420 is a real file, but I don't have any file 1663. 1663 is the OID of the pg_default tablespace; that's never going to change unless you move the table to a non-default tablespace. > The server is Gentoo, and I don't have an identical server to use. This > machines HA partner is staying at 8.3.5 until we're completely and > entirely sure 8.3.10 works right. I wasn't really suggesting that you need an *identical* machine. If this is a software problem, as seems moderately likely at this point, it should be possible to reproduce it on a different machine. Even more to the point, it's going to be difficult to find the problem without tracing through the fault case with a debugger --- which isn't something you're going to want to do on a production server. Have you got a test machine you can load the data onto? Preferably one you could give a PG developer access to? <aside>Using gentoo for production servers is generally viewed with suspicion around here; gentoo is a fine distro but, um, not especially focused on stability. Still, there's no obvious reason to blame this on the distro.</aside> regards, tom lane
Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
"Mike Williams"
Date:
On Sun, March 28, 2010 8:49 pm, Tom Lane wrote: >> The server is Gentoo, and I don't have an identical server to use. This >> machines HA partner is staying at 8.3.5 until we're completely and >> entirely sure 8.3.10 works right. > > I wasn't really suggesting that you need an *identical* machine. If > this is a software problem, as seems moderately likely at this point, > it should be possible to reproduce it on a different machine. Even more > to the point, it's going to be difficult to find the problem without > tracing through the fault case with a debugger --- which isn't something > you're going to want to do on a production server. Have you got a test > machine you can load the data onto? Preferably one you could give a PG > developer access to? We do have a test system, and the only significant difference from production is the kernel. Test are para-virt VMs with "regular" kernels, production are real machines with hardened kernels (grsec+pax). I got prod onto a regular kernel and nothing in nearly an hour. 8.3.5 has worked forever and a day on this type of setup. -- Mike Williams
Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
ALEXANDER JOSE
Date:
Good morning
You have to revise the base folder to create postgres when you perform the initial installation and verify the relation fallanado that I spend And it turned out he had a folder in C: \ Program Files \ PostgreSQL \ 8.2 \ data \ base folder exists the number 1, this is where the file belongs to the database that lost the connection does not exist, determine why and try to copy again, delete the database that you are creating and then think again.
Alexander Angel
Venezuela
Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! Try it!
You have to revise the base folder to create postgres when you perform the initial installation and verify the relation fallanado that I spend And it turned out he had a folder in C: \ Program Files \ PostgreSQL \ 8.2 \ data \ base folder exists the number 1, this is where the file belongs to the database that lost the connection does not exist, determine why and try to copy again, delete the database that you are creating and then think again.
Alexander Angel
Venezuela
Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy! Try it!
Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
Alex Hunsaker
Date:
On Sun, Mar 28, 2010 at 16:27, Mike Williams <mike.williams@comodo.com> wrote: > Test are para-virt VMs with "regular" kernels, production are real > machines with hardened kernels (grsec+pax). Ive seen this error on a few boxes around here, using a non grsec kernel fixes it. I never bothered to report it because I cant reproduce it and it goes away without grsec. (2.6.30-32 IIRC, have not tired the latest 33). Interestingly I have the *exact* same hardware and setup (kernel/userspace/postgres etc) on a test machine that I've never been able to trigger it on.
Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
Mike Williams
Date:
On Monday 29 March 2010 20:26:08 Alex Hunsaker wrote: > > Test are para-virt VMs with "regular" kernels, production are real > > machines with hardened kernels (grsec+pax). > > Ive seen this error on a few boxes around here, using a non grsec > kernel fixes it. I never bothered to report it because I cant > reproduce it and it goes away without grsec. (2.6.30-32 IIRC, have > not tired the latest 33). Interestingly I have the *exact* same > hardware and setup (kernel/userspace/postgres etc) on a test machine > that I've never been able to trigger it on. Thanks Alex, good to know I've not screwed up the kernel somehow. I've been using 2.6.32 with grsecurity-2.1.14-2.6.32.9-201002231820 applied. -- Mike Williams
Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
From
Alex Hunsaker
Date:
On Tue, Mar 30, 2010 at 04:16, Mike Williams <mike.williams@comodo.com> wrote: > Thanks Alex, good to know I've not screwed up the kernel somehow. > > I've been using 2.6.32 with grsecurity-2.1.14-2.6.32.9-201002231820 applied. Looks like the first instance I had of this problem was with 2.6.31.1-rc1-grsec. I know I tried 2.6.32-grsec and various 2.6.32.X-grsecs but all those had this issue at some point. Currently im on a mostly stock 2.6.33.1 with no problems. I have not had the nerve to try a -grsec kernel on it again. For reference here are the errors I got: could not open segment 3 of relation base/4440720/8003730 COPY public.page_loads (cgi, content_length, date_created, defunct, host, ip, page_load_id, protocol, proxy_ip, referrer, request_method, sessionid, url, user_id, audit_tid, user_agent_id, action, server) TO '/tmp/blah.sql'; ERROR: invalid memory alloc request size 18446744073709551613 There were more could not open segment errors... but I seem to have lost them. Normally I would think the above is corrupt data, but it would sometimes work. It *always* worked on the non grsec kernel. So instead it smells like bad ram, well its got ecc ram and survived multiple runs of memtest, memtest86+ various versions. [ Yeah I know people including me have seen ram that passes all that and is still bad ] Since you are having similar problems with a -grsec kernel sounds like there might be some kind of memory corruption bug with it. I would recommend trying a stock kernel and seeing if the problem goes away. I also think the general attitude here is if you run crazy security patches you get to keep both pieces. :) Another fact that seemed to point to bad ram or some kind of kernel corruption was trying to find the bad row COPY reported above: SELECT count(*) from (select * from page_loads order by page_load_id desc limit 937980) as foo; ERROR: could not open segment 3 of relation base/4440720/8003730 (target block 4680336): No such file or directory SELECT count(*) from ( select * from page_loads order by page_load_id desc limit 937970) as foo; count -------- 937970 <70-79 snipped all worked> SELECT count(*) from (select * from page_loads order by page_load_id desc limit 937979) as foo; count -------- 937979 -- Uhh this was just broken... SELECT count(*) from (select * order by page_load_id desc limit 937980) as foo; count -------- 937980 I thought I had some stacktraces... but they are not in my notes... I do remember tracing through them and coming to the conclusion that its most likely some kind of kernel bug. (That error can only happen if we try to open a file that does not exist, but we can only get that far if the file existed or some such) Sorry Im a bit hazy this was back in September.