Thread: ReadBuffer(P_NEW) versus valid buffers
Some off-list investigation of Dan Kavan's data loss problem, http://archives.postgresql.org/pgsql-admin/2006-09/msg00092.php has led to the conclusion that it seems to be a kernel bug. The smoking gun is this strace excerpt: > lseek(10, 0, SEEK_END) = 913072128 > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 > lseek(10, 0, SEEK_END) = 913080320 > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 > lseek(10, 0, SEEK_END) = 913088512 > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 > lseek(10, 0, SEEK_END) = 913088512 > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 > lseek(10, 0, SEEK_END) = 913096704 > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 Note the lseek results --- surely each successive result ought to be 8K more than the one before, but the fourth in this extract seems to have forgotten about the immediately preceding write(). These calls are coming from successive ReadBuffer(rel, P_NEW) operations, which should just extend the file each time. But the incorrect lseek result is causing ReadBuffer to re-find the buffer we had just finished filling with a page of data, and that leads it to this conclusion: /* * We get here only in the corner case where we are trying to extend * the relation but we found apre-existing buffer marked BM_VALID. * (This can happen because mdread doesn't complain about reads * beyondEOF --- which is arguably bogus, but changing it seems * tricky.) We *must* do smgrextend before succeeding,else the * page will not be reserved by the kernel, and the next P_NEW call * will decide to returnthe same page. Clear the BM_VALID bit, * do the StartBufferIO call that BufferAlloc didn't, and proceed. */ So ReadBuffer without hesitation zeroes out the page of data we just filled, and returns it for re-filling. There went some tuples :-( Although this is clearly Not Our Bug, it's annoying that ReadBuffer falls into the trap so easily instead of complaining. I'm still disinclined to try to change the behavior of mdread(), but what I am considering doing is adding a check here to error out if not PageIsNew. AFAICS, if we do find a buffer for a page supposedly past EOF, it should be zero-filled because that's what mdread returns in this case. So this change would prevent Dan's silent-overwrite scenario without changing the behavior for any legitimate case. Thoughts, problems, better ideas? regards, tom lane
Tom Lane wrote: > So ReadBuffer without hesitation zeroes out the page of data we just > filled, and returns it for re-filling. There went some tuples :-( > > Although this is clearly Not Our Bug, it's annoying that ReadBuffer > falls into the trap so easily instead of complaining. I'm still > disinclined to try to change the behavior of mdread(), but what I am > considering doing is adding a check here to error out if not PageIsNew. > AFAICS, if we do find a buffer for a page supposedly past EOF, it should > be zero-filled because that's what mdread returns in this case. So this > change would prevent Dan's silent-overwrite scenario without changing the > behavior for any legitimate case. > > Thoughts, problems, better ideas? > The check looks good - are we chasing up the Linux kernel (or Suse) guys to get the bug investigated? Cheers Mark
Mark Kirkwood <markir@paradise.net.nz> writes: > The check looks good - are we chasing up the Linux kernel (or Suse) guys > to get the bug investigated? I asked around inside Red Hat but haven't gotten any responses yet ... seeing that it's a rather old Suse kernel, I can understand that RH's kernel hackers might not be too excited about investigating. (Alan Cox, for one, has got other things to worry about this weekend: http://zeniv.linux.org.uk/%7etelsa/boom/ I believe Dan's busy updating his kernel --- if a current Suse kernel still shows the problem then he should definitely file a bug with them. regards, tom lane
Tom Lane wrote: > Mark Kirkwood <markir@paradise.net.nz> writes: >> The check looks good - are we chasing up the Linux kernel (or Suse) guys >> to get the bug investigated? > > I asked around inside Red Hat but haven't gotten any responses yet ... > seeing that it's a rather old Suse kernel, I can understand that RH's > kernel hackers might not be too excited about investigating. (Alan Cox, > for one, has got other things to worry about this weekend: > http://zeniv.linux.org.uk/%7etelsa/boom/ Uhmm... doh? Joshua D. Drake > > I believe Dan's busy updating his kernel --- if a current Suse kernel > still shows the problem then he should definitely file a bug with them. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Joshua D. Drake wrote: > Tom Lane wrote: > >I asked around inside Red Hat but haven't gotten any responses yet ... > >seeing that it's a rather old Suse kernel, I can understand that RH's > >kernel hackers might not be too excited about investigating. (Alan Cox, > >for one, has got other things to worry about this weekend: > >http://zeniv.linux.org.uk/%7etelsa/boom/ > > Uhmm... doh? Telsa got "fired" for buying IBM? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Sun, Sep 24, 2006 at 12:26:55AM -0400, Alvaro Herrera wrote: > Joshua D. Drake wrote: > > Tom Lane wrote: > > > >I asked around inside Red Hat but haven't gotten any responses yet ... > > >seeing that it's a rather old Suse kernel, I can understand that RH's > > >kernel hackers might not be too excited about investigating. (Alan Cox, > > >for one, has got other things to worry about this weekend: > > >http://zeniv.linux.org.uk/%7etelsa/boom/ > > > > Uhmm... doh? > > Telsa got "fired" for buying IBM? You should be fired for that pun. :P -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)