smgrsettransient mechanism is full of bugs - Mailing list pgsql-hackers

From Tom Lane
Subject smgrsettransient mechanism is full of bugs
Date
Msg-id 7924.1350263908@sss.pgh.pa.us
Whole thread Raw
Responses Re: smgrsettransient mechanism is full of bugs  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: smgrsettransient mechanism is full of bugs  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
I got a bit suspicious of the transient-file mechanism introduced in
commit fba105b1099f4f5fa7283bb17cba6fed2baa8d0c after noticing that
CleanupTempFiles seemed to take an unreasonable amount of time in a
test case that didn't involve any temp files, cf
http://archives.postgresql.org/message-id/7110.1349392471@sss.pgh.pa.us

After further review, I have become convinced that in fact it's
completely broken and needs to be redone from scratch.  The temp-file
marking at the fd.c level can easily get out of sync with the marking
at the smgr level, and that marking isn't too consistent with reality
either, which means we have all of the following problems:

(1) It can leak kernel descriptors, as reported in
http://archives.postgresql.org/message-id/B9BEA448-978F-4A14-A088-3FD82214FFAC@pvv.ntnu.no
The triggering sequence for that appears to be:* Transaction 1 does a blind write, sets FD_XACT_TRANSIENT.* At
transactionclose, we close kernel FD and clear  FD_XACT_TRANSIENT in the VFD, but the VFD, the smgr relation,  and the
md.cdata structure are all still there.* Transaction 2 does another blind write on same file.  This  does not cause
FD_XACT_TRANSIENTto get set because md.c  data structure already exists.* Now we are carrying a "leaked" kernel FD that
willnever get  closed short of a CacheInvalidateSmgr message.  Which doesn't  happen in a dropdb scenario.  (That might
bea bug in itself.)
 

(2) FlushBuffer will set the smgr-level transient flag even if we have
a relcache entry for the relation.  (The fact that we're doing a blind
write to flush a dirty buffer does not prove that the rel is one we're
not interested in.)  This can result in unnecessary forced closures of
kernel FDs, and it also results in too many scans of the VFD array,
because have_pending_fd_cleanup can get set unnecessarily.

(3) If the smgr-level flag gets cleared intra-transaction (ie, we did
a blind write and later started doing normal accesses to the same
relation), this fails to propagate to the VFD level, again resulting in
undesirable FD closures.

(4) After a blind write, we will close the kernel FD at transaction end,
but we don't flush the VFD array entry.  This results in VFD array bloat
over time.  The combination of this and (2) seems to explain the
performance problem I complained of above: there are too many VFD
searches done by CleanupTempFiles, and they have to pass over too many
useless entries.


I believe that we probably ought to revert this mechanism entirely, and
build a new implementation based on these concepts:

* An SMgrRelation is transient if and only if it doesn't have an
"owning" relcache entry.  Keep a list of all such SmgrRelations, and
close them all at transaction end.  (Obviously, an SMgrRelation gets
removed from the list if it acquires an owner mid-transaction.)

* There's no such concept as FD_XACT_TRANSIENT at the fd.c level.
Rather, we close and delete the VFD entry when told to by SmgrRelation
closure.

Comments?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Deprecating RULES
Next
From: Robert Haas
Date:
Subject: Re: Making the planner more tolerant of implicit/explicit casts