Getting rid of excess lseeks() - Mailing list pgsql-hackers

From Tom Lane
Subject Getting rid of excess lseeks()
Date
Msg-id 18441.989516445@sss.pgh.pa.us
Whole thread Raw
List pgsql-hackers
We've known for a long time that Postgres does a lot of
redundant-seeming "lseek(fd,0,SEEK_END)" kernel calls while inserting
data; one for each inserted tuple, in fact.  This is coming from
RelationGetBufferForTuple() in src/backend/access/heap/hio.c, which does
RelationGetNumberOfBlocks() to ensure that it knows the currently last
page of the relation to insert into.  That results in the lseek() call,
which is the only way to be sure we know the current file EOF exactly,
given that other backends might be extending the file too.

We have talked about avoiding this kernel call by keeping an accurate
EOF location somewhere in shared memory.  However, I just had what is
either a brilliant or foolish idea: who says that we absolutely must
insert the new tuple on the very last page of the table?  If it fits on
a page that's not-quite-the-last-one, why shouldn't we put it there?
If that works, we could just use "rel->rd_nblocks-1" as our initial
guess of the page to insert onto, and skip the lseek.  It doesn't
matter if rd_nblocks is slightly out of date.  The logic in 
RelationGetBufferForTuple would then be something like:
/* * First, use cached rd_nblocks to guess which page to put tuple * on. */if (rel->rd_nblocks > 0){    see if tuple
willfit on page rel->rd_nblocks-1;    if so, put it there and return.}/* * Before extending relation, make sure no one
elsehas done * so more recently than our last rd_nblocks update.  (If we * blindly extend the relation here, then
probablymost of the * page the other guy added will end up going to waste.) */newlastblock =
RelationGetNumberOfBlocks(relation);if(newlastblock > rel->rd_nblocks){    /*     * Someone else has indeed extended
therel.     * Update my idea of the rel length, and see if     * I can fit my tuple on the page he made.     */
rel->rd_nblocks= newlastblock;    see if tuple will fit on page rel->rd_nblocks-1;    if so, put it there and
return.}/** Otherwise, extend the rel by one block and put our tuple * there, same as before.  (Be sure to update
rel->rd_nblocks* for next time...) */
 

An additional small win is that we'd not have to do theif (!relation->rd_myxactonly)    LockPage(relation, 0,
ExclusiveLock);
bit unless the first insertion attempt fails.  This lock is only needed
to ensure that just one backend extends the rel at a time, so as long as
we are adding a tuple to a pre-existing page there's no need to grab it.
That would improve concurrency some more, since the majority of tuple
insertions will succeed in adding to an existing page.

So the question is, is it safe to insert on non-last pages?  AFAIK,
the only aspect of the system that really makes assumptions about tuple
positioning that sequential scans stop when they reach rel->rd_nblocks
(which they update at the beginning of the scan).  They are assuming
that tuples appearing on pages added after a scan starts are
uninteresting because they can't be committed from the point of view of
the scanning transaction.  But that assumption is not violated by
placing new tuples in pages earlier than the last possible place.

Comments?  Is there a hole in my reasoning?
        regards, tom lane


pgsql-hackers by date:

Previous
From: bpalmer
Date:
Subject: Regression tests for OBSD scrammed..
Next
From: Bruce Momjian
Date:
Subject: Re: Regression tests for OBSD scrammed..