Thread: Caching number of blocks in relation to avoi lseek.

Caching number of blocks in relation to avoi lseek.

From
Denis Perchine
Date:
Hello all,

While digging the code I found out quite interesting comment in
src/backend/access/heap/hio.c before function RelationPutHeapTupleAtEnd
* Eventually, we should cache the number of blocks in a relation somewhere.* Until that time, this code will have to do
anlseek to determine the number* of blocks in a relation.
 

As far as I can see there's field rd_nblocks in Relation.

Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
which calls lseek.

-- 
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------


Re: Caching number of blocks in relation to avoi lseek.

From
Bruce Momjian
Date:
> Hello all,
> 
> While digging the code I found out quite interesting comment in
> src/backend/access/heap/hio.c before function RelationPutHeapTupleAtEnd
> 
>  * Eventually, we should cache the number of blocks in a relation somewhere.
>  * Until that time, this code will have to do an lseek to determine the number
>  * of blocks in a relation.
> 
> As far as I can see there's field rd_nblocks in Relation.
> 
> Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
> which calls lseek.

Probably should be kept up-to-date.  Currently only vacuum sets it.  It
would have to be kept up-to-date for every INSERT/UPDATE that adds a block.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Caching number of blocks in relation to avoi lseek.

From
Denis Perchine
Date:
> > Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks
> > which calls lseek.
> 
> Probably should be kept up-to-date.  Currently only vacuum sets it.  It
> would have to be kept up-to-date for every INSERT/UPDATE that adds a block.

If we can acomplish this it will be quite big speed improvement. I saw lots of such calls
and strace show lots of lseek(,,END) calls. I would do this, but I do not know the code so
well to be sure I will walk through all places.

-- 
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------


Re: Caching number of blocks in relation to avoi lseek.

From
Tom Lane
Date:
>> * Eventually, we should cache the number of blocks in a relation somewhere.
>> * Until that time, this code will have to do an lseek to determine the number
>> * of blocks in a relation.
>> 
>> As far as I can see there's field rd_nblocks in Relation.
>> 
>> Question: is this field properly updated?

No.  If it were that easy, this problem would have been fixed long ago.
The problem with the relcache field is that it's backend-local, so it's
not going to get updated when some other backend extends the relation.

We do use rd_nblocks to cache the last table length determined from
lseek, but that can't be trusted in critical cases, like when we are
trying to add a block to the relation.

There has been some talk of keeping a small cache of current relation
lengths in shared memory, but I'm dubious that that'd actually be a win.
In order to save an lseek (which ought to be pretty quick as kernel
calls go) we'd be talking about grabbing a spinlock, searching a shared
hashtable, and releasing a spinlock.  Spinlock contention on this
heavily used datastructure might well be a problem.  Now add the costs
of updating that hashtable every time we extend a relation (or even just
fail to find a relation in it).  Not immediately obvious that this is
a net win overall, is it?  If it were easier to do, or more obviously
a large performance gain, someone would likely have tried it by now ...
but there is lots of lower-hanging fruit to keep us busy.
        regards, tom lane


Re: Caching number of blocks in relation to avoi lseek.

From
Denis Perchine
Date:
> No.  If it were that easy, this problem would have been fixed long ago.
> The problem with the relcache field is that it's backend-local, so it's
> not going to get updated when some other backend extends the relation.

Sorry. I missed this.
> We do use rd_nblocks to cache the last table length determined from
> lseek, but that can't be trusted in critical cases, like when we are
> trying to add a block to the relation.
> 
> There has been some talk of keeping a small cache of current relation
> lengths in shared memory, but I'm dubious that that'd actually be a win.
> In order to save an lseek (which ought to be pretty quick as kernel
> calls go) we'd be talking about grabbing a spinlock, searching a shared
> hashtable, and releasing a spinlock.  Spinlock contention on this
> heavily used datastructure might well be a problem.  Now add the costs
> of updating that hashtable every time we extend a relation (or even just
> fail to find a relation in it).  Not immediately obvious that this is
> a net win overall, is it?  If it were easier to do, or more obviously
> a large performance gain, someone would likely have tried it by now ...
> but there is lots of lower-hanging fruit to keep us busy.

Hard to say what is faster... Lots of testing is needed. You are right.
And what about skipping lseek when continually read relation?
Is it possible?

-- 
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------


Re: Caching number of blocks in relation to avoi lseek.

From
Tom Lane
Date:
Denis Perchine <dyp@perchine.com> writes:
> And what about skipping lseek when continually read relation?
> Is it possible?

In a pure read scenario the way it's supposed to work is that an
lseek(END) is done once at the start of each sequential scan, and we
save that value in rd_nblocks.  Then we read rd_nblocks pages and stop.
By the time we finish the scan there might be more pages in the relation
(added by other backends, or even by ourselves if it's an update query).
But those pages cannot contain any tuples that could be visible to the
current scan, so it's OK if we don't read them.  However, we do need a
new lseek() --- or some other way to verify the right table length
--- at least once per transaction start or CommandCounterIncrement.
Either of those events could make new tuples visible to us.

I think there may be some code paths that cause us to do more than just
one lseek(END) during scan startup, and if so it'd be worthwhile to try
to get rid of the extra lseeks().  But you'd have to be careful that
you didn't remove any lseeks that are essential in some other paths.
        regards, tom lane


Re: Caching number of blocks in relation to avoi lseek.

From
Denis Perchine
Date:
On ���, 13 ��� 2000, Tom Lane wrote:
> Denis Perchine <dyp@perchine.com> writes:
> > And what about skipping lseek when continually read relation?
> > Is it possible?
> 
> In a pure read scenario the way it's supposed to work is that an
> lseek(END) is done once at the start of each sequential scan, and we
> save that value in rd_nblocks.  Then we read rd_nblocks pages and stop.
> By the time we finish the scan there might be more pages in the relation
> (added by other backends, or even by ourselves if it's an update query).
> But those pages cannot contain any tuples that could be visible to the
> current scan, so it's OK if we don't read them.  However, we do need a
> new lseek() --- or some other way to verify the right table length
> --- at least once per transaction start or CommandCounterIncrement.
> Either of those events could make new tuples visible to us.
> 
> I think there may be some code paths that cause us to do more than just
> one lseek(END) during scan startup, and if so it'd be worthwhile to try
> to get rid of the extra lseeks().  But you'd have to be careful that
> you didn't remove any lseeks that are essential in some other paths.

No... You did not get me. I am talking about completly different thing:
I strace'ed postgres binary when doing queries and found out that it
do lseek after each read, and the difference in the position is 8096.
It means that we are in correct position anyway and do not need additional lseek.

-- 
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------


Re: Caching number of blocks in relation to avoi lseek.

From
Tom Lane
Date:
Denis Perchine <dyp@perchine.com> writes:
> No... You did not get me. I am talking about completly different thing:
> I strace'ed postgres binary when doing queries and found out that it
> do lseek after each read, and the difference in the position is 8096.
> It means that we are in correct position anyway and do not need additional lseek.

Oh.  Hmm.  Not sure if it's really worth the trouble, but you could try
having fd.c keep track of the current seek position of VFDs when they
are open as well as when they are closed, and optimize away the lseek
call in FileSeek if the position is already right.  You'd have to think
carefully about what to do if a read or write fails, however --- where
has the kernel left its seek position in that case?  Possibly this could
be dealt with by setting the internal position to "unknown" anytime
we're not perfectly sure where the kernel is.
        regards, tom lane


Re: Caching number of blocks in relation to avoi lseek.

From
Alfred Perlstein
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [000612 19:37] wrote:
> Denis Perchine <dyp@perchine.com> writes:
> > No... You did not get me. I am talking about completly different thing:
> > I strace'ed postgres binary when doing queries and found out that it
> > do lseek after each read, and the difference in the position is 8096.
> > It means that we are in correct position anyway and do not need additional lseek.
> 
> Oh.  Hmm.  Not sure if it's really worth the trouble, but you could try
> having fd.c keep track of the current seek position of VFDs when they
> are open as well as when they are closed, and optimize away the lseek
> call in FileSeek if the position is already right.  You'd have to think
> carefully about what to do if a read or write fails, however --- where
> has the kernel left its seek position in that case?  Possibly this could
> be dealt with by setting the internal position to "unknown" anytime
> we're not perfectly sure where the kernel is.

Have you thought of using pread/pwrite which are available on many
modern platforms:
    ssize_t    pread(int d, void *buf, size_t nbytes, off_t offset)
    Pread() performs the same function, but reads from the specified    position in the file without modifying the file
pointer.

I'm unsure how the postgresql system uses it's fds, however if they
are really shared via dup() or across fork/exec they will share
the same offset across multiple instances of the fd.  The only way
around this behavior is to reopen the file in each process to get
a private offset.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."