Thread: Caching number of blocks in relation to avoi lseek.
Hello all, While digging the code I found out quite interesting comment in src/backend/access/heap/hio.c before function RelationPutHeapTupleAtEnd * Eventually, we should cache the number of blocks in a relation somewhere.* Until that time, this code will have to do anlseek to determine the number* of blocks in a relation. As far as I can see there's field rd_nblocks in Relation. Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks which calls lseek. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ----------------------------------
> Hello all, > > While digging the code I found out quite interesting comment in > src/backend/access/heap/hio.c before function RelationPutHeapTupleAtEnd > > * Eventually, we should cache the number of blocks in a relation somewhere. > * Until that time, this code will have to do an lseek to determine the number > * of blocks in a relation. > > As far as I can see there's field rd_nblocks in Relation. > > Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks > which calls lseek. Probably should be kept up-to-date. Currently only vacuum sets it. It would have to be kept up-to-date for every INSERT/UPDATE that adds a block. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Question: is this field properly updated? Could it be used instead of RelationGetNumberOfBlocks > > which calls lseek. > > Probably should be kept up-to-date. Currently only vacuum sets it. It > would have to be kept up-to-date for every INSERT/UPDATE that adds a block. If we can acomplish this it will be quite big speed improvement. I saw lots of such calls and strace show lots of lseek(,,END) calls. I would do this, but I do not know the code so well to be sure I will walk through all places. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ----------------------------------
>> * Eventually, we should cache the number of blocks in a relation somewhere. >> * Until that time, this code will have to do an lseek to determine the number >> * of blocks in a relation. >> >> As far as I can see there's field rd_nblocks in Relation. >> >> Question: is this field properly updated? No. If it were that easy, this problem would have been fixed long ago. The problem with the relcache field is that it's backend-local, so it's not going to get updated when some other backend extends the relation. We do use rd_nblocks to cache the last table length determined from lseek, but that can't be trusted in critical cases, like when we are trying to add a block to the relation. There has been some talk of keeping a small cache of current relation lengths in shared memory, but I'm dubious that that'd actually be a win. In order to save an lseek (which ought to be pretty quick as kernel calls go) we'd be talking about grabbing a spinlock, searching a shared hashtable, and releasing a spinlock. Spinlock contention on this heavily used datastructure might well be a problem. Now add the costs of updating that hashtable every time we extend a relation (or even just fail to find a relation in it). Not immediately obvious that this is a net win overall, is it? If it were easier to do, or more obviously a large performance gain, someone would likely have tried it by now ... but there is lots of lower-hanging fruit to keep us busy. regards, tom lane
> No. If it were that easy, this problem would have been fixed long ago. > The problem with the relcache field is that it's backend-local, so it's > not going to get updated when some other backend extends the relation. Sorry. I missed this. > We do use rd_nblocks to cache the last table length determined from > lseek, but that can't be trusted in critical cases, like when we are > trying to add a block to the relation. > > There has been some talk of keeping a small cache of current relation > lengths in shared memory, but I'm dubious that that'd actually be a win. > In order to save an lseek (which ought to be pretty quick as kernel > calls go) we'd be talking about grabbing a spinlock, searching a shared > hashtable, and releasing a spinlock. Spinlock contention on this > heavily used datastructure might well be a problem. Now add the costs > of updating that hashtable every time we extend a relation (or even just > fail to find a relation in it). Not immediately obvious that this is > a net win overall, is it? If it were easier to do, or more obviously > a large performance gain, someone would likely have tried it by now ... > but there is lots of lower-hanging fruit to keep us busy. Hard to say what is faster... Lots of testing is needed. You are right. And what about skipping lseek when continually read relation? Is it possible? -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ----------------------------------
Denis Perchine <dyp@perchine.com> writes: > And what about skipping lseek when continually read relation? > Is it possible? In a pure read scenario the way it's supposed to work is that an lseek(END) is done once at the start of each sequential scan, and we save that value in rd_nblocks. Then we read rd_nblocks pages and stop. By the time we finish the scan there might be more pages in the relation (added by other backends, or even by ourselves if it's an update query). But those pages cannot contain any tuples that could be visible to the current scan, so it's OK if we don't read them. However, we do need a new lseek() --- or some other way to verify the right table length --- at least once per transaction start or CommandCounterIncrement. Either of those events could make new tuples visible to us. I think there may be some code paths that cause us to do more than just one lseek(END) during scan startup, and if so it'd be worthwhile to try to get rid of the extra lseeks(). But you'd have to be careful that you didn't remove any lseeks that are essential in some other paths. regards, tom lane
On ���, 13 ��� 2000, Tom Lane wrote: > Denis Perchine <dyp@perchine.com> writes: > > And what about skipping lseek when continually read relation? > > Is it possible? > > In a pure read scenario the way it's supposed to work is that an > lseek(END) is done once at the start of each sequential scan, and we > save that value in rd_nblocks. Then we read rd_nblocks pages and stop. > By the time we finish the scan there might be more pages in the relation > (added by other backends, or even by ourselves if it's an update query). > But those pages cannot contain any tuples that could be visible to the > current scan, so it's OK if we don't read them. However, we do need a > new lseek() --- or some other way to verify the right table length > --- at least once per transaction start or CommandCounterIncrement. > Either of those events could make new tuples visible to us. > > I think there may be some code paths that cause us to do more than just > one lseek(END) during scan startup, and if so it'd be worthwhile to try > to get rid of the extra lseeks(). But you'd have to be careful that > you didn't remove any lseeks that are essential in some other paths. No... You did not get me. I am talking about completly different thing: I strace'ed postgres binary when doing queries and found out that it do lseek after each read, and the difference in the position is 8096. It means that we are in correct position anyway and do not need additional lseek. -- Sincerely Yours, Denis Perchine ---------------------------------- E-Mail: dyp@perchine.com HomePage: http://www.perchine.com/dyp/ FidoNet: 2:5000/120.5 ----------------------------------
Denis Perchine <dyp@perchine.com> writes: > No... You did not get me. I am talking about completly different thing: > I strace'ed postgres binary when doing queries and found out that it > do lseek after each read, and the difference in the position is 8096. > It means that we are in correct position anyway and do not need additional lseek. Oh. Hmm. Not sure if it's really worth the trouble, but you could try having fd.c keep track of the current seek position of VFDs when they are open as well as when they are closed, and optimize away the lseek call in FileSeek if the position is already right. You'd have to think carefully about what to do if a read or write fails, however --- where has the kernel left its seek position in that case? Possibly this could be dealt with by setting the internal position to "unknown" anytime we're not perfectly sure where the kernel is. regards, tom lane
* Tom Lane <tgl@sss.pgh.pa.us> [000612 19:37] wrote: > Denis Perchine <dyp@perchine.com> writes: > > No... You did not get me. I am talking about completly different thing: > > I strace'ed postgres binary when doing queries and found out that it > > do lseek after each read, and the difference in the position is 8096. > > It means that we are in correct position anyway and do not need additional lseek. > > Oh. Hmm. Not sure if it's really worth the trouble, but you could try > having fd.c keep track of the current seek position of VFDs when they > are open as well as when they are closed, and optimize away the lseek > call in FileSeek if the position is already right. You'd have to think > carefully about what to do if a read or write fails, however --- where > has the kernel left its seek position in that case? Possibly this could > be dealt with by setting the internal position to "unknown" anytime > we're not perfectly sure where the kernel is. Have you thought of using pread/pwrite which are available on many modern platforms: ssize_t pread(int d, void *buf, size_t nbytes, off_t offset) Pread() performs the same function, but reads from the specified position in the file without modifying the file pointer. I'm unsure how the postgresql system uses it's fds, however if they are really shared via dup() or across fork/exec they will share the same offset across multiple instances of the fd. The only way around this behavior is to reopen the file in each process to get a private offset. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."