Re: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: [Patch] Optimize dropping of relation buffers using dlist |
Date | |
Msg-id | CAA4eK1K8++A=zws=58MyH84gf3DZg8Rs=-a4QLZCHM7f9b3MsQ@mail.gmail.com Whole thread Raw |
In response to | Re: [Patch] Optimize dropping of relation buffers using dlist (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
List | pgsql-hackers |
On Thu, Oct 22, 2020 at 2:20 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 22 Oct 2020 07:31:55 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in > > From: Thomas Munro <thomas.munro@gmail.com> > > > On Thu, Oct 22, 2020 at 7:33 PM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > > > Mmm. Not exact. The requirement here is that we must be certain that > > > > the we don't have a buffuer for blocks after the file size known to > > > > the process. While recoverying, If the first lseek() returned smaller > > > > size than actual, we cannot have a buffer for the blocks after the > > > > size. After we trncated or extended the file, we are certain that we > > > > don't have a buffer for unknown blocks. > > > > > > Thanks, I understand now. Something feels fragile about it, perhaps > > > because it's not really acting as a "cache" anymore despite its name, > > > but I see the logic now. It becomes the authoritative source of > > > information, even if the kernel decides to make our file smaller > > > asynchronously. I understand your hesitation but I guess if we can't rely on this cache in recovery then probably we have a problem without this patch itself because the current relation extension (in ReadBuffer_common) relies on the smgrnblocks. So, if the cache lies with us it will overwrite some existing block. > > Thank you Horiguchi-san, you are a savior! I was worried like the end of the world has come. > > > > > > > I think a synchronised file size cache wouldn't be enough to use this > > > trick outside the recovery process, because the initial value would > > > come from a call to lseek(), but unlike recovery, that wouldn't happen > > > *before* we start putting pages in the buffer pool. This is true because the other sessions might have pulled the page to buffer pool but I think if we have invalidations for extension/truncation of a relation then probably before relying on this value we can process the invalidations to update this cache value. > > > Also, if we one > > > day have a size-limited relcache, even recovery could get into > > > trouble, if it evicts the RelationData that holds the authoritative > > > nblocks value. > > > > That's too bad, because we hoped we may be able to various operations during normal operation (TRUNCATE, DROP TABLE/INDEX,DROP DATABASE, etc.) An honest man can't believe the system call, that's a hell. > > > > I'm probably being silly, but can't we avoid the problem by using fstat() instead of lseek(SEEK_END)? Would they returnthe same value from the i-node? > > > > Or, can't we just try to do BufTableLookup() one block after what smgrnblocks() returns? > > Lossy smgrrelcache or relacache is not a hard obstacle. As the same > with the case of !accurate, we just give up the optimized dropping if > the relcache doesn't give the authoritative size. > I think detecting lossy cache is the key thing, probably it might not be as straight forward as it is in recovery path. > By the way, heap scan finds the size of target relation using > smgrnblocks(). I'm not sure why we don't miss recently-extended pages > on a heap-scan? It seems to be possible that concurrent checkpoint > fsyncs relation files inbetween the extension and scanning and the > scanning gets smaller size than it really is. > Yeah, I think that would be a problem but not as serious as in the case we are trying to deal here. -- With Regards, Amit Kapila.
pgsql-hackers by date: