Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core |
Date | |
Msg-id | CAM3SWZSySkp2z44C_BFthC9rDr10PTF7KjL80nD-Sr=Xv-XjZA@mail.gmail.com Whole thread Raw |
In response to | Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Storing pg_stat_statements query texts externally,
pg_stat_statements in core
Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core |
List | pgsql-hackers |
On Tue, Jan 21, 2014 at 6:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, what is the value of using posix_fadvise() in warm_fcache? > > ISTM that if you're worried about waiting for I/O while holding the > LWLock (as I concur you should be), this coding is entirely inadequate > for preventing that, as the whole point of posix_fadvise is that it > won't wait around for the I/O to happen. It strains credulity to > the breaking point to imagine that the kernel will have gotten > around to reading the file in the small number of nanoseconds that > will elapse before you start needing the data. Unless of course the file > is already in buffers, but then the whole thing was a waste of code. Not necessarily. Under Linux for example, POSIX_FADV_SEQUENTIAL "sets the readahead window to the default size for the backing device". That seems like a useful thing, though I will admit that it's hard to quantify how useful here. If it is useful, and furthermore if it's more useful than just reading the file into a temp buffer (which, I admit, is not clear), then why not do it where we can? Adopting support for the non-portable posix_fadvise is a sunk cost. > I think we should flush the posix_fadvise() code path as being neither > useful nor portable, and just do the actual read() as in the other > path Perhaps. > Actually, I think the whole thing is rather badly designed, as warm > cache or no you're still proposing to do thousands of kernel calls > while holding a potentially contended LWLock. I suggest what we > do is (1) read in the whole file, (2) acquire the lock, (3) > re-read the whole file in the same buffer, (4) work from the buffer. fseeko() and fread() are part of the standard library, not any kernel. I don't believe that what I've done in qtext_fetch() implies thousands of kernel calls, or thousands of context switches. Now, using a buffer not managed by libc might be better, but I should note what holding that LWLock actually represents. As you know, it's a shared lock that does not block the updating of existing entries. It will only block the creation of entirely new ones. That ought to be very rare. You may recall that when I worked with you on pg_stat_statements in the past, I characterized the rate of growth as being logarithmic. Once the system has been running for half an hour, new entries become very infrequent indeed. Admittedly, I may be failing to consider a worst case there, but in the average case this works fine. Encouraging automated tools to lazily retrieve query texts is a big part of why I thought this might be the least worst thing. It's not obvious to me that always avoiding blocking new entries while performing physical I/O is important enough to warrant the additional complexity of copying to a buffer first, and mostly working off that. Also, you must consider that there may have been a garbage collection in the interim, so you have to get the shared lock twice (to check the garbage collection count, to later guard against your buffer having been invalidated by a garbage collection) instead of just getting the shared lock once. I discussed this with Pavel, actually. Getting the shared lock twice on every pg_stat_statements() call doesn't seem very good either. Especially since you'll have to open the file twice, *with* the shared lock second time around, when your scheme misses new entries. Your scheme won't work out under exactly the same circumstances that mine doesn't do too well, which in when there is a new entry to consider. When that doesn't happen, holding a shared lock is not that important. > It might be possible to optimize the re-read away in the common case > that the file didn't actually change since we started to read it; > we'd need to put a bit more reporting into qtext_store probably, so > that it's possible to tell if any space is reserved but still unwritten. So, to be clear, you'd still be reading from the file if that proved necessary? I think you'll have to, because it seems useful that pg_stat_statements offers a limited form of consistency, in that you only see entries at a point in time, even if the figures themselves are not exactly consistent. Maybe I've missed something, but it's not obvious to me that you're any better off as compared to just using a FILE/stream. > BTW shouldn't there be an fflush() in qtext_store? I don't think so, no. Whenever qtext_store() callers don't fflush() before releasing their exclusive lock, they FreeFile() before doing so instead. In the case of pgss_shmem_startup() there is no lock held anyway, because it doesn't matter, for the same reason it doesn't matter that we're modifying shmem structures in a way that ordinarily necessitates an exclusive lock. Why fflush() every qtext_store() call if there is expected to be more than one, as is the case for several callers? FreeFile() calls fclose(). fclose() is described by the standard as flushing its stream. It is my understanding that calling fflush() on a stream immediately prior to calling fclose() on the same stream is always superfluous. > We could also eliminate some palloc/pfree cycles by using the string > directly in the buffer (at the small cost of needing to include the > strings' trailing nulls in the file). It might be useful to do that anyway, for sanity checking purposes. So +1 from me. -- Peter Geoghegan
pgsql-hackers by date: