Re: WAL prefetch - Mailing list pgsql-hackers
| From | Tomas Vondra | 
|---|---|
| Subject | Re: WAL prefetch | 
| Date | |
| Msg-id | 8da3c2dd-8577-2141-d64a-d109ac038388@2ndquadrant.com Whole thread Raw | 
| In response to | Re: WAL prefetch (Andres Freund <andres@anarazel.de>) | 
| Responses | Re: WAL prefetch | 
| List | pgsql-hackers | 
On 06/16/2018 09:02 PM, Andres Freund wrote: > On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote: >> >> >> On 06/15/2018 08:01 PM, Andres Freund wrote: >>> On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote: >>>> >>>> >>>> On 14.06.2018 09:52, Thomas Munro wrote: >>>>> On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik >>>>> <k.knizhnik@postgrespro.ru> wrote: >>>>>> pg_wal_prefetch function will infinitely traverse WAL and prefetch block >>>>>> references in WAL records >>>>>> using posix_fadvise(WILLNEED) system call. >>>>> Hi Konstantin, >>>>> >>>>> Why stop at the page cache... what about shared buffers? >>>>> >>>> >>>> It is good question. I thought a lot about prefetching directly to shared >>>> buffers. >>> >>> I think that's definitely how this should work. I'm pretty strongly >>> opposed to a prefetching implementation that doesn't read into s_b. >>> >> >> Could you elaborate why prefetching into s_b is so much better (I'm sure it >> has advantages, but I suppose prefetching into page cache would be much >> easier to implement). > > I think there's a number of issues with just issuing prefetch requests > via fadvise etc: > > - it leads to guaranteed double buffering, in a way that's just about > guaranteed to *never* be useful. Because we'd only prefetch whenever > there's an upcoming write, there's simply no benefit in the page > staying in the page cache - we'll write out the whole page back to the > OS. How does reading directly into shared buffers substantially change the behavior? The only difference is that we end up with the double buffering after performing the write. Which is expected to happen pretty quick after the read request. > - reading from the page cache is far from free - so you add costs to the > replay process that it doesn't need to do. > - you don't have any sort of completion notification, so you basically > just have to guess how far ahead you want to read. If you read a bit > too much you suddenly get into synchronous blocking land. > - The OS page is actually not particularly scalable to large amounts of > data either. Nor are the decisions what to keep cached likley to be > particularly useful. The posix_fadvise approach is not perfect, no doubt about that. But it works pretty well for bitmap heap scans, and it's about 13249x better (rough estimate) than the current solution (no prefetching). > - We imo need to add support for direct IO before long, and adding more > and more work to reach feature parity strikes meas a bad move. > IMHO it's unlikely to happen in PG12, but I might be over-estimating the invasiveness and complexity of the direct I/O change. While this patch seems pretty doable, and the improvements are pretty significant. My point was that I don't think this actually adds a significant amount of work to the direct IO patch, as we already do prefetch for bitmap heap scans. So this needs to be written anyway, and I'd expect those two places to share most of the code. So where's the additional work? I don't think we should reject patches just because it might add a bit of work to some not-yet-written future patch ... (which I however don't think is this case). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: