Re: What is the posix_memalign() equivalent for the PostgreSQL? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: What is the posix_memalign() equivalent for the PostgreSQL? |
Date | |
Msg-id | CA+TgmoZEcG3u7DzTpQtzYUpqRnvb3cKdW3G+ZbdM+9Lq=JeQTQ@mail.gmail.com Whole thread Raw |
In response to | Re: What is the posix_memalign() equivalent for the PostgreSQL? (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Fri, Sep 2, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-09-02 13:05:37 -0400, Tom Lane wrote: >> Anderson Carniel <accarniel@gmail.com> writes: >> > If not, according to your experience, is there a >> > significance difference between the performance of the O_DIRECT or not? >> >> AFAIK, nobody's really bothered to measure whether that would be useful >> for Postgres. The results would probably be quite platform-specific >> anyway. > > I've played with patches to make postgres use O_DIRECT. On linux, it's > rather beneficial for some workloads (fits into memory), but it also > works really badly for some others, because our IO code isn't > intelligent enough. We pretty much rely on write() being nearly > instantaneous when done by normal backends (during buffer replacement), > we rely on readahead, we rely on the kernel to stopgap some bad > replacement decisions we're making. So, suppose we changed the world so that backends don't write dirty buffers, or at least not normally. If they need to perform a buffer eviction, they first check the freelist, then run the clock sweep. The clock sweep puts clean buffers on the freelist and dirty buffers on a to-be-cleaned list. A background process writes buffers on the to-be-cleaned list and then adds them to the freelist afterward if the usage count hasn't been bumped meanwhile. As in Amit's bgreclaimer patch, we have a target size for the freelist, with a low watermark and a high watermark. When we drop below the low watermark, the background processes run the clock sweep and write from the to-be-cleaned list to try to populate it; when we surge above the high watermark, they go back to sleep. Further, suppose we also create a prefetch system, maybe based on the synchronous scan machinery. It preemptively pulls data into shared_buffers if an ongoing scan will need it soon. Or maybe don't base it on the synchronous scan machinery, but instead just have a queue that lets backends throw prefetch requests over the wall; when the queue wraps, old requests are discarded. A background process - or perhaps one per tablespace or something like that - pull the data in. Neither of those things seems that hard. And if we could do those things and make them work, then maybe we could offer direct I/O as an option. We'd still lose heavily in the case where our buffer eviction decisions are poor, but that'd probably spur some improvement in that area, which IMHO would be a good thing. I personally think direct I/O would be a really good thing, not least because O_ATOMIC is designed to allow MySQL to avoid double buffering, their alternative to full page writes. But we can't use it because it requires O_DIRECT. The savings are probably massive. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: