Re: postgresql latency & bgwriter not doing its job - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: postgresql latency & bgwriter not doing its job |
Date | |
Msg-id | 20140826083446.GG21544@awork2.anarazel.de Whole thread Raw |
In response to | Re: postgresql latency & bgwriter not doing its job (Fabien COELHO <coelho@cri.ensmp.fr>) |
List | pgsql-hackers |
On 2014-08-26 10:25:29 +0200, Fabien COELHO wrote: > >Did you check whether xfs yields a, err, more predictable performance? > > No. I cannot test that easily without reinstalling the box. I did some quick > tests with ZFS/FreeBSD which seemed to freeze the same, but not in the very > same conditions. Maybe I could try again. After Robert and I went to LSF/MM this spring I sent a test program for precisely this problem and while it could *crash* machines when using ext4, xfs yielded much more predictable performance. There's a problem with priorization of write vs read IO that's apparently FS dependent. > >[...] Note that it would *not* be a good idea to make the bgwriter write > >out everything, as much as possible - that'd turn sequential write io into > >random write io. > > Hmmm. I'm not sure it would be necessary the case, it depends on how > bgwriter would choose the pages to write? If they are chosen randomly then > indeed that could be bad. The essentially have to be random to fulfil it's roal of reducing the likelihood of a backend having to write out a buffer itself. Consider how the clock sweep algorithm (not that I am happy with it) works. When looking for a new victim buffer all backends scan the buffer cache in one continuous cycle. If they find a buffer with a usagecount==0 they'll use that one and throw away its contents. Otherwise they reduce usagecount by 1 and move on. What the bgwriter *tries* to do is to write out buffers with usagecount==0 that are dirty and will soon be visited in the clock cycle. To avoid having the backends to do that. > If there is a big sequential write, should not the > backend do the write directly anyway? ISTM that currently checkpoint is > mostly random writes anyway, at least with the OLTP write load of pgbench. > I'm just trying to be able to start them ealier so that they can be > completed quickly. If the IO scheduling worked - which it really doesn't in many cases - there'd really be no need to make it finish fast. I think you should try to tune spread checkpoints to have less impact, not make bgwriter do something it's not written for. > So although bgwriter is not the solution, ISTM that pg has no reason to wait > for minutes before starting to write dirty pages, if it has nothing else to > do. That precisely *IS* a spread checkpoint. > If the OS does some retention later and cannot spread the load, as Josh > suggest, this could also be a problem, but currently the OS seems not to > have much to write (but WAL) till the checkpoint. The actual problem is that the writes by the checkpointer - done in the background - aren't flushed out eagerly enough out of the OS's page cache. Then, when the final phase of the checkpoint comes, where relation files need to be fsynced, some filesystems essentially stal while trying to write out lots and lots of dirty buffers. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: