On Mon, Oct 7, 2013 at 11:44 AM, Michal TOMA <mt@sicoop.com> wrote:
I gave it in my first post. It is a software raid 1 of average 7200 rpm disks (Hitachi HDS723020BLE640) for the main tablespace and a software raid 1 of SSDs for onother tablespace and alos the partition holding the pg_xlog directory.
So that is exactly 2 drives on the HDD side? Yeah, that isn't going to go very far.
The problem is not the workload as the application is a web crawler. So the workload can be infinite. What I would expect Postgres to do is to regulate te workload somehow insetad of just crashing twice a day with a "partition full" followed by automatic recovery.
There has been some discussion about mechanisms to throttle throughput based on the log file partition filling up, but it was more in the context of archiving going down rather than checkpointing being way too slow. No real conclusion was reached though.
And I'm not very hopeful about it, especially not as something that would be on by default. I'd be pretty ticked if the system started automatically throttling a bulk load because it extrapolated and decided that some problem might occur at some point in the future--even though I know that the bulk load will be finished before that point is reached.
It seems like the best place to implement the throttling would be in your application, as that is where the sleeping can be done with the least amount of locks/resources being held. Maybe you could check `fgrep Dirty /proc/meminfo` and throttle based on that value.
Also, the nasty slug of dirty pages is accumulating in the OS, not in PostgreSQL itself, so you could turn down dirty_ratio and friends in the kernel to limit the problem.