Home > mailing lists

Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From	KONDO Mitsumasa
Subject	Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date	July 4, 2013 12:23:51
Msg-id	51D56A5B.3050504@lab.ntt.co.jp Whole thread Raw
In response to	Re: Improvement of checkpoint IO scheduler for stable transaction responses (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Improvement of checkpoint IO scheduler for stable transaction responses
List	pgsql-hackers

Tree view

(2013/07/03 22:31), Robert Haas wrote:
> On Wed, Jul 3, 2013 at 4:18 AM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>> I tested and changed segsize=0.25GB which is max partitioned table file size and
>> default setting is 1GB in configure option (./configure --with-segsize=0.25).
>> Because I thought that small segsize is good for fsync phase and background disk
>> write in OS in checkpoint. I got significant improvements in DBT-2 result!
>
> This is interesting.  Unfortunately, it has a significant downside:
> potentially, there will be a lot more files in the data directory.  As
> it is, the number of files that exist there today has caused
> performance problems for some of our customers.  I'm not sure off-hand
> to what degree those problems have been related to overall inode
> consumption vs. the number of files in the same directory.
Did you change number of max FD per process in kernel parameter? In default 
setting, number of max FD per process is 1024. I think that it might over limit 
in 500GB class database. Or, this problem might be caused by _mdfd_getseg() at 
md.c. In write phase, dirty buffers don't have own FD. Therefore they seek to 
find own FD and check the file in each dirty buffer. I think it is safe file 
writing, but it might too wasteful. I think that BufferTag should have own FD and 
it will be more efficient in checkpoint writing.

> If the problem is mainly with number of of files in the same
> directory, we could consider revising our directory layout.  Instead
> of:
>
> base/${DBOID}/${RELFILENODE}_{FORK}
>
> We could have:
>
> base/${DBOID}/${FORK}/${RELFILENODE}
>
> That would move all the vm and fsm forks to separate directories,
> which would cut down the number of files in the main-fork directory
> significantly.  That might be worth doing independently of the issue
> you're raising here.  For large clusters, you'd even want one more
> level to keep the directories from getting too big:
>
> base/${DBOID}/${FORK}/${X}/${RELFILENODE}
>
> ...where ${X} is two hex digits, maybe just the low 16 bits of the
> relfilenode number.  But this would be not as good for small clusters
> where you'd end up with oodles of little-tiny directories, and I'm not
> sure it'd be practical to smoothly fail over from one system to the
> other.
It seems good idea! In generally, base directory was not seen by user.
So it should be more efficient arrangement for performance and adopt for
large database.

(2013/07/03 22:39), Andres Freund wrote:> On 2013-07-03 17:18:29 +0900> Hm. I wonder how much of this could be gained
bydoing a> sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing> the original checkpoint-pass through
thebuffers or when fsyncing the> files.
 
Sync_file_rage system call is interesting. But it was supported only by Linux 
kernel 2.6.22 or later. In postgresql, it will suits Robert's idea which does not 
depend on kind of OS.
> Presumably the smaller segsize is better because we don't> completely stall the system by submitting up to 1GB of io
atonce. So,> if we were to do it in 32MB chunks and then do a final fsync()> afterwards we might get most of the
benefits.
Yes, I try to test this setting './configure --with-segsize=0.03125' tonight.
I will send you this test result tomorrow.

I think that best way to write buffers in checkpoint is sorted by buffer's FD and 
block-number with small segsize setting and each property sleep times. It will 
realize genuine sorted checkpint with sequential disk writing!

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

pgsql-hackers by date:

From: Andrew Dunstan
Date: 04 July 2013, 11:59:12
Subject: Re: [9.3 bug fix] ECPG does not escape backslashes

From: Michael Meskes
Date: 04 July 2013, 12:31:46
Subject: Re: [9.3 bug fix] ECPG does not escape backslashes

Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

Previous

Next