Thread: fstat vs. lseek
In response to my blog post on lseek contention, someone posted a comment wherein they proposed using fstat() rather than lseek() to get file sizes. http://rhaas.blogspot.com/2011/08/linux-and-glibc-scalability.html I tried that on a RHEL 6.1 machine with 64-cores running 2.6.32-131.6.1.el6.x86_64, and it's pretty clear that the locking characteristics are completely different. At 1 client, the lseek method appears to be slightly faster, although it's not beyond belief that the difference could be in the noise. Above 40 cores, however, the fstat method thumps the lseek method up one side and down the other. Patch and test results are attached. Test runs are 5-minute runs with scale factor 100 and shared_buffers=8GB. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > In response to my blog post on lseek contention, someone posted a > comment wherein they proposed using fstat() rather than lseek() to get > file sizes. > Patch and test results are attached. Test runs are 5-minute runs with > scale factor 100 and shared_buffers=8GB. > Thoughts? I'm a bit concerned by the fact that you've only tested this on one operating system, and thus the performance characteristics could be quite different elsewhere. The comment in mdextend also points out a way in which this might not be a win --- did you test anything besides read-only scenarios? regards, tom lane
On Monday, August 08, 2011 10:30:38 Robert Haas wrote: > In response to my blog post on lseek contention, someone posted a > comment wherein they proposed using fstat() rather than lseek() to get > file sizes. > > Thoughts? I don't think its a good idea to replace lseek with fstat in the long run. The likelihood that the lockless generic_file_llseek will get included seems rather high to me. In contrast to that fstat will always be more expensive than that as its going through a security check and then the fs' getattr implementation (which actually takes a lock on some fs). On the other hand its currently lockless if the security subsystem is compiled out (i.e. no selinux et al) for some common fs (ext3/4, xfs). Andres
On Mon, Aug 8, 2011 at 10:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm a bit concerned by the fact that you've only tested this on one > operating system, and thus the performance characteristics could be > quite different elsewhere. The comment in mdextend also points out > a way in which this might not be a win --- did you test anything besides > read-only scenarios? Nope. On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund <andres@anarazel.de> wrote: > I don't think its a good idea to replace lseek with fstat in the long run. The > likelihood that the lockless generic_file_llseek will get included seems rather > high to me. In contrast to that fstat will always be more expensive than that > as its going through a security check and then the fs' getattr implementation > (which actually takes a lock on some fs). *scratches head* I understand that stat() would need a security check, but why would fstat()? I think both of you raise good points. I wasn't too enthusiastic about this approach either. It's not very appealing to adopt an approach where the right performance decision is going to depend on operating system, file system, kernel version, core count, and workload. We could add a GUC, but it would be pretty annoying to have a setting that won't matter for most people at all, except occasionally when it makes a huge difference. I wasn't aware that was any current activity around this on the Linux side. But Andres' comments made me Google it again, and now I see this: https://lkml.org/lkml/2011/6/16/800 Andes, any idea what the status of that patch is? I'm not clear on how Linux works in terms of things getting upstreamed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Monday, August 08, 2011 11:33:29 Robert Haas wrote: > On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund <andres@anarazel.de> wrote: > > I don't think its a good idea to replace lseek with fstat in the long > > run. The likelihood that the lockless generic_file_llseek will get > > included seems rather high to me. In contrast to that fstat will always > > be more expensive than that as its going through a security check and > > then the fs' getattr implementation (which actually takes a lock on > > some fs). > *scratches head* I understand that stat() would need a security > check, but why would fstat()? That I am not totally sure of either. I guess Kaigai might know more about that. I guess it might be that a forked process possibly is not allowed anymore to access the information from an inherited file handle? Also I think a process can change its permissions during runtime. > I think both of you raise good points. I wasn't too enthusiastic > about this approach either. It's not very appealing to adopt an > approach where the right performance decision is going to depend on > operating system, file system, kernel version, core count, and > workload. We could add a GUC, but it would be pretty annoying to have > a setting that won't matter for most people at all, except > occasionally when it makes a huge difference. > > I wasn't aware that was any current activity around this on the Linux > side. But Andres' comments made me Google it again, and now I see > this: > > https://lkml.org/lkml/2011/6/16/800 > > Andes, any idea what the status of that patch is? I'm not clear on > how Linux works in terms of things getting upstreamed. There doesn't seem to have been any activity to inlude it in 3.1. The merge window for 3.1 just ended. The next one will open for about a week after the release. Its also not yet included in linux-next which is a "preview" for the currently worked on release + 1. A release takes roughly 3 months. For upstreaming somebody needs to be persistent enough to convince one of the maintainers of the particular area to include the code so that linus then can pull that. I guess citing your numbers would go a long way in that direction. Naturally it would be even better to inlcude results with the patch applied. My largest machine I can reboot often enough to test such a thing has only two sockets (4cores E5520). I guess you cannot reboot your loaned machine with a new kernel easily? Greetings, Andres
On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund <andres@anarazel.de> wrote: > There doesn't seem to have been any activity to inlude it in 3.1. The merge > window for 3.1 just ended. The next one will open for about a week after the > release. > Its also not yet included in linux-next which is a "preview" for the currently > worked on release + 1. A release takes roughly 3 months. OK. If it doesn't get into Linux 3.2 we had better start thinking hard about a workaround on our side. I am not too concerned about people hitting this with PostgreSQL 9.1 or prior, because you'd basically need a workload targeted to exercise the problem, which workload is not that similar to the way people actually do things in real life. However, in PostgreSQL 9.2devel, it's going to be much more of a real-world problem, so I'd hate to wait until after our feature freeze and then decide we've got a problem we have to fix. > For upstreaming somebody needs to be persistent enough to convince one of the > maintainers of the particular area to include the code so that linus then can > pull that. > I guess citing your numbers would go a long way in that direction. Naturally > it would be even better to inlcude results with the patch applied. > My largest machine I can reboot often enough to test such a thing has only two > sockets (4cores E5520). I guess you cannot reboot your loaned machine with a > new kernel easily? Not really. I do have root access to a 64-core box at the moment, and I could probably get permission to reboot it, but if it didn't come back on-line that would be awkward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Not really. I do have root access to a 64-core box at the moment, and > I could probably get permission to reboot it, but if it didn't come > back on-line that would be awkward. Red Hat has some test hardware that I can use (... pokes around ...) Hmm, this one looks promising: Memory NUMA Nodes 64348 MB 4 Cpu Vendor Model Name Family Model Stepping Speed Processors Cores Sockets Hyper GenuineIntel Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz 6 47 2 1064.0 80 40 4 True If you can wrap something up to the point where someone else can run it, I'll give it a shot. regards, tom lane
On Monday, August 08, 2011 13:19:13 Robert Haas wrote: > On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund <andres@anarazel.de> wrote: > > There doesn't seem to have been any activity to inlude it in 3.1. The > > merge window for 3.1 just ended. The next one will open for about a > > week after the release. > > Its also not yet included in linux-next which is a "preview" for the > > currently worked on release + 1. A release takes roughly 3 months. > > OK. If it doesn't get into Linux 3.2 we had better start thinking > hard about a workaround on our side. If its ok I will write a mail to lkml referencing this thread and your numbers inline (with attribution obviously). I don't think it will be that hard to convince them. But I constantly surprise myself with naivity so I may be wrong. > > My largest machine I can reboot often enough to test such a thing has only > > two sockets (4cores E5520). I guess you cannot reboot your loaned machine > > with a new kernel easily? >Not really. I do have root access to a 64-core box at the moment, and >I could probably get permission to reboot it, but if it didn't come >back on-line that would be awkward. As I feared. Any chance that the person lending you the machine can give you a hand? Although I don't know how that could be after reading the code it would be disappointing to wait for 3.2 with the llseek fixes appearing in $distribution just to notice fstat is still faster for $unobvious_reason... Andres
On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote: > If its ok I will write a mail to lkml referencing this thread and your numbers > inline (with attribution obviously). That would be great. Please go ahead. > I don't think it will be that hard to convince them. But I constantly surprise > myself with naivity so I may be wrong. Heh, heh, open source is fun. >> > My largest machine I can reboot often enough to test such a thing has only >> > two sockets (4cores E5520). I guess you cannot reboot your loaned machine >> > with a new kernel easily? >>Not really. I do have root access to a 64-core box at the moment, and >>I could probably get permission to reboot it, but if it didn't come >>back on-line that would be awkward. > As I feared. Any chance that the person lending you the machine can give you a > hand? Uh, maybe, but considering my relative inexperience in compiling the Linux kernel, I'd be a little worried about having to iterate too many times. > Although I don't know how that could be after reading the code it would be > disappointing to wait for 3.2 with the llseek fixes appearing in $distribution > just to notice fstat is still faster for $unobvious_reason... Well, the good thing here is that we are really only concerned with gross effects. It's pretty obvious from the numbers I posted upthread that the problem is related to lock contention. If that gets fixed, and lseek is still 20% slower under some set of circumstances, it's not clear that we're really gonna care. I mean, maybe it would be nice to avoid going to the kernel at all here just so we're immune to possible inefficiencies in other operating systems (it would be nice if someone could repeat these tests on a big SMP box running Windows and/or one of BSD systems) and to save the overhead of a system call, but those effects are pretty tiny. We could spend a lot of time optimizing other things before that one percolated up to the top of the heap, at least based on what I've seen so far. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
hi On 08/08/2011 07:50 PM, Robert Haas wrote: > On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund<andres@anarazel.de> wrote: >> If its ok I will write a mail to lkml referencing this thread and your numbers >> inline (with attribution obviously). > > That would be great. Please go ahead. I've just stumbled across this thread on lkml [1] "Improve lseek scalability v3". and I thought to ping pgsql hackers list just in case, more to the point they're asking "are there any real workloads which care [Make generic lseek lockless safe]" maybe I've got it wrong but it seems somewhat related to what has been discussed here and also in Robert Haas's "Linux and glibc Scalability" blog post [1]. [cut] Andrea [1] https://lkml.org/lkml/2011/9/15/399 [2] http://rhaas.blogspot.com/2011/08/linux-and-glibc-scalability.html
On Friday 16 Sep 2011 15:19:07 Andrea Suisani wrote: > hi > > On 08/08/2011 07:50 PM, Robert Haas wrote: > > On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund<andres@anarazel.de> wrote: > >> If its ok I will write a mail to lkml referencing this thread and your > >> numbers inline (with attribution obviously). > > > > That would be great. Please go ahead. > > I've just stumbled across this thread on lkml [1] > "Improve lseek scalability v3". > > and I thought to ping pgsql hackers list > just in case, more to the point they're > asking "are there any real workloads which care > [Make generic lseek lockless safe]" I wrote them a mail sometime ago (some weeks) regarding an earlier version of the patch... Can't find it right now though. Andres
Hi All, The lseek patches just got included in Linus tree. Andres
On Fri, Oct 28, 2011 at 3:33 PM, Andres Freund <andres@anarazel.de> wrote: > The lseek patches just got included in Linus tree. Excellent, thanks for the update! http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=ef3d0fd27e90f67e35da516dafc1482c82939a60 So I guess this will be in Linux 3.2? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On Friday, October 28, 2011 09:40:51 PM Robert Haas wrote: > On Fri, Oct 28, 2011 at 3:33 PM, Andres Freund <andres@anarazel.de> wrote: > > The lseek patches just got included in Linus tree. > > Excellent, thanks for the update! > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=ef3 > d0fd27e90f67e35da516dafc1482c82939a60 > > So I guess this will be in Linux 3.2? Unless they get reverted for some reason, yes. Andres