Thread: odd postgresql performance (excessive lseek)
postgres 8.4.4 on openSUSE 11.3 (2.6.36rc7, x86_64). I was watching a fairly large query run and observed that the disk light went out. I checked 'top' and postgres was using 100% CPU so I strace'd the running process. This is what I saw: lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(87, 0, SEEK_END) = 585531392 lseek(94, 270680064, SEEK_SET) = 270680064 read(94, "<elided>"..., 8192) = 8192 and I observed that pattern quite a bit. I know lseek is cheap, but a superfluous systemcall is a superfluous systemcall, and over a short period amounted to 37% (according to strace) of the time spent in the system. What's with the excess calls to lseek? The query plan was a nested loop anti-join (on purpose). -- Jon
No replies? This is another situation where using pread would have saved a lot of time and sped things up a bit, but failing that, keeping track of the file position ourselves and only lseek'ing when necessary would also help. Postgresql was spending 37% of it's time in redundant lseek! -- Jon
On Tue, Oct 19, 2010 at 9:10 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > No replies? > > This is another situation where using pread would have saved a lot of > time and sped things up a bit, but failing that, keeping track of the > file position ourselves and only lseek'ing when necessary would also > help. Postgresql was spending 37% of it's time in redundant lseek! 37% of cpu time? Is that according to strace -T? how did you measure it? merlin
On Tue, Oct 19, 2010 at 8:25 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Oct 19, 2010 at 9:10 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> No replies? >> >> This is another situation where using pread would have saved a lot of >> time and sped things up a bit, but failing that, keeping track of the >> file position ourselves and only lseek'ing when necessary would also >> help. Postgresql was spending 37% of it's time in redundant lseek! > > 37% of cpu time? Is that according to strace -T? how did you measure it? Per the original post, it (redundant lseek system calls) accounted for 37% of the time spent in the kernel. strace -f -p <pid> -c -- Jon
Jon Nelson <jnelson+pgsql@jamponi.net> writes: > This is another situation where using pread would have saved a lot of > time and sped things up a bit, but failing that, keeping track of the > file position ourselves and only lseek'ing when necessary would also > help. No, it wouldn't; you don't have the slightest idea what's going on there. Those lseeks are for the purpose of detecting the current EOF location, ie, finding out whether some other backend has extended the file recently. We could get rid of them, but only at the cost of putting in some other communication mechanism instead. regards, tom lane
On Tue, Oct 19, 2010 at 9:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> This is another situation where using pread would have saved a lot of >> time and sped things up a bit, but failing that, keeping track of the >> file position ourselves and only lseek'ing when necessary would also >> help. > > No, it wouldn't; you don't have the slightest idea what's going on > there. Those lseeks are for the purpose of detecting the current EOF > location, ie, finding out whether some other backend has extended the > file recently. We could get rid of them, but only at the cost of > putting in some other communication mechanism instead. That's a little harsh (it's not untrue, though). It's true I don't know how postgresql works WRT how it manages files, but now I've been educated (some). I'm guessing, then, that due to how each backend may extend files without the other backends knowing of it, that using fallocate or some-such is also likely a non-starter. I ask because, especially when allocating files 8KB at a time, file fragmentation on a busy system is potentially high. I recently saw an ext3 filesystem (dedicated to postgresql) with 38% file fragmentation and, yes, it does make a huge performance difference in some cases. After manually defragmenting some files (with pg offline) I saw a read speed increase for single-MB-per-second to high-double-digit-MB-per-second. However, after asking pg to rewrite some of the worst files (by way of CLUSTER or ALTER TABLE) I saw no improvement - I'm guessing due to the 8KB-at-a-time allocation mechanism. Has any work been done on making use of shared memory for file stats or using fallocate (or posix_fallocate) to allocate files in larger chunks? -- Jon
Jon Nelson wrote: > That's a little harsh (it's not untrue, though). > Welcome to pgsql-performance! You can get a right answer, or a nice answer, but given the personalities involved it's hard to get both at the same time. With this crowd, you need to be careful stating something you were speculating about as if you were certain of it, and to be certain here usually means "I read the source code". I recommend writing theories as a question ("would X have sped this up?") rather than a statement ("X will speed this up") if you want to see gentler answers. > Has any work been done on making use of shared memory for file stats > or using fallocate (or posix_fallocate) to allocate files in larger > chunks? > Just plain old pre-allocating in larger chunks turns out to work well. Am hoping to get that into an official patch eventually. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On Tue, Oct 19, 2010 at 12:28 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Jon Nelson wrote: >> >> That's a little harsh (it's not untrue, though). >> > > Welcome to pgsql-performance! You can get a right answer, or a nice answer, > but given the personalities involved it's hard to get both at the same time. > With this crowd, you need to be careful stating something you were > speculating about as if you were certain of it, and to be certain here > usually means "I read the source code". I recommend writing theories as a > question ("would X have sped this up?") rather than a statement ("X will > speed this up") if you want to see gentler answers. > >> Has any work been done on making use of shared memory for file stats >> or using fallocate (or posix_fallocate) to allocate files in larger >> chunks? >> > > Just plain old pre-allocating in larger chunks turns out to work well. Am > hoping to get that into an official patch eventually. hm...wouldn't larger allocation blocks reduce the frequency of 'this file got larger via another backend' type events, and therefore raise the benefit of pushing the events out vs checking them over and over? merlin
On Tue, Oct 19, 2010 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> This is another situation where using pread would have saved a lot of >> time and sped things up a bit, but failing that, keeping track of the >> file position ourselves and only lseek'ing when necessary would also >> help. > > No, it wouldn't; you don't have the slightest idea what's going on > there. Those lseeks are for the purpose of detecting the current EOF > location, ie, finding out whether some other backend has extended the > file recently. We could get rid of them, but only at the cost of > putting in some other communication mechanism instead. I don't get it. Why would be doing that in a tight loop within a single backend? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Oct 19, 2010 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Those lseeks are for the purpose of detecting the current EOF >> location, ie, finding out whether some other backend has extended the >> file recently. �We could get rid of them, but only at the cost of >> putting in some other communication mechanism instead. > I don't get it. Why would be doing that in a tight loop within a > single backend? Well, we weren't shown any context whatsoever about what the backend was actually doing ... but for example the planner likes to recheck the current physical size of each relation in a query, so that it's working with an up-to-date number. That could probably be avoided, since an estimate would be good enough as long as it wasn't horribly stale. But there are other places that *have* to have the accurate size, like seqscan startup. I doubt it was as tight a loop as all that. It wouldn't be hard at all to have an example where those lseeks are the only operations visible to strace, if all the data the backend needs is in shared buffers. regards, tom lane