Thread: odd postgresql performance (excessive lseek)

odd postgresql performance (excessive lseek)

From

Jon Nelson

Date:

15 October 2010, 02:23:25

postgres 8.4.4 on openSUSE 11.3 (2.6.36rc7, x86_64).

I was watching a fairly large query run and observed that the disk
light went out. I checked 'top' and postgres was using 100% CPU so I
strace'd the running process.
This is what I saw:

lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(87, 0, SEEK_END)                  = 585531392
lseek(94, 270680064, SEEK_SET)          = 270680064
read(94, "<elided>"..., 8192) = 8192

and I observed that pattern quite a bit.

I know lseek is cheap, but a superfluous systemcall is a superfluous
systemcall, and over a short period amounted to 37% (according to
strace) of the time spent in the system.

What's with the excess calls to lseek?

The query plan was a nested loop anti-join (on purpose).

--
Jon

Re: odd postgresql performance (excessive lseek)

From

Jon Nelson

Date:

19 October 2010, 10:11:18

No replies?

This is another situation where using pread would have saved a lot of
time and sped things up a bit, but failing that, keeping track of the
file position ourselves and only lseek'ing when necessary would also
help. Postgresql was spending 37% of it's time in redundant lseek!

--
Jon

Re: odd postgresql performance (excessive lseek)

From

Merlin Moncure

Date:

19 October 2010, 10:28:25

On Tue, Oct 19, 2010 at 9:10 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> No replies?
>
> This is another situation where using pread would have saved a lot of
> time and sped things up a bit, but failing that, keeping track of the
> file position ourselves and only lseek'ing when necessary would also
> help. Postgresql was spending 37% of it's time in redundant lseek!

37% of cpu time?  Is that according to strace -T? how did you measure it?

merlin

Re: odd postgresql performance (excessive lseek)

From

Jon Nelson

Date:

19 October 2010, 10:41:40

On Tue, Oct 19, 2010 at 8:25 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Oct 19, 2010 at 9:10 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>> No replies?
>>
>> This is another situation where using pread would have saved a lot of
>> time and sped things up a bit, but failing that, keeping track of the
>> file position ourselves and only lseek'ing when necessary would also
>> help. Postgresql was spending 37% of it's time in redundant lseek!
>
> 37% of cpu time?  Is that according to strace -T? how did you measure it?

Per the original post, it (redundant lseek system calls) accounted for
37% of the time spent in the kernel.

strace -f -p <pid> -c


--
Jon

Re: odd postgresql performance (excessive lseek)

From

Tom Lane

Date:

19 October 2010, 11:36:18

Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> This is another situation where using pread would have saved a lot of
> time and sped things up a bit, but failing that, keeping track of the
> file position ourselves and only lseek'ing when necessary would also
> help.

No, it wouldn't; you don't have the slightest idea what's going on
there.  Those lseeks are for the purpose of detecting the current EOF
location, ie, finding out whether some other backend has extended the
file recently.  We could get rid of them, but only at the cost of
putting in some other communication mechanism instead.

            regards, tom lane

Re: odd postgresql performance (excessive lseek)

From

Jon Nelson

Date:

19 October 2010, 12:35:32

On Tue, Oct 19, 2010 at 9:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> This is another situation where using pread would have saved a lot of
>> time and sped things up a bit, but failing that, keeping track of the
>> file position ourselves and only lseek'ing when necessary would also
>> help.
>
> No, it wouldn't; you don't have the slightest idea what's going on
> there.  Those lseeks are for the purpose of detecting the current EOF
> location, ie, finding out whether some other backend has extended the
> file recently.  We could get rid of them, but only at the cost of
> putting in some other communication mechanism instead.

That's a little harsh (it's not untrue, though).

It's true I don't know how postgresql works WRT how it manages files,
but now I've been educated (some). I'm guessing, then, that due to how
each backend may extend files without the other backends knowing of
it, that using fallocate or some-such is also likely a non-starter. I
ask because, especially when allocating files 8KB at a time, file
fragmentation on a busy system is potentially high. I recently saw an
ext3 filesystem (dedicated to postgresql) with 38% file fragmentation
and, yes, it does make a huge performance difference in some cases.
After manually defragmenting some files (with pg offline) I saw a read
speed increase for single-MB-per-second to
high-double-digit-MB-per-second.  However, after asking pg to rewrite
some of the worst files (by way of CLUSTER or ALTER TABLE) I saw no
improvement - I'm guessing due to the 8KB-at-a-time allocation
mechanism.

Has any work been done on making use of shared memory for file stats
or using fallocate (or posix_fallocate) to allocate files in larger
chunks?

--
Jon

Re: odd postgresql performance (excessive lseek)

From

Greg Smith

Date:

19 October 2010, 13:28:43

Jon Nelson wrote:
> That's a little harsh (it's not untrue, though).
>

Welcome to pgsql-performance!  You can get a right answer, or a nice
answer, but given the personalities involved it's hard to get both at
the same time.  With this crowd, you need to be careful stating
something you were speculating about as if you were certain of it, and
to be certain here usually means "I read the source code".  I recommend
writing theories as a question ("would X have sped this up?") rather
than a statement ("X will speed this up") if you want to see gentler
answers.

> Has any work been done on making use of shared memory for file stats
> or using fallocate (or posix_fallocate) to allocate files in larger
> chunks?
>

Just plain old pre-allocating in larger chunks turns out to work well.
Am hoping to get that into an official patch eventually.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book

Re: odd postgresql performance (excessive lseek)

From

Merlin Moncure

Date:

19 October 2010, 14:41:15

On Tue, Oct 19, 2010 at 12:28 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Jon Nelson wrote:
>>
>> That's a little harsh (it's not untrue, though).
>>
>
> Welcome to pgsql-performance!  You can get a right answer, or a nice answer,
> but given the personalities involved it's hard to get both at the same time.
>  With this crowd, you need to be careful stating something you were
> speculating about as if you were certain of it, and to be certain here
> usually means "I read the source code".  I recommend writing theories as a
> question ("would X have sped this up?") rather than a statement ("X will
> speed this up") if you want to see gentler answers.
>
>> Has any work been done on making use of shared memory for file stats
>> or using fallocate (or posix_fallocate) to allocate files in larger
>> chunks?
>>
>
> Just plain old pre-allocating in larger chunks turns out to work well.  Am
> hoping to get that into an official patch eventually.

hm...wouldn't larger allocation blocks reduce the frequency of 'this
file got larger via another backend' type events, and therefore raise
the benefit of pushing the events out vs checking them over and over?

merlin

Re: odd postgresql performance (excessive lseek)

From

Robert Haas

Date:

27 October 2010, 00:05:54

On Tue, Oct 19, 2010 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> This is another situation where using pread would have saved a lot of
>> time and sped things up a bit, but failing that, keeping track of the
>> file position ourselves and only lseek'ing when necessary would also
>> help.
>
> No, it wouldn't; you don't have the slightest idea what's going on
> there.  Those lseeks are for the purpose of detecting the current EOF
> location, ie, finding out whether some other backend has extended the
> file recently.  We could get rid of them, but only at the cost of
> putting in some other communication mechanism instead.

I don't get it.  Why would be doing that in a tight loop within a
single backend?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: odd postgresql performance (excessive lseek)

From

Tom Lane

Date:

27 October 2010, 00:57:51

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Oct 19, 2010 at 10:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Those lseeks are for the purpose of detecting the current EOF
>> location, ie, finding out whether some other backend has extended the
>> file recently. �We could get rid of them, but only at the cost of
>> putting in some other communication mechanism instead.

> I don't get it.  Why would be doing that in a tight loop within a
> single backend?

Well, we weren't shown any context whatsoever about what the backend was
actually doing ... but for example the planner likes to recheck the
current physical size of each relation in a query, so that it's working
with an up-to-date number.  That could probably be avoided, since an
estimate would be good enough as long as it wasn't horribly stale.
But there are other places that *have* to have the accurate size, like
seqscan startup.  I doubt it was as tight a loop as all that.  It
wouldn't be hard at all to have an example where those lseeks are the
only operations visible to strace, if all the data the backend needs is
in shared buffers.

            regards, tom lane