Thread: Large number of open(2) calls with bulk INSERT into empty table

Large number of open(2) calls with bulk INSERT into empty table

From
Florian Weimer
Date:
I noticed that a bulk INSERT into an empty table (which has been
TRUNCATEd in the same transaction, for good measure) results in a
curious number of open(2) calls for the FSM resource fork:

open("base/657862/16554373_fsm", O_RDWR) = -1 ENOENT (No such file or directory)
lseek(17, 0, SEEK_END)                  = 407584768
write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
lseek(18, 0, SEEK_END)                  = 333119488
write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
lseek(18, 0, SEEK_END)                  = 333127680
write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
open("base/657862/16554373_fsm", O_RDWR) = -1 ENOENT (No such file or directory)
lseek(17, 0, SEEK_END)                  = 407592960
write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
lseek(18, 0, SEEK_END)                  = 333135872
write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
open("base/657862/16554373_fsm", O_RDWR) = -1 ENOENT (No such file or directory)
lseek(17, 0, SEEK_END)                  = 407601152
write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
lseek(18, 0, SEEK_END)                  = 333144064
write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
lseek(18, 0, SEEK_END)                  = 333152256
write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
open("base/657862/16554373_fsm", O_RDWR) = -1 ENOENT (No such file or directory)

I'm not sure if this result in a significant performance hit on Linux
(because the dentry cache covers negative lookups, too), but I suspect
that it could be an issue with other systems.

This happens with PostgreSQL 9.1.0.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Robert Haas
Date:
On Sun, Nov 27, 2011 at 10:24 AM, Florian Weimer <fweimer@bfk.de> wrote:
> I noticed that a bulk INSERT into an empty table (which has been
> TRUNCATEd in the same transaction, for good measure) results in a
> curious number of open(2) calls for the FSM resource fork:

That's kind of unfortunate.  It looks like every time we extend the
relation, we try to read the free space map to see whether there's a
block available with free space in it.  But since we never actually
make any entries in the free space map, the fork never gets created,
so every attempt to read it involves a system call to see whether it's
there.

I set up the following test case to try to measure the overhead on my
MacBook Pro:

create table bob (a integer, b text);

pgbench -f foo -t 100, with the following contents for foo:

begin;
truncate bob;
insert into bob select g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,10000) g;
commit;

I tried whacking out the call to GetPageWithFreeSpace() in
RelationGetBufferForTuple(), and also with the unpatched code, but the
run-to-run randomness was way more than any difference the change
made.  Is there a better test case?

I've had the thought before that maybe we should cache the size of
some limited number of relation forks in shared memory.  That would
potentially eliminate not only the open() calls but also the lseek()
calls.  The trouble is, to get any benefit from such a change, we'd
need to have a userspace cache which was at least as concurrent as
what the kernel implements.  We're currently well behind the Linux
kernel in terms of synchronization techniques, so that would represent
a considerable investment of time and energy.

In this particular case, it seems like there's probably some way to be
smarter.  If we knew that the relation was created or truncated in the
current transaction, and we knew that we hadn't created the free space
map for it, we could presumably deduce that it still doesn't exist.
Not sure exactly how to make that work, though, and it doesn't solve
the more general problem where you create in one transaction and then
insert in the next.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Nov 27, 2011 at 10:24 AM, Florian Weimer <fweimer@bfk.de> wrote:
>> I noticed that a bulk INSERT into an empty table (which has been
>> TRUNCATEd in the same transaction, for good measure) results in a
>> curious number of open(2) calls for the FSM resource fork:

> That's kind of unfortunate.  It looks like every time we extend the
> relation, we try to read the free space map to see whether there's a
> block available with free space in it.  But since we never actually
> make any entries in the free space map, the fork never gets created,
> so every attempt to read it involves a system call to see whether it's
> there.

I wonder whether it'd help if we went ahead and created the FSM file,
with length zero, as soon as the relation is made (or maybe when it
first becomes of nonzero length).  That would at least save the failed
open()s.  We'd still be doing lseeks on the FSM file, but those ought
to be cheaper.

A less shaky way to do it would be to just create the first page of the
FSM file immediately, but that would represent an annoying percentage
increase in the disk space needed for small tables.  Or maybe we could
compromise and do that once the rel reaches N pages, for some N.
        regards, tom lane


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Robert Haas
Date:
On Wed, Nov 30, 2011 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Nov 27, 2011 at 10:24 AM, Florian Weimer <fweimer@bfk.de> wrote:
>>> I noticed that a bulk INSERT into an empty table (which has been
>>> TRUNCATEd in the same transaction, for good measure) results in a
>>> curious number of open(2) calls for the FSM resource fork:
>
>> That's kind of unfortunate.  It looks like every time we extend the
>> relation, we try to read the free space map to see whether there's a
>> block available with free space in it.  But since we never actually
>> make any entries in the free space map, the fork never gets created,
>> so every attempt to read it involves a system call to see whether it's
>> there.
>
> I wonder whether it'd help if we went ahead and created the FSM file,
> with length zero, as soon as the relation is made (or maybe when it
> first becomes of nonzero length).  That would at least save the failed
> open()s.  We'd still be doing lseeks on the FSM file, but those ought
> to be cheaper.
>
> A less shaky way to do it would be to just create the first page of the
> FSM file immediately, but that would represent an annoying percentage
> increase in the disk space needed for small tables.

Well, unfortunately, we're not really doing a good job dodging that
problem as it is.  For example:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# select pg_relation_size('foo'), pg_table_size('foo');pg_relation_size | pg_table_size
------------------+---------------               0 |             0
(1 row)

rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select pg_relation_size('foo'), pg_table_size('foo');pg_relation_size | pg_table_size
------------------+---------------            8192 |          8192
(1 row)

rhaas=# vacuum foo;
VACUUM
rhaas=# select pg_relation_size('foo'), pg_table_size('foo');pg_relation_size | pg_table_size
------------------+---------------            8192 |         40960
(1 row)

rhaas=#

Yikes!  A table with 4 bytes of useful data is consuming 40kB on disk
- 8kB in the main form, 8kB in the VM fork, and 24kB in the FSM fork.
Ouch!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Yikes!  A table with 4 bytes of useful data is consuming 40kB on disk
> - 8kB in the main form, 8kB in the VM fork, and 24kB in the FSM fork.
> Ouch!

Yeah, ouch.  Why is the FSM fork eating so much space --- I'd have
expected 8k there, but 24?

Also, if VACUUM is going to cause the FSM to be created anyway, there
may not be a lot of point to refraining from creating the first page
right away.
        regards, tom lane


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Heikki Linnakangas
Date:
On 30.11.2011 20:45, Tom Lane wrote:
> Robert Haas<robertmhaas@gmail.com>  writes:
>> Yikes!  A table with 4 bytes of useful data is consuming 40kB on disk
>> - 8kB in the main form, 8kB in the VM fork, and 24kB in the FSM fork.
>> Ouch!
>
> Yeah, ouch.  Why is the FSM fork eating so much space --- I'd have
> expected 8k there, but 24?

The FSM is a three-level tree (with the default BLCKSZ), and the code 
creates all three levels right from the start. That keeps the addressing 
simple, you can easily calculate the location of the FSM entry for a 
given page. I tried to come up with an addressing scheme that would 
allow adding upper-level pages only as needed, but failed. If you have 
any ideas on that, I'm all ears. I have a feeling that there's got to be 
some well-known scheme that does that, but I didn't find it. The current 
addressing scheme is documented in storage/freespace/README, in section 
Higher-Level Structure.

> Also, if VACUUM is going to cause the FSM to be created anyway, there
> may not be a lot of point to refraining from creating the first page
> right away.

Perhaps we should refrain from creating a FSM unless the table is larger 
than 1 block.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Florian Weimer
Date:
* Robert Haas:

> I tried whacking out the call to GetPageWithFreeSpace() in
> RelationGetBufferForTuple(), and also with the unpatched code, but the
> run-to-run randomness was way more than any difference the change
> made.  Is there a better test case?

I think that if you want to exercise file system lookup performance, you
need a larger directory, which presumably means a large number of
tables.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Robert Haas
Date:
On Tue, Dec 6, 2011 at 7:12 AM, Florian Weimer <fweimer@bfk.de> wrote:
> * Robert Haas:
>
>> I tried whacking out the call to GetPageWithFreeSpace() in
>> RelationGetBufferForTuple(), and also with the unpatched code, but the
>> run-to-run randomness was way more than any difference the change
>> made.  Is there a better test case?
>
> I think that if you want to exercise file system lookup performance, you
> need a larger directory, which presumably means a large number of
> tables.

OK.  I created 100,000 dummy tables, 10,000 at a time avoid blowing up
the lock manager.  I then repeated my previous tests, and I still
can't see any meaningful difference (on my MacBook Pro, running MacOS
X v10.6.8).  So at least on this OS, it doesn't seem to matter much.
I'm inclined to defer putting any more work into it until such time as
someone can demonstrate that it actually causes a problem and provides
a reproducible test case.  I don't deny that there's probably an
effect and it would be nice to improve this, but it doesn't seem worth
spending a lot of time on until we can find a case where the effect is
measurable.

On the other hand, the problem of the FSM taking up 24kB for an 8kB
table seems clearly worth fixing, but I don't think I have the cycles
for it at present.  Maybe a TODO is in order.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Andres Freund
Date:
On Tuesday, December 06, 2011 08:53:42 PM Robert Haas wrote:
> On Tue, Dec 6, 2011 at 7:12 AM, Florian Weimer <fweimer@bfk.de> wrote:
> > * Robert Haas:
> >> I tried whacking out the call to GetPageWithFreeSpace() in
> >> RelationGetBufferForTuple(), and also with the unpatched code, but the
> >> run-to-run randomness was way more than any difference the change
> >> made.  Is there a better test case?
> > 
> > I think that if you want to exercise file system lookup performance, you
> > need a larger directory, which presumably means a large number of
> > tables.
> 
> OK.  I created 100,000 dummy tables, 10,000 at a time avoid blowing up
> the lock manager.  I then repeated my previous tests, and I still
> can't see any meaningful difference (on my MacBook Pro, running MacOS
> X v10.6.8).  So at least on this OS, it doesn't seem to matter much.
> I'm inclined to defer putting any more work into it until such time as
> someone can demonstrate that it actually causes a problem and provides
> a reproducible test case.  I don't deny that there's probably an
> effect and it would be nice to improve this, but it doesn't seem worth
> spending a lot of time on until we can find a case where the effect is
> measurable.
I think if at all youre going to notice differences at a high concurrency 
because you then would start to hit the price of synchronizing the dcache 
between cpu cores in the kernel.

Andres


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Robert Haas
Date:
On Tue, Dec 6, 2011 at 8:12 PM, Andres Freund <andres@anarazel.de> wrote:
> On Tuesday, December 06, 2011 08:53:42 PM Robert Haas wrote:
>> On Tue, Dec 6, 2011 at 7:12 AM, Florian Weimer <fweimer@bfk.de> wrote:
>> > * Robert Haas:
>> >> I tried whacking out the call to GetPageWithFreeSpace() in
>> >> RelationGetBufferForTuple(), and also with the unpatched code, but the
>> >> run-to-run randomness was way more than any difference the change
>> >> made.  Is there a better test case?
>> >
>> > I think that if you want to exercise file system lookup performance, you
>> > need a larger directory, which presumably means a large number of
>> > tables.
>>
>> OK.  I created 100,000 dummy tables, 10,000 at a time avoid blowing up
>> the lock manager.  I then repeated my previous tests, and I still
>> can't see any meaningful difference (on my MacBook Pro, running MacOS
>> X v10.6.8).  So at least on this OS, it doesn't seem to matter much.
>> I'm inclined to defer putting any more work into it until such time as
>> someone can demonstrate that it actually causes a problem and provides
>> a reproducible test case.  I don't deny that there's probably an
>> effect and it would be nice to improve this, but it doesn't seem worth
>> spending a lot of time on until we can find a case where the effect is
>> measurable.
> I think if at all youre going to notice differences at a high concurrency
> because you then would start to hit the price of synchronizing the dcache
> between cpu cores in the kernel.

Well, if the premise is that the table has been truncated in the same
transaction, then it's going to be tough to get high concurrency.
Maybe you could do it with multiple tables or with without truncation,
but either way I think you're going to be primarily limited by I/O
bandwidth or WALInsertLock contention, not kernel dcache
synchronization.  I might be wrong, of course, but that's what I
think.  I'm not saying this isn't worth improving, just that I don't
see it as a priority for me personally to spend time on right now.  If
you or someone else wants to beat on it, or even just come up with a
test case, great!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Large number of open(2) calls with bulk INSERT into empty table

From
Bruce Momjian
Date:
A TODO for this?

---------------------------------------------------------------------------

On Tue, Dec  6, 2011 at 02:53:42PM -0500, Robert Haas wrote:
> On Tue, Dec 6, 2011 at 7:12 AM, Florian Weimer <fweimer@bfk.de> wrote:
> > * Robert Haas:
> >
> >> I tried whacking out the call to GetPageWithFreeSpace() in
> >> RelationGetBufferForTuple(), and also with the unpatched code, but the
> >> run-to-run randomness was way more than any difference the change
> >> made.  Is there a better test case?
> >
> > I think that if you want to exercise file system lookup performance, you
> > need a larger directory, which presumably means a large number of
> > tables.
> 
> OK.  I created 100,000 dummy tables, 10,000 at a time avoid blowing up
> the lock manager.  I then repeated my previous tests, and I still
> can't see any meaningful difference (on my MacBook Pro, running MacOS
> X v10.6.8).  So at least on this OS, it doesn't seem to matter much.
> I'm inclined to defer putting any more work into it until such time as
> someone can demonstrate that it actually causes a problem and provides
> a reproducible test case.  I don't deny that there's probably an
> effect and it would be nice to improve this, but it doesn't seem worth
> spending a lot of time on until we can find a case where the effect is
> measurable.
> 
> On the other hand, the problem of the FSM taking up 24kB for an 8kB
> table seems clearly worth fixing, but I don't think I have the cycles
> for it at present.  Maybe a TODO is in order.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Large number of open(2) calls with bulk INSERT into empty table

From
Robert Haas
Date:
On Thu, Aug 16, 2012 at 9:05 PM, Bruce Momjian <bruce@momjian.us> wrote:
> A TODO for this?

You mean this part?

>> On the other hand, the problem of the FSM taking up 24kB for an 8kB
>> table seems clearly worth fixing, but I don't think I have the cycles
>> for it at present.  Maybe a TODO is in order.

I certainly think that'd be worth a TODO.  Whether the rest of this is
worth worrying about I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Large number of open(2) calls with bulk INSERT into empty table

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
>>> On the other hand, the problem of the FSM taking up 24kB for an 8kB
>>> table seems clearly worth fixing, but I don't think I have the cycles
>>> for it at present.  Maybe a TODO is in order.

> I certainly think that'd be worth a TODO.  Whether the rest of this is
> worth worrying about I'm not sure.

Surely we could just prevent creation of the FSM until the table has
reached at least, say, 10 blocks.

Any threshold beyond one block would mean potential space wastage,
but it's hard to get excited about that until you're into the dozens
of pages.
        regards, tom lane



Re: Large number of open(2) calls with bulk INSERT into empty table

From
Robert Haas
Date:
On Mon, Aug 20, 2012 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>>>> On the other hand, the problem of the FSM taking up 24kB for an 8kB
>>>> table seems clearly worth fixing, but I don't think I have the cycles
>>>> for it at present.  Maybe a TODO is in order.
>
>> I certainly think that'd be worth a TODO.  Whether the rest of this is
>> worth worrying about I'm not sure.
>
> Surely we could just prevent creation of the FSM until the table has
> reached at least, say, 10 blocks.
>
> Any threshold beyond one block would mean potential space wastage,
> but it's hard to get excited about that until you're into the dozens
> of pages.

I dunno, I think one-row tables are pretty common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Large number of open(2) calls with bulk INSERT into empty table

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 20, 2012 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Surely we could just prevent creation of the FSM until the table has
>> reached at least, say, 10 blocks.
>> 
>> Any threshold beyond one block would mean potential space wastage,
>> but it's hard to get excited about that until you're into the dozens
>> of pages.

> I dunno, I think one-row tables are pretty common.

Sure, and for that you don't need an FSM, because any row allocation
attempt will default to trying the last existing block before it extends
(see RelationGetBufferForTuple).  It's only once you've got more than
one block in the table that it becomes interesting.

If we had a convention that FSM is only created for rels of more than
N blocks, perhaps it'd be worthwhile to teach RelationGetBufferForTuple
to try all existing blocks when relation size <= N.  Or equivalently,
hack the FSM code to return all N pages when it has no info.
        regards, tom lane



Re: Large number of open(2) calls with bulk INSERT into empty table

From
Robert Haas
Date:
On Mon, Aug 20, 2012 at 6:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Aug 20, 2012 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Surely we could just prevent creation of the FSM until the table has
>>> reached at least, say, 10 blocks.
>>>
>>> Any threshold beyond one block would mean potential space wastage,
>>> but it's hard to get excited about that until you're into the dozens
>>> of pages.
>
>> I dunno, I think one-row tables are pretty common.
>
> Sure, and for that you don't need an FSM, because any row allocation
> attempt will default to trying the last existing block before it extends
> (see RelationGetBufferForTuple).  It's only once you've got more than
> one block in the table that it becomes interesting.
>
> If we had a convention that FSM is only created for rels of more than
> N blocks, perhaps it'd be worthwhile to teach RelationGetBufferForTuple
> to try all existing blocks when relation size <= N.  Or equivalently,
> hack the FSM code to return all N pages when it has no info.

Now that's an idea I could get behind.  I'd pick a smaller value of N
than what you suggested (10), perhaps 5.  But I like it otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Large number of open(2) calls with bulk INSERT into empty table

From
Bruce Momjian
Date:
On Tue, Aug 21, 2012 at 09:52:02AM -0400, Robert Haas wrote:
> On Mon, Aug 20, 2012 at 6:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Mon, Aug 20, 2012 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> Surely we could just prevent creation of the FSM until the table has
> >>> reached at least, say, 10 blocks.
> >>>
> >>> Any threshold beyond one block would mean potential space wastage,
> >>> but it's hard to get excited about that until you're into the dozens
> >>> of pages.
> >
> >> I dunno, I think one-row tables are pretty common.
> >
> > Sure, and for that you don't need an FSM, because any row allocation
> > attempt will default to trying the last existing block before it extends
> > (see RelationGetBufferForTuple).  It's only once you've got more than
> > one block in the table that it becomes interesting.
> >
> > If we had a convention that FSM is only created for rels of more than
> > N blocks, perhaps it'd be worthwhile to teach RelationGetBufferForTuple
> > to try all existing blocks when relation size <= N.  Or equivalently,
> > hack the FSM code to return all N pages when it has no info.
> 
> Now that's an idea I could get behind.  I'd pick a smaller value of N
> than what you suggested (10), perhaps 5.  But I like it otherwise.

TODO added:
Avoid creation of the free space map for small tables
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01751.php
http://archives.postgresql.org/pgsql-hackers/2012-08/msg00552.php
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +