Thread: Large number of open(2) calls with bulk INSERT into empty table
I noticed that a bulk INSERT into an empty table (which has been TRUNCATEd in the same transaction, for good measure) results in a curious number of open(2) calls for the FSM resource fork: open("base/657862/16554373_fsm", O_RDWR) = -1 ENOENT (No such file or directory) lseek(17, 0, SEEK_END) = 407584768 write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 lseek(18, 0, SEEK_END) = 333119488 write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 lseek(18, 0, SEEK_END) = 333127680 write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 open("base/657862/16554373_fsm", O_RDWR) = -1 ENOENT (No such file or directory) lseek(17, 0, SEEK_END) = 407592960 write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 lseek(18, 0, SEEK_END) = 333135872 write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 open("base/657862/16554373_fsm", O_RDWR) = -1 ENOENT (No such file or directory) lseek(17, 0, SEEK_END) = 407601152 write(17, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 lseek(18, 0, SEEK_END) = 333144064 write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 lseek(18, 0, SEEK_END) = 333152256 write(18, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 open("base/657862/16554373_fsm", O_RDWR) = -1 ENOENT (No such file or directory) I'm not sure if this result in a significant performance hit on Linux (because the dentry cache covers negative lookups, too), but I suspect that it could be an issue with other systems. This happens with PostgreSQL 9.1.0. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Sun, Nov 27, 2011 at 10:24 AM, Florian Weimer <fweimer@bfk.de> wrote: > I noticed that a bulk INSERT into an empty table (which has been > TRUNCATEd in the same transaction, for good measure) results in a > curious number of open(2) calls for the FSM resource fork: That's kind of unfortunate. It looks like every time we extend the relation, we try to read the free space map to see whether there's a block available with free space in it. But since we never actually make any entries in the free space map, the fork never gets created, so every attempt to read it involves a system call to see whether it's there. I set up the following test case to try to measure the overhead on my MacBook Pro: create table bob (a integer, b text); pgbench -f foo -t 100, with the following contents for foo: begin; truncate bob; insert into bob select g, random()::text||random()::text||random()::text||random()::text from generate_series(1,10000) g; commit; I tried whacking out the call to GetPageWithFreeSpace() in RelationGetBufferForTuple(), and also with the unpatched code, but the run-to-run randomness was way more than any difference the change made. Is there a better test case? I've had the thought before that maybe we should cache the size of some limited number of relation forks in shared memory. That would potentially eliminate not only the open() calls but also the lseek() calls. The trouble is, to get any benefit from such a change, we'd need to have a userspace cache which was at least as concurrent as what the kernel implements. We're currently well behind the Linux kernel in terms of synchronization techniques, so that would represent a considerable investment of time and energy. In this particular case, it seems like there's probably some way to be smarter. If we knew that the relation was created or truncated in the current transaction, and we knew that we hadn't created the free space map for it, we could presumably deduce that it still doesn't exist. Not sure exactly how to make that work, though, and it doesn't solve the more general problem where you create in one transaction and then insert in the next. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Nov 27, 2011 at 10:24 AM, Florian Weimer <fweimer@bfk.de> wrote: >> I noticed that a bulk INSERT into an empty table (which has been >> TRUNCATEd in the same transaction, for good measure) results in a >> curious number of open(2) calls for the FSM resource fork: > That's kind of unfortunate. It looks like every time we extend the > relation, we try to read the free space map to see whether there's a > block available with free space in it. But since we never actually > make any entries in the free space map, the fork never gets created, > so every attempt to read it involves a system call to see whether it's > there. I wonder whether it'd help if we went ahead and created the FSM file, with length zero, as soon as the relation is made (or maybe when it first becomes of nonzero length). That would at least save the failed open()s. We'd still be doing lseeks on the FSM file, but those ought to be cheaper. A less shaky way to do it would be to just create the first page of the FSM file immediately, but that would represent an annoying percentage increase in the disk space needed for small tables. Or maybe we could compromise and do that once the rel reaches N pages, for some N. regards, tom lane
On Wed, Nov 30, 2011 at 12:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Nov 27, 2011 at 10:24 AM, Florian Weimer <fweimer@bfk.de> wrote: >>> I noticed that a bulk INSERT into an empty table (which has been >>> TRUNCATEd in the same transaction, for good measure) results in a >>> curious number of open(2) calls for the FSM resource fork: > >> That's kind of unfortunate. It looks like every time we extend the >> relation, we try to read the free space map to see whether there's a >> block available with free space in it. But since we never actually >> make any entries in the free space map, the fork never gets created, >> so every attempt to read it involves a system call to see whether it's >> there. > > I wonder whether it'd help if we went ahead and created the FSM file, > with length zero, as soon as the relation is made (or maybe when it > first becomes of nonzero length). That would at least save the failed > open()s. We'd still be doing lseeks on the FSM file, but those ought > to be cheaper. > > A less shaky way to do it would be to just create the first page of the > FSM file immediately, but that would represent an annoying percentage > increase in the disk space needed for small tables. Well, unfortunately, we're not really doing a good job dodging that problem as it is. For example: rhaas=# create table foo (a int); CREATE TABLE rhaas=# select pg_relation_size('foo'), pg_table_size('foo');pg_relation_size | pg_table_size ------------------+--------------- 0 | 0 (1 row) rhaas=# insert into foo values (1); INSERT 0 1 rhaas=# select pg_relation_size('foo'), pg_table_size('foo');pg_relation_size | pg_table_size ------------------+--------------- 8192 | 8192 (1 row) rhaas=# vacuum foo; VACUUM rhaas=# select pg_relation_size('foo'), pg_table_size('foo');pg_relation_size | pg_table_size ------------------+--------------- 8192 | 40960 (1 row) rhaas=# Yikes! A table with 4 bytes of useful data is consuming 40kB on disk - 8kB in the main form, 8kB in the VM fork, and 24kB in the FSM fork. Ouch! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Yikes! A table with 4 bytes of useful data is consuming 40kB on disk > - 8kB in the main form, 8kB in the VM fork, and 24kB in the FSM fork. > Ouch! Yeah, ouch. Why is the FSM fork eating so much space --- I'd have expected 8k there, but 24? Also, if VACUUM is going to cause the FSM to be created anyway, there may not be a lot of point to refraining from creating the first page right away. regards, tom lane
On 30.11.2011 20:45, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: >> Yikes! A table with 4 bytes of useful data is consuming 40kB on disk >> - 8kB in the main form, 8kB in the VM fork, and 24kB in the FSM fork. >> Ouch! > > Yeah, ouch. Why is the FSM fork eating so much space --- I'd have > expected 8k there, but 24? The FSM is a three-level tree (with the default BLCKSZ), and the code creates all three levels right from the start. That keeps the addressing simple, you can easily calculate the location of the FSM entry for a given page. I tried to come up with an addressing scheme that would allow adding upper-level pages only as needed, but failed. If you have any ideas on that, I'm all ears. I have a feeling that there's got to be some well-known scheme that does that, but I didn't find it. The current addressing scheme is documented in storage/freespace/README, in section Higher-Level Structure. > Also, if VACUUM is going to cause the FSM to be created anyway, there > may not be a lot of point to refraining from creating the first page > right away. Perhaps we should refrain from creating a FSM unless the table is larger than 1 block. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
* Robert Haas: > I tried whacking out the call to GetPageWithFreeSpace() in > RelationGetBufferForTuple(), and also with the unpatched code, but the > run-to-run randomness was way more than any difference the change > made. Is there a better test case? I think that if you want to exercise file system lookup performance, you need a larger directory, which presumably means a large number of tables. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Tue, Dec 6, 2011 at 7:12 AM, Florian Weimer <fweimer@bfk.de> wrote: > * Robert Haas: > >> I tried whacking out the call to GetPageWithFreeSpace() in >> RelationGetBufferForTuple(), and also with the unpatched code, but the >> run-to-run randomness was way more than any difference the change >> made. Is there a better test case? > > I think that if you want to exercise file system lookup performance, you > need a larger directory, which presumably means a large number of > tables. OK. I created 100,000 dummy tables, 10,000 at a time avoid blowing up the lock manager. I then repeated my previous tests, and I still can't see any meaningful difference (on my MacBook Pro, running MacOS X v10.6.8). So at least on this OS, it doesn't seem to matter much. I'm inclined to defer putting any more work into it until such time as someone can demonstrate that it actually causes a problem and provides a reproducible test case. I don't deny that there's probably an effect and it would be nice to improve this, but it doesn't seem worth spending a lot of time on until we can find a case where the effect is measurable. On the other hand, the problem of the FSM taking up 24kB for an 8kB table seems clearly worth fixing, but I don't think I have the cycles for it at present. Maybe a TODO is in order. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tuesday, December 06, 2011 08:53:42 PM Robert Haas wrote: > On Tue, Dec 6, 2011 at 7:12 AM, Florian Weimer <fweimer@bfk.de> wrote: > > * Robert Haas: > >> I tried whacking out the call to GetPageWithFreeSpace() in > >> RelationGetBufferForTuple(), and also with the unpatched code, but the > >> run-to-run randomness was way more than any difference the change > >> made. Is there a better test case? > > > > I think that if you want to exercise file system lookup performance, you > > need a larger directory, which presumably means a large number of > > tables. > > OK. I created 100,000 dummy tables, 10,000 at a time avoid blowing up > the lock manager. I then repeated my previous tests, and I still > can't see any meaningful difference (on my MacBook Pro, running MacOS > X v10.6.8). So at least on this OS, it doesn't seem to matter much. > I'm inclined to defer putting any more work into it until such time as > someone can demonstrate that it actually causes a problem and provides > a reproducible test case. I don't deny that there's probably an > effect and it would be nice to improve this, but it doesn't seem worth > spending a lot of time on until we can find a case where the effect is > measurable. I think if at all youre going to notice differences at a high concurrency because you then would start to hit the price of synchronizing the dcache between cpu cores in the kernel. Andres
On Tue, Dec 6, 2011 at 8:12 PM, Andres Freund <andres@anarazel.de> wrote: > On Tuesday, December 06, 2011 08:53:42 PM Robert Haas wrote: >> On Tue, Dec 6, 2011 at 7:12 AM, Florian Weimer <fweimer@bfk.de> wrote: >> > * Robert Haas: >> >> I tried whacking out the call to GetPageWithFreeSpace() in >> >> RelationGetBufferForTuple(), and also with the unpatched code, but the >> >> run-to-run randomness was way more than any difference the change >> >> made. Is there a better test case? >> > >> > I think that if you want to exercise file system lookup performance, you >> > need a larger directory, which presumably means a large number of >> > tables. >> >> OK. I created 100,000 dummy tables, 10,000 at a time avoid blowing up >> the lock manager. I then repeated my previous tests, and I still >> can't see any meaningful difference (on my MacBook Pro, running MacOS >> X v10.6.8). So at least on this OS, it doesn't seem to matter much. >> I'm inclined to defer putting any more work into it until such time as >> someone can demonstrate that it actually causes a problem and provides >> a reproducible test case. I don't deny that there's probably an >> effect and it would be nice to improve this, but it doesn't seem worth >> spending a lot of time on until we can find a case where the effect is >> measurable. > I think if at all youre going to notice differences at a high concurrency > because you then would start to hit the price of synchronizing the dcache > between cpu cores in the kernel. Well, if the premise is that the table has been truncated in the same transaction, then it's going to be tough to get high concurrency. Maybe you could do it with multiple tables or with without truncation, but either way I think you're going to be primarily limited by I/O bandwidth or WALInsertLock contention, not kernel dcache synchronization. I might be wrong, of course, but that's what I think. I'm not saying this isn't worth improving, just that I don't see it as a priority for me personally to spend time on right now. If you or someone else wants to beat on it, or even just come up with a test case, great! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
A TODO for this? --------------------------------------------------------------------------- On Tue, Dec 6, 2011 at 02:53:42PM -0500, Robert Haas wrote: > On Tue, Dec 6, 2011 at 7:12 AM, Florian Weimer <fweimer@bfk.de> wrote: > > * Robert Haas: > > > >> I tried whacking out the call to GetPageWithFreeSpace() in > >> RelationGetBufferForTuple(), and also with the unpatched code, but the > >> run-to-run randomness was way more than any difference the change > >> made. Is there a better test case? > > > > I think that if you want to exercise file system lookup performance, you > > need a larger directory, which presumably means a large number of > > tables. > > OK. I created 100,000 dummy tables, 10,000 at a time avoid blowing up > the lock manager. I then repeated my previous tests, and I still > can't see any meaningful difference (on my MacBook Pro, running MacOS > X v10.6.8). So at least on this OS, it doesn't seem to matter much. > I'm inclined to defer putting any more work into it until such time as > someone can demonstrate that it actually causes a problem and provides > a reproducible test case. I don't deny that there's probably an > effect and it would be nice to improve this, but it doesn't seem worth > spending a lot of time on until we can find a case where the effect is > measurable. > > On the other hand, the problem of the FSM taking up 24kB for an 8kB > table seems clearly worth fixing, but I don't think I have the cycles > for it at present. Maybe a TODO is in order. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Thu, Aug 16, 2012 at 9:05 PM, Bruce Momjian <bruce@momjian.us> wrote: > A TODO for this? You mean this part? >> On the other hand, the problem of the FSM taking up 24kB for an 8kB >> table seems clearly worth fixing, but I don't think I have the cycles >> for it at present. Maybe a TODO is in order. I certainly think that'd be worth a TODO. Whether the rest of this is worth worrying about I'm not sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: >>> On the other hand, the problem of the FSM taking up 24kB for an 8kB >>> table seems clearly worth fixing, but I don't think I have the cycles >>> for it at present. Maybe a TODO is in order. > I certainly think that'd be worth a TODO. Whether the rest of this is > worth worrying about I'm not sure. Surely we could just prevent creation of the FSM until the table has reached at least, say, 10 blocks. Any threshold beyond one block would mean potential space wastage, but it's hard to get excited about that until you're into the dozens of pages. regards, tom lane
On Mon, Aug 20, 2012 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >>>> On the other hand, the problem of the FSM taking up 24kB for an 8kB >>>> table seems clearly worth fixing, but I don't think I have the cycles >>>> for it at present. Maybe a TODO is in order. > >> I certainly think that'd be worth a TODO. Whether the rest of this is >> worth worrying about I'm not sure. > > Surely we could just prevent creation of the FSM until the table has > reached at least, say, 10 blocks. > > Any threshold beyond one block would mean potential space wastage, > but it's hard to get excited about that until you're into the dozens > of pages. I dunno, I think one-row tables are pretty common. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 20, 2012 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Surely we could just prevent creation of the FSM until the table has >> reached at least, say, 10 blocks. >> >> Any threshold beyond one block would mean potential space wastage, >> but it's hard to get excited about that until you're into the dozens >> of pages. > I dunno, I think one-row tables are pretty common. Sure, and for that you don't need an FSM, because any row allocation attempt will default to trying the last existing block before it extends (see RelationGetBufferForTuple). It's only once you've got more than one block in the table that it becomes interesting. If we had a convention that FSM is only created for rels of more than N blocks, perhaps it'd be worthwhile to teach RelationGetBufferForTuple to try all existing blocks when relation size <= N. Or equivalently, hack the FSM code to return all N pages when it has no info. regards, tom lane
On Mon, Aug 20, 2012 at 6:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Aug 20, 2012 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Surely we could just prevent creation of the FSM until the table has >>> reached at least, say, 10 blocks. >>> >>> Any threshold beyond one block would mean potential space wastage, >>> but it's hard to get excited about that until you're into the dozens >>> of pages. > >> I dunno, I think one-row tables are pretty common. > > Sure, and for that you don't need an FSM, because any row allocation > attempt will default to trying the last existing block before it extends > (see RelationGetBufferForTuple). It's only once you've got more than > one block in the table that it becomes interesting. > > If we had a convention that FSM is only created for rels of more than > N blocks, perhaps it'd be worthwhile to teach RelationGetBufferForTuple > to try all existing blocks when relation size <= N. Or equivalently, > hack the FSM code to return all N pages when it has no info. Now that's an idea I could get behind. I'd pick a smaller value of N than what you suggested (10), perhaps 5. But I like it otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 21, 2012 at 09:52:02AM -0400, Robert Haas wrote: > On Mon, Aug 20, 2012 at 6:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> On Mon, Aug 20, 2012 at 4:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> Surely we could just prevent creation of the FSM until the table has > >>> reached at least, say, 10 blocks. > >>> > >>> Any threshold beyond one block would mean potential space wastage, > >>> but it's hard to get excited about that until you're into the dozens > >>> of pages. > > > >> I dunno, I think one-row tables are pretty common. > > > > Sure, and for that you don't need an FSM, because any row allocation > > attempt will default to trying the last existing block before it extends > > (see RelationGetBufferForTuple). It's only once you've got more than > > one block in the table that it becomes interesting. > > > > If we had a convention that FSM is only created for rels of more than > > N blocks, perhaps it'd be worthwhile to teach RelationGetBufferForTuple > > to try all existing blocks when relation size <= N. Or equivalently, > > hack the FSM code to return all N pages when it has no info. > > Now that's an idea I could get behind. I'd pick a smaller value of N > than what you suggested (10), perhaps 5. But I like it otherwise. TODO added: Avoid creation of the free space map for small tables http://archives.postgresql.org/pgsql-hackers/2011-11/msg01751.php http://archives.postgresql.org/pgsql-hackers/2012-08/msg00552.php -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +