Re: Bunching "transactions" - Mailing list pgsql-performance
From | Chris Browne |
---|---|
Subject | Re: Bunching "transactions" |
Date | |
Msg-id | 60lk9q9b2h.fsf@dba2.int.libertyrms.com Whole thread Raw |
In response to | Bunching "transactions" (Jean-David Beyer <jeandavid8@verizon.net>) |
Responses |
Re: Bunching "transactions"
|
List | pgsql-performance |
jeandavid8@verizon.net (Jean-David Beyer) writes: > Chris Browne wrote: >> jeandavid8@verizon.net (Jean-David Beyer) writes: >>> But what is the limitation on such a thing? In this case, I am just >>> populating the database and there are no other users at such a time. I am >>> willing to lose the whole insert of a file if something goes wrong -- I >>> would fix whatever went wrong and start over anyway. >>> >>> But at some point, disk IO would have to be done. Is this just a function of >>> how big /pgsql/data/postgresql.conf's shared_buffers is set to? Or does it >>> have to do with wal_buffers and checkpoint_segments? >> >> I have done bulk data loads where I was typically loading hundreds of >> thousands of rows in as a single transaction, and it is worth >> observing that loading in data from a pg_dump will do exactly the same >> thing, where, in general, each table's data is loaded as a single >> transaction. > > I guess a reasonable standard of performance would be that if my initial > population of the database takes only a little longer than a restore of the > database using pg_restore, I am pretty close, and that is good enough. Of > course, the restore depends on how fast my tape drive can pull the tape -- > it claims up to 12 MB/sec transfer rate, so it looks as though it will be > tape-limited rather than postgreSQL-limited. That's quite possible. There is a further factor, which is that grouping things into larger transactions has very clearly diminishing returns. Supposing you have a stream of 50,000 operations updating one tuple (those could be UPDATE, DELETE, or INSERT; it is not, at first order, material what sort they are), then the effects of grouping are thus... - With none... Cost = cost of doing 50,000 updates + cost of doing 50,000 COMMITs - If you COMMIT after every 2 updates Cost = cost of doing 50,000 updates + cost of doing 25,000 COMMITs - If you COMMIT after every 10 updates Cost = cost of doing 50,000 updates + cost of doing 5,000 COMMITs - If you COMMIT after every 100 updates Cost = cost of doing 50,000 updates + cost of doing 500 COMMITs The amount of work that COMMIT does is fairly much constant, regardless of the number of updates in the transaction, so that the cost, in that equation, of COMMITs pretty quickly evaporates to irrelevancy. And increasing the sizes of the transactions does not give you *increasing* performance improvements; the improvements will tend to decline. I wouldn't worry about trying to strictly minimize the number of transactions COMMITted; once you have grouped "enough" data into one transaction, that should be good enough. Further, the Right Thing is to group related data together, and come up with a policy that is driven primarily by the need for data consistency. If things work well enough, then don't go off trying to optimize something that doesn't really need optimization, and perhaps break the logic of the application. -- output = ("cbbrowne" "@" "acm.org") http://cbbrowne.com/info/unix.html Users should cultivate an ability to make the simplest molehill into a mountain by finding controversial interpretations of innocuous sounding statements that the sender never intended or imagined. -- from the Symbolics Guidelines for Sending Mail
pgsql-performance by date: