Thread: One long transaction or multiple short transactions?
We have a system which is constantly importing flat file data feeds into normalized tables in a DB warehouse over 10-20 connections. Each data feed row results in a single transaction of multiple single row writes to multiple normalized tables.
The more columns in the feed row, the more write operations, longer the transaction.
Operators are noticing that splitting a single feed of say – 100 columns – into two consecutive feeds of 50 columns improves performance dramatically. I am wondering whether the multi-threaded and very busy import environment causes non-linear performance degradation for longer transactions. Would the operators be advised to rewrite the feeds to result in more smaller transactions rather than fewer, longer ones?
Carlo
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Carlo
Sent: Monday, October 05, 2015 11:11 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] One long transaction or multiple short transactions?
We have a system which is constantly importing flat file data feeds into normalized tables in a DB warehouse over 10-20 connections. Each data feed row results in a single transaction of multiple single row writes to multiple normalized tables.
The more columns in the feed row, the more write operations, longer the transaction.
Operators are noticing that splitting a single feed of say – 100 columns – into two consecutive feeds of 50 columns improves performance dramatically. I am wondering whether the multi-threaded and very busy import environment causes non-linear performance degradation for longer transactions. Would the operators be advised to rewrite the feeds to result in more smaller transactions rather than fewer, longer ones?
Carlo
Ø over 10-20 connections
How many cores do you have on that machine?
Test if limiting number of simultaneous feeds, like bringing their number down to half of your normal connections has the same positive effect.
Regards,
Igor Neyman
>> How many cores do you have on that machine?
Test if limiting number of simultaneous feeds, like bringing their number down to half of your normal connections has the same positive effect.
<<
I am told 32 cores on a LINUX VM. The operators have tried limiting the number of threads. They feel that the number of connections is optimal. However, under the same conditions they noticed a sizable boost in performance if the same import was split into two successive imports which had shorter transactions.
I am just looking to see if there is any reason to think that lock contention (or anything else) over longer vs. shorter single-row-write transactions under the same conditions might explain this.
Carlo
From: Igor Neyman [mailto:ineyman@perceptron.com]
Sent: October 6, 2015 9:10 AM
To: Carlo; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] One long transaction or multiple short transactions?
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Carlo
Sent: Monday, October 05, 2015 11:11 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] One long transaction or multiple short transactions?
We have a system which is constantly importing flat file data feeds into normalized tables in a DB warehouse over 10-20 connections. Each data feed row results in a single transaction of multiple single row writes to multiple normalized tables.
The more columns in the feed row, the more write operations, longer the transaction.
Operators are noticing that splitting a single feed of say – 100 columns – into two consecutive feeds of 50 columns improves performance dramatically. I am wondering whether the multi-threaded and very busy import environment causes non-linear performance degradation for longer transactions. Would the operators be advised to rewrite the feeds to result in more smaller transactions rather than fewer, longer ones?
Carlo
Ø over 10-20 connections
How many cores do you have on that machine?
Test if limiting number of simultaneous feeds, like bringing their number down to half of your normal connections has the same positive effect.
Regards,
Igor Neyman
Sounds like a locking problem, but assuming you aren’t sherlock holmes and simply want to get the thing working as soon aspossible: Stick a fast SSD in there (whether you stay on VM or physical). If you have enough I/O, you may be able to solve the problemwith brute force. SSDs are a lot cheaper than your time. Suggest you forward this to your operators: a talk I have about optimising multi-threaded work in postgres: http://graemebell.net/foss4gcomo.pdf (Slides: “Input/Output” in the middle of the talk and also the slides at the endlabelled “For Techies") Graeme Bell p.s. You mentioned a VM. Consider making the machine physical and not VM. You’ll get a performance boost and remove the riskof DB corruption from untrustworthy VM fsyncs. One day there will be a power cut or O/S crash during these your writesand with a VM you’ve a reasonable chance of nuking your DB because VM virtualised storage often doesn’t honour fsync(for performance reasons), but it’s fundamental to correct operation of PG. > On 08 Oct 2015, at 01:40, Carlo <reg01@stonebanks.ca> wrote: > > > I am told 32 cores on a LINUX VM. The operators have tried limiting the number of threads. They feel that the number ofconnections is optimal. However, under the same conditions they noticed a sizable boost in performance if the same importwas split into two successive imports which had shorter transactions. >
>> Sounds like a locking problem This is what I am trying to get at. The reason that I am not addressing hardware or OS configuration concerns is that this is not my environment, but my client's. The client is running my import software and has a choice of how long the transactions can be. They are going for long transactions, and I am trying to determine whether there is a penalty for single long transactions over a configuration which would allow for more successive short transactions. (keep in mind all reads and writes are single-row). There are other people working on hardware and OS configuration, and that's why I can't want to get into a general optimization discussion because the client is concerned with just this question. -----Original Message----- From: Graeme B. Bell [mailto:graeme.bell@nibio.no] Sent: October 8, 2015 4:55 AM To: Carlo Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] One long transaction or multiple short transactions? Sounds like a locking problem, but assuming you aren't sherlock holmes and simply want to get the thing working as soon as possible: Stick a fast SSD in there (whether you stay on VM or physical). If you have enough I/O, you may be able to solve the problem with brute force. SSDs are a lot cheaper than your time. Suggest you forward this to your operators: a talk I have about optimising multi-threaded work in postgres: http://graemebell.net/foss4gcomo.pdf (Slides: "Input/Output" in the middle of the talk and also the slides at the end labelled "For Techies") Graeme Bell p.s. You mentioned a VM. Consider making the machine physical and not VM. You'll get a performance boost and remove the risk of DB corruption from untrustworthy VM fsyncs. One day there will be a power cut or O/S crash during these your writes and with a VM you've a reasonable chance of nuking your DB because VM virtualised storage often doesn't honour fsync (for performance reasons), but it's fundamental to correct operation of PG. > On 08 Oct 2015, at 01:40, Carlo <reg01@stonebanks.ca> wrote: > > > I am told 32 cores on a LINUX VM. The operators have tried limiting the number of threads. They feel that the number of connections is optimal. However, under the same conditions they noticed a sizable boost in performance if the same import was split into two successive imports which had shorter transactions. >
On Thu, Oct 08, 2015 at 11:08:55AM -0400, Carlo wrote: > >> Sounds like a locking problem > > This is what I am trying to get at. The reason that I am not addressing > hardware or OS configuration concerns is that this is not my environment, > but my client's. The client is running my import software and has a choice > of how long the transactions can be. They are going for long transactions, > and I am trying to determine whether there is a penalty for single long > transactions over a configuration which would allow for more successive > short transactions. (keep in mind all reads and writes are single-row). > > There are other people working on hardware and OS configuration, and that's > why I can't want to get into a general optimization discussion because the > client is concerned with just this question. > Hi Carlo, Since the read/writes are basically independent, which is what I take your "single-row" comment to mean, by batching them you are balancing two opposing factors. First, larger batches allow you to consolodate I/O and other resource requests to make them more efficient per row. Second, larger batches require more locking as the number of rows updated grows. It may very well be the case that by halving your batch size that the system can process them more quickly than a single batch that is twice the size. Regards, Ken
-----Original Message----- From: ktm@rice.edu [mailto:ktm@rice.edu] Sent: October 8, 2015 1:00 PM To: Carlo Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] One long transaction or multiple short transactions? On Thu, Oct 08, 2015 at 11:08:55AM -0400, Carlo wrote: > >> Sounds like a locking problem > > This is what I am trying to get at. The reason that I am not > addressing hardware or OS configuration concerns is that this is not > my environment, but my client's. The client is running my import > software and has a choice of how long the transactions can be. They > are going for long transactions, and I am trying to determine whether > there is a penalty for single long transactions over a configuration > which would allow for more successive short transactions. (keep in mind all reads and writes are single-row). > > There are other people working on hardware and OS configuration, and > that's why I can't want to get into a general optimization discussion > because the client is concerned with just this question. > On October 8, 2015 1:00 PM Ken wrote: > Hi Carlo, > Since the read/writes are basically independent, which is what I take your "single-row" comment to mean, by batching them you are balancing two > opposing factors. First, larger batches allow you to consolodate I/O and other resource requests to make them more efficient per row. Second, larger > batches require more locking as the number of rows updated grows. It may very well be the case that by halving your batch size that the system can > process them more quickly than a single batch that is twice the size. Just to clarify, one transaction of this type may take longer to commit than two successive transactions of half the size?
On Thu, Oct 08, 2015 at 05:43:11PM -0400, Carlo wrote: > -----Original Message----- > From: ktm@rice.edu [mailto:ktm@rice.edu] > Sent: October 8, 2015 1:00 PM > To: Carlo > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] One long transaction or multiple short transactions? > > On Thu, Oct 08, 2015 at 11:08:55AM -0400, Carlo wrote: > > >> Sounds like a locking problem > > > > This is what I am trying to get at. The reason that I am not > > addressing hardware or OS configuration concerns is that this is not > > my environment, but my client's. The client is running my import > > software and has a choice of how long the transactions can be. They > > are going for long transactions, and I am trying to determine whether > > there is a penalty for single long transactions over a configuration > > which would allow for more successive short transactions. (keep in mind > all reads and writes are single-row). > > > > There are other people working on hardware and OS configuration, and > > that's why I can't want to get into a general optimization discussion > > because the client is concerned with just this question. > > > > On October 8, 2015 1:00 PM Ken wrote: > > Hi Carlo, > > > Since the read/writes are basically independent, which is what I take your > "single-row" comment to mean, by batching them you are balancing two > > opposing factors. First, larger batches allow you to consolodate I/O and > other resource requests to make them more efficient per row. Second, larger > > batches require more locking as the number of rows updated grows. It may > very well be the case that by halving your batch size that the system can > > process them more quickly than a single batch that is twice the size. > > Just to clarify, one transaction of this type may take longer to commit than > two successive transactions of half the size? > Yes, but where the optimum count is located should be determined by testing. Just varying the batch size and note where the performance is at a maximum. Regards, Ken
I don't think inserts can cause contention on the server. Insert do not lock tables during the transaction. You may have contention on sequence but it won't vary with transaction size.>> How many cores do you have on that machine?
Test if limiting number of simultaneous feeds, like bringing their number down to half of your normal connections has the same positive effect.
<<
I am told 32 cores on a LINUX VM. The operators have tried limiting the number of threads. They feel that the number of connections is optimal. However, under the same conditions they noticed a sizable boost in performance if the same import was split into two successive imports which had shorter transactions.
I am just looking to see if there is any reason to think that lock contention (or anything else) over longer vs. shorter single-row-write transactions under the same conditions might explain this.
Have you checked the resource usage (CPU,memory) on the client side ?
How do you insert rows ? Do you use plain postgres API ?
Regards,
Laurent
> I don't think inserts can cause contention on the server. Insert do not lock tables during the transaction. You may havecontention on sequence but it won't vary with transaction size. Perhaps there could be a trigger on inserts which creates some lock contention?
On 10/9/15 3:33 AM, Graeme B. Bell wrote: > >> I don't think inserts can cause contention on the server. Insert do not lock tables during the transaction. You may havecontention on sequence but it won't vary with transaction size. > > Perhaps there could be a trigger on inserts which creates some lock contention? Except inserts *do* take a lot of locks, just not user-level locks. Operations like finding a page to insert into, seeing if that page is in shared buffers, loading the page into shared buffers, modifying a shared buffer, getting the relation extension lock if you need to add a new page. Then there's a whole pile of additional locking you could be looking at for inserting into any indexes. Now, most of the locks I described up there are transaction-aware, but there's other things happening at a transaction level that could alter that locking. So it wouldn't surprise me if you're seeing radically different behavior based on transaction duration. Also, it sounds like perhaps longer transactions are involving more tables? Is this a star schema you're dealing with? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 2015-10-17 10:26:01 -0500, Jim Nasby wrote: > Except inserts *do* take a lot of locks, just not user-level locks. > Operations like finding a page to insert into, seeing if that page is in > shared buffers, loading the page into shared buffers, modifying a shared > buffer, getting the relation extension lock if you need to add a new page. > Then there's a whole pile of additional locking you could be looking at for > inserting into any indexes. > > Now, most of the locks I described up there are transaction-aware Missing *not*?
On 10/17/15 12:13 PM, Andres Freund wrote: > On 2015-10-17 10:26:01 -0500, Jim Nasby wrote: >> Except inserts *do* take a lot of locks, just not user-level locks. >> Operations like finding a page to insert into, seeing if that page is in >> shared buffers, loading the page into shared buffers, modifying a shared >> buffer, getting the relation extension lock if you need to add a new page. >> Then there's a whole pile of additional locking you could be looking at for >> inserting into any indexes. >> >> Now, most of the locks I described up there are transaction-aware > > Missing *not*? Oops. Yes, they're *not* transaction-aware. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com