Thread: COPY Performance
Hello, We're using a statement like this to dump between 500K and >5 million rows. COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') TO '/dev/shm/SomeFile.csv' Upon first run, this operation can take several minutes. Upon second run, it will be complete in generally well under a minute. Some observations/thoughts: -- When watching the file grow, it will "trickle" upon first run, increasing in size by only a few kb/s. Upon second run, it will grow by megabytes per second. -- If we dump to disk, the same behavior is observed - no apparent RAM contention issues. -- The SELECT by itself will complete typically in a few seconds, either on second or first run. -- Upon the first run, the PostgreSQL process typically consumes <10% CPU. -- In very rare cases, we've seen even the first run be as fast as the second run. -- There is no other activity on the server while performing these tests. -- SomeID is just an int -- CPU usage (wait %) is low; seems as though there is some other bottleneck Any thoughts on: -- tuning parameters to speed up the initial execution -- why is it so slow to begin with? -- possible areas to check for bottlenecks? -- better method for accomplishing the same thing? Thank you, H
"Hans Zaunere" <lists@zaunere.com> writes: > We're using a statement like this to dump between 500K and >5 million rows. > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') > TO '/dev/shm/SomeFile.csv' > Upon first run, this operation can take several minutes. Upon second run, > it will be complete in generally well under a minute. Hmmm ... define "first" versus "second". What do you do to return it to the slow state? regards, tom lane
On Sun, May 4, 2008 at 5:11 PM, Hans Zaunere <lists@zaunere.com> wrote: > Hello, > > We're using a statement like this to dump between 500K and >5 million rows. > > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') > TO '/dev/shm/SomeFile.csv' > > Upon first run, this operation can take several minutes. Upon second run, > it will be complete in generally well under a minute. > Almost certainly a buffering issue. First time it's reading the file into memory WHILE also doing other things, file system wise. Second time it's in memory (kernel cache) and zips right by. What can you do? First you need to see what's really happening, which means learning how to drive vmstat, iostat, top, etc to see what's happening on your machine. You'll likely want to look into doing something that will reduce contention on the database partition set for starters. Table spaces, big RAID arrays (big meaning a lot of spindles), battery backed RAID controller.
> > We're using a statement like this to dump between 500K and >5 > > million rows. > > > > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') > > TO '/dev/shm/SomeFile.csv' > > > > Upon first run, this operation can take several minutes. Upon > > second run, it will be complete in generally well under a minute. > > > > Almost certainly a buffering issue. First time it's reading the file > into memory WHILE also doing other things, file system wise. Second > time it's in memory (kernel cache) and zips right by. > > What can you do? First you need to see what's really happening, which > means learning how to drive vmstat, iostat, top, etc to see what's > happening on your machine. You'll likely want to look into doing > something that will reduce contention on the database partition set > for starters. Table spaces, big RAID arrays (big meaning a lot of > spindles), battery backed RAID controller. We do have some pretty fair hardware: -- PowerEdge R900 with 132gb and 16 cores -- PowerVault MD1000 with 1.6 TB I am getting more exact information from the NOC. The contention might not be an issue. I ran a test last night with zero activity on the server, and the results were the same. In fact, even the second run took the same amount of time as the first run. The queries, however, are always very simple - a single column equal to or greater than a single value. And typically the column being either an integer or a one byte char. Would a cluster index or ordered table help here - partitioning - something to get sequential reads from the disk? I could experiment, but alas we only have this one production environment right now, so I'm looking for low hanging fruit from prior experiences. The table is about 600+ columns wide with a total of >32 million rows. Would SELECTing into a temporary table first, then doing the COPY(), make any difference? Regards, H
> > We're using a statement like this to dump between 500K and >5 million > > rows. > > > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') > > TO '/dev/shm/SomeFile.csv' > > > Upon first run, this operation can take several minutes. Upon second > > run, it will be complete in generally well under a minute. > > Hmmm ... define "first" versus "second". What do you do to return it > to the slow state? Interesting that you ask. I haven't found a very reliable way to reproduce this. Typically, just waiting a while to run the same query the second time will reproduce this behavior. I restarted postgresql and it was reproduced as well. However, I can't find a way to flush buffers/etc, to reproduce the problem on-demand. The first time vs second time is usually simply defined as the temp. CSV file being created slowly once, and then fast, although frankly sometimes it's always slow, and sometimes seemingly a bit quicker, depending on the query. As I mentioned in my other post, the query is always of the simple form above - single column greater than or equal to a single value (int or single byte char). We had set shared_buffers to 1028M, which at first seemed to make a significant difference. With some testing, various queries ran under 20 seconds, the "first time", and the second time at the same speed or a bit quicker. However, operations then restarted the system (and downgraded to Red Hat 5.0) and now we're back where we started. Even with the increased shared_buffers, things are slow - perhaps I'll try to increase it more. Upon first starting postgresql, and executing the above statement, the process begins to grow to 1gb and then stops. When it stops, the CSV file starts to "trickle" out to disk. Best, H
On Mon, May 5, 2008 at 6:18 AM, Hans Zaunere <lists@zaunere.com> wrote: > > > We're using a statement like this to dump between 500K and >5 million > > > rows. > > > > > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') > > > TO '/dev/shm/SomeFile.csv' > > > > > Upon first run, this operation can take several minutes. Upon second > > > run, it will be complete in generally well under a minute. > > > > Hmmm ... define "first" versus "second". What do you do to return it > > to the slow state? > > Interesting that you ask. I haven't found a very reliable way to reproduce > this. > > Typically, just waiting a while to run the same query the second time will > reproduce this behavior. I restarted postgresql and it was reproduced as > well. However, I can't find a way to flush buffers/etc, to reproduce the what happens if you do something like: select count(*) from (select ...); i.e. don't make the .csv file each time. How's the performance without making the csv versus making it?
On Sun, May 4, 2008 at 5:11 PM, Hans Zaunere <lists@zaunere.com> wrote: > Hello, > > We're using a statement like this to dump between 500K and >5 million rows. > > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') > TO '/dev/shm/SomeFile.csv' Wait, are you really creating a .csv file in shared memory? Can such a thing even work? If you're not creating a file in /dev/shm, where is it being built? On the main OS drive? the battery backed cached RAID array?
On Mon, May 5, 2008 at 9:03 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Sun, May 4, 2008 at 5:11 PM, Hans Zaunere <lists@zaunere.com> wrote: > > Hello, > > > > We're using a statement like this to dump between 500K and >5 million rows. > > > > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') > > TO '/dev/shm/SomeFile.csv' > > Wait, are you really creating a .csv file in shared memory? Can such > a thing even work? > > If you're not creating a file in /dev/shm, where is it being built? > On the main OS drive? the battery backed cached RAID array? OK, looked it up. you're making your .csv file in a ramdisk? If it gets big it's gonna make the machine start swapping. i'd suggest storing only small things in a ram disk really. Other than that, I'm not sure what the problem is.
> > > > We're using a statement like this to dump between 500K and >5 > > > > million rows. > > > > > > > > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0') > > > > TO '/dev/shm/SomeFile.csv' > > > > > > > > Upon first run, this operation can take several minutes. Upon > > > > second run, it will be complete in generally well under a minute. > > > > > > > Hmmm ... define "first" versus "second". What do you do to return > > > it to the slow state? > > > > Interesting that you ask. I haven't found a very reliable way to > > reproduce this. > > > > Typically, just waiting a while to run the same query the second > > time will reproduce this behavior. I restarted postgresql and i > > was reproduced as well. However, I can't find a way to flush > > buffers/etc, to reproduce the > > what happens if you do something like: > > select count(*) from (select ...); > > i.e. don't make the .csv file each time. How's the performance > without making the csv versus making it? It's the same. And regarding the /dev/shm, we do watch that memory doesn't become contentious. We've also done the dump to another set of disk spindles, and we've seen the same performance. So at the end of the day, it certainly does seem like a read-bottleneck off of the disks. Unfortunately, from a hardware perspective, there's not much we can do about it currently. Does anyone have any experiences they can share about using partitioning or index tricks to speed up what should be basically large contiguous rows from a table, based on a single column WHERE constraint? H