> > We're using a statement like this to dump between 500K and >5
> > million rows.
> >
> > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0')
> > TO '/dev/shm/SomeFile.csv'
> >
> > Upon first run, this operation can take several minutes. Upon
> > second run, it will be complete in generally well under a minute.
> >
>
> Almost certainly a buffering issue. First time it's reading the file
> into memory WHILE also doing other things, file system wise. Second
> time it's in memory (kernel cache) and zips right by.
>
> What can you do? First you need to see what's really happening, which
> means learning how to drive vmstat, iostat, top, etc to see what's
> happening on your machine. You'll likely want to look into doing
> something that will reduce contention on the database partition set
> for starters. Table spaces, big RAID arrays (big meaning a lot of
> spindles), battery backed RAID controller.
We do have some pretty fair hardware:
-- PowerEdge R900 with 132gb and 16 cores
-- PowerVault MD1000 with 1.6 TB
I am getting more exact information from the NOC.
The contention might not be an issue. I ran a test last night with zero
activity on the server, and the results were the same. In fact, even the
second run took the same amount of time as the first run.
The queries, however, are always very simple - a single column equal to or
greater than a single value. And typically the column being either an
integer or a one byte char.
Would a cluster index or ordered table help here - partitioning - something
to get sequential reads from the disk? I could experiment, but alas we only
have this one production environment right now, so I'm looking for low
hanging fruit from prior experiences.
The table is about 600+ columns wide with a total of >32 million rows.
Would SELECTing into a temporary table first, then doing the COPY(), make
any difference?
Regards,
H