Re: COPY Performance - Mailing list pgsql-general

From Hans Zaunere
Subject Re: COPY Performance
Date
Msg-id 009d01c8aeec$931682a0$b94387e0$@com
Whole thread Raw
In response to Re: COPY Performance  ("Scott Marlowe" <scott.marlowe@gmail.com>)
List pgsql-general
> > > > We're using a statement like this to dump between 500K and >5
> > > > million rows.
> > > >
> > > > COPY(SELECT SomeID FROM SomeTable WHERE SomeColumn > '0')
> > > >   TO '/dev/shm/SomeFile.csv'
> > > >
> > > > Upon first run, this operation can take several minutes.  Upon
> > > > second run, it will be complete in generally well under a minute.
> > > >
> > > Hmmm ... define "first" versus "second".  What do you do to return
> > > it to the slow state?
> >
> >  Interesting that you ask.  I haven't found a very reliable way to
> >  reproduce this.
> >
> >  Typically, just waiting a while to run the same query the second
> >  time will reproduce this behavior.  I restarted postgresql and i
> >  was reproduced as well.  However, I can't find a way to flush
> >  buffers/etc, to reproduce the
>
> what happens if you do something like:
>
> select count(*) from (select ...);
>
> i.e. don't make the .csv file each time.  How's the performance
> without making the csv versus making it?

It's the same.

And regarding the /dev/shm, we do watch that memory doesn't become
contentious.  We've also done the dump to another set of disk spindles, and
we've seen the same performance.

So at the end of the day, it certainly does seem like a read-bottleneck off
of the disks.  Unfortunately, from a hardware perspective, there's not much
we can do about it currently.

Does anyone have any experiences they can share about using partitioning or
index tricks to speed up what should be basically large contiguous rows from
a table, based on a single column WHERE constraint?

H



pgsql-general by date:

Previous
From: Micah Yoder
Date:
Subject: Re: psycopg2 and prepared statements
Next
From: "Dan \"Heron\" Myers"
Date:
Subject: Re: Custom C function - is palloc broken?