Re: questions on (parallel) COPY and when to REINDEX - Mailing list pgsql-general

From Andy Colson
Subject Re: questions on (parallel) COPY and when to REINDEX
Date
Msg-id 4A74E98B.5030503@squeakycode.net
Whole thread Raw
In response to questions on (parallel) COPY and when to REINDEX  (Janet Jacobsen <jsjacobsen@lbl.gov>)
Responses Re: questions on (parallel) COPY and when to REINDEX  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
Janet Jacobsen wrote:
> Hi.  We are running a data processing/analysis pipeline that
> writes about 100K records to two tables on a daily basis.
> The pipeline runs from about 6:00 a.m. to 10:00 a.m.
>
> Our user base is small - about five people.  Each accesses
> the database in a different way (generally using some script
> - either Perl or Python).
>
> Some people begin querying the database as soon as the new
> data/analysis results start being loaded.  Others wait until the
> day's run is complete, so the number of concurrent users is
> small at this time.
>
> The data/analysis results are loaded into two tables from two
> files of 200 to 1,000 rows each using the COPY command,
> which is executed from a Perl script that uses DBD-Pg.
>
> Other details: Postgres 8.3.7 running on a Linux system
> with eight processors.
>
> Both of the big tables (now up to > 15 M rows each) have
> indexes on several of the columns.  The indexes were
> created using CREATE INDEX CONCURRENTLY...
> Both tables have one or two foreign key constraints.
>
> My questions are:
> (1) At the point that the data are being loaded into the tables,
> are the new data indexed?

it depends if an index exists on the table when you fill it with data.  If there is an index, it will be updated.


> (2) Should I REINDEX these two tables daily after the pipeline
> completes?  Is this what other people do in practice?

it depends if an index exists on the table when you fill it with data.  But I repeat myself :-).  If an index exists
youwould not need to reindex it.  It may be faster to fill a table without an index, then add an index later.  But that
woulddepend on if you need the index for unique constraints. 


> (3) Currently the pipeline executes in serial fashion.  We'd
> like to cut the wall clock time down as much as possible.
> The data processing and data analysis can be done in parallel,
> but can the loading of the database be done in parallel, i.e.,
> can I execute four parallel COPY commands from four copies

We'd need more specifics.  Are you COPY'ing into two different tables at once?  (that should work).  Or the same table
withdifferent data (that should work too, I'd guess) or the same data with a unique key (that'll break)? 

> Our initial attempt at doing this failed.

What was the error?

> I found one
> posting in the archives about parallel COPY, but it doesn't seem
> to be quite on point.

They have added parallel copy to the pg_restore, but I think that does different tables, not the same table.  Was that
whatyou saw? 

> (4) Does COPY lock the table?  Do I need to explicitly
> LOCK the table before the COPY command?  Does LOCK
> even apply to using COPY?  If I used table locking, would
> parallel COPY work?

pg does not need to lock tables.  Locking is counter productive to multiuser access.  Why would you think locking a
tablewould let parallel copy work?  A lock is to give one process exclusive access to a table.  Locking is exactly what
youdont want. 

> (5) If I drop the indexes and foreign key constraints, then is it
> possible to COPY to a table from more than one script, i.e., do
> parallel COPY?  It seems like a really bad idea to drop those
> foreign key constraints.

It would be a bad idea yes.  One thing that could stop you is a unique constraint and two copy's are inserting the same
data. What sort of errors did you get last time you tried this? 

I have never tried two processes copy'ing into the same table at the same time, but I'd bet its possible.

-Andy

pgsql-general by date:

Previous
From: Chris Spotts
Date:
Subject: Re: How to execute external script from a TRIGGER or FUNCTION ?
Next
From: andy
Date:
Subject: Re: Can I add a super table to existing tables?