Re: questions on (parallel) COPY and when to REINDEX - Mailing list pgsql-general

From Janet Jacobsen
Subject Re: questions on (parallel) COPY and when to REINDEX
Date
Msg-id 4A762E8C.4090300@lbl.gov
Whole thread Raw
In response to Re: questions on (parallel) COPY and when to REINDEX  (Alban Hertroys <dalroi@solfertje.student.utwente.nl>)
Responses Re: questions on (parallel) COPY and when to REINDEX  (Andy Colson <andy@squeakycode.net>)
List pgsql-general
Thanks for your reply.  Responses below, and one follow-up
question about when/how often to use analyze.

Janet


On 02/08/2009 05:09 a.m., Alban Hertroys wrote:
> On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:
>> My questions are:
>> (2) Should I REINDEX these two tables daily after the pipeline
>> completes?  Is this what other people do in practice?
>
> No need, but as people are querying as soon as data is arriving, an
> analyse of the table you just copied to should help performance - the
> query plans will be derived from more accurate table statistics that way.

The files that are being copied into the table(s) are between 200 to 1,000
each, but there are hundreds of these small files every day.  Would you
recommend running analyze after every COPY?
>
>> (3) Currently the pipeline executes in serial fashion.  We'd
>> like to cut the wall clock time down as much as possible.
>> The data processing and data analysis can be done in parallel,
>> but can the loading of the database be done in parallel, i.e.,
>> can I execute four parallel COPY commands from four copies
>> of a script?  Our initial attempt at doing this failed.  I found one
>> posting in the archives about parallel COPY, but it doesn't seem
>> to be quite on point.
>
> As long as you're not using the same database-session in parallel you
> should be fine. You can't do parallel database operations in the same
> session. Last time I did something similar I used separate database
> connections.

Yes, they are separate database connections because the connection
is made in the data loading script.
>
>> (5) If I drop the indexes and foreign key constraints, then is it
>> possible to COPY to a table from more than one script, i.e., do
>> parallel COPY?  It seems like a really bad idea to drop those
>> foreign key constraints.
>
> You can COPY in parallel, but having no FK's does help
> insert-performance. In that case whether you should or shouldn't
> remove your FK's depends on what's more important to you; insert
> performance or data correctness.
>
> As some of your users query the data while it's still coming in I
> guess that data correctness is in fact more important to you and you
> should keep the FK's.

Yes, I think we should keep the FK's.
>
> You wrote that your pipeline runs for a period of 4 hours and the
> table is about 15M rows now. What is taking up all that time? I
> understand why you'd want to parallelise that process, but where do
> you expect the gains?


We're processing images, and the data processing and analysis takes
up most of the time, but the images can be processed/analyzed in parallel.
We've been doing all of the data loading at the end - one COPY at a
time.  Originally that made sense because the researchers wanted to
check the images before loading the data/analysis results into the db.

But now we want to load the data/analysis results as soon as they are
available, so if there are four copies of the processing/analysis script
running, we want to have the data loading script initiated at the end of
each processing/analysis script, which is equivalent to four COPYs
writing four different files into the same table.

In terms of time, we will not see a huge decrease in the wall clock time,
but the data/analysis results will be available to the users much sooner,
and this matters a lot to the researchers.
>
> From the above I'm guessing that part of the problem you want to solve
> by parallelising is insert-performance. In cases like these I've seen
> people with problems with the look-up speed of foreign keys because
> the database chose a bad query plan. Regular analyses during inserting
> data should prevent that.

Okay, so I can try this.
>
>> Should re-think about where our database loading fits into the
>> overall pipeline, i.e., do the data processing and analysis in
>> parallel, but keep the data loading sequential?  The reason for
>> not doing all of the data loading at the end is that some of the
>> users *really* want to start querying the data and analysis
>> results as soon as they enter the database.
>
> You seem to have a design where the database grows indefinitely? Your
> database doesn't look to be very big yet, but I expect that at some
> point in the future your data will grow so much that the ordering of
> it on disk starts to matter for how fast records can be fetched.
> That's a typical scenario where people here start to advise using
> table partitioning.
>
> The thing is though that partitioning only works well if the queries
> your users perform contain clauses of which the database knows they
> divide the data (the same clauses used to partition the table).
>
> The people you work with apparently are knowledgeable enough that they
> create their own queries. They will have to take partition constraints
> into account too if you choose to use table partitioning.
>

Yes, there are a couple of tables that it seems will grow indefinitely.
We are not in production yet, so there is some question about
should so much be going into the database.  For now the answer is
"yes, everything".

Last week or so I posted a query about table partitioning, and I got
some good suggestions having to do with partitioning and partial
indexes, so I think both are in our future.

Yes, the researchers do write their own queries, so I think with
adequate explanation, they could take advantage of table partitions.
I need to think a lot about what the right partitions are, however.

Thanks,
Janet

>
> Regards,
> Alban Hertroys
>
> --
> If you can't see the forest for the trees,
> cut the trees and you'll see there is no forest.

pgsql-general by date:

Previous
From: Michael Gould
Date:
Subject: Re: Problem trying to load trigger
Next
From: andy
Date:
Subject: Re: questions on (parallel) COPY and when to REINDEX