Thread: questions on (parallel) COPY and when to REINDEX

questions on (parallel) COPY and when to REINDEX

From

Janet Jacobsen

Date:

01 August 2009, 21:28:44

Hi.  We are running a data processing/analysis pipeline that
writes about 100K records to two tables on a daily basis.
The pipeline runs from about 6:00 a.m. to 10:00 a.m.

Our user base is small - about five people.  Each accesses
the database in a different way (generally using some script
- either Perl or Python).

Some people begin querying the database as soon as the new
data/analysis results start being loaded.  Others wait until the
day's run is complete, so the number of concurrent users is
small at this time.

The data/analysis results are loaded into two tables from two
files of 200 to 1,000 rows each using the COPY command,
which is executed from a Perl script that uses DBD-Pg.

Other details: Postgres 8.3.7 running on a Linux system
with eight processors.

Both of the big tables (now up to > 15 M rows each) have
indexes on several of the columns.  The indexes were
created using CREATE INDEX CONCURRENTLY...
Both tables have one or two foreign key constraints.

My questions are:
(1) At the point that the data are being loaded into the tables,
are the new data indexed?
(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?
(3) Currently the pipeline executes in serial fashion.  We'd
like to cut the wall clock time down as much as possible.
The data processing and data analysis can be done in parallel,
but can the loading of the database be done in parallel, i.e.,
can I execute four parallel COPY commands from four copies
of a script?  Our initial attempt at doing this failed.  I found one
posting in the archives about parallel COPY, but it doesn't seem
to be quite on point.
(4) Does COPY lock the table?  Do I need to explicitly
LOCK the table before the COPY command?  Does LOCK
even apply to using COPY?  If I used table locking, would
parallel COPY work?
(5) If I drop the indexes and foreign key constraints, then is it
possible to COPY to a table from more than one script, i.e., do
parallel COPY?  It seems like a really bad idea to drop those
foreign key constraints.

Should re-think about where our database loading fits into the
overall pipeline, i.e., do the data processing and analysis in
parallel, but keep the data loading sequential?  The reason for
not doing all of the data loading at the end is that some of the
users *really* want to start querying the data and analysis
results as soon as they enter the database.

Looking forward to your replies.

Janet

Re: questions on (parallel) COPY and when to REINDEX

From

Andy Colson

Date:

02 August 2009, 01:48:50

Janet Jacobsen wrote:
> Hi.  We are running a data processing/analysis pipeline that
> writes about 100K records to two tables on a daily basis.
> The pipeline runs from about 6:00 a.m. to 10:00 a.m.
>
> Our user base is small - about five people.  Each accesses
> the database in a different way (generally using some script
> - either Perl or Python).
>
> Some people begin querying the database as soon as the new
> data/analysis results start being loaded.  Others wait until the
> day's run is complete, so the number of concurrent users is
> small at this time.
>
> The data/analysis results are loaded into two tables from two
> files of 200 to 1,000 rows each using the COPY command,
> which is executed from a Perl script that uses DBD-Pg.
>
> Other details: Postgres 8.3.7 running on a Linux system
> with eight processors.
>
> Both of the big tables (now up to > 15 M rows each) have
> indexes on several of the columns.  The indexes were
> created using CREATE INDEX CONCURRENTLY...
> Both tables have one or two foreign key constraints.
>
> My questions are:
> (1) At the point that the data are being loaded into the tables,
> are the new data indexed?

it depends if an index exists on the table when you fill it with data.  If there is an index, it will be updated.

> (2) Should I REINDEX these two tables daily after the pipeline
> completes?  Is this what other people do in practice?

it depends if an index exists on the table when you fill it with data.  But I repeat myself :-).  If an index exists
youwould not need to reindex it.  It may be faster to fill a table without an index, then add an index later.  But that
woulddepend on if you need the index for unique constraints. 

> (3) Currently the pipeline executes in serial fashion.  We'd
> like to cut the wall clock time down as much as possible.
> The data processing and data analysis can be done in parallel,
> but can the loading of the database be done in parallel, i.e.,
> can I execute four parallel COPY commands from four copies

We'd need more specifics.  Are you COPY'ing into two different tables at once?  (that should work).  Or the same table
withdifferent data (that should work too, I'd guess) or the same data with a unique key (that'll break)? 

> Our initial attempt at doing this failed.

What was the error?

> I found one
> posting in the archives about parallel COPY, but it doesn't seem
> to be quite on point.

They have added parallel copy to the pg_restore, but I think that does different tables, not the same table.  Was that
whatyou saw? 

> (4) Does COPY lock the table?  Do I need to explicitly
> LOCK the table before the COPY command?  Does LOCK
> even apply to using COPY?  If I used table locking, would
> parallel COPY work?

pg does not need to lock tables.  Locking is counter productive to multiuser access.  Why would you think locking a
tablewould let parallel copy work?  A lock is to give one process exclusive access to a table.  Locking is exactly what
youdont want. 

> (5) If I drop the indexes and foreign key constraints, then is it
> possible to COPY to a table from more than one script, i.e., do
> parallel COPY?  It seems like a really bad idea to drop those
> foreign key constraints.

It would be a bad idea yes.  One thing that could stop you is a unique constraint and two copy's are inserting the same
data. What sort of errors did you get last time you tried this? 

I have never tried two processes copy'ing into the same table at the same time, but I'd bet its possible.

-Andy

Re: questions on (parallel) COPY and when to REINDEX

From

Tom Lane

Date:

02 August 2009, 04:55:23

Andy Colson <andy@squeakycode.net> writes:
> I have never tried two processes copy'ing into the same table at the same time, but I'd bet its possible.

It's no different from concurrent INSERTs ...

            regards, tom lane

Re: questions on (parallel) COPY and when to REINDEX

From

Alban Hertroys

Date:

02 August 2009, 12:14:56

On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:

> Both of the big tables (now up to > 15 M rows each) have
> indexes on several of the columns.  The indexes were
> created using CREATE INDEX CONCURRENTLY...
> Both tables have one or two foreign key constraints.

Others have answered some of your questions already, so I snipped a few.

> My questions are:
> (2) Should I REINDEX these two tables daily after the pipeline
> completes?  Is this what other people do in practice?

No need, but as people are querying as soon as data is arriving, an
analyse of the table you just copied to should help performance - the
query plans will be derived from more accurate table statistics that
way.

> (3) Currently the pipeline executes in serial fashion.  We'd
> like to cut the wall clock time down as much as possible.
> The data processing and data analysis can be done in parallel,
> but can the loading of the database be done in parallel, i.e.,
> can I execute four parallel COPY commands from four copies
> of a script?  Our initial attempt at doing this failed.  I found one
> posting in the archives about parallel COPY, but it doesn't seem
> to be quite on point.

As long as you're not using the same database-session in parallel you
should be fine. You can't do parallel database operations in the same
session. Last time I did something similar I used separate database
connections.

> (5) If I drop the indexes and foreign key constraints, then is it
> possible to COPY to a table from more than one script, i.e., do
> parallel COPY?  It seems like a really bad idea to drop those
> foreign key constraints.

You can COPY in parallel, but having no FK's does help insert-
performance. In that case whether you should or shouldn't remove your
FK's depends on what's more important to you; insert performance or
data correctness.

As some of your users query the data while it's still coming in I
guess that data correctness is in fact more important to you and you
should keep the FK's.

You wrote that your pipeline runs for a period of 4 hours and the
table is about 15M rows now. What is taking up all that time? I
understand why you'd want to parallelise that process, but where do
you expect the gains?

 From the above I'm guessing that part of the problem you want to
solve by parallelising is insert-performance. In cases like these I've
seen people with problems with the look-up speed of foreign keys
because the database chose a bad query plan. Regular analyses during
inserting data should prevent that.

> Should re-think about where our database loading fits into the
> overall pipeline, i.e., do the data processing and analysis in
> parallel, but keep the data loading sequential?  The reason for
> not doing all of the data loading at the end is that some of the
> users *really* want to start querying the data and analysis
> results as soon as they enter the database.

You seem to have a design where the database grows indefinitely? Your
database doesn't look to be very big yet, but I expect that at some
point in the future your data will grow so much that the ordering of
it on disk starts to matter for how fast records can be fetched.
That's a typical scenario where people here start to advise using
table partitioning.

The thing is though that partitioning only works well if the queries
your users perform contain clauses of which the database knows they
divide the data (the same clauses used to partition the table).

The people you work with apparently are knowledgeable enough that they
create their own queries. They will have to take partition constraints
into account too if you choose to use table partitioning.

> Looking forward to your replies.
>
> Janet

Regards,
Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4a7581ec10134875916639!

Re: questions on (parallel) COPY and when to REINDEX

From

Janet Jacobsen

Date:

03 August 2009, 00:25:59

Thanks for your reply.  Responses below, and one follow-up
question about when/how often to use analyze.

Janet


On 02/08/2009 05:09 a.m., Alban Hertroys wrote:
> On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:
>> My questions are:
>> (2) Should I REINDEX these two tables daily after the pipeline
>> completes?  Is this what other people do in practice?
>
> No need, but as people are querying as soon as data is arriving, an
> analyse of the table you just copied to should help performance - the
> query plans will be derived from more accurate table statistics that way.

The files that are being copied into the table(s) are between 200 to 1,000
each, but there are hundreds of these small files every day.  Would you
recommend running analyze after every COPY?
>
>> (3) Currently the pipeline executes in serial fashion.  We'd
>> like to cut the wall clock time down as much as possible.
>> The data processing and data analysis can be done in parallel,
>> but can the loading of the database be done in parallel, i.e.,
>> can I execute four parallel COPY commands from four copies
>> of a script?  Our initial attempt at doing this failed.  I found one
>> posting in the archives about parallel COPY, but it doesn't seem
>> to be quite on point.
>
> As long as you're not using the same database-session in parallel you
> should be fine. You can't do parallel database operations in the same
> session. Last time I did something similar I used separate database
> connections.

Yes, they are separate database connections because the connection
is made in the data loading script.
>
>> (5) If I drop the indexes and foreign key constraints, then is it
>> possible to COPY to a table from more than one script, i.e., do
>> parallel COPY?  It seems like a really bad idea to drop those
>> foreign key constraints.
>
> You can COPY in parallel, but having no FK's does help
> insert-performance. In that case whether you should or shouldn't
> remove your FK's depends on what's more important to you; insert
> performance or data correctness.
>
> As some of your users query the data while it's still coming in I
> guess that data correctness is in fact more important to you and you
> should keep the FK's.

Yes, I think we should keep the FK's.
>
> You wrote that your pipeline runs for a period of 4 hours and the
> table is about 15M rows now. What is taking up all that time? I
> understand why you'd want to parallelise that process, but where do
> you expect the gains?


We're processing images, and the data processing and analysis takes
up most of the time, but the images can be processed/analyzed in parallel.
We've been doing all of the data loading at the end - one COPY at a
time.  Originally that made sense because the researchers wanted to
check the images before loading the data/analysis results into the db.

But now we want to load the data/analysis results as soon as they are
available, so if there are four copies of the processing/analysis script
running, we want to have the data loading script initiated at the end of
each processing/analysis script, which is equivalent to four COPYs
writing four different files into the same table.

In terms of time, we will not see a huge decrease in the wall clock time,
but the data/analysis results will be available to the users much sooner,
and this matters a lot to the researchers.
>
> From the above I'm guessing that part of the problem you want to solve
> by parallelising is insert-performance. In cases like these I've seen
> people with problems with the look-up speed of foreign keys because
> the database chose a bad query plan. Regular analyses during inserting
> data should prevent that.

Okay, so I can try this.
>
>> Should re-think about where our database loading fits into the
>> overall pipeline, i.e., do the data processing and analysis in
>> parallel, but keep the data loading sequential?  The reason for
>> not doing all of the data loading at the end is that some of the
>> users *really* want to start querying the data and analysis
>> results as soon as they enter the database.
>
> You seem to have a design where the database grows indefinitely? Your
> database doesn't look to be very big yet, but I expect that at some
> point in the future your data will grow so much that the ordering of
> it on disk starts to matter for how fast records can be fetched.
> That's a typical scenario where people here start to advise using
> table partitioning.
>
> The thing is though that partitioning only works well if the queries
> your users perform contain clauses of which the database knows they
> divide the data (the same clauses used to partition the table).
>
> The people you work with apparently are knowledgeable enough that they
> create their own queries. They will have to take partition constraints
> into account too if you choose to use table partitioning.
>

Yes, there are a couple of tables that it seems will grow indefinitely.
We are not in production yet, so there is some question about
should so much be going into the database.  For now the answer is
"yes, everything".

Last week or so I posted a query about table partitioning, and I got
some good suggestions having to do with partitioning and partial
indexes, so I think both are in our future.

Yes, the researchers do write their own queries, so I think with
adequate explanation, they could take advantage of table partitions.
I need to think a lot about what the right partitions are, however.

Thanks,
Janet

>
> Regards,
> Alban Hertroys
>
> --
> If you can't see the forest for the trees,
> cut the trees and you'll see there is no forest.

Re: questions on (parallel) COPY and when to REINDEX

From

andy

Date:

03 August 2009, 00:37:29

Janet Jacobsen wrote:
> Thanks for your reply.  Responses below to answer your questions,
> and one follow-up question on REINDEX.
>
> Janet
>
>
>>> Both tables have one or two foreign key constraints.
>>>
>>> (2) Should I REINDEX these two tables daily after the pipeline
>>> completes?  Is this what other people do in practice?
>>
>> it depends if an index exists on the table when you fill it with
>> data.  But I repeat myself :-).  If an index exists you would not need
>> to reindex it.  It may be faster to fill a table without an index,
>> then add an index later.  But that would depend on if you need the
>> index for unique constraints.
>>
>
> Ok.  Since data loading occurs daily, and the users query the table
> while the data loading is going on, it seems like I should not drop
> the indexes before the daily loading.
>
> I re-read the REINDEX pages.  I see the following statement,
>
>     "Also, for B-tree indexes a freshly-constructed index is somewhat
>     faster to access than one that has been updated many times, because
>     logically adjacent pages are usually also physically adjacent in a
>     newly built index. (This consideration does not currently apply to
>     non-B-tree indexes.) It might be worthwhile to reindex periodically
>     just to improve access speed."
>
> This quote says "has been updated many times" and "worthwhile to index
> periodically".  I'm not sure how to interpret "many times" and
> "periodically".
>
> In our case, on a daily basis, 100K rows or so are added to two tables,
> and a later script does 100K updates on one of the table. Does that make
> us a candidate for daily REINDEXing?

Its tough to say.  I'd guess not every day.  Once a month?  The best way to find out is to do some timing.  Do a few
indexedselect statements and 'explain analyze' them.  See what the numbers tell you.  Then REINDEX and do the same
test. Then wait a month and try the same test.  See if its much slower. 

The difference between having an index and not is hugely huge orders of magnitude.  The difference between a balanced
indexand unbalanced is minor. 

A vacuum analyze might be more important than a reindex, depending on how many indexes you have, it will update the
statsabout the indexes and help the planner pick the best index. 

I cant answer as to what others do, my pg database is 25meg.  Yes meg.  And 260K rows.  Its embarrassing.  By next
monthI'll probably be up to 260.5K rows.  So I really have no experience with the volume of data your pushing around. 

>>
>>> (3) Currently the pipeline executes in serial fashion.  We'd
>>> like to cut the wall clock time down as much as possible.
>>> The data processing and data analysis can be done in parallel,
>>> but can the loading of the database be done in parallel, i.e.,
>>> can I execute four parallel COPY commands from four copies
>>
>> We'd need more specifics.  Are you COPY'ing into two different tables
>> at once?  (that should work).  Or the same table with different data
>> (that should work too, I'd guess) or the same data with a unique key
>> (that'll break)?
>>
>
> We'd like to run four identical scripts in parallel, each of which will
> copy a different file into the same table.
>>> Our initial attempt at doing this failed.
>>
>> What was the error?
>>
>
> If the return status from trying to do the COPY is 7, the script prints
> a failure message, i.e., not the Postgres error. I will set up a test on
> a test case to get more information.  (I didn't run the initial try.)

COPY wont return 7.  Not sure where the 7 comes from.  The copy may fail and whatever language your programming in my
raisean exception, which might be numbered 7... I suppose.  Multiple copy's into the same table at the same time should
work. I think the error was on your side. 

Note that while you are in a transaction your clients wont be able to see any of the data until you commit.  Since some
ofthem want at the data asap, you might want to break up your copy's with a few commits.  I sur'pose tha'd depend on
howlong it all takes though. 

>>> I found one
>>> posting in the archives about parallel COPY, but it doesn't seem
>>> to be quite on point.
>>
>> They have added parallel copy to the pg_restore, but I think that does
>> different tables, not the same table.  Was that what you saw?
>
> Yes, I think so.  The reference is to "Andrews parallel restore patch":
> http://archives.postgresql.org/pgsql-hackers/2008-09/msg01711.php
> The subject line is "lock contention on parallel COPY ?"

Yeah, that's an internal lock on some really small variable deep in the guts of pg core.  Not an entire table lock.

-Andy

Re: questions on (parallel) COPY and when to REINDEX

From

Andy Colson

Date:

03 August 2009, 01:19:00

>> On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:
>>> My questions are:
>>> (2) Should I REINDEX these two tables daily after the pipeline
>>> completes?  Is this what other people do in practice?
>>
>> No need, but as people are querying as soon as data is arriving, an
>> analyse of the table you just copied to should help performance - the
>> query plans will be derived from more accurate table statistics that way.
>
> The files that are being copied into the table(s) are between 200 to 1,000
> each, but there are hundreds of these small files every day.  Would you
> recommend running analyze after every COPY?

Just once, at the end.  Assuming you are only deleting 100k records and re-adding/updating another 100k in a batch.
That'snot so many records it'll through the stats out of whack.  (If you were dropping/updating 15M in a batch you
mightwanna analyze in the middle) 

>>
>> You wrote that your pipeline runs for a period of 4 hours and the
>> table is about 15M rows now. What is taking up all that time? I
>> understand why you'd want to parallelise that process, but where do
>> you expect the gains?
>
>
> We're processing images, and the data processing and analysis takes
> up most of the time, but the images can be processed/analyzed in parallel.
> We've been doing all of the data loading at the end - one COPY at a
> time.  Originally that made sense because the researchers wanted to
> check the images before loading the data/analysis results into the db.

Ah!  Images!  When you are doing the COPY are you escaping the data?  You cant "just" copy a binary file.

-Andy

Re: questions on (parallel) COPY and when to REINDEX

From

Janet Jacobsen

Date:

03 August 2009, 02:51:28

Hi, Andy.  Thanks for the responses and information.

Just to let you know... what we are storing in the db are the image
attributes - about 40 of them - not the images.  So the COPY
is reading an ascii file of the image attributes.  It turns out to be
useful to have the image attributes handy - much better than reading
the image headers. The images are available on spinning disk, and
the image locations are in the db.

Thanks,
Janet


On 02/08/2009 05:59 p.m., Andy Colson wrote:
>>> On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:
>>>> My questions are:
>>>> (2) Should I REINDEX these two tables daily after the pipeline
>>>> completes?  Is this what other people do in practice?
>>>
>>> No need, but as people are querying as soon as data is arriving, an
>>> analyse of the table you just copied to should help performance -
>>> the query plans will be derived from more accurate table statistics
>>> that way.
>>
>> The files that are being copied into the table(s) are between 200 to
>> 1,000
>> each, but there are hundreds of these small files every day.  Would you
>> recommend running analyze after every COPY?
>
> Just once, at the end.  Assuming you are only deleting 100k records
> and re-adding/updating another 100k in a batch.  That's not so many
> records it'll through the stats out of whack.  (If you were
> dropping/updating 15M in a batch you might wanna analyze in the middle)
>
>>>
>>> You wrote that your pipeline runs for a period of 4 hours and the
>>> table is about 15M rows now. What is taking up all that time? I
>>> understand why you'd want to parallelise that process, but where do
>>> you expect the gains?
>>
>>
>> We're processing images, and the data processing and analysis takes
>> up most of the time, but the images can be processed/analyzed in
>> parallel.
>> We've been doing all of the data loading at the end - one COPY at a
>> time.  Originally that made sense because the researchers wanted to
>> check the images before loading the data/analysis results into the db.
>
> Ah!  Images!  When you are doing the COPY are you escaping the data?
> You cant "just" copy a binary file.
>
> -Andy