Re: questions on (parallel) COPY and when to REINDEX - Mailing list pgsql-general

From andy
Subject Re: questions on (parallel) COPY and when to REINDEX
Date
Msg-id 4A76311D.4020609@squeakycode.net
Whole thread Raw
In response to questions on (parallel) COPY and when to REINDEX  (Janet Jacobsen <jsjacobsen@lbl.gov>)
List pgsql-general
Janet Jacobsen wrote:
> Thanks for your reply.  Responses below to answer your questions,
> and one follow-up question on REINDEX.
>
> Janet
>
>
>>> Both tables have one or two foreign key constraints.
>>>
>>> (2) Should I REINDEX these two tables daily after the pipeline
>>> completes?  Is this what other people do in practice?
>>
>> it depends if an index exists on the table when you fill it with
>> data.  But I repeat myself :-).  If an index exists you would not need
>> to reindex it.  It may be faster to fill a table without an index,
>> then add an index later.  But that would depend on if you need the
>> index for unique constraints.
>>
>
> Ok.  Since data loading occurs daily, and the users query the table
> while the data loading is going on, it seems like I should not drop
> the indexes before the daily loading.
>
> I re-read the REINDEX pages.  I see the following statement,
>
>     "Also, for B-tree indexes a freshly-constructed index is somewhat
>     faster to access than one that has been updated many times, because
>     logically adjacent pages are usually also physically adjacent in a
>     newly built index. (This consideration does not currently apply to
>     non-B-tree indexes.) It might be worthwhile to reindex periodically
>     just to improve access speed."
>
> This quote says "has been updated many times" and "worthwhile to index
> periodically".  I'm not sure how to interpret "many times" and
> "periodically".
>
> In our case, on a daily basis, 100K rows or so are added to two tables,
> and a later script does 100K updates on one of the table. Does that make
> us a candidate for daily REINDEXing?

Its tough to say.  I'd guess not every day.  Once a month?  The best way to find out is to do some timing.  Do a few
indexedselect statements and 'explain analyze' them.  See what the numbers tell you.  Then REINDEX and do the same
test. Then wait a month and try the same test.  See if its much slower. 

The difference between having an index and not is hugely huge orders of magnitude.  The difference between a balanced
indexand unbalanced is minor. 

A vacuum analyze might be more important than a reindex, depending on how many indexes you have, it will update the
statsabout the indexes and help the planner pick the best index. 

I cant answer as to what others do, my pg database is 25meg.  Yes meg.  And 260K rows.  Its embarrassing.  By next
monthI'll probably be up to 260.5K rows.  So I really have no experience with the volume of data your pushing around. 


>>
>>> (3) Currently the pipeline executes in serial fashion.  We'd
>>> like to cut the wall clock time down as much as possible.
>>> The data processing and data analysis can be done in parallel,
>>> but can the loading of the database be done in parallel, i.e.,
>>> can I execute four parallel COPY commands from four copies
>>
>> We'd need more specifics.  Are you COPY'ing into two different tables
>> at once?  (that should work).  Or the same table with different data
>> (that should work too, I'd guess) or the same data with a unique key
>> (that'll break)?
>>
>
> We'd like to run four identical scripts in parallel, each of which will
> copy a different file into the same table.
>>> Our initial attempt at doing this failed.
>>
>> What was the error?
>>
>
> If the return status from trying to do the COPY is 7, the script prints
> a failure message, i.e., not the Postgres error. I will set up a test on
> a test case to get more information.  (I didn't run the initial try.)

COPY wont return 7.  Not sure where the 7 comes from.  The copy may fail and whatever language your programming in my
raisean exception, which might be numbered 7... I suppose.  Multiple copy's into the same table at the same time should
work. I think the error was on your side. 

Note that while you are in a transaction your clients wont be able to see any of the data until you commit.  Since some
ofthem want at the data asap, you might want to break up your copy's with a few commits.  I sur'pose tha'd depend on
howlong it all takes though. 


>>> I found one
>>> posting in the archives about parallel COPY, but it doesn't seem
>>> to be quite on point.
>>
>> They have added parallel copy to the pg_restore, but I think that does
>> different tables, not the same table.  Was that what you saw?
>
> Yes, I think so.  The reference is to "Andrews parallel restore patch":
> http://archives.postgresql.org/pgsql-hackers/2008-09/msg01711.php
> The subject line is "lock contention on parallel COPY ?"

Yeah, that's an internal lock on some really small variable deep in the guts of pg core.  Not an entire table lock.


-Andy

pgsql-general by date:

Previous
From: Janet Jacobsen
Date:
Subject: Re: questions on (parallel) COPY and when to REINDEX
Next
From: Andy Colson
Date:
Subject: Re: questions on (parallel) COPY and when to REINDEX