Home > mailing lists

Re: questions on (parallel) COPY and when to REINDEX - Mailing list pgsql-general

From	Janet Jacobsen
Subject	Re: questions on (parallel) COPY and when to REINDEX
Date	August 2, 2009 23:51:28
Msg-id	4A76508D.6010509@lbl.gov Whole thread Raw
In response to	Re: questions on (parallel) COPY and when to REINDEX (Andy Colson <andy@squeakycode.net>)
List	pgsql-general

Tree view

Hi, Andy.  Thanks for the responses and information.

Just to let you know... what we are storing in the db are the image
attributes - about 40 of them - not the images.  So the COPY
is reading an ascii file of the image attributes.  It turns out to be
useful to have the image attributes handy - much better than reading
the image headers. The images are available on spinning disk, and
the image locations are in the db.

Thanks,
Janet


On 02/08/2009 05:59 p.m., Andy Colson wrote:
>>> On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:
>>>> My questions are:
>>>> (2) Should I REINDEX these two tables daily after the pipeline
>>>> completes?  Is this what other people do in practice?
>>>
>>> No need, but as people are querying as soon as data is arriving, an
>>> analyse of the table you just copied to should help performance -
>>> the query plans will be derived from more accurate table statistics
>>> that way.
>>
>> The files that are being copied into the table(s) are between 200 to
>> 1,000
>> each, but there are hundreds of these small files every day.  Would you
>> recommend running analyze after every COPY?
>
> Just once, at the end.  Assuming you are only deleting 100k records
> and re-adding/updating another 100k in a batch.  That's not so many
> records it'll through the stats out of whack.  (If you were
> dropping/updating 15M in a batch you might wanna analyze in the middle)
>
>>>
>>> You wrote that your pipeline runs for a period of 4 hours and the
>>> table is about 15M rows now. What is taking up all that time? I
>>> understand why you'd want to parallelise that process, but where do
>>> you expect the gains?
>>
>>
>> We're processing images, and the data processing and analysis takes
>> up most of the time, but the images can be processed/analyzed in
>> parallel.
>> We've been doing all of the data loading at the end - one COPY at a
>> time.  Originally that made sense because the researchers wanted to
>> check the images before loading the data/analysis results into the db.
>
> Ah!  Images!  When you are doing the COPY are you escaping the data?
> You cant "just" copy a binary file.
>
> -Andy

pgsql-general by date:

From: Ilya Urikh
Date: 02 August 2009, 23:46:09
Subject: Problem with Numerics multiplication in C-function

From: Ilya Urikh
Date: 03 August 2009, 01:05:06
Subject: Problem with Numerics multiplication in C-function

Re: questions on (parallel) COPY and when to REINDEX - Mailing list pgsql-general

Previous

Next