Thread: ...

...

From
Anton de Wet
Date:
Hi

I'm playing with postgresql and a few rather large datasets.

My biggest table I'm going to have to work with will be something like:

create table salesitems (storecode char(5), docid char(12),
    stockcode char(7), nrsold int, totalcost float8);

with data like:
BRM2,BRM2-001681,M01207,8,117.89

Now my full file will have 5-10 million rows but I'm playing with test sets of
6000000 rows. Its taking a VERY long time with COPY to get the data
into postgres. To try and find out what is going on, I've broken down the set
into 10000 row blocks and I'm timing the time it takes to import it.

The system is a PentiumII-300 with 96 MB of memory and its running Linux 2.0.35.
I've tried 6.3.2 and 6.4-beta2 with the same results (though 6.4 solves a few
other problems I've had).

I'm running the COPYs from a bash script which runs psql for each copy.

COPYing the blocks give the following times (time given per block, not
total time)

importing part 1 of      62, time : 0:02.35
importing part 2 of      62, time : 0:01.88
importing part 3 of      62, time : 0:01.99
importing part 4 of      62, time : 0:02.67
importing part 5 of      62, time : 0:02.09
importing part 6 of      62, time : 0:02.20
importing part 7 of      62, time : 0:02.35
importing part 8 of      62, time : 0:02.41
importing part 9 of      62, time : 0:02.62
importing part 10 of      62, time : 0:02.79
importing part 11 of      62, time : 0:03.10
importing part 12 of      62, time : 0:03.82
importing part 13 of      62, time : 0:03.98
importing part 14 of      62, time : 0:03.72
importing part 15 of      62, time : 0:03.81
importing part 16 of      62, time : 0:04.01
importing part 17 of      62, time : 0:04.29
importing part 18 of      62, time : 0:04.62
importing part 19 of      62, time : 0:05.35
importing part 20 of      62, time : 0:05.40
importing part 21 of      62, time : 0:05.51
importing part 22 of      62, time : 0:05.92
importing part 23 of      62, time : 0:06.04
importing part 24 of      62, time : 0:06.53
importing part 25 of      62, time : 0:06.79
importing part 26 of      62, time : 0:07.24
importing part 27 of      62, time : 0:07.56
importing part 28 of      62, time : 0:08.56
importing part 29 of      62, time : 0:08.85
importing part 30 of      62, time : 0:09.50
importing part 31 of      62, time : 0:09.50
importing part 32 of      62, time : 0:10.28
importing part 33 of      62, time : 0:10.10
importing part 34 of      62, time : 0:13.45
importing part 35 of      62, time : 0:11.23
importing part 36 of      62, time : 0:11.32
importing part 37 of      62, time : 0:11.65
importing part 38 of      62, time : 0:11.70
importing part 39 of      62, time : 0:11.97
importing part 40 of      62, time : 0:12.67
importing part 41 of      62, time : 0:12.42
importing part 42 of      62, time : 0:12.90
importing part 43 of      62, time : 0:13.26
importing part 44 of      62, time : 0:14.04
importing part 45 of      62, time : 0:13.52
importing part 46 of      62, time : 0:14.31
importing part 47 of      62, time : 0:14.48
importing part 48 of      62, time : 0:14.99
importing part 49 of      62, time : 0:15.22
importing part 50 of      62, time : 0:15.18
importing part 51 of      62, time : 0:16.00
importing part 52 of      62, time : 0:17.46
importing part 53 of      62, time : 0:16.55
importing part 54 of      62, time : 0:17.01
importing part 55 of      62, time : 0:17.30
importing part 56 of      62, time : 0:17.55
importing part 57 of      62, time : 0:19.36
importing part 58 of      62, time : 0:18.14
importing part 59 of      62, time : 0:17.86
importing part 60 of      62, time : 0:18.65
importing part 61 of      62, time : 0:18.95



Or nicely plotted it looks like this:


     20 ++--------+--------+---------+---------+---------+--------+--------++
        +         +        +         +         +         +    A  AA         +
     18 ++                                                     "times"  A  ++
        |                                                 AAAA              |
     16 ++                                               A A               ++
        |                                             AAA                   |
     14 ++                                        A AA                     ++
        |                               A       AA A                        |
     12 ++                                   AAA                           ++
        |                                AAAA                               |
     10 ++                            AA                                   ++
        |                           AA                                      |
        |                         AA                                        |
      8 ++                      AA                                         ++
        |                     AA                                            |
      6 ++               AAAAA                                             ++
        |               AA                                                  |
      4 ++         AAAAA                                                   ++
        |  A   AAAA                                                         |
      2 AAA AAA                                                            ++
        +         +        +         +         +         +        +         +
      0 ++--------+--------+---------+---------+---------+--------+A-------++
        0        10       20        30        40        50       60        70


Apparently a linear increase in time per data set.

This means that importing 1/2 million rows takes  7 minuites, but
importing 5 million takes more than 10 hours!

Is there a way around this? I'm currently creating indexes afterwards (I haven't
even tried creating indexes on the full set of data, will that show similar
linear time increases?


Any help will be appreciated.

Anton





"We must be the change we wish to see in the world."--Gandhi


Re: your mail

From
Oleg Broytmann
Date:
Hi!

On Mon, 19 Oct 1998, Anton de Wet wrote:
> I'm running the COPYs from a bash script which runs psql for each copy.

   But why? Isn't it better and simpler just do
COPY table FROM stdin
   ???

Oleg.
----
    Oleg Broytmann  National Research Surgery Centre  http://sun.med.ru/~phd/
           Programmers don't die, they just GOSUB without RETURN.


Re: your mail

From
Anton de Wet
Date:
On Mon, 19 Oct 1998, Oleg Broytmann wrote:

> > I'm running the COPYs from a bash script which runs psql for each copy.
>
>    But why? Isn't it better and simpler just do
> COPY table FROM stdin

Yes, I did that on the 1st attempt and gave up after 4 hours. I'm doing it
in blocks so that you can see the time it takes per 10000 records. There
is no big difference if you do it in one block or several (proved
experimentally).

Anton