Hi
I'm playing with postgresql and a few rather large datasets.
My biggest table I'm going to have to work with will be something like:
create table salesitems (storecode char(5), docid char(12),
stockcode char(7), nrsold int, totalcost float8);
with data like:
BRM2,BRM2-001681,M01207,8,117.89
Now my full file will have 5-10 million rows but I'm playing with test sets of
6000000 rows. Its taking a VERY long time with COPY to get the data
into postgres. To try and find out what is going on, I've broken down the set
into 10000 row blocks and I'm timing the time it takes to import it.
The system is a PentiumII-300 with 96 MB of memory and its running Linux 2.0.35.
I've tried 6.3.2 and 6.4-beta2 with the same results (though 6.4 solves a few
other problems I've had).
I'm running the COPYs from a bash script which runs psql for each copy.
COPYing the blocks give the following times (time given per block, not
total time)
importing part 1 of 62, time : 0:02.35
importing part 2 of 62, time : 0:01.88
importing part 3 of 62, time : 0:01.99
importing part 4 of 62, time : 0:02.67
importing part 5 of 62, time : 0:02.09
importing part 6 of 62, time : 0:02.20
importing part 7 of 62, time : 0:02.35
importing part 8 of 62, time : 0:02.41
importing part 9 of 62, time : 0:02.62
importing part 10 of 62, time : 0:02.79
importing part 11 of 62, time : 0:03.10
importing part 12 of 62, time : 0:03.82
importing part 13 of 62, time : 0:03.98
importing part 14 of 62, time : 0:03.72
importing part 15 of 62, time : 0:03.81
importing part 16 of 62, time : 0:04.01
importing part 17 of 62, time : 0:04.29
importing part 18 of 62, time : 0:04.62
importing part 19 of 62, time : 0:05.35
importing part 20 of 62, time : 0:05.40
importing part 21 of 62, time : 0:05.51
importing part 22 of 62, time : 0:05.92
importing part 23 of 62, time : 0:06.04
importing part 24 of 62, time : 0:06.53
importing part 25 of 62, time : 0:06.79
importing part 26 of 62, time : 0:07.24
importing part 27 of 62, time : 0:07.56
importing part 28 of 62, time : 0:08.56
importing part 29 of 62, time : 0:08.85
importing part 30 of 62, time : 0:09.50
importing part 31 of 62, time : 0:09.50
importing part 32 of 62, time : 0:10.28
importing part 33 of 62, time : 0:10.10
importing part 34 of 62, time : 0:13.45
importing part 35 of 62, time : 0:11.23
importing part 36 of 62, time : 0:11.32
importing part 37 of 62, time : 0:11.65
importing part 38 of 62, time : 0:11.70
importing part 39 of 62, time : 0:11.97
importing part 40 of 62, time : 0:12.67
importing part 41 of 62, time : 0:12.42
importing part 42 of 62, time : 0:12.90
importing part 43 of 62, time : 0:13.26
importing part 44 of 62, time : 0:14.04
importing part 45 of 62, time : 0:13.52
importing part 46 of 62, time : 0:14.31
importing part 47 of 62, time : 0:14.48
importing part 48 of 62, time : 0:14.99
importing part 49 of 62, time : 0:15.22
importing part 50 of 62, time : 0:15.18
importing part 51 of 62, time : 0:16.00
importing part 52 of 62, time : 0:17.46
importing part 53 of 62, time : 0:16.55
importing part 54 of 62, time : 0:17.01
importing part 55 of 62, time : 0:17.30
importing part 56 of 62, time : 0:17.55
importing part 57 of 62, time : 0:19.36
importing part 58 of 62, time : 0:18.14
importing part 59 of 62, time : 0:17.86
importing part 60 of 62, time : 0:18.65
importing part 61 of 62, time : 0:18.95
Or nicely plotted it looks like this:
20 ++--------+--------+---------+---------+---------+--------+--------++
+ + + + + + A AA +
18 ++ "times" A ++
| AAAA |
16 ++ A A ++
| AAA |
14 ++ A AA ++
| A AA A |
12 ++ AAA ++
| AAAA |
10 ++ AA ++
| AA |
| AA |
8 ++ AA ++
| AA |
6 ++ AAAAA ++
| AA |
4 ++ AAAAA ++
| A AAAA |
2 AAA AAA ++
+ + + + + + + +
0 ++--------+--------+---------+---------+---------+--------+A-------++
0 10 20 30 40 50 60 70
Apparently a linear increase in time per data set.
This means that importing 1/2 million rows takes 7 minuites, but
importing 5 million takes more than 10 hours!
Is there a way around this? I'm currently creating indexes afterwards (I haven't
even tried creating indexes on the full set of data, will that show similar
linear time increases?
Any help will be appreciated.
Anton
"We must be the change we wish to see in the world."--Gandhi