Thread: Inserting 26 million rows takes 8 hours, how to improve those times?
Inserting 26 million rows takes 8 hours, how to improve those times?
From
Jose Vicente Nunez Z
Date:
Greetings, I'm trying to use PostgreSQL to manage big amounts of data; One of the first things i'm testing is how fast PostgreSQL can load some big CSV text files. For that i'm using the PostgreSQL copy tool, but the problem is that is taking almost 9 hours to load the data: copy nb_cmo_deal_pools from '/postgres-system/datafile.txt' DELIMITERS '|'; [root@linux0105 root]# time psql -Upostgres MYDB < load.sql real 487m47.632s user 0m0.020s sys 0m0.000s [root@linux0105 root]# [root@linux0105 root]# cat /postgres-system/datafile.txt|wc -l 26026965 I've already played with filesystem options on my ext3 system and i would like to know: 1) How i can tune PostgreSQL to improve the insertion speed (besides droping indexes and using the fsync option). 2) How 'safe' is to use the fsync=off option on a ext3 journaled system? The journal should give me some protection in casethe system goes down, isn't it? 3) I've read several web pages that talk about tunning the sort and buffer options in PosgreSQL, but no good example of optimizationshere. Does anyone know where i can find more help (My test system has 2GB of RAM and i think 1.5 for the database will be fair enough). I apologize if this not the proper place to post this questions. Thanks in advance, -- Jose Vicente Nunez Zuleta (josevnz at newbreak dot com) Newbreak LLC System Administrator http://www.newbreak.com RHCE, SCJD, SCJP
Jose Vicente Nunez Z <josevnz@newbreak.com> writes: > For that i'm using the PostgreSQL copy tool, but the problem is that is > taking almost 9 hours to load the data: Hmm, 889 rows/second doesn't sound that bad, considering that you've given us no data about the hardware you're running on. In particular, what's your disk setup? But at any rate, the only bottlenecks I could think of for a COPY command are: 1. Data conversion. This would depend on the column datatypes of the table, which you didn't tell us. 2. Index updating. This would depend on the indexes on the table, which you didn't tell us. 3. Foreign-key constraint checking. This would depend on the foreign keys the table has, which you didn't tell us. In short ... you'd better provide a lot more detail if you want useful commentary. > I apologize if this not the proper place to post this questions. pgsql-performance would be more appropriate, probably. regards, tom lane
Tom, On Tue, 2003-09-16 at 10:44, Tom Lane wrote: > Jose Vicente Nunez Z <josevnz@newbreak.com> writes: > > For that i'm using the PostgreSQL copy tool, but the problem is that is > > taking almost 9 hours to load the data: > > Hmm, 889 rows/second doesn't sound that bad, considering that you've > given us no data about the hardware you're running on. In particular, > what's your disk setup? My fault, i should provide more information about my setup. I got two ATA disks (one for PostgreSQL, the other for the data files). The box is a Pentium III 1.3Ghz, dual CPU with 2 GB of RAM. One disk is 30 GB, the other is 40GB. As you can see is not state of the art harware, but for development purposes i should get a decent performance (only one or two applications will be accessing the server to do some data mining). > > But at any rate, the only bottlenecks I could think of for a COPY > command are: > > 1. Data conversion. This would depend on the column datatypes of the > table, which you didn't tell us. > create table mytable ( a varchar(20) not null, b varchar(20) not null, c varchar(20) not null, d char(6), f int null, g float not null, h float not null ) Data conversions shouldn't be that complicated, isn't it? > 2. Index updating. This would depend on the indexes on the table, > which you didn't tell us. I had one index and dropped it for the load. Also i turn of the 'fsync' flag (fsync=false) and now the load time is only 15 minutes!. But then recreating the original index takes forever: create index myindex on mytable ( a, b, c ); For some reason i don't see much I/O but a lot of CPU ussage: 11:18:05 up 3 days, 20:55, 3 users, load average: 1.00, 1.00, 0.92 51 processes: 49 sleeping, 2 running, 0 zombie, 0 stopped CPU0 states: 96.0% user 3.0% system 0.0% nice 0.0% iowait 0.0% idle CPU1 states: 0.1% user 0.0% system 0.0% nice 0.0% iowait 99.0% idle Mem: 2064404k av, 2049312k used, 15092k free, 0k shrd, 149544k buff 406064k actv, 1076876k in_d, 345612k in_c Swap: 2047744k av, 4252k used, 2043492k free 1809160k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 1882 postgres 25 0 41492 40M 36080 R 98.6 2.0 87:34 0 postmaster 16998 root 15 0 0 0 0 SW 1.9 0.0 4:30 0 kjournald 1956 root 19 0 1152 1152 848 R 0.9 0.0 0:00 1 top 1 root 15 0 476 448 420 S 0.0 0.0 0:07 1 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 keventd 5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd_CPU0 6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd_CPU1 11 root 19 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush 7 root 15 0 0 0 0 SW 0.0 0.0 0:22 0 kswapd 8 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 kscand/DMA 9 root 15 0 0 0 0 SW 0.0 0.0 1:57 0 kscand/Normal 10 root 15 0 0 0 0 SW 0.0 0.0 2:28 0 kscand/HighMem 12 root 15 0 0 0 0 SW 0.0 0.0 0:04 0 kupdated Also seems to be than memory usage is ok, i don't see any process blocked or waiting for resources, nor swapped out due lack of memory: [root@linux0105 data]# vmstat 10 10 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 4252 15092 149564 1809160 0 0 60 43 1 16 2 1 33 1 0 0 4252 15092 149564 1809160 0 0 0 1820 134 12 49 1 50 1 0 0 4252 15092 149564 1809160 0 0 0 1282 125 8 49 1 50 1 0 0 4252 15092 149568 1809160 0 0 0 1202 124 9 50 1 50 1 0 0 4252 15092 149572 1809160 0 0 0 1216 123 9 50 1 50 1 0 0 4252 15092 149576 1809160 0 0 0 1224 125 8 50 0 50 1 0 0 4252 15092 149580 1809160 0 0 0 1212 124 10 50 1 50 1 0 0 4252 15092 149584 1809160 0 0 0 1206 123 8 50 1 50 1 0 0 4252 15092 149588 1809160 0 0 0 1212 125 8 49 1 50 1 0 0 4252 15092 149592 1809160 0 0 0 1208 125 9 49 1 50 Any ideas about what i can do to speed up the Index creation? Why is using so much CPU (doesn't seem to be writting much information to disk though). > > 3. Foreign-key constraint checking. This would depend on the foreign > keys the table has, which you didn't tell us. > None so far, is just one table. > In short ... you'd better provide a lot more detail if you want > useful commentary. > > > I apologize if this not the proper place to post this questions. > > pgsql-performance would be more appropriate, probably. I'm reading some documentation (checking the list archives): http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html Hopefully this would help me to fix the problem. Thanks for the help, JV. > > regards, tom lane -- Jose Vicente Nunez Zuleta (josevnz at newbreak dot com) Newbreak LLC System Administrator http://www.newbreak.com RHCE, SCJD, SCJP
Jose Vicente Nunez Z <josevnz@newbreak.com> writes: > create table mytable > ( > a varchar(20) not null, > b varchar(20) not null, > c varchar(20) not null, > d char(6), > f int null, > g float not null, > h float not null > ) > recreating the original index takes forever: > create index myindex on mytable ( a, b, c ); > For some reason i don't see much I/O but a lot of CPU ussage: Hm. I'll bet that you are using a non-C locale, such that varchar comparisons depend on strcoll() instead of strcmp(). strcoll can be incredibly slow in some locales. Do you really need non-ASCII sort order? If not, I'll bet that re-initdb'ing in C locale will make a difference. regards, tom lane
On 16 Sep 2003, Jose Vicente Nunez Z wrote: > Tom, > > > On Tue, 2003-09-16 at 10:44, Tom Lane wrote: > > Jose Vicente Nunez Z <josevnz@newbreak.com> writes: > > > For that i'm using the PostgreSQL copy tool, but the problem is that is > > > taking almost 9 hours to load the data: > > > > Hmm, 889 rows/second doesn't sound that bad, considering that you've > > given us no data about the hardware you're running on. In particular, > > what's your disk setup? > > My fault, i should provide more information about my setup. > > I got two ATA disks (one for PostgreSQL, the other for the data files). > The box is a Pentium III 1.3Ghz, dual CPU with 2 GB of RAM. One disk is > 30 GB, the other is 40GB. > > As you can see is not state of the art harware, but for development > purposes i should get a decent performance (only one or two applications > will be accessing the server to do some data mining). > > > > > But at any rate, the only bottlenecks I could think of for a COPY > > command are: > > > > 1. Data conversion. This would depend on the column datatypes of the > > table, which you didn't tell us. > > > create table mytable > ( > a varchar(20) not null, > b varchar(20) not null, > c varchar(20) not null, > d char(6), > f int null, > g float not null, > h float not null > ) > > Data conversions shouldn't be that complicated, isn't it? > > > 2. Index updating. This would depend on the indexes on the table, > > which you didn't tell us. > > I had one index and dropped it for the load. Also i turn of the 'fsync' > flag (fsync=false) and now the load time is only 15 minutes!. But then > recreating the original index takes forever: > > create index myindex on mytable ( a, b, c ); Are you sure this is the type of index you want? Keep in mind postgresql won't use this index for queries like: select * from mytable where b='abc'; or select * from mytable where a='%abc%'; but only for queries that invoke a, then b, then c, and only with left anchored text. select * from mytable where a='abc' and b='def'; will work. you might want to look at using the tsearch engine just updated for 7.4.x for this kind of thing, or just running seq scans with a whole lot of shared_buffers to hold your data. Cranking shared_buffers up to 1000 to 20000 or so should help. Note that there's a point of diminishing returns for certain queries with large shared_buffers where buffer management costs more than having the data in the buffer. For what you're doing, however, larger may well be better. Oh, and don't forget to vacuum and analyze...