Re: Improve BULK insertion - Mailing list pgsql-performance
From | Gregory S. Williamson |
---|---|
Subject | Re: Improve BULK insertion |
Date | |
Msg-id | 71E37EF6B7DCC1499CEA0316A256832801D4BBDF@loki.wc.globexplorer.net Whole thread Raw |
In response to | Improve BULK insertion (Grupos <grupos@carvalhaes.net>) |
List | pgsql-performance |
Rodrigo -- You should definitely drop the indexes and any other FK constraints before loading and then rebuild them. Check your logsand see if there are warnings about checkpoint intervals -- only 3 logs seems like it might be small; if you have thedisk space I would definitely consider raising the number. If you haven't already posted your config settings you mightdo so -- this seems very slow. I regularly use COPY to load or unload data sets in the 200k-900k range and they don'ttake 90 minutes, even on slower hardware (and usually only a few minutes on our production servers; rebuilding the indexesusually takes longer. This unloading a 300k+ row data set on a dell linux box with not very good disks and 1 gig of RAM: Starting copy of parcel staging table parcels_12031 at Thu Dec 2 01:13:52 2004 Done with staging table copy at Thu Dec 2 01:15:16 2004 ... Starting compression of parcel file at Thu Dec 2 01:15:22 2004 gzip: /tmp/parcels_12031.unl.gz already exists; do you wish to overwrite (y or n )? y Done with compression of parcel file at Thu Dec 2 01:17:23 2004 ... And loading them on a rather faster server: Starting unzip of parcels at Thu Dec 2 01:29:15 2004 Finished with unzip at Thu Dec 2 01:29:22 2004 ... Target db detail table updated at Thu Dec 2 01:29:29 2004 Dropping indexes Dropping fk constraint on tracking id Dropping indexes Done dropping indexes on target parcels table at Thu Dec 2 01:29:30 2004 NOTICE: drop cascades to table f12031.parcel_pins NOTICE: drop cascades to table f12031.parcel_addresses NOTICE: drop cascades to table f12031.parcel_owner_fti NOTICE: drop cascades to table f12031.parcel_owners Removing old parcels entries starting at Thu Dec 2 01:29:30 2004 Done deleting schema and parcels for track_id 10163541 at Thu Dec 2 01:33:04 2004 Starting load of parcels at Thu Dec 2 01:33:04 2004 Done copying data into parcels at Thu Dec 2 01:35:18 2004 Deleting old v_detail reference for track_id 10163541 Done with delete of old v_detail reference Starting creation of foreign key constraint at Thu Dec 2 01:39:43 2004 Done with creation of foreign key constraint at Thu Dec 2 01:42:14 2004 Starting spatial index create at Thu Dec 2 01:42:14 2004 Done creating spatial index at Thu Dec 2 01:55:04 2004 Starting stats on geometry column now Done doing stats for spatial index at Thu Dec 2 02:03:47 2004 Starting index on PIN now Done creating pin index at Thu Dec 2 02:09:36 2004 Starting index on tracking id now Done creating trid index at Thu Dec 2 02:12:35 2004 Starting centroid index now Done creating centroid index at Thu Dec 2 02:24:11 2004 Starting stats on centroid column Done doing stats for spatial index at Thu Dec 2 02:29:55 2004 Doing City/Street Index on parcels table ...Done creating city/street index at Thu Dec 2 02:42:41 2004 with result <-1> Committing changes So this took about 70 minutes to delete 200000+ rows from a table with about 5 million rows, load a new set and reindex them(and do some statistics for spatial geometry). If the table had only this data the indexing would have been *much* faster.These are moderate size columns -- about 2 dozen columns and some spatial data (polygon and point). Both servers haverather more log files than your setup, but I am not familiar enough with postgres to know how much of an impact thatalone will have. The comment about it slowing down part way through a load makes me suspect indexing issues, somehow(not from postgres experience but it rings a bell with other DBs); if you explicitly drop the indexes first and thenload does it show the same performance behavior ? If you are doing the data read from, the database write and the WAL logging all on single disk drive, then I would guessthat that is your bottleneck. If you use vmstat and/or top or the like, is your I/O pegged ? HTH Greg WIlliamson DBA GlobeXplorer LLC -----Original Message----- From: Rodrigo Carvalhaes [mailto:grupos@carvalhaes.net] Sent: Sun 12/5/2004 11:52 AM To: Christopher Browne Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Improve BULK insertion Hi! 1. I am doing the inserts using pg_restore. The dump was created using pg_dump and the standard format (copy statements) 2. See below the table schema. There are only 7 indexes. 3. My transaction log configuration are : checkpoint_segments = 3 and checkpoint_timeout = 300 and my transaction logs are on the same disk . I know that I can increase the performance separating the transaction logs and making a RAID 5 array BUT I am really curious about WHY this performance is so poor and HOW can I try to improve on this actual machine because actualy this inserts are taking around 90 minutes!!! Cheers! Rodrigo dadosadv=# \d si2010 Table "public.si2010" Column | Type | Modifiers ------------+------------------+--------------------------------------------------------------------- i2_filial | character(2) | not null default ' '::bpchar i2_num | character(10) | not null default ' '::bpchar i2_linha | character(2) | not null default ' '::bpchar i2_data | character(8) | not null default ' '::bpchar i2_dc | character(1) | not null default ' '::bpchar i2_debito | character(20) | not null default ' '::bpchar i2_dcd | character(1) | not null default ' '::bpchar i2_credito | character(20) | not null default ' '::bpchar i2_dcc | character(1) | not null default ' '::bpchar i2_moedas | character(5) | not null default ' '::bpchar i2_valor | double precision | not null default 0.0 i2_hp | character(3) | not null default ' '::bpchar i2_hist | character(40) | not null default ' '::bpchar i2_ccd | character(9) | not null default ' '::bpchar i2_ccc | character(9) | not null default ' '::bpchar i2_ativdeb | character(6) | not null default ' '::bpchar i2_ativcrd | character(6) | not null default ' '::bpchar i2_vlmoed2 | double precision | not null default 0.0 i2_vlmoed3 | double precision | not null default 0.0 i2_vlmoed4 | double precision | not null default 0.0 i2_vlmoed5 | double precision | not null default 0.0 i2_dtvenc | character(8) | not null default ' '::bpchar i2_criter | character(4) | not null default ' '::bpchar i2_rotina | character(8) | not null default ' '::bpchar i2_periodo | character(6) | not null default ' '::bpchar i2_listado | character(1) | not null default ' '::bpchar i2_origem | character(40) | not null default ' '::bpchar i2_permat | character(4) | not null default ' '::bpchar i2_filorig | character(2) | not null default ' '::bpchar i2_intercp | character(1) | not null default ' '::bpchar i2_identcp | character(12) | not null default ' '::bpchar i2_lote | character(4) | not null default ' '::bpchar i2_doc | character(6) | not null default ' '::bpchar i2_emporig | character(2) | not null default ' '::bpchar i2_lp | character(3) | not null default ' '::bpchar i2_itemd | character(9) | not null default ' '::bpchar i2_itemc | character(9) | not null default ' '::bpchar i2_prelan | character(1) | not null default ' '::bpchar i2_tipo | character(2) | not null default ' '::bpchar i2_dcc | character(1) | not null default ' '::bpchar i2_moedas | character(5) | not null default ' '::bpchar i2_valor | double precision | not null default 0.0 i2_hp | character(3) | not null default ' '::bpchar i2_hist | character(40) | not null default ' '::bpchar i2_ccd | character(9) | not null default ' '::bpchar i2_ccc | character(9) | not null default ' '::bpchar i2_ativdeb | character(6) | not null default ' '::bpchar i2_ativcrd | character(6) | not null default ' '::bpchar i2_vlmoed2 | double precision | not null default 0.0 i2_vlmoed3 | double precision | not null default 0.0 i2_vlmoed4 | double precision | not null default 0.0 i2_vlmoed5 | double precision | not null default 0.0 i2_dtvenc | character(8) | not null default ' '::bpchar i2_criter | character(4) | not null default ' '::bpchar i2_rotina | character(8) | not null default ' '::bpchar i2_periodo | character(6) | not null default ' '::bpchar i2_listado | character(1) | not null default ' '::bpchar i2_origem | character(40) | not null default ' '::bpchar i2_permat | character(4) | not null default ' '::bpchar i2_filorig | character(2) | not null default ' '::bpchar i2_intercp | character(1) | not null default ' '::bpchar i2_identcp | character(12) | not null default ' '::bpchar i2_lote | character(4) | not null default ' '::bpchar i2_doc | character(6) | not null default ' '::bpchar i2_emporig | character(2) | not null default ' '::bpchar i2_lp | character(3) | not null default ' '::bpchar i2_itemd | character(9) | not null default ' '::bpchar i2_itemc | character(9) | not null default ' '::bpchar i2_prelan | character(1) | not null default ' '::bpchar i2_tipo | character(2) | not null default ' '::bpchar d_e_l_e_t_ | character(1) | not null default ' '::bpchar r_e_c_n_o_ | double precision | not null default 0.0 Indexes: "si2010_pkey" primary key, btree (r_e_c_n_o_) "si20101" btree (i2_filial, i2_num, i2_linha, i2_periodo, r_e_c_n_o_, d_e_l_e_t_) "si20102" btree (i2_filial, i2_periodo, i2_num, i2_linha, r_e_c_n_o_, d_e_l_e_t_) "si20103" btree (i2_filial, i2_data, i2_num, i2_linha, r_e_c_n_o_, d_e_l_e_t_) "si20104" btree (i2_filial, i2_debito, i2_data, i2_num, i2_linha, r_e_c_n_o_, d_e_l_e_t_) "si20105" btree (i2_filial, i2_credito, i2_data, i2_num, i2_linha, r_e_c_n_o_, d_e_l_e_t_) "si20106" btree (i2_filial, i2_doc, i2_periodo, r_e_c_n_o_, d_e_l_e_t_) "si20107" btree (i2_filial, i2_origem, r_e_c_n_o_, d_e_l_e_t_) Christopher Browne wrote: >In the last exciting episode, grupos@carvalhaes.net (Grupos) wrote: > > >>Hi ! >> >>I need to insert 500.000 records on a table frequently. It´s a bulk >>insertion from my applicatoin. >>I am with a very poor performance. PostgreSQL insert very fast until >>the tuple 200.000 and after it the insertion starts to be really slow. >>I am seeing on the log and there is a lot of transaction logs, >>something like : >> >>2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000012" >>2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000013" >>2004-12-04 11:08:59 LOG: recycled transaction log file "0000000600000011" >>2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000015" >>2004-12-04 11:14:04 LOG: recycled transaction log file "0000000600000014" >>2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000016" >>2004-12-04 11:19:08 LOG: recycled transaction log file "0000000600000017" >>2004-12-04 11:24:10 LOG: recycled transaction log file "0000000600000018" >> >> > >It is entirely normal for there to be a lot of transaction log file >recycling when bulk inserts are taking place; that goes through a lot >of transaction logs. > > > >>How can I configure PostgreSQL to have a better performance on this >>bulk insertions ? I already increased the memory values. >> >> > >Memory is, as likely as not, NOT the issue. > >Two questions: > > 1. How are you doing the inserts? Via INSERT statements? Or > via COPY statements? What sort of transaction grouping > is involved? > > COPY is way faster than INSERT, and grouping plenty of updates > into a single transaction is generally a "win." > > 2. What is the schema like? Does the table have a foreign key > constraint? Does it have a bunch of indices? > > If there should eventually be lots of indices, it tends to be > faster to create the table with none/minimal indices, and add > indexes afterwards, as long as your "load" process can be trusted > to not break "unique" constraints... > > If there is some secondary table with a foreign key constraint, > and _that_ table is growing, it is possible that a sequential > scan is being used to search the secondary table where, if you > did an ANALYZE on that table, an index scan would be preferred > once it grew to larger size... > >There isn't a particular reason for PostgreSQL to "hit a wall" upon >seeing 200K records; I and coworkers routinely load database dumps >that have millions of (sometimes pretty fat) records, and they don't >"choke." That's true whether talking about loading things onto my >(somewhat wimpy) desktop PC, or a SMP Xeon system with a small RAID >array, or higher end stuff involving high end SMP and EMC disk arrays. >The latter obviously being orders of magnitude faster than desktop >equipment :-). > > ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
pgsql-performance by date: