Re: COPY FROM performance improvements - Mailing list pgsql-patches
From | Andrew Dunstan |
---|---|
Subject | Re: COPY FROM performance improvements |
Date | |
Msg-id | 42DD7C10.4090008@dunslane.net Whole thread Raw |
In response to | Re: COPY FROM performance improvements (Mark Wong <markw@osdl.org>) |
Responses |
Re: COPY FROM performance improvements
|
List | pgsql-patches |
Mark, You should definitely not be doing this sort of thing, I believe: CREATE TABLE orders ( o_orderkey INTEGER, o_custkey INTEGER, o_orderstatus CHAR(1), o_totalprice REAL, o_orderDATE DATE, o_orderpriority CHAR(15), o_clerk CHAR(15), o_shippriority INTEGER, o_comment VARCHAR(79), PRIMARY KEY (o_orderkey)) Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want usingALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from deferringconstarint creation. cheers andrew Mark Wong wrote: >Hi Alon, > >Yeah, that helps. I just need to break up my scripts a little to just >load the data and not build indexes. > >Is the following information good enough to give a guess about the data >I'm loading, if you don't mind? ;) Here's a link to my script to create >tables: >http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in > >File sizes: >-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl >-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl >-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl >-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl >-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl >-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl >-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl >-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl > >Number of rows: ># wc -l *.tbl > 15000000 customer.tbl > 600037902 lineitem.tbl > 25 nation.tbl > 150000000 orders.tbl > 20000000 part.tbl > 80000000 partsupp.tbl > 5 region.tbl > 1000000 supplier.tbl > >Thanks, >Mark > >On Tue, 19 Jul 2005 14:05:56 -0700 >"Alon Goldshuv" <agoldshuv@greenplum.com> wrote: > > > >>Hi Mark, >> >>I improved the data *parsing* capabilities of COPY, and didn't touch the >>data conversion or data insertion parts of the code. The parsing improvement >>will vary largely depending on the ratio of parsing -to- converting and >>inserting. >> >>Therefore, the speed increase really depends on the nature of your data: >> >>100GB file with >>long data rows (lots of parsing) >>Small number of columns (small number of attr conversions per row) >>less rows (less tuple insertions) >> >>Will show the best performance improvements. >> >>However, same file size 100GB with >>Short data rows (minimal parsing) >>large number of columns (large number of attr conversions per row) >>AND/OR >>more rows (more tuple insertions) >> >>Will show improvements but not as significant. >>In general I'll estimate 40%-95% improvement in load speed for the 1st case >>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed >>etc... This is for TEXT format. As for CSV, it may be faster but not as much >>as I specified here. BINARY will stay the same as before. >> >>HTH >>Alon. >> >> >> >> >> >> >>On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote: >> >> >> >>>On Thu, 14 Jul 2005 17:22:18 -0700 >>>"Alon Goldshuv" <agoldshuv@greenplum.com> wrote: >>> >>> >>> >>>>I revisited my patch and removed the code duplications that were there, and >>>>added support for CSV with buffered input, so CSV now runs faster too >>>>(although it is not as optimized as the TEXT format parsing). So now >>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file. >>>> >>>> >>>Hi Alon, >>> >>>I'm curious, what kind of system are you testing this on? I'm trying to >>>load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm >>>interested in the results you would expect. >>> >>>Mark >>> >>> >>> > >---------------------------(end of broadcast)--------------------------- >TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > > >
pgsql-patches by date: