Re: COPY FROM performance improvements - Mailing list pgsql-patches
From | Mark Wong |
---|---|
Subject | Re: COPY FROM performance improvements |
Date | |
Msg-id | 200507192251.j6JMpAjA016147@smtp.osdl.org Whole thread Raw |
In response to | Re: COPY FROM performance improvements (Andrew Dunstan <andrew@dunslane.net>) |
Responses |
Re: COPY FROM performance improvements
|
List | pgsql-patches |
Whoopsies, yeah good point about the PRIMARY KEY. I'll fix that. Mark On Tue, 19 Jul 2005 18:17:52 -0400 Andrew Dunstan <andrew@dunslane.net> wrote: > Mark, > > You should definitely not be doing this sort of thing, I believe: > > CREATE TABLE orders ( > o_orderkey INTEGER, > o_custkey INTEGER, > o_orderstatus CHAR(1), > o_totalprice REAL, > o_orderDATE DATE, > o_orderpriority CHAR(15), > o_clerk CHAR(15), > o_shippriority INTEGER, > o_comment VARCHAR(79), > PRIMARY KEY (o_orderkey)) > > Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want usingALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from deferringconstarint creation. > > > cheers > > andrew > > > > Mark Wong wrote: > > >Hi Alon, > > > >Yeah, that helps. I just need to break up my scripts a little to just > >load the data and not build indexes. > > > >Is the following information good enough to give a guess about the data > >I'm loading, if you don't mind? ;) Here's a link to my script to create > >tables: > >http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in > > > >File sizes: > >-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl > >-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl > >-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl > >-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl > >-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl > >-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl > >-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl > >-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl > > > >Number of rows: > ># wc -l *.tbl > > 15000000 customer.tbl > > 600037902 lineitem.tbl > > 25 nation.tbl > > 150000000 orders.tbl > > 20000000 part.tbl > > 80000000 partsupp.tbl > > 5 region.tbl > > 1000000 supplier.tbl > > > >Thanks, > >Mark > > > >On Tue, 19 Jul 2005 14:05:56 -0700 > >"Alon Goldshuv" <agoldshuv@greenplum.com> wrote: > > > > > > > >>Hi Mark, > >> > >>I improved the data *parsing* capabilities of COPY, and didn't touch the > >>data conversion or data insertion parts of the code. The parsing improvement > >>will vary largely depending on the ratio of parsing -to- converting and > >>inserting. > >> > >>Therefore, the speed increase really depends on the nature of your data: > >> > >>100GB file with > >>long data rows (lots of parsing) > >>Small number of columns (small number of attr conversions per row) > >>less rows (less tuple insertions) > >> > >>Will show the best performance improvements. > >> > >>However, same file size 100GB with > >>Short data rows (minimal parsing) > >>large number of columns (large number of attr conversions per row) > >>AND/OR > >>more rows (more tuple insertions) > >> > >>Will show improvements but not as significant. > >>In general I'll estimate 40%-95% improvement in load speed for the 1st case > >>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed > >>etc... This is for TEXT format. As for CSV, it may be faster but not as much > >>as I specified here. BINARY will stay the same as before. > >> > >>HTH > >>Alon. > >> > >> > >> > >> > >> > >> > >>On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote: > >> > >> > >> > >>>On Thu, 14 Jul 2005 17:22:18 -0700 > >>>"Alon Goldshuv" <agoldshuv@greenplum.com> wrote: > >>> > >>> > >>> > >>>>I revisited my patch and removed the code duplications that were there, and > >>>>added support for CSV with buffered input, so CSV now runs faster too > >>>>(although it is not as optimized as the TEXT format parsing). So now > >>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file. > >>>> > >>>> > >>>Hi Alon, > >>> > >>>I'm curious, what kind of system are you testing this on? I'm trying to > >>>load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm > >>>interested in the results you would expect. > >>> > >>>Mark > >>> > >>> > >>> > >
pgsql-patches by date: