Home > mailing lists

Re: COPY FROM performance improvements - Mailing list pgsql-patches

From	Mark Wong
Subject	Re: COPY FROM performance improvements
Date	July 19, 2005 19:51:23
Msg-id	200507192251.j6JMpAjA016147@smtp.osdl.org Whole thread Raw
In response to	Re: COPY FROM performance improvements (Andrew Dunstan <andrew@dunslane.net>)
Responses	Re: COPY FROM performance improvements
List	pgsql-patches

Tree view

Whoopsies, yeah good point about the PRIMARY KEY.  I'll fix that.

Mark

On Tue, 19 Jul 2005 18:17:52 -0400
Andrew Dunstan <andrew@dunslane.net> wrote:

> Mark,
>
> You should definitely not be doing this sort of thing, I believe:
>
> CREATE TABLE orders (
>     o_orderkey INTEGER,
>     o_custkey INTEGER,
>     o_orderstatus CHAR(1),
>     o_totalprice REAL,
>     o_orderDATE DATE,
>     o_orderpriority CHAR(15),
>     o_clerk CHAR(15),
>     o_shippriority INTEGER,
>     o_comment VARCHAR(79),
>     PRIMARY KEY (o_orderkey))
>
> Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want
usingALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from
deferringconstarint creation. 
>
>
> cheers
>
> andrew
>
>
>
> Mark Wong wrote:
>
> >Hi Alon,
> >
> >Yeah, that helps.  I just need to break up my scripts a little to just
> >load the data and not build indexes.
> >
> >Is the following information good enough to give a guess about the data
> >I'm loading, if you don't mind? ;)  Here's a link to my script to create
> >tables:
>
>http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in
> >
> >File sizes:
> >-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
> >-rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
> >-rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
> >-rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
> >-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
> >-rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
> >-rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
> >-rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl
> >
> >Number of rows:
> ># wc -l *.tbl
> >    15000000 customer.tbl
> >   600037902 lineitem.tbl
> >          25 nation.tbl
> >   150000000 orders.tbl
> >    20000000 part.tbl
> >    80000000 partsupp.tbl
> >           5 region.tbl
> >     1000000 supplier.tbl
> >
> >Thanks,
> >Mark
> >
> >On Tue, 19 Jul 2005 14:05:56 -0700
> >"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
> >
> >
> >
> >>Hi Mark,
> >>
> >>I improved the data *parsing* capabilities of COPY, and didn't touch the
> >>data conversion or data insertion parts of the code. The parsing improvement
> >>will vary largely depending on the ratio of parsing -to- converting and
> >>inserting.
> >>
> >>Therefore, the speed increase really depends on the nature of your data:
> >>
> >>100GB file with
> >>long data rows (lots of parsing)
> >>Small number of columns (small number of attr conversions per row)
> >>less rows (less tuple insertions)
> >>
> >>Will show the best performance improvements.
> >>
> >>However, same file size 100GB with
> >>Short data rows (minimal parsing)
> >>large number of columns (large number of attr conversions per row)
> >>AND/OR
> >>more rows (more tuple insertions)
> >>
> >>Will show improvements but not as significant.
> >>In general I'll estimate 40%-95% improvement in load speed for the 1st case
> >>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> >>etc... This is for TEXT format. As for CSV, it may be faster but not as much
> >>as I specified here. BINARY will stay the same as before.
> >>
> >>HTH
> >>Alon.
> >>
> >>
> >>
> >>
> >>
> >>
> >>On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
> >>
> >>
> >>
> >>>On Thu, 14 Jul 2005 17:22:18 -0700
> >>>"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
> >>>
> >>>
> >>>
> >>>>I revisited my patch and removed the code duplications that were there, and
> >>>>added support for CSV with buffered input, so CSV now runs faster too
> >>>>(although it is not as optimized as the TEXT format parsing). So now
> >>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
> >>>>
> >>>>
> >>>Hi Alon,
> >>>
> >>>I'm curious, what kind of system are you testing this on?  I'm trying to
> >>>load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
> >>>interested in the results you would expect.
> >>>
> >>>Mark
> >>>
> >>>
> >>>
> >

pgsql-patches by date:

From: Andrew Dunstan
Date: 19 July 2005, 19:18:16
Subject: Re: COPY FROM performance improvements

From: Simon Riggs
Date: 19 July 2005, 19:57:45
Subject: Writing Commit Status hint bits (was Re: [HACKERS] Constant WAL replay)

Re: COPY FROM performance improvements - Mailing list pgsql-patches

Previous

Next