Home > mailing lists

Re: COPY FROM performance improvements - Mailing list pgsql-patches

From	Andrew Dunstan
Subject	Re: COPY FROM performance improvements
Date	July 19, 2005 19:18:16
Msg-id	42DD7C10.4090008@dunslane.net Whole thread Raw
In response to	Re: COPY FROM performance improvements (Mark Wong <markw@osdl.org>)
Responses	Re: COPY FROM performance improvements
List	pgsql-patches

Tree view

Mark,

You should definitely not be doing this sort of thing, I believe:

CREATE TABLE orders (
    o_orderkey INTEGER,
    o_custkey INTEGER,
    o_orderstatus CHAR(1),
    o_totalprice REAL,
    o_orderDATE DATE,
    o_orderpriority CHAR(15),
    o_clerk CHAR(15),
    o_shippriority INTEGER,
    o_comment VARCHAR(79),
    PRIMARY KEY (o_orderkey))

Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want
usingALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from
deferringconstarint creation. 


cheers

andrew



Mark Wong wrote:

>Hi Alon,
>
>Yeah, that helps.  I just need to break up my scripts a little to just
>load the data and not build indexes.
>
>Is the following information good enough to give a guess about the data
>I'm loading, if you don't mind? ;)  Here's a link to my script to create
>tables:

>http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in
>
>File sizes:
>-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
>-rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
>-rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
>-rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
>-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
>-rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
>-rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
>-rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl
>
>Number of rows:
># wc -l *.tbl
>    15000000 customer.tbl
>   600037902 lineitem.tbl
>          25 nation.tbl
>   150000000 orders.tbl
>    20000000 part.tbl
>    80000000 partsupp.tbl
>           5 region.tbl
>     1000000 supplier.tbl
>
>Thanks,
>Mark
>
>On Tue, 19 Jul 2005 14:05:56 -0700
>"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>
>
>
>>Hi Mark,
>>
>>I improved the data *parsing* capabilities of COPY, and didn't touch the
>>data conversion or data insertion parts of the code. The parsing improvement
>>will vary largely depending on the ratio of parsing -to- converting and
>>inserting.
>>
>>Therefore, the speed increase really depends on the nature of your data:
>>
>>100GB file with
>>long data rows (lots of parsing)
>>Small number of columns (small number of attr conversions per row)
>>less rows (less tuple insertions)
>>
>>Will show the best performance improvements.
>>
>>However, same file size 100GB with
>>Short data rows (minimal parsing)
>>large number of columns (large number of attr conversions per row)
>>AND/OR
>>more rows (more tuple insertions)
>>
>>Will show improvements but not as significant.
>>In general I'll estimate 40%-95% improvement in load speed for the 1st case
>>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
>>etc... This is for TEXT format. As for CSV, it may be faster but not as much
>>as I specified here. BINARY will stay the same as before.
>>
>>HTH
>>Alon.
>>
>>
>>
>>
>>
>>
>>On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
>>
>>
>>
>>>On Thu, 14 Jul 2005 17:22:18 -0700
>>>"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>>>
>>>
>>>
>>>>I revisited my patch and removed the code duplications that were there, and
>>>>added support for CSV with buffered input, so CSV now runs faster too
>>>>(although it is not as optimized as the TEXT format parsing). So now
>>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
>>>>
>>>>
>>>Hi Alon,
>>>
>>>I'm curious, what kind of system are you testing this on?  I'm trying to
>>>load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
>>>interested in the results you would expect.
>>>
>>>Mark
>>>
>>>
>>>
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>
>
>

pgsql-patches by date:

From: "Alon Goldshuv"
Date: 19 July 2005, 19:06:47
Subject: Re: COPY FROM performance improvements

From: Mark Wong
Date: 19 July 2005, 19:51:23
Subject: Re: COPY FROM performance improvements

Re: COPY FROM performance improvements - Mailing list pgsql-patches

Previous

Next