Home > mailing lists

Re: COPY FROM performance improvements - Mailing list pgsql-patches

From	Mark Wong
Subject	Re: COPY FROM performance improvements
Date	July 19, 2005 21:37:43
Msg-id	200507192137.j6JLbZjA011011@smtp.osdl.org Whole thread Raw
In response to	Re: COPY FROM performance improvements ("Alon Goldshuv" <agoldshuv@greenplum.com>)
Responses	Re: COPY FROM performance improvements Re: COPY FROM performance improvements
List	pgsql-patches

Tree view

Hi Alon,

Yeah, that helps.  I just need to break up my scripts a little to just
load the data and not build indexes.

Is the following information good enough to give a guess about the data
I'm loading, if you don't mind? ;)  Here's a link to my script to create
tables:

http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in

File sizes:
-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
-rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
-rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
-rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
-rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
-rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
-rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl

Number of rows:
# wc -l *.tbl
    15000000 customer.tbl
   600037902 lineitem.tbl
          25 nation.tbl
   150000000 orders.tbl
    20000000 part.tbl
    80000000 partsupp.tbl
           5 region.tbl
     1000000 supplier.tbl

Thanks,
Mark

On Tue, 19 Jul 2005 14:05:56 -0700
"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:

> Hi Mark,
>
> I improved the data *parsing* capabilities of COPY, and didn't touch the
> data conversion or data insertion parts of the code. The parsing improvement
> will vary largely depending on the ratio of parsing -to- converting and
> inserting.
>
> Therefore, the speed increase really depends on the nature of your data:
>
> 100GB file with
> long data rows (lots of parsing)
> Small number of columns (small number of attr conversions per row)
> less rows (less tuple insertions)
>
> Will show the best performance improvements.
>
> However, same file size 100GB with
> Short data rows (minimal parsing)
> large number of columns (large number of attr conversions per row)
> AND/OR
> more rows (more tuple insertions)
>
> Will show improvements but not as significant.
> In general I'll estimate 40%-95% improvement in load speed for the 1st case
> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> etc... This is for TEXT format. As for CSV, it may be faster but not as much
> as I specified here. BINARY will stay the same as before.
>
> HTH
> Alon.
>
>
>
>
>
>
> On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
>
> > On Thu, 14 Jul 2005 17:22:18 -0700
> > "Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
> >
> >> I revisited my patch and removed the code duplications that were there, and
> >> added support for CSV with buffered input, so CSV now runs faster too
> >> (although it is not as optimized as the TEXT format parsing). So now
> >> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
> >
> > Hi Alon,
> >
> > I'm curious, what kind of system are you testing this on?  I'm trying to
> > load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
> > interested in the results you would expect.
> >
> > Mark
> >
>

pgsql-patches by date:

From: Andrew Dunstan
Date: 19 July 2005, 21:13:52
Subject: Re: [HACKERS] Patch to fix plpython on OS X

From: "Jim C. Nasby"
Date: 19 July 2005, 21:51:14
Subject: Re: [HACKERS] Patch to fix plpython on OS X

Re: COPY FROM performance improvements - Mailing list pgsql-patches

Previous

Next