Re: COPY FROM performance improvements - Mailing list pgsql-patches

From Mark Wong
Subject Re: COPY FROM performance improvements
Date
Msg-id 200507192137.j6JLbZjA011011@smtp.osdl.org
Whole thread Raw
In response to Re: COPY FROM performance improvements  ("Alon Goldshuv" <agoldshuv@greenplum.com>)
Responses Re: COPY FROM performance improvements  ("Alon Goldshuv" <agoldshuv@greenplum.com>)
Re: COPY FROM performance improvements  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-patches
Hi Alon,

Yeah, that helps.  I just need to break up my scripts a little to just
load the data and not build indexes.

Is the following information good enough to give a guess about the data
I'm loading, if you don't mind? ;)  Here's a link to my script to create
tables:

http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in

File sizes:
-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
-rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
-rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
-rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
-rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
-rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
-rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl

Number of rows:
# wc -l *.tbl
    15000000 customer.tbl
   600037902 lineitem.tbl
          25 nation.tbl
   150000000 orders.tbl
    20000000 part.tbl
    80000000 partsupp.tbl
           5 region.tbl
     1000000 supplier.tbl

Thanks,
Mark

On Tue, 19 Jul 2005 14:05:56 -0700
"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:

> Hi Mark,
>
> I improved the data *parsing* capabilities of COPY, and didn't touch the
> data conversion or data insertion parts of the code. The parsing improvement
> will vary largely depending on the ratio of parsing -to- converting and
> inserting.
>
> Therefore, the speed increase really depends on the nature of your data:
>
> 100GB file with
> long data rows (lots of parsing)
> Small number of columns (small number of attr conversions per row)
> less rows (less tuple insertions)
>
> Will show the best performance improvements.
>
> However, same file size 100GB with
> Short data rows (minimal parsing)
> large number of columns (large number of attr conversions per row)
> AND/OR
> more rows (more tuple insertions)
>
> Will show improvements but not as significant.
> In general I'll estimate 40%-95% improvement in load speed for the 1st case
> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> etc... This is for TEXT format. As for CSV, it may be faster but not as much
> as I specified here. BINARY will stay the same as before.
>
> HTH
> Alon.
>
>
>
>
>
>
> On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
>
> > On Thu, 14 Jul 2005 17:22:18 -0700
> > "Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
> >
> >> I revisited my patch and removed the code duplications that were there, and
> >> added support for CSV with buffered input, so CSV now runs faster too
> >> (although it is not as optimized as the TEXT format parsing). So now
> >> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
> >
> > Hi Alon,
> >
> > I'm curious, what kind of system are you testing this on?  I'm trying to
> > load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
> > interested in the results you would expect.
> >
> > Mark
> >
>

pgsql-patches by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [HACKERS] Patch to fix plpython on OS X
Next
From: "Jim C. Nasby"
Date:
Subject: Re: [HACKERS] Patch to fix plpython on OS X