Re: COPY FROM performance improvements - Mailing list pgsql-patches
From | Mark Wong |
---|---|
Subject | Re: COPY FROM performance improvements |
Date | |
Msg-id | 200507192137.j6JLbZjA011011@smtp.osdl.org Whole thread Raw |
In response to | Re: COPY FROM performance improvements ("Alon Goldshuv" <agoldshuv@greenplum.com>) |
Responses |
Re: COPY FROM performance improvements
("Alon Goldshuv" <agoldshuv@greenplum.com>)
Re: COPY FROM performance improvements (Andrew Dunstan <andrew@dunslane.net>) |
List | pgsql-patches |
Hi Alon, Yeah, that helps. I just need to break up my scripts a little to just load the data and not build indexes. Is the following information good enough to give a guess about the data I'm loading, if you don't mind? ;) Here's a link to my script to create tables: http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in File sizes: -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl Number of rows: # wc -l *.tbl 15000000 customer.tbl 600037902 lineitem.tbl 25 nation.tbl 150000000 orders.tbl 20000000 part.tbl 80000000 partsupp.tbl 5 region.tbl 1000000 supplier.tbl Thanks, Mark On Tue, 19 Jul 2005 14:05:56 -0700 "Alon Goldshuv" <agoldshuv@greenplum.com> wrote: > Hi Mark, > > I improved the data *parsing* capabilities of COPY, and didn't touch the > data conversion or data insertion parts of the code. The parsing improvement > will vary largely depending on the ratio of parsing -to- converting and > inserting. > > Therefore, the speed increase really depends on the nature of your data: > > 100GB file with > long data rows (lots of parsing) > Small number of columns (small number of attr conversions per row) > less rows (less tuple insertions) > > Will show the best performance improvements. > > However, same file size 100GB with > Short data rows (minimal parsing) > large number of columns (large number of attr conversions per row) > AND/OR > more rows (more tuple insertions) > > Will show improvements but not as significant. > In general I'll estimate 40%-95% improvement in load speed for the 1st case > and 10%-40% for the 2nd. But that also depends on the hardware, disk speed > etc... This is for TEXT format. As for CSV, it may be faster but not as much > as I specified here. BINARY will stay the same as before. > > HTH > Alon. > > > > > > > On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote: > > > On Thu, 14 Jul 2005 17:22:18 -0700 > > "Alon Goldshuv" <agoldshuv@greenplum.com> wrote: > > > >> I revisited my patch and removed the code duplications that were there, and > >> added support for CSV with buffered input, so CSV now runs faster too > >> (although it is not as optimized as the TEXT format parsing). So now > >> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file. > > > > Hi Alon, > > > > I'm curious, what kind of system are you testing this on? I'm trying to > > load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm > > interested in the results you would expect. > > > > Mark > > >
pgsql-patches by date: