Thread: optimising data load

optimising data load

From
John Taylor
Date:
Hi all,

I'm (still) working on an application to regularly populate my database with some data provided from
a third party DB.

I'm still having really bad performance problems.
There are 300,000 records to be inserted, but I'm only getting 10,000/hour.
30 hours to populate a single table is just not on. There must be something seriously bad going on.

I have to update to different databases: live, and an update log.

I've included explains for both the db's below.

I have a few questions:

1) Can it really be right that it is going this slow ?
   I'm running Redhat Linux with kernel 2.4.7, and postgres 7.1.3

2) Can anyone see a way to optimise these queries further ?

3) Is it likely to be quicker if I use a query to obtain ol.line, and then generate a CSV file for use with COPY ?

Thanks
JohnT

--- LIVE ---
explain INSERT INTO orderlines (theorder,type,stock,line,ordercurrent,sellingquant,price,discount,vatrate,comment)
SELECT oh.theorder,'P','  0310',coalesce(ol.line+1,1),5,0,.52,0,0,''  FROM orderheader oh
LEFT OUTER JOIN orderlines ol ON oh.theorder = ol.theorder
WHERE oh.account=' MILN1' AND oh.delivery=1 AND oh.thedate='2002-06-01' AND oh.ordertype='O'
ORDER BY ol.line DESC LIMIT 1;
NOTICE:  QUERY PLAN:

Subquery Scan *SELECT*  (cost=47.41..47.41 rows=1 width=12)
  ->  Limit  (cost=47.41..47.41 rows=1 width=12)
        ->  Sort  (cost=47.41..47.41 rows=1 width=12)
              ->  Nested Loop  (cost=0.00..47.40 rows=1 width=12)
                    ->  Index Scan using orderheader_account on orderheader oh  (cost=0.00..21.64 rows=1 width=4)
                    ->  Index Scan using orderlines_pkey on orderlines ol  (cost=0.00..25.54 rows=17 width=8)

EXPLAIN

--- UPDATE LOG ---
explain INSERT INTO orderlinesupdates
(theorder,type,stock,line,ordercurrent,sellingquant,price,discount,vatrate,comment,updated,utype,origin)
SELECT oh.theorder,'P','  0310',coalesce(ol.line,ol2.line+1,1),5,0,.52,0,0,'',128,'+','C' FROM orderheaderupdates oh
LEFT OUTER JOIN orderlinesupdates ol ON oh.theorder = ol.theorder AND ol.stock='  0310'
LEFT OUTER JOIN orderlinesupdates ol2 ON oh.theorder = ol2.theorder
WHERE oh.account=' MILN1' AND oh.delivery=1 AND oh.thedate='2002-06-01' AND oh.ordertype='O'
ORDER BY oh.updated DESC, ol.line DESC, ol2.line DESC LIMIT 1;
NOTICE:  QUERY PLAN:

Subquery Scan *SELECT*  (cost=81.29..81.29 rows=1 width=36)
  ->  Limit  (cost=81.29..81.29 rows=1 width=36)
        ->  Sort  (cost=81.29..81.29 rows=1 width=36)
              ->  Nested Loop  (cost=0.00..81.28 rows=1 width=36)
                    ->  Nested Loop  (cost=0.00..52.47 rows=1 width=28)
                          ->  Index Scan using orderheaderupdates_account on orderheaderupdates oh  (cost=0.00..23.62
rows=1width=8) 
                          ->  Index Scan using orderlinesupdates_theorder on orderlinesupdates ol
 (cost=0.00..28.60 rows=17 width=20)
                    ->  Index Scan using orderlinesupdates_theorder on orderlinesupdates ol2  (cost=0.00..28.60 rows=17
width=8)

EXPLAIN


Re: optimising data load

From
Rasmus Mohr
Date:
I think something like setting fsync = off in postgresql.conf and dropping
indeces should boost performance. I think it did help when we had a similar
problem populating our database with data from an old access database.

--------------------------------------------------------------
Rasmus T. Mohr            Direct  :             +45 36 910 122
Application Developer     Mobile  :             +45 28 731 827
Netpointers Intl. ApS     Phone   :             +45 70 117 117
Vestergade 18 B           Fax     :             +45 70 115 115
1456 Copenhagen K         Email   : mailto:rmo@netpointers.com
Denmark                   Website : http://www.netpointers.com

"Remember that there are no bugs, only undocumented features."
--------------------------------------------------------------

> -----Original Message-----
> From: pgsql-novice-owner@postgresql.org
> [mailto:pgsql-novice-owner@postgresql.org]On Behalf Of John Taylor
> Sent: Wednesday, May 22, 2002 3:46 PM
> To: pgsql-novice@postgresql.org
> Subject: [NOVICE] optimising data load
>
>
>
> Hi all,
>
> I'm (still) working on an application to regularly populate
> my database with some data provided from
> a third party DB.
>
> I'm still having really bad performance problems.
> There are 300,000 records to be inserted, but I'm only
> getting 10,000/hour.
> 30 hours to populate a single table is just not on. There
> must be something seriously bad going on.
>
> I have to update to different databases: live, and an update log.
>
> I've included explains for both the db's below.
>
> I have a few questions:
>
> 1) Can it really be right that it is going this slow ?
>    I'm running Redhat Linux with kernel 2.4.7, and postgres 7.1.3
>
> 2) Can anyone see a way to optimise these queries further ?
>
> 3) Is it likely to be quicker if I use a query to obtain
> ol.line, and then generate a CSV file for use with COPY ?
>
> Thanks
> JohnT
>
> --- LIVE ---
> explain INSERT INTO orderlines
> (theorder,type,stock,line,ordercurrent,sellingquant,price,disc
> ount,vatrate,comment)
> SELECT oh.theorder,'P','
> 0310',coalesce(ol.line+1,1),5,0,.52,0,0,''  FROM orderheader oh
> LEFT OUTER JOIN orderlines ol ON oh.theorder = ol.theorder
> WHERE oh.account=' MILN1' AND oh.delivery=1 AND
> oh.thedate='2002-06-01' AND oh.ordertype='O'
> ORDER BY ol.line DESC LIMIT 1;
> NOTICE:  QUERY PLAN:
>
> Subquery Scan *SELECT*  (cost=47.41..47.41 rows=1 width=12)
>   ->  Limit  (cost=47.41..47.41 rows=1 width=12)
>         ->  Sort  (cost=47.41..47.41 rows=1 width=12)
>               ->  Nested Loop  (cost=0.00..47.40 rows=1 width=12)
>                     ->  Index Scan using orderheader_account
> on orderheader oh  (cost=0.00..21.64 rows=1 width=4)
>                     ->  Index Scan using orderlines_pkey on
> orderlines ol  (cost=0.00..25.54 rows=17 width=8)
>
> EXPLAIN
>
> --- UPDATE LOG ---
> explain INSERT INTO orderlinesupdates
> (theorder,type,stock,line,ordercurrent,sellingquant,price,disc
> ount,vatrate,comment,updated,utype,origin)
> SELECT oh.theorder,'P','
> 0310',coalesce(ol.line,ol2.line+1,1),5,0,.52,0,0,'',128,'+','C
> ' FROM orderheaderupdates oh
> LEFT OUTER JOIN orderlinesupdates ol ON oh.theorder =
> ol.theorder AND ol.stock='  0310'
> LEFT OUTER JOIN orderlinesupdates ol2 ON oh.theorder = ol2.theorder
> WHERE oh.account=' MILN1' AND oh.delivery=1 AND
> oh.thedate='2002-06-01' AND oh.ordertype='O'
> ORDER BY oh.updated DESC, ol.line DESC, ol2.line DESC LIMIT 1;
> NOTICE:  QUERY PLAN:
>
> Subquery Scan *SELECT*  (cost=81.29..81.29 rows=1 width=36)
>   ->  Limit  (cost=81.29..81.29 rows=1 width=36)
>         ->  Sort  (cost=81.29..81.29 rows=1 width=36)
>               ->  Nested Loop  (cost=0.00..81.28 rows=1 width=36)
>                     ->  Nested Loop  (cost=0.00..52.47 rows=1
> width=28)
>                           ->  Index Scan using
> orderheaderupdates_account on orderheaderupdates oh
> (cost=0.00..23.62 rows=1 width=8)
>                           ->  Index Scan using
> orderlinesupdates_theorder on orderlinesupdates ol
>  (cost=0.00..28.60 rows=17 width=20)
>                     ->  Index Scan using
> orderlinesupdates_theorder on orderlinesupdates ol2
> (cost=0.00..28.60 rows=17 width=8)
>
> EXPLAIN
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
> majordomo@postgresql.org)
>
>

Re: optimising data load

From
"Patrick Hatcher"
Date:
Dump the records from the other dbase to a text file and then use the COPY
command for Pg.  I update tables nightly with 400K+ records and it only
takes 1 -2 mins.  You should drop and re-add your indexes and then do a
vacuum analyze

Patrick Hatcher
Macys.Com
Legacy Integration Developer
415-932-0610 office
HatcherPT - AIM





                    John Taylor
                    <postgres@jtresponse.co       To:     pgsql-novice@postgresql.org
                    .uk>                          cc:
                    Sent by:                      Subject:     [NOVICE] optimising data load
                    pgsql-novice-owner@post
                    gresql.org


                    05/22/2002 06:45 AM







Hi all,

I'm (still) working on an application to regularly populate my database
with some data provided from
a third party DB.

I'm still having really bad performance problems.
There are 300,000 records to be inserted, but I'm only getting 10,000/hour.
30 hours to populate a single table is just not on. There must be something
seriously bad going on.

I have to update to different databases: live, and an update log.

I've included explains for both the db's below.

I have a few questions:

1) Can it really be right that it is going this slow ?
   I'm running Redhat Linux with kernel 2.4.7, and postgres 7.1.3

2) Can anyone see a way to optimise these queries further ?

3) Is it likely to be quicker if I use a query to obtain ol.line, and then
generate a CSV file for use with COPY ?

Thanks
JohnT

--- LIVE ---
explain INSERT INTO orderlines
(theorder,type,stock,line,ordercurrent,sellingquant,price,discount,vatrate,comment)

SELECT oh.theorder,'P','  0310',coalesce(ol.line+1,1),5,0,.52,0,0,''  FROM
orderheader oh
LEFT OUTER JOIN orderlines ol ON oh.theorder = ol.theorder
WHERE oh.account=' MILN1' AND oh.delivery=1 AND oh.thedate='2002-06-01' AND
oh.ordertype='O'
ORDER BY ol.line DESC LIMIT 1;
NOTICE:  QUERY PLAN:

Subquery Scan *SELECT*  (cost=47.41..47.41 rows=1 width=12)
  ->  Limit  (cost=47.41..47.41 rows=1 width=12)
        ->  Sort  (cost=47.41..47.41 rows=1 width=12)
              ->  Nested Loop  (cost=0.00..47.40 rows=1 width=12)
                    ->  Index Scan using orderheader_account on orderheader
oh  (cost=0.00..21.64 rows=1 width=4)
                    ->  Index Scan using orderlines_pkey on orderlines ol
(cost=0.00..25.54 rows=17 width=8)

EXPLAIN

--- UPDATE LOG ---
explain INSERT INTO orderlinesupdates
(theorder,type,stock,line,ordercurrent,sellingquant,price,discount,vatrate,comment,updated,utype,origin)

SELECT oh.theorder,'P','
0310',coalesce(ol.line,ol2.line+1,1),5,0,.52,0,0,'',128,'+','C' FROM
orderheaderupdates oh
LEFT OUTER JOIN orderlinesupdates ol ON oh.theorder = ol.theorder AND
ol.stock='  0310'
LEFT OUTER JOIN orderlinesupdates ol2 ON oh.theorder = ol2.theorder
WHERE oh.account=' MILN1' AND oh.delivery=1 AND oh.thedate='2002-06-01' AND
oh.ordertype='O'
ORDER BY oh.updated DESC, ol.line DESC, ol2.line DESC LIMIT 1;
NOTICE:  QUERY PLAN:

Subquery Scan *SELECT*  (cost=81.29..81.29 rows=1 width=36)
  ->  Limit  (cost=81.29..81.29 rows=1 width=36)
        ->  Sort  (cost=81.29..81.29 rows=1 width=36)
              ->  Nested Loop  (cost=0.00..81.28 rows=1 width=36)
                    ->  Nested Loop  (cost=0.00..52.47 rows=1 width=28)
                          ->  Index Scan using orderheaderupdates_account
on orderheaderupdates oh  (cost=0.00..23.62 rows=1 width=8)
                          ->  Index Scan using orderlinesupdates_theorder
on orderlinesupdates ol
 (cost=0.00..28.60 rows=17 width=20)
                    ->  Index Scan using orderlinesupdates_theorder on
orderlinesupdates ol2  (cost=0.00..28.60 rows=17 width=8)

EXPLAIN


---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)





Re: optimising data load

From
John Taylor
Date:
On Wednesday 22 May 2002 16:29, Patrick Hatcher wrote:
> Dump the records from the other dbase to a text file and then use the COPY
> command for Pg.  I update tables nightly with 400K+ records and it only
> takes 1 -2 mins.  You should drop and re-add your indexes and then do a
> vacuum analyze
>

I'm looking into that at the moment.
I'm getting some very variable results.
There are some tables that it is easy to do this for.

However for some tables, I don't get data in the right format, so I need to
perform some queries to get the right values to use when populating.

In this situation I'm not sure if I should drop the indexes to make make the insert faster,
or keep them to make the selects faster.


Thanks
JohnT

Re: optimising data load

From
Ron Johnson
Date:
On Wed, 2002-05-22 at 08:45, John Taylor wrote:
>
> Hi all,
>
> I'm (still) working on an application to regularly populate my database with some data provided from
> a third party DB.
>
> I'm still having really bad performance problems.
> There are 300,000 records to be inserted, but I'm only getting 10,000/hour.
> 30 hours to populate a single table is just not on. There must be something seriously bad going on.

From what I see of your INSERT statement, you are joining orderheader
to orderlines and then inserting those records into orderlines.
Is that correct?

I think I'd make a view from the SELECT portion.  Then, nightly:
1. COPY out the view.
2. Drop the index(es) on orderlines
3. COPY into orderlines
4. recreate indexes on orderlines.

Do the same for orderlineupdates.  The thing is is that
modifying indexes is a _very_ expensive operation, no matter
what RDBMS or hardware configuration you have...

> I have to update to different databases: live, and an update log.
>
> I've included explains for both the db's below.
>
> I have a few questions:
>
> 1) Can it really be right that it is going this slow ?
>    I'm running Redhat Linux with kernel 2.4.7, and postgres 7.1.3
>
> 2) Can anyone see a way to optimise these queries further ?
>
> 3) Is it likely to be quicker if I use a query to obtain ol.line, and then generate a CSV file for use with COPY ?
>
> Thanks
> JohnT
>
> --- LIVE ---
> explain INSERT INTO orderlines (theorder,type,stock,line,ordercurrent,sellingquant,price,discount,vatrate,comment)
> SELECT oh.theorder,'P','  0310',coalesce(ol.line+1,1),5,0,.52,0,0,''  FROM orderheader oh
> LEFT OUTER JOIN orderlines ol ON oh.theorder = ol.theorder
> WHERE oh.account=' MILN1' AND oh.delivery=1 AND oh.thedate='2002-06-01' AND oh.ordertype='O'
> ORDER BY ol.line DESC LIMIT 1;
> NOTICE:  QUERY PLAN:
>
> Subquery Scan *SELECT*  (cost=47.41..47.41 rows=1 width=12)
>   ->  Limit  (cost=47.41..47.41 rows=1 width=12)
>         ->  Sort  (cost=47.41..47.41 rows=1 width=12)
>               ->  Nested Loop  (cost=0.00..47.40 rows=1 width=12)
>                     ->  Index Scan using orderheader_account on orderheader oh  (cost=0.00..21.64 rows=1 width=4)
>                     ->  Index Scan using orderlines_pkey on orderlines ol  (cost=0.00..25.54 rows=17 width=8)
>
> EXPLAIN
>
> --- UPDATE LOG ---
> explain INSERT INTO orderlinesupdates
(theorder,type,stock,line,ordercurrent,sellingquant,price,discount,vatrate,comment,updated,utype,origin)
> SELECT oh.theorder,'P','  0310',coalesce(ol.line,ol2.line+1,1),5,0,.52,0,0,'',128,'+','C' FROM orderheaderupdates oh
> LEFT OUTER JOIN orderlinesupdates ol ON oh.theorder = ol.theorder AND ol.stock='  0310'
> LEFT OUTER JOIN orderlinesupdates ol2 ON oh.theorder = ol2.theorder
> WHERE oh.account=' MILN1' AND oh.delivery=1 AND oh.thedate='2002-06-01' AND oh.ordertype='O'
> ORDER BY oh.updated DESC, ol.line DESC, ol2.line DESC LIMIT 1;
> NOTICE:  QUERY PLAN:
>
> Subquery Scan *SELECT*  (cost=81.29..81.29 rows=1 width=36)
>   ->  Limit  (cost=81.29..81.29 rows=1 width=36)
>         ->  Sort  (cost=81.29..81.29 rows=1 width=36)
>               ->  Nested Loop  (cost=0.00..81.28 rows=1 width=36)
>                     ->  Nested Loop  (cost=0.00..52.47 rows=1 width=28)
>                           ->  Index Scan using orderheaderupdates_account on orderheaderupdates oh  (cost=0.00..23.62
rows=1width=8) 
>                           ->  Index Scan using orderlinesupdates_theorder on orderlinesupdates ol
>  (cost=0.00..28.60 rows=17 width=20)
>                     ->  Index Scan using orderlinesupdates_theorder on orderlinesupdates ol2  (cost=0.00..28.60
rows=17width=8) 
>
> EXPLAIN

--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: optimising data load

From
John Taylor
Date:

On Wednesday 22 May 2002 18:40, Ron Johnson wrote:
> On Wed, 2002-05-22 at 08:45, John Taylor wrote:
> >
> > Hi all,
> >
> > I'm (still) working on an application to regularly populate my database with some data provided from
> > a third party DB.
> >
> > I'm still having really bad performance problems.
> > There are 300,000 records to be inserted, but I'm only getting 10,000/hour.
> > 30 hours to populate a single table is just not on. There must be something seriously bad going on.
>
> From what I see of your INSERT statement, you are joining orderheader
> to orderlines and then inserting those records into orderlines.
> Is that correct?

Yes, unfortunately I get incomplete data, so I need to select some details before inserting.

>
> I think I'd make a view from the SELECT portion.  Then, nightly:
> 1. COPY out the view.
> 2. Drop the index(es) on orderlines
> 3. COPY into orderlines
> 4. recreate indexes on orderlines.

So, how would I merge the data from the view with the data to copy in ?

I was thinking of generating the copy file by performing a select for each row,
but a selecting each row into a file, and then copy in might be more expensive than
just inserting it like I do now.

>
> Do the same for orderlineupdates.  The thing is is that
> modifying indexes is a _very_ expensive operation, no matter
> what RDBMS or hardware configuration you have...
>

Yes, I was thinking of that for my next step.
Once I figure how to get copy working, I'll also try dropping indexes.

Thanks
JohnT


Re: optimising data load

From
Ron Johnson
Date:
On Wed, 2002-05-22 at 16:05, John Taylor wrote:
> On Wednesday 22 May 2002 18:40, Ron Johnson wrote:
> > On Wed, 2002-05-22 at 08:45, John Taylor wrote:
> > >
> > > Hi all,
> > >
> > > I'm (still) working on an application to regularly populate my database with some data provided from
> > > a third party DB.
> > >
> > > I'm still having really bad performance problems.
> > > There are 300,000 records to be inserted, but I'm only getting 10,000/hour.
> > > 30 hours to populate a single table is just not on. There must be something seriously bad going on.
> >
> > From what I see of your INSERT statement, you are joining orderheader
> > to orderlines and then inserting those records into orderlines.
> > Is that correct?
>
> Yes, unfortunately I get incomplete data, so I need to select some details before inserting.
>
> >
> > I think I'd make a view from the SELECT portion.  Then, nightly:
> > 1. COPY out the view.
> > 2. Drop the index(es) on orderlines
> > 3. COPY into orderlines
> > 4. recreate indexes on orderlines.
>
> So, how would I merge the data from the view with the data to copy in ?

???  Unless I am misunderstanding you:

0. create view v_new_orderlines (theorder, type, stock, line,
       ordercurrent, sellingquant, price, discount, vatrate, comment)
   as  SELECT oh.theorder, 'P', '0310', coalesce(ol.line+1,1), 5, 0,
              .52, 0, 0, ''
       FROM orderheader oh
               LEFT OUTER JOIN orderlines ol
                       ON oh.theorder = ol.theorder
       WHERE oh.account=' MILN1'
         AND oh.delivery=1
         AND oh.thedate='2002-06-01'
         AND oh.ordertype='O'
       ORDER BY ol.line ;
1. COPY v_new_orderlines to '/tmp/new_orderlines.tsv';
2. DROP each index from orderlines
3. COPY orderlines from '/tmp/new_orderlines.tsv';
4. CREATE each index on orderlines


> I was thinking of generating the copy file by performing a select for each row,
> but a selecting each row into a file, and then copy in might be more expensive than
> just inserting it like I do now.
>
> >
> > Do the same for orderlineupdates.  The thing is is that
> > modifying indexes is a _very_ expensive operation, no matter
> > what RDBMS or hardware configuration you have...
> >
>
> Yes, I was thinking of that for my next step.
> Once I figure how to get copy working, I'll also try dropping indexes.

--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: optimising data load

From
John Taylor
Date:
On Thursday 23 May 2002 03:40, Ron Johnson wrote:
> On Wed, 2002-05-22 at 16:05, John Taylor wrote:
>
> > So, how would I merge the data from the view with the data to copy in ?
>
> ???  Unless I am misunderstanding you:

I think you are misunderstanding me.
I have 300,000 rows to insert.
They each require a different query to obtain the correct data to insert.

>
> 0. create view v_new_orderlines (theorder, type, stock, line,
>        ordercurrent, sellingquant, price, discount, vatrate, comment)
>    as  SELECT oh.theorder, 'P', '0310', coalesce(ol.line+1,1), 5, 0,
>               .52, 0, 0, ''
>        FROM orderheader oh
>                LEFT OUTER JOIN orderlines ol
>                        ON oh.theorder = ol.theorder
>        WHERE oh.account=' MILN1'
>          AND oh.delivery=1
>          AND oh.thedate='2002-06-01'
>          AND oh.ordertype='O'
>        ORDER BY ol.line ;
> 1. COPY v_new_orderlines to '/tmp/new_orderlines.tsv';
> 2. DROP each index from orderlines
> 3. COPY orderlines from '/tmp/new_orderlines.tsv';
> 4. CREATE each index on orderlines
>

JohnT

Re: optimising data load

From
Ron Johnson
Date:
On Thu, 2002-05-23 at 02:01, John Taylor wrote:
> On Thursday 23 May 2002 03:40, Ron Johnson wrote:
> > On Wed, 2002-05-22 at 16:05, John Taylor wrote:
> >
> > > So, how would I merge the data from the view with the data to copy in ?
> >
> > ???  Unless I am misunderstanding you:
>
> I think you are misunderstanding me.
> I have 300,000 rows to insert.
> They each require a different query to obtain the correct data to insert.

Oh, ok.  How about creating an intermediary, indexless table
that the 300,000 records will be inserted into.  Then, after
the intermediary table is populated, drop the index(es) on
orderlines and INSERT INTO orderlines SELECT * FROM intermediary;

> >
> > 0. create view v_new_orderlines (theorder, type, stock, line,
> >        ordercurrent, sellingquant, price, discount, vatrate, comment)
> >    as  SELECT oh.theorder, 'P', '0310', coalesce(ol.line+1,1), 5, 0,
> >               .52, 0, 0, ''
> >        FROM orderheader oh
> >                LEFT OUTER JOIN orderlines ol
> >                        ON oh.theorder = ol.theorder
> >        WHERE oh.account=' MILN1'
> >          AND oh.delivery=1
> >          AND oh.thedate='2002-06-01'
> >          AND oh.ordertype='O'
> >        ORDER BY ol.line ;
> > 1. COPY v_new_orderlines to '/tmp/new_orderlines.tsv';
> > 2. DROP each index from orderlines
> > 3. COPY orderlines from '/tmp/new_orderlines.tsv';
> > 4. CREATE each index on orderlines
> >
>
> JohnT
--
+---------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net     |
| Jefferson, LA  USA      http://ronandheather.dhs.org:81 |
|                                                         |
| "I have created a government of whirled peas..."        |
|   Maharishi Mahesh Yogi, 12-May-2002,                   |
!   CNN, Larry King Live                                  |
+---------------------------------------------------------+


Re: optimising data load

From
John Taylor
Date:
On Thursday 23 May 2002 08:22, you wrote:
> On Thu, 2002-05-23 at 02:01, John Taylor wrote:
> > On Thursday 23 May 2002 03:40, Ron Johnson wrote:
> > > On Wed, 2002-05-22 at 16:05, John Taylor wrote:
> > >
> > > > So, how would I merge the data from the view with the data to copy in ?
> > >
> > > ???  Unless I am misunderstanding you:
> >
> > I think you are misunderstanding me.
> > I have 300,000 rows to insert.
> > They each require a different query to obtain the correct data to insert.
>
> Oh, ok.  How about creating an intermediary, indexless table
> that the 300,000 records will be inserted into.  Then, after
> the intermediary table is populated, drop the index(es) on
> orderlines and INSERT INTO orderlines SELECT * FROM intermediary;
>

That sounds a good idea.
Would it be quicker to copy out from the intermediary, and then copy in to the real one ?
I guess I'll have to experiment.

I have a feeling a lot of the problems may be due to the indexes.
I've been experimenting with an easier table.
copy in to the live takes 5 minutes for 850000 records
copy in to the updates takes 2 hours !
It is the same data with 3 extra columns, but there are an additional 2 indexes.
I think it must be the indexes making the diffrence. I'm just checking now.

Thanks for the ideas !

Regards
JohnT

Re: optimising data load

From
Andrew McMillan
Date:
On Thu, 2002-05-23 at 21:10, John Taylor wrote:
> >
> > Oh, ok.  How about creating an intermediary, indexless table
> > that the 300,000 records will be inserted into.  Then, after
> > the intermediary table is populated, drop the index(es) on
> > orderlines and INSERT INTO orderlines SELECT * FROM intermediary;
> >
>
> That sounds a good idea.
> Would it be quicker to copy out from the intermediary, and then copy in to the real one ?
> I guess I'll have to experiment.

It will be quickest to:

"INSERT INTO orderlinesupdate SELECT * FROM intermediary", like Ron said

No COPY involved, except to get the records into the intermediary table
initially.

That syntax on the INSERT statement is _really_ useful for this sort of
thing.


> I have a feeling a lot of the problems may be due to the indexes.
> I've been experimenting with an easier table.
> copy in to the live takes 5 minutes for 850000 records
> copy in to the updates takes 2 hours !
> It is the same data with 3 extra columns, but there are an additional 2 indexes.
> I think it must be the indexes making the diffrence. I'm just checking now.

Also consider that if the numbers of records inthe table is changing a
lot, then it is well worth while doing a Vacuum analyze after time, so
the planner statistics get updated and the system produces queries
appropriate to a 400k record table (e.g.) rather than a 100k record
table.

Regards,
                    Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267
       Are you enrolled at http://schoolreunions.co.nz/ yet?


Re: optimising data load

From
John Taylor
Date:
On Wednesday 22 May 2002 18:40, Ron Johnson wrote:
> On Wed, 2002-05-22 at 08:45, John Taylor wrote:
> >
> > Hi all,
> >
> > I'm (still) working on an application to regularly populate my database with some data provided from
> > a third party DB.
> >
> > I'm still having really bad performance problems.
> > There are 300,000 records to be inserted, but I'm only getting 10,000/hour.
> > 30 hours to populate a single table is just not on. There must be something seriously bad going on.
>
> From what I see of your INSERT statement, you are joining orderheader
> to orderlines and then inserting those records into orderlines.
> Is that correct?

Yes, unfortunately I get incomplete data, so I need to select some details before inserting.

>
> I think I'd make a view from the SELECT portion.  Then, nightly:
> 1. COPY out the view.
> 2. Drop the index(es) on orderlines
> 3. COPY into orderlines
> 4. recreate indexes on orderlines.

So, how would I merge the data from the view with the data to copy in ?

I was thinking of generating the copy file by performing a select for each row,
but a selecting each row into a file, and then copy in might be more expensive than
just inserting it like I do now.

>
> Do the same for orderlineupdates.  The thing is is that
> modifying indexes is a _very_ expensive operation, no matter
> what RDBMS or hardware configuration you have...
>

Yes, I was thinking of that for my next step.
Once I figure how to get copy working, I'll also try dropping indexes.

Thanks
JohnT

--
John Taylor            J T Response Limited
Director               Internet Solutions for Business
                       +44 (0)1954 261146

Full Outer Joins

From
John Taylor
Date:
Hi,

Can someone give me examples for the correct syntax for FULL OUTER JOIN, where each table
has some additional conditions. The docs don't give any complete examples :-(

I can only get it to work using subselects:

SELECT stock,o.ordercurrent,s.quantity FROM
(SELECT stock,ordercurrent FROM orderlines o WHERE  o.theorder='  4494' AND (o.type='P' OR o.type='T')) AS o
FULL OUTER JOIN
(SELECT stock,quantity FROM standingorders s WHERE s.account=' 15048' AND s.dayno=2) AS s
USING (stock)

Thanks
JohnT

Re: Full Outer Joins

From
"Joel Burton"
Date:
> -----Original Message-----
> From: pgsql-novice-owner@postgresql.org
> [mailto:pgsql-novice-owner@postgresql.org]On Behalf Of John Taylor
> Sent: Monday, May 27, 2002 6:15 AM
> To: PgSQL Novice ML
> Subject: [NOVICE] Full Outer Joins
>
> Can someone give me examples for the correct syntax for FULL
> OUTER JOIN, where each table
> has some additional conditions. The docs don't give any complete
> examples :-(
>
> I can only get it to work using subselects:
>
> SELECT stock,o.ordercurrent,s.quantity FROM
> (SELECT stock,ordercurrent FROM orderlines o WHERE  o.theorder='
> 4494' AND (o.type='P' OR o.type='T')) AS o
> FULL OUTER JOIN
> (SELECT stock,quantity FROM standingorders s WHERE s.account='
> 15048' AND s.dayno=2) AS s
> USING (stock)

create table Customers (custid int primary key,
                        custname text unique,
                        zip char(5) not null);

create table Orders (orderid int primary key,
                     custid int not null references Customers,
                     part text not null,
                     orderamt int not null);

insert into Customers values (1,'Joel','20009');
insert into Customers values (2,'Johnny Cheapskate','20009');
insert into Customers values (3,'Santa Claus','10005');

insert into Orders values (1,1,'Pink Widget',100);
insert into Orders values (2,1,'Pink Widget',200);
insert into Orders values (3,3,'Pink Widget',100);

Select customer id, name, # of orders, and total $ of orders, only for those
customers in zip 20009 and only for Pink Widgets. However, be sure to show
all customers in that zip code; for those that never ordered a Pink Widget,
show zero).


           SELECT c.custid,
                  c.custname,
                  count(o.orderid),
                  sum(o.orderamt)
             FROM Customers AS c
  LEFT OUTER JOIN Orders AS o
               ON (o.custid=c.custid
                   AND o.part='Pink Widget')
            WHERE c.zip = '20009'
         GROUP BY c.custid,
                  c.custname;


 custid |     custname      | count | sum
--------+-------------------+-------+-----
      1 | Joel              |     2 | 300
      2 | Johnny Cheapskate |     0 |
(2 rows)


If this doesn't give you the idea, you'll have to be more specific about
where you're stuck.

- J.

Joel BURTON | joel@joelburton.com | joelburton.com | aim: wjoelburton
Knowledge Management & Technology Consultant



Re: Full Outer Joins

From
John Taylor
Date:
On Monday 27 May 2002 15:33, Joel Burton wrote:
> > -----Original Message-----
> > From: pgsql-novice-owner@postgresql.org
> > [mailto:pgsql-novice-owner@postgresql.org]On Behalf Of John Taylor
> > Sent: Monday, May 27, 2002 6:15 AM
> > To: PgSQL Novice ML
> > Subject: [NOVICE] Full Outer Joins
> >
> > Can someone give me examples for the correct syntax for FULL
> > OUTER JOIN, where each table
> > has some additional conditions. The docs don't give any complete
> > examples :-(
> >

>
>
>            SELECT c.custid,
>                   c.custname,
>                   count(o.orderid),
>                   sum(o.orderamt)
>              FROM Customers AS c
>   LEFT OUTER JOIN Orders AS o
>                ON (o.custid=c.custid
>                    AND o.part='Pink Widget')
>             WHERE c.zip = '20009'
>          GROUP BY c.custid,
>                   c.custname;
>
>

Postgres doesn't seem to like that form with full outer joins ...

SELECT o.stock,o.ordercurrent,o.type,s.stock,s.quantity
FROM orderlines o FULL OUTER JOIN standingorders s ON (s.stock=o.stock AND  s.account=' 15048' and s.dayno=2 )
WHERE o.theorder='  4494' AND (o.type='P' OR o.type='T')

gives ...

ERROR:  FULL JOIN is only supported with mergejoinable join conditions

If I join with USING(), and move all the conditionals to the end WHERE that only returns columns matching the first
table.

If I join with USING(), and use subselects to generate each side of the JOIN, then it all works fine.

Regards
JohnT


Re: Full Outer Joins

From
Tom Lane
Date:
John Taylor <postgres@jtresponse.co.uk> writes:
> Postgres doesn't seem to like that form with full outer joins ...

> SELECT o.stock,o.ordercurrent,o.type,s.stock,s.quantity
> FROM orderlines o FULL OUTER JOIN standingorders s ON (s.stock=o.stock AND  s.account=' 15048' and s.dayno=2 )
> WHERE o.theorder='  4494' AND (o.type='P' OR o.type='T')

> ERROR:  FULL JOIN is only supported with mergejoinable join conditions

While that implementation limitation is annoying (it's partly fixed in
development sources, FWIW), I really wonder why you'd want to do the
above.  With a FULL JOIN, you are going to get a lot of dummy rows out:
every s row *not* satisfying account=' 15048' and dayno=2 will still
generate a joined row (with nulls for the o columns) and also every o
row that doesn't join to an s row with account=' 15048' and dayno=2 will
generate a joined row (with nulls for the s columns).  It seems unlikely
that that's what you wanted.  I have a hard time envisioning a use for
FULL JOIN with join conditions that restrict only one of the tables;
seems like the restrictions ought to be in WHERE, instead.

            regards, tom lane

Re: Full Outer Joins

From
John Taylor
Date:
On Monday 27 May 2002 19:10, Tom Lane wrote:
> John Taylor <postgres@jtresponse.co.uk> writes:
> > Postgres doesn't seem to like that form with full outer joins ...
>
> > SELECT o.stock,o.ordercurrent,o.type,s.stock,s.quantity
> > FROM orderlines o FULL OUTER JOIN standingorders s ON (s.stock=o.stock AND  s.account=' 15048' and s.dayno=2 )
> > WHERE o.theorder='  4494' AND (o.type='P' OR o.type='T')
>
> > ERROR:  FULL JOIN is only supported with mergejoinable join conditions
>
> While that implementation limitation is annoying (it's partly fixed in
> development sources, FWIW), I really wonder why you'd want to do the
> above.  With a FULL JOIN, you are going to get a lot of dummy rows out:
> every s row *not* satisfying account=' 15048' and dayno=2 will still
> generate a joined row (with nulls for the o columns) and also every o
> row that doesn't join to an s row with account=' 15048' and dayno=2 will
> generate a joined row (with nulls for the s columns).  It seems unlikely
> that that's what you wanted.  I have a hard time envisioning a use for
> FULL JOIN with join conditions that restrict only one of the tables;
> seems like the restrictions ought to be in WHERE, instead.

Yes, I WANT to restrict both tables, but I can't figure out how ?
Where do I put the conditionals for each table ?

I have two tables of orders, temporary, and permanent.
For each day there are a number of orders to be delivered.
Each order may have any entry in the temporary AND/OR the permanent.

I want to get all orders that are in either table.
For each order, I need to know what table it is in (and if it is in both), and also join to the stockitems table to get
thedescription. 

I can do it all like this:

SELECT stock,stockitems.description,o.ordercurrent,s.quantity FROM
(SELECT stock,ordercurrent FROM orderlines o WHERE  o.theorder='  4494' AND (o.type='P' OR o.type='T')) AS o
FULL OUTER JOIN
(SELECT stock,quantity FROM standingorders s WHERE s.account=' 15048' AND s.dayno=2) AS s
USING (stock)
JOIN stockitems USING (stock)

How do I do it without the subselects ?

Thanks
JohnT


Re: Full Outer Joins

From
Tom Lane
Date:
John Taylor <postgres@jtresponse.co.uk> writes:
> I can do it all like this:

> SELECT stock,stockitems.description,o.ordercurrent,s.quantity FROM
> (SELECT stock,ordercurrent FROM orderlines o WHERE  o.theorder='  4494' AND (o.type='P' OR o.type='T')) AS o
> FULL OUTER JOIN
> (SELECT stock,quantity FROM standingorders s WHERE s.account=' 15048' AND s.dayno=2) AS s
> USING (stock)
> JOIN stockitems USING (stock)

If that actually has the behavior you want, then that's the way you
write it.  Placing restrictions inside an outer join is not semantically
the same as placing them outside (at the WHERE clause), so it wasn't
clear to me exactly what you wanted.

> How do I do it without the subselects ?

Why worry?  It looks like it should be an efficient solution, or at
least as efficient as you'll get.

            regards, tom lane

Re: Full Outer Joins

From
John Taylor
Date:
On Monday 27 May 2002 20:02, Tom Lane wrote:

>
> If that actually has the behavior you want, then that's the way you
> write it.  Placing restrictions inside an outer join is not semantically
> the same as placing them outside (at the WHERE clause), so it wasn't
> clear to me exactly what you wanted.
>

Ah, that explains why I couldn't figure it out.
I think I need to move on from my current Postgres Bible, it stops short of
full outer joins :-(

Thanks
JohnT