Thread: Update performance ... is 200,000 updates per hour what I should expect?

Update performance ... is 200,000 updates per hour what I should expect?

From
Erik Norvelle
Date:
<fixed><fontfamily><param>Courier New</param>Folks:


I´m running a query which is designed to generate a foreign key for a
table of approx. 10 million records (I've mentioned this in an earlier
posting).  The table is called "indethom", and each row contains a
single word from the works of St. Thomas Aquinas, along with
grammatical data about the word form, and (most importantly for my
current problem) a set of columns identifying the particular
work/section/paragraph that the word appears in.


This database is completely non-normalized, and I'm working on
performing some basic normalization, beginning with creating a table
called "s2.sectiones" which (naturally) contains a complete listing of
all of the sections of all the works of St. Thomas.  I will then
eliminate this information from the original "indethom" table,
replacing it with the foreign key I am currently generating.


** My question has to do with whether or not I am getting maximal
speed out of PostgreSQL, or whether I need to perform further
optimizations.  I am currently getting about 200,000 updates per hour,
and updating the entire 10 million rows thus requires 50 hours, which
seems a bit much.


Here's the query I am running:

update indethom

    set query_counter = nextval('s2.query_counter_seq'),           --
Just for keeping track of how fast the query is running

    sectref = (select clavis from s2.sectiones where

        s2.sectiones.nomeoper = indethom.nomeoper

        and s2.sectiones.refere1a = indethom.refere1a and
s2.sectiones.refere1b = indethom.refere1b

        and s2.sectiones.refere2a = indethom.refere2a  and
s2.sectiones.refere2b = indethom.refere2b

        and s2.sectiones.refere3a = indethom.refere3a  and
s2.sectiones.refere3b = indethom.refere3b

        and s2.sectiones.refere4a = indethom.refere4a and
s2.sectiones.refere4b = indethom.refere4b);


Here´s the query plan:

                            QUERY PLAN
              

-------------------------------------------------------------------------------------

 Seq Scan on indethom  (cost=0.00..1310352.72 rows=10631972 width=212)

   SubPlan

     ->  Index Scan using sectiones_ndx on sectiones  (cost=0.00..6.03
rows=1 width=4)

           Index Cond: ((nomeoper = $0) AND (refere1a = $1) AND
(refere1b = $2) AND (refere2a = $3) AND (refere2b = $4) AND (refere3a
= $5) AND (refere3b = $6) AND (refere4a = $7) AND (refere4b = $8))

(4 rows)


<smaller>Note:  I have just performed a VACUUM ANALYZE on the indethom
table, as suggested by this listserve.</smaller>


Here's the structure of the s2.sectiones table:

it=> \d s2.sectiones

        Table "s2.sectiones"

  Column  |     Type     | Modifiers

----------+--------------+-----------

 nomeoper | character(3) |

 refere1a | character(2) |

 refere1b | character(2) |

 refere2a | character(2) |

 refere2b | character(2) |

 refere3a | character(2) |

 refere3b | character(2) |

 refere4a | character(2) |

 refere4b | character(2) |

 clavis   | integer      |

Indexes: sectiones_ndx btree (nomeoper, refere1a, refere1b, refere2a,
refere2b, refere3a, refere3b, refere4a, refere4b)


Finally, here is the structure of indethom (some non-relevant columns
not shown):

it=> \d indethom

              Table "public.indethom"

    Column     |         Type          | Modifiers

---------------+-----------------------+-----------

 numeoper      | smallint              | not null

 nomeoper      | character(3)          | not null

 editcrit      | character(1)          |

 refere1a      | character(2)          |

 refere1b      | character(2)          |

 refere2a      | character(2)          |

 refere2b      | character(2)          |

 refere3a      | character(2)          |

 refere3b      | character(2)          |

 refere4a      | character(2)          |

 refere4b      | character(2)          |

 refere5a      | character(2)          | not null

 refere5b      | smallint              | not null

 referen6      | smallint              | not null

 ... several columns skipped ...

 verbum        | character varying(22) | not null

 ... other columns skipped ...

 poslinop      | integer               | not null

 posverli      | smallint              | not null

 posverop      | integer               | not null

 clavis        | integer               | not null

 articref      | integer               |

 sectref       | integer               |

 query_counter | integer               |

Indexes: indethom_pkey primary key btree (clavis),

         indethom_articulus_ndx btree (nomeoper, refere1a, refere1b,
refere2a, refere2b, refere3a, refere3b),

         indethom_sectio_ndx btree (nomeoper, refere1a, refere1b,
refere2a, refere2b, refere3a, refere3b, refere4a, refere4b),

         verbum_ndx btree (verbum)


Thanks for your assistance!

-Erik Norvelle</fontfamily></fixed>Folks:

I´m running a query which is designed to generate a foreign key for a
table of approx. 10 million records (I've mentioned this in an earlier
posting).  The table is called "indethom", and each row contains a
single word from the works of St. Thomas Aquinas, along with
grammatical data about the word form, and (most importantly for my
current problem) a set of columns identifying the particular
work/section/paragraph that the word appears in.

This database is completely non-normalized, and I'm working on
performing some basic normalization, beginning with creating a table
called "s2.sectiones" which (naturally) contains a complete listing of
all of the sections of all the works of St. Thomas.  I will then
eliminate this information from the original "indethom" table,
replacing it with the foreign key I am currently generating.

** My question has to do with whether or not I am getting maximal speed
out of PostgreSQL, or whether I need to perform further optimizations.
I am currently getting about 200,000 updates per hour, and updating the
entire 10 million rows thus requires 50 hours, which seems a bit much.

Here's the query I am running:
update indethom
    set query_counter = nextval('s2.query_counter_seq'),           -- Just
for keeping track of how fast the query is running
    sectref = (select clavis from s2.sectiones where
        s2.sectiones.nomeoper = indethom.nomeoper
        and s2.sectiones.refere1a = indethom.refere1a and
s2.sectiones.refere1b = indethom.refere1b
        and s2.sectiones.refere2a = indethom.refere2a  and
s2.sectiones.refere2b = indethom.refere2b
        and s2.sectiones.refere3a = indethom.refere3a  and
s2.sectiones.refere3b = indethom.refere3b
        and s2.sectiones.refere4a = indethom.refere4a and
s2.sectiones.refere4b = indethom.refere4b);

Here´s the query plan:
                             QUERY PLAN
------------------------------------------------------------------------
-------------
  Seq Scan on indethom  (cost=0.00..1310352.72 rows=10631972 width=212)
    SubPlan
      ->  Index Scan using sectiones_ndx on sectiones  (cost=0.00..6.03
rows=1 width=4)
            Index Cond: ((nomeoper = $0) AND (refere1a = $1) AND
(refere1b = $2) AND (refere2a = $3) AND (refere2b = $4) AND (refere3a =
$5) AND (refere3b = $6) AND (refere4a = $7) AND (refere4b = $8))
(4 rows)

Note:  I have just performed a VACUUM ANALYZE on the indethom table, as
suggested by this listserve.

Here's the structure of the s2.sectiones table:
it=> \d s2.sectiones
         Table "s2.sectiones"
   Column  |     Type     | Modifiers
----------+--------------+-----------
  nomeoper | character(3) |
  refere1a | character(2) |
  refere1b | character(2) |
  refere2a | character(2) |
  refere2b | character(2) |
  refere3a | character(2) |
  refere3b | character(2) |
  refere4a | character(2) |
  refere4b | character(2) |
  clavis   | integer      |
Indexes: sectiones_ndx btree (nomeoper, refere1a, refere1b, refere2a,
refere2b, refere3a, refere3b, refere4a, refere4b)

Finally, here is the structure of indethom (some non-relevant columns
not shown):
it=> \d indethom
               Table "public.indethom"
     Column     |         Type          | Modifiers
---------------+-----------------------+-----------
  numeoper      | smallint              | not null
  nomeoper      | character(3)          | not null
  editcrit      | character(1)          |
  refere1a      | character(2)          |
  refere1b      | character(2)          |
  refere2a      | character(2)          |
  refere2b      | character(2)          |
  refere3a      | character(2)          |
  refere3b      | character(2)          |
  refere4a      | character(2)          |
  refere4b      | character(2)          |
  refere5a      | character(2)          | not null
  refere5b      | smallint              | not null
  referen6      | smallint              | not null
  ... several columns skipped ...
  verbum        | character varying(22) | not null
  ... other columns skipped ...
  poslinop      | integer               | not null
  posverli      | smallint              | not null
  posverop      | integer               | not null
  clavis        | integer               | not null
  articref      | integer               |
  sectref       | integer               |
  query_counter | integer               |
Indexes: indethom_pkey primary key btree (clavis),
          indethom_articulus_ndx btree (nomeoper, refere1a, refere1b,
refere2a, refere2b, refere3a, refere3b),
          indethom_sectio_ndx btree (nomeoper, refere1a, refere1b,
refere2a, refere2b, refere3a, refere3b, refere4a, refere4b),
          verbum_ndx btree (verbum)

Thanks for your assistance!
-Erik Norvelle

Re: Update performance ... is 200,000 updates per hour

From
Stephan Szabo
Date:
On Tue, 2 Dec 2003, Erik Norvelle wrote:

> ** My question has to do with whether or not I am getting maximal speed
> out of PostgreSQL, or whether I need to perform further optimizations.
> I am currently getting about 200,000 updates per hour, and updating the
> entire 10 million rows thus requires 50 hours, which seems a bit much.

Well, it doesn't entirely surprise me much given the presumably 10 million
iterations of the index scan that it's doing. Explain analyze output (even
over a subset of the indethom table by adding a where clause) would
probably help to get better info.

I'd suggest seeing if something like:
update indethom set query_counter=...,sectref=s.clavis
 FROM s2.sectiones s where
  s2.sectiones.nomeoper = indethom.nomeoper and ...;
tries a join that might give a better plan.


> Here's the query I am running:
> update indethom
>     set query_counter = nextval('s2.query_counter_seq'),           -- Just
> for keeping track of how fast the query is running
>     sectref = (select clavis from s2.sectiones where
>         s2.sectiones.nomeoper = indethom.nomeoper
>         and s2.sectiones.refere1a = indethom.refere1a and
> s2.sectiones.refere1b = indethom.refere1b
>         and s2.sectiones.refere2a = indethom.refere2a  and
> s2.sectiones.refere2b = indethom.refere2b
>         and s2.sectiones.refere3a = indethom.refere3a  and
> s2.sectiones.refere3b = indethom.refere3b
>         and s2.sectiones.refere4a = indethom.refere4a and
> s2.sectiones.refere4b = indethom.refere4b);
>
> Here�s the query plan:
>                              QUERY PLAN
> ------------------------------------------------------------------------
> -------------
>   Seq Scan on indethom  (cost=0.00..1310352.72 rows=10631972 width=212)
>     SubPlan
>       ->  Index Scan using sectiones_ndx on sectiones  (cost=0.00..6.03
> rows=1 width=4)
>             Index Cond: ((nomeoper = $0) AND (refere1a = $1) AND
> (refere1b = $2) AND (refere2a = $3) AND (refere2b = $4) AND (refere3a =
> $5) AND (refere3b = $6) AND (refere4a = $7) AND (refere4b = $8))
> (4 rows)

Erik Norvelle <erik@norvelle.net> writes:
> update indethom
>     set query_counter =3D nextval('s2.query_counter_seq'),           -- Just=
> =20=20
> for keeping track of how fast the query is running
>     sectref =3D (select clavis from s2.sectiones where
>         s2.sectiones.nomeoper =3D indethom.nomeoper
>         and s2.sectiones.refere1a =3D indethom.refere1a and=20=20
> s2.sectiones.refere1b =3D indethom.refere1b
>         and s2.sectiones.refere2a =3D indethom.refere2a  and=20=20
> s2.sectiones.refere2b =3D indethom.refere2b
>         and s2.sectiones.refere3a =3D indethom.refere3a  and=20=20
> s2.sectiones.refere3b =3D indethom.refere3b
>         and s2.sectiones.refere4a =3D indethom.refere4a and=20=20
> s2.sectiones.refere4b =3D indethom.refere4b);

This is effectively forcing a nestloop-with-inner-indexscan join.  You
might be better off with

update indethom
    set query_counter = nextval('s2.query_counter_seq'),
    sectref = sectiones.clavis
from s2.sectiones
where
        s2.sectiones.nomeoper = indethom.nomeoper
        and s2.sectiones.refere1a = indethom.refere1a and
s2.sectiones.refere1b = indethom.refere1b
        and s2.sectiones.refere2a = indethom.refere2a  and
s2.sectiones.refere2b = indethom.refere2b
        and s2.sectiones.refere3a = indethom.refere3a  and
s2.sectiones.refere3b = indethom.refere3b
        and s2.sectiones.refere4a = indethom.refere4a and
s2.sectiones.refere4b = indethom.refere4b;

            regards, tom lane

Re: Update performance ... is 200,000 updates per hour what I should expect?

From
Greg Stark
Date:
Erik Norvelle <erik@norvelle.net> writes:

> Here's the query I am running:
> update indethom
>     set query_counter = nextval('s2.query_counter_seq'), -- Just for keeping track of how fast the query is running
>     sectref = (select clavis from s2.sectiones where
>         s2.sectiones.nomeoper = indethom.nomeoper
>         and s2.sectiones.refere1a = indethom.refere1a and s2.sectiones.refere1b = indethom.refere1b
>         and s2.sectiones.refere2a = indethom.refere2a and s2.sectiones.refere2b = indethom.refere2b
>         and s2.sectiones.refere3a = indethom.refere3a and s2.sectiones.refere3b = indethom.refere3b
>         and s2.sectiones.refere4a = indethom.refere4a and s2.sectiones.refere4b = indethom.refere4b);
>
> Here´s the query plan:
>                             QUERY PLAN
> -------------------------------------------------------------------------------------
>  Seq Scan on indethom (cost=0.00..1310352.72 rows=10631972 width=212)
>    SubPlan
>      -> Index Scan using sectiones_ndx on sectiones (cost=0.00..6.03 rows=1 width=4)
>            Index Cond: ((nomeoper = $0) AND (refere1a = $1) AND (refere1b = $2) AND (refere2a = $3) AND (refere2b =
$4)AND (refere3a = $5) AND (refere3b = $6) AND (refere4a = $7) AND (refere4b = $8)) 
> (4 rows)

Firstly, you might try running "vacuum full" on both tables. If there are tons
of extra dead records that are left-over they could be slowing down the
update.

This isn't the fastest possible plan but it's pretty good.

You might be able to get it somewhat faster using the non-standard "from"
clause on the update statement.

update indethom
   set sectref = clavis
  from sectiones
 where sectiones.nomeoper = indethom.nomeoper
   and sectiones.refere1a = indethom.refere1a
   and sectiones.refere1b = indethom.refere1b
   and sectiones.refere2a = indethom.refere2a
   and sectiones.refere2b = indethom.refere2b
   and sectiones.refere3a = indethom.refere3a
   and sectiones.refere3b = indethom.refere3b
   and sectiones.refere4a = indethom.refere4a
   and sectiones.refere4b = indethom.refere4b

This might be able to use a merge join which will take longer to get started
because it has to sort both tables, but might finish faster.

You might also try just paring the index down to just the two or three most
useful columns. Is it common that something matches refere1a and refere1b but
doesn't match the remaining? A 8-column index is a lot of overhead. I'm not
sure how much that effects lookup times but it might be substantial.


--
greg