Thread: Transaction progress

Transaction progress

From
Pablo Yaggi
Date:
Hi,
is there any way to know a transaction progress state ? I'm executing queries from psql and I have a query running for
24hours, 
and I don't know when It will end, is there any way I could know it ?
if not, can I estimate the time from the explain result ? how ?
this is the explain result about the query I'm running:

Hash Join  (cost=100347487.59..145240934.03 rows=26 width=60)
  ->  Index Scan using inars_rectificaciones_inx on inars_rectificaciones b2  (cost=0.00..37839140.62 rows=9546466
width=34)
  ->  Hash  (cost=100347379.07..100347379.07 rows=43407 width=26)
        ->  Seq Scan on inars_nocontrib_perm1  (cost=100000000.00..100347379.07 rows=43407 width=26)

and this is the query:

update inars_nocontrib_perm1  set
rectificada=TRUE,actividad=b2.actividad,presentacion=b2.presentacion,remuneracion=b2.remuneracionfrom
inars_rectificacionesb2 where inars_nocontrib_perm1.cuil=b2.cuil and inars_nocontrib_perm1.cuit=b2.cuit and
inars_nocontrib_perm1.ano=b2.anoand inars_nocontrib_perm1.mes=b2.mes; 

Thank's in advance,

Pablo Yaggi


Re: Transaction progress

From
Manfred Koizar
Date:
On Sun, 19 Jan 2003 14:53:49 -0300, Pablo Yaggi
<pyaggi@aulamagna.com.ar> wrote:
>if not, can I estimate the time from the explain result ? how ?
>this is the explain result about the query I'm running:
>
>Hash Join  (cost=100347487.59..145240934.03 rows=26 width=60)
>  ->  Index Scan using inars_rectificaciones_inx on inars_rectificaciones b2
>                  (cost=0.00..37839140.62 rows=9546466 width=34)
                                                ^^^^^^^       ^^
                                                  (c)         (d)
>  ->  Hash  (cost=100347379.07..100347379.07 rows=43407 width=26)
>        ->  Seq Scan on inars_nocontrib_perm1
>                 (cost=100000000.00..100347379.07 rows=43407 width=26)
                        ^^^^^^^^^^^^     ^^^^^^
                         (a)               (b)

The end result of explain, 145240934.03, basically tells us that you
have SET enable_seqscan = OFF (a).

(b) looks a bit strange.  Does inars_nocontrib_perm1 have 300000 pages
but only 43407 tuples?  If so, you should VACUUM more often.  Anyway
this seq scan contributes only a small fraction to the overall cost.

(c) There are almost 10M rows in inars_rectificaciones, the width of
which is at least 34 (d).  Counting for overhead and columns not used
by your statement, let's guess 80 bytes per tuple (might be more).
This gives at least 800MB for the whole relation.  Assuming that your
physical memory is much less than this and the index
inars_rectificaciones_inx doesn't retrieve tuples in physical order,
you need one random page read per tuple.  So your question boils down
to how long it takes to do 10M random reads on your hardware.

I have to guess again: 4ms seek, 4ms latency.  Ignoring CPU and
transfer times we get 8ms per page access or 80000 seconds (more than
22 hours) for the query on average PC hardware.  This could be much
more, if your disk is fragmented or slow or both ...

Next time you run that query SET enable_seqscan = ON;  this should cut
down execution time to way below one hour.  YMMV.

Servus
 Manfred

Re: Transaction progress

From
Pablo Yaggi
Date:
On Sunday 19 January 2003 07:08 pm, Manfred Koizar wrote:
> On Sun, 19 Jan 2003 14:53:49 -0300, Pablo Yaggi
>
> <pyaggi@aulamagna.com.ar> wrote:
> >if not, can I estimate the time from the explain result ? how ?
> >this is the explain result about the query I'm running:
> >
> >Hash Join  (cost=100347487.59..145240934.03 rows=26 width=60)
> >  ->  Index Scan using inars_rectificaciones_inx on inars_rectificaciones
> > b2 (cost=0.00..37839140.62 rows=9546466 width=34)
>
>                                                 ^^^^^^^       ^^
>                                                   (c)         (d)
>
> >  ->  Hash  (cost=100347379.07..100347379.07 rows=43407 width=26)
> >        ->  Seq Scan on inars_nocontrib_perm1
> >                 (cost=100000000.00..100347379.07 rows=43407 width=26)
>
>                         ^^^^^^^^^^^^     ^^^^^^
>                          (a)               (b)
>
> The end result of explain, 145240934.03, basically tells us that you
> have SET enable_seqscan = OFF (a).
You're right, that's a configuration error, i fixed it, but the query is still working,
so it will be working after I restart the server. But anyway the planner says that
it will use Seq scan anyway, so is it not the same if enable_seqscan is on or not ?

>
> (b) looks a bit strange.  Does inars_nocontrib_perm1 have 300000 pages
> but only 43407 tuples?  If so, you should VACUUM more often.  Anyway
> this seq scan contributes only a small fraction to the overall cost.
how did you realize that inars_nocontrib_perm1 is 300000 pages ?
I did a Vacuum before I execute the query that is still in progress, but the plan I tell last time
was made before it, so this is the output of the planner while the query is running:

Merge Join  (cost=209527960.44..247478144.43 rows=612 width=60)
  ->  Sort  (cost=209527960.44..209527960.44 rows=35037092 width=26)
        ->  Seq Scan on inars_nocontrib_perm1  (cost=100000000.00..100697315.92 rows=35037092 width=26)
  ->  Index Scan using inars_rectificaciones_inx on inars_rectificaciones b2  (cost=0.00..37838713.13 rows=9546358
width=34)


> (c) There are almost 10M rows in inars_rectificaciones, the width of
> which is at least 34 (d).  Counting for overhead and columns not used
34 (d) ? Sorry, I didn't get it.

these are the tables and indx:

create table inars_rectificaciones (
        cuil            int8,cuit            int8,ano             int2,mes             int2,
        presentacion    date,actividad       int2,remuneracion    float
        );

create table inars_contrib_perm1 (
        cuil            int8,cuit            int8,ano             int2,mes             int2,
        presentacion    date,rectificada     bool,actividad       int2,remuneracion    float
        );

create index inars_rectificaciones_inx on inars_rectificaciones (ano,mes,cuil,cuit);

> by your statement, let's guess 80 bytes per tuple (might be more).
> This gives at least 800MB for the whole relation.  Assuming that your
> physical memory is much less than this and the index
> inars_rectificaciones_inx doesn't retrieve tuples in physical order,
> you need one random page read per tuple.  So your question boils down
> to how long it takes to do 10M random reads on your hardware.

my memory size is 512Mb and the processor is a dual pentium 1.4G


> I have to guess again: 4ms seek, 4ms latency.  Ignoring CPU and
> transfer times we get 8ms per page access or 80000 seconds (more than
> 22 hours) for the query on average PC hardware.  This could be much
> more, if your disk is fragmented or slow or both ...

Thank's a lot,
Pablo



Re: Transaction progress

From
Manfred Koizar
Date:
On Sun, 19 Jan 2003 20:22:45 -0300, Pablo Yaggi
<pyaggi@aulamagna.com.ar> wrote:
>But anyway the planner says that
>it will use Seq scan anyway, so is it not the same if enable_seqscan is on or not ?

"Seq scan anyway" on inars_nocontrib_perm1, but there's still that
index scan on inars_rectificaciones which will turn into a (much
faster) seq scan, if you allow it.

>how did you realize that inars_nocontrib_perm1 is 300000 pages ?

cost=100000000.00..100347379.07:
100000000.00 is a fake startup cost to talk the planner out of
choosing the seq scan.  From costsize.c: "The cost of reading a page
sequentially is 1.0, by definition."  347379.07 is the cost for
reading all pages plus processing all tuples.  Given the low number of
tuples (43407) the latter is not important, so I guessed 300K pages.

>Merge Join  (cost=209527960.44..247478144.43 rows=612 width=60)
>  ->  Sort  (cost=209527960.44..209527960.44 rows=35037092 width=26)
>        ->  Seq Scan on inars_nocontrib_perm1  (cost=100000000.00..100697315.92 rows=35037092 width=26)
>  ->  Index Scan using inars_rectificaciones_inx on inars_rectificaciones b2  (cost=0.00..37838713.13 rows=9546358
width=34)

I have absolutely no feeling for how long a sort of 35M rows might
take.  Be sure to give it enough sort_mem; but OTOH not too much, the
whole dataset doesn't fit into memory, so it has to be written to disk
anyway.  Sort_mem (and shared_buffers) is discussed on -general,
-admin and -performance every now and then.  Search the list archives.

>> (c) There are almost 10M rows in inars_rectificaciones, the width of
>> which is at least 34 (d).  Counting for overhead and columns not used
>34 (d) ? Sorry, I didn't get it.

"34 (d)" means "34 bytes.  See note (d) above"

BTW, you join 35M rows to 10M rows and the planner expects to get only
612 rows.  Is this realistic?  If it is, can you change your
application to store the "candidates" in a third table as soon as they
are created?  Then you could get a plan like

  Nested loop
    -> Seq scan on candidates (rows=612)
    -> Index scan on inars_rectificaciones_inx ( rows=1 loops=612)
    -> Index scan on inars_nocontrib_perm1_inx ( rows=1 loops=612)

and an execution time of a few seconds.

Servus
 Manfred

Re: Transaction progress

From
Pablo Yaggi
Date:
> BTW, you join 35M rows to 10M rows and the planner expects to get only
> 612 rows.  Is this realistic?  If it is, can you change your
> application to store the "candidates" in a third table as soon as they
> are created?  Then you could get a plan like

I expect that the join produce about 4M rows, so I think the planner is
wrong, and about the data, I already have the data and I preparing it
for later processing, so I can't have any original source.

I had also created and index inars_nocontrib_perm1_inx this way:

create index inars_nocontrib_perm1_inx on inars_nocontrib_perm1 (ano,mes,cuil,cuit)

but the planner didn't use it, as you can see. That's the way I broke the config file
before (enable_seqscan=false).

so this is my scenario, the query I need to do is this:

update inars_nocontrib_perm1  set
rectificada=TRUE,actividad=b2.actividad,presentacion=b2.presentacion,remuneracion=b2.remuneracionfrom
inars_rectificacionesb2  
    where inars_nocontrib_perm1.cuil=b2.cuil and inars_nocontrib_perm1.cuit=b2.cuit and
inars_nocontrib_perm1.ano=b2.anoand inars_nocontrib_perm1.mes=b2.mes; 
where
    inars_nocontrib_perm1 is about 35M rows
    inars_rectificaciones    is about 10M rows
    sort memory 4096

based on your experience/calculation, could you give some advice, do I have to increase sort memory ?
do I have to change the query ? ... well something, the query is running from about 28 hours, do I stop it
and try something else ? is there anyway to check how long, even estimated, it will take to finish ?

well thank's a lot for your help, I hope I'm not abusing of it,
Best Regards,
Pablo



Re: Transaction progress

From
Pablo Yaggi
Date:
Sorry,
    I checked out and I notice, that the index
inars_nocontrib_perm1 doesn't exists, do you think it's better if I stop the
query and create these index ?
Pablo

PD: I'm expecting you read this one with the last one. =)

Re: Transaction progress

From
Manfred Koizar
Date:
On Sun, 19 Jan 2003 22:23:49 -0300, Pablo Yaggi
<pyaggi@aulamagna.com.ar> wrote:
>I had also created and index inars_nocontrib_perm1_inx this way:
>
>create index inars_nocontrib_perm1_inx on inars_nocontrib_perm1 (ano,mes,cuil,cuit)

How long did this take?  Knowing this can help estimating sort cost.
As I told you I have no gut feeling for large sorts;  simply not
enough experience ...

Does this index still exist?

>but the planner didn't use it, as you can see. That's the way I broke the config file
>before (enable_seqscan=false).

If your tuples are physically ordered by ano, mes, cuil, cuit, then an
index scan is almost as fast as a seq scan and there is no need for a
separate sort step.  Unfortunately the planner has its problems with
multi column indices.  So if *you* know that tuples are stored in
index order in both relations, this might indeed be a good case for
setting enable_seqscan=off.

>based on your experience/calculation, could you give some advice, do I have to increase sort memory ?

Definitely!  I just don't know how much :-(
A shoot into the dark: 60000 or even 120000, but don't leave it that
high when you go multiuser.

>do I have to change the query ?

If the sort turns out to be the problem and it is unavoidable, I'd do
several smaller updates:

UPDATE ... WHERE ano=2000::int2 AND mes=1::int2;
UPDATE ... WHERE ano=2000::int2 AND mes=2::int2;
...
UPDATE ... WHERE ano=2003::int2 AND mes=1::int2;

>... well something, the query is running from about 28 hours, do I stop it
>and try something else ? is there anyway to check how long, even estimated, it will take to finish ?

Not that I know of, except watching your disk files grow and trying to
estimate how many tuples have already been updated ...

Servus
 Manfred

Re: Transaction progress

From
Pablo Yaggi
Date:
Manfred,
    well thank's a lot you really help me to clearfy many concepts
about postgres explain and setup parameters.
    About the uqery, it's finally done, it takes about 40 hours, and
it update about 9.1M rows, so the planner missed that.
    Once again, thanks a lot,
Best Regards
Pablo



On Monday 20 January 2003 07:21 am, Manfred Koizar wrote:
> On Sun, 19 Jan 2003 22:23:49 -0300, Pablo Yaggi
>
> <pyaggi@aulamagna.com.ar> wrote:
> >I had also created and index inars_nocontrib_perm1_inx this way:
> >
> >create index inars_nocontrib_perm1_inx on inars_nocontrib_perm1
> > (ano,mes,cuil,cuit)
>
> How long did this take?  Knowing this can help estimating sort cost.
> As I told you I have no gut feeling for large sorts;  simply not
> enough experience ...
>
> Does this index still exist?
>
> >but the planner didn't use it, as you can see. That's the way I broke the
> > config file before (enable_seqscan=false).
>
> If your tuples are physically ordered by ano, mes, cuil, cuit, then an
> index scan is almost as fast as a seq scan and there is no need for a
> separate sort step.  Unfortunately the planner has its problems with
> multi column indices.  So if *you* know that tuples are stored in
> index order in both relations, this might indeed be a good case for
> setting enable_seqscan=off.
>
> >based on your experience/calculation, could you give some advice, do I
> > have to increase sort memory ?
>
> Definitely!  I just don't know how much :-(
> A shoot into the dark: 60000 or even 120000, but don't leave it that
> high when you go multiuser.
>
> >do I have to change the query ?
>
> If the sort turns out to be the problem and it is unavoidable, I'd do
> several smaller updates:
>
> UPDATE ... WHERE ano=2000::int2 AND mes=1::int2;
> UPDATE ... WHERE ano=2000::int2 AND mes=2::int2;
> ...
> UPDATE ... WHERE ano=2003::int2 AND mes=1::int2;
>
> >... well something, the query is running from about 28 hours, do I stop it
> >and try something else ? is there anyway to check how long, even
> > estimated, it will take to finish ?
>
> Not that I know of, except watching your disk files grow and trying to
> estimate how many tuples have already been updated ...
>
> Servus
>  Manfred
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)