Thread: ehm...

ehm...

From
andrea gelmini
Date:
well, i will be a little verbose, but i am afraid this could be a big bug
in postgresql, so...

i've made a script in python that, by psycopg module, put freecddb archive
in a relational db. so, it does a few select and a lot of insert for each
file of freecddb (something like 500.000 files).
now, the process is slower (most of the time is spent reading small file,
parsing and generating the right query), so i have try it just 8 times, and
now i need to know if i am wrong or if is it a real postgresql problem
(having see the problem of duplicated key, and so on).

well, just to make this a little bit more a bug report:

os: linux 2.4.18pre1 (also tried with 2.4.17)
distribution: debian unstable
gcc: gcc version 3.0.3 (also tried with 3.0.2)
python: 2.1
psycopg: http://packages.debian.org/unstable/non-us/python2.1-psycopg.html
postgresql: different cvs version in last 8 days

now, what happen is simple:
i run my script and after a lot of insert (the error in the bottom appears
after 76242 file parsed), if i run vacuumdb i've got different error.
i.e.:

vacuumdb -a -v -f -z

NOTICE:  Analyzing author
FATAL 2:  open of /home/postgres/db/pg_clog/0000 failed: No such file or directo
ry
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
connection to server was lost
vacuumdb: vacuum  freecddb failed

i've got errors like this using vacuum with and without '-f', with db
working and with db doing nothing else than vacuum.
sometimes i've got errors of duplicated key (already know problem), but
that time i didn't 'cut & paste' the error believing it was my fault by
some mistake in my schema.

now, i can investigate deeply, but i need your opinion if it is something
to do or not. as i said, it takes longer to reproduce this (and maybe i'm
doing something wrong).

thanks a lot for your work,
andrea

n.b.: well, if i try on the same postgres to run other db i've got no
problem, but they are a lot smaller and have a few read access and less write
access per day (3-4 insert, without foreing keys, and one hundred select).

Re: ehm...

From
Tom Lane
Date:
andrea gelmini <andrea.gelmini@linux.it> writes:
> vacuumdb -a -v -f -z

> NOTICE:  Analyzing author
> FATAL 2:  open of /home/postgres/db/pg_clog/0000 failed: No such file or directo
> ry
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> connection to server was lost
> vacuumdb: vacuum  freecddb failed

> i've got errors like this using vacuum with and without '-f', with db
> working and with db doing nothing else than vacuum.
> sometimes i've got errors of duplicated key (already know problem), but
> that time i didn't 'cut & paste' the error believing it was my fault by
> some mistake in my schema.

> now, i can investigate deeply, but i need your opinion if it is something
> to do or not. as i said, it takes longer to reproduce this (and maybe i'm
> doing something wrong).

Yup, it looks like a bug to me.  Apparently a CLOG segment has been
recycled too soon.  We just found a bug of that ilk in sequence
processing, but VACUUM doesn't touch sequences, so apparently you have
a different bug.  Please submit details.

Since CLOG segments normally hold a million transactions each, it'll
necessarily take a long time to reproduce any problem of this kind.
If you don't mind doing an initdb, you could reduce the CLOG segment
size to make it easier to try to reproduce the problem.  In
src/backend/access/transam/clog.c change
#define CLOG_XACTS_PER_SEGMENT    0x100000
to 0x10000 (I think that's about as small as you can make it without
breaking anything).  That gives you a shot at a problem every 64K
transactions.

            regards, tom lane

Re: ehm...

From
Tom Lane
Date:
andrea gelmini <andrea.gelmini@linux.it> writes:
> On ven, gen 11, 2002 at 01:56:12 -0500, Tom Lane wrote:
>> Yup, it looks like a bug to me.  Apparently a CLOG segment has been
>> recycled too soon.  We just found a bug of that ilk in sequence
>> processing, but VACUUM doesn't touch sequences, so apparently you have
>> a different bug.  Please submit details.
> yes, i will retry tonight.
> what kind of info do you want?
> by the way i can give you script and the rest of things you need (and it will be very
> easily for you to reproduce the problem... the script is very simple, and
> schema is nothing more than a few tables).

Yes, I would like a copy of your script.

I just found and fixed a problem that could lead to this kind of failure
(reference to an already-deleted CLOG entry).  However, I'm not sure
that the problem I fixed could affect VACUUM itself.  Please update from
cvs (make sure you get src/backend/utils/time/tqual.c v 1.46) and then
see if you still see the problem.  If you do, a backtrace from the point
where elog() is called would be really helpful.

            regards, tom lane

Re: ehm...

From
Tom Lane
Date:
andrea gelmini <andrea.gelmini@linux.it> writes:
> you was right, and there's no need of vacuum db to crash the postmaster...
> doing a fresh installation, a fresh initdb, starting my script (and a
> second bash script that does nothing more than a simple insert) after about
> 1 milion transactions i can see the problem (after changed sources as you told
> me), without vacuumdb.

Hm, you never did a VACUUM at all?  Don't see how that can be ---
removal of CLOG segments is triggered by VACUUM.

Please let me know if you can still see the problem with today's snapshot.

            regards, tom lane

Re: ehm...

From
andrea gelmini
Date:
On ven, gen 11, 2002 at 01:56:12 -0500, Tom Lane wrote:
> Yup, it looks like a bug to me.  Apparently a CLOG segment has been
> recycled too soon.  We just found a bug of that ilk in sequence
> processing, but VACUUM doesn't touch sequences, so apparently you have
> a different bug.  Please submit details.
yes, i will retry tonight.
what kind of info do you want?
by the way i can give you script and the rest of things you need (and it will be very
easily for you to reproduce the problem... the script is very simple, and
schema is nothing more than a few tables).

> If you don't mind doing an initdb, you could reduce the CLOG segment
no problem, i do initdb everytime I recompile cvs...

> size to make it easier to try to reproduce the problem.  In
> src/backend/access/transam/clog.c change
> #define CLOG_XACTS_PER_SEGMENT    0x100000
> to 0x10000 (I think that's about as small as you can make it without
> breaking anything).  That gives you a shot at a problem every 64K
> transactions.

thanks for your time,
andrea

Re: ehm...

From
andrea gelmini
Date:
On ven, gen 11, 2002 at 03:16:12 -0500, Tom Lane wrote:
> Yes, I would like a copy of your script.
i will package script and 'instructions' soon.
i've got also two snapshots of db after the crash (tar files of postgres
directory)

>
> I just found and fixed a problem that could lead to this kind of failure
> (reference to an already-deleted CLOG entry).  However, I'm not sure
yesterday i spent the afternoon doing a lot of test... now, i think you
don't need the details because...

> that the problem I fixed could affect VACUUM itself.  Please update from
you was right, and there's no need of vacuum db to crash the postmaster...
doing a fresh installation, a fresh initdb, starting my script (and a
second bash script that does nothing more than a simple insert) after about
1 milion transactions i can see the problem (after changed sources as you told
me), without vacuumdb.

> cvs (make sure you get src/backend/utils/time/tqual.c v 1.46) and then
> see if you still see the problem.  If you do, a backtrace from the point
> where elog() is called would be really helpful.
tomorrow i will redo all my tests, but i guess now the bug is fixed.

thanks a lot for your work,
andrea

n.b.: when you'll come in italy i'll be happy to be your host...

Re: ehm...

From
andrea gelmini
Date:
On dom, gen 13, 2002 at 11:30:08 -0500, Tom Lane wrote:
> Hm, you never did a VACUUM at all?  Don't see how that can be ---
> removal of CLOG segments is triggered by VACUUM.
uhm... i was sure i didn't do vacuum... but... well... so I made other
tests.
>
> Please let me know if you can still see the problem with today's snapshot.

i've tried in these days, and it seems everything work right now...

thanks a lot for your time,
andrea