Thread: Re: pgsql-sql-digest V1 #225

Re: pgsql-sql-digest V1 #225

From
"Steven M. Wheeler"
Date:
Two problems, one question:

First my configuration:
I have a 350MB+ DB with two tables in it.  One of the tables contains 2.5M+ rows,
the other is empty.  I have 1 index on the table with rows in it.  This is all
running on a Compaq 450MHz Pentium II with 256MB RAM, 1GB of swap.  The DB is on
its own 6GB EIDE drive.

Problem #1:
Using psql in interactive mode, I issue "select count(*) from currnt;".  The
postmaster starts to read data in, as seen in vmstat by the bi stat jumping
sharply.  Very rapidly, CPU goes to 0% idle, postmaster is using 99.4%, bi stat
runs about 12K for 60-80 seconds, and then goes to 0, and everything stays there.
I never get a return and the system stays maxed.  When the row count in this table
was below 2M, I would get a return count.

Problem #2:
Running psql in interactive mode, I issue "select * from currnt where
cast(statdate as text) like '03-%-1999);".  This runs for quite awhile, eats into
swap to the tune of 670MB+ and then bombs out with a palloc error.  After reading
one of the FAQs, I first set ulimit -d to 65535, and then set it to unlimited.  No
change, still bombed out.

The question:
Am I trying to run too large a DB?  Have I exceeded the capabilities of Postgres
in its current incarnation?  Have I got a hope in hell of getting this to work?  I
have already spent many, many, many, too many hours on these problems.

Thanks
--
Steven M. Wheeler
UNIX Engineering
The SABRE Group
(918) 292-4119
(918) 292-4165 FAX




Re: [SQL] Re: pgsql-sql-digest V1 #225

From
Tom Lane
Date:
"Steven M. Wheeler" <swheeler@sabre.com> writes:
> First my configuration:

Er, what version of Postgres?

> Using psql in interactive mode, I issue "select count(*) from
> currnt;".  The postmaster starts to read data in, as seen in vmstat by
> the bi stat jumping sharply.  Very rapidly, CPU goes to 0% idle,
> postmaster is using 99.4%, bi stat runs about 12K for 60-80 seconds,
> and then goes to 0, and everything stays there.  I never get a return
> and the system stays maxed.  When the row count in this table was
> below 2M, I would get a return count.

This is odd, all right.  Can you attach to the backend with a debugger
and see where it is (get a backtrace)?  It sounds like the backend's
going into an infinite loop, but I've never seen that sort of behavior
on such a simple query...

> Running psql in interactive mode, I issue "select * from currnt where
> cast(statdate as text) like '03-%-1999);".  This runs for quite
> awhile, eats into swap to the tune of 670MB+ and then bombs out with a
> palloc error.

This is a known problem that I hope to see fixed in 6.6, but it will
not be fixed for 6.5.   An expression involving any datatype more
complex than integers consumes temporary memory for each evaluation,
and currently the backend doesn't try to recover that memory until
end of statement.  So if you process enough tuples in one statement,
you run out of memory :-(.  We know how to fix this but it's too large
a change to make at this late stage of the 6.5 release cycle.

It sounds like this expression is consuming several hundred bytes per
iteration, which is more than I would've expected --- a couple dozen
bytes for the temporary text value should've been enough.  It could be
that the "like" operator itself is wasting memory internally, which
might be fixable now; I'll look into it.

In the meantime, you might be able to work around the problem by using
a less memory-hungry form of the WHERE expression --- for example, the
above query could probably be written without any text temporary asstatdate >= '03-01-1999' AND statdate <=
'03-31-1999';
This way only requires boolean intermediate values, which don't require
extra memory to be allocated.  (Haven't actually tried it, but I think
it should work.)

> Am I trying to run too large a DB?

No; there are people running Postgres DBs with individual tables larger
than 2Gb without trouble.  But it does depend on working around some of
the known limitations :-(.  The developers plan to address these
limitations in future releases, but there are only so many hours in the
day...
        regards, tom lane


Re: [SQL] Re: pgsql-sql-digest V1 #225

From
"Steven M. Wheeler"
Date:
Tom,

I have a little more information.

Version of Postgres: 6.4.2

Regarding your request for a backtrace, I recompiled with debugging and
profiling options on.  Subsequently I have attached the debugger and
interrupted the backend a number of times.  I keep coming up in mcount() and
a couple of hash functions.  BTW: I had let the query run for over 50+ hours
without it returning a value.  The offending SQL: select count(*) from
currnt;  Is there something more definitive you would like me to do?

Using the syntax you suggested, "statdate >= 'date' and statdate <= 'date'"
works like a champ.  So does using "between" as suggested by another user.
However, issuing "select count(*) form currnt where statdate = '03-01-1999';
still returns an error telling me that I must use a cast since there is more
than one possible function.

A funny note.  Issuing the SQL: select count(*) from currnt where statdate
>= '03-01-1999' and statdate <= '05-31-1999'; works just fine (runtime:
00:07:30:81, nice and fast) and returns the count 2956630.  The laugh is
that this date range covers everything in the database, the same as the
above offending syntax should.

Thanks for the assist!
--
Steven M. Wheeler
UNIX Engineering
The SABRE Group
(918) 292-4119
(918) 292-4165 FAX

Tom Lane wrote:

> "Steven M. Wheeler" <swheeler@sabre.com> writes:
> > First my configuration:
>
> Er, what version of Postgres?
>
> > Using psql in interactive mode, I issue "select count(*) from
> > currnt;".  The postmaster starts to read data in, as seen in vmstat by
> > the bi stat jumping sharply.  Very rapidly, CPU goes to 0% idle,
> > postmaster is using 99.4%, bi stat runs about 12K for 60-80 seconds,
> > and then goes to 0, and everything stays there.  I never get a return
> > and the system stays maxed.  When the row count in this table was
> > below 2M, I would get a return count.
>
> This is odd, all right.  Can you attach to the backend with a debugger
> and see where it is (get a backtrace)?  It sounds like the backend's
> going into an infinite loop, but I've never seen that sort of behavior
> on such a simple query...
>
> > Running psql in interactive mode, I issue "select * from currnt where
> > cast(statdate as text) like '03-%-1999);".  This runs for quite
> > awhile, eats into swap to the tune of 670MB+ and then bombs out with a
> > palloc error.
>
> This is a known problem that I hope to see fixed in 6.6, but it will
> not be fixed for 6.5.   An expression involving any datatype more
> complex than integers consumes temporary memory for each evaluation,
> and currently the backend doesn't try to recover that memory until
> end of statement.  So if you process enough tuples in one statement,
> you run out of memory :-(.  We know how to fix this but it's too large
> a change to make at this late stage of the 6.5 release cycle.
>
> It sounds like this expression is consuming several hundred bytes per
> iteration, which is more than I would've expected --- a couple dozen
> bytes for the temporary text value should've been enough.  It could be
> that the "like" operator itself is wasting memory internally, which
> might be fixable now; I'll look into it.
>
> In the meantime, you might be able to work around the problem by using
> a less memory-hungry form of the WHERE expression --- for example, the
> above query could probably be written without any text temporary as
>         statdate >= '03-01-1999' AND statdate <= '03-31-1999';
> This way only requires boolean intermediate values, which don't require
> extra memory to be allocated.  (Haven't actually tried it, but I think
> it should work.)
>
> > Am I trying to run too large a DB?
>
> No; there are people running Postgres DBs with individual tables larger
> than 2Gb without trouble.  But it does depend on working around some of
> the known limitations :-(.  The developers plan to address these
> limitations in future releases, but there are only so many hours in the
> day...
>
>                         regards, tom lane



Re: [SQL] Re: pgsql-sql-digest V1 #225

From
Bruce Momjian
Date:
> Tom,
> 
> I have a little more information.
> 
> Version of Postgres: 6.4.2
> 
> Regarding your request for a backtrace, I recompiled with debugging and
> profiling options on.  Subsequently I have attached the debugger and
> interrupted the backend a number of times.  I keep coming up in mcount() and
> a couple of hash functions.  BTW: I had let the query run for over 50+ hours
> without it returning a value.  The offending SQL: select count(*) from
> currnt;  Is there something more definitive you would like me to do?
> 
> Using the syntax you suggested, "statdate >= 'date' and statdate <= 'date'"
> works like a champ.  So does using "between" as suggested by another user.
> However, issuing "select count(*) form currnt where statdate = '03-01-1999';
> still returns an error telling me that I must use a cast since there is more
> than one possible function.

I hate to say this but 6.5 is just about to be released, and has major
hashjoin fixes that may fix this problem.  Can you try the most recent
snapshot on ftp.postgresql.org.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [SQL] Re: pgsql-sql-digest V1 #225

From
Tom Lane
Date:
"Steven M. Wheeler" <swheeler@sabre.com> writes:
> Regarding your request for a backtrace, I recompiled with debugging and
> profiling options on.  Subsequently I have attached the debugger and
> interrupted the backend a number of times.  I keep coming up in mcount() and
> a couple of hash functions.  BTW: I had let the query run for over 50+ hours
> without it returning a value.  The offending SQL: select count(*) from
> currnt;  Is there something more definitive you would like me to do?

If you could interrupt the backend a few times and provide a full
backtrace (gdb "bt" command) each time, we could maybe form a picture of
what the heck it's doing.  This report does seem *very* odd, especially
your discovery that adding a "where" clause speeds it up.  (That'd be
fine if the where clause eliminated many rows, but since it doesn't...)

Also, it would be useful to know what "explain" says about how the query
will be executed.  I'd expect an index scan for the "select ... where"
case, and a plain sequential scan for the case without where; if it's
doing something else that would be important to know.

One more thing --- exactly what is the declaration of the currnt table,
and of its indexes if any?

BTW, I concur with Bruce's suggestion to try a recent 6.5 snapshot.
I don't see any hashjoin going on here, but it is true that we've
squashed a remarkable number of bugs between 6.4.* and 6.5.  Perhaps
you are hitting one of them.
        regards, tom lane


Re: [SQL] Re: pgsql-sql-digest V1 #225

From
"Steven M. Wheeler"
Date:
Boy is my face red;-}

I just found out that my vacuum has not been running.  Once I vacuumed my DB the
performance problems went away.  I can now issue a select count(*) from currnt;
and get a response in less than 45 seconds.

Also, I just sent a message to pgsql-admin about performance with pg_dumpall.
Guess what?  The vacuum fixed that too.  Dumped my 1+GB DB in under 2 minutes (I
didn't time it closely).


Thanks again for all the help.
--
Steven M. Wheeler
UNIX Engineering
The SABRE Group
(918) 292-4119
(918) 292-4165 FAX

Tom Lane wrote:

> "Steven M. Wheeler" <swheeler@sabre.com> writes:
> > Regarding your request for a backtrace, I recompiled with debugging and
> > profiling options on.  Subsequently I have attached the debugger and
> > interrupted the backend a number of times.  I keep coming up in mcount() and
> > a couple of hash functions.  BTW: I had let the query run for over 50+ hours
> > without it returning a value.  The offending SQL: select count(*) from
> > currnt;  Is there something more definitive you would like me to do?
>
> If you could interrupt the backend a few times and provide a full
> backtrace (gdb "bt" command) each time, we could maybe form a picture of
> what the heck it's doing.  This report does seem *very* odd, especially
> your discovery that adding a "where" clause speeds it up.  (That'd be
> fine if the where clause eliminated many rows, but since it doesn't...)
>
> Also, it would be useful to know what "explain" says about how the query
> will be executed.  I'd expect an index scan for the "select ... where"
> case, and a plain sequential scan for the case without where; if it's
> doing something else that would be important to know.
>
> One more thing --- exactly what is the declaration of the currnt table,
> and of its indexes if any?
>
> BTW, I concur with Bruce's suggestion to try a recent 6.5 snapshot.
> I don't see any hashjoin going on here, but it is true that we've
> squashed a remarkable number of bugs between 6.4.* and 6.5.  Perhaps
> you are hitting one of them.
>
>                         regards, tom lane