Thread: Vacuum & Memory Exhausted error

Vacuum & Memory Exhausted error

From
Greg Spiegelberg
Date:
Morning,

I have multiple databases running on a Solaris 9, PostgreSQL 7.3.2
system and nightly I run a process that, for each database, performs
first a "VACUUM ANALYZE;" then runs the vacuumlo on the same database.

The problem is that the "VACUUM ANALYZE;" results in a

   ERROR:  Memory exhausted in AllocSetAlloc(1048575996)

The (1048575996) doesn't change regardless of the database that the
vacuum is running on.  Additionally, our tables in all of our databases
are reindexed every 6 hours.

That's the problem.

My current workaround is prior to the "VACUUM ANALYZE;" I drop all
indexes to all non-system, non-postgres tables, then perform the
vacuum on each non-system, non-postgres table, and finally recreate
the indexes.  The reindex every 6 hours is currently disabled as well.
This is the only solution I have to date.

Any input would be greatly appreciated.

Greg

--
Greg Spiegelberg
  Sr. Product Development Engineer
  Cranel, Incorporated.
  Phone: 614.318.4314
  Fax:   614.431.8388
  Email: gspiegelberg@Cranel.com
Cranel. Technology. Integrity. Focus.



Re: Vacuum & Memory Exhausted error

From
Tom Lane
Date:
Greg Spiegelberg <gspiegelberg@cranel.com> writes:
> The problem is that the "VACUUM ANALYZE;" results in a
>    ERROR:  Memory exhausted in AllocSetAlloc(1048575996)

If you do VACUUM and ANALYZE separately, which part shows the failure?
(I'm betting on ANALYZE.)  And have you narrowed down which table it
fails on?  (Adding the VERBOSE option would let you track that.)

This looks to me like a data-corruption problem, specifically a bad
value of the length word in a variable-length field.  (Adjacent data
is probably clobbered too, but the length problem is showing itself
first.)  See previous threads in the archives for general advice about
tracking down broken rows and getting rid of them.

            regards, tom lane

Re: Vacuum & Memory Exhausted error

From
Tom Lane
Date:
Greg Spiegelberg <gspiegelberg@cranel.com> writes:
> I have isolated the table but can't seem to find the info on finding
> the bad row and removing it.

Well, it's not rocket science, you just do trial and error to see which
rows you can select without getting the error.  I'd try a SELECT
COUNT(*) first to check that there is no corruption of tuple headers.
If that works, identify which column contains the damage by seeing
whether you can do SELECT max(col) for each column left-to-right.
Then identify the broken row by doing
    SELECT broken-col FROM table OFFSET n LIMIT 1
for various values of n --- this reads and discards n rows then reads
and returns another, so if it doesn't fail then the first n+1 rows are
good.  If you understand the principle of binary search you can home
in on the target row quickly.  Actually it's a little bit complicated
because of an undocumented fact: in current releases the LIMIT mechanism
actually reads one more row than it needs to, so when you have narrowed
down the exact n at which it first fails, the broken row is actually
identifiable by
    SELECT ctid FROM table OFFSET n+1 LIMIT 1
Double-check that you have identified the right row by verifying that
    SELECT * FROM table WHERE ctid = 'ctid obtained above'
blows up --- if not, you're off by one in the LIMIT stuff.

Once you have found the broken row, learn what you can from it (with
luck you can select at least the first few columns) and then delete it
by ctid.

This all assumes that there is exactly one point of corruption, which is
a really bad assumption when dealing with real cases.  Keep in mind that
there is likely to be more than one broken row, and that some of the
corruption may show only as incorrect values and not anything that
provokes an error report.  Once you can select all the data in the
table, do what you can to validate your data.

BTW, "ctid" is the physical location of a row, which is expressed in the
form '(block number, line number)'.  Once you have determined which
block(s) contain broken data, it would be interesting to dump them out
with a tool like pg_filedump (see http://sources.redhat.com/rhdb/).
The pattern of wrong data might possibly suggest something about the
cause.

            regards, tom lane