Thread: Re: Huge number of disk writes after migration to 8.1

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Marcin <migor@op.pl> writes:
> Attached you'll find the archive with data you've asked.

Well, breaking down the contents of the stats file I find:
    190 backends (218880 bytes)
    118 databases (8496 bytes)
    81952 tables (8523008 bytes)
And the ps listing shows there actually are about 190 live backends,
so the idea about leaked backend stats entries is clearly wrong.

The per-table data is 104 bytes/table, up from 80 in 8.0, so that's
a noticeable increase but hardly a killer.  What I have to conclude
is that 8.1 is tracking stats for a lot more tables than 8.0 was.
(Do you happen to have a number for the size of the stats file in
the 8.0 installation?  I don't need to see its contents, I just want
to know how big it is.)

Do you actually have 81952 tables in the installation?  Maybe we are
looking at a table-level leak.  Do you have autovacuum on where it was
not in 8.0?  Maybe that's causing it.

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Marcin <migor@op.pl> writes:
> Autovacuum is off.
> The VACUUM FULL  bottom line is:
> INFO:  free space map contains 59842 pages in 25371 relations
> DETAIL:  A total of 426720 page slots are in use (including overhead).
> 426720 page slots are required to track all free space.

> And the quick iterations for all the DBs with psql -t -c "\dt" shows
> 67654 rows.

> So there's not 81952, BUT, the tables are created (and dropped) quite
> often during the work hours (and they're regular, not TEMPORARY tables).
>  I also find out, that there were 11170 tables created (and most of them
> dropped) today.

Looking at the code, stats entries for dropped tables are cleaned out
only when a VACUUM command is done; which is something we probably ought
to change.  However, that's always been the case, so I don't understand
why your stats file is so much bigger in 8.1.  Have you changed your
vacuuming strategy at all since the 8.0 installation?  Perhaps row or
block statistics weren't enabled in the 8.0 installation and are now?

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Marcin <migor@op.pl> writes:

> > So there's not 81952, BUT, the tables are created (and dropped) quite
> > often during the work hours (and they're regular, not TEMPORARY tables).
> >  I also find out, that there were 11170 tables created (and most of them
> > dropped) today.
>
> Looking at the code, stats entries for dropped tables are cleaned out
> only when a VACUUM command is done; which is something we probably ought
> to change.

I was going to ask if you were confusing pgstat_vacuum_tabstat with a
VACUUM command, when I noticed that only in vacuum() is that function
called!  This surprised me and I agree that it should be changed.  I'm
not sure what would be a more appropiate place to call it, however.

Maybe in heap_drop_with_catalog() we could place a call, or a limited
version that'd only "vacuum" the table being dropped.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> Looking at the code, stats entries for dropped tables are cleaned out
>> only when a VACUUM command is done; which is something we probably ought
>> to change.

> I was going to ask if you were confusing pgstat_vacuum_tabstat with a
> VACUUM command, when I noticed that only in vacuum() is that function
> called!  This surprised me and I agree that it should be changed.  I'm
> not sure what would be a more appropiate place to call it, however.

I don't have a problem with that.  What I'm thinking is that a DROP
TABLE command should issue a tabpurge message for the specific table
(or index) being zapped.  We still need vacuum_tabstat as a backstop
in case the tabpurge message gets lost, though.

Another thought is that in autovacuum, pgstat_vacuum_tabstat is really
called too often: once per autovac cycle would be sufficient, but
instead it's repeated for each table we vacuum.

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Marcin
Date:
On Tue, Jan 17, 2006 at 05:12:09PM -0500, Tom Lane wrote:
> However, that's always been the case, so I don't understand
> why your stats file is so much bigger in 8.1.  Have you changed your
> vacuuming strategy at all since the 8.0 installation?  Perhaps row or
> block statistics weren't enabled in the 8.0 installation and are now?

The only change was the upgrade to 8.1. I run VACUUM FULL everyday, and
double checked that it finished succesfully in last two days.
The block and row statistics are disabled:
stats_start_collector = on
stats_command_string = off
stats_block_level = off
stats_row_level = off
stats_reset_on_server_start = off

I have no idea, why the pgstat.stat is so large. I just shutdown the
idle backend connection, so there're only 30 left (instead of 200),
and the pgstat size decreases by 200KB from 8.8 to 8.6 MB. The lowest size
of pgstat in last six hours was 8.5 MB.

--
Marcin

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Marcin <migor@op.pl> writes:
> On Tue, Jan 17, 2006 at 05:12:09PM -0500, Tom Lane wrote:
>> However, that's always been the case, so I don't understand
>> why your stats file is so much bigger in 8.1.  Have you changed your
>> vacuuming strategy at all since the 8.0 installation?  Perhaps row or
>> block statistics weren't enabled in the 8.0 installation and are now?

> The only change was the upgrade to 8.1. I run VACUUM FULL everyday, and
> double checked that it finished succesfully in last two days.
> The block and row statistics are disabled:
> stats_start_collector = on
> stats_command_string = off
> stats_block_level = off
> stats_row_level = off
> stats_reset_on_server_start = off

Stats off and it's still bloating the file??

[ studies code... ]  I see the culprit: it's the pgstat_report_vacuum
and pgstat_report_analyze routines that were added in 8.1.  Those send
messages unconditionally, meaning that the collector will create table
entries for every table during a database-wide vacuum, even with stats
turned off.

This seems like a bad idea.  Given the nature of what's counted, I think
that treating these messages as "row level" stats would be appropriate.
Alvaro, what do you think?

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Stats off and it's still bloating the file??
>
> [ studies code... ]  I see the culprit: it's the pgstat_report_vacuum
> and pgstat_report_analyze routines that were added in 8.1.  Those send
> messages unconditionally, meaning that the collector will create table
> entries for every table during a database-wide vacuum, even with stats
> turned off.
>
> This seems like a bad idea.

Sorry, clearly my bug :-(

> Given the nature of what's counted, I think that treating these
> messages as "row level" stats would be appropriate.  Alvaro, what do
> you think?

Yeah, row level seems appropiate for what we use it.  I'll take care of
it, unless you want to do it.

--
Alvaro Herrera                                http://www.PlanetPostgreSQL.org
<inflex> really, I see PHP as like a strange amalgamation of C, Perl, Shell
<crab> inflex: you know that "amalgam" means "mixture with mercury",
       more or less, right?
<crab> i.e., "deadly poison"

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Yeah, row level seems appropiate for what we use it.  I'll take care of
> it, unless you want to do it.

I'll fix it --- I want to put in an immediate tabpurge on drop, too.

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> Given the nature of what's counted, I think that treating these
>> messages as "row level" stats would be appropriate.  Alvaro, what do
>> you think?

> Yeah, row level seems appropiate for what we use it.  I'll take care of
> it, unless you want to do it.

Actually, there's another problem here: if we do have row-level stats
turned on, a manual "VACUUM" command will still cause the set of tables
listed in the stats file to grow to include every table in the DB,
whether or not anything interesting is happening to that table.  I think
this is probably undesirable.  I'm tempted to change pgstat_recv_vacuum
and pgstat_recv_analyze so that they will not create new hash entries,
but only insert the row count if the hash entry already exists.  I am a
bit worried that I might be missing something about possible
interactions with autovacuum though.  I see that autovac skips vacuuming
tables that have no hash entry ... is there some circular reasoning
going on there?

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Actually, there's another problem here: if we do have row-level stats
> turned on, a manual "VACUUM" command will still cause the set of tables
> listed in the stats file to grow to include every table in the DB,
> whether or not anything interesting is happening to that table.  I think
> this is probably undesirable.  I'm tempted to change pgstat_recv_vacuum
> and pgstat_recv_analyze so that they will not create new hash entries,
> but only insert the row count if the hash entry already exists.  I am a
> bit worried that I might be missing something about possible
> interactions with autovacuum though.  I see that autovac skips vacuuming
> tables that have no hash entry ... is there some circular reasoning
> going on there?

The idea was that autovac should skip tables for which it doesn't have
info, because it can't decide and we chose to err on the side of
caution.  However, after a vacuum or analyze we do have info about the
table, which is what we store in the pgstat_recv functions
inconditionally.  Thus the next autovacuum is able to make an informed
decision about this table.

The principles are: 1) store as much information as possible,
2) autovacuum should act iff it has information.

I don't think this is a bug, but I'm not dead set on it if you want to
change this behavior.

--
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"The first of April is the day we remember what we are
the other 364 days of the year"  (Mark Twain)

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> I'm tempted to change pgstat_recv_vacuum
>> and pgstat_recv_analyze so that they will not create new hash entries,
>> but only insert the row count if the hash entry already exists.  I am a
>> bit worried that I might be missing something about possible
>> interactions with autovacuum though.  I see that autovac skips vacuuming
>> tables that have no hash entry ... is there some circular reasoning
>> going on there?

> The idea was that autovac should skip tables for which it doesn't have
> info, because it can't decide and we chose to err on the side of
> caution.  However, after a vacuum or analyze we do have info about the
> table, which is what we store in the pgstat_recv functions
> inconditionally.  Thus the next autovacuum is able to make an informed
> decision about this table.

However, if it skips the table for lack of info, then there still won't
be any info next time.

More to the point, that's not why we skip entry-less tables.  We skip
entry-less tables because there is no need for autovac to do anything to
a table that's not being modified, and if it *is* being modified then
those operations will result in creation of a table entry.  So I'm still
not seeing why vacuum or analyze should force creation of an entry.

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Tom Lane wrote:
> >> I'm tempted to change pgstat_recv_vacuum
> >> and pgstat_recv_analyze so that they will not create new hash entries,
> >> but only insert the row count if the hash entry already exists.  I am a
> >> bit worried that I might be missing something about possible
> >> interactions with autovacuum though.  I see that autovac skips vacuuming
> >> tables that have no hash entry ... is there some circular reasoning
> >> going on there?
>
> > The idea was that autovac should skip tables for which it doesn't have
> > info, because it can't decide and we chose to err on the side of
> > caution.  However, after a vacuum or analyze we do have info about the
> > table, which is what we store in the pgstat_recv functions
> > inconditionally.  Thus the next autovacuum is able to make an informed
> > decision about this table.
>
> However, if it skips the table for lack of info, then there still won't
> be any info next time.

Of course, because there's nowhere to take it from.

> More to the point, that's not why we skip entry-less tables.  We skip
> entry-less tables because there is no need for autovac to do anything to
> a table that's not being modified, and if it *is* being modified then
> those operations will result in creation of a table entry.  So I'm still
> not seeing why vacuum or analyze should force creation of an entry.

Well, if you issue a manual vacuum then you can argue that it has been
"modified" (it's seeing some activity).  But immediately after a vacuum
the table will not be vacuumed by autovac if there's no other activity,
because there's no need for it, so we can create the entry anyway and it
won't make an immediate difference.  However, it (to have the pgstat
table entry) _will_ make a difference as soon as that entry is modified
by other activity, because the number of tuples pgstat knows about will
be more accurate.

So maybe we could supress the entry creation on analyze (because analyze
_can_ force a vacuum during the next autovac iteration, which is
something we may want to avoid), but keep it on vacuum.

--
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 17.7", W 73º 14' 26.8"
"The West won the world not by the superiority of its ideas or values
or religion but rather by its superiority in applying organized violence.
Westerners often forget this fact, non-Westerners never do."
(Samuel P. Huntington)

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Well, if you issue a manual vacuum then you can argue that it has been
> "modified" (it's seeing some activity).  But immediately after a vacuum
> the table will not be vacuumed by autovac if there's no other activity,
> because there's no need for it, so we can create the entry anyway and it
> won't make an immediate difference.

You're ignoring the point of this thread, which is that creating
hashtable entries for tables that aren't actively being modified
causes significant ongoing overhead that we ought to try to minimize.

What I'm suggesting is that only foreground modification activity
(insert/update/delete) should trigger creation of a hashtable entry.
Once that has happened, it's reasonable to update the entry after
any vacuum or analyze.  But a vacuum or analyze shouldn't force an
entry to spring into existence when nothing else is happening to
the table.  That just bloats the stats file.

Given the current autovac logic, ordinary autovac operations won't cause
useless entries to be created anyway ... but an anti-wraparound vacuum
will, and so will manual vacuum or analyze commands.

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Well, if you issue a manual vacuum then you can argue that it has been
> > "modified" (it's seeing some activity).  But immediately after a vacuum
> > the table will not be vacuumed by autovac if there's no other activity,
> > because there's no need for it, so we can create the entry anyway and it
> > won't make an immediate difference.
>
> You're ignoring the point of this thread, which is that creating
> hashtable entries for tables that aren't actively being modified
> causes significant ongoing overhead that we ought to try to minimize.

True.  I agree with the rest of the reasoning in this case.

Maybe the fact that the stat file is completely rewritten every 500 ms
should be reconsidered, if in the future someone chooses to rewrite
the stat system.  We can reconsider this part then, as well.

--
Alvaro Herrera                                http://www.PlanetPostgreSQL.org
Jason Tesser: You might not have understood me or I am not understanding you.
Paul Thomas: It feels like we're 2 people divided by a common language...

Re: Huge number of disk writes after migration to 8.1

From
Marcin
Date:
Tom Lane wrote:
> [ studies code... ]  I see the culprit: it's the pgstat_report_vacuum
> and pgstat_report_analyze routines that were added in 8.1.  Those send
> messages unconditionally, meaning that the collector will create table
> entries for every table during a database-wide vacuum, even with stats
> turned off.

So in my case disabling the daily vacuum would, in fact, improve the
write performance ;) ... just joking.

I think I turn off the stats completely, and wait patiently for a fix.
Do you think it will be included in one of next 8.1.x releases, or do I
have to wait for 8.2?

Thanks again for a great help,
--
Marcin

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Maybe the fact that the stat file is completely rewritten every 500 ms
> should be reconsidered, if in the future someone chooses to rewrite
> the stat system.  We can reconsider this part then, as well.

Yeah, it's becoming pretty obvious that that design does not scale very
well.  I don't immediately have any ideas about a better way though.

I am working on some marginal hacks like not writing more of the backend
activity strings than is needed, but it'd be nicer to think of a
different solution.

            regards, tom lane

Re: Huge number of disk writes after migration to 8.1

From
Tom Lane
Date:
Marcin <migor@op.pl> writes:
> Do you think it will be included in one of next 8.1.x releases, or do I
> have to wait for 8.2?

I'm working on a fix for 8.1.3.

            regards, tom lane

pg_restore error checking

From
Stephen Frost
Date:
Greetings,

  pg_restore doesn't handle errors very well.  While attempting to
  upgraded a database from 8.0 to 8.1 there were some errors creating
  some tables because 8.0 had PostGIS installed (along with some tables)
  and 8.1 did not.  This also meant that attempts to load the tables
  which had failed to be created didn't work.  The bad part about this
  is that instead of skipping to the end of the data from the COPY
  command, pg_restore seems to have thought that the contents of the
  table was an SQL command and read in the whole table and I think then
  got very confused.  Here's the process list from this:

postgres  6692  2.7  0.7  78380 48524 pts/3    S+   Jan19  32:26 /usr/lib/p=
ostgresql/8.1/bin/pg_dump -h /var/run/postgresql -p 5432 -Fc tsf
postgres  6693  1.2 51.7 3198716 3169240 pts/3 S+   Jan19  15:06 /usr/lib/p=
ostgresql/8.1/bin/pg_restore -h /var/run/postgresql -p 5433 -d template1 -C
postgres  6694 10.8  0.9  96780 59316 ?        S    Jan19 126:40 postgres: =
postgres tsf [local] COPY=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20
postgres  6782  3.4  0.3  52364 24092 ?        S    Jan19  40:12 postgres: =
postgres tsf [local] idle=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20

  As you can see, pg_restore is using up 3G or so, which seems likely to
  be the size of one of the PostGIS-based tables.  The 8.0 COPY seems to
  think it's still going (perhaps it's blocking because pg_restore is
  trying to do something, but it's not actually using any CPU time),
  while the 8.1 connection is idle (waiting for pg_restore to give it
  something to do).

  It seems like pg_restore really should be able to handle COPY errors
  correctly by skipping to the end of the COPY data segment when the
  initial COPY command comes back as an error.  If it's not too invasive
  of a fix, I really think it should be included in the next point
  release of 8.1.

      Thanks!

        Stephen

Re: pg_restore error checking

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
>   It seems like pg_restore really should be able to handle COPY errors
>   correctly by skipping to the end of the COPY data segment when the
>   initial COPY command comes back as an error.

Send a patch ;-)

            regards, tom lane