Re: Postgresqlism & Vacuum? - Mailing list pgsql-general

From Charles Tassell
Subject Re: Postgresqlism & Vacuum?
Date
Msg-id 4.2.0.58.20000414151418.009f3610@mailer.isn.net
Whole thread Raw
In response to Re: Postgresqlism & Vacuum?  (Stephen J Lombardo <lombardo@mac.com>)
List pgsql-general
This may be a dumb question, since I have no idea how the core of Postgres
is actually implemented, but couldn't the VACUUM command be done within a
transaction, or something similar, so that it doesn't lock the table while
running?  Of course it would be slightly out of date if write actions
(insert/update/delete/drop) were done during the process, but not much more
than it would be if these actions were taken a few seconds after the vacuum
was finished.

I don't see having to vacuum analyze the database every once in  a while as
a big problem, but the fact that it completely locks the table from even
read access while it's running is a bit scary.

BTW: Does dropping the indexes on a table then recreating them achieve the
same affect as doing a VACUUM ANALYZE, minus the garbage collection?  Or
does create index just define the index, and the next vacuum analyze
actually populate it?

At 12:58 PM 4/14/00, Stephen J Lombardo wrote:

> > I think there must be something wrong with the optimiser that it's
> > "postgresqlism" that you must vacuum analyze frequently.  Just as an
> example,
> > for Clipper (dBase compiler), it's Clipperism that you must re-index if you
> > cannot locate some records just because the indexing module screws up.
> >
> > For large 24x7 installations, it's impossible to vacuum nightly because
> when
> > postgresql is vacuuming the table is locked up, to the end-user the
> database
> > has already hung.
> >
> > There has been effort to speed up the vacuuming process, but this isn't the
> > cure.  I believe the fault lies on the optimizer.
> >
> > For eg, in Bruce Momjian's FAQ 4.9:
> >
> > PostgreSQL does not automatically maintain statistics. One has to make
> > an explicit vacuum call to update the statistics. After statistics are
> > updated, the optimizer knows how many rows in the table, and can
> > better decide if it should use indices. Note that the optimizer does
> > not use indices in cases when the table is small because a sequential
> > scan would be faster.
> >
> > Why save on micro-seconds to use sequential scan when the table is
> small and
> > later 'forgets' that the table is now big because you didn't vacuum
> analyze?
> > Why can't the optimizer just use indexes when they are there and not
> > 'optimize' for special cases when the table is small to save micro-seconds?
>
>     Because small is a relative term. You will notice that Bruce does not
>say "where a table is less than 100 tuples" or something like that. And
>because in the end you would probably waste significantly more time than a
>few micro-seconds. Consider a table where you have some round number of
>tuples, say 100,000.  Suppose you had b-tree indexes on two attributes,
>employee_id (primary key) and last_name. Now if you were to run a query to
>look up an employee by the primary key you would surly want to use the
>index. Assume that it would take 3 disk accesses to search the index, and
>one to fetch the data page from the heap. So you have a total of 4 disk
>accesses to search on primary key and retrieve on row. Now suppose you were
>going to run a query that would return a significant number of rows, lets
>say half the table (50,000). Now if the optimizer chose to use the index on
>that query it would take 4 disk access to locate each and every row (3 to
>search index, 1 to grab data page). So if the query ran using the index it
>would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course.
>Using CLUSTER could improve the efficiency). Lets assume that the average
>size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a
>single page. Therefore doing a sequential search on the table would require
>100,000/16, or 6250 disk accesses. Depending on the speed of your drive this
>could make a big difference. Suppose the large query was run only 10 times a
>day, that would waste around 2 million disk accesses. Now if you were using
>a join performance would suffer even more.
>     The job of the optimizer is to make educated decisions about how to run
>a query. Stats will help it out significantly, but it is expensive to
>maintain statistics on a running database and it would decrease overall
>performace. Instead the answer is to collect statistics periodically. There
>is reasoning behind this to. Consider a table where you have 1,000,000
>tuples. One of the attributes is called state. Currently there are only 5
>states in the database. A query is run like this:
>
>SELECT state FROM table_name WHERE state='NY';
>
>The optimizer will see if it has any statistics on this table. If not it
>will make a guess at how many rows are returned. So the optimizer guesses
>that 1% of the table, or 10,000 rows, will be returned. Then it will use
>that number to asses how to run the query. Now if it had statistics on the
>table the optimizer would know that there were only 5 different values in
>the states column of the table. So the optimizer would assume that 20% of
>the table would be returned from the query. It is likely that the optimizer
>will choose a very different plan when it thinks that 200,000 rows will be
>returned.
>     You can be confident that the fine PostgreSQL developers have done a
>good job with the optimizer. There are reasons that things are done the way
>they are, but they might not be immediatly apparent.
>
>Cheers,
>Stephen


pgsql-general by date:

Previous
From: Charles Tassell
Date:
Subject: Re: can't connect using the -host option
Next
From: Charles Tassell
Date:
Subject: Re: Unsupported frontend protocol? & config systems files?