Re: Postgresqlism & Vacuum? - Mailing list pgsql-general
From | Charles Tassell |
---|---|
Subject | Re: Postgresqlism & Vacuum? |
Date | |
Msg-id | 4.2.0.58.20000414151418.009f3610@mailer.isn.net Whole thread Raw |
In response to | Re: Postgresqlism & Vacuum? (Stephen J Lombardo <lombardo@mac.com>) |
List | pgsql-general |
This may be a dumb question, since I have no idea how the core of Postgres is actually implemented, but couldn't the VACUUM command be done within a transaction, or something similar, so that it doesn't lock the table while running? Of course it would be slightly out of date if write actions (insert/update/delete/drop) were done during the process, but not much more than it would be if these actions were taken a few seconds after the vacuum was finished. I don't see having to vacuum analyze the database every once in a while as a big problem, but the fact that it completely locks the table from even read access while it's running is a bit scary. BTW: Does dropping the indexes on a table then recreating them achieve the same affect as doing a VACUUM ANALYZE, minus the garbage collection? Or does create index just define the index, and the next vacuum analyze actually populate it? At 12:58 PM 4/14/00, Stephen J Lombardo wrote: > > I think there must be something wrong with the optimiser that it's > > "postgresqlism" that you must vacuum analyze frequently. Just as an > example, > > for Clipper (dBase compiler), it's Clipperism that you must re-index if you > > cannot locate some records just because the indexing module screws up. > > > > For large 24x7 installations, it's impossible to vacuum nightly because > when > > postgresql is vacuuming the table is locked up, to the end-user the > database > > has already hung. > > > > There has been effort to speed up the vacuuming process, but this isn't the > > cure. I believe the fault lies on the optimizer. > > > > For eg, in Bruce Momjian's FAQ 4.9: > > > > PostgreSQL does not automatically maintain statistics. One has to make > > an explicit vacuum call to update the statistics. After statistics are > > updated, the optimizer knows how many rows in the table, and can > > better decide if it should use indices. Note that the optimizer does > > not use indices in cases when the table is small because a sequential > > scan would be faster. > > > > Why save on micro-seconds to use sequential scan when the table is > small and > > later 'forgets' that the table is now big because you didn't vacuum > analyze? > > Why can't the optimizer just use indexes when they are there and not > > 'optimize' for special cases when the table is small to save micro-seconds? > > Because small is a relative term. You will notice that Bruce does not >say "where a table is less than 100 tuples" or something like that. And >because in the end you would probably waste significantly more time than a >few micro-seconds. Consider a table where you have some round number of >tuples, say 100,000. Suppose you had b-tree indexes on two attributes, >employee_id (primary key) and last_name. Now if you were to run a query to >look up an employee by the primary key you would surly want to use the >index. Assume that it would take 3 disk accesses to search the index, and >one to fetch the data page from the heap. So you have a total of 4 disk >accesses to search on primary key and retrieve on row. Now suppose you were >going to run a query that would return a significant number of rows, lets >say half the table (50,000). Now if the optimizer chose to use the index on >that query it would take 4 disk access to locate each and every row (3 to >search index, 1 to grab data page). So if the query ran using the index it >would use 200,000 (50,000 * 4) disk accesses (Worst case scenario of course. >Using CLUSTER could improve the efficiency). Lets assume that the average >size of a tuple is 500k. So PostgreSQL would pack about 16 tuples into a >single page. Therefore doing a sequential search on the table would require >100,000/16, or 6250 disk accesses. Depending on the speed of your drive this >could make a big difference. Suppose the large query was run only 10 times a >day, that would waste around 2 million disk accesses. Now if you were using >a join performance would suffer even more. > The job of the optimizer is to make educated decisions about how to run >a query. Stats will help it out significantly, but it is expensive to >maintain statistics on a running database and it would decrease overall >performace. Instead the answer is to collect statistics periodically. There >is reasoning behind this to. Consider a table where you have 1,000,000 >tuples. One of the attributes is called state. Currently there are only 5 >states in the database. A query is run like this: > >SELECT state FROM table_name WHERE state='NY'; > >The optimizer will see if it has any statistics on this table. If not it >will make a guess at how many rows are returned. So the optimizer guesses >that 1% of the table, or 10,000 rows, will be returned. Then it will use >that number to asses how to run the query. Now if it had statistics on the >table the optimizer would know that there were only 5 different values in >the states column of the table. So the optimizer would assume that 20% of >the table would be returned from the query. It is likely that the optimizer >will choose a very different plan when it thinks that 200,000 rows will be >returned. > You can be confident that the fine PostgreSQL developers have done a >good job with the optimizer. There are reasons that things are done the way >they are, but they might not be immediatly apparent. > >Cheers, >Stephen
pgsql-general by date: