Thread: cache estimates, cache access cost

cache estimates, cache access cost

From

Cédric Villemain

Date:

15 May 2011, 18:39:24

Hello

cache estimation and cache access cost are currently not accounted
explicitly: they have a cost associated with but no constants (other
than effective_cache_size but it has a very limited usage).

Every IO cost is build with a derivation of the seq_page_cost,
random_page_cost and the number of pages. Formulas are used in some
places to make the cost more or less, to take into account caching and
data alignment.

There are:
* estimation of page we will find in the postgresql buffer cache* estimation of page we will find in the operating
systemcache buffer cache

and they can be compute for :
* first access* several access

We currently don't make distinction between both cache areas (there is
more cache areas but we don't care here) and we 'prefer' estimate
several access instead of the first one.

There is also a point related to cost estimation, they are strong: for
example once a sort goes over work_mem, its cost jumped because page
access are accounted.

The current cost estimations are already very good, most of our
queries run well without those famous 'HINT' and the planner provide
the best plan in most cases.

But I believe that now we need more tools to improve even more the
cost estimation.
I would like to propose some ideas, not my ideas in all cases, the
topic is in the air since a long time and probably that everything has
already being said (at least around a beer or a pepsi)

Adding a new GUC "cache_page_cost":
- allows to cost the page access when it is estimated in cache
- allows to cost a sort exceeding work_mem but which should not hit disk
- allows to use random_page_cost for what it should be.
(I was tempted by a GUC "write_page_cost" but I am unsure for this one
at this stage)

Adding 2 columns to pg_class "oscache_percent" and "pgcache_percent"
(or similar names): they allow to store stats about the percentage of
a relation in each cache.
- Usage should be to estimate cost of first access to pages then use
the Mackert and Lohman formula on next access. The later only provide
a way to estimate cost of re-reading.

It is hard to advocate here with real expected performance gain other
than: we will have more options for more precise planner decision and
we may reduce the number of report for bad planning. (it is also in
the todolist to improve cache estimation)

I've already hack a bit the core for that and added the 2 new columns
with hooks to update them. ANALYZE OSCACHE update one of them and a
plugin can be used to provide the estimate (so how it's filled is not
important, most OSes have solutions to estimate it accurately if
someone wonder)
It is as-is for POC, probably not clean enough to go to commit festand
not expected to go there before some consensus are done.
http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

Hacking costsize is ... dangerous, I would say. Breaking something
which works already so well is easy. Changing only one cost function
is not enough to keep a good balance....
Performance farm should help here ... and the full cycle for 9.2 too.

Comments ?
--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

Re: cache estimates, cache access cost

From

Greg Smith

Date:

16 May 2011, 00:53:20

Cédric Villemain wrote:
> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>

This rebases easily to make Cedric's changes move to the end; I just
pushed a version with that change to
https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
wants a cleaner one to browse.  I've attached a patch too if that's more
your thing.

I'd recommend not getting too stuck on the particular hook Cédric has
added here to compute the cache estimate, which uses mmap and mincore to
figure it out.  It's possible to compute similar numbers, albeit less
accurate, using an approach similar to how pg_buffercache inspects
things.  And I even once wrote a background writer extension that
collected this sort of data as it was running the LRU scan anyway.
Discussions of this idea seem to focus on how the "what's in the cache?"
data is collected, which as far as I'm concerned is the least important
part.  There are multiple options, some work better than others, and
there's no reason that can't be swapped out later.  The more important
question is how to store the data collected and then use it for
optimizing queries.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


diff --git a/contrib/Makefile b/contrib/Makefile
index 6967767..47652d5 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
         lo        \
         ltree        \
         oid2name    \
+        oscache    \
         pageinspect    \
         passwordcheck    \
         pg_archivecleanup \
diff --git a/contrib/oscache/Makefile b/contrib/oscache/Makefile
new file mode 100644
index 0000000..8d8dcc5
--- /dev/null
+++ b/contrib/oscache/Makefile
@@ -0,0 +1,15 @@
+# contrib/oscache/Makefile
+
+MODULE_big = oscache
+OBJS = oscache.o
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/oscache
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/oscache/oscache.c b/contrib/oscache/oscache.c
new file mode 100644
index 0000000..1ad7dc2
--- /dev/null
+++ b/contrib/oscache/oscache.c
@@ -0,0 +1,151 @@
+/*-------------------------------------------------------------------------
+ *
+ * oscache.c
+ *
+ *
+ * Copyright (c) 2011, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      contrib/oscache/oscache.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/* { POSIX stuff */
+#include <stdlib.h> /* exit, calloc, free */
+#include <sys/stat.h> /* stat, fstat */
+#include <sys/types.h> /* size_t, mincore */
+#include <unistd.h> /* sysconf, close */
+#include <sys/mman.h> /* mmap, mincore */
+/* } */
+
+/* { PostgreSQL stuff */
+#include "postgres.h" /* general Postgres declarations */
+#include "utils/rel.h" /* Relation */
+#include "storage/bufmgr.h"
+#include "catalog/catalog.h" /* relpath */
+/* } */
+
+PG_MODULE_MAGIC;
+
+void        _PG_init(void);
+
+float4 oscache(Relation, ForkNumber);
+
+/*
+ * Module load callback
+ */
+void
+_PG_init(void)
+{
+    /* Install hook. */
+    OSCache_hook = &oscache;
+}
+
+/*
+ * oscache process the os cache inspection for the relation.
+ * It returns the percentage of blocks in OS cache.
+ */
+float4
+oscache(Relation relation, ForkNumber forkNum)
+{
+    int  segment = 0;
+    char *relationpath;
+    char filename[MAXPGPATH];
+    int fd;
+    int64  total_block_disk = 0;
+    int64  total_block_mem  = 0;
+
+    /* OS things */
+    int64 pageSize  = sysconf(_SC_PAGESIZE); /* Page size */
+    register int64 pageIndex;
+
+    relationpath = relpathperm(relation->rd_node, forkNum);
+
+    /*
+     * For each segment of the relation
+     */
+    snprintf(filename, MAXPGPATH, "%s", relationpath);
+    while ((fd = open(filename, O_RDONLY)) != -1)
+    {
+        // for stat file
+        struct stat st;
+        // for mmap file
+        void *pa = (char *)0;
+        // for calloc file
+        unsigned char *vec = (unsigned char *)0;
+        int64  block_disk = 0;
+        int64  block_mem  = 0;
+
+        if (fstat(fd, &st) == -1)
+        {
+            close(fd);
+            elog(ERROR, "Can not stat object file : %s",
+                filename);
+            return 0;
+        }
+
+        /*
+        * if file ok
+        * then process
+        */
+        if (st.st_size != 0)
+        {
+            /* number of block in the current file */
+            block_disk = st.st_size/pageSize;
+
+            /* TODO We need to split mmap size to be sure (?) to be able to mmap */
+            pa = mmap(NULL, st.st_size, PROT_NONE, MAP_SHARED, fd, 0);
+            if (pa == MAP_FAILED)
+            {
+                close(fd);
+                elog(ERROR, "Can not mmap object file : %s, errno = %i,%s\nThis error can happen if there is not
enoughtspace in memory to do the projection. Please mail cedric@2ndQuadrant.fr with '[oscache] ENOMEM' as subject.", 
+                    filename, errno, strerror(errno));
+                return 0;
+            }
+
+            /* Prepare our vector containing all blocks information */
+            vec = calloc(1, (st.st_size+pageSize-1)/pageSize);
+            if ((void *)0 == vec)
+            {
+                munmap(pa, st.st_size);
+                close(fd);
+                elog(ERROR, "Can not calloc object file : %s",
+                    filename);
+                return 0;
+            }
+
+            /* Affect vec with mincore */
+            if (mincore(pa, st.st_size, vec) != 0)
+            {
+                free(vec);
+                munmap(pa, st.st_size);
+                close(fd);
+                elog(ERROR, "mincore(%p, %lld, %p): %s\n",
+                    pa, (int64)st.st_size, vec, strerror(errno));
+                return 0;
+            }
+
+            /* handle the results */
+            for (pageIndex = 0; pageIndex <= st.st_size/pageSize; pageIndex++)
+            {
+                // block in memory
+                if (vec[pageIndex] & 1)
+                {
+                    block_mem++;
+                }
+            }
+        }
+        elog(DEBUG1, "oscache %s: %lld of %lld block in linux cache",
+            filename, block_mem,  block_disk);
+
+        //   free things
+        free(vec);
+        munmap(pa, st.st_size);
+        close(fd);
+        total_block_mem += block_mem;
+        total_block_disk += block_disk;
+
+        snprintf(filename, MAXPGPATH, "%s.%u", relationpath, segment++);
+    }
+    return (float4)(total_block_mem*100/(total_block_disk+1));
+}
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 7b62818..25338d0 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1634,6 +1634,26 @@
      </row>

      <row>
+      <entry><structfield>reloscache</structfield></entry>
+      <entry><type>float4</type></entry>
+      <entry></entry>
+      <entry>
+       Percentage of the files in OS cache.  This is only an estimate used by
+       the planner.  It is updated by <command>ANALYZE OSCACHE</command>.
+      </entry>
+     </row>
+
+     <row>
+      <entry><structfield>relpgcache</structfield></entry>
+      <entry><type>float4</type></entry>
+      <entry></entry>
+      <entry>
+       Percentage of the files in PostgreSQL cache.  This is only an estimate used by
+       the planner.  It is updated by <command>ANALYZE PGCACHE</command>.
+      </entry>
+     </row>
+
+     <row>
       <entry><structfield>reltoastrelid</structfield></entry>
       <entry><type>oid</type></entry>
       <entry><literal><link linkend="catalog-pg-class"><structname>pg_class</structname></link>.oid</literal></entry>
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 4cb29b2..7f39a93 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -54,6 +54,8 @@ hashbuild(PG_FUNCTION_ARGS)
     IndexBuildResult *result;
     BlockNumber relpages;
     double        reltuples;
+    float4        reloscache;
+    float4        relpgcache;
     uint32        num_buckets;
     HashBuildState buildstate;

@@ -66,7 +68,7 @@ hashbuild(PG_FUNCTION_ARGS)
              RelationGetRelationName(index));

     /* Estimate the number of rows currently present in the table */
-    estimate_rel_size(heap, NULL, &relpages, &reltuples);
+    estimate_rel_size(heap, NULL, &relpages, &reltuples, &reloscache, &relpgcache);

     /* Initialize the hash index metadata page and initial buckets */
     num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 71c9931..73ba67b 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -756,6 +756,8 @@ InsertPgClassTuple(Relation pg_class_desc,
     values[Anum_pg_class_reltablespace - 1] = ObjectIdGetDatum(rd_rel->reltablespace);
     values[Anum_pg_class_relpages - 1] = Int32GetDatum(rd_rel->relpages);
     values[Anum_pg_class_reltuples - 1] = Float4GetDatum(rd_rel->reltuples);
+    values[Anum_pg_class_reloscache - 1] = Float4GetDatum(rd_rel->reloscache);
+    values[Anum_pg_class_relpgcache - 1] = Float4GetDatum(rd_rel->relpgcache);
     values[Anum_pg_class_reltoastrelid - 1] = ObjectIdGetDatum(rd_rel->reltoastrelid);
     values[Anum_pg_class_reltoastidxid - 1] = ObjectIdGetDatum(rd_rel->reltoastidxid);
     values[Anum_pg_class_relhasindex - 1] = BoolGetDatum(rd_rel->relhasindex);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 0568a1b..284ab5d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -86,6 +86,8 @@ static BufferAccessStrategy vac_strategy;

 static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt,
                bool update_reltuples, bool inh);
+static void do_cache_analyze_rel(Relation onerel, VacuumStmt *vacstmt,
+               bool update_reltuples, bool inh);
 static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
                   int samplesize);
 static bool BlockSampler_HasMore(BlockSampler bs);
@@ -238,13 +240,21 @@ analyze_rel(Oid relid, VacuumStmt *vacstmt,
     /*
      * Do the normal non-recursive ANALYZE.
      */
-    do_analyze_rel(onerel, vacstmt, update_reltuples, false);
+    if (vacstmt->options & (VACOPT_CACHE))
+        do_cache_analyze_rel(onerel, vacstmt, update_reltuples, false);
+    else
+        do_analyze_rel(onerel, vacstmt, update_reltuples, false);

     /*
      * If there are child tables, do recursive ANALYZE.
      */
     if (onerel->rd_rel->relhassubclass)
-        do_analyze_rel(onerel, vacstmt, false, true);
+    {
+        if (vacstmt->options & (VACOPT_CACHE))
+            do_cache_analyze_rel(onerel, vacstmt, false, true);
+        else
+            do_analyze_rel(onerel, vacstmt, false, true);
+    }

     /*
      * Close source relation now, but keep lock so that no one deletes it
@@ -640,6 +650,120 @@ cleanup:
 }

 /*
+ *    do_analyze_rel() -- analyze one relation, recursively or not
+ */
+static void
+do_cache_analyze_rel(Relation onerel, VacuumStmt *vacstmt,
+                     bool update_relcache, bool inh)
+{
+    int            ind;
+    Relation   *Irel;
+    int            nindexes;
+    bool        hasindex;
+    AnlIndexData *indexdata;
+    PGRUsage    ru0;
+    TimestampTz starttime = 0;
+    MemoryContext caller_context;
+    int            save_nestlevel;
+
+    if (inh)
+        ereport(elevel,
+                (errmsg("cache analyzing \"%s.%s\" inheritance tree",
+                        get_namespace_name(RelationGetNamespace(onerel)),
+                        RelationGetRelationName(onerel))));
+    else
+        ereport(elevel,
+                (errmsg("cache analyzing \"%s.%s\"",
+                        get_namespace_name(RelationGetNamespace(onerel)),
+                        RelationGetRelationName(onerel))));
+
+    /*
+     * Set up a working context so that we can easily free whatever junk gets
+     * created.
+     */
+    anl_context = AllocSetContextCreate(CurrentMemoryContext,
+                                        "Analyze",
+                                        ALLOCSET_DEFAULT_MINSIZE,
+                                        ALLOCSET_DEFAULT_INITSIZE,
+                                        ALLOCSET_DEFAULT_MAXSIZE);
+    caller_context = MemoryContextSwitchTo(anl_context);
+
+    /*
+     * Arrange to make GUC variable changes local to this command.
+     */
+    save_nestlevel = NewGUCNestLevel();
+
+    /* measure elapsed time iff autovacuum logging requires it */
+    if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
+    {
+        pg_rusage_init(&ru0);
+        if (Log_autovacuum_min_duration > 0)
+            starttime = GetCurrentTimestamp();
+    }
+
+    /*
+     * Open all indexes of the relation, and see if there are any analyzable
+     * columns in the indexes.    We do not analyze index columns if there was
+     * an explicit column list in the ANALYZE command, however.  If we are
+     * doing a recursive scan, we don't want to touch the parent's indexes at
+     * all.
+     */
+    if (!inh)
+        vac_open_indexes(onerel, AccessShareLock, &nindexes, &Irel);
+    else
+    {
+        Irel = NULL;
+        nindexes = 0;
+    }
+    hasindex = (nindexes > 0);
+    indexdata = NULL;
+
+    /*
+     * Update cache stats in pg_class.
+     */
+    cache_update_relstats(onerel,
+                          RelationGetRelationOSCacheInFork(onerel, MAIN_FORKNUM),
+                          RelationGetRelationPGCacheInFork(onerel, MAIN_FORKNUM),
+                          InvalidTransactionId);
+
+    /*
+     * Same for indexes.
+     */
+    for (ind = 0; ind < nindexes; ind++)
+    {
+        cache_update_relstats(Irel[ind],
+                              RelationGetRelationOSCacheInFork(Irel[ind], MAIN_FORKNUM),
+                              RelationGetRelationPGCacheInFork(Irel[ind], MAIN_FORKNUM),
+                              InvalidTransactionId);
+    }
+
+    /* Done with indexes */
+    vac_close_indexes(nindexes, Irel, NoLock);
+
+    /* Log the action if appropriate */
+    if (IsAutoVacuumWorkerProcess() && Log_autovacuum_min_duration >= 0)
+    {
+        if (Log_autovacuum_min_duration == 0 ||
+            TimestampDifferenceExceeds(starttime, GetCurrentTimestamp(),
+                                       Log_autovacuum_min_duration))
+            ereport(LOG,
+                    (errmsg("automatic cache analyze of table \"%s.%s.%s\" system usage: %s",
+                            get_database_name(MyDatabaseId),
+                            get_namespace_name(RelationGetNamespace(onerel)),
+                            RelationGetRelationName(onerel),
+                            pg_rusage_show(&ru0))));
+    }
+
+    /* Roll back any GUC changes executed by index functions */
+    AtEOXact_GUC(false, save_nestlevel);
+
+    /* Restore current context and release memory */
+    MemoryContextSwitchTo(caller_context);
+    MemoryContextDelete(anl_context);
+    anl_context = NULL;
+}
+
+/*
  * Compute statistics about indexes of a relation
  */
 static void
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 9606569..b45f012 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1096,3 +1096,61 @@ vacuum_delay_point(void)
         CHECK_FOR_INTERRUPTS();
     }
 }
+
+
+/*
+ *    cache_update_relstats() -- update cache statistics for one relation
+ *
+ *  /!\ Same comment as function vac_update_relstats()
+ */
+void
+cache_update_relstats(Relation relation,
+                      float4 per_oscache, float4 per_pgcache,
+                      TransactionId frozenxid)
+{
+    Oid            relid = RelationGetRelid(relation);
+    Relation    rd;
+    HeapTuple    ctup;
+    Form_pg_class pgcform;
+    bool        dirty;
+
+    rd = heap_open(RelationRelationId, RowExclusiveLock);
+
+    /* Fetch a copy of the tuple to scribble on */
+    ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
+    if (!HeapTupleIsValid(ctup))
+        elog(ERROR, "pg_class entry for relid %u vanished during cache analyze",
+             relid);
+    pgcform = (Form_pg_class) GETSTRUCT(ctup);
+
+    /* Apply required updates, if any, to copied tuple */
+
+    dirty = false;
+    if (pgcform->reloscache != (float4) per_oscache)
+    {
+        pgcform->reloscache = (float4) per_oscache;
+        dirty = true;
+    }
+    if (pgcform->relpgcache != (float4) per_pgcache)
+    {
+        pgcform->relpgcache = (float4) per_pgcache;
+        dirty = true;
+    }
+
+    /*
+     * relfrozenxid should never go backward.  Caller can pass
+     * InvalidTransactionId if it has no new data.
+     */
+    if (TransactionIdIsNormal(frozenxid) &&
+        TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid))
+    {
+        pgcform->relfrozenxid = frozenxid;
+        dirty = true;
+    }
+
+    /* If anything changed, write out the tuple. */
+    if (dirty)
+        heap_inplace_update(rd, ctup);
+
+    heap_close(rd, RowExclusiveLock);
+}
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index fd8ea45..39f9eab 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -108,7 +108,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
      */
     if (!inhparent)
         estimate_rel_size(relation, rel->attr_widths - rel->min_attr,
-                          &rel->pages, &rel->tuples);
+                          &rel->pages, &rel->tuples,
+                          &rel->oscache, &rel->pgcache);

     /*
      * Make list of indexes.  Ignore indexes on system catalogs if told to.
@@ -323,11 +324,14 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
             {
                 info->pages = RelationGetNumberOfBlocks(indexRelation);
                 info->tuples = rel->tuples;
+                info->oscache = 0;
+                info->pgcache = 0;
             }
             else
             {
                 estimate_rel_size(indexRelation, NULL,
-                                  &info->pages, &info->tuples);
+                                  &info->pages, &info->tuples,
+                                  &info->oscache, &info->pgcache);
                 if (info->tuples > rel->tuples)
                     info->tuples = rel->tuples;
             }
@@ -362,7 +366,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
  */
 void
 estimate_rel_size(Relation rel, int32 *attr_widths,
-                  BlockNumber *pages, double *tuples)
+                  BlockNumber *pages, double *tuples,
+                  float4 *oscache, float4 *pgcache)
 {
     BlockNumber curpages;
     BlockNumber relpages;
@@ -451,21 +456,29 @@ estimate_rel_size(Relation rel, int32 *attr_widths,
                 density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width;
             }
             *tuples = rint(density * (double) curpages);
+            *oscache = (float4) rel->rd_rel->reloscache;
+            *pgcache = (float4) rel->rd_rel->relpgcache;
             break;
         case RELKIND_SEQUENCE:
             /* Sequences always have a known size */
             *pages = 1;
             *tuples = 1;
+            *oscache = 0;
+            *pgcache = 0;
             break;
         case RELKIND_FOREIGN_TABLE:
             /* Just use whatever's in pg_class */
             *pages = rel->rd_rel->relpages;
             *tuples = rel->rd_rel->reltuples;
+            *oscache = 0;
+            *pgcache = 0;
             break;
         default:
             /* else it has no disk storage; probably shouldn't get here? */
             *pages = 0;
             *tuples = 0;
+            *oscache = 0;
+            *pgcache = 0;
             break;
     }
 }
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 1d39674..cc0d6f5 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -342,7 +342,7 @@ static void SplitColQualList(List *qualList,

 %type <boolean> opt_instead
 %type <boolean> opt_unique opt_concurrently opt_verbose opt_full
-%type <boolean> opt_freeze opt_default opt_recheck
+%type <boolean> opt_freeze opt_oscache opt_default opt_recheck
 %type <defelt>    opt_binary opt_oids copy_delimiter

 %type <boolean> copy_from
@@ -529,7 +529,7 @@ static void SplitColQualList(List *qualList,
     NULLS_P NUMERIC

     OBJECT_P OF OFF OFFSET OIDS ON ONLY OPERATOR OPTION OPTIONS OR
-    ORDER OUT_P OUTER_P OVER OVERLAPS OVERLAY OWNED OWNER
+    ORDER OSCACHE OUT_P OUTER_P OVER OVERLAPS OVERLAY OWNED OWNER

     PARSER PARTIAL PARTITION PASSING PASSWORD PLACING PLANS POSITION
     PRECEDING PRECISION PRESERVE PREPARE PREPARED PRIMARY
@@ -7801,11 +7801,13 @@ vacuum_option_elem:
         ;

 AnalyzeStmt:
-            analyze_keyword opt_verbose
+            analyze_keyword opt_oscache opt_verbose
                 {
                     VacuumStmt *n = makeNode(VacuumStmt);
                     n->options = VACOPT_ANALYZE;
                     if ($2)
+                        n->options |= VACOPT_CACHE;
+                    if ($3)
                         n->options |= VACOPT_VERBOSE;
                     n->freeze_min_age = -1;
                     n->freeze_table_age = -1;
@@ -7813,16 +7815,18 @@ AnalyzeStmt:
                     n->va_cols = NIL;
                     $$ = (Node *)n;
                 }
-            | analyze_keyword opt_verbose qualified_name opt_name_list
+            | analyze_keyword opt_oscache opt_verbose qualified_name opt_name_list
                 {
                     VacuumStmt *n = makeNode(VacuumStmt);
                     n->options = VACOPT_ANALYZE;
                     if ($2)
+                        n->options |= VACOPT_CACHE;
+                    if ($3)
                         n->options |= VACOPT_VERBOSE;
                     n->freeze_min_age = -1;
                     n->freeze_table_age = -1;
-                    n->relation = $3;
-                    n->va_cols = $4;
+                    n->relation = $4;
+                    n->va_cols = $5;
                     $$ = (Node *)n;
                 }
         ;
@@ -7845,6 +7849,11 @@ opt_freeze: FREEZE                                    { $$ = TRUE; }
             | /*EMPTY*/                                { $$ = FALSE; }
         ;

+opt_oscache:
+            OSCACHE                                    { $$ = TRUE; }
+            | /*EMPTY*/                             { $$ = FALSE; }
+        ;
+
 opt_name_list:
             '(' name_list ')'                        { $$ = $2; }
             | /*EMPTY*/                                { $$ = NIL; }
@@ -12158,6 +12167,7 @@ type_func_name_keyword:
             | LIKE
             | NATURAL
             | NOTNULL
+            | OSCACHE
             | OUTER_P
             | OVER
             | OVERLAPS
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f96685d..5cea929 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -106,6 +106,13 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);

+/*
+ * Hooks for plugins to get control in
+ * RelationGetRelationOSCacheInFork
+ * RelationGetRelationPGCacheInFork
+ */
+oscache_hook_type OSCache_hook = NULL;
+pgcache_hook_type PGCache_hook = NULL;

 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
@@ -1922,6 +1929,40 @@ RelationGetNumberOfBlocksInFork(Relation relation, ForkNumber forkNum)
     return smgrnblocks(relation->rd_smgr, forkNum);
 }

+/*
+ * RelationGetRelationOSCacheInFork
+ *        Determines the current percentage of pages in OS cache for the
+ *        relation.
+ */
+float4
+RelationGetRelationOSCacheInFork(Relation relation, ForkNumber forkNum)
+{
+    float4 percent = 0;
+
+    /* if a plugin is present, let it manage things */
+    if (OSCache_hook)
+        percent = (*OSCache_hook) (relation, forkNum);
+
+    return percent;
+}
+
+/*
+ * RelationGetRelationPGCacheInFork
+ *        Determines the current percentage of pages in PostgreSQL cache
+ *        for the relation.
+ */
+float4
+RelationGetRelationPGCacheInFork(Relation relation, ForkNumber forkNum)
+{
+    float4 percent = 0;
+
+    /* if a plugin is present, let it manage things */
+    if (PGCache_hook)
+        percent = (*PGCache_hook) (relation, forkNum);
+
+    return percent;
+}
+
 /* ---------------------------------------------------------------------
  *        DropRelFileNodeBuffers
  *
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d7e94ff..159096a 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1417,6 +1417,8 @@ formrdesc(const char *relationName, Oid relationReltype,

     relation->rd_rel->relpages = 1;
     relation->rd_rel->reltuples = 1;
+    relation->rd_rel->reloscache = 0;
+    relation->rd_rel->relpgcache = 0;
     relation->rd_rel->relkind = RELKIND_RELATION;
     relation->rd_rel->relhasoids = hasoids;
     relation->rd_rel->relnatts = (int16) natts;
@@ -2661,6 +2663,8 @@ RelationSetNewRelfilenode(Relation relation, TransactionId freezeXid)
     {
         classform->relpages = 0;    /* it's empty until further notice */
         classform->reltuples = 0;
+        classform->reloscache = 0;
+        classform->relpgcache = 0;
     }
     classform->relfrozenxid = freezeXid;

diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index ffcce3c..dc79df5 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -45,6 +45,8 @@ CATALOG(pg_class,1259) BKI_BOOTSTRAP BKI_ROWTYPE_OID(83) BKI_SCHEMA_MACRO
     Oid            reltablespace;    /* identifier of table space for relation */
     int4        relpages;        /* # of blocks (not always up-to-date) */
     float4        reltuples;        /* # of tuples (not always up-to-date) */
+    float4        reloscache;        /* % of files in OS cache (not always up-to-date) */
+    float4        relpgcache;        /* % of files in PostgreSQL cache (not always up-to-date) */
     Oid            reltoastrelid;    /* OID of toast table; 0 if none */
     Oid            reltoastidxid;    /* if toast table, OID of chunk_id index */
     bool        relhasindex;    /* T if has (or has had) any indexes */
@@ -92,7 +94,7 @@ typedef FormData_pg_class *Form_pg_class;
  * ----------------
  */

-#define Natts_pg_class                    26
+#define Natts_pg_class                    28
 #define Anum_pg_class_relname            1
 #define Anum_pg_class_relnamespace        2
 #define Anum_pg_class_reltype            3
@@ -103,22 +105,24 @@ typedef FormData_pg_class *Form_pg_class;
 #define Anum_pg_class_reltablespace        8
 #define Anum_pg_class_relpages            9
 #define Anum_pg_class_reltuples            10
-#define Anum_pg_class_reltoastrelid        11
-#define Anum_pg_class_reltoastidxid        12
-#define Anum_pg_class_relhasindex        13
-#define Anum_pg_class_relisshared        14
-#define Anum_pg_class_relpersistence    15
-#define Anum_pg_class_relkind            16
-#define Anum_pg_class_relnatts            17
-#define Anum_pg_class_relchecks            18
-#define Anum_pg_class_relhasoids        19
-#define Anum_pg_class_relhaspkey        20
-#define Anum_pg_class_relhasrules        21
-#define Anum_pg_class_relhastriggers    22
-#define Anum_pg_class_relhassubclass    23
-#define Anum_pg_class_relfrozenxid        24
-#define Anum_pg_class_relacl            25
-#define Anum_pg_class_reloptions        26
+#define Anum_pg_class_reloscache        11
+#define Anum_pg_class_relpgcache        12
+#define Anum_pg_class_reltoastrelid        13
+#define Anum_pg_class_reltoastidxid        14
+#define Anum_pg_class_relhasindex        15
+#define Anum_pg_class_relisshared        16
+#define Anum_pg_class_relpersistence    17
+#define Anum_pg_class_relkind            18
+#define Anum_pg_class_relnatts            19
+#define Anum_pg_class_relchecks            20
+#define Anum_pg_class_relhasoids        21
+#define Anum_pg_class_relhaspkey        22
+#define Anum_pg_class_relhasrules        23
+#define Anum_pg_class_relhastriggers    24
+#define Anum_pg_class_relhassubclass    25
+#define Anum_pg_class_relfrozenxid        26
+#define Anum_pg_class_relacl            27
+#define Anum_pg_class_reloptions        28

 /* ----------------
  *        initial contents of pg_class
@@ -130,13 +134,13 @@ typedef FormData_pg_class *Form_pg_class;
  */

 /* Note: "3" in the relfrozenxid column stands for FirstNormalTransactionId */
-DATA(insert OID = 1247 (  pg_type        PGNSP 71 0 PGUID 0 0 0 0 0 0 0 f f p r 29 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1247 (  pg_type        PGNSP 71 0 PGUID 0 0 0 0 0 0 0 0 0 f f p r 29 0 t f f f f 3 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1249 (  pg_attribute    PGNSP 75 0 PGUID 0 0 0 0 0 0 0 f f p r 20 0 f f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1249 (  pg_attribute    PGNSP 75 0 PGUID 0 0 0 0 0 0 0 0 0 f f p r 20 0 f f f f f 3 _null_ _null_
));
 DESCR("");
-DATA(insert OID = 1255 (  pg_proc        PGNSP 81 0 PGUID 0 0 0 0 0 0 0 f f p r 25 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1255 (  pg_proc        PGNSP 81 0 PGUID 0 0 0 0 0 0 0 0 0 f f p r 25 0 t f f f f 3 _null_ _null_ ));
 DESCR("");
-DATA(insert OID = 1259 (  pg_class        PGNSP 83 0 PGUID 0 0 0 0 0 0 0 f f p r 26 0 t f f f f 3 _null_ _null_ ));
+DATA(insert OID = 1259 (  pg_class        PGNSP 83 0 PGUID 0 0 0 0 0 0 0 0 0 f f p r 28 0 t f f f f 3 _null_ _null_
));
 DESCR("");


diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 79c9f5d..7f1801a 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -155,6 +155,11 @@ extern void vacuum_set_xid_limits(int freeze_min_age, int freeze_table_age,
 extern void vac_update_datfrozenxid(void);
 extern void vacuum_delay_point(void);

+extern void cache_update_relstats(Relation relation,
+                                  float4 per_oscache,
+                                  float4 per_pgcache,
+                                  TransactionId frozenxid);
+
 /* in commands/vacuumlazy.c */
 extern void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
                 BufferAccessStrategy bstrategy, bool *scanned_all);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index ee1881b..bc7a301 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2417,7 +2417,8 @@ typedef enum VacuumOption
     VACOPT_VERBOSE = 1 << 2,    /* print progress info */
     VACOPT_FREEZE = 1 << 3,        /* FREEZE option */
     VACOPT_FULL = 1 << 4,        /* FULL (non-concurrent) vacuum */
-    VACOPT_NOWAIT = 1 << 5
+    VACOPT_NOWAIT = 1 << 5,
+    VACOPT_CACHE = 1 << 6        /* do CACHE stats analyze */
 } VacuumOption;

 typedef struct VacuumStmt
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index f659269..3f08bb0 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -408,6 +408,8 @@ typedef struct RelOptInfo
     List       *indexlist;        /* list of IndexOptInfo */
     BlockNumber pages;
     double        tuples;
+    float4        oscache;
+    float4        pgcache;
     struct Plan *subplan;        /* if subquery */
     List       *subrtable;        /* if subquery */
     List       *subrowmark;        /* if subquery */
@@ -466,6 +468,8 @@ typedef struct IndexOptInfo
     /* statistics from pg_class */
     BlockNumber pages;            /* number of disk pages in index */
     double        tuples;            /* number of index tuples in index */
+    float4        oscache;
+    float4        pgcache;

     /* index descriptor information */
     int            ncolumns;        /* number of columns in index */
diff --git a/src/include/optimizer/plancat.h b/src/include/optimizer/plancat.h
index c0b8eda..1dc78d5 100644
--- a/src/include/optimizer/plancat.h
+++ b/src/include/optimizer/plancat.h
@@ -29,7 +29,8 @@ extern void get_relation_info(PlannerInfo *root, Oid relationObjectId,
                   bool inhparent, RelOptInfo *rel);

 extern void estimate_rel_size(Relation rel, int32 *attr_widths,
-                  BlockNumber *pages, double *tuples);
+                              BlockNumber *pages, double *tuples,
+                              float4 *oscache, float4 *pgcache);

 extern int32 get_relation_data_width(Oid relid, int32 *attr_widths);

diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 12c2faf..95a7e3d 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -264,6 +264,7 @@ PG_KEYWORD("option", OPTION, UNRESERVED_KEYWORD)
 PG_KEYWORD("options", OPTIONS, UNRESERVED_KEYWORD)
 PG_KEYWORD("or", OR, RESERVED_KEYWORD)
 PG_KEYWORD("order", ORDER, RESERVED_KEYWORD)
+PG_KEYWORD("oscache", OSCACHE, TYPE_FUNC_NAME_KEYWORD)
 PG_KEYWORD("out", OUT_P, COL_NAME_KEYWORD)
 PG_KEYWORD("outer", OUTER_P, TYPE_FUNC_NAME_KEYWORD)
 PG_KEYWORD("over", OVER, TYPE_FUNC_NAME_KEYWORD)
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b8fc87e..8b621de 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -179,6 +179,10 @@ extern void CheckPointBuffers(int flags);
 extern BlockNumber BufferGetBlockNumber(Buffer buffer);
 extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
                                 ForkNumber forkNum);
+extern float4 RelationGetRelationOSCacheInFork(Relation relation,
+                                ForkNumber forkNum);
+extern float4 RelationGetRelationPGCacheInFork(Relation relation,
+                                ForkNumber forkNum);
 extern void FlushRelationBuffers(Relation rel);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
@@ -215,4 +219,14 @@ extern void AtProcExit_LocalBuffers(void);
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);

+/*
+* Hooks for plugins to get control in
+* RelationGetRelationOSCacheInFork
+* RelationGetRelationPGCacheInFork
+*/
+typedef float4 (*oscache_hook_type) (Relation relation, ForkNumber forkNum);
+extern PGDLLIMPORT oscache_hook_type OSCache_hook;
+typedef float4 (*pgcache_hook_type) (Relation relation, ForkNumber forkNum);
+extern PGDLLIMPORT pgcache_hook_type PGCache_hook;
+
 #endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5e28289..64ef53f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -963,6 +963,7 @@ OprCacheKey
 OprInfo
 OprProofCacheEntry
 OprProofCacheKey
+OSCache_hook_type
 OutputContext
 OverrideSearchPath
 OverrideStackEntry
@@ -973,6 +974,7 @@ PBOOL
 PCtxtHandle
 PFN
 PGAsyncStatusType
+PGCache_hook_type
 PGCALL2
 PGEvent
 PGEventConnDestroy

Re: cache estimates, cache access cost

From

Robert Haas

Date:

17 May 2011, 16:14:40

On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Cédric Villemain wrote:
>>
>> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>
> This rebases easily to make Cedric's changes move to the end; I just pushed
> a version with that change to
> https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
> wants a cleaner one to browse.  I've attached a patch too if that's more
> your thing.

Thank you.  I don't much like sucking in other people's git repos - it
tends to take a lot longer than just opening a patch file, and if I
add the repo as a remote then my git repo ends up bloated.  :-(

> The more important question is how to store the data collected and
> then use it for optimizing queries.

Agreed, but unless I'm missing something, this patch does nothing
about that.  I think the first step needs to be to update all the
formulas that are based on random_page_cost and seq_page_cost to
properly take cache_page_cost into account - and in some cases it may
be a bit debatable what the right mathematics are.

For what it's worth, I don't believe for a minute that an analyze
process that may run only run on a given table every six months has a
chance of producing useful statistics about the likelihood that a
table will be cached.  The buffer cache can turn over completely in
under a minute, and a minute is a lot less than a month.  Now, if we
measured this information periodically for a long period of time and
averaged it, that might be a believable basis for setting an optimizer
parameter.  But I think we should take the approach recently discussed
on performance: allow it to be manually set by the administrator on a
per-relation basis, with some reasonable default (maybe based on the
size of the relation relative to effective_cache_size) if the
administrator doesn't intervene.  I don't want to be excessively
negative about the approach of examining the actual behavior of the
system and using that to guide system behavior - indeed, I think there
are quite a few places where we would do well to incorporate that
approach to a greater degree than we do currently.  But I think that
it's going to take a lot of research, and a lot of work, and a lot of
performance testing, to convince ourselves that we've come up with an
appropriate feedback mechanism that will actually deliver better
performance across a large variety of workloads.  It would be much
better, IMHO, to *first* get a cached_page_cost parameter added, even
if the mechanism by which caching percentages are set is initially
quite crude - that will give us a clear-cut benefit that people can
begin enjoying immediately.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cache estimates, cache access cost

From

Cédric Villemain

Date:

17 May 2011, 19:12:22

2011/5/17 Robert Haas <robertmhaas@gmail.com>:
> On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Cédric Villemain wrote:
>>>
>>> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>>
>> This rebases easily to make Cedric's changes move to the end; I just pushed
>> a version with that change to
>> https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
>> wants a cleaner one to browse.  I've attached a patch too if that's more
>> your thing.
>
> Thank you.  I don't much like sucking in other people's git repos - it
> tends to take a lot longer than just opening a patch file, and if I
> add the repo as a remote then my git repo ends up bloated.  :-(
>
>> The more important question is how to store the data collected and
>> then use it for optimizing queries.
>
> Agreed, but unless I'm missing something, this patch does nothing
> about that.  I think the first step needs to be to update all the
> formulas that are based on random_page_cost and seq_page_cost to
> properly take cache_page_cost into account - and in some cases it may
> be a bit debatable what the right mathematics are.

Yes, I provide the branch only in case someone want to hack the
costsize and to close the problem of getting stats.

>
> For what it's worth, I don't believe for a minute that an analyze
> process that may run only run on a given table every six months has a
> chance of producing useful statistics about the likelihood that a
> table will be cached.  The buffer cache can turn over completely in
> under a minute, and a minute is a lot less than a month.  Now, if we
> measured this information periodically for a long period of time and
> averaged it, that might be a believable basis for setting an optimizer

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.
Maybe it is datawarehouse, so data move a lot, then just update your
per-relation stats before starting your queries as suggested in other
threads. Maybe it is just a matter of frequency of stats update or
explicit request like we *use to do* (ANALYZE foo;) to handle those
situations.

> parameter.  But I think we should take the approach recently discussed
> on performance: allow it to be manually set by the administrator on a
> per-relation basis, with some reasonable default (maybe based on the
> size of the relation relative to effective_cache_size) if the
> administrator doesn't intervene.  I don't want to be excessively
> negative about the approach of examining the actual behavior of the
> system and using that to guide system behavior - indeed, I think there
> are quite a few places where we would do well to incorporate that
> approach to a greater degree than we do currently.  But I think that
> it's going to take a lot of research, and a lot of work, and a lot of
> performance testing, to convince ourselves that we've come up with an
> appropriate feedback mechanism that will actually deliver better
> performance across a large variety of workloads.  It would be much
> better, IMHO, to *first* get a cached_page_cost parameter added, even
> if the mechanism by which caching percentages are set is initially
> quite crude - that will give us a clear-cut benefit that people can
> begin enjoying immediately.

The plugin I provided is just to be able to do first analysis on how
the os cache size move. You can either use pgfincore to monitor that
per table or use the patch and monitor columns values for *cache.

I took the Hooks approach because it allows to do what you want :)
You can set up a hook where you set the values you want to see, it
allows for example to fix cold start values, or permanent values set
by DBA or ... do what you want here.

The topic is do we need more parameters to increase the value of our planner ?
1/ cache_page_cost
2/ cache information, arbitrary set or not.

Starting with 1/ is ok for me, I prefer to try both at once if
possible to remove the pain to hack twice costsize.c

Several items are to be discussed after that: formulas to handle
'small' tables, data distribution usage (this one hit an old topic
about auto-partitionning  as we are here), cold state, hot state, ...

PS: there is very good blocker for the pg_class changes : what happens
in a standby ? Maybe it just opens the door on how to unlock that or
find another option to get the information per table but distinct per
server. (or we don't care, at least for a first implementation, like
for other parameters)
--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: cache estimates, cache access cost

From

Robert Haas

Date:

19 May 2011, 09:15:13

On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> The point is to get ratio in cache, not the distribution of the data
> in cache (pgfincore also allows you to see this information).
> I don't see how a stable (a server in production) system can have its
> ratio moving up and down so fast without known pattern.

Really?  It doesn't seem that hard to me.  For example, your nightly
reports might use a different set of tables than are active during the
day....

> PS: there is very good blocker for the pg_class changes : what happens
> in a standby ? Maybe it just opens the door on how to unlock that or
> find another option to get the information per table but distinct per
> server. (or we don't care, at least for a first implementation, like
> for other parameters)

That's a good point, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cache estimates, cache access cost

From

Cédric Villemain

Date:

19 May 2011, 09:19:52

2011/5/19 Robert Haas <robertmhaas@gmail.com>:
> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> The point is to get ratio in cache, not the distribution of the data
>> in cache (pgfincore also allows you to see this information).
>> I don't see how a stable (a server in production) system can have its
>> ratio moving up and down so fast without known pattern.
>
> Really?  It doesn't seem that hard to me.  For example, your nightly
> reports might use a different set of tables than are active during the
> day....

yes, this is known pattern, I believe we can work with it.

>
>> PS: there is very good blocker for the pg_class changes : what happens
>> in a standby ? Maybe it just opens the door on how to unlock that or
>> find another option to get the information per table but distinct per
>> server. (or we don't care, at least for a first implementation, like
>> for other parameters)
>
> That's a good point, too.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: cache estimates, cache access cost

From

Robert Haas

Date:

19 May 2011, 09:47:17

On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2011/5/19 Robert Haas <robertmhaas@gmail.com>:
>> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>>> The point is to get ratio in cache, not the distribution of the data
>>> in cache (pgfincore also allows you to see this information).
>>> I don't see how a stable (a server in production) system can have its
>>> ratio moving up and down so fast without known pattern.
>>
>> Really?  It doesn't seem that hard to me.  For example, your nightly
>> reports might use a different set of tables than are active during the
>> day....
>
> yes, this is known pattern, I believe we can work with it.

I guess the case where I agree that this would be relatively static is
on something like a busy OLTP system.  If different users access
different portions of the main tables, which parts of each relation
are hot might move around, but overall the percentage of that relation
in cache probably won't move around a ton, except perhaps just after
running a one-off reporting query, or when the system is first
starting up.

But that's not everybody's workload.  Imagine a system that is
relatively lightly used.  Every once in a while someone comes along
and runs a big reporting query.  Well, the contents of the buffer
caches are might vary considerably depending on *which* big reporting
queries ran most recently.

Also, even if we knew what was going to be in cache at the start of
the query, the execution of the query might change things greatly as
it runs.  For example, imagine a join between some table and itself.
If we estimate that none of the data is i cache, we will almost
certainly be wrong, because it's likely both sides of the join are
going to access some of the same pages.  Exactly how many depends on
the details of the join condition and whether we choose to implement
it by merging, sorting, or hashing.  But it's likely going to be more
than zero.  This problem can also arise in other contexts - for
example, if a query accesses a bunch of large tables, the tables that
are accessed later in the computation might be less cached than the
ones accessed earlier in the computation, because the earlier accesses
pushed parts of the tables accessed later out of cache.  Or, if a
query requires a large sort, and the value of work_mem is very high
(say 1GB), the sort might evict data from cache.  Now maybe none of
this matters a bit in practice, but it's something to think about.

There was an interesting report on a problem along these lines from
Kevin Grittner a while back.  He found he needed to set seq_page_cost
and random_page_cost differently for the database user that ran the
nightly reports, precisely because the degree of caching was very
different than it was for the daily activity, and he got bad plans
otherwise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: cache estimates, cache access cost

From

Cédric Villemain

Date:

19 May 2011, 19:17:16

2011/5/19 Robert Haas <robertmhaas@gmail.com>:
> On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> 2011/5/19 Robert Haas <robertmhaas@gmail.com>:
>>> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
>>> <cedric.villemain.debian@gmail.com> wrote:
>>>> The point is to get ratio in cache, not the distribution of the data
>>>> in cache (pgfincore also allows you to see this information).
>>>> I don't see how a stable (a server in production) system can have its
>>>> ratio moving up and down so fast without known pattern.
>>>
>>> Really?  It doesn't seem that hard to me.  For example, your nightly
>>> reports might use a different set of tables than are active during the
>>> day....
>>
>> yes, this is known pattern, I believe we can work with it.
>
> I guess the case where I agree that this would be relatively static is
> on something like a busy OLTP system.  If different users access
> different portions of the main tables, which parts of each relation
> are hot might move around, but overall the percentage of that relation
> in cache probably won't move around a ton, except perhaps just after
> running a one-off reporting query, or when the system is first
> starting up.

yes.

>
> But that's not everybody's workload.  Imagine a system that is
> relatively lightly used.  Every once in a while someone comes along
> and runs a big reporting query.  Well, the contents of the buffer
> caches are might vary considerably depending on *which* big reporting
> queries ran most recently.

Yes, I agree. This scenario is for the case where oscache_percent and
pgcache_percent are subject to change I guess. We can defined 1/ if
the values can/need to be change 2/ when update the values. For 2/ the
database usage may help to trigger an ANALYZE when required. But to be
honest I'd like to hear more of the strategy suggested by Greg here.

Those scenari are good keep in mind to build good indicators for both
the plugin to do the ANALYZE and to solve 2/

>
> Also, even if we knew what was going to be in cache at the start of
> the query, the execution of the query might change things greatly as
> it runs.  For example, imagine a join between some table and itself.
> If we estimate that none of the data is i cache, we will almost
> certainly be wrong, because it's likely both sides of the join are
> going to access some of the same pages.  Exactly how many depends on
> the details of the join condition and whether we choose to implement
> it by merging, sorting, or hashing.  But it's likely going to be more
> than zero.  This problem can also arise in other contexts - for
> example, if a query accesses a bunch of large tables, the tables that
> are accessed later in the computation might be less cached than the
> ones accessed earlier in the computation, because the earlier accesses
> pushed parts of the tables accessed later out of cache.

Yes I believe the Mackert and Lohman formula has been good so far and
I didn't suggest at any moment to remove it.
It will need some rewrite to handle it with the new GUC and new
pg_class columns but the code is already in the place for that.

> Or, if a
> query requires a large sort, and the value of work_mem is very high
> (say 1GB), the sort might evict data from cache.  Now maybe none of
> this matters a bit in practice, but it's something to think about.

Yes I agree again.

>
> There was an interesting report on a problem along these lines from
> Kevin Grittner a while back.  He found he needed to set seq_page_cost
> and random_page_cost differently for the database user that ran the
> nightly reports, precisely because the degree of caching was very
> different than it was for the daily activity, and he got bad plans
> otherwise.

this is in fact a very interesting use case.  I believe the same
strategy can be applied and update cache_page_cost and pg_class.
But I really like if it closes this use case: seq_page_cost,
random_page_cost and cache_page_cost must not need to be changed, they
should be more 'hardware dependent'. What will need to be changed is
in fact the frequency of ANALYZE CACHE in such case (or arbitrary set
values). It should allow the planner and costsize functions to have
accurate values and provide the best plan (again, the cache estimation
coming from the running query remain in the hands of the Mackert and
Lohman).
OK, maybe the user will have to write some ANALYZE CACHE; between some
queries in his scenarios.

Maybe a good scenario to add to the performance farm ? (as others but
this one has the very good value to be a production case)

I'll write those scenarios in a wiki page so it can be used to review
corner cases and possible issues (not now, it is late here).

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

[WIP] cache estimates, cache access cost

From

Cédric Villemain

Date:

14 June 2011, 11:30:06

2011/5/16 Greg Smith <greg@2ndquadrant.com>:
> Cédric Villemain wrote:
>>
>>
>> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>>
>
> This rebases easily to make Cedric's changes move to the end; I just pushed
> a version with that change to
> https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
> wants a cleaner one to browse.  I've attached a patch too if that's more
> your thing.
>
> I'd recommend not getting too stuck on the particular hook Cédric has added
> here to compute the cache estimate, which uses mmap and mincore to figure it
> out.  It's possible to compute similar numbers, albeit less accurate, using
> an approach similar to how pg_buffercache inspects things.  And I even once
> wrote a background writer extension that collected this sort of data as it
> was running the LRU scan anyway.  Discussions of this idea seem to focus on
> how the "what's in the cache?" data is collected, which as far as I'm
> concerned is the least important part.  There are multiple options, some
> work better than others, and there's no reason that can't be swapped out
> later.  The more important question is how to store the data collected and
> then use it for optimizing queries.

Attached are updated patches without the plugin itself. I've also
added the cache_page_cost GUC, this one is not per tablespace, like
others page_cost.

There are 6 patches:

0001-Add-reloscache-column-to-pg_class.patch
0002-Add-a-function-to-update-the-new-pg_class-cols.patch
0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
0004-Add-a-Hook-to-handle-OSCache-stats.patch
0005-Add-reloscache-to-Index-Rel-OptInfo.patch
0006-Add-cache_page_cost-GUC.patch

I have some comments on my own code:

* I am not sure of the best datatype to use for 'reloscache'
* I didn't include the catalog number change in the patch itself.
* oscache_update_relstats() is very similar to vac_update_relstats(),
maybe better to merge them but reloscache should not be updated at the
same time than other stats.
* There is probably too much work done in do_oscache_analyze_rel()
because I kept vac_open_indexes() (not a big drama atm)
* I don't know so much how gram.y works, so I am not sure my changes
cover all cases.
* No tests; similar columns and GUC does not have test either, but it
lacks a test for ANALYZE OSCACHE

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Attachment

Re: [WIP] cache estimates, cache access cost

From

Robert Haas

Date:

14 June 2011, 12:05:06

On Tue, Jun 14, 2011 at 10:29 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 0001-Add-reloscache-column-to-pg_class.patch
> 0002-Add-a-function-to-update-the-new-pg_class-cols.patch
> 0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
> 0004-Add-a-Hook-to-handle-OSCache-stats.patch
> 0005-Add-reloscache-to-Index-Rel-OptInfo.patch
> 0006-Add-cache_page_cost-GUC.patch

It seems to me that posting updated versions of this patch gets us no
closer to addressing the concerns I (and Tom, on other threads)
expressed about this idea previously.  Specifically:

1. ANALYZE happens far too infrequently to believe that any data taken
at ANALYZE time will still be relevant at execution time.
2. Using data gathered by ANALYZE will make plans less stable, and our
users complain not infrequently about the plan instability we already
have, therefore we should not add more.
3. Even if the data were accurate and did not cause plan stability, we
have no evidence that using it will improve real-world performance.

Now, it's possible that you or someone else could provide some
experimental evidence refuting these points.  But right now there
isn't any, and until there is, -1 from me on applying any of this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

From

Cédric Villemain

Date:

14 June 2011, 13:07:27

2011/6/14 Robert Haas <robertmhaas@gmail.com>:
> On Tue, Jun 14, 2011 at 10:29 AM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> 0001-Add-reloscache-column-to-pg_class.patch
>> 0002-Add-a-function-to-update-the-new-pg_class-cols.patch
>> 0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
>> 0004-Add-a-Hook-to-handle-OSCache-stats.patch
>> 0005-Add-reloscache-to-Index-Rel-OptInfo.patch
>> 0006-Add-cache_page_cost-GUC.patch
>
> It seems to me that posting updated versions of this patch gets us no
> closer to addressing the concerns I (and Tom, on other threads)
> expressed about this idea previously.  Specifically:
>
> 1. ANALYZE happens far too infrequently to believe that any data taken
> at ANALYZE time will still be relevant at execution time.

ANALYZE happens when people execute it, else it is auto-analyze and I
am not providing auto-analyze-oscache.
ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
frequency is not important here, I believe.

> 2. Using data gathered by ANALYZE will make plans less stable, and our
> users complain not infrequently about the plan instability we already
> have, therefore we should not add more.

Again, it is hard to do a UPDATE pg_class SET reloscache, so I used
ANALYZE logic.
Also I have taken into account the fact that someone may want to SET
the values like it was also suggested, so my patches allow to do :
'this table is 95% in cache, the DBA said' (it is stable, not based on
OS stats).

This case has been suggested several times and is covered by my patch.

> 3. Even if the data were accurate and did not cause plan stability, we
> have no evidence that using it will improve real-world performance.

I have not finish my work on cost estimation and I believe this work
will take some time and can be done in another commitfest. At the
moment my patches do not change anything on the dcision of the
planner, just offers the tools I need to hack cost estimates.

>
> Now, it's possible that you or someone else could provide some
> experimental evidence refuting these points.  But right now there
> isn't any, and until there is, -1 from me on applying any of this.

I was trying to split the patch size by group of features to reduce
its size. The work is in progress.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

From

Greg Smith

Date:

14 June 2011, 14:10:34

On 06/14/2011 11:04 AM, Robert Haas wrote:
> Even if the data were accurate and did not cause plan stability, we
> have no evidence that using it will improve real-world performance.
>    

That's the dependency Cédric has provided us a way to finally make 
progress on.  Everyone says there's no evidence that this whole approach 
will improve performance.  But we can't collect such data, to prove or 
disprove it helps, without a proof of concept patch that implements 
*something*.  You may not like the particular way the data is collected 
here, but it's a working implementation that may be useful for some 
people.  I'll take "data collected at ANALYZE time" as a completely 
reasonable way to populate the new structures with realistic enough test 
data to use initially.

Surely at least one other way to populate the statistics, and possibly 
multiple other ways that the user selects, will be needed eventually.  I 
commented a while ago on this thread:  every one of these discussions 
always gets dragged into the details of how the cache statistics data 
will be collected and rejects whatever is suggested as not good enough.  
Until that stops, no progress will ever get made on the higher level 
details.  By its nature, developing toward integrating cached 
percentages is going to lurch forward on both "collecting the cache 
data" and "using the cache knowledge in queries" fronts almost 
independently.  This is not a commit candidate; it's the first useful 
proof of concept step for something we keep talking about but never 
really doing.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [WIP] cache estimates, cache access cost

From

Robert Haas

Date:

14 June 2011, 14:16:55

On Tue, Jun 14, 2011 at 1:10 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 06/14/2011 11:04 AM, Robert Haas wrote:
>> Even if the data were accurate and did not cause plan stability, we
>> have no evidence that using it will improve real-world performance.
>
> That's the dependency Cédric has provided us a way to finally make progress
> on.  Everyone says there's no evidence that this whole approach will improve
> performance.  But we can't collect such data, to prove or disprove it helps,
> without a proof of concept patch that implements *something*.  You may not
> like the particular way the data is collected here, but it's a working
> implementation that may be useful for some people.  I'll take "data
> collected at ANALYZE time" as a completely reasonable way to populate the
> new structures with realistic enough test data to use initially.

But there's no reason that code (which may or may not eventually prove
useful) has to be incorporated into the main tree.  We don't commit
code so people can go benchmark it; we ask for the benchmarking to be
done first, and then if the results are favorable, we commit the code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

From

Robert Haas

Date:

14 June 2011, 14:36:34

On Tue, Jun 14, 2011 at 12:06 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
>> 1. ANALYZE happens far too infrequently to believe that any data taken
>> at ANALYZE time will still be relevant at execution time.
>
> ANALYZE happens when people execute it, else it is auto-analyze and I
> am not providing auto-analyze-oscache.
> ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
> frequency is not important here, I believe.

Well, I'm not saying you have to have all the answers to post a WIP
patch, certainly.  But in terms of getting something committable, it
seems like we need to have at least an outline of what the long-term
plan is.  If ANALYZE OSCACHE is an infrequent operation, then the data
isn't going to be a reliable guide to what will happen at execution
time...

>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>> users complain not infrequently about the plan instability we already
>> have, therefore we should not add more.

...and if it is a frequent operation then it's going to result in
unstable plans (and maybe pg_class bloat).  There's a fundamental
tension here that I don't think you can just wave your hands at.

> I was trying to split the patch size by group of features to reduce
> its size. The work is in progress.

Totally reasonable, but I can't see committing any of it without some
evidence that there's light at the end of the tunnel.  No performance
tests *whatsoever* have been done.  We can debate the exact amount of
evidence that should be required to prove that something is useful
from a performance perspective, but we at least need some.  I'm
beating on this point because I believe that the whole idea of trying
to feed this information back into the planner is going to turn out to
be something that we don't want to do.  I think it's going to turn out
to have downsides that are far larger than the upsides.  I am
completely willing to be be proven wrong, but right now I think this
will make things worse and you think it will make things better and I
don't see any way to bridge that gap without doing some measurements.

For example, if you run this patch on a system and subject that system
to a relatively even workload, how much do the numbers bounce around
between runs?  What if you vary the workload, so that you blast it
with OLTP traffic at some times and then run reporting queries at
other times?  Or different tables become hot at different times?

Once you've written code to make the planner do something with the
caching % values, then you can start to explore other questions.  Can
you generate plan instability, especially on complex queries, which
are more prone to change quickly based on small changes in the cost
estimates?  Can you demonstrate a workload where bad performance is
inevitable with the current code, but with your code, the system
becomes self-tuning and ends up with good performance?  What happens
if you have a large cold table with a small hot end where all activity
is concentrated?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

From

Cédric Villemain

Date:

14 June 2011, 16:12:13

2011/6/14 Robert Haas <robertmhaas@gmail.com>:
> On Tue, Jun 14, 2011 at 12:06 PM, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>>> 1. ANALYZE happens far too infrequently to believe that any data taken
>>> at ANALYZE time will still be relevant at execution time.
>>
>> ANALYZE happens when people execute it, else it is auto-analyze and I
>> am not providing auto-analyze-oscache.
>> ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
>> frequency is not important here, I believe.
>
> Well, I'm not saying you have to have all the answers to post a WIP
> patch, certainly.  But in terms of getting something committable, it
> seems like we need to have at least an outline of what the long-term
> plan is.  If ANALYZE OSCACHE is an infrequent operation, then the data
> isn't going to be a reliable guide to what will happen at execution
> time...

Ok.

>
>>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>>> users complain not infrequently about the plan instability we already
>>> have, therefore we should not add more.
>
> ...and if it is a frequent operation then it's going to result in
> unstable plans (and maybe pg_class bloat).  There's a fundamental
> tension here that I don't think you can just wave your hands at.

I don't want to hide that point, which is just correct.
The idea is not to have something (which need to be) updated too much
but it needs to be taken into account.

>
>> I was trying to split the patch size by group of features to reduce
>> its size. The work is in progress.
>
> Totally reasonable, but I can't see committing any of it without some
> evidence that there's light at the end of the tunnel.  No performance
> tests *whatsoever* have been done.  We can debate the exact amount of
> evidence that should be required to prove that something is useful
> from a performance perspective, but we at least need some.  I'm
> beating on this point because I believe that the whole idea of trying
> to feed this information back into the planner is going to turn out to
> be something that we don't want to do.  I think it's going to turn out
> to have downsides that are far larger than the upsides.

it is possible, yes.
I try to do changes in a way that if the reloscache values is the one
by default then the planner keep the same behavior than in the past.

> I am
> completely willing to be be proven wrong, but right now I think this
> will make things worse and you think it will make things better and I
> don't see any way to bridge that gap without doing some measurements.

correct.

>
> For example, if you run this patch on a system and subject that system
> to a relatively even workload, how much do the numbers bounce around
> between runs?  What if you vary the workload, so that you blast it
> with OLTP traffic at some times and then run reporting queries at
> other times?  Or different tables become hot at different times?

This is all true, this is *already* true.
Like the thread about random_page_cost vs index_page_cost where the
good option is to change the parameters at certain moment in the day
(IIRC the use case).

I mean that I agree that those benchs need to be done, hopefully I can
fix some usecases, while not breaking others too much or not at all,
or ...

>
> Once you've written code to make the planner do something with the
> caching % values, then you can start to explore other questions.  Can
> you generate plan instability, especially on complex queries, which
> are more prone to change quickly based on small changes in the cost
> estimates?  Can you demonstrate a workload where bad performance is
> inevitable with the current code, but with your code, the system

My next step is cost estimation changes. I have already some very
small usecases where the minimum changes I did so far are interesting
but it is not enought to come with that as evidences.

> becomes self-tuning and ends up with good performance?  What happens
> if you have a large cold table with a small hot end where all activity
> is concentrated?

We are at step 3 here :-) I have already some ideas to handle those
situations but not yet polished.

The current idea is to be conservative, like PostgreSQL used to be, for example:
/* * disk and cache costs * this assumes an agnostic knowledge of the data repartition and query * usage despite large
tablesmay have a hot part of 10% which is the only * requested part or that we select only (c)old data so the cache
useless.* We keep the original strategy to not guess too much and just ponderate * the cost globaly. */run_cost +=
baserel->pages* ( spc_seq_page_cost * (1 - baserel->oscache)                         + cache_page_cost   *
baserel->oscache);

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

From

Alvaro Herrera

Date:

14 June 2011, 17:31:48

Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:

> Attached are updated patches without the plugin itself. I've also
> added the cache_page_cost GUC, this one is not per tablespace, like
> others page_cost.
> 
> There are 6 patches:
> 
> 0001-Add-reloscache-column-to-pg_class.patch

Hmm, do you really need this to be a new column?  Would it work to have
it be a reloption?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] cache estimates, cache access cost

From

Cédric Villemain

Date:

14 June 2011, 18:10:48

2011/6/14 Alvaro Herrera <alvherre@commandprompt.com>:
> Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:
>
>> Attached are updated patches without the plugin itself. I've also
>> added the cache_page_cost GUC, this one is not per tablespace, like
>> others page_cost.
>>
>> There are 6 patches:
>>
>> 0001-Add-reloscache-column-to-pg_class.patch
>
> Hmm, do you really need this to be a new column?  Would it work to have
> it be a reloption?

If we can have ALTER TABLE running on heavy workload, why not.
I am bit scared by the effect of such reloption, it focus on HINT
oriented strategy when I would like to allow a dynamic strategy from
the server. This work is not done and may not work, so a reloption is
good at least as a backup  (and is more in the idea suggested by Tom
and others)

>
> --
> Álvaro Herrera <alvherre@commandprompt.com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

From

Alvaro Herrera

Date:

14 June 2011, 18:17:36

Excerpts from Cédric Villemain's message of mar jun 14 17:10:20 -0400 2011:

> If we can have ALTER TABLE running on heavy workload, why not.
> I am bit scared by the effect of such reloption, it focus on HINT
> oriented strategy when I would like to allow a dynamic strategy from
> the server. This work is not done and may not work, so a reloption is
> good at least as a backup  (and is more in the idea suggested by Tom
> and others)

Hmm, sounds like yet another use case for pg_class_nt.  Why do these
keep popping up?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: [WIP] cache estimates, cache access cost

From

Tom Lane

Date:

14 June 2011, 19:02:37

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:
>> 0001-Add-reloscache-column-to-pg_class.patch

> Hmm, do you really need this to be a new column?  Would it work to have
> it be a reloption?

If it's to be updated in the same way as ANALYZE updates reltuples and
relpages (ie, an in-place non-transactional update), I think it'll have
to be a real column.
        regards, tom lane

Re: [WIP] cache estimates, cache access cost

From

Greg Smith

Date:

14 June 2011, 19:18:02

On 06/14/2011 01:16 PM, Robert Haas wrote:
> But there's no reason that code (which may or may not eventually prove
> useful) has to be incorporated into the main tree.  We don't commit
> code so people can go benchmark it; we ask for the benchmarking to be
> done first, and then if the results are favorable, we commit the code.
>    

Who said anything about this being a commit candidate?  The "WIP" in the 
subject says it's not intended to be.  The community asks people to 
submit design ideas early so that ideas around them can be explored 
publicly.  One of the things that needs to be explored, and that could 
use some community feedback, is exactly how this should be benchmarked 
in the first place.  This topic--planning based on cached 
percentage--keeps coming up, but hasn't gone very far as an abstract 
discussion.  Having a patch to test lets it turn to a concrete one.

Note that I already listed myself as the reviewer  here, so it's not 
even like this is asking explicitly for a community volunteer to help.  
Would you like us to research this privately and then dump a giant patch 
that is commit candidate quality on everyone six months from now, 
without anyone else getting input to the process, or would you like the 
work to happen here?  I recommended Cédric not ever bother soliciting 
ideas early, because I didn't want to get into this sort of debate.  I 
avoid sending anything here unless I already have a strong idea about 
the solution, because it's hard to keep criticism at bay even with 
that.  He was more optimistic about working within the community 
contribution guidelines and decided to send this over early instead.  If 
you feel this is too rough to even discuss, I'll mark it returned with 
feedback and we'll go develop this ourselves.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [WIP] cache estimates, cache access cost

From

Bruce Momjian

Date:

14 June 2011, 19:45:34

Greg Smith wrote:
> On 06/14/2011 01:16 PM, Robert Haas wrote:
> > But there's no reason that code (which may or may not eventually prove
> > useful) has to be incorporated into the main tree.  We don't commit
> > code so people can go benchmark it; we ask for the benchmarking to be
> > done first, and then if the results are favorable, we commit the code.
> >    
> 
> Who said anything about this being a commit candidate?  The "WIP" in the 
> subject says it's not intended to be.  The community asks people to 
> submit design ideas early so that ideas around them can be explored 
> publicly.  One of the things that needs to be explored, and that could 
> use some community feedback, is exactly how this should be benchmarked 
> in the first place.  This topic--planning based on cached 
> percentage--keeps coming up, but hasn't gone very far as an abstract 
> discussion.  Having a patch to test lets it turn to a concrete one.
> 
> Note that I already listed myself as the reviewer  here, so it's not 
> even like this is asking explicitly for a community volunteer to help.  
> Would you like us to research this privately and then dump a giant patch 
> that is commit candidate quality on everyone six months from now, 
> without anyone else getting input to the process, or would you like the 
> work to happen here?  I recommended C?dric not ever bother soliciting 
> ideas early, because I didn't want to get into this sort of debate.  I 
> avoid sending anything here unless I already have a strong idea about 
> the solution, because it's hard to keep criticism at bay even with 
> that.  He was more optimistic about working within the community 
> contribution guidelines and decided to send this over early instead.  If 
> you feel this is too rough to even discuss, I'll mark it returned with 
> feedback and we'll go develop this ourselves.

I would like to see us continue researching in this direction.  I think
perhaps the background writer would be ideal for collecting this
information because it scans the buffer cache already, and frequently.
(Yes, I know it can't access databases.)

I think random_page_cost is a dead-end --- it will never be possible for
it to produce the right value for us.  Its value is tied up in caching,
e.g. the default 4 is not the right value for a physical drive (it
should be much higher), but kernel and shared buffer caching require it
to be a hybrid number that isn't really realistic.  And once we have
caching in that number, it is not going to be even caching for all
tables, obviously.  Hence, there is no way for random_page_cost to be
improved and we have to start thinking about alternatives.

Basically, random_page_cost is a terrible setting and we have to admit
that and move forward.  I realize the concerns about unstable plans, and
we might need to give users the option of stable plans with a fixed
random_page_cost, but at this point we don't even have enough data to
know we need that.  What we do know is that random_page_cost is
inadequate.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: [WIP] cache estimates, cache access cost

From

Tom Lane

Date:

14 June 2011, 20:08:20

Greg Smith <greg@2ndQuadrant.com> writes:
> On 06/14/2011 01:16 PM, Robert Haas wrote:
>> But there's no reason that code (which may or may not eventually prove
>> useful) has to be incorporated into the main tree.  We don't commit
>> code so people can go benchmark it; we ask for the benchmarking to be
>> done first, and then if the results are favorable, we commit the code.

> Who said anything about this being a commit candidate?  The "WIP" in the 
> subject says it's not intended to be.  The community asks people to 
> submit design ideas early so that ideas around them can be explored 
> publicly.  One of the things that needs to be explored, and that could 
> use some community feedback, is exactly how this should be benchmarked 
> in the first place.  This topic--planning based on cached 
> percentage--keeps coming up, but hasn't gone very far as an abstract 
> discussion.  Having a patch to test lets it turn to a concrete one.

Yeah, it *can't* go very far as an abstract discussion ... we need some
realistic testing to decide whether this is a good idea, and you can't
get that without code.

I think the real underlying issue here is that we have this CommitFest
process that is focused on getting committable or nearly-committable
code into the tree, and it just doesn't fit well for experimental code.
I concur with Robert's desire to not push experimental code into the
main repository, but we need to have *some* way of working with it.
Maybe a separate repo where experimental branches could hang out would
be helpful?

(Another way of phrasing my point is that "WIP" is not conveying the
true status of this patch.  Maybe "Experimental" would be an appropriate
label.)
        regards, tom lane

Re: [WIP] cache estimates, cache access cost

From

Greg Smith

Date:

14 June 2011, 21:02:03

On 06/14/2011 07:08 PM, Tom Lane wrote:
> I concur with Robert's desire to not push experimental code into the
> main repository, but we need to have *some* way of working with it.
> Maybe a separate repo where experimental branches could hang out would
> be helpful?
>    

Well, this one is sitting around in branches at both git.postgresql.org 
and github so far, both being updated periodically.  Maybe there's some 
value around an official experimental repository too, but I thought that 
was the idea of individual people having their own directories on 
git.postgres.org.  Do we need something fancier than that?  It would be 
nice, but seems little return on investment to improve that, relative to 
what you can do easily enough now.

The idea David Fetter has been advocating of having a "bit rot" farm to 
help detect when the experimental branches drift too far out of date 
tries to make that concept really formal.  I like that idea, too, but 
find it hard to marshal enough resources to do something about it.  The 
current status quo isn't that terrible; noticing bit rot when it's 
relevant isn't that hard to do.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [WIP] cache estimates, cache access cost

From

Robert Haas

Date:

15 June 2011, 00:43:40

On Tue, Jun 14, 2011 at 6:17 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Who said anything about this being a commit candidate?  The "WIP" in the
> subject says it's not intended to be.  The community asks people to submit
> design ideas early so that ideas around them can be explored publicly.  One
> of the things that needs to be explored, and that could use some community
> feedback, is exactly how this should be benchmarked in the first place.
>  This topic--planning based on cached percentage--keeps coming up, but
> hasn't gone very far as an abstract discussion.  Having a patch to test lets
> it turn to a concrete one.
>
> Note that I already listed myself as the reviewer  here, so it's not even
> like this is asking explicitly for a community volunteer to help.  Would you
> like us to research this privately and then dump a giant patch that is
> commit candidate quality on everyone six months from now, without anyone
> else getting input to the process, or would you like the work to happen
> here?  I recommended Cédric not ever bother soliciting ideas early, because
> I didn't want to get into this sort of debate.  I avoid sending anything
> here unless I already have a strong idea about the solution, because it's
> hard to keep criticism at bay even with that.  He was more optimistic about
> working within the community contribution guidelines and decided to send
> this over early instead.  If you feel this is too rough to even discuss,
> I'll mark it returned with feedback and we'll go develop this ourselves.

My usual trope on this subject is that WIP patches tend to elicit
helpful feedback if and only if the patch author is clear about what
sort of feedback they are seeking.  I'm interested in this topic, so,
I'm willing to put some effort into it; but, as I've said before, I
think this patch is coming from the wrong end, so in the absence of
any specific guidance on what sort of input would be useful, that's
the feedback you're getting.  Feel free to clarify what would be more
helpful.  :-)

Incidentally, I have done a bit of math around how to rejigger the
costing formulas to take cached_page_cost and caching_percentage into
account, which I think is the most interesting end place to start this
work.  If it's helpful, I can write it up in a more organized way and
post that; it likely wouldn't be that much work to incorporate it into
what Cedric has here already.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

From

Greg Stark

Date:

19 June 2011, 10:39:08

On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 1. ANALYZE happens far too infrequently to believe that any data taken
> at ANALYZE time will still be relevant at execution time.
> 2. Using data gathered by ANALYZE will make plans less stable, and our
> users complain not infrequently about the plan instability we already
> have, therefore we should not add more.
> 3. Even if the data were accurate and did not cause plan stability, we
> have no evidence that using it will improve real-world performance.

I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
our interface for telling postgres to gather stats and we generally
agree that having stats and modelling the system behaviour as
accurately as practical is the right direction so we need a specific
reason why this stat and this bit of modeling is a bad idea before we
dismiss it.

I think the kernel of truth in these concerns is simply that
everything else ANALYZE looks at mutates only on DML. If you load the
same data into two databases and run ANALYZE you'll get (modulo random
sampling) the same stats. And if you never modify it and analyze it
again a week later you'll get the same stats again. So autovacuum can
guess when to run analyze based on the number of DML operations, it
can run it without regard to how busy the system is, and it can hold
off on running it if the data hasn't changed.

In the case of the filesystem buffer cache the cached percentage will
vary over time regardless of whether the data changes. Plain select
queries will change it, even other activity outside the database will
change it. There are a bunch of strategies for mitigating this
problem: we might want to look at the cache situation more frequently,
discount the results we see since more aggressively, and possibly
maintain a kind of running average over time.

There's another problem which I haven't seen mentioned. Because the
access method will affect the cache there's the possibility of
feedback loops. e.g. A freshly loaded system prefers sequential scans
for a given table because without the cache the seeks of random reads
are too expensive... causing it to never load that table into cache...
causing that table to never be cached and never switch to an index
method. It's possible there are mitigation strategies for this as well
such as keeping a running average over time and discounting the
estimates with some heuristic values.

-- 
greg

Re: [WIP] cache estimates, cache access cost

From

Cédric Villemain

Date:

19 June 2011, 11:26:31

2011/6/19 Greg Stark <stark@mit.edu>:
> On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. ANALYZE happens far too infrequently to believe that any data taken
>> at ANALYZE time will still be relevant at execution time.
>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>> users complain not infrequently about the plan instability we already
>> have, therefore we should not add more.
>> 3. Even if the data were accurate and did not cause plan stability, we
>> have no evidence that using it will improve real-world performance.
>
> I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
> our interface for telling postgres to gather stats and we generally
> agree that having stats and modelling the system behaviour as
> accurately as practical is the right direction so we need a specific
> reason why this stat and this bit of modeling is a bad idea before we
> dismiss it.
>
> I think the kernel of truth in these concerns is simply that
> everything else ANALYZE looks at mutates only on DML. If you load the
> same data into two databases and run ANALYZE you'll get (modulo random
> sampling) the same stats. And if you never modify it and analyze it
> again a week later you'll get the same stats again. So autovacuum can
> guess when to run analyze based on the number of DML operations, it
> can run it without regard to how busy the system is, and it can hold
> off on running it if the data hasn't changed.
>
> In the case of the filesystem buffer cache the cached percentage will
> vary over time regardless of whether the data changes. Plain select
> queries will change it, even other activity outside the database will
> change it. There are a bunch of strategies for mitigating this
> problem: we might want to look at the cache situation more frequently,
> discount the results we see since more aggressively, and possibly
> maintain a kind of running average over time.

Yes.

>
> There's another problem which I haven't seen mentioned. Because the
> access method will affect the cache there's the possibility of
> feedback loops. e.g. A freshly loaded system prefers sequential scans
> for a given table because without the cache the seeks of random reads
> are too expensive... causing it to never load that table into cache...
> causing that table to never be cached and never switch to an index
> method. It's possible there are mitigation strategies for this as well

Yeah, that's one of the problem to solve. So far I've tried to keep a
planner which behave as currently when the rel_oscache == 0. So that
fresh server will have the same planning than a server without
rel_oscache.

Those points are to be solved in costestimates (and selfunc). For this
case, there is a balance between page filtering cost and index access
cost. *And* once  the table is in cache, the index cost less and can
be better because it need less filtering (less rows, less pages, less
work). there is also a possible issue here (if using the index remove
the table from cache) but I am not too much afraid of that right now.

> such as keeping a running average over time and discounting the
> estimates with some heuristic values.

yes, definitively something to think about. My biggest fear here is
for shared servers (when there is competition between services to use
the OS cache, shooting down kernel cache strategies).

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

From

Robert Haas

Date:

19 June 2011, 15:52:25

On Sun, Jun 19, 2011 at 9:38 AM, Greg Stark <stark@mit.edu> wrote:
> On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> 1. ANALYZE happens far too infrequently to believe that any data taken
>> at ANALYZE time will still be relevant at execution time.
>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>> users complain not infrequently about the plan instability we already
>> have, therefore we should not add more.
>> 3. Even if the data were accurate and did not cause plan stability, we
>> have no evidence that using it will improve real-world performance.
>
> I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
> our interface for telling postgres to gather stats and we generally
> agree that having stats and modelling the system behaviour as
> accurately as practical is the right direction so we need a specific
> reason why this stat and this bit of modeling is a bad idea before we
> dismiss it.
>
> I think the kernel of truth in these concerns is simply that
> everything else ANALYZE looks at mutates only on DML. If you load the
> same data into two databases and run ANALYZE you'll get (modulo random
> sampling) the same stats. And if you never modify it and analyze it
> again a week later you'll get the same stats again. So autovacuum can
> guess when to run analyze based on the number of DML operations, it
> can run it without regard to how busy the system is, and it can hold
> off on running it if the data hasn't changed.
>
> In the case of the filesystem buffer cache the cached percentage will
> vary over time regardless of whether the data changes. Plain select
> queries will change it, even other activity outside the database will
> change it. There are a bunch of strategies for mitigating this
> problem: we might want to look at the cache situation more frequently,
> discount the results we see since more aggressively, and possibly
> maintain a kind of running average over time.
>
> There's another problem which I haven't seen mentioned. Because the
> access method will affect the cache there's the possibility of
> feedback loops. e.g. A freshly loaded system prefers sequential scans
> for a given table because without the cache the seeks of random reads
> are too expensive... causing it to never load that table into cache...
> causing that table to never be cached and never switch to an index
> method. It's possible there are mitigation strategies for this as well
> such as keeping a running average over time and discounting the
> estimates with some heuristic values.

*scratches head*

Well, yeah.  I completely agree with you that these are the things we
need to worry about.  Maybe I did a bad job explaining myself, because
ISTM you said my concerns were FUD and then went on to restate them in
different words.

I'm not bent out of shape about using ANALYZE to try to gather the
information.  That's probably a reasonable approach if it turns out we
actually need to do it at all.  I am not sure we do.  What I've argued
for in the past is that we start by estimating the percentage of the
relation that will be cached based on its size relative to
effective_cache_size, and allow the administrator to override the
percentage on a per-relation basis if it turns out to be wrong.  That
would avoid all of these concerns and allow us to focus on the issue
of how the caching percentages impact the choice of plan, and whether
the plans that pop out are in fact better when you provide information
on caching as input.  If we have that facility in core, then people
can write scripts or plug-in modules to do ALTER TABLE .. SET
(caching_percentage = XYZ) every hour or so based on the sorts of
statistics that Cedric is gathering here, and users will be able to
experiment with a variety of algorithms and determine which ones work
the best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [WIP] cache estimates, cache access cost

From

Cédric Villemain

Date:

19 June 2011, 16:32:40

2011/6/19 Robert Haas <robertmhaas@gmail.com>:
> On Sun, Jun 19, 2011 at 9:38 AM, Greg Stark <stark@mit.edu> wrote:
>> On Tue, Jun 14, 2011 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> 1. ANALYZE happens far too infrequently to believe that any data taken
>>> at ANALYZE time will still be relevant at execution time.
>>> 2. Using data gathered by ANALYZE will make plans less stable, and our
>>> users complain not infrequently about the plan instability we already
>>> have, therefore we should not add more.
>>> 3. Even if the data were accurate and did not cause plan stability, we
>>> have no evidence that using it will improve real-world performance.
>>
>> I feel like this is all baseless FUD. ANALYZE isn't perfect but it's
>> our interface for telling postgres to gather stats and we generally
>> agree that having stats and modelling the system behaviour as
>> accurately as practical is the right direction so we need a specific
>> reason why this stat and this bit of modeling is a bad idea before we
>> dismiss it.
>>
>> I think the kernel of truth in these concerns is simply that
>> everything else ANALYZE looks at mutates only on DML. If you load the
>> same data into two databases and run ANALYZE you'll get (modulo random
>> sampling) the same stats. And if you never modify it and analyze it
>> again a week later you'll get the same stats again. So autovacuum can
>> guess when to run analyze based on the number of DML operations, it
>> can run it without regard to how busy the system is, and it can hold
>> off on running it if the data hasn't changed.
>>
>> In the case of the filesystem buffer cache the cached percentage will
>> vary over time regardless of whether the data changes. Plain select
>> queries will change it, even other activity outside the database will
>> change it. There are a bunch of strategies for mitigating this
>> problem: we might want to look at the cache situation more frequently,
>> discount the results we see since more aggressively, and possibly
>> maintain a kind of running average over time.
>>
>> There's another problem which I haven't seen mentioned. Because the
>> access method will affect the cache there's the possibility of
>> feedback loops. e.g. A freshly loaded system prefers sequential scans
>> for a given table because without the cache the seeks of random reads
>> are too expensive... causing it to never load that table into cache...
>> causing that table to never be cached and never switch to an index
>> method. It's possible there are mitigation strategies for this as well
>> such as keeping a running average over time and discounting the
>> estimates with some heuristic values.
>
> *scratches head*
>
> Well, yeah.  I completely agree with you that these are the things we
> need to worry about.  Maybe I did a bad job explaining myself, because
> ISTM you said my concerns were FUD and then went on to restate them in
> different words.
>
> I'm not bent out of shape about using ANALYZE to try to gather the
> information.  That's probably a reasonable approach if it turns out we
> actually need to do it at all.  I am not sure we do.  What I've argued
> for in the past is that we start by estimating the percentage of the
> relation that will be cached based on its size relative to
> effective_cache_size, and allow the administrator to override the
> percentage on a per-relation basis if it turns out to be wrong.  That
> would avoid all of these concerns and allow us to focus on the issue
> of how the caching percentages impact the choice of plan, and whether
> the plans that pop out are in fact better when you provide information
> on caching as input.  If we have that facility in core, then people
> can write scripts or plug-in modules to do ALTER TABLE .. SET
> (caching_percentage = XYZ) every hour or so based on the sorts of
> statistics that Cedric is gathering here, and users will be able to
> experiment with a variety of algorithms and determine which ones work
> the best.

Robert, I am very surprised.
My patch does offer that.

1st, I used ANALYZE because it is the way to update pg_class I found.
You are suggesting ALTER TABLE instead, that is fine, but give me that
lock-free :) else we have the ahem.. Alvaro's pg_class_ng (I find this
one interesting because it will be lot easier to have different values
on standby server if we find a way to have pg_class_ng 'updatable' per
server)
So, as long as the value can be change without problem, I don't care
where it resides.

2nd, I provided the patches on the last CF, exactly to allow to go to
the exciting part: the cost-estimates changes. (after all, we can work
on the cost estimate, and if later we find a way to use ALTER TABLE or
pg_class_ng, just do it instead of via the ANALYZE magic)

3nd, you can right now write a plugin to set the value of rel_oscache
(exactly like the one you'll do for a ALTER TABLE SET reloscache...)

RelationGetRelationOSCacheInFork(Relation relation, ForkNumber forkNum)
{      float4 percent = 0;      /* if a plugin is present, let it manage things */      if (OSCache_hook)
percent= (*OSCache_hook) (relation, forkNum);      return percent;} 

Looks like the main fear is because I used the ANALYZE word...

PS: ANALYZE OSCACHE does *not* run with ANALYZE, those are distinct
operations. (ANALYZE won't do the job of ANALYZE OSCACHE, we can
discuss the grammar, maybe a ANALYZE ([OSCACHE], [DATA], ...) will be
better ).
--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: [WIP] cache estimates, cache access cost

From

Greg Smith

Date:

19 June 2011, 18:32:52

On 06/19/2011 09:38 AM, Greg Stark wrote:
> There's another problem which I haven't seen mentioned. Because the
> access method will affect the cache there's the possibility of
> feedback loops. e.g. A freshly loaded system prefers sequential scans
> for a given table because without the cache the seeks of random reads
> are too expensive...

Not sure if it's been mentioned in this thread yet, but he feedback 
issue has popped up in regards to this area plenty of times.  I think 
everyone who's producing regular input into this is aware of it, even if 
it's not mentioned regularly.  I'm not too concerned about the specific 
case you warned about because I don't see how sequential scan vs. index 
costing will be any different on a fresh system than it is now.  But 
there are plenty of cases like it to be mapped out here, and many are 
not solvable--they're just something that needs to be documented as a risk.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: [WIP] cache estimates, cache access cost

From

Robert Haas

Date:

19 June 2011, 21:49:26

On Sun, Jun 19, 2011 at 3:32 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2nd, I provided the patches on the last CF, exactly to allow to go to
> the exciting part: the cost-estimates changes. (after all, we can work
> on the cost estimate, and if later we find a way to use ALTER TABLE or
> pg_class_ng, just do it instead of via the ANALYZE magic)

We're talking past each other here, somehow.  The cost-estimating part
does not require this patch in order to something useful, but this
patch, AFAICT, absolutely does require the cost-estimating part to do
something useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company