Thread: Thanks, naming conventions, and count()

Thanks, naming conventions, and count()

From
Casey Lyon
Date:
First off I just wanted to give a big 'thank you' to all the developers and contributors
who have made PostgreSQL what it is today. I haven't come across a single thing
since my first experience with it a few years ago that hasn't been corrected, sped
up, or otherwise postively enhanced!

In working with 7.1 over the past couple weeks, I've noted the following mods may
add to usability and speed:

o v7.1 changed the database naming convention to be all numeric; I suggest having  the DB engine create symbolic links
whencreating a new DB and subsequent tables.  For instance, in creating a database 'foo' with table 'bar' the
/path/to/pgsql/data/base folder will have a new folder named something like '18720'; this folder could also  have a
symboliclink to 'foo'. Then in the '18720' folder rather than just having  numeric files for each table, pk, index,
etc.  there could be symbolic links following  the naming convention 'bar', 'pk_foo_pkey', 'field1_foo_ukey',
'field2_foo_key'.
  Maybe this would work best as configurable flag that could be set during compilation or  in the conf file.

o count() should use index scans for tables with a PK; scans would be on the PK index;  even after running 'vacuum
analyze'such a query still uses a sequential scan. For  instance, "select count(*) from bar" and even "select(pk_name)
frombar" both use  sequential scans. Likewise, scans on fields with indexes should use the index.
 


I hope this input is useful; keep up the excellent work,

Casey Lyon
Systems Engineer
Earthcars.com, Inc
www.earthcars.com
casey@earthcars.com



Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> First off I just wanted to give a big 'thank you' to all the
> developers and contributors who have made PostgreSQL what it is
> today. I haven't come across a single thing since my first
> experience with it a few years ago that hasn't been corrected,
> sped up, or otherwise postively enhanced!
> 
> In working with 7.1 over the past couple weeks, I've noted the
> following mods may add to usability and speed:
> 
> o v7.1 changed the database naming convention to be all numeric;
> I suggest having
>    the DB engine create symbolic links when creating a new DB
>    and subsequent tables.  For instance, in creating a database
>    'foo' with table 'bar' the /path/to/pgsql/data/base folder
>    will have a new folder named something like '18720'; this
>    folder could also have a symbolic link to 'foo'. Then in the
>    '18720' folder rather than just having numeric files for each
>    table, pk, index, etc.   there could be symbolic links
>    following the naming convention 'bar', 'pk_foo_pkey',
>    'field1_foo_ukey', 'field2_foo_key'.
> 
>    Maybe this would work best as configurable flag that could
>    be set during compilation or in the conf file.

I think this is an excellent idea, and will add it to the TODO list.  We
agonized over moving to numeric names, and we couldn't think of a good
way to allow administrators to know that table matched what files.  The
big problem is that there is no good way to make the symlinks reliable
because in a crash, the symlink could point to a table creation that got
rolled back or the renaming of a table that got rolled back.  I think
symlinks with some postmaster cleanup script that fixed bad symlinks
would be great for 7,2.

I have added this to the TODO list.  If someone objects, I will remove
it:
* Add tablename symlinks for numeric file names

> 
> o count() should use index scans for tables with a PK; scans
> would be on the PK index;
>    even after running 'vacuum analyze' such a query still uses
>    a sequential scan. For instance, "select count(*) from bar"
>    and even "select(pk_name) from bar" both use sequential scans.
>    Likewise, scans on fields with indexes should use the index.

The problem here is that now we don't have commit status in the index
rows, so they have to check the heap for every row.  One idea is to
update the index status on an index scan, and if we can do that, we can
easily use the index.  However, the table scan is pretty quick.

-- Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
The Hermit Hacker
Date:
doesn't this defeat the reasons for going to numerics?  is there a reason
why its such a difficult thing to do a SELECT oid on pg_database and
pg_class to get this information?  that's what I've been doing when I need
to know *shrug*

On Sun, 29 Apr 2001, Bruce Momjian wrote:

> > First off I just wanted to give a big 'thank you' to all the
> > developers and contributors who have made PostgreSQL what it is
> > today. I haven't come across a single thing since my first
> > experience with it a few years ago that hasn't been corrected,
> > sped up, or otherwise postively enhanced!
> >
> > In working with 7.1 over the past couple weeks, I've noted the
> > following mods may add to usability and speed:
> >
> > o v7.1 changed the database naming convention to be all numeric;
> > I suggest having
> >    the DB engine create symbolic links when creating a new DB
> >    and subsequent tables.  For instance, in creating a database
> >    'foo' with table 'bar' the /path/to/pgsql/data/base folder
> >    will have a new folder named something like '18720'; this
> >    folder could also have a symbolic link to 'foo'. Then in the
> >    '18720' folder rather than just having numeric files for each
> >    table, pk, index, etc.   there could be symbolic links
> >    following the naming convention 'bar', 'pk_foo_pkey',
> >    'field1_foo_ukey', 'field2_foo_key'.
> >
> >    Maybe this would work best as configurable flag that could
> >    be set during compilation or in the conf file.
>
> I think this is an excellent idea, and will add it to the TODO list.  We
> agonized over moving to numeric names, and we couldn't think of a good
> way to allow administrators to know that table matched what files.  The
> big problem is that there is no good way to make the symlinks reliable
> because in a crash, the symlink could point to a table creation that got
> rolled back or the renaming of a table that got rolled back.  I think
> symlinks with some postmaster cleanup script that fixed bad symlinks
> would be great for 7,2.
>
> I have added this to the TODO list.  If someone objects, I will remove
> it:
>
>     * Add tablename symlinks for numeric file names
>
> >
> > o count() should use index scans for tables with a PK; scans
> > would be on the PK index;
> >    even after running 'vacuum analyze' such a query still uses
> >    a sequential scan. For instance, "select count(*) from bar"
> >    and even "select(pk_name) from bar" both use sequential scans.
> >    Likewise, scans on fields with indexes should use the index.
>
> The problem here is that now we don't have commit status in the index
> rows, so they have to check the heap for every row.  One idea is to
> update the index status on an index scan, and if we can do that, we can
> easily use the index.  However, the table scan is pretty quick.
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org



Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> 
> doesn't this defeat the reasons for going to numerics?  is there a reason
> why its such a difficult thing to do a SELECT oid on pg_database and
> pg_class to get this information?  that's what I've been doing when I need
> to know *shrug*

Yes, but you can't do that if you can't start the database or can't
connect for some reason.  If people don't think it is worthwhile, we can
delete the TODO item.

For example, when someone has trouble figuring out which directory is
which database, they can just ls and look at the symlinks.  Seems like a
nice feature.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> > 
> > doesn't this defeat the reasons for going to numerics?  is there a reason
> > why its such a difficult thing to do a SELECT oid on pg_database and
> > pg_class to get this information?  that's what I've been doing when I need
> > to know *shrug*
> 
> Yes, but you can't do that if you can't start the database or can't
> connect for some reason.  If people don't think it is worthwhile, we can
> delete the TODO item.
> 
> For example, when someone has trouble figuring out which directory is
> which database, they can just ls and look at the symlinks.  Seems like a
> nice feature.

I will admit we are not getting flooded with problems due to the new
numeric file names like I thought we would, so maybe it is not worth the
symlinks.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
The Hermit Hacker
Date:
On Sun, 29 Apr 2001, Bruce Momjian wrote:

> >
> > doesn't this defeat the reasons for going to numerics?  is there a reason
> > why its such a difficult thing to do a SELECT oid on pg_database and
> > pg_class to get this information?  that's what I've been doing when I need
> > to know *shrug*
>
> Yes, but you can't do that if you can't start the database or can't
> connect for some reason.  If people don't think it is worthwhile, we can
> delete the TODO item.

Okay, what does being able to ls the directory give you if you can't start
the database?  the only thing I do it for is to figure out whicih tables
are taking up so much disk space, or which databases ...

> For example, when someone has trouble figuring out which directory is
> which database, they can just ls and look at the symlinks.  Seems like
> a nice feature.

Ya, but I thought that the reason for going numeric had to do with being
transaction safe ... something about being able to safely RENAME a table,
if my recollection remotely comes close ... as soon as you start throwing
around symlinks, do we break that once more?  what about if someone wants
to physically move a table to a seperate file system, which is something
that has been suggested as a way around the fact that all files are in the
same subdirectory?  You have a symlink to the symlink?

I don't know the answers to these questions, which is why I'm asking them
... if this is something safe to do, and doesn't break us again, then
sounds like a good idea to me too ...



Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> On Sun, 29 Apr 2001, Bruce Momjian wrote:
> 
> > >
> > > doesn't this defeat the reasons for going to numerics?  is there a reason
> > > why its such a difficult thing to do a SELECT oid on pg_database and
> > > pg_class to get this information?  that's what I've been doing when I need
> > > to know *shrug*
> >
> > Yes, but you can't do that if you can't start the database or can't
> > connect for some reason.  If people don't think it is worthwhile, we can
> > delete the TODO item.
> 
> Okay, what does being able to ls the directory give you if you can't start
> the database?  the only thing I do it for is to figure out whicih tables
> are taking up so much disk space, or which databases ...

Yes, it is just for admin convenience, and if you pull back a database
from a tar backup, you can know which files are which without starting
the database.

> 
> > For example, when someone has trouble figuring out which directory is
> > which database, they can just ls and look at the symlinks.  Seems like
> > a nice feature.
> 
> Ya, but I thought that the reason for going numeric had to do with being
> transaction safe ... something about being able to safely RENAME a table,
> if my recollection remotely comes close ... as soon as you start throwing
> around symlinks, do we break that once more?  what about if someone wants
> to physically move a table to a seperate file system, which is something
> that has been suggested as a way around the fact that all files are in the
> same subdirectory?  You have a symlink to the symlink?
> 
> I don't know the answers to these questions, which is why I'm asking them
> ... if this is something safe to do, and doesn't break us again, then
> sounds like a good idea to me too ...

I was suggesting the symlinks purely for admin convenience.  The database
would use only the numeric names.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
The Hermit Hacker
Date:
On Sun, 29 Apr 2001, Bruce Momjian wrote:

> > I don't know the answers to these questions, which is why I'm asking them
> > ... if this is something safe to do, and doesn't break us again, then
> > sounds like a good idea to me too ...
>
> I was suggesting the symlinks purely for admin convenience.  The database
> would use only the numeric names.

Except that the database would have to maintain those links ... now you've
given something ppl are relying on being there, but, for some reason, a
symlink wasn't created, so they think their table doesn't exist?

I can even think of a situation, as unlikely as it can be, where this
could happen ... run out of inodes on the file system ... last inode used
by the table, no inode to stick the symlink onto ...

its a remote situation, but I've personally had it happen ...

I'd personally prefer to see some text file created in the database
directory itself that contains the mappings ... so that each time there is
a change, it just redumps that data to the dext file ... less to maintain
overall ...





Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> I can even think of a situation, as unlikely as it can be, where this
> could happen ... run out of inodes on the file system ... last inode used
> by the table, no inode to stick the symlink onto ...


If you run out of inodes, you are going to have much bigger problems
than symlinks.  Sort file creation would fail too.

> 
> its a remote situation, but I've personally had it happen ...
> 
> I'd personally prefer to see some text file created in the database
> directory itself that contains the mappings ... so that each time there is
> a change, it just redumps that data to the dext file ... less to maintain
> overall ...

Yes, I like that idea, but the problem is that it is hard to update just
one table in the file.  You sort of have to update the entire file each
time a table changes.  That is why I liked symlinks because they are
per-table, but you are right that the symlink creation could fail
because the new table file was never created or something, leaving the
symlink pointing to nothing.  Not sure how to address this.  Is there a
way to update a flat file when a single table changes?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
Alfred Perlstein
Date:
* Bruce Momjian <pgman@candle.pha.pa.us> [010429 20:14] wrote:

> Yes, I like that idea, but the problem is that it is hard to update just
> one table in the file.  You sort of have to update the entire file each
> time a table changes.  That is why I liked symlinks because they are
> per-table, but you are right that the symlink creation could fail
> because the new table file was never created or something, leaving the
> symlink pointing to nothing.  Not sure how to address this.  Is there a
> way to update a flat file when a single table changes?

Sort of, if that flat file is in the form of:
123456;"tablename                   "
000033;"another_table               "

ie, each line is a fixed length.


-- 
-Alfred Perlstein - [alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/


Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> * Bruce Momjian <pgman@candle.pha.pa.us> [010429 20:14] wrote:
> 
> > Yes, I like that idea, but the problem is that it is hard to update just
> > one table in the file.  You sort of have to update the entire file each
> > time a table changes.  That is why I liked symlinks because they are
> > per-table, but you are right that the symlink creation could fail
> > because the new table file was never created or something, leaving the
> > symlink pointing to nothing.  Not sure how to address this.  Is there a
> > way to update a flat file when a single table changes?
> 
> Sort of, if that flat file is in the form of:
> 123456;"tablename                   "
> 000033;"another_table               "
> 
> ie, each line is a fixed length.
> 

Yea, after I posted, I realized that using a fixed length line would
solve the problem.  The larger problem, though, I think, is concurrency.
Can multiple backends update that single flat file reliably?  I suppose
they could do append-only to the file, and you could grab the last
entry, but again, sometimes it is rolled back, so I think there has to
be a way to clean it up.

Of course, Tom or Vadim may come along and say this is a stupid idea,
and we would be done discussing it.  :-)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
The Hermit Hacker
Date:
On Sun, 29 Apr 2001, Bruce Momjian wrote:

> > I can even think of a situation, as unlikely as it can be, where this
> > could happen ... run out of inodes on the file system ... last inode used
> > by the table, no inode to stick the symlink onto ...
>
>
> If you run out of inodes, you are going to have much bigger problems
> than symlinks.  Sort file creation would fail too.
>
> >
> > its a remote situation, but I've personally had it happen ...
> >
> > I'd personally prefer to see some text file created in the database
> > directory itself that contains the mappings ... so that each time there is
> > a change, it just redumps that data to the dext file ... less to maintain
> > overall ...
>
> Yes, I like that idea, but the problem is that it is hard to update just
> one table in the file.  You sort of have to update the entire file each
> time a table changes.  That is why I liked symlinks because they are
> per-table, but you are right that the symlink creation could fail
> because the new table file was never created or something, leaving the
> symlink pointing to nothing.  Not sure how to address this.  Is there a
> way to update a flat file when a single table changes?

Why not just dump the whole file?  That way, if a previosu dump failed for
whatever reason, the new dump would correct that omission ...

Then again, why not some sort of 'lsdb' command that looks at where it is
and gives you info as appropriate?

if in data/base, then do a connect to template1 using postgres so that you
can dump and parse the raw data from pg_database ... if in a directory,
you should be able to connect to that database in a similar way to grab
the contents of pg_class ...

no server would need to be running for this to work, and if it was
readonly, it should be workable if a server is running, no?



Re: Thanks, naming conventions, and count()

From
"B. Palmer"
Date:
2 points:

- I thought that a big part of the reason we got rid of filenames was
so we would use arbitrary table / db names that were not restricted by
the file system / OS.  Using links would then return this restriction.

- What is the format for the table?  Could we write a tool that can
read the tables raw in case of a 'HARD' crash?  One could then walk the
table by hand as need.  Can someone give me information on the schema
for the files?  i'll take a look at it.  There may also be a way to
then use WAL files to do some more serious recovery.

Thoughts?

- brandon





Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> > Yes, I like that idea, but the problem is that it is hard to update just
> > one table in the file.  You sort of have to update the entire file each
> > time a table changes.  That is why I liked symlinks because they are
> > per-table, but you are right that the symlink creation could fail
> > because the new table file was never created or something, leaving the
> > symlink pointing to nothing.  Not sure how to address this.  Is there a
> > way to update a flat file when a single table changes?
> 
> Why not just dump the whole file?  That way, if a previosu dump failed for
> whatever reason, the new dump would correct that omission ...

Yes, you can do that, but it is only updated during a dump, right? 
Makes it hard to use during the day, no?

> 
> Then again, why not some sort of 'lsdb' command that looks at where it is
> and gives you info as appropriate?


I want to do that for oid2name.  I had the plan layed out, but never got
to it.

> 
> if in data/base, then do a connect to template1 using postgres so that you
> can dump and parse the raw data from pg_database ... if in a directory,
> you should be able to connect to that database in a similar way to grab
> the contents of pg_class ...
> 
> no server would need to be running for this to work, and if it was
> readonly, it should be workable if a server is running, no?

I think parsing the file contents is too hard.  The database would have
to be running and I would use psql.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
Casey Lyon
Date:
I could even see a utility that does a dump of this info into a flat file,
entirely overwriting the file every time.

This would be quick to reference and usable in a meltdown scenario. Could
easily be incorporated into vacuum and other db maintenance cron scripts.

-Casey


Bruce Momjian wrote:

>>> Yes, I like that idea, but the problem is that it is hard to update just
>>> one table in the file.  You sort of have to update the entire file each
>>> time a table changes.  That is why I liked symlinks because they are
>>> per-table, but you are right that the symlink creation could fail
>>> because the new table file was never created or something, leaving the
>>> symlink pointing to nothing.  Not sure how to address this.  Is there a
>>> way to update a flat file when a single table changes?
>> 
>> Why not just dump the whole file?  That way, if a previosu dump failed for
>> whatever reason, the new dump would correct that omission ...
> 
> 
> Yes, you can do that, but it is only updated during a dump, right? 
> Makes it hard to use during the day, no?
> 
> 
>> Then again, why not some sort of 'lsdb' command that looks at where it is
>> and gives you info as appropriate?
> 
> 
> 
> I want to do that for oid2name.  I had the plan layed out, but never got
> to it.
> 
> 
>> if in data/base, then do a connect to template1 using postgres so that you
>> can dump and parse the raw data from pg_database ... if in a directory,
>> you should be able to connect to that database in a similar way to grab
>> the contents of pg_class ...
>> 
>> no server would need to be running for this to work, and if it was
>> readonly, it should be workable if a server is running, no?
> 
> 
> I think parsing the file contents is too hard.  The database would have
> to be running and I would use psql.



Re: Thanks, naming conventions, and count()

From
Casey Lyon
Date:
Bruce Momjian wrote:

> The problem here is that now we don't have commit status in the index
> rows, so they have to check the heap for every row.  One idea is to
> update the index status on an index scan, and if we can do that, we can
> easily use the index.  However, the table scan is pretty quick.

It certainly works quickly for smaller tables, however the 21.7 million
record table I ran this on takes a touch longer as shown here:

database=# explain select count(*) from table;
NOTICE:  QUERY PLAN:

Aggregate  (cost=478056.20..478056.20 rows=1 width=0)  ->  Seq Scan on table  (cost=0.00..423737.76 rows=21727376
width=0)

EXPLAIN

However I noted explain provides rows as part of it's data; from what
I've seen this loses precision over time or with large data imports,
though; at least until the table is vacuumed again.

-Casey



Re: Thanks, naming conventions, and count()

From
The Hermit Hacker
Date:
On Sun, 29 Apr 2001, Bruce Momjian wrote:

> > > Yes, I like that idea, but the problem is that it is hard to update just
> > > one table in the file.  You sort of have to update the entire file each
> > > time a table changes.  That is why I liked symlinks because they are
> > > per-table, but you are right that the symlink creation could fail
> > > because the new table file was never created or something, leaving the
> > > symlink pointing to nothing.  Not sure how to address this.  Is there a
> > > way to update a flat file when a single table changes?
> >
> > Why not just dump the whole file?  That way, if a previosu dump failed for
> > whatever reason, the new dump would correct that omission ...
>
> Yes, you can do that, but it is only updated during a dump, right?
> Makes it hard to use during the day, no?
>
> >
> > Then again, why not some sort of 'lsdb' command that looks at where it is
> > and gives you info as appropriate?
>
>
> I want to do that for oid2name.  I had the plan layed out, but never got
> to it.
>
> >
> > if in data/base, then do a connect to template1 using postgres so that you
> > can dump and parse the raw data from pg_database ... if in a directory,
> > you should be able to connect to that database in a similar way to grab
> > the contents of pg_class ...
> >
> > no server would need to be running for this to work, and if it was
> > readonly, it should be workable if a server is running, no?
>
> I think parsing the file contents is too hard.  The database would have
> to be running and I would use psql.

I don't know, I recovered someone's database using a "raw" connection ...
wasn't that difficult once I figured out the format *shrug*

the following gets the oid,relname's for a database in the format:

echo "select oid,relname from pg_class" | postgres -L -D /usr/local/pgsql/data eceb | egrep "oid|relname"

then just parse the output using a simple perl script:
        1: oid = "163338"      (typeid = 26, len = 4, typmod = -1, byval = t)        2: relname = "auth_info_uid_key"
   (typeid = 19, len = 32, typmod = -1, byval = f)        1: oid = "163341"      (typeid = 26, len = 4, typmod = -1,
byval= t)        2: relname = "auth_info_id"    (typeid = 19, len = 32, typmod = -1, byval = f)        1: oid = "56082"
     (typeid = 26, len = 4, typmod = -1, byval = t)        2: relname = "auth_info"       (typeid = 19, len = 32,
typmod= -1, byval = f)
 

the above won't work on a live database, did try that, so best is to test
for a connection first, and this would be a fall back ... but you'd at
least have a live *and* non live way of parsing the data *shrug*



Re: Thanks, naming conventions, and count()

From
"Serguei Mokhov"
Date:
----- Original Message ----- 
From: Alfred Perlstein <bright@wintelcom.net>
To: Bruce Momjian <pgman@candle.pha.pa.us>
Cc: The Hermit Hacker <scrappy@hub.org>; Casey Lyon <casey@earthcars.com>; <pgsql-hackers@postgresql.org>
Sent: Sunday, April 29, 2001 11:17 PM
Subject: Re: [HACKERS] Thanks, naming conventions, and count()


> * Bruce Momjian <pgman@candle.pha.pa.us> [010429 20:14] wrote:
> 
> > Yes, I like that idea, but the problem is that it is hard to update just
> > one table in the file.  You sort of have to update the entire file each
> > time a table changes.  That is why I liked symlinks because they are
> > per-table, but you are right that the symlink creation could fail
> > because the new table file was never created or something, leaving the
> > symlink pointing to nothing.  Not sure how to address this.  Is there a
> > way to update a flat file when a single table changes?
> 
> Sort of, if that flat file is in the form of:
> 123456;"tablename                   "
> 000033;"another_table               "
> 
> ie, each line is a fixed length.

What if have one such a line in separate file in one dir?
Then there is no restriction on field length, you don't need
to dump the file each time and maintain the real .symlinks.

The 'lsdb' command (courtesy of The Hermit Hacker :))
will assemble all of them together and will show the DBA
where to look for a specific table.
File names can be your OIDs again, and just keep
table name inside the file. Keep these files under
a certain dir, and let the lsdb display them appropriately
when needed.

Or another idea is to create 'deferred' symlinks.
The (real) symlinks only created when DBA issues the 'lsdb'
command and lists them, and this list is maintained only
when the 'lsdb' is invoked....

Maybe this sounds stupid, but just a thought... 

Serguei








Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> It certainly works quickly for smaller tables, however the 21.7 million
> record table I ran this on takes a touch longer as shown here:
> 
> database=# explain select count(*) from table;
> NOTICE:  QUERY PLAN:
> 
> Aggregate  (cost=478056.20..478056.20 rows=1 width=0)
>    ->  Seq Scan on table  (cost=0.00..423737.76 rows=21727376 width=0)
> 
> EXPLAIN
> 
> However I noted explain provides rows as part of it's data; from what
> I've seen this loses precision over time or with large data imports,
> though; at least until the table is vacuumed again.

I guess I was saying that an index scan could take longer because it has
to walk the btree.  However it only has one column of the table, so it
may be faster.  I never measured the two, but the heap access needed for
the index scan currently is a performance killer.  Sequential is faster
than all those random heap lookups from the index.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> > I think parsing the file contents is too hard.  The database would have
> > to be running and I would use psql.
> 
> I don't know, I recovered someone's database using a "raw" connection ...
> wasn't that difficult once I figured out the format *shrug*
> 
> the following gets the oid,relname's for a database in the format:
> 
> echo "select oid,relname from pg_class" | postgres -L -D /usr/local/pgsql/data eceb | egrep "oid|relname"
> 
> then just parse the output using a simple perl script:
> 
>          1: oid = "163338"      (typeid = 26, len = 4, typmod = -1, byval = t)
>          2: relname = "auth_info_uid_key"       (typeid = 19, len = 32, typmod = -1, byval = f)
>          1: oid = "163341"      (typeid = 26, len = 4, typmod = -1, byval = t)
>          2: relname = "auth_info_id"    (typeid = 19, len = 32, typmod = -1, byval = f)
>          1: oid = "56082"       (typeid = 26, len = 4, typmod = -1, byval = t)
>          2: relname = "auth_info"       (typeid = 19, len = 32, typmod = -1, byval = f)

Oh, you did a direct postgres backend connect.  Yes, that will work
fine.  Good idea if the postmaster is down.  I originally thought you
meant reading the pg_class file raw.  Of course, that would be really
hard because there is no way to know what numeric file is pg_class!

Actually, seems it is always 1259.  I see this in
include/catalog/pg_class.h:
DATA(insert OID = 1259 (  pg_class 83         PGUID 0 1259 0 0 0 0 f f r22 0 0 0 0 0 f f f _null_ ));

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
Casey Lyon
Date:
If this isn't incorporated into a utility, it would certainly be prime
for inclusion for the yet-to-be-written chapter 11 of the PG Admin Manual
"Database Recovery."

Thanks for your responses,   -Casey


The Hermit Hacker wrote:

> On Sun, 29 Apr 2001, Bruce Momjian wrote:
> 
> 
>>>> Yes, I like that idea, but the problem is that it is hard to update just
>>>> one table in the file.  You sort of have to update the entire file each
>>>> time a table changes.  That is why I liked symlinks because they are
>>>> per-table, but you are right that the symlink creation could fail
>>>> because the new table file was never created or something, leaving the
>>>> symlink pointing to nothing.  Not sure how to address this.  Is there a
>>>> way to update a flat file when a single table changes?
>>> 
>>> Why not just dump the whole file?  That way, if a previosu dump failed for
>>> whatever reason, the new dump would correct that omission ...
>> 
>> Yes, you can do that, but it is only updated during a dump, right?
>> Makes it hard to use during the day, no?
>> 
>> 
>>> Then again, why not some sort of 'lsdb' command that looks at where it is
>>> and gives you info as appropriate?
>> 
>> 
>> I want to do that for oid2name.  I had the plan layed out, but never got
>> to it.
>> 
>> 
>>> if in data/base, then do a connect to template1 using postgres so that you
>>> can dump and parse the raw data from pg_database ... if in a directory,
>>> you should be able to connect to that database in a similar way to grab
>>> the contents of pg_class ...
>>> 
>>> no server would need to be running for this to work, and if it was
>>> readonly, it should be workable if a server is running, no?
>> 
>> I think parsing the file contents is too hard.  The database would have
>> to be running and I would use psql.
> 
> 
> I don't know, I recovered someone's database using a "raw" connection ...
> wasn't that difficult once I figured out the format *shrug*
> 
> the following gets the oid,relname's for a database in the format:
> 
> echo "select oid,relname from pg_class" | postgres -L -D /usr/local/pgsql/data eceb | egrep "oid|relname"
> 
> then just parse the output using a simple perl script:
> 
>          1: oid = "163338"      (typeid = 26, len = 4, typmod = -1, byval = t)
>          2: relname = "auth_info_uid_key"       (typeid = 19, len = 32, typmod = -1, byval = f)
>          1: oid = "163341"      (typeid = 26, len = 4, typmod = -1, byval = t)
>          2: relname = "auth_info_id"    (typeid = 19, len = 32, typmod = -1, byval = f)
>          1: oid = "56082"       (typeid = 26, len = 4, typmod = -1, byval = t)
>          2: relname = "auth_info"       (typeid = 19, len = 32, typmod = -1, byval = f)
> 
> the above won't work on a live database, did try that, so best is to test
> for a connection first, and this would be a fall back ... but you'd at
> least have a live *and* non live way of parsing the data *shrug*



Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
Here is what I suggested for oid2name to do with file names:

---------------------------------------------------------------------------

Just seems like a major pain; not worth the work.

If you do a ls and pipe it,  here is what you would need to do:

- find out where $PWD is
- in that database (found from PID),  for each file in the dir,  look it
up using oid2name
- print that out

problems:
- ls -l vs ls
- column are different for differing OSs / filesystems
- du will REALLY suck
- what if the user tries to do "ls /var/postgres/data/base/12364"  Will
you try to parse out the request?  Ugh.  no thanks.

I also don't think people will have much reason to use the script.
oid2name will have little enough use,  what use will the script have?  Who
knows..  I guess keep it on back burner till there is a demand.


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> big problem is that there is no good way to make the symlinks reliable
> because in a crash, the symlink could point to a table creation that got
> rolled back or the renaming of a table that got rolled back.

Yes.  Have you already forgotten the very long discussion we had about
this some months back?  There is no way to provide a reliable symlink
mapping without re-introducing all the same problems that we went to
numeric filenames to avoid.  Now if you want an *UNRELIABLE* symlink
mapping, maybe we could talk about it ... but IMHO such a feature would
be worse than useless.  Murphy's law says that the symlinks would be
right often enough to mislead dbadmins into trusting them, and wrong
exactly when it would do the most damage to trust them.  The same goes
for other methods of unreliably exporting the name-to-number mapping,
such as dumping it into a flat file.

We do need to document how to get the mapping (ie, select relfilenode,
relname from pg_class).  But I really doubt that an automated method
for exporting the mapping would be worth the cycles it would cost,
even if it could be made reliable which it can't.
        regards, tom lane


Re: Thanks, naming conventions, and count()

From
Alfred Perlstein
Date:
* Tom Lane <tgl@sss.pgh.pa.us> [010429 23:12] wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > big problem is that there is no good way to make the symlinks reliable
> > because in a crash, the symlink could point to a table creation that got
> > rolled back or the renaming of a table that got rolled back.
> 
> Yes.  Have you already forgotten the very long discussion we had about
> this some months back?  There is no way to provide a reliable symlink
> mapping without re-introducing all the same problems that we went to
> numeric filenames to avoid.  Now if you want an *UNRELIABLE* symlink
> mapping, maybe we could talk about it ... but IMHO such a feature would
> be worse than useless.  Murphy's law says that the symlinks would be
> right often enough to mislead dbadmins into trusting them, and wrong
> exactly when it would do the most damage to trust them.  The same goes
> for other methods of unreliably exporting the name-to-number mapping,
> such as dumping it into a flat file.
> 
> We do need to document how to get the mapping (ie, select relfilenode,
> relname from pg_class).  But I really doubt that an automated method
> for exporting the mapping would be worth the cycles it would cost,
> even if it could be made reliable which it can't.

Perhaps an external tool to rebuild the symlink state that could be
run on an offline database.  But I'm sure you have more important
things to do. :)

-- 
-Alfred Perlstein - [alfred@freebsd.org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/


Re: Thanks, naming conventions, and count()

From
Vince Vielhaber
Date:
On Sun, 29 Apr 2001, Bruce Momjian wrote:

> > > I think parsing the file contents is too hard.  The database would have
> > > to be running and I would use psql.
> >
> > I don't know, I recovered someone's database using a "raw" connection ...
> > wasn't that difficult once I figured out the format *shrug*
> >
> > the following gets the oid,relname's for a database in the format:
> >
> > echo "select oid,relname from pg_class" | postgres -L -D /usr/local/pgsql/data eceb | egrep "oid|relname"
> >
> > then just parse the output using a simple perl script:
> >
> >          1: oid = "163338"      (typeid = 26, len = 4, typmod = -1, byval = t)
> >          2: relname = "auth_info_uid_key"       (typeid = 19, len = 32, typmod = -1, byval = f)
> >          1: oid = "163341"      (typeid = 26, len = 4, typmod = -1, byval = t)
> >          2: relname = "auth_info_id"    (typeid = 19, len = 32, typmod = -1, byval = f)
> >          1: oid = "56082"       (typeid = 26, len = 4, typmod = -1, byval = t)
> >          2: relname = "auth_info"       (typeid = 19, len = 32, typmod = -1, byval = f)
>
> Oh, you did a direct postgres backend connect.  Yes, that will work
> fine.  Good idea if the postmaster is down.  I originally thought you
> meant reading the pg_class file raw.  Of course, that would be really
> hard because there is no way to know what numeric file is pg_class!

But would it work on a crashed database that won't come up or doesn't
the direct connect care about any other tables in this usage?

Vince.
-- 
==========================================================================
Vince Vielhaber -- KA8CSH    email: vev@michvhf.com    http://www.pop4.net        56K Nationwide Dialup from $16.00/mo
atPop4 Networking       Online Campground Directory    http://www.camping-usa.com      Online Giftshop Superstore
http://www.cloudninegifts.com
==========================================================================





Re: Thanks, naming conventions, and count()

From
Tom Lane
Date:
Vince Vielhaber <vev@michvhf.com> writes:
>> Oh, you did a direct postgres backend connect.  Yes, that will work
>> fine.  Good idea if the postmaster is down.  I originally thought you
>> meant reading the pg_class file raw.  Of course, that would be really
>> hard because there is no way to know what numeric file is pg_class!

> But would it work on a crashed database that won't come up

No.

It's not that hard to know "which numeric file is pg_class" --- that
info has to be hard-wired in at some level.  (The backends cannot learn
pg_class's own relfilenode number by examining its pg_class entry...)

It might be worth making a simple utility (could be based on Bryan
White's pg_check) to grovel through the raw pg_class bits and extract
relfilenode info the hard way.  You'd only need it in certain disaster
scenarios, but when you did need it you'd need it bad.

So far we have not seen a report of a situation where this seemed to be
useful, so I'm not that excited about having it... WAL dump and
interrogation utilities are higher on my want list.
        regards, tom lane


Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> It might be worth making a simple utility (could be based on Bryan
> White's pg_check) to grovel through the raw pg_class bits and extract
> relfilenode info the hard way.  You'd only need it in certain disaster
> scenarios, but when you did need it you'd need it bad.
> 
> So far we have not seen a report of a situation where this seemed to be
> useful, so I'm not that excited about having it... WAL dump and
> interrogation utilities are higher on my want list.

OK, updated TODO item:
* Add table name mapping for numeric file names

I removed the symlink mention, and I agree it is low priority. No one is
really asking for it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
Michael Samuel
Date:
On Sun, Apr 29, 2001 at 08:17:28PM -0700, Alfred Perlstein wrote:
> Sort of, if that flat file is in the form of:
> 123456;"tablename                   "
> 000033;"another_table               "

Or better yet, since the flat file is unlikely to be large, you could
just do this dance:

1) open file for reading
2) flock() file exclusively, non-blocking.
3) If 2 failed, sleep a bit, then go back to 1, otherwise open new file  for writing
4) Write out new file
5) rename() the temp file over the new file
6) close files, etc

That way, you'll never have the race of 2 programs trying to write the file
at a time (therefore losing changes), and you get total atomicity of the
writing.

You could also do it with an open(O_EXCL) on a fixed temp file, instead of
the flock() call.  The semantics should be the same.

Of course, you could always fork() a child to handle this in the background,
as it's hardly important to the running of the database. (Or if it is, it
can become part of the transaction, which means that at rename() time, there
must be no room for other failures, but it mustn't be too late to roll back)

-- 
Michael Samuel <michael@miknet.net>


Re: Thanks, naming conventions, and count()

From
Bruce Momjian
Date:
> > Yes, I like that idea, but the problem is that it is hard to update just
> > one table in the file.
> 
> why not have just one ever-growing file that is only appended to and
> that has 
> lines of form 
> 
> OID, type (DB/TABLE/INDEX/...), name, time
> 
> so when you need tha actual info you grep for name and use tha last line
> whose 
> file actually exists. Not too convenient but useful enough when you
> really need it.

Yes, that is one idea, but it is hard to undo a change.  You would have
to write to the file only on a commit.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Thanks, naming conventions, and count()

From
Hannu Krosing
Date:
Bruce Momjian wrote:
> 
> > I can even think of a situation, as unlikely as it can be, where this
> > could happen ... run out of inodes on the file system ... last inode used
> > by the table, no inode to stick the symlink onto ...
> 
> If you run out of inodes, you are going to have much bigger problems
> than symlinks.  Sort file creation would fail too.
> 
> >
> > its a remote situation, but I've personally had it happen ...
> >
> > I'd personally prefer to see some text file created in the database
> > directory itself that contains the mappings ... so that each time there is
> > a change, it just redumps that data to the dext file ... less to maintain
> > overall ...
> 
> Yes, I like that idea, but the problem is that it is hard to update just
> one table in the file.

why not have just one ever-growing file that is only appended to and
that has 
lines of form 

OID, type (DB/TABLE/INDEX/...), name, time

so when you need tha actual info you grep for name and use tha last line
whose 
file actually exists. Not too convenient but useful enough when you
really need it.

-------------------
Hannu