Re: Additional Chapter for Tutorial - Mailing list pgsql-hackers

From Jürgen Purtz
Subject Re: Additional Chapter for Tutorial
Date
Msg-id b4124734-9d3f-849d-b408-b122b624772c@purtz.de
Whole thread Raw
In response to Re: Additional Chapter for Tutorial  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: Additional Chapter for Tutorial - arch-dev.sgml  (Erik Rijkers <er@xs4all.nl>)
List pgsql-hackers
On 10.11.20 00:14, David G. Johnston wrote:
Reviewed the first three sections.

template0 - I would remove the schema portions of this and simply note this as being a pristine recovery database in the diagram.
ok

I would drop the word "more" and just say "system schemas".  I would drop pg_toast from the list of system schema and focus on the three user-facing ones.
ok

Instead of "my_schema" (optional) I would do "my_schema" (example)
The terms 'optional' and 'default' are used at various places with their literal meaning. We shall not change them.

Server Graphic
#3 Global SQL Objects: Objects which are shared among all databases within a cluster.
#6 Client applications are prohibited from connecting to template0
ok
#1 If by you we mean "the client" saying that you work "in the cluster data" doesn't really help.  I would emphasize the point that the client sees an endpoint the Postmaster publishes as a port or socket file and that plus the database name defines the endpoint the client connects to (meld with #5)
ok, with some changes.

In lieu of some of the existing detail provided about structure I would add information about configuration and search_path at this level.
Search path appended. But IMO configuration questions are out of scope of this sub-chapter.

I like the object type enumeration - I would suggest grouping them by type in a manner consistent with the documentation and making each one a link to its "primary" section - the SQL Command reference if all else fails.
ok. But don't how to group them in a better way.

The "i" in internal in 51.3 (the image) needs capitalization).
ok

You correctly add both Extension and Collation as database-level objects but they are not mentioned anywhere else.  They do belong here and need to be tied in properly in the text.
Have some courage to the gap, it's an introductory chapter.

The whole thing needs a good pass focused on capitalization.  Both for typos and to decide when various primary concepts like Instance should be capitalized and when not.
'Instance' and 'Cluster' are now uppercase because of their importance, everything else lowercase for better reading.

51.4 - When you look at the diagram seeing /pg/data/base looks really cool, but when reading the prose where both the "pg" and the "base" are omitted and all you get are repeated references to "data", the directory name choice becomes an issue IMO.  I suggest (and changed the attached) to name the actual root directory "pgdata".  You should change the /pg/ directory name to something like ".../tutorial_project/".

The graphic shall reflect the default behavior of PG. Without the parameter -D, initdb creates the new cluster in the directory where PGDATA points to. This is in many cases /var/lib/pgsql/data. Therefore 'data' and its subdirectory 'base' are not my invention but reflects the default situation.

(Diving a little deeper into this issue I noticed that there is a parameter 'cluster_name' in the config file. But it does not change the name of the cluster's root directory, it only changes the names of the running processes. Choosing 'instance_name' instead of 'cluster_name' as the parameter's name would be a better choice imo - but that is not what we are speaking about in the context of the new chapter).

I changed the very first directory in the graphic to visualize the standard behavior; I reverted your recommendation to use 'pgdata' instead of 'data' in the text part.

Since you aren't following alphabetical order anyway I would place pg_tblspc after globals since tablespaces are globals and thus proximity links them here - and pointing out that pg_tblspc holds the data makes stating that global doesn't contain tablespace data unnecessary.
ok

Maybe point out somewhere the the "base/databaseOID" directory represents the default tablespace for each database, which isn't "global", only the non-default tablespaces are considered globals (or just get rid of the mentioned on "non-default tablespace" for now).

ok

more:

1) some changes concerning the nature of connections (52.2: logical perspective). IMO accessing multiple databases within one connection is not a question of configuring, you have to take more actions. But I'm not sure we should mention this at all.

2) you propose to cancel or trim down the paragraphs behind figure 51.2. (cluster, database, schema). I believe that a textual description of this hierarchy is essential for the understanding of the system. Because it isn't described explicitly at a different place, it should remain.

--- snipp -------- from other e-mail ----

MVCC Section

The first paragraph and example in the MVCC section is a good example but seems misplaced - its relationship to MVCC generally is tenuous, rather I would expect a discussion of the serializable isolation mode to follow.

I'm not sure how much detail this section wants to get into given the coverage of concurrency elsewhere in the documentation.  "Not much" would be my baseline.
The paragraph focus on the fact that new row versions are generated instead of locking something. Explaining serialization isolation modes is imo very complicate and out of the scope of this subchapter. If we want to give an overview - in addition to the exiting documentation - it should be a separate subchapter.

I would suggest spelling out what "OLTP" stands for and ideally pointing the user to the glossary for the term.
ok, but not added to glossary. The given explanation "... with a massive number of concurrent write actions" should be sufficient.

Tending more toward a style gripe but the amount of leader phrases and redundancy are at a level that I am noticing them when I read this but do not have the same impression having read large portions of documentation. In particular:
Because I'm not a native English speaker, orthographic and style hits are always welcome.

"When we speak about transaction IDs, you need to know that xids are like sequences."

"But keep in mind that xids are independent of any time measurement — in milliseconds or otherwise. If you dive deeper into PostgreSQL, you will recognize parameters with names such as 'xxx_age'. Despite their names, these '_age' parameters do not specify a period of time but represent a certain number of transactions, e.g., 100 million."

Could just be:  xids are sequences and age computations involving them measure a transaction count as opposed to a time interval.
ok

Then I would consider adding a bit more detail/context here.

xids are 32bit sequences, with a reserved value to handle wrap-around.  There are 4 billion values in the sequence but wrap-around handling must occur every 2 billion transactions. Age computations involving xids measure a transaction count as opposed to a time interval.

I would move the mentioning of "vacuum" to the main paragraph about delete and not solely as a "keep in mind" note.
The mentioning here at the food of the page is a crossover to the next subchapter.

The part before the diagram seems like it should be much shorter, concise, and provide links to the excellent documentation.  The part after the image, and the image itself, are good material, though possibly should be in a main administration chapter instead of an internals chapter.

vacuum: The problem - and one reason for the existence of this subchapter - is that vacuum's documentation is scattered across may pages:

19.4: parameters to configure the server, especially five parameters 'vacuum_cost_xxx'.

19.10: parameters to configure autovacuum.

19.11: parameters to configure client connections, especially five parameters 'vacuum_xxx' concerning their freeze-behavior.

24.1: explains the general necessity of (auto)vacuum and their strategies.

The page about the SQL command VACUUM explains the different options (FULL, FREEZE, ..) and their meaning.

Because of the structure of our documentation as well as the complexity of the issue that's ok. The existing documentation describes every parameter very well, but I'm missing a page where the 'big picture' of vacuum is explained (not necessarily here). It shall show the relationship between the huge number of parameters and an explanation *why* they exists. As far as we don't have such a page within the vacuum documentation the proposed subchapter fills the gap. (The provided graphics can be included multiple times without generating redundancies - here and at arbitrary other places.)


The first bullet of "keep in mind" is both wordy and wrong - in particular "as xids grow old row versions get out of scope over time" doesn't make sense (or rather it only does in the context of wrap-around, not normal visibility).  Having the only mention of bloat be here is also not ideal, it too should be weaved into the main narrative.  The "keep in mind" section here should be a recap of already covered material in a succinct form, nothing should be new to someone who just read the entire section.
ok.

I don't think that usage of exclamation marks (!) is warranted here, though emphasis on the key phrase wouldn't hurt.
ok

Vacuum Section

avoid -> prevent (continued growth)
ok

Autovacuum is enabled by default.  The whole note needs commas.
ok

I'd try to get rid of "at arbitrary point in time"
ok

"Instance." we've already described where instances are previously ("on the server")
ok

The other sections - these seem misplaced for the tutorial, update the main documentation if this information is wholly missing or lacking.  The MVCC chapter can incorporate overview information as it is a strict consequence of that implementation.

Statistics belong elsewhere - the tutorial should not use poor command implementation choices as a guide for user education.

In short, this whole section should not exist and its content moved to more appropriate areas (mainly MVCC).  Vacuum is a tool that one must use but the narrative should be about the system generally.


concerning vacuum section: see my comments above

concerning 'the other sections' (transactions, reliability, backup (plus: someone should add 'replication', I'm not familiar with this issue)): The intention of the chapter is to give a *summary* about PG's essential architecture and about central implementation aspects. This implies that the chapters does not present any new information. They shall only show (or repeat) essential things in their context and explain *why* they are used. In this sense the three chapters may be reasonable. Concerning this, I like to hear some comments from other people.


Attachments:

0013-architecture.patch: complete patch vs. master

0013-architecture.sgml.diff: changes in file architecture.sgml since 0012

0013-images.diff: changes in files *-raw.svg since 0012

--

J. Purtz


Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: list of extended statistics on psql
Next
From: Victor Yegorov
Date:
Subject: Re: Deleting older versions in unique indexes to avoid page splits