Re: On How To Shorten the Steep Learning Curve Towards PG Hacking... - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: On How To Shorten the Steep Learning Curve Towards PG Hacking...
Date
Msg-id CAMsr+YEcxs9AvVhq5Wv5v62VNyLWNQZ6niuvW-3Cs2j07VWEKg@mail.gmail.com
Whole thread Raw
In response to On How To Shorten the Steep Learning Curve Towards PG Hacking...  (Kang Yuzhe <tiggreen87@gmail.com>)
Responses Re: On How To Shorten the Steep Learning Curve Towards PG Hacking...  (Kang Yuzhe <tiggreen87@gmail.com>)
Re: On How To Shorten the Steep Learning Curve Towards PGHacking...  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
List pgsql-hackers
On 29 March 2017 at 10:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi,
>
> On 2017/03/28 15:40, Kang Yuzhe wrote:
>> Thanks Tsunakawa for such an informative reply.
>>
>> Almost all of the docs related to the internals of PG are of introductory
>> concepts only.
>> There is even more useful PG internals site entitled "The Internals of
>> PostgreSQL" in http://www.interdb.jp/pg/ translation of the Japanese PG
>> Internals.
>>
>> The query processing framework that is described in the manual as you
>> mentioned is of informative and introductory nature.
>> In theory, the query processing framework described in the manual is
>> understandable.
>>
>> Unfortunate, it is another story to understand how query processing
>> framework in PG codebase really works.
>> It has become a difficult task for me to walk through the PG source code
>> for example how SELECT/INSERT/TRUNCATE in the the different modules under
>> "src/..". really works.
>>
>> I wish there were Hands-On with PostgreSQL Internals like
>> https://bkmjournal.wordpress.com/2017/01/22/hands-on-with-postgresql-internals/
>> for more complex PG features.
>>
>> For example, MERGE SQL standard is not supported yet by PG.  I wish there
>> were Hands-On with PostgreSQL Internals for MERGE/UPSERT. How it is
>> implemented in parser/executor/storage etc. modules with detailed
>> explanation for each code and debugging and other important concepts
>> related to system programming.
>
> I am not sure if I can show you that one place where you could learn all
> of that, but many people who started with PostgreSQL development at some
> point started by exploring the source code itself (either for learning or
> to write a feature patch), articles on PostgreSQL wiki, and many related
> presentations accessible using the Internet. I liked the following among
> many others:

Personally I have to agree that the learning curve is very steep. Some
of the docs and presentations help, but there's a LOT to understand.

When you're getting started you're lost in a world of language you
don't know, and trying to understand one piece often gets you lost in
other pieces. In no particular order:

* Memory contexts and palloc
* Managing transactions and how that interacts with memory contexts
and the default memory context
* Snapshots, snapshot push/pop, etc
* LWLocks, memory barriers, spinlocks, latches
* Heavyweight locks (and the different APIs to them)
* GUCs, their scopes, the rules around their callbacks, etc
* dynahash
* catalogs and oids and access methods
* The heap AM like heap_open
* relcache, catcache, syscache
* genam and the systable_ calls and their limitations with indexes
* The SPI
* When to use each of the above 4!
* Heap tuples and minimal tuples
* VARLENA
* GETSTRUCT, when you can/can't use it, other attribute fetching methods
* TOAST and detoasting datums.
* forming and deforming tuples
* LSNs, WAL/xlog generation and redo. Timelines. (ARGH, timelines).
* cache invalidations, when they can happen, and how to do anything
safely around them.
* TIDs, cmin and cmax, xmin and xmax
* postmaster, vacuum, bgwriter, checkpointer, startup process,
walsender, walreceiver, all our auxillary procs and what they do
* relmapper, relfilenodes vs relation oids, filenode extents
* ondisk structure, page headers, pages
* shmem management, buffers and buffer pins
* bgworkers
* PG_TRY() and PG_CATCH() and their limitations
* elog and ereport and errcontexts, exception unwinding/longjmp and
how it interacts with memory contexts, lwlocks, etc
* The nest of macros around datum manipulation and functions, PL
handlers. How to find the macros for the data types you want to work
with.
* Everything to do with the C API for arrays (is horrible)
* The details of the parse/rewrite/plan phases with rewrite calling
back into parse, paths, the mess with inheritance_planner, reading and
understanding plantrees
* The permissions and grants model and how to interact with it
* PGPROC, PGXACT, other main shmem structures
* Resource owners (which I still don't fully "get")
* Checkpoints, pg_control and ShmemVariableCache, crash recovery
* How globals are used in Pg and how they interact with fork()ing from
postmaster
* SSI (haven't gone there yet myself)
* ....

Personally I recall finding the magic of resource owner and memory
context changing under me when I started/stopped xacts in a bgworker,
along with the need to manage snapshots and SPI state to be distinctly
confusing.

There are various READMEs, blog posts, presentation slides/videos, etc
that explain bits and pieces. But not much exists to tie it together
into a comprehensible hole with simple, minimal explanations for each
part so someone who's new to it all can begin to get a handle on it,
find resources to learn more about subsystems they need to care about,
etc.

Lots of it boils down to "read the code". But so much code! You don't
know if what you're reading is really relevant or if it's even
correct, or if it makes assumptions that differ from your situation.
There are lots of coding rules that aren't necessarily obvious unless
you read the right place, e.g. that you don't need to and shouldn't
LWLockRelease() before elog(ERROR). That SPI doesn't manage snapshots
or xacts for you (but will often silently work anyway!). etc.

I've long intended to start a blog series on postgresql innards
concepts, partly with the intent of turning it into such an overview.
I find that people are better at shouting you down when you're wrong
than they are at writing new material or reviewing proposed docs, so
it's often a good way to fact-check things ;) .  Plus it's a good way
to learn. Time is always short though.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Schedule and Release Management Team for PG10
Next
From: Craig Ringer
Date:
Subject: Re: [PATCH] Reduce src/test/recovery verbosity