Thread: On How To Shorten the Steep Learning Curve Towards PG Hacking...

On How To Shorten the Steep Learning Curve Towards PG Hacking...

From
Kang Yuzhe
Date:

Dear PG Hackers/Experts,

I am newbie to PG Hacking.
I have been reading the PG code base to find my space in it but without success.

There are hundreds of Hands-on with PG Application development on the web.
Alas, there is almost none in PG hacking.

I have found PG source Code reading and hacking to be one the most frustrating experiences in my life.  I believe that PG hacking should not be a painful 
Dear PG Hacker/Experts,


I am newbie to PG Hacking.
I have been reading the PG code base to find my space in it but without success.

There are hundreds of Hands-on with PG Application development on the web.
Alas, there is almost none in PG hacking.

I have found PG source Code reading and hacking to be one the most frustrating experiences in my life.  I believe that PG hacking should not be a painful journey but an enjoyable one!

It is my strong believe that out of my PG hacking frustrations, there may come insights for the PG experts on ways how to devise hands-on with PG internals so that new comers will be great coders as quickly as possible.

I also believe that we should spend our time reading great Papers and Books related to Data Management problems BUT not PG code base.

Here are my suggestion for  the experts to devise ways to shorten the steep learning curve towards PG Hacking.

1. Prepare Hands-on with PG internals

 For example, a complete Hands-on  with SELECT/INSERT SQL Standard PG internals. The point is the experts can pick one fairly complex feature and walk it from Parser to Executor in a hands-on manner explaining step by step every technical detail.

2. Write a book on PG Internals.

There is one book on PG internals. Unfortunately, it's in Chinese.
Why not in English??
It is my strong believe that if there were a great book on PG Internals with hands-on with some of the basic features of PG internals machinery, PG hacking would be almost as easy as PG application development.

If the experts make the newbie understand the PG code base as quickly as possible, that means more reviewers, more contributors and more users of PG which in turn means more PG usability, more PG popularity, stronger PG community.

This is my personal feelings and am the ready to be corrected and advised the right way towards the PG hacking.

Regards,
Zeray





Re: On How To Shorten the Steep Learning Curve Towards PG Hacking...

From
Michael Paquier
Date:
On Mon, Mar 27, 2017 at 9:00 PM, Kang Yuzhe <tiggreen87@gmail.com> wrote:
> 1. Prepare Hands-on with PG internals
>
>  For example, a complete Hands-on  with SELECT/INSERT SQL Standard PG
> internals. The point is the experts can pick one fairly complex feature and
> walk it from Parser to Executor in a hands-on manner explaining step by step
> every technical detail.

There are resources on the net, in English as well. Take for example
this manual explaining the internals of Postgres by Hironobu Suzuki:
http://www.interdb.jp/pg/
-- 
Michael



Re: On How To Shorten the Steep Learning Curve Towards PGHacking...

From
"Tsunakawa, Takayuki"
Date:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kang Yuzhe

> 1. Prepare Hands-on with PG internals
> 
> 
>  For example, a complete Hands-on  with SELECT/INSERT SQL Standard PG
> internals. The point is the experts can pick one fairly complex feature
> and walk it from Parser to Executor in a hands-on manner explaining step
> by step every technical detail.

I sympathize with you.  What level of detail do you have in mind?  The query processing framework is described in the
manual:

Chapter 50. Overview of PostgreSQL Internals
https://www.postgresql.org/docs/devel/static/overview.html

More detailed source code analysis is provided for very old PostgreSQL 7.4, but I guess it's not much different now.
Thedocument is in Japanese only: 
 

http://ikubo.x0.com/PostgreSQL/pg_source.htm

Are you thinking of something like this?

MySQL Internals Manual
https://dev.mysql.com/doc/internals/en/





Regards
Takayuki Tsunakawa


Re: On How To Shorten the Steep Learning Curve Towards PG Hacking...

From
Kang Yuzhe
Date:
Thanks Tsunakawa for such an informative reply.

Almost all of the docs related to the internals of PG are of introductory concepts only.
There is even more useful PG internals site entitled "The Internals of PostgreSQL" in http://www.interdb.jp/pg/ translation of the Japanese PG Internals.

The query processing framework that is described in the manual as you mentioned is of informative and introductory nature.
In theory, the query processing framework described in the manual is understandable.

Unfortunate, it is another story to understand how query processing framework in PG codebase really works.
It has become a difficult task for me to walk through the PG source code for example how SELECT/INSERT/TRUNCATE in the the different modules under "src/..". really works.

I wish there were Hands-On with PostgreSQL Internals like https://bkmjournal.wordpress.com/2017/01/22/hands-on-with-postgresql-internals/ for more complex PG features.

For example, MERGE SQL standard is not supported yet by PG.  I wish there were Hands-On with PostgreSQL Internals for MERGE/UPSERT. How it is implemented in parser/executor/storage etc. modules with detailed explanation for each code and debugging and other important concepts related to system programming.
 
Zeray,
Regards



On Tue, Mar 28, 2017 at 6:04 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kang Yuzhe

> 1. Prepare Hands-on with PG internals
>
>
>  For example, a complete Hands-on  with SELECT/INSERT SQL Standard PG
> internals. The point is the experts can pick one fairly complex feature
> and walk it from Parser to Executor in a hands-on manner explaining step
> by step every technical detail.

I sympathize with you.  What level of detail do you have in mind?  The query processing framework is described in the manual:

Chapter 50. Overview of PostgreSQL Internals
https://www.postgresql.org/docs/devel/static/overview.html

More detailed source code analysis is provided for very old PostgreSQL 7.4, but I guess it's not much different now.  The document is in Japanese only:

http://ikubo.x0.com/PostgreSQL/pg_source.htm

Are you thinking of something like this?

MySQL Internals Manual
https://dev.mysql.com/doc/internals/en/





Regards
Takayuki Tsunakawa


Re: On How To Shorten the Steep Learning Curve Towards PGHacking...

From
Adrien Nayrat
Date:
On 03/27/2017 02:00 PM, Kang Yuzhe wrote:
> 1. Prepare Hands-on with PG internals
>
>  For example, a complete Hands-on  with SELECT/INSERT SQL Standard PG internals.
> The point is the experts can pick one fairly complex feature and walk it from
> Parser to Executor in a hands-on manner explaining step by step every technical
> detail.
Hi,

Bruce Momjian has made several presentations about Postgres Internal :
http://momjian.us/main/presentations/internals.html


Regards
--
Adrien NAYRAT



Re: On How To Shorten the Steep Learning Curve Towards PGHacking...

From
Amit Langote
Date:
Hi,

On 2017/03/28 15:40, Kang Yuzhe wrote:
> Thanks Tsunakawa for such an informative reply.
> 
> Almost all of the docs related to the internals of PG are of introductory
> concepts only.
> There is even more useful PG internals site entitled "The Internals of
> PostgreSQL" in http://www.interdb.jp/pg/ translation of the Japanese PG
> Internals.
> 
> The query processing framework that is described in the manual as you
> mentioned is of informative and introductory nature.
> In theory, the query processing framework described in the manual is
> understandable.
> 
> Unfortunate, it is another story to understand how query processing
> framework in PG codebase really works.
> It has become a difficult task for me to walk through the PG source code
> for example how SELECT/INSERT/TRUNCATE in the the different modules under
> "src/..". really works.
> 
> I wish there were Hands-On with PostgreSQL Internals like
> https://bkmjournal.wordpress.com/2017/01/22/hands-on-with-postgresql-internals/
> for more complex PG features.
> 
> For example, MERGE SQL standard is not supported yet by PG.  I wish there
> were Hands-On with PostgreSQL Internals for MERGE/UPSERT. How it is
> implemented in parser/executor/storage etc. modules with detailed
> explanation for each code and debugging and other important concepts
> related to system programming.

I am not sure if I can show you that one place where you could learn all
of that, but many people who started with PostgreSQL development at some
point started by exploring the source code itself (either for learning or
to write a feature patch), articles on PostgreSQL wiki, and many related
presentations accessible using the Internet. I liked the following among
many others:

Introduction to Hacking PostgreSQL:
http://www.neilconway.org/talks/hacking/

Inside the PostgreSQL Query Optimizer:
http://www.neilconway.org/talks/optimizer/optimizer.pdf

Postgres Internals Presentations:
http://momjian.us/main/presentations/internals.html

Thanks,
Amit





Re: On How To Shorten the Steep Learning Curve Towards PG Hacking...

From
Craig Ringer
Date:
On 29 March 2017 at 10:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi,
>
> On 2017/03/28 15:40, Kang Yuzhe wrote:
>> Thanks Tsunakawa for such an informative reply.
>>
>> Almost all of the docs related to the internals of PG are of introductory
>> concepts only.
>> There is even more useful PG internals site entitled "The Internals of
>> PostgreSQL" in http://www.interdb.jp/pg/ translation of the Japanese PG
>> Internals.
>>
>> The query processing framework that is described in the manual as you
>> mentioned is of informative and introductory nature.
>> In theory, the query processing framework described in the manual is
>> understandable.
>>
>> Unfortunate, it is another story to understand how query processing
>> framework in PG codebase really works.
>> It has become a difficult task for me to walk through the PG source code
>> for example how SELECT/INSERT/TRUNCATE in the the different modules under
>> "src/..". really works.
>>
>> I wish there were Hands-On with PostgreSQL Internals like
>> https://bkmjournal.wordpress.com/2017/01/22/hands-on-with-postgresql-internals/
>> for more complex PG features.
>>
>> For example, MERGE SQL standard is not supported yet by PG.  I wish there
>> were Hands-On with PostgreSQL Internals for MERGE/UPSERT. How it is
>> implemented in parser/executor/storage etc. modules with detailed
>> explanation for each code and debugging and other important concepts
>> related to system programming.
>
> I am not sure if I can show you that one place where you could learn all
> of that, but many people who started with PostgreSQL development at some
> point started by exploring the source code itself (either for learning or
> to write a feature patch), articles on PostgreSQL wiki, and many related
> presentations accessible using the Internet. I liked the following among
> many others:

Personally I have to agree that the learning curve is very steep. Some
of the docs and presentations help, but there's a LOT to understand.

When you're getting started you're lost in a world of language you
don't know, and trying to understand one piece often gets you lost in
other pieces. In no particular order:

* Memory contexts and palloc
* Managing transactions and how that interacts with memory contexts
and the default memory context
* Snapshots, snapshot push/pop, etc
* LWLocks, memory barriers, spinlocks, latches
* Heavyweight locks (and the different APIs to them)
* GUCs, their scopes, the rules around their callbacks, etc
* dynahash
* catalogs and oids and access methods
* The heap AM like heap_open
* relcache, catcache, syscache
* genam and the systable_ calls and their limitations with indexes
* The SPI
* When to use each of the above 4!
* Heap tuples and minimal tuples
* VARLENA
* GETSTRUCT, when you can/can't use it, other attribute fetching methods
* TOAST and detoasting datums.
* forming and deforming tuples
* LSNs, WAL/xlog generation and redo. Timelines. (ARGH, timelines).
* cache invalidations, when they can happen, and how to do anything
safely around them.
* TIDs, cmin and cmax, xmin and xmax
* postmaster, vacuum, bgwriter, checkpointer, startup process,
walsender, walreceiver, all our auxillary procs and what they do
* relmapper, relfilenodes vs relation oids, filenode extents
* ondisk structure, page headers, pages
* shmem management, buffers and buffer pins
* bgworkers
* PG_TRY() and PG_CATCH() and their limitations
* elog and ereport and errcontexts, exception unwinding/longjmp and
how it interacts with memory contexts, lwlocks, etc
* The nest of macros around datum manipulation and functions, PL
handlers. How to find the macros for the data types you want to work
with.
* Everything to do with the C API for arrays (is horrible)
* The details of the parse/rewrite/plan phases with rewrite calling
back into parse, paths, the mess with inheritance_planner, reading and
understanding plantrees
* The permissions and grants model and how to interact with it
* PGPROC, PGXACT, other main shmem structures
* Resource owners (which I still don't fully "get")
* Checkpoints, pg_control and ShmemVariableCache, crash recovery
* How globals are used in Pg and how they interact with fork()ing from
postmaster
* SSI (haven't gone there yet myself)
* ....

Personally I recall finding the magic of resource owner and memory
context changing under me when I started/stopped xacts in a bgworker,
along with the need to manage snapshots and SPI state to be distinctly
confusing.

There are various READMEs, blog posts, presentation slides/videos, etc
that explain bits and pieces. But not much exists to tie it together
into a comprehensible hole with simple, minimal explanations for each
part so someone who's new to it all can begin to get a handle on it,
find resources to learn more about subsystems they need to care about,
etc.

Lots of it boils down to "read the code". But so much code! You don't
know if what you're reading is really relevant or if it's even
correct, or if it makes assumptions that differ from your situation.
There are lots of coding rules that aren't necessarily obvious unless
you read the right place, e.g. that you don't need to and shouldn't
LWLockRelease() before elog(ERROR). That SPI doesn't manage snapshots
or xacts for you (but will often silently work anyway!). etc.

I've long intended to start a blog series on postgresql innards
concepts, partly with the intent of turning it into such an overview.
I find that people are better at shouting you down when you're wrong
than they are at writing new material or reviewing proposed docs, so
it's often a good way to fact-check things ;) .  Plus it's a good way
to learn. Time is always short though.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: On How To Shorten the Steep Learning Curve Towards PG Hacking...

From
Kang Yuzhe
Date:
Thanks you all for pointing me to useful docs on PG kernel stuff as well as for being sympathetic with me and the newbie question that appears to be true and interesting but yet be addressed by PG experts.

Last but not least, Craig Ringer, you just nailed it!! You also made me feel and think that my question is working asking.

Regards,
Zeray

On Wed, Mar 29, 2017 at 6:36 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 29 March 2017 at 10:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Hi,
>
> On 2017/03/28 15:40, Kang Yuzhe wrote:
>> Thanks Tsunakawa for such an informative reply.
>>
>> Almost all of the docs related to the internals of PG are of introductory
>> concepts only.
>> There is even more useful PG internals site entitled "The Internals of
>> PostgreSQL" in http://www.interdb.jp/pg/ translation of the Japanese PG
>> Internals.
>>
>> The query processing framework that is described in the manual as you
>> mentioned is of informative and introductory nature.
>> In theory, the query processing framework described in the manual is
>> understandable.
>>
>> Unfortunate, it is another story to understand how query processing
>> framework in PG codebase really works.
>> It has become a difficult task for me to walk through the PG source code
>> for example how SELECT/INSERT/TRUNCATE in the the different modules under
>> "src/..". really works.
>>
>> I wish there were Hands-On with PostgreSQL Internals like
>> https://bkmjournal.wordpress.com/2017/01/22/hands-on-with-postgresql-internals/
>> for more complex PG features.
>>
>> For example, MERGE SQL standard is not supported yet by PG.  I wish there
>> were Hands-On with PostgreSQL Internals for MERGE/UPSERT. How it is
>> implemented in parser/executor/storage etc. modules with detailed
>> explanation for each code and debugging and other important concepts
>> related to system programming.
>
> I am not sure if I can show you that one place where you could learn all
> of that, but many people who started with PostgreSQL development at some
> point started by exploring the source code itself (either for learning or
> to write a feature patch), articles on PostgreSQL wiki, and many related
> presentations accessible using the Internet. I liked the following among
> many others:

Personally I have to agree that the learning curve is very steep. Some
of the docs and presentations help, but there's a LOT to understand.

When you're getting started you're lost in a world of language you
don't know, and trying to understand one piece often gets you lost in
other pieces. In no particular order:

* Memory contexts and palloc
* Managing transactions and how that interacts with memory contexts
and the default memory context
* Snapshots, snapshot push/pop, etc
* LWLocks, memory barriers, spinlocks, latches
* Heavyweight locks (and the different APIs to them)
* GUCs, their scopes, the rules around their callbacks, etc
* dynahash
* catalogs and oids and access methods
* The heap AM like heap_open
* relcache, catcache, syscache
* genam and the systable_ calls and their limitations with indexes
* The SPI
* When to use each of the above 4!
* Heap tuples and minimal tuples
* VARLENA
* GETSTRUCT, when you can/can't use it, other attribute fetching methods
* TOAST and detoasting datums.
* forming and deforming tuples
* LSNs, WAL/xlog generation and redo. Timelines. (ARGH, timelines).
* cache invalidations, when they can happen, and how to do anything
safely around them.
* TIDs, cmin and cmax, xmin and xmax
* postmaster, vacuum, bgwriter, checkpointer, startup process,
walsender, walreceiver, all our auxillary procs and what they do
* relmapper, relfilenodes vs relation oids, filenode extents
* ondisk structure, page headers, pages
* shmem management, buffers and buffer pins
* bgworkers
* PG_TRY() and PG_CATCH() and their limitations
* elog and ereport and errcontexts, exception unwinding/longjmp and
how it interacts with memory contexts, lwlocks, etc
* The nest of macros around datum manipulation and functions, PL
handlers. How to find the macros for the data types you want to work
with.
* Everything to do with the C API for arrays (is horrible)
* The details of the parse/rewrite/plan phases with rewrite calling
back into parse, paths, the mess with inheritance_planner, reading and
understanding plantrees
* The permissions and grants model and how to interact with it
* PGPROC, PGXACT, other main shmem structures
* Resource owners (which I still don't fully "get")
* Checkpoints, pg_control and ShmemVariableCache, crash recovery
* How globals are used in Pg and how they interact with fork()ing from
postmaster
* SSI (haven't gone there yet myself)
* ....

Personally I recall finding the magic of resource owner and memory
context changing under me when I started/stopped xacts in a bgworker,
along with the need to manage snapshots and SPI state to be distinctly
confusing.

There are various READMEs, blog posts, presentation slides/videos, etc
that explain bits and pieces. But not much exists to tie it together
into a comprehensible hole with simple, minimal explanations for each
part so someone who's new to it all can begin to get a handle on it,
find resources to learn more about subsystems they need to care about,
etc.

Lots of it boils down to "read the code". But so much code! You don't
know if what you're reading is really relevant or if it's even
correct, or if it makes assumptions that differ from your situation.
There are lots of coding rules that aren't necessarily obvious unless
you read the right place, e.g. that you don't need to and shouldn't
LWLockRelease() before elog(ERROR). That SPI doesn't manage snapshots
or xacts for you (but will often silently work anyway!). etc.

I've long intended to start a blog series on postgresql innards
concepts, partly with the intent of turning it into such an overview.
I find that people are better at shouting you down when you're wrong
than they are at writing new material or reviewing proposed docs, so
it's often a good way to fact-check things ;) .  Plus it's a good way
to learn. Time is always short though.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: On How To Shorten the Steep Learning Curve Towards PGHacking...

From
Amit Langote
Date:
On 2017/03/29 12:36, Craig Ringer wrote:
> On 29 March 2017 at 10:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Hi,
>>
>> On 2017/03/28 15:40, Kang Yuzhe wrote:
>>> Thanks Tsunakawa for such an informative reply.
>>>
>>> Almost all of the docs related to the internals of PG are of introductory
>>> concepts only.
>>> There is even more useful PG internals site entitled "The Internals of
>>> PostgreSQL" in http://www.interdb.jp/pg/ translation of the Japanese PG
>>> Internals.
>>>
>>> The query processing framework that is described in the manual as you
>>> mentioned is of informative and introductory nature.
>>> In theory, the query processing framework described in the manual is
>>> understandable.
>>>
>>> Unfortunate, it is another story to understand how query processing
>>> framework in PG codebase really works.
>>> It has become a difficult task for me to walk through the PG source code
>>> for example how SELECT/INSERT/TRUNCATE in the the different modules under
>>> "src/..". really works.
>>>
>>> I wish there were Hands-On with PostgreSQL Internals like
>>> https://bkmjournal.wordpress.com/2017/01/22/hands-on-with-postgresql-internals/
>>> for more complex PG features.
>>>
>>> For example, MERGE SQL standard is not supported yet by PG.  I wish there
>>> were Hands-On with PostgreSQL Internals for MERGE/UPSERT. How it is
>>> implemented in parser/executor/storage etc. modules with detailed
>>> explanation for each code and debugging and other important concepts
>>> related to system programming.
>>
>> I am not sure if I can show you that one place where you could learn all
>> of that, but many people who started with PostgreSQL development at some
>> point started by exploring the source code itself (either for learning or
>> to write a feature patch), articles on PostgreSQL wiki, and many related
>> presentations accessible using the Internet. I liked the following among
>> many others:
> 
> Personally I have to agree that the learning curve is very steep. Some
> of the docs and presentations help, but there's a LOT to understand.

I agree too. :)

> When you're getting started you're lost in a world of language you
> don't know, and trying to understand one piece often gets you lost in
> other pieces. In no particular order:
> 
> * Memory contexts and palloc
> * Managing transactions and how that interacts with memory contexts
> and the default memory context
> * Snapshots, snapshot push/pop, etc
> * LWLocks, memory barriers, spinlocks, latches
> * Heavyweight locks (and the different APIs to them)
> * GUCs, their scopes, the rules around their callbacks, etc
> * dynahash
> * catalogs and oids and access methods
> * The heap AM like heap_open
> * relcache, catcache, syscache
> * genam and the systable_ calls and their limitations with indexes
> * The SPI
> * When to use each of the above 4!
> * Heap tuples and minimal tuples
> * VARLENA
> * GETSTRUCT, when you can/can't use it, other attribute fetching methods
> * TOAST and detoasting datums.
> * forming and deforming tuples
> * LSNs, WAL/xlog generation and redo. Timelines. (ARGH, timelines).
> * cache invalidations, when they can happen, and how to do anything
> safely around them.
> * TIDs, cmin and cmax, xmin and xmax
> * postmaster, vacuum, bgwriter, checkpointer, startup process,
> walsender, walreceiver, all our auxillary procs and what they do
> * relmapper, relfilenodes vs relation oids, filenode extents
> * ondisk structure, page headers, pages
> * shmem management, buffers and buffer pins
> * bgworkers
> * PG_TRY() and PG_CATCH() and their limitations
> * elog and ereport and errcontexts, exception unwinding/longjmp and
> how it interacts with memory contexts, lwlocks, etc
> * The nest of macros around datum manipulation and functions, PL
> handlers. How to find the macros for the data types you want to work
> with.
> * Everything to do with the C API for arrays (is horrible)
> * The details of the parse/rewrite/plan phases with rewrite calling
> back into parse, paths, the mess with inheritance_planner, reading and
> understanding plantrees
> * The permissions and grants model and how to interact with it
> * PGPROC, PGXACT, other main shmem structures
> * Resource owners (which I still don't fully "get")
> * Checkpoints, pg_control and ShmemVariableCache, crash recovery
> * How globals are used in Pg and how they interact with fork()ing from
> postmaster
> * SSI (haven't gone there yet myself)
> * ....

That is indeed a big list of things to know and (have to) worry about.  If
we indeed come up with a PG-hackers-handbook someday, things in your list
could be organized such that it's clear to someone wanting to contribute
code which of those things they need to *absolutely* worry about and which
they don't.

> Personally I recall finding the magic of resource owner and memory
> context changing under me when I started/stopped xacts in a bgworker,
> along with the need to manage snapshots and SPI state to be distinctly
> confusing.
> 
> There are various READMEs, blog posts, presentation slides/videos, etc
> that explain bits and pieces. But not much exists to tie it together
> into a comprehensible hole with simple, minimal explanations for each
> part so someone who's new to it all can begin to get a handle on it,
> find resources to learn more about subsystems they need to care about,
> etc.
> 
> Lots of it boils down to "read the code". But so much code! You don't
> know if what you're reading is really relevant or if it's even
> correct, or if it makes assumptions that differ from your situation.
> There are lots of coding rules that aren't necessarily obvious unless
> you read the right place, e.g. that you don't need to and shouldn't
> LWLockRelease() before elog(ERROR). That SPI doesn't manage snapshots
> or xacts for you (but will often silently work anyway!). etc.
> 
> I've long intended to start a blog series on postgresql innards
> concepts, partly with the intent of turning it into such an overview.
> I find that people are better at shouting you down when you're wrong
> than they are at writing new material or reviewing proposed docs, so
> it's often a good way to fact-check things ;) .  Plus it's a good way
> to learn. Time is always short though.

Agreed on all counts.  Look forward to the blog. :)

Thanks,
Amit





Re: On How To Shorten the Steep Learning Curve Towards PG Hacking...

From
Kang Yuzhe
Date:
Thanks Amit for further confirmation on the  Craig's intention.

I am looking forward to seeing your "PG internal machinery under microscope" blog. May health, persistence and courage be with YOU.

Regards,
Zeray

On Wed, Mar 29, 2017 at 10:36 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2017/03/29 12:36, Craig Ringer wrote:
> On 29 March 2017 at 10:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Hi,
>>
>> On 2017/03/28 15:40, Kang Yuzhe wrote:
>>> Thanks Tsunakawa for such an informative reply.
>>>
>>> Almost all of the docs related to the internals of PG are of introductory
>>> concepts only.
>>> There is even more useful PG internals site entitled "The Internals of
>>> PostgreSQL" in http://www.interdb.jp/pg/ translation of the Japanese PG
>>> Internals.
>>>
>>> The query processing framework that is described in the manual as you
>>> mentioned is of informative and introductory nature.
>>> In theory, the query processing framework described in the manual is
>>> understandable.
>>>
>>> Unfortunate, it is another story to understand how query processing
>>> framework in PG codebase really works.
>>> It has become a difficult task for me to walk through the PG source code
>>> for example how SELECT/INSERT/TRUNCATE in the the different modules under
>>> "src/..". really works.
>>>
>>> I wish there were Hands-On with PostgreSQL Internals like
>>> https://bkmjournal.wordpress.com/2017/01/22/hands-on-with-postgresql-internals/
>>> for more complex PG features.
>>>
>>> For example, MERGE SQL standard is not supported yet by PG.  I wish there
>>> were Hands-On with PostgreSQL Internals for MERGE/UPSERT. How it is
>>> implemented in parser/executor/storage etc. modules with detailed
>>> explanation for each code and debugging and other important concepts
>>> related to system programming.
>>
>> I am not sure if I can show you that one place where you could learn all
>> of that, but many people who started with PostgreSQL development at some
>> point started by exploring the source code itself (either for learning or
>> to write a feature patch), articles on PostgreSQL wiki, and many related
>> presentations accessible using the Internet. I liked the following among
>> many others:
>
> Personally I have to agree that the learning curve is very steep. Some
> of the docs and presentations help, but there's a LOT to understand.

I agree too. :)

> When you're getting started you're lost in a world of language you
> don't know, and trying to understand one piece often gets you lost in
> other pieces. In no particular order:
>
> * Memory contexts and palloc
> * Managing transactions and how that interacts with memory contexts
> and the default memory context
> * Snapshots, snapshot push/pop, etc
> * LWLocks, memory barriers, spinlocks, latches
> * Heavyweight locks (and the different APIs to them)
> * GUCs, their scopes, the rules around their callbacks, etc
> * dynahash
> * catalogs and oids and access methods
> * The heap AM like heap_open
> * relcache, catcache, syscache
> * genam and the systable_ calls and their limitations with indexes
> * The SPI
> * When to use each of the above 4!
> * Heap tuples and minimal tuples
> * VARLENA
> * GETSTRUCT, when you can/can't use it, other attribute fetching methods
> * TOAST and detoasting datums.
> * forming and deforming tuples
> * LSNs, WAL/xlog generation and redo. Timelines. (ARGH, timelines).
> * cache invalidations, when they can happen, and how to do anything
> safely around them.
> * TIDs, cmin and cmax, xmin and xmax
> * postmaster, vacuum, bgwriter, checkpointer, startup process,
> walsender, walreceiver, all our auxillary procs and what they do
> * relmapper, relfilenodes vs relation oids, filenode extents
> * ondisk structure, page headers, pages
> * shmem management, buffers and buffer pins
> * bgworkers
> * PG_TRY() and PG_CATCH() and their limitations
> * elog and ereport and errcontexts, exception unwinding/longjmp and
> how it interacts with memory contexts, lwlocks, etc
> * The nest of macros around datum manipulation and functions, PL
> handlers. How to find the macros for the data types you want to work
> with.
> * Everything to do with the C API for arrays (is horrible)
> * The details of the parse/rewrite/plan phases with rewrite calling
> back into parse, paths, the mess with inheritance_planner, reading and
> understanding plantrees
> * The permissions and grants model and how to interact with it
> * PGPROC, PGXACT, other main shmem structures
> * Resource owners (which I still don't fully "get")
> * Checkpoints, pg_control and ShmemVariableCache, crash recovery
> * How globals are used in Pg and how they interact with fork()ing from
> postmaster
> * SSI (haven't gone there yet myself)
> * ....

That is indeed a big list of things to know and (have to) worry about.  If
we indeed come up with a PG-hackers-handbook someday, things in your list
could be organized such that it's clear to someone wanting to contribute
code which of those things they need to *absolutely* worry about and which
they don't.

> Personally I recall finding the magic of resource owner and memory
> context changing under me when I started/stopped xacts in a bgworker,
> along with the need to manage snapshots and SPI state to be distinctly
> confusing.
>
> There are various READMEs, blog posts, presentation slides/videos, etc
> that explain bits and pieces. But not much exists to tie it together
> into a comprehensible hole with simple, minimal explanations for each
> part so someone who's new to it all can begin to get a handle on it,
> find resources to learn more about subsystems they need to care about,
> etc.
>
> Lots of it boils down to "read the code". But so much code! You don't
> know if what you're reading is really relevant or if it's even
> correct, or if it makes assumptions that differ from your situation.
> There are lots of coding rules that aren't necessarily obvious unless
> you read the right place, e.g. that you don't need to and shouldn't
> LWLockRelease() before elog(ERROR). That SPI doesn't manage snapshots
> or xacts for you (but will often silently work anyway!). etc.
>
> I've long intended to start a blog series on postgresql innards
> concepts, partly with the intent of turning it into such an overview.
> I find that people are better at shouting you down when you're wrong
> than they are at writing new material or reviewing proposed docs, so
> it's often a good way to fact-check things ;) .  Plus it's a good way
> to learn. Time is always short though.

Agreed on all counts.  Look forward to the blog. :)

Thanks,
Amit