WIP: a way forward on bootstrap data - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | WIP: a way forward on bootstrap data |
Date | |
Msg-id | CAJVSVGWO48JbbwXkJz_yBFyGYW-M9YWxnPdxJBUosDC9ou_F0Q@mail.gmail.com Whole thread Raw |
Responses |
Re: WIP: a way forward on bootstrap data
(John Naylor <jcnaylor@gmail.com>)
|
List | pgsql-hackers |
Hi, I was looking through the archives one day for a few topics that interest me, and saw there was continued interest in making bootstrap data less painful [1] [2]. There were quite a few good ideas thrown around in those threads, but not much in the way of concrete results. I took a few of them as a starting point and threw together the attached WIP patchset. == An overview (warning: long): 1 through 3 are small tweaks worth doing even without a data format change. 4 through 7 are separated for readability and flexibility, but should be understood as one big patch. I tried to credit as many people's ideas as possible. Some things are left undone until basic agreement is reached. -- Patch 1 - Minor corrections -- Patch 2 - Minor cleanup Be more consistent style-wise, change a confusing variable name, fix perltidy junk. -- Patch 3 - Remove hard-coded schema information about pg_attribute from genbki.pl. This means slightly less code maintenance, but more importantly it's a proving ground for mechanisms used in later patches. 1. Label a column's default value in the catalog header [3]. Note: I implemented it in the simplest way possible for now, which happens to prevents a column from having both FORCE_(NOT_)NULL and a default at the same time, but I think in practice that would almost never matter. If more column options are added in the future, this will have to be rewritten. 2. Add a new function to Catalog.pm to fill in a tuple with default values. It will complain loudly if it can't find either a default or a given value, so change the signature of emit_pgattr_row() so we can pass a partially built tuple to it. 3. Format the schema macro entries according to their types. 4. Commit 8137f2c32322c624e0431fac1621e8e9315202f9 labeled which columns are variable length. Expose that label so we can exclude those columns from schema macros in a general fashion. -- Patch 4 - Infrastructure for the data conversion 1. Copy a modified version of Catalogs() from Catalog.pm to a new script that turns DATA()/(SH)DESCR() statements into serialized Perl data structures in pg_*.dat files, preserving comments along the way. Toast and index statements are unaffected. Although it's a one-off as far as the core project is concerned, I imagine third-parties might want access to this tool, so it's in the patch and not separate. 2. Remove data parsing from the original Catalogs() function and rename it to reflect its new, limited role of extracting the schema info from a header. The data files are handled by a new function. 3. Introduce a script to rewrite pg_*.dat files in a standard format [4], strip default values, preserve comments and normalize blank lines. It can also change default values on the fly. It is intended to be run when adding new data. 4. Add default values to a few catalog headers for demonstration purposes. There might be some questionable choices here. Review is appreciated. Note: At this point, the build is broken. TODO: See what pgindent does to the the modified headers. -- Patch 5 - Mechanical data conversion This is the result of: cd src/include/catalog perl convert_header2dat.pl list-of-catalog-headers-from-the-Makefile perl -I ../../backend/catalog rewrite_dat.pl *.dat rm *.bak Note: The data has been stripped of all double-quotes for readability, since the Perl hash literals have single quotes everywhere. Patches 6 and 7 restore them where needed. -- Patch 6 - Hand edits Re-doublequote values that are macros expanded by initdb.c, remove stray comments, fix up whitespace, and do a minimum of comment editing to reflect the new data format. At this point the files are ready to take a look at. Format is the central issue, of course. I tried to structure things so it wouldn't be a huge lift to bikeshed on the format. See pg_authid.dat for a conveniently small example. Each entry is 1 or 2 lines long, depending on whether oid or (sh)descr is present. Regarding pg_proc.dat, I think readability is improved, and to some extent line length, although it's not great: pg_proc.h: avg=175, stdev=25 pg_proc.dat: avg=159, stdev=43 (grep -E '^DATA' pg_proc.h | awk '{print length}' grep prosrc pg_proc.dat | awk '{print length}') Many lines now fit in an editor window, but the longest line has ballooned from 576 chars to 692. I made proparallel default to 'u' for safety, but the vast majority are 's'. If we risked making 's' the default, most entries would shrink by 20 chars. On the other hand, human-readable types would inflate that again, but maybe patch 8 below can help with editing. There was some discussion about abbreviated labels that were mapped to column names - I haven't thought about that yet. -- Patch 7 - Distprep scripts 1. Teach genbki.pl and Gen_fmgrtab.pl to read the data files, and arrange for the former to double-quote certain values so bootscanner.l can read them. 2. Introduce Makefile dependencies on the data files. The build works now. Note: Since the original DATA() entries contained some (as far as I can tell) useless quotes, the postgres.bki diff won't be zero, but it will be close. Performance note: On my laptop, running Make in the catalog dir with no compilation goes from ~700ms on master to ~1000ms with the new data files. -- Patch 8 - SQL output 1. Write out postgres.sql, which allows you to insert all the source BKI data into a development catalog schema for viewing and possibly bulk-editing [5]. It retains oids, and (sh)descr fields in their own columns, and implements default values. 2. Make it a distclean target. TODO: Find a way to dump dev catalog tuples into the canonical format. -- Not implemented yet: -Gut the header files of DATA() statements. I'm thinking we should keep the #defines in the headers, but see below: -Update README and documentation -- Future work: -More lookups for human-readable types, operators, etc. -Automate pg_type #defines a la pg_proc [6], which could also be used to maintain ecpg's copy of pg_type #defines. -- [1] https://www.postgresql.org/message-id/flat/20150220234142.GH12653%40awork2.anarazel.de [2] https://www.postgresql.org/message-id/flat/CAGoxFiFeW64k4t95Ez2udXZmKA%2BtazUFAaSTtYQLM4Jhzw%2B-pg%40mail.gmail.com [3] https://www.postgresql.org/message-id/20161113171017.7sgaqdeq6jslmsr3%40alap3.anarazel.de [4] https://www.postgresql.org/message-id/D8F1D509-6498-43AC-BEFC-052DFE16847A%402ndquadrant.com [5] https://www.postgresql.org/message-id/20150304150712.GV29780%40tamriel.snowman.net [6] https://www.postgresql.org/message-id/15697.1479161432%40sss.pgh.pa.us == I'll add this to the next commitfest soon. -John Naylor
Attachment
pgsql-hackers by date: