Thread: next draft of list rewrite

next draft of list rewrite

From
Neil Conway
Date:
I've attached the latest header and implementation files (pg_list.h
and list.c, respectively) for the new linked list implementation. I'm
pretty satisfied with the new API: now would be the ideal time to
suggest any changes you think I should make to it. I decided to use a
new naming convention: all public functions and macros are prefixed
with list_ or cell_, as appropriate, and type-specific functions are
suffixed with _int or _oid, as appropriate.

ISTM that the "allocate the list itself and the head node in one
palloc()" idea that Tom suggested actually won't work: since the head
node can change over the life of the list, we need to be able to
deallocate a list cell via pfree() -- which would not be the case if a
particular node was allocated via the same palloc() as the list
itself.

BTW, one nice thing about the new List API is that it lets us get rid
of the WRITE_INTLIST_FIELD() and related cruft in nodes/, because each
type of list (T_List, T_IntList, or T_OidList) is now a legitimate
Node in its own right. So we can now use WRITE_NODE_FIELD() to output
a list of integers, for example.

(The implementation is a little "off the cuff", and I haven't reviewed
it; also, since the tree is totally broken with this patch applied, I
haven't tested it either. That said, I'd appreciate anyone mentioning
any implementation bugs they might happen to spot.)

Incidentally, does anyone have any thoughts on the best way to commit
this? It requires changing pretty much _every single_ place in the
code that uses the List API. Although most of those modifications
should be trivial, it will still require touching a lot of
files. Should I just make the changes locally and commit them all in
one fell swoop? Create a private branch in CVS? Any other suggestions?

FWIW, I'm hoping to have this done (and committed to CVS HEAD) by the
end of next weekend.

-Neil


Attachment

Re: next draft of list rewrite

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> I've attached the latest header and implementation files (pg_list.h
> and list.c, respectively) for the new linked list implementation. I'm
> pretty satisfied with the new API: now would be the ideal time to
> suggest any changes you think I should make to it. I decided to use a
> new naming convention: all public functions and macros are prefixed
> with list_ or cell_, as appropriate, and type-specific functions are
> suffixed with _int or _oid, as appropriate.

I'm fairly unhappy with that idea.  I thought the plan was to keep the
API exactly the same as before, with the single exception of having to
use ListCell* rather than List* as the type of pointers used to iterate
through a list.  Your apparent intention to rename every single list
routine makes the change far more invasive than it has any need to be.
It will undoubtedly break a fair amount of user-written code that we
don't have control over; and to what benefit?  Everyone who works on
the backend is already accustomed to the existing names, which are
mostly derived from ancient Lisp tradition.  I don't think we should
change them.  If you want to push forward on it, we had better have
a vote in pghackers, rather than assuming everyone concerned will see
this discussion in -patches.

Other than the naming issues, the code looks reasonable.  I noticed one
trivial bug, which is that the Asserts in list_member_int and
list_member_oid seem backwards.  A bigger problem is that list_length,
list_head, and list_tail must *not* be macros --- we cannot afford
double-evaluation bugs associated with operations as widely used as these.
Since these functions are not used within loop bodies, I don't think
that avoiding function call overhead for them is conceivably worth the
risks involved.

> ISTM that the "allocate the list itself and the head node in one
> palloc()" idea that Tom suggested actually won't work: since the head
> node can change over the life of the list, we need to be able to
> deallocate a list cell via pfree() -- which would not be the case if a
> particular node was allocated via the same palloc() as the list
> itself.

No, you'd simply leave that cell-space wasted if the head member were to
change.  It's doable, though it would complicate the deletion routine
and probably nconc as well.  It's entirely likely that it'd be more
complexity than it's worth, though, so if you don't want to do it that's
fine with me.  Certainly it makes sense not to do it in the first pass.

> BTW, one nice thing about the new List API is that it lets us get rid
> of the WRITE_INTLIST_FIELD() and related cruft in nodes/, because each
> type of list (T_List, T_IntList, or T_OidList) is now a legitimate
> Node in its own right. So we can now use WRITE_NODE_FIELD() to output
> a list of integers, for example.

Right, that was a benefit I was expecting ...

> (The implementation is a little "off the cuff", and I haven't reviewed
> it; also, since the tree is totally broken with this patch applied, I
> haven't tested it either.

This is one reason I'd suggest refraining from renaming everything.
Without the rename, it'd be possible to change the implementation with
little or no change to the rest of the tree.  The idea I had was to
temporarily place "(ListCell*)" casts in foreach, lfirst, and lnext,
so that they would work without complaint if the iteration variable is
misdeclared as List* rather than ListCell*.  Then it should be possible
to test and commit the reimplemented list.c with few if any source-code
changes elsewhere.  After that, you can incrementally go through the
rest of the tree and change iteration variables to ListCell*, removing
the casts in the macros when all is done.  This is a lot more pleasant
way to proceed than a "big bang" changeover.  (Removal of the FastList
code could also be done incrementally.)

            regards, tom lane

Re: next draft of list rewrite

From
Neil Conway
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> I thought the plan was to keep the API exactly the same as before,
> with the single exception of having to use ListCell* rather than
> List* as the type of pointers used to iterate through a list.

I thought that once we had decided to change the API, anything was
fair game: either we need to change every call site of the API (which
would be required if we adopted ListCell* as the type of the foreach
iteration variable), or we don't.

> It will undoubtedly break a fair amount of user-written code that we
> don't have control over

Granted, but so will changing the API at all: as above, the ListCell*
change would cause approximately the same amount of breakage for
users.

> Everyone who works on the backend is already accustomed to the
> existing names, which are mostly derived from ancient Lisp
> tradition.

The heritage of the names was part of my reason for changing them: the
Lisp names imply a certain style of list implementation (cons cells,
linear-time length and append ops but constant time prepend) that is
no longer being used. In addition, that makes names like lfirst()
(which would now be applied to a ListCell*) no longer very meaningful.

In addition, there are plenty of infelicities that it would be nice to
remove from the API. For example:

       - names like lispRemove() that aren't very meaningful (AFAICS),
         even in the present implementation

       - some names use case to separate words (e.g. LispRemove(),
         freeList()), some use underscores (e.g. set_union()), while
         some use nothing (e.g. llastnode())

       - some names begin with capital letters (e.g. FastAppend()),
         the rest do not

       - some names use a 'ptr/int/oid' prefix to denote type-specific
         variants of a function, whereas some use a 'i' or 'o' suffix
         for the same purpose.

       - some names are prefixed with 'l', some are not -- without
         apparently rhyme or reason for this distinction

       - some functions take the List* as their first argument
         (e.g. lappend()), some do not (e.g. nth()) -- again, without
         any good reason for the inconsistency that I could see

> If you want to push forward on it, we had better have a vote in
> pghackers, rather than assuming everyone concerned will see this
> discussion in -patches.

I'm still open to being convinced, BTW: I only write the code as it is
without prior discussion because I misunderstood your earlier
position. I'll send a mail to hackers if we're still in disagreement
at the end of this thread.

> I noticed one trivial bug, which is that the Asserts in
> list_member_int and list_member_oid seem backwards.

Thanks: Alvaro was already kind enough to point that out via email.

> A bigger problem is that list_length, list_head, and list_tail must
> *not* be macros --- we cannot afford double-evaluation bugs
> associated with operations as widely used as these.

Fair enough. I'll also take a look at GCC-style inline functions...

> No, you'd simply leave that cell-space wasted if the head member
> were to change.  It's doable, though it would complicate the
> deletion routine and probably nconc as well.  It's entirely likely
> that it'd be more complexity than it's worth, though, so if you
> don't want to do it that's fine with me.  Certainly it makes sense
> not to do it in the first pass.

Ok, I won't worry about it for now.

> [ a scheme for iteratively committing the changes ]
> This is a lot more pleasant way to proceed than a "big bang"
> changeover.

I agree that your method is easier procedurally, but ISTM that the
aggregate amount of work required is about the same: we still need to
change pretty much every call site of the list API. Yes, it is
slightly easier if we can do this one call site at a time, and yes,
it's slightly easier if s/List/ListCell/ is the only required change,
but there is about the same degree of work involved either
way -- and IMHO the benefit of a well named API is worth the
short-term pain.

Furthermore, it is quite possible to reduce the pain induced by my
method. For example, we could create a private CVS branch. In that
branch, it wouldn't matter if the tree is broken for a week or two, in
which time we can make the changes across the tree at our leisure, and
then merge them into HEAD in one relatively painless branch landing.

-Neil


Re: next draft of list rewrite

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> I thought that once we had decided to change the API, anything was
> fair game: either we need to change every call site of the API (which
> would be required if we adopted ListCell* as the type of the foreach
> iteration variable), or we don't.

Changing the iteration variable type only impacts places that use
foreach, lfirst, and lnext, which is a small subset of the places that
use the complete API.  (A quick look at a couple of planner files
suggested that it'd be about half as many changes, though I'm not
certain that the ones I looked at are very representative.)

A bigger point is that the iteration variable type makes no real
difference in terms of reading the code, whereas a wholesale change in
function names will entail a fair amount of relearning.

BTW, I noticed that lfirst() gets applied to both iteration variables
and true Lists, so we will certainly need to do something about that.
The second case is much less frequent, so possibly
lfirst(first_cell(list)) is the right sort of locution for it.
It might be worthwhile to define ListCell and List as the same struct
temporarily (with some wasted fields in each case) until all the cases
like that can be fixed.

> In addition, there are plenty of infelicities that it would be nice to
> remove from the API. For example:

Yes, there's inconsistency in the names.  That's true all through
Postgres, and is an inevitable result when you have code that's been
worked on by many people over many years.  If we were going to start
enforcing some project-wide naming standards, we could make things
prettier-looking, but I don't see the call to do it to list.c alone.

> I agree that your method is easier procedurally, but ISTM that the
> aggregate amount of work required is about the same: we still need to
> change pretty much every call site of the list API.

You could still do it incrementally, by making macros that implement the
old function names on top of the new.  But I remain of the opinion that
renaming functions simply for the sake of renaming is a dead loss of
time --- and not only your time, but a lot of other contributors'.

> Furthermore, it is quite possible to reduce the pain induced by my
> method. For example, we could create a private CVS branch. In that
> branch, it wouldn't matter if the tree is broken for a week or two, in
> which time we can make the changes across the tree at our leisure, and
> then merge them into HEAD in one relatively painless branch landing.

I doubt the branch landing would be painless, unless the rest of us sit
around twiddling our thumbs meanwhile.  This is such core code that a
wholesale renaming would likely break any patch anyone else has in
progress.

            regards, tom lane