I've pushed fix with the DEFAULT_NUM_DISTINCT. The input comes from a set operation (which is where we call generate_append_tlist), so it's probably fairly unique, so maybe we should use input_tuples. But it's not guaranteed, so DEFAULT_NUM_DISTINCT seems reasonably defensive.
Thanks for the fix. Verified that the crash has been fixed.
One detail I've changed is that instead of matching the expression directly to a Var, it now calls pull_varnos() to also detect Vars somewhere deeper. Lookig at examine_variable() it calls find_base_rel for such case too, but I haven't tried constructing a query triggering the issue.
A minor comment is that I don't think we need to strip relabel explicitly before calling pull_varnos(), because this function would recurse into T_RelabelType nodes.
Also do we need to call bms_free(varnos) for each pathkey here to avoid waste of memory?
One improvement I can think of is handling lists with only some expressions containing varno 0. We could still call estimate_num_groups for expressions with varno != 0, and multiply that by the estimate for the other part (be it DEFAULT_NUM_DISTINCT). This might produce a higher estimate than just using DEFAULT_NUM_DISTINCT directly, resulting in a lower incremenal sort cost. But it's not clear to me if this can even happen - AFAICS either all Vars have varno 0 or none, so I haven't done this.