I think I found the issue - it's kinda obvious, really. We need to consider the timezone, because the "time" parts alone may be sorted differently. The attached patch should fix this, and it also fixes a similar issue in the inet data type.
As for why the regression tests did not catch this, it's most likely because the data is likely generated in "nice" ordering, or something like that. I'll see if I can tweak the ordering to trigger these issues reliably, and I'll do a bit more randomized testing.
There's also the question of rounding errors, which I think might cause random assert failures (but in practice it's harmless, in the worst case we'll merge the ranges a bit differently).