Also grasping at straws and wondering if it's related to L3 contention. Hash tables mostly always have very unpredictable memory access which the hardware prefetcher can't deal with. If useful cachelines are being evicted from L3 by other processes, then that'll mean more stalls waiting on RAM when probing the hash table.
I can reproduce it on some near idle systems as well as the busy one. Queries are sometimes slowed by 10% to 30% on the busy system vs the near idle readonly streaming replica.
Cache contention and memory contention can certainly slow hash access. And although we often model a hash as O(1) access, there is no such thing as true O(1) performance scaling for random memory access on today's hardware. The difference between the speed of accessing something entirely in L1/L2 cache vs something mostly in RAM is pretty huge.
I'm curious to know how consistent the run times are and if the json_agg() query can be just as slow as the array_agg() one. Could it be that the json_agg() version was just run at a time the server wasn't as busy with other things... ?
The run times are consistent enough (within a 15% range most of the time on the busy server), yet we are talking about a 50x performance difference, dwarfing the time variation.
I have run each of these tests dozens of times now, back-to-back or with a delay, and this is extremely consistent no matter the timing. On the low load servers or laptop, it is even more consistent. The only time it is 'unexpectedly fast' is if I do something that triggers a different query plan, like one that uses sort + group aggregate or one that does an index scan. Every time it does the HashAggregate over the sequential scan it reproduces.