Attached is a spreadsheet with results for various work_mem values, and also with a smaller data set (just 30M rows in the fact table), which easily fits into memory. Yet it shows similar gains, shaving off ~40% in the best case, suggesting that this is not just thanks to reduction of I/O when forcing the temp files to disk.
A neat idea! Have you possibly tried to also collect statistics about actual false-positive rates and filter allocation sizes in every of the collected data points?