I also tried creating the table as "UNLOGGED" which led to walwriter I/O to drop drastically and I now get no wall-clock time increase with two import processes but it gets slower with four.
Switching off fsync leads to a drastic time improvement but still higher wall-clock time for four threads.
Does switching fsync off make it faster even when the table are unlogged (that would be surprising) or were the two changes made one at a time? When you say still higher for four threads, do you mean the four threads for fsync=off are much faster than 4 threads for fsync=on but still doesn't scale linearly within the fsync=off set? Or is the nonlinearity so bad that you fsync=off doesn't even improve the 4 thread situation?
PostgreSQL fsyncs each wal segment once it is full. Under heavy load, this is effectively done in the foreground (even when done by WALwriter), because other processes inserting WAL records will soon be blocked by locks taken out by the fsyncing process. So if you can't run your production database with fsync=off, one thing you can try is setting up a new database with a larger wal segment size (--wal-segsize argument to initdb).