I used the following C-flags at compile time:
-msse2 -fmerge-all-constants -fmodulo-sched -fgcse-sm -fgcse-las -funsafe-loop-optimizations -fsched-spec-load -fsched-spec-load-dangerous -fsched-stalled-insns=0 -fsched-stalled-insns-dep -fsched2-use-superblocks -fipa-pta -ftree-loop-linear -ftree-loop-im -ftree-loop-ivcanon -fivopts -fvariable-expansion-in-unroller -ffast-math -fbranch-target-load-optimize -maccumulate-outgoing-args -combine
This means it's only going to work on CPUs that support the SSE2 SIMD instructions, but that includes pretty much every processor manufactured in the past 8 years, so I figured that was good enough.
If you want to give them a try, you can get them from my mediafire.