c - Fastest way to find out minimum of 3 numbers? -
in program wrote, 20% of time being spent on finding out minimum of 3 numbers in inner loop, in routine:
static inline unsigned int min(unsigned int a, unsigned int b, unsigned int c) { unsigned int m = a; if (m > b) m = b; if (m > c) m = c; return m; }
is there way speed up? ok assembly code x86/x86_64.
edit: in reply of comments:
* compiler being used gcc 4.3.3
* far assembly concerned, beginner there. asked assembly here, learn how this. :)
* have quad-core intel 64 running, mmx/sse etc. supported.
* it's hard post loop here, can tell it's heavily optimized implementation of levenshtein algorithm.
this compiler giving me non-inlined version of min:
.globl min .type min, @function min: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl 12(%ebp), %eax movl 16(%ebp), %ecx cmpl %edx, %eax jbe .l2 movl %edx, %eax .l2: cmpl %ecx, %eax jbe .l3 movl %ecx, %eax .l3: popl %ebp ret .size min, .-min .ident "gcc: (ubuntu 4.3.3-5ubuntu4) 4.3.3" .section .note.gnu-stack,"",@progbits
the inlined version within -o2 optimized code (even markers mrk = 0xfefefefe, before , after call min()) getting optimized away gcc, couldn't hold of it.
update: tested changes suggested nils, ephemient, there's no perceptible performance boost using assembly versions of min(). however, 12.5% boost compiling program -march=i686, guess because whole program getting benefits of new faster instructions gcc generating option. guys.
p.s. - used ruby profiler measure performance (my c program shared library loaded ruby program), time spent top-level c function called ruby program, ends calling min() down stack. please see question.
make sure using appropriate -march
setting, first off. gcc defaults not using instructions not supported on original i386 - allowing use newer instruction sets can make big difference @ times! on -march=core2 -o2
get:
min: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl 12(%ebp), %ecx movl 16(%ebp), %eax cmpl %edx, %ecx leave cmovbe %ecx, %edx cmpl %eax, %edx cmovbe %edx, %eax ret
the use of cmov here may avoid branch delays - , without inline asm passing in -march
. when inlined larger function more efficient, possibly 4 assembly operations. if need faster this, see if can sse vector operations work in context of overall algorithm.
Comments
Post a Comment