Anyone who has written C code has probably at one point or another had to write a small function to find the minimum value in a set of integers. When the set contains only two numbers, a macro like #define min(a,b) ((a) < (b) ? (a) : (b)) is frequently used. When the set is expanded to three it is possible to add to that macro, but I generally prefer to err on the side of code readability whenever possible, so I just write a little function to do it. Recently, as I was performing research for my master’s thesis, I saw some C++ code which implemented the minimum_of_three function slightly differently than my standard naïve approach. The comment said something to the effect of, “spend a register to reduce the number of branches and increase performance.” I’m always looking for ways to improve my code, so I thought that I would try out this better way. However, I am not one to just blindly believe stuff I read in code comments, so I had to test it out for myself. Below are the results of my analysis.

So, to start, let’s look at the functions that I am investigating. Here is the minimum function that I have written on numerous occasions. (Did I steal this from K&R? Perhaps, many C ideas originate there, my inspiration was probably on page 121 of the second edition)

Spoiler alert: tldr: use this function

static inline int min3a(int a, int b, int c)
{
    if (a < b) {
        if (a < c) {
            return a;
        } else {
            return c;
        }
    } else {
        if (b < c) {
            return b;
        } else {
            return c;
        }
    }
}

And, here is the function that was supposed to improve performance by reducing the number of jumps. There are only two ‘if’ statements, so it’s looking like this new method will improve performance, but we still need to perform some actual analysis to be sure.

static inline int min3b(int a, int b, int c)
{
    int result = a;
    if (b < result) {
        result = b;
    }
    if (c < result) {
        return c;
    }
    return result;
}

To start I needed to look at the assembly code for the two programs. With clang I can get the LLVM assembly with the command cc -S -c ./example.c.

The assembly code for both functions start the same way …

_min3:                                  ## @min3
    .cfi_startproc
## BB#0:
    pushq   %rbp
Ltmp32:
    .cfi_def_cfa_offset 16
Ltmp33:
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
Ltmp34:
    .cfi_def_cfa_register %rbp
    movl    %edi, -8(%rbp)
    movl    %esi, -12(%rbp)
    movl    %edx, -16(%rbp)
    movl    -8(%rbp), %edx

… and end the same way.

LBB6_7:
    movl    -4(%rbp), %eax
    popq    %rbp
    ret
    .cfi_endproc

    .align  4, 0x90

With the common sections removed we can have a closer look at the actual work done by the two functions. Looking at the bodies of the functions it seems more likely that min3a will take more time than min3b.

    cmpl    -12(%rbp), %edx
    jge LBB6_4 ## a > b
## BB#1:
    movl    -8(%rbp), %eax
    cmpl    -16(%rbp), %eax
    jge LBB6_3 ## c > a
## BB#2:
    movl    -8(%rbp), %eax
    movl    %eax, -4(%rbp)
    jmp LBB6_7
LBB6_3:
    movl    -16(%rbp), %eax
    movl    %eax, -4(%rbp)
    jmp LBB6_7
LBB6_4:
    movl    -12(%rbp), %eax
    cmpl    -16(%rbp), %eax
    jge LBB6_6
## BB#5:
    movl    -12(%rbp), %eax
    movl    %eax, -4(%rbp)
    jmp LBB6_7
LBB6_6:
    movl    -16(%rbp), %eax
    movl    %eax, -4(%rbp)

Function min3b only has 14 lines of assembly code, compared to min3a’s 19 lines (excluding comments and labels).

    movl    %edx, -20(%rbp)
    movl    -12(%rbp), %edx
    cmpl    -20(%rbp), %edx
    jge LBB8_2
## BB#1:
    movl    -12(%rbp), %eax
    movl    %eax, -20(%rbp)
LBB8_2:
    movl    -16(%rbp), %eax
    cmpl    -20(%rbp), %eax
    jge LBB8_4
## BB#3:
    movl    -16(%rbp), %eax
    movl    %eax, -4(%rbp)
    jmp LBB6_7
LBB8_4:
    movl    -20(%rbp), %eax
    movl    %eax, -4(%rbp)

So, now that we have the llvm assembly code for each function we can examine how many instructions are actually required for each of the three cases (a being minimum, b being minimum, and c being minimum).

Method a b c Instructions Jumps Note
min3 A X 7 1
min3 A X 6 2
min3 A X 6/7 2
min3 A 6.5 1.6 Average
min3 B X 11 1
min3 B X 8 1
min3 B X 10 1
min3 B 9.6 1 Average

That’s interesting. If we step through the code, each of the possible outcomes takes fewer instructions in my original function than in the supposedly improved function. However, the commenter was right; she saves one jump two out of three times. But, these are near jumps, is that really going to offset the increased number of assembly instructions? To find out we will have to write a little test program and perform some empirical analysis.

I wrote a test program which executes each of the functions 1,200,000 times and takes the average of 100 runs. The program was compiled using clang (Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)) without any optimizations. The empirical analysis supports the code analysis, my original naïve minimum function is about 15% faster than the “improved” version.

Method Time (microseconds)
A 3446
B 4128

So, what’s the moral of the story? Well, besides for reaffirming my belief of not blindly believing code comments [potentially left by a Historian turned programmer], my takeaway is that just because code looks like it should be more efficient, that doesn’t necessarily mean it will be more efficient. Here the first C function and the LLVM code for it is longer with no apparent optimizations done by the compiler, and yet it performs faster; the branches reduce the number of steps required by the processor to complete the same amount of work. In this specific case, not only is my version of the function more readable and easier to follow the control flow, it is also demonstrably faster.

Who cares about saving 600 microseconds … microseconds, not milliseconds? Well, I do. I am presently working with a dynamic programming algorithm which manipulates each pixel in an image. An iPhone 5 image is 3,264 x 2,448 pixels, or 7,990,272 total pixels (8 megapixel). Let’s say it took me just 600 microseconds to manipulate each pixel, then it would take me 4,794,163,200 microseconds to manipulate the entire image. That’s an hour and 20 minutes, which makes the problem intractable.

That sounds impresive, and makes you better appreciate 600 microseconds, but really the savings of 600 microseconds was over 1.2 million itterations. So, the per function call savings was only 0.0005 microseconds. However, if we have an algorithm which must call the minimum function 8 times for each pixel in the image, then the savings adds up to 31,961.088 microseconds, or just over 0.03 seconds. Now we are getting close to talking about real time loss, 30 bad choices like that and you are talking about loosing a second per image.

If that doesn’t persuade you that this is important, then perhaps if you think of it from a battery consumption perspective you will change your mind. The first function uses 15% less battery power to get the same result. I know that is not strictly true, but you know what I mean.

The program I wrote to test the two functions is listed below. I am running the code under OS X 10.9.2 Mavericks on a 1.7 GHz Intel Core i7 (mid-2013 13” MacBook Air)

#include <stdio.h>
#include <limits.h>
#include <sys/time.h>

#define TEST_ITTERATIONS 100
#define TEST_LOOPS 100000

static inline int min3a(int a, int b, int c)
{
    if (a < b) {
        if (a < c) {
            return a;
        } else {
            return c;
        }
    } else {
        if (b < c) {
            return b;
        } else {
            return c;
        }
    }
}

static inline int min3b(int a, int b, int c)
{
    int result = a;
    if (b < result) {
        result = b;
    }
    if (c < result) {
        return c;
    }
    return result;
}

static void testA()
{
    int result = 0;
    for (int i = 0; i < TEST_LOOPS; ++i) {
        result = 0;
        result += min3a(1, 2, 3);
        result += min3a(1, 3, 2);
        result += min3a(2, 1, 3);
        result += min3a(2, 3, 1);
        result += min3a(3, 1, 2);
        result += min3a(3, 2, 1);

        if (result != 6) {
            printf("Bad min (method A, group 1, loop %d)\n", i);
        }

        result = 0;
        result = min3a(INT_MAX-2, INT_MAX-1, INT_MAX);
        result = min3a(INT_MAX-2, INT_MAX, INT_MAX-1);
        result = min3a(INT_MAX-1, INT_MAX-2, INT_MAX);
        result = min3a(INT_MAX-1, INT_MAX, INT_MAX-2);
        result = min3a(INT_MAX, INT_MAX-2, INT_MAX-1);
        result = min3a(INT_MAX, INT_MAX-1, INT_MAX-2);

        if (result != INT_MAX-2) {
            printf("Bad min (method A, group 2, loop %d)\n", i);
        }
    }
}

static void testB()
{
    int result = 0;
    for (int i = 0; i < TEST_LOOPS; ++i) {
        result = 0;
        result += min3b(1, 2, 3);
        result += min3b(1, 3, 2);
        result += min3b(2, 1, 3);
        result += min3b(2, 3, 1);
        result += min3b(3, 1, 2);
        result += min3b(3, 2, 1);

        if (result != 6) {
            printf("Bad min (method B, group 1, loop %d)\n", i);
        }

        result = 0;
        result = min3b(INT_MAX-2, INT_MAX-1, INT_MAX);
        result = min3b(INT_MAX-2, INT_MAX, INT_MAX-1);
        result = min3b(INT_MAX-1, INT_MAX-2, INT_MAX);
        result = min3b(INT_MAX-1, INT_MAX, INT_MAX-2);
        result = min3b(INT_MAX, INT_MAX-2, INT_MAX-1);
        result = min3b(INT_MAX, INT_MAX-1, INT_MAX-2);

        if (result != INT_MAX-2) {
            printf("Bad min (method B, group 2, loop %d)\n", i);
        }
    }
}

int main(int argc, char const *argv[])
{
    struct timeval stop, start;

    int totalTime = 0;
    for (int i = 0; i < TEST_ITTERATIONS; ++i) {
        gettimeofday(&start, NULL);
        testA();
        gettimeofday(&stop, NULL);
        totalTime += (stop.tv_usec - start.tv_usec);
        printf("Method A: %u us \n", stop.tv_usec - start.tv_usec);
    }
    printf("A Avg: %u us \n", (totalTime / TEST_ITTERATIONS));

    totalTime = 0;
    for (int i = 0; i < TEST_ITTERATIONS; ++i) {
        gettimeofday(&start, NULL);
        testB();
        gettimeofday(&stop, NULL);
        totalTime += (stop.tv_usec - start.tv_usec);
        printf("Method B: %u us \n", stop.tv_usec - start.tv_usec);
    }
    printf("B Avg: %u us \n", (totalTime / TEST_ITTERATIONS));
    
    return 0;
}

Tags: C LLVM Assembly Minimum

Published: 2014-05-23