Julius Plenz - Blog - Concurrent Hashing is an Embarrassingly Parallel Problem

Concurrent Hashing is an Embarrassingly Parallel Problem

So I was reading some rather not so clever code today. I had a gut feeling something was wrong with the code, since I had never seen an idiom like that. A server that does a little hash calculation with lots of threads – and the function that computes the hash had a peculiar feature: Its entire body was wrapped by a mutex lock/unlock clause of a function-static mutex PTHREAD_MUTEX_INITIALIZER, like this:

static EVP_MD_CTX mdctx;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static unsigned char first = 1;

pthread_mutex_lock(&lock);
if (first) {
    EVP_MD_CTX_init(&mdctx);
    first = 0;
}

/* the actual hash computation using &mdctx */

pthread_mutex_unlock(&lock);

In other words, if this function is called multiple times from different threads, it is only run once at a time, possibly waiting for other instances to unlock the (shared) mutex first.

The computation code inside the function looks roughly like this:

if (!EVP_DigestInit_ex(&mdctx, EVP_sha256(), NULL) ||
    !EVP_DigestUpdate(&mdctx, input, inputlen) ||
    !EVP_DigestFinal(&mdctx, hash, &md_len)) {
        ERR_print_errors_fp(stderr);
        exit(-1);
}

This is the typical OpenSSL pattern: You tell it to initialize mdctx to compute the SHA256 digest, then you “update” the digest (i.e., you feed it some bytes) and then you tell it to finish, storing the resulting hash in hash. If either of the functions fail, the OpenSSL error is printed.

So the lock mutex really only protects the mdctx (short for ‘message digest context’). And my gut feeling was that re-initializing the context all the time (i.e. copying stuff around) is much cheaper than synchronizing all the hash operations (i.e., having one stupid bottleneck).

To be sure, I ran a few tests. I wrote a simple C program that scales up the number of threads and looks at how much time you need to hash 10 million 16-byte strings. (You can find the whole quick’n’dirty code on Github.)

First, I have to create a dataset. In order for it to be the same all the time, I use rand_r() with a hard-coded seed, so that over all iterations, the random data set is actually equivalent:

#define DATANUM 10000000
#define DATASIZE 16
static char data[DATANUM][DATASIZE];

void init_data(void)
{
    int n, i;
    unsigned int seedp = 0xdeadbeef; /* make the randomness predictable */
    char alpha[] = "abcdefghijklmnopqrstuvwxyz";

    for(n = 0; n < DATANUM; n++)
            for(i = 0; i < DATASIZE; i++)
                    data[n][i] = alpha[rand_r(&seedp) % 26];
}

Next, you have to give a helping hand to OpenSSL so that it can be run multithreaded. (There are, it seems, certain internal data structures that need protection.) This is a technical detail.

Then I start num threads on equally-sized slices of data while recording and printing out timing statistics:

void hash_all(int num)
{
    int i;
    pthread_t *t;
    struct fromto *ft;
    struct timespec start, end;
    double delta;

    clock_gettime(CLOCK_MONOTONIC, &start);

    t = malloc(num * sizeof *t);
    for(i = 0; i < num; i++) {
        ft = malloc(sizeof(struct fromto));
        ft->from = i * (DATANUM/num);
        ft->to = ((i+1) * (DATANUM/num)) > DATANUM ?
                DATANUM : (i+1) * (DATANUM/num);
        pthread_create(&t[i], NULL, hash_slice, ft);
    }

    for(i = 0; i < num; i++)
            pthread_join(t[i], NULL);

    clock_gettime(CLOCK_MONOTONIC, &end);

    delta = end.tv_sec - start.tv_sec;
    delta += (end.tv_nsec - start.tv_nsec) / 1000000000.0;

    printf("%d threads: %ld hashes/s, total = %.3fs\n",
            num, (unsigned long) (DATANUM / delta), delta);
    free(t);
    sleep(1);
}

Each thread runs the hash_slice() function, which linearly iterates over the slice and calls hash_one(n) for each entry. With preprocessor macros, I define two versions of this function:

void hash_one(int num)
{
    int i;
    unsigned char hash[EVP_MAX_MD_SIZE];
    unsigned int md_len;

#ifdef LOCK_STATIC_EVP_MD_CTX
    static EVP_MD_CTX mdctx;
    static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
    static unsigned char first = 1;

    pthread_mutex_lock(&lock);
    if (first) {
            EVP_MD_CTX_init(&mdctx);
            first = 0;
    }
#else
    EVP_MD_CTX mdctx;
    EVP_MD_CTX_init(&mdctx);
#endif

    /* the actual hashing from above */

#ifdef LOCK_STATIC_EVP_MD_CTX
    pthread_mutex_unlock(&lock);
#endif

    return;
}

The Makefile produces two binaries:

$ make
gcc -Wall -pthread -lrt -lssl -DLOCK_STATIC_EVP_MD_CTX -o speedtest-locked speedtest.c
gcc -Wall -pthread -lrt -lssl -o speedtest-copied speedtest.c

… and the result is as expected. On my Intel i7-2620M quadcore:

$ ./speedtest-copied
1 threads: 1999113 hashes/s, total = 5.002s
2 threads: 3443722 hashes/s, total = 2.904s
4 threads: 3709510 hashes/s, total = 2.696s
8 threads: 3665865 hashes/s, total = 2.728s
12 threads: 3650451 hashes/s, total = 2.739s
24 threads: 3642619 hashes/s, total = 2.745s

$ ./speedtest-locked
1 threads: 2013590 hashes/s, total = 4.966s
2 threads: 857542 hashes/s, total = 11.661s
4 threads: 631336 hashes/s, total = 15.839s
8 threads: 932238 hashes/s, total = 10.727s
12 threads: 850431 hashes/s, total = 11.759s
24 threads: 802501 hashes/s, total = 12.461s

And on an Intel Xeon X5650 24 core machine:

$ ./speedtest-copied
1 threads: 1564546 hashes/s, total = 6.392s
2 threads: 1973912 hashes/s, total = 5.066s
4 threads: 3821067 hashes/s, total = 2.617s
8 threads: 5096136 hashes/s, total = 1.962s
12 threads: 5849133 hashes/s, total = 1.710s
24 threads: 7467990 hashes/s, total = 1.339s

$ ./speedtest-locked
1 threads: 1481025 hashes/s, total = 6.752s
2 threads: 701797 hashes/s, total = 14.249s
4 threads: 338231 hashes/s, total = 29.566s
8 threads: 318873 hashes/s, total = 31.360s
12 threads: 402054 hashes/s, total = 24.872s
24 threads: 304193 hashes/s, total = 32.874s

So, while the real computation times shrink when you don’t force a bottleneck – yes, it’s an embarrassingly parallel problem – the reverse happens if you force synchronization: All the mutex waiting slows the program so much down that you’d better only use one thread or else you lose.

Rule of thumb: If you don’t have a good argument for a multithreading application, simply don’t take the extra effort of implementing it in the first place.

posted 2012-12-12 ∴ tagged c and linux

Julius Plenz – Blog

Concurrent Hashing is an Embarrassingly Parallel Problem

Tagcloud