So I was reading some rather not so clever code today. I had a gut
feeling something was wrong with the code, since I had never seen an
idiom like that. A server that does a little hash calculation with
lots of threads – and the function that computes the hash had a
peculiar feature: Its entire body was wrapped by a mutex lock/unlock
clause of a function-static mutex PTHREAD_MUTEX_INITIALIZER
, like
this:
static EVP_MD_CTX mdctx;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static unsigned char first = 1;
pthread_mutex_lock(&lock);
if (first) {
EVP_MD_CTX_init(&mdctx);
first = 0;
}
/* the actual hash computation using &mdctx */
pthread_mutex_unlock(&lock);
In other words, if this function is called multiple times from different threads, it is only run once at a time, possibly waiting for other instances to unlock the (shared) mutex first.
The computation code inside the function looks roughly like this:
if (!EVP_DigestInit_ex(&mdctx, EVP_sha256(), NULL) ||
!EVP_DigestUpdate(&mdctx, input, inputlen) ||
!EVP_DigestFinal(&mdctx, hash, &md_len)) {
ERR_print_errors_fp(stderr);
exit(-1);
}
This is the typical OpenSSL pattern: You tell it to initialize mdctx
to
compute the SHA256 digest, then you “update” the digest (i.e., you
feed it some bytes) and then you tell it to finish, storing the
resulting hash in hash
. If either of the functions fail, the OpenSSL
error is printed.
So the lock
mutex really only protects the mdctx
(short for
‘message digest context’). And my gut feeling was that re-initializing the
context all the time (i.e. copying stuff around) is much cheaper
than synchronizing all the hash operations (i.e., having one stupid
bottleneck).
To be sure, I ran a few tests. I wrote a simple C program that scales up the number of threads and looks at how much time you need to hash 10 million 16-byte strings. (You can find the whole quick’n’dirty code on Github.)
First, I have to create a dataset. In order for it to be the same all
the time, I use rand_r()
with a hard-coded seed, so that over all
iterations, the random data set is actually equivalent:
#define DATANUM 10000000
#define DATASIZE 16
static char data[DATANUM][DATASIZE];
void init_data(void)
{
int n, i;
unsigned int seedp = 0xdeadbeef; /* make the randomness predictable */
char alpha[] = "abcdefghijklmnopqrstuvwxyz";
for(n = 0; n < DATANUM; n++)
for(i = 0; i < DATASIZE; i++)
data[n][i] = alpha[rand_r(&seedp) % 26];
}
Next, you have to give a helping hand to OpenSSL so that it can be run multithreaded. (There are, it seems, certain internal data structures that need protection.) This is a technical detail.
Then I start num
threads on equally-sized slices of data
while recording and
printing out timing statistics:
void hash_all(int num)
{
int i;
pthread_t *t;
struct fromto *ft;
struct timespec start, end;
double delta;
clock_gettime(CLOCK_MONOTONIC, &start);
t = malloc(num * sizeof *t);
for(i = 0; i < num; i++) {
ft = malloc(sizeof(struct fromto));
ft->from = i * (DATANUM/num);
ft->to = ((i+1) * (DATANUM/num)) > DATANUM ?
DATANUM : (i+1) * (DATANUM/num);
pthread_create(&t[i], NULL, hash_slice, ft);
}
for(i = 0; i < num; i++)
pthread_join(t[i], NULL);
clock_gettime(CLOCK_MONOTONIC, &end);
delta = end.tv_sec - start.tv_sec;
delta += (end.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("%d threads: %ld hashes/s, total = %.3fs\n",
num, (unsigned long) (DATANUM / delta), delta);
free(t);
sleep(1);
}
Each thread runs the hash_slice()
function, which linearly iterates
over the slice and calls hash_one(n)
for each entry. With
preprocessor macros, I define two versions of this function:
void hash_one(int num)
{
int i;
unsigned char hash[EVP_MAX_MD_SIZE];
unsigned int md_len;
#ifdef LOCK_STATIC_EVP_MD_CTX
static EVP_MD_CTX mdctx;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static unsigned char first = 1;
pthread_mutex_lock(&lock);
if (first) {
EVP_MD_CTX_init(&mdctx);
first = 0;
}
#else
EVP_MD_CTX mdctx;
EVP_MD_CTX_init(&mdctx);
#endif
/* the actual hashing from above */
#ifdef LOCK_STATIC_EVP_MD_CTX
pthread_mutex_unlock(&lock);
#endif
return;
}
The Makefile produces two binaries:
$ make
gcc -Wall -pthread -lrt -lssl -DLOCK_STATIC_EVP_MD_CTX -o speedtest-locked speedtest.c
gcc -Wall -pthread -lrt -lssl -o speedtest-copied speedtest.c
… and the result is as expected. On my Intel i7-2620M quadcore:
$ ./speedtest-copied
1 threads: 1999113 hashes/s, total = 5.002s
2 threads: 3443722 hashes/s, total = 2.904s
4 threads: 3709510 hashes/s, total = 2.696s
8 threads: 3665865 hashes/s, total = 2.728s
12 threads: 3650451 hashes/s, total = 2.739s
24 threads: 3642619 hashes/s, total = 2.745s
$ ./speedtest-locked
1 threads: 2013590 hashes/s, total = 4.966s
2 threads: 857542 hashes/s, total = 11.661s
4 threads: 631336 hashes/s, total = 15.839s
8 threads: 932238 hashes/s, total = 10.727s
12 threads: 850431 hashes/s, total = 11.759s
24 threads: 802501 hashes/s, total = 12.461s
And on an Intel Xeon X5650 24 core machine:
$ ./speedtest-copied
1 threads: 1564546 hashes/s, total = 6.392s
2 threads: 1973912 hashes/s, total = 5.066s
4 threads: 3821067 hashes/s, total = 2.617s
8 threads: 5096136 hashes/s, total = 1.962s
12 threads: 5849133 hashes/s, total = 1.710s
24 threads: 7467990 hashes/s, total = 1.339s
$ ./speedtest-locked
1 threads: 1481025 hashes/s, total = 6.752s
2 threads: 701797 hashes/s, total = 14.249s
4 threads: 338231 hashes/s, total = 29.566s
8 threads: 318873 hashes/s, total = 31.360s
12 threads: 402054 hashes/s, total = 24.872s
24 threads: 304193 hashes/s, total = 32.874s
So, while the real computation times shrink when you don’t force a bottleneck – yes, it’s an embarrassingly parallel problem – the reverse happens if you force synchronization: All the mutex waiting slows the program so much down that you’d better only use one thread or else you lose.
Rule of thumb: If you don’t have a good argument for a multithreading application, simply don’t take the extra effort of implementing it in the first place.