I am working on improving the nwipe tool, specifically by implementing an AES-CTR PRNG using AES-128 in counter mode to generate high-quality random numbers for securely wiping HDDs and SSDs. The original implementation runs on a single core, and I am trying to parallelize it using pthreads to utilize all available CPU cores. However, my attempt at parallelization has resulted in a significant performance drop, and I'm seeking advice on how to correct this.
Here's the single-core implementation that works correctly but only utilizes one core:
int nwipe_aes_ctr_prng_read(NWIPE_PRNG_READ_SIGNATURE) {
u8* restrict bufpos = buffer;
size_t words = count / SIZE_OF_AES_CTR_PRNG;
for(size_t ii = 0; ii < words; ++ii) {
aes_ctr_prng_genrand_uint128_to_buf((aes_ctr_state_t*) *state, bufpos);
bufpos += 16; // Move to the next block
}
// Handle remaining bytes if count is not a multiple of SIZE_OF_AES_CTR_PRNG
const size_t remain = count % SIZE_OF_AES_CTR_PRNG;
if(remain > 0) {
unsigned char temp_output[16]; // Temporary buffer for the last block
aes_ctr_prng_genrand_uint128_to_buf((aes_ctr_state_t*) *state, temp_output);
memcpy(bufpos, temp_output, remain);
}
return 0; // Success
}
My attempt to implement pthreads for parallelization is as follows, but it has led to a performance decrease from 200MB/s to around 15MB/s:
typedef struct {
aes_ctr_state_t* state;
u8* buffer;
size_t start;
size_t end;
} prng_thread_arg_t;
void* nwipe_aes_ctr_prng_read_thread(void* arg) {
prng_thread_arg_t* thread_arg = (prng_thread_arg_t*)arg;
aes_ctr_state_t* state = thread_arg->state;
u8* buffer = thread_arg->buffer + thread_arg->start;
size_t words = (thread_arg->end - thread_arg->start) / SIZE_OF_AES_CTR_PRNG;
for(size_t ii = 0; ii < words; ++ii) {
aes_ctr_prng_genrand_uint128_to_buf(state, buffer);
buffer += SIZE_OF_AES_CTR_PRNG;
}
return NULL;
}
int nwipe_aes_ctr_prng_read(NWIPE_PRNG_READ_SIGNATURE) {
int num_threads = 8; // Adjustable based on requirements
pthread_t threads[num_threads];
prng_thread_arg_t thread_args[num_threads];
size_t total_words = count / SIZE_OF_AES_CTR_PRNG;
size_t words_per_thread = total_words / num_threads;
for(int i = 0; i < num_threads; i++) {
size_t start = i * words_per_thread * SIZE_OF_AES_CTR_PRNG;
size_t end = (i + 1) * words_per_thread * SIZE_OF_AES_CTR_PRNG;
if(i == num_threads - 1) {
end = total_words * SIZE_OF_AES_CTR_PRNG; // Correct end calculation
}
thread_args[i].state = (aes_ctr_state_t*)*state;
thread_args[i].buffer = buffer;
thread_args[i].start = start;
thread_args[i].end = end;
pthread_create(&threads[i], NULL, nwipe_aes_ctr_prng_read_thread, &thread_args[i]);
}
for(int i = 0; i < num_threads; i++) {
pthread_join(threads[i], NULL);
}
// Remaining bytes handling omitted for brevity
return 0;
}
Both attempts, use the following function in order to generate the numbers.
void aes_ctr_prng_genrand_uint128_to_buf(aes_ctr_state_t* state, unsigned char* bufpos) {
CRYPTO_ctr128_encrypt(bufpos, bufpos, 16, &state->aes_key, state->ivec, state->ecount, &state->num, (block128_f) AES_encrypt);
next_state(state);
}
Question: What could be the cause of the performance drop when parallelizing with pthreads, and how can I efficiently use all cores for the AES-CTR PRNG implementation?
I appreciate any insights or suggestions you may have. Thank you!