I've been working on the volunteer project described here https://www.torproject.org/getinvolved/volunteer.html.en#useMoreCores but can't spend much more time on it.
Right now, I have refactored circuit_receive_relay_cell() in relay.c (which calls relay_crypt() and eventually the AES crypt routines) to use the workqueue.c infrastructure similar to cpuworker.c.
When the refactored code runs in single threaded mode, all seems good. Once I activate the thread pool and start sending it work with threadpool_queue_work(), it Bootstraps 100% okay and runs for several minutes before crashing on cells it doesn't handle properly.
I'm happy to share my code with anyone interested.
Thanks.