It is entirely possible to build a workstation that supports 44 cores which means that this blockchain should be able to process 8.8M transfers per second.
Porting to GPU threading might be more practical, although I don't know if there are any obstacles.
A blockchain that desires to scale must achieve the following:
-perform a small number of well defined tasks
-operate on a protocol defined state
-minimal change in function over time
-ensure that all transactions in a block are independent
-minimize the number of sequential steps per block
I'd also add an encoding scheme (in effect a custom-made compression), to minimize bytes used. The less bytes there are, the easier it is to fit in faster mediums of storage... It may not be a problem now, but over time it will be if a blockchain processes a lot of data...