Just a thought, and excuse me if it has already been done: memory sizes on ASICs are growing (as well as on FPGAs and GPUs) and logic transistors are extremely cheap. The next hard thing seems to be bandwidth. Has any project investigated mining which would operate by performing an algorithm over the entire binary content of the last N blocks, where N is adjusted for difficulty, and where the input blocks are arranged or processed for every mining iteration so partial results of the previous iteration cannot be reused?
An example:
Let's say the mining of the next block requires running an algorithm over the last N=10000 blocks (that's over all those gigabytes of data arranged in a flat buffer) after the blocks are prepared. Let's say that for preparation, the blocks are first shuffled (their arrangement is not sequential) in a certain way, and an expensive uncommon encryption algorithm like Blowfish is used to encrypt the whole buffer in CBC mode. Let's say that the goal of the mining process is to find a PoW hash, with a suitable algorithm (maybe even SHA256 is good enough?) for the resulting buffer, and that the nonce which is being mined is the seed to the PRNG from which the blocks are shuffled.
Just as a thought experiment. Seems like something like that would kill the bandwidth.