Thanks. I hope some of it was beneficial. The devil is usually in the details on projects like this, so the sooner you can confront potential traps, the better.
Re: matrices versus vectors: I tend to use the term "matrix" more since I always keep the numbers in that form until a matrix multiplication operation is required, then I unroll and roll the matrix as needed to use the optimized multiplication algorithm.
Re: colors: If the inputs to the CNN are photos of real world objects, I would go with the multi-channel approach, because it's the closest thing to be inspired from biology. We have separate cones in the eyes for reds, greens, and blues, while other living organisms have even more diversity, adding cones from UV light and polarized light (useful for seeing underwater). If the retinal neural cells combine the inputs from the different cones, that's one thing, but I don't think they do. That said, I think the situation changes if the inputs to the CNN are fMRI images. Don't they basically contain false colors to indicate some level of intensity? If that's the case, I would expect grayscale encoding to work.
Re: temporal characteristic of fMRI data: I was just going by personal experience. I underwent a brain MRI scan once, and it took 10 minutes. But the stats on the Wikipedia page you cited doesn't give me much confidence. I suspect that current techniques are not capturing a large amount of neural activations.
Re: filtering out neural activations in the fMRI image that represent undesired sensory inputs: I understand your response, but just be aware that this might be a matter of relevance, not noise. If the noise follows a particular pattern, the CNN may wrongly incorporate that pattern into its predictions. Think of the old experiment that classified tanks from cars, and failed, because all the photos of cars were taken on sunny days, and all the photos of tanks were taken on cloudy days. The CNN in that experiment started focusing on the sky color as a feature, rather than treating it as irrelevant data. It wasn't noise from the classic sense (from information theory and communication channels), because it was deterministic and not random (although that was an unfortunate coincidence). From another perspective, this problem seems a lot like an NLP experiment that attempts to separate simultaneous conversations at a party where a number of microphones scattered among the crowd are recording voices.
Incidentally, a Kaggle competition from a while ago used brain fMRI data. It might be useful to track down the winner of that competition and ask about the approach he or she took.