It's just that it would take an obscenely long time to get any signal out of the random generations of the LLM, which is why we start it out with pretraining and fine tuning. But it's good to recognize this, and not get stuck in this thinking that these models can only get as good as the human training data that went in. This is only true of the pretraining and fine tuning steps.
To give an example, we could use this reinforcement loop to train an LLM not only to give the right answer to a coding problem, but do it in the most concise way, or with the shortest execution time, or smallest memory footprint, etc. Eventually, the LLM would drift away from the "human" way of working through code and invent it's own process that humans might not even find recognizable. I'm not suggesting we do this, it's just an example of where things are going, and why Deepseek is much more interesting that just MoE and distillation.