By Sergey Nikolenko
This article TD-Gammon and Q-training and their practical application with the game Atari.
In 2013, Mnih et al. published a paper where one of the standard methods for reinforcement learning, combined with deep learning neural networks, is used for playing Atari. TD-learning (temporal difference learning) is commonly used in contexts in which the reward represents the outcome of a relatively long sequence of actions, and the problem involves redistributing this single reward among the moves and/or states leading to it. For instance, a game of Go can last a few hundred moves, but the model will only get cheese for winning or an electric shock for losing at the very end, when the players reach a final outcome. Which moves out of the hundred made were good or bad? That’s still a big question, even when the outcome is known. It’s quite possible that you were heading for defeat in the middle of the game but then your opponent blundered, and you wound up winning. Also, trying to introduce intermediary goals artificially, like winning material, is a universally bad idea in reinforcement learning. We have ample evidence that a smart opponent can take advantage of the inevitable “near-sightedness” of such a system.
The main idea of TD-learning involves re-using later states, which are close to the reward, as targets for training previous states. We can start with random (probably completely ludicrous) evaluations of a position but then, after each game, perform the following process. First, we are absolutely sure about the final result. For instance, we won, and the result is equal to +1 (hooray!). We push our evaluation of the penultimate position to +1, the third-to-last position to the penultimate one, which has been pushed to +1, etc. Eventually, if you train long enough then you get good evaluations for each position (state).
This method was first successfully used in TD-Gammon, the computer backgammon program. Backgammon wound up being easy enough for the computer to master, because it’s a game played with dice. Since the dice fall in every which way, it wasn’t hard to get training games or an intelligent opponent, who would enable us to explore all the games possible — you simply have the computer play against itself, and the inherent randomness in the game of backgammon would enable the program to explore the vast space of possible game states.
TD-Gammon was developed roughly 30 years ago; however, even back then, a neural network served as its foundation. A position from a game would be the input, and the network predicted an evaluation of the position or your odds of winning. The computer versus computer games would produce a set of test cases new for the network, and then the network kept learning and playing against itself (or slightly earlier versions of itself).
TD-Gammon learned to defeat humans back in the late eighties, but this was attributed to the specific nature of the game — namely, the use of dice. But by now we understand that deep learning can help computers win in numerous other games, too, like Atari games mentioned earlier. The key difference of Mnih et al.’s paper from backgammon or chess was that they did not teach a model the rules of an Atari game. All the computer knew was what the image on the screen — the same one players would see — looked like. The only other input was the current score, which needed to be externally defined, otherwise it was unclear what the objective was. The computer could perform one of the possible actions on the joystick — turning the joystick and/or pushing a button.
The machine spent roughly 200 tries on figuring out the objective of the game, another 400 to acquire skills, and then the computer started winning after about 600 games.
Q-training is used here, too. In the same exact way, we try to build a model that approximates the Q-function, but now this model is a deep convolutional network. This approach proved to work very well. In 29 games, including wildly popular ones like Space Invaders, Pong, Boxing, and Breakout, the system wound up being better than humans. Now, a team from DeepMind responsible for this design is focusing on games from the 1990s (probably Doom will be their first project). There’s no doubt that they beat these games in the near future and keep moving forward, to the latest releases.
Another interesting example of how the Deep Q-Network is used is paraphrasing. You are given a sentence, and you want to write it in a different way, yet express the same meaning. This task is a bit artificial, but it’s very closely linked to text generation in general. In a recently proposed approach, the model contains an LSTM-RNN (Long Short-Term Memory Recurrent Neural Network), which serves as an encoder, condensing the text and making it into a vector. Then this condensed version is “unfolded” into a sentence with a decoder. Since it’s decoded from its condensed form, then, most likely, the new sentence will be different. This process is called the encoder-decoder architecture. Machine translation works in a similar manner. We condense the text in one language, and then unfold it using roughly the same models but in a different language, and assuming that the encoded version is semantically similar. Deep Q-Network can iteratively generate various sentences from a hidden version and various types of decoding to move the final sentence closer to the initial one over time. The model’s behavior is rather intelligent: In the experiments, DQN first fixes the parts that we have already rephrased well, and then moves on to more complex parts where the quality has been worse so far. In other words, DQN supplants a decoder in this architecture.
What’s next?
Contemporary neural networks are getting smarter with each day. The deep learning revolution occurred in 2005–2006, and since then, interest in this topic has only continued to grow. New research is published every month, if not every week, and new interesting applications of deep learning networks are cropping up. In this article, which we hope has been sufficiently accessible, we have tried to explain how this deep learning revolution fits into modern history and development of neural networks, and we went into more detail about reinforcement learning and how deep learning networks can learn to interact with their environment.
Multiple examples have shown that now, when deep learning is undergoing explosive growth, it’s quite possible to create something new and exciting that will solve real tasks without huge investments. All you need is a modern video card, enthusiasm, and the desire to try new things. Who knows, maybe you’ll be the one to make history during this ongoing revolution — at any rate, it’s worth a try.
Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
https://medium.com/neuromation-io-blog/deep-q-network-d2dbc5688c3b