Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

OpenAI Gym is a platform where you could test your intelligent learning algorithm in various applications, including games and virtual physics experiments. It provides APIs for all these applications for the convenience of integrating the algorithms into the application. The API is called the “environment” in OpenAI Gym. On one hand, the environment only receives “action” instructions as input and outputs the observation, reward, signal of termination, and other information. On the other hand, your learning algorithm receives observation(s), reward(s), signal(s) of termination as input and outputs the action. So in principle, one can develop a learning algorithm and wrapped it into a class object. It could test all the environments in OpenAI Gym.


Because I have already implemented a Deep Q-Learning class to learn flappy bird, I think it would be very convenient to test the Deep Q-Learning algorithm in all these environments in OpenAI Gym.

Make OpenAI Deep Q-Learning Class

The environments in OpenAI Gym could be categorized into two classes regarding their types of observation output. The video game environments usually output two-dimensional images as observation and the virtual physics experiments usually output one-dimensional numerical experiment observation data. Therefore, in addition to the existing Deep Q-Learning class for the two-dimensional image data, an additional Deep Q-Learning class that is suitable for learning from the one-dimensional data should be prepared for the OpenAI Gym environments.

Test OpenAI Deep Q-Learning Class in OpenAI Gym CartPole-v0 Environment

CartPole environment is probably the most simple environment in OpenAI Gym. However, when I was trying to load this environment, there is an issue regarding the box2d component. To fix this, please take the following steps. Many thanks to this blogger for the straightforward instruction. This bug might be fixed in the future release of OpenAI Gym according to someone related to OpenAI.

pip uninstall Box2D box2d-py
git clone https://github.com/pybox2d/pybox2d
cd pybox2d/
python setup.py clean
python setup.py build
python setup.py install

At first, I thought it would be super easy to train the Q-Learning algorithm, given a similar Q-Learning algorithm was doing extremely well in Flappy Bird game after training with 60,000 game frames. However, I was wrong in some aspects. With some parameter setting details from songrotek’s code, I was able to overcome the problems and learned a lot. So I have to thank songrotek here.

Number of Game Frames

When I was implementing Deep Q-Learning algorithm on Flappy Bird game, I used the concept of integrating multiple game frames as input data, because a single game frame is not able to fully represent the current state of the game. For example, the moving direction and moving velocity could not be told from a single game frame.


There lacks the detailed explanations to the physical meanings to the actions and the observation in most of the environments in OpenAI Gym. (I already complained about it in the forum, but it seems that nobody is responding. ) This is also true for the CartPole-v0 environment. So I was not sure whether I have to ignore the observations preceeding to the current observation. In principle, I think including the proceeding observations will not hurt. Because even if the proceeding observations are not relevant to the determination of action, the neuron network will give zero weights to those observations after sufficient training. However, it turned out that increasing game frames did not help the algorithm learn well. If I set the game frame to 1, the algorithm was able to play CartPole very well after 5,000 to 8,000 episodes. If I set the game frame to 4, at least within 10,000 episodes, the algorithm was not able to play. I could set the episode maximum to 100,000 in the future to see whether a good learning performance could be achieved. But for this CartPole game, introducing multiple game frames is bad. If I knew the physical meaning of the observation data, I would not even try introducing multiple game frames. (It really made me sad when the algorithm did not work at the beginning.)

Neural Network

Because the observation space of CartPole is only 3 and the action space of CartPole is only 2. I think this must be a very simple game. So I used one single layer of fully-connected neural network with only 20 hidden units. It turned out that it worked just fine. It should be noted that there is no convolutional neural network in such applications.

Learning Rate

The learning rate is usually the most important parameter to the success of an algorithm in an application. Deep Learning is different from traditional Machine Learning. One may systematically explore all most all the hyperparameters in a Machine Learning task in a short period of time, however, the training of Deep Learning usually takes much longer time, which makes it much more difficult to tune deep learning hyperparameters using limited computation resources. In this situation, the experience, which I lack, becomes very important.


In this CartPole game, I first set the learning rate to 0.0001 in Adam Optimizer and started to observe the loss during the training. The loss increased right after the start of training, and the learning performance was extremely poor. So I thought the learning rate is too high. I immediately terminated the program and set the learning rate to smaller numbers. After training with smaller learning rates, say 0.000001, the loss decrease after the start of training. But they stopped decreasing when the loss reaches around 0.4. The learning performance, in some rare cases, is extremely good. However, most of the time, learning performance is extremely poor. I did not understand what’s happening at that time. Later, I think the optimization was trapped in the local minimum at that time. The learning rate was too small for the optimization to overcome the barriers around the local minimum. That small learning rate in ordinary gradient descent leads to bad optimization outcomes rarely happen in ordinary machine learning tasks to my knowledge, though it may take a very long time to reach the minimum. I am not sure whether a small learning rate sometimes would never lead the optimization to reach the minimum if we use stochastic gradient descent, like what we used to use in Deep Learning tasks.


It turned out that the learning rate of 0.0001 is the right one to use in CartPole game. The loss first increased then decreased. The algorithm was able to play CartPole very well after 5,000 to 8,000 episodes of training.

Key Parameters

FC-20

GAME_STATE_FRAMES = 1  # number of game state frames used as input
GAMMA = 0.9 # decay rate of past observations
EPSILON_INITIALIZED = 0.5 # probability epsilon used to determine random actions
EPSILON_FINAL = 0.01 # final epsilon after decay
BATCH_SIZE = 32 # number of sample size in one minibatch
LEARNING_RATE = 0.0001 # learning rate in deep learning
FRAME_PER_ACTION = 1 # number of frames per action
REPLAYS_SIZE = 1000 # maximum number of replays in cache
TRAINING_DELAY = 1000 # time steps before starting training for the purpose of collecting sufficient replays to initialize training
EXPLORATION_TIME = 10000 # time steps used for decaying epsilon during training before epsilon decreases to zero

Algorithm Performance

Before Training:

After Training:

OpenAI Gym Evaluation


Solved after 9919 episodes. The best 100-episode average reward was 200.00 ± 0.00. https://gym.openai.com/evaluations/eval_ewr0DWHeTmGE6x1NGQ1LiQ

Conclusions

Deep Q-Learning is a good technique to solve CartPole problem. However, it seems that it suffered from high variance and its convergences seem to be slow.

https://github.com/leimao/OpenAI_Gym_AI/tree/master/CartPole-v0/Deep_Q-Learning/2017-04-28-v1

Follow-up Optimizations

I used one single layer of fully-connected neural network with only 20 hidden units in the first implementation. I found that increasing the depth and the size of the neural network, and increasing the batch size for stochastic gradient descent could improve the learning efficiency and performance robustness. Personally, I think the depth and the size of the neural network helped to improve the robustness of performance, and the batch size helped to prevent random sampling bias and optimization bias during the stochastic gradient descent. As a result, the learning became faster, and the learning performance robustness was improved.

2017-04-29-v1

Parameters


FC-128 -> FC-128

GAME_STATE_FRAMES = 1  # number of game state frames used as input
GAMMA = 0.95 # decay rate of past observations
EPSILON_INITIALIZED = 0.5 # probability epsilon used to determine random actions
EPSILON_FINAL = 0.0001 # final epsilon after decay
BATCH_SIZE = 128 # number of sample size in one minibatch
LEARNING_RATE = 0.0005 # learning rate in deep learning
FRAME_PER_ACTION = 1 # number of frames per action
REPLAYS_SIZE = 2000 # maximum number of replays in cache
TRAINING_DELAY = 2000 # time steps before starting training for the purpose of collecting sufficient replays to initialize training
EXPLORATION_TIME = 10000 # time steps used for decaying epsilon during training before epsilon decreases to zero

OpenAI Gym Evaluation


Solved after 293 episodes. The best 100-episode average reward was 197.39 ± 1.68.


https://gym.openai.com/evaluations/eval_Jr2oXkrS8KMUQEkCBurAw


Links to GitHub


https://github.com/leimao/OpenAI_Gym_AI/tree/master/CartPole-v0/Deep_Q-Learning/2017-04-29-v1

2017-04-29-v2

Parameters


FC-128 -> FC-128

GAME_STATE_FRAMES = 1  # number of game state frames used as input
GAMMA = 0.95 # decay rate of past observations
EPSILON_INITIALIZED = 0.5 # probability epsilon used to determine random actions
EPSILON_FINAL = 0.0005 # final epsilon after decay
BATCH_SIZE = 128 # number of sample size in one minibatch
LEARNING_RATE = 0.0005 # learning rate in deep learning
FRAME_PER_ACTION = 1 # number of frames per action
REPLAYS_SIZE = 5000 # maximum number of replays in cache
TRAINING_DELAY = 1000 # time steps before starting training for the purpose of collecting sufficient replays to initialize training
EXPLORATION_TIME = 10000 # time steps used for decaying epsilon during training before epsilon decreases to zero

OpenAI Gym Evaluation


Solved after 138 episodes. The best 100-episode average reward was 196.58 ± 1.34.


https://gym.openai.com/evaluations/eval_F90GxQxrQK2J6ESQkLVaA


Links to GitHub


https://github.com/leimao/OpenAI_Gym_AI/tree/master/CartPole-v0/Deep_Q-Learning/2017-04-29-v2

Notes

2017-4-28

When I was training the algorithm, I found that if the algorithm was trained for a sufficiently long time, the learning performance would fluctuate. Say, the learning performance reached the maximum at episode 5000 for 300 episodes. Then the learning performance dropped significantly. After training for some more time, the learning performance reached the maximum again for another while. This phenomenon repeated throughout the training. From my point of view, the optimization might have deviated from the optimal because I could often see some large loss number even in the later stage of the training. Is it because the learning rate is sometimes too big to make cause the optimization jump out of the optimal, or it is often not possible to train a Deep Q-Learning algorithm to have an perfect solution, or the neural network is just not sophisticated enough? I am not able to answer this question with my current knowledge.


I was also surprised that if counting game frames, it also took nearly 1,000,000 game frames to reach good performance. Recall a similar algorithm only took 600,000 game frames to have an extremely good performance in Flappy Bird game.

2017-4-28

Specifically for the problem in OpenAI Gym, to achieve both learning efficiency and performance robustness, I think learning rate decay might be a good strategy. I may try it if I have a chance in the future.


I also found that, in addition to Q-Learning, Policy Gradient might work better. I may implement this algorithm in the future.


https://github.com/lancerts/Reinforcement-Learning

https://gym.openai.com/evaluations/eval_9niu4HNZTgm0VLJ0b8MUtA