Training and applying RL methods to solve a grid world envorment, the agent should be able to navigate to the goal efficiently.
Table of contents
Open Table of contents
Environment
The environment defined is a 4x4 grid where the agent has to reach the goal state from the initial state, the environment defined is identical for both deterministic and stochastic environments.
Action | The agent has 4 possible actions at any given state.
up=0; down=1; right=2; left=3 |
---|---|
State | In this 4x4 grid environment there are 16 states |
Rewards | 4 rewards are defined in this environment, the position and value of the rewards: (2,0): -3, (1,2): -4, (1,0): 2, (3,1): 5, (3,3): 20 The goal position (3,3) has the highest reward value (20) |
Objective | To reach the goal state |
- Deterministic: In a deterministic environment, the agent can determine the next state given the current state and action there is no randomness or probability involved in the environment.
- Stochastic: Here stochasticity is defined in the step function by choosing between executing the given action or choosing a random action. The probability of performing the given action is 0.6 and performing a random action is 0.4.
Algorithms
- SARSA
- Q Leaerning
For the defined deterministic and stochastic environments, the Q-learning algorithm and SARSA is used to solve the environment, the deterministic environment consists of a 4x4 grid world where the agent starts at an initial position (0,0) and needs to reach the goal state (3,3) and should collect rewards on the way, there are negative and positive rewards defined in the grid, the agent should not collect the negative rewards and stay away from them and collect the positive rewards and reach the goal state.
Analysis
- Both the algorithms are able to navigate the deterministic environment successfully, obtaining the higest possible reward each episode
- The algorithms perform equally when evaluated for 50 episodes in Graph 6.1 and also obtain the highest reward
- Q-Learning took more no. of episodes to train and to get a liner graph while training, but SARA was able to achieve it in 200 episodes
Screenshots
-
Initial state
-
When the agent is navigating
-
When the agent collects the reward
-
When the agent is at the goal state
Read more - Project Report