Grid world problem - Reinforcement Learning

Training and applying RL methods to solve a grid world envorment, the agent should be able to navigate to the goal efficiently.

Github

Open Table of contents

Environment
Algorithms
Analysis
Screenshots

Environment

The environment defined is a 4x4 grid where the agent has to reach the goal state from the initial state, the environment defined is identical for both deterministic and stochastic environments.

Action	The agent has 4 possible actions at any given state. up=0; down=1; right=2; left=3
State	In this 4x4 grid environment there are 16 states
Rewards	4 rewards are defined in this environment, the position and value of the rewards: (2,0): -3, (1,2): -4, (1,0): 2, (3,1): 5, (3,3): 20 The goal position (3,3) has the highest reward value (20)
Objective	To reach the goal state

Deterministic: In a deterministic environment, the agent can determine the next state given the current state and action there is no randomness or probability involved in the environment.
Stochastic: Here stochasticity is defined in the step function by choosing between executing the given action or choosing a random action. The probability of performing the given action is 0.6 and performing a random action is 0.4.

Algorithms

SARSA
Q Leaerning

For the defined deterministic and stochastic environments, the Q-learning algorithm and SARSA is used to solve the environment, the deterministic environment consists of a 4x4 grid world where the agent starts at an initial position (0,0) and needs to reach the goal state (3,3) and should collect rewards on the way, there are negative and positive rewards defined in the grid, the agent should not collect the negative rewards and stay away from them and collect the positive rewards and reach the goal state.

Analysis

Both the algorithms are able to navigate the deterministic environment successfully, obtaining the higest possible reward each episode
The algorithms perform equally when evaluated for 50 episodes in Graph 6.1 and also obtain the highest reward
Q-Learning took more no. of episodes to train and to get a liner graph while training, but SARA was able to achieve it in 200 episodes

Screenshots

Initial state
When the agent is navigating
When the agent collects the reward
When the agent is at the goal state