Among Supervised Learning and Unsupervised Learning, Reinforcement Learning is one of the primary neural network learning paradigms. Reinforcement learning is primarily used in dynamic environments without predefined inputs and outputs. As it navigates its problem space, the program is provided with rewards (which it tries to maximize) and penalties (which it tries to minimize). Reinforcement learning stems from a similar process in animal psychology: biological brains are hardwired to interpret signals such as pain and hunger as negative reinforcements, and interpret pleasure and food intake as positive reinforcements. Animals learn to engage in behaviours that minimize negative reinforcements and optimize the rewards.
Reinforcement learning does not need labelled input/output pairs and it does not need sub-optimal actions to be explicitly corrected. The focus of RL is to find a balance between the exploration of uncharted territory and the exploitation of current knowledge, with the goal of maximizing the long-term reward.
Reinforcement is modelled by:
- : a set of environment and agent states
- : a set of actions of the agent
- : the probability of transition (at time ) from state to state under action
- : the reward after the transition from to with action
A basic RL agent interacts with its environment in discrete time steps. At each time , the agent receives the current state and reward . Then, it chooses an action from the set of available actions. The agent moves to a new state and new reward . The goal of a RL agent is to learn a policy: that maximizes the expected cumulative reward.
When the agentβs performance is compared to that of an agent that acts optimally, the difference in performance is described as regret. To act near optimally, the agent must reason about the long-term consequences of its actions (maximize future income), although the immediate associated reward might be negative.