Starter code is provided here. We have provided a full, working implementation, which works on MDPs with discrete action spaces.
Each iteration, it collects a batch of trajectories. It computes the advantage at every timestep, and concatenates together the observations, actions, and advantages from all timesteps. Then it symbolically constructs the following objective:
and then differentiates it (using Theano) to get the policy gradient estimator.
Here, the policy is parameterized as a neural network with one hidden layer, so the parameters \(\theta\) are the weights and biases of this neural network.
This code uses a time-dependent baseline, which computes the average return at the \(t^{\text{th}}\) timestep from the batch of trajectories.
You can try various things:
agent = REINFORCEAgent(env.observation_space, env.action_space, episode_max_length=env.spec.timestep_limit)