CSCA 5902: Mastering Classical Reinforcement Learning Algorithms

Get a head start on program admission

ÌýÌýPreview this courseÌýin the non-credit experience today!Ìý
Start working toward program admission and requirements right away.ÌýWork you complete in the non-credit experience will transfer to the for-credit experience when you upgrade and pay tuition. See How It Works for details.

Course Type: MS-AI Breadth, MS-CS Elective

Specialization: Reinforcement Learning

Instructor:ÌýDr. Ashutosh Trivedi, Associate Professor of Computer Science

Prior knowledge needed: TBD

Learning Outcomes

  • Formulate sequential decision-making problems as deterministic decision processes, Markov chains, and finite Markov decision processes.
  • Explain and apply core reinforcement learning concepts, including discounting, value functions, policies, Bellman equations, and optimality.
  • Implement planning algorithms for finite Markov decision processes, including value iteration, policy iteration, and linear programming formulations.
  • Implement and compare tabular reinforcement learning algorithms, including bandits, Monte Carlo methods, temporal-difference learning, SARSA, and Q-learning.
  • Analyze the role of sampling, exploration, and convergence guarantees in classical reinforcement learning.

Course Grading Policy

AssessmentPercentage of GradeAI Usage Policy
Quizzes (5)50% (10% each)Conditional
Final Exam50%Conditional

Course Content

Duration: 1Ìýhours, 50 minutes

This module introduces the modeling and optimization foundations for sequential decision-making in their simplest form: deterministic decision processes with discounted rewards. We begin with states, actions, transitions, and rewards as a language for representing decision problems over time. We then develop value functions and discounted optimality equations as tools for optimizing long-term return. The goal is to build intuition for why dynamic programming is correct in the simpler setting of deterministic decision processes before introducing stochastic transitions, learning from sampled experience, and bootstrapping in later modules.

Duration: 1Ìýhour

This module adds stochasticity to the deterministic picture developed in the previous module. Learners continue with the surprise-quiz example, now with uncertain outcomes: studying usually helps but may not always help, and relaxing may reduce preparation but may not always do so. The module first introduces stochastic transitions as probability distributions over next states, then studies Markov chains as stochastic systems without choices, and finally adds actions to obtain Markov decision processes. The goal is to make expected discounted reward, policies, and Bellman equations feel like natural extensions of the deterministic setting.

Duration: 4 hours, 20 minutes

This module studies dynamic programming methods for finite discounted Markov decision processes. We assume that the model is known: the transition probabilities and rewards are available. The goal is to compute value functions and policies using Bellman equations. The module is organized into five lessons. The first lesson evaluates a fixed policy. The second lesson explains policy improvement. The third lesson studies value iteration through the Bellman optimality operator. The fourth lesson combines evaluation and improvement into policy iteration. The final lesson gives a linear programming formulation of discounted MDP planning.

Duration: 4Ìýhours, 30 minutes

This module is organized into four lessons. The first lesson explains reinforcement learning as sample-based dynamic programming: the Bellman equations are still the conceptual target, but their expectations are no longer computed from a known model. The second lesson introduces Monte Carlo methods, where values are estimated by averaging complete returns. The third lesson introduces temporal-difference learning, where values are updated online using one-step bootstrapped targets. The final lesson gives the stochastic-approximation intuition behind incremental value learning.

Duration: 4Ìýhours, 40 minutes

This module moves from prediction to control. In the previous module, we estimated value functions for a fixed policy. Here, the learner must also improve behavior. This creates the central exploration–exploitation tension: the agent must use what it has learned while continuing to gather information. The first lesson studies exploration in bandit problems. The second lesson extends Monte Carlo methods from prediction to control by learning action values. The third and fourth lessons introduce SARSA and Q-learning as temporal-difference control algorithms. The final lesson compares these algorithms, explains their convergence intuition, and discusses why tabular methods struggle in large state spaces.

Duration: TBD

This final assessment synthesizes the main ideas. Learners solve the same finite discounted Markov decision process from two perspectives: first as a planning problem with a known model, and then as a learning problem from sampled experience. The assessment emphasizes the relationship between Bellman equations, dynamic programming, Monte Carlo estimation, temporal-difference learning, SARSA, and Q-learning.Ìý

Notes

  • Cross-listed Courses: CoursesÌýthat are offered under two or more programs. Considered equivalent when evaluating progress toward degree requirements. You may not earn credit for more than one version of a cross-listed course.
  • Page Updates: This page is periodically updated. Course information on the Coursera platform supersedes the information on this page. Click theÌýView on CourseraÌýbuttonÌýabove for the most up-to-date information.