← Back to curriculum

Module 8 — Continuous control & robotics

Robotics RL case studies

Manipulation benchmarks, sim stacks, and connecting to the robotics track.

~55 min read + exercises

Robotics RL case studies

Before we begin

Research robotics RL is not just an algorithm — it is simulation, rewards, safety, and hardware integrated. This lesson surveys landmark case studies: locomotion, manipulation, and sim-to-real successes. Use them as templates for scoping projects and reading papers critically.

Locomotion — walking, running, balancing under physics constraints.
Manipulation — grasping, reorientation, tool use.
Sim-to-real robotics — policies trained mostly in sim, deployed on physical platforms.


What you will learn

  • Summarize OpenAI / ETH / Google style robotics RL pipelines.
  • Decompose papers into: env, algorithm, reward, DR, real results.
  • Recognize reward engineering patterns for locomotion vs manipulation.
  • Judge claims: sample complexity, generalization, hardware wear.
  • Extract ideas applicable to smaller course projects and portfolios.

Case study: learning to walk (sim)

Setup: quadruped or biped in MuJoCo / Isaac Gym; proprioceptive state (joint angles, velocities, IMU).
Algorithm: PPO or SAC, often massively parallel envs (thousands).
Reward shaping:

TermPurpose
Forward velocity trackingTask progress
Energy / torque penaltySmooth, efficient gait
Foot slip penaltyRealistic contact
Upright orientationDo not fall
Survival bonusEpisode length

Lesson: locomotion rewards are dense shaping; ablations show each term prevents a failure mode (diving, hopping in place, scuttling).


Case study: dexterous manipulation (OpenAI Rubik's cube)

Setup: Shadow Hand + cube in sim; domain randomization on dynamics and sensing; automatic curriculum on scramble difficulty.
Algorithm: PPO with large batch from distributed rollouts.
Real transfer: DR + vision (optional); human interventions rare.

Takeaways:

  • DR scale — thousands of parallel sims.
  • Memory — partial observability needs history or LSTM.
  • Evaluation — success rate on held-out randomizations, not one sim seed.

Checkpoint: Why was sim-to-real possible without real RL training data?

Answer

Randomized sim covered a broad enough distribution of real conditions that the real cube/hand instance was likely in-distribution. Vision and tactile noise were also randomized so the policy did not rely on sim-only cues.


Case study: industrial pick-and-place

Many factories use model-based control + learning for residual corrections. RL appears for:

  • Grasp point selection from point clouds.
  • Insertion with force feedback.
  • Regrasp policies after slips.

Often offline RL or behavior cloning warm-start from teleop logs — online RL on hardware is expensive. Module 9 offline RL connects here.


ETH quadruped ANYmal / similar

Pattern: train locomotion in sim (PPO/SHAC), system ID, narrow DR, then fine-tune on robot with safe exploration limits. Highlight perceptive locomotion — elevation maps from lidar — moving beyond proprioception.

StageData sourceRisk
Sim pretrainMillions of stepsSim exploit
Real adaptThousandsHardware stress
DeployFrozen policy + monitorDrift, falls

Reading a robotics RL paper (template)

  1. Observation / action — what is actually deployed?
  2. Parallelism — how many env steps total?
  3. Reward — full equation in appendix?
  4. DR table — which parameters?
  5. Real experiments — how many trials? video cherry-picking?
  6. Baselines — tuned PID / MPC compared fairly?
  7. Failure modes — discussed or hidden?

Smaller-scale portfolio projects

You may not have a robot arm — still demonstrate patterns:

ProjectDemonstrates
SAC on Pendulum / HalfCheetahContinuous control
Domain rand on Hopper frictionDR code + learning curve
Sim + real video comparisonAnalysis write-up
Teleop log + BCIndustry-relevant pipeline

Document reproducibility: seeds, commit hash, env versions.


Common mistakes (when emulating papers)

MistakeReality check
Copy reward without ablationYour morphologies differ
Under-budget parallel envsSample complexity explodes
Skip energy penaltiesSim-only violent gaits
Claim sim-to-real from one videoNeed statistics
Ignore safety on student hardwareStart constrained

Closing

Robotics RL successes combine scalable simulation, careful rewards, domain randomization, and humble real-world iteration. Algorithms (PPO, SAC, TD3) are one line in the system diagram — your Module 8 project practices the control stack; Module 9 covers deployment, offline data, and safety at scale.


Before this lesson


What's next

Continue from the module welcome or the curriculum sidebar.