Robotics RL case studies
Before we begin
Research robotics RL is not just an algorithm — it is simulation, rewards, safety, and hardware integrated. This lesson surveys landmark case studies: locomotion, manipulation, and sim-to-real successes. Use them as templates for scoping projects and reading papers critically.
Locomotion — walking, running, balancing under physics constraints.
Manipulation — grasping, reorientation, tool use.
Sim-to-real robotics — policies trained mostly in sim, deployed on physical platforms.
What you will learn
- Summarize OpenAI / ETH / Google style robotics RL pipelines.
- Decompose papers into: env, algorithm, reward, DR, real results.
- Recognize reward engineering patterns for locomotion vs manipulation.
- Judge claims: sample complexity, generalization, hardware wear.
- Extract ideas applicable to smaller course projects and portfolios.
Case study: learning to walk (sim)
Setup: quadruped or biped in MuJoCo / Isaac Gym; proprioceptive state (joint angles, velocities, IMU).
Algorithm: PPO or SAC, often massively parallel envs (thousands).
Reward shaping:
| Term | Purpose |
|---|---|
| Forward velocity tracking | Task progress |
| Energy / torque penalty | Smooth, efficient gait |
| Foot slip penalty | Realistic contact |
| Upright orientation | Do not fall |
| Survival bonus | Episode length |
Lesson: locomotion rewards are dense shaping; ablations show each term prevents a failure mode (diving, hopping in place, scuttling).
Case study: dexterous manipulation (OpenAI Rubik's cube)
Setup: Shadow Hand + cube in sim; domain randomization on dynamics and sensing; automatic curriculum on scramble difficulty.
Algorithm: PPO with large batch from distributed rollouts.
Real transfer: DR + vision (optional); human interventions rare.
Takeaways:
- DR scale — thousands of parallel sims.
- Memory — partial observability needs history or LSTM.
- Evaluation — success rate on held-out randomizations, not one sim seed.
Checkpoint: Why was sim-to-real possible without real RL training data?
Answer
Randomized sim covered a broad enough distribution of real conditions that the real cube/hand instance was likely in-distribution. Vision and tactile noise were also randomized so the policy did not rely on sim-only cues.
Case study: industrial pick-and-place
Many factories use model-based control + learning for residual corrections. RL appears for:
- Grasp point selection from point clouds.
- Insertion with force feedback.
- Regrasp policies after slips.
Often offline RL or behavior cloning warm-start from teleop logs — online RL on hardware is expensive. Module 9 offline RL connects here.
ETH quadruped ANYmal / similar
Pattern: train locomotion in sim (PPO/SHAC), system ID, narrow DR, then fine-tune on robot with safe exploration limits. Highlight perceptive locomotion — elevation maps from lidar — moving beyond proprioception.
| Stage | Data source | Risk |
|---|---|---|
| Sim pretrain | Millions of steps | Sim exploit |
| Real adapt | Thousands | Hardware stress |
| Deploy | Frozen policy + monitor | Drift, falls |
Reading a robotics RL paper (template)
- Observation / action — what is actually deployed?
- Parallelism — how many env steps total?
- Reward — full equation in appendix?
- DR table — which parameters?
- Real experiments — how many trials? video cherry-picking?
- Baselines — tuned PID / MPC compared fairly?
- Failure modes — discussed or hidden?
Smaller-scale portfolio projects
You may not have a robot arm — still demonstrate patterns:
| Project | Demonstrates |
|---|---|
| SAC on Pendulum / HalfCheetah | Continuous control |
| Domain rand on Hopper friction | DR code + learning curve |
| Sim + real video comparison | Analysis write-up |
| Teleop log + BC | Industry-relevant pipeline |
Document reproducibility: seeds, commit hash, env versions.
Common mistakes (when emulating papers)
| Mistake | Reality check |
|---|---|
| Copy reward without ablation | Your morphologies differ |
| Under-budget parallel envs | Sample complexity explodes |
| Skip energy penalties | Sim-only violent gaits |
| Claim sim-to-real from one video | Need statistics |
| Ignore safety on student hardware | Start constrained |
Closing
Robotics RL successes combine scalable simulation, careful rewards, domain randomization, and humble real-world iteration. Algorithms (PPO, SAC, TD3) are one line in the system diagram — your Module 8 project practices the control stack; Module 9 covers deployment, offline data, and safety at scale.
Before this lesson
What's next
Continue from the module welcome or the curriculum sidebar.