Robotics RL case studies

Before we begin

Research robotics RL is not just an algorithm — it is simulation, rewards, safety, and hardware integrated. This lesson surveys landmark case studies: locomotion, manipulation, and sim-to-real successes. Use them as templates for scoping projects and reading papers critically.

Locomotion — walking, running, balancing under physics constraints.
Manipulation — grasping, reorientation, tool use.
Sim-to-real robotics — policies trained mostly in sim, deployed on physical platforms.

What you will learn

Summarize OpenAI / ETH / Google style robotics RL pipelines.
Decompose papers into: env, algorithm, reward, DR, real results.
Recognize reward engineering patterns for locomotion vs manipulation.
Judge claims: sample complexity, generalization, hardware wear.
Extract ideas applicable to smaller course projects and portfolios.

Case study: learning to walk (sim)

Setup: quadruped or biped in MuJoCo / Isaac Gym; proprioceptive state (joint angles, velocities, IMU).
Algorithm: PPO or SAC, often massively parallel envs (thousands).
Reward shaping:

Term	Purpose
Forward velocity tracking	Task progress
Energy / torque penalty	Smooth, efficient gait
Foot slip penalty	Realistic contact
Upright orientation	Do not fall
Survival bonus	Episode length

Lesson: locomotion rewards are dense shaping; ablations show each term prevents a failure mode (diving, hopping in place, scuttling).

Case study: dexterous manipulation (OpenAI Rubik's cube)

Setup: Shadow Hand + cube in sim; domain randomization on dynamics and sensing; automatic curriculum on scramble difficulty.
Algorithm: PPO with large batch from distributed rollouts.
Real transfer: DR + vision (optional); human interventions rare.

Takeaways:

DR scale — thousands of parallel sims.
Memory — partial observability needs history or LSTM.
Evaluation — success rate on held-out randomizations, not one sim seed.

Checkpoint: Why was sim-to-real possible without real RL training data?

Answer

Randomized sim covered a broad enough distribution of real conditions that the real cube/hand instance was likely in-distribution. Vision and tactile noise were also randomized so the policy did not rely on sim-only cues.

Case study: industrial pick-and-place

Many factories use model-based control + learning for residual corrections. RL appears for:

Grasp point selection from point clouds.
Insertion with force feedback.
Regrasp policies after slips.

Often offline RL or behavior cloning warm-start from teleop logs — online RL on hardware is expensive. Module 9 offline RL connects here.

ETH quadruped ANYmal / similar

Pattern: train locomotion in sim (PPO/SHAC), system ID, narrow DR, then fine-tune on robot with safe exploration limits. Highlight perceptive locomotion — elevation maps from lidar — moving beyond proprioception.

Stage	Data source	Risk
Sim pretrain	Millions of steps	Sim exploit
Real adapt	Thousands	Hardware stress
Deploy	Frozen policy + monitor	Drift, falls

Reading a robotics RL paper (template)

Observation / action — what is actually deployed?
Parallelism — how many env steps total?
Reward — full equation in appendix?
DR table — which parameters?
Real experiments — how many trials? video cherry-picking?
Baselines — tuned PID / MPC compared fairly?
Failure modes — discussed or hidden?

Smaller-scale portfolio projects

You may not have a robot arm — still demonstrate patterns:

Project	Demonstrates
SAC on Pendulum / HalfCheetah	Continuous control
Domain rand on Hopper friction	DR code + learning curve
Sim + real video comparison	Analysis write-up
Teleop log + BC	Industry-relevant pipeline

Document reproducibility: seeds, commit hash, env versions.

Common mistakes (when emulating papers)

Mistake	Reality check
Copy reward without ablation	Your morphologies differ
Under-budget parallel envs	Sample complexity explodes
Skip energy penalties	Sim-only violent gaits
Claim sim-to-real from one video	Need statistics
Ignore safety on student hardware	Start constrained

Closing

Robotics RL successes combine scalable simulation, careful rewards, domain randomization, and humble real-world iteration. Algorithms (PPO, SAC, TD3) are one line in the system diagram — your Module 8 project practices the control stack; Module 9 covers deployment, offline data, and safety at scale.

Before this lesson

Previous lesson

What's next

Continue from the module welcome or the curriculum sidebar.