Crossing the Sim2Real Gap with RotorPy

TL;DR: We successfully trained a quadrotor tracking control policy via reinforcement learning in just minutes using RotorPy, and deployed it on a real quadrotor.

Recently my colleague Hersh Sanghvi has been using RotorPy to generate training data for his research on meta learning. In support of this we’ve been working on rewriting RotorPy’s backend in PyTorch so that it can be parallelized on GPUs. The results have been impressive–depending on the system specs we’re seeing upwards of 100x speedup over a CPU-bound RotorPy!

We thought a really cool demo to illustrate the speed up would be to train an RL policy in RotorPy and then transfer it to the real world. After quite a bit of experimentation we managed to get a solid policy. Below is a video demonstration of the policy in action!

Hovering demonstrated by an RL policy trained in RotorPy. The policy is robust to a variety of disturbances.

The policy receives an observation containing the position error, velocity error, the orientation, and body rates. The observation also includes a horizon of future position commands from the trajectory. Odometry is conveniently provided by an external motion capture system!

The output of the policy is a collective thrust and attitude command (i.e. “angle” mode), which is then tracked by lower level controllers running onboard the quadrotor. The policy itself runs on the base station computer.

The policy was trained using PPO and our custom Gymnasium environment. In simulation, the agent was exposed to sinusoidal trajectories of varying amplitudes and frequencies. On Hersh’s M1 MacBook Pro, it took a little over 3 minutes to train a policy over a couple million simulation steps. My (untested) hypothesis is that RotorPy’s dynamics model is more accurate than other more simplified simulators, so fewer simulation steps (and therefore less wall clock time) are required to successfully transfer a policy. The results here are a positive indication but by no means conclusive.

Below is another example of the policy, but this time tracking a figure eight pattern on the XY plane.

Our RL policy tracking a figure eight pattern. We found including a horizon improved tracking performance and reduced oscillations especially for aggressive maneuvers.

If you’re an RL expert, you’re probably thinking we didn’t really do anything novel here. I completely agree! Nevertheless, we’re quite proud of how fast we were able to train a policy using RotorPy, and super excited we were able to cross the infamous sim2real gap without any fine-tuning on real world data!

Possible follow-ups we’re thinking about:

We’d love to figure out how to compile the policy and run it using the onboard processor. It’s been done before, and our policy is probably small enough to fit on the flash memory.
Towards running everything on board, we want to try to train a policy that uses IMU measurements (accelerometer and gyroscope), perhaps even replacing the motion capture observations. There are a lot of practical reasons that I suspect would prevent this from working, but it would get our policies closer to the edge! Fortunately RotorPy has an IMU model already!
Try training policies on lower control abstractions, such as body rates or even single motor thrust commands.
Deeper comparisons with other quadrotor environments for RL (like Isaac Sim), to test my hypothesis above.
Replicating and testing similar works such as DATT.