To properly test this approach, we need an env that (1) produces renderings, and (2) has train tasks that we can easily solve. Currently, the only envs that satisfy (1) are burger and kitchen, but neither satisfy (2). Ideally, we create/use a super-simple env that adds state renderings to the state.simulator_state["images"] (perhaps we can do this for something like cover?).