Reduce overhead in ODM

In the current implementation, if the entropy reward is selected, we do multiple forward pass of the model to get the logits and then compute the reward.

We should optimize this and remove this overhead.