Optimistic Policy Optimization with Bandit Feedback | Read Paper on Bytez