A super simple site to organize meetings for our reading group
Offline reinforcement learning aims to learn an optimal policy from a pre-collected dataset, and is becoming prevelant in numerous domains. A common challenge is the scarcity of fully labeled reward data. Relying on the labeled data alone often leads to a narrow state-action distribution, and the distributional shift between the behavior policy and the optimal policy presents significant hurdles in achieving high-quality policy learning. Recognizing that the volume of unlabeled data is typically substantially larger than that of labeled data, and guided by the pessimistic principle, we propose to leverage the abundant unlabeled data to learn a lower bound of the reward function, rather than that of the Q-function or state transition function. This strategy effectively addresses the issue of distributional shift, and also greatly simplifies the learning process. We develop the corresponding semi-supervised offline RL algorithms. We further introduce the concept of `semi-coverage’ that is essential in establishing the theoretical guarantees of our new method. We conduct intensive numerical analyses to demonstrate the empirical efficacy of the proposed method.