Interpretable Reward Modeling with Active Concept Bottlenecks

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This academic paper introduces Concept Bottleneck Reward Models (CB-RM), a novel framework designed to enhance the interpretability of reward functions used in Reinforcement Learning from Human Feedback (RLHF). Unlike traditional opaque models, CB-RM decomposes reward prediction into human-understandable concepts, such as helpfulness or correctness. To address the high cost of data annotation, the authors propose an active learning (AL) strategy, leveraging an Expected Information Gain (EIG) acquisition function to efficiently select the most informative concept labels to query. Experiments on the UltraFeedback dataset demonstrate that this approach significantly improves concept accuracy and sample efficiency without compromising overall preference prediction accuracy, moving towards more transparent and auditable AI alignment. The research also cautions against potential information leakage when using large language models pre-trained on evaluation datasets.