Despite the subject of non-stationary bandit learning having attracted much recent attention, we have yet to identify a formal definition of non-stationarity that can consistently distinguish non-stationary bandits from stationary ones. Prior work has characterized non-stationary bandits as bandits for which the reward distribution changes over time. We demonstrate that this definition can ambiguously classify the same bandit as both stationary and non-stationary; this ambiguity arises in the existing definition's dependence on the latent sequence of reward distributions. In addition, the definition has given rise to two widely used notions of regret: the dynamic regret and the weak regret. These notions are not indicative of qualitative agent performance in some bandits that are intuitively nearly-stationary, i.e., non-stationary bandits that closely resemble stationary ones. In addition, this definition of non-stationary bandits has led to the design of agents that explore excessively. We introduce a formal definition of non-stationary bandits that resolves these issues. In addition, for two bandits that provide agents with indistinguishable experiences, our definition classifies them as both stationary or both non-stationary. Our definition also applies seamlessly to both the Bayesian and the frequentist formulations of bandits, providing a unified approach.
|