2015年12月
Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Reward
JOURNAL OF MACHINE LEARNING RESEARCH
- ,
- 巻
- 16
- 号
- 開始ページ
- 3721
- 終了ページ
- 3756
- 記述言語
- 英語
- 掲載種別
- 研究論文(学術雑誌)
- 出版者・発行元
- MICROTOME PUBL
In this paper we consider a stochastic multiarmed bandit problem. It is known in this problem that Deterministic Minimum Empirical Divergence (DMED) policy achieves the asymptotic theoretical bound for the model where each reward distribution is supported in a known bounded interval, say [0; 1]. However, the regret bound of DMED is described in an asymptotic form and the performance in finite time has been unknown. We modify this policy and derive a finite-time regret bound for the new policy, Indexed Minimum Empirical Divergence (IMED), by refining large deviation probabilities to a simple nonasymptotic form. Further, the refined analysis reveals that the finite-time regret bound is valid even in the case that the reward is not bounded from below. Therefore, our finitetime result applies to the case that the minimum reward (that is, the maximum loss) is unknown or unbounded. We also present some simulation results which shows that IMED much improves DMED and performs competitively to other state-of-the-art policies.
- リンク情報
- ID情報
-
- ISSN : 1532-4435
- Web of Science ID : WOS:000369888000042