論文

査読有り
2015年12月

Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Reward

JOURNAL OF MACHINE LEARNING RESEARCH
  • Honda, Junya
  • ,
  • Takemura, Akimichi

16
開始ページ
3721
終了ページ
3756
記述言語
英語
掲載種別
研究論文(学術雑誌)
出版者・発行元
MICROTOME PUBL

In this paper we consider a stochastic multiarmed bandit problem. It is known in this problem that Deterministic Minimum Empirical Divergence (DMED) policy achieves the asymptotic theoretical bound for the model where each reward distribution is supported in a known bounded interval, say [0; 1]. However, the regret bound of DMED is described in an asymptotic form and the performance in finite time has been unknown. We modify this policy and derive a finite-time regret bound for the new policy, Indexed Minimum Empirical Divergence (IMED), by refining large deviation probabilities to a simple nonasymptotic form. Further, the refined analysis reveals that the finite-time regret bound is valid even in the case that the reward is not bounded from below. Therefore, our finitetime result applies to the case that the minimum reward (that is, the maximum loss) is unknown or unbounded. We also present some simulation results which shows that IMED much improves DMED and performs competitively to other state-of-the-art policies.

リンク情報
Web of Science
https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000369888000042&DestApp=WOS_CPL
ID情報
  • ISSN : 1532-4435
  • Web of Science ID : WOS:000369888000042

エクスポート
BibTeX RIS