- 1709.06853
Accepted author manuscript, 4.11 MB, PDF document

Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

- http://www.lancs.ac.uk/~pikeburc/mabdaaf.pdf
Final published version

- http://proceedings.mlr.press/v80/pike-burke18a.html
Final published version

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

**Bandits with Delayed, Aggregated Anonymous Feedback.** / Pike-Burke, Ciara; Agrawal, Shipra; Szepesvari, Csaba; Grunewalder, Steffen.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Pike-Burke, C, Agrawal, S, Szepesvari, C & Grunewalder, S 2018, Bandits with Delayed, Aggregated Anonymous Feedback. in J Dy (ed.), *Proceedings of the International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm Sweden.* Proceedings of Machine Learning Research, vol. 80, PMLR, pp. 4105-4123. <http://www.lancs.ac.uk/~pikeburc/mabdaaf.pdf>

Pike-Burke, C., Agrawal, S., Szepesvari, C., & Grunewalder, S. (2018). Bandits with Delayed, Aggregated Anonymous Feedback. In J. Dy (Ed.), *Proceedings of the International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm Sweden *(pp. 4105-4123). (Proceedings of Machine Learning Research; Vol. 80). PMLR. http://www.lancs.ac.uk/~pikeburc/mabdaaf.pdf

Pike-Burke C, Agrawal S, Szepesvari C, Grunewalder S. Bandits with Delayed, Aggregated Anonymous Feedback. In Dy J, editor, Proceedings of the International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm Sweden. PMLR. 2018. p. 4105-4123. (Proceedings of Machine Learning Research).

@inproceedings{7db87a74695040beb576ce6d577d9755,

title = "Bandits with Delayed, Aggregated Anonymous Feedback",

abstract = "We study a variant of the stochastic K-armed bandit problem, which we call “bandits with delayed,aggregated anonymous feedback”. In this problem, when the player pulls an arm, a rewardis generated, however it is not immediately observed.Instead, at the end of each round the player observes only the sum of a number of previouslygenerated rewards which happen to arrive in the given round. The rewards are stochasticallydelayed and due to the aggregated nature of the observations, the information of which armled to a particular reward is lost. The question is what is the cost of the information loss due to this delayed, aggregated anonymous feedback? Previous works have studied bandits with stochastic, non-anonymous delays and found that the regret increases only by an additive factor relating to the expected delay. In this paper, we show that this additive regret increase can be maintained in the harder delayed, aggregated anonymous feedback setting when the expected delay (or a bound on it) is known. We provide an algorithm that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.",

author = "Ciara Pike-Burke and Shipra Agrawal and Csaba Szepesvari and Steffen Grunewalder",

year = "2018",

month = jul,

day = "15",

language = "English",

series = "Proceedings of Machine Learning Research",

publisher = "PMLR",

pages = "4105--4123",

editor = "Jennifer Dy",

booktitle = "Proceedings of the International Conference on Machine Learning, 10-15 July 2018, Stockholmsm{\"a}ssan, Stockholm Sweden",

}

TY - GEN

T1 - Bandits with Delayed, Aggregated Anonymous Feedback

AU - Pike-Burke, Ciara

AU - Agrawal, Shipra

AU - Szepesvari, Csaba

AU - Grunewalder, Steffen

PY - 2018/7/15

Y1 - 2018/7/15

N2 - We study a variant of the stochastic K-armed bandit problem, which we call “bandits with delayed,aggregated anonymous feedback”. In this problem, when the player pulls an arm, a rewardis generated, however it is not immediately observed.Instead, at the end of each round the player observes only the sum of a number of previouslygenerated rewards which happen to arrive in the given round. The rewards are stochasticallydelayed and due to the aggregated nature of the observations, the information of which armled to a particular reward is lost. The question is what is the cost of the information loss due to this delayed, aggregated anonymous feedback? Previous works have studied bandits with stochastic, non-anonymous delays and found that the regret increases only by an additive factor relating to the expected delay. In this paper, we show that this additive regret increase can be maintained in the harder delayed, aggregated anonymous feedback setting when the expected delay (or a bound on it) is known. We provide an algorithm that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.

AB - We study a variant of the stochastic K-armed bandit problem, which we call “bandits with delayed,aggregated anonymous feedback”. In this problem, when the player pulls an arm, a rewardis generated, however it is not immediately observed.Instead, at the end of each round the player observes only the sum of a number of previouslygenerated rewards which happen to arrive in the given round. The rewards are stochasticallydelayed and due to the aggregated nature of the observations, the information of which armled to a particular reward is lost. The question is what is the cost of the information loss due to this delayed, aggregated anonymous feedback? Previous works have studied bandits with stochastic, non-anonymous delays and found that the regret increases only by an additive factor relating to the expected delay. In this paper, we show that this additive regret increase can be maintained in the harder delayed, aggregated anonymous feedback setting when the expected delay (or a bound on it) is known. We provide an algorithm that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.

M3 - Conference contribution/Paper

T3 - Proceedings of Machine Learning Research

SP - 4105

EP - 4123

BT - Proceedings of the International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm Sweden

A2 - Dy, Jennifer

PB - PMLR

ER -