P. Henderson, . Islam, . Riashat, . Bachman, . Philip et al., Deep reinforcement learning that matters, 2017.

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman et al., OpenAI Gym, 2016.

M. Harvey, Mat Harvey's self-driving car project, pp.28-33

A. &. Dosovitskiy and V. Koltun, Learning to act by predicting the future, 2016.

. Mnih, . Volodymyr, . Kavukcuoglu, . Koray, . Silver et al., Human-level control through deep reinforcement learning, Nature, vol.518, p.529, 2015.

M. G. Bellemare, . Naddaf, . Yavar, J. &. Veness, and M. Bowling, The arcade learning environment: An evaluation platform for general agents, Journal of Artificial Intelligence Research, vol.47, pp.253-279, 2013.

C. Beattie, J. Z. Leibo, . Teplyashin, . Denis, . Ward et al., Deepmind lab, 2016.

M. Kempka, . Wydmuch, . Marek, . Runc, . Grzegorz et al., Vizdoom: A doom-based ai research platform for visual reinforcement learning, Computational Intelligence and Games (CIG), 2016 IEEE Conference, pp.1-8, 2016.

M. Johnson, . Hofmann, . Katja, T. &. Hutton, and D. Bignell, The Malmo Platform for Artificial Intelligence Experimentation, 2016.

E. Todorov, . Erez, &. Tom, and Y. Tassa, Mujoco: A physics engine for model-based control, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.5026-5033, 2012.

. Mnih, . Volodymyr, A. Badia, . Puigdomenech, . Mirza et al., Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning, pp.1928-1937, 2016.

P. Agrawal, A. Nair, and P. Abbeel, Learning to Poke by Poking: Experiential Learning of Intuitive Physics, Jitendra Malik & Sergey Levine, 2016.

D. Ha and J. Schmidhuber, World Models, 2018.

G. Wayne, Unsupervised Predictive Memory in a GoalDirected Agent, 2018.

J. Friston, The free-energy principle: a rough guide to the brain?, Trends in cognitive sciences, vol.13, pp.293-301, 2009.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

A. Barreto, . Dabney, . Will, . Munos, . Rémi et al., Successor features for transfer in reinforcement learning, Advances in neural information processing systems, pp.4055-4065, 2017.

R. Sutton, . Barto, and G. Andrew, Reinforcement learning: An introduction, 1998.

B. M. Lake, . Ullman, D. Tomer, J. B. Tenenbaum, . Gershman et al., Building machines that learn and think like people, Behavioral and Brain Sciences, p.40, 2017.

C. Watkins and . Hellaby, Learning from Delayed Rewards, 1989.

L. Lin, Self-improving reactive agents based on reinforcement learning, planning and teaching, Machine learning, vol.8, pp.293-321, 1992.

R. Hogg, &. Vincent, . Tanis, and A. Elliot, Probability and statistical inference, 2009.

Y. Teh, . Bapst, . Victor, W. M. Czarnecki, . Quan et al., Distral: Robust multitask reinforcement learning, Advances in Neural Information Processing Systems, pp.4499-4509, 2017.

, APPENDIX We present implementation details for each of the three RL baselines that we experiment with (see Sec. IV of main paper)

, A. Deep Q-Network ? Implementation: Keras-rl

, ? Normalization of inputs ? Adam: learning rate = 0.001, ? 1 = 0.9, ? 2 = 0.999 ? Policy: Boltzmann policy (softmax) with temperature 1. ? 1500 timesteps warmup ? Soft updates

, 3))-Convolution 1-D (filters: 32, kernel size: 8, 1)-Convolution 1-D (48,4,1)-Convolution 1-D (64,3,1)-Max, ? Replay buffer size: 500000 ? Architecture: Input (shape =, vol.64

B. Asynchronous, Advantage Actor-Critic ? Implementation

, ? 5 actor-learner threads, all methods performed updates after every 20 actions (t max = 20 and I update = 20)

, No action repeat: execute action on every frame (action repeat = 1)

, ? Architecture: Convolution 1-D (filters: 32, kernel size: 8, 1)-Convolution 1-D

C. Direct, Direct-Future-Prediction-Keras ? Adam: learning rate = 0.00001, ? 1 = 0.9, ? 2 = 0.999 ? Measurements used: score, number of fruits picked up, Future Prediction ? Implementation

, ? Normalization of inputs and measurements ? 1000 timesteps warmup ? Training interval: 3 timesteps ? Policy, p.300000

?. Replay and . Size, ? Architecture: We only modify the convolutional part with: Convolution 1-D (filters: 32, kernel size: 8, 1)Convolution 1-D (48,4,1)-Convolution 1-D (64,3,1)Max Pooling 1-D, the rest is unchanged. Deep Q-Network agent after learning in the experiments. The video can be found here, 20000.