Abstract : Robots are destined to live with humans and perform tasks for them. In order to do that, an adapted representation of the world including human detection is required. Evidential grids enable the robot to handle partial information and ignorance, which can be useful in various situations. This paper deals with an audiovisual perception scheme of a robot in indoor environment (apartment, house..). As the robot moves, it must take into account its environment and the humans in presence. This article presents the key-stages of the multimodal fusion: an evidential grid is built from each modality using a modified Dempster combination, and a temporal fusion is made using an evidential filter based on an adapted version of the generalized bayesian theorem. This enables the robot to keep track of the state of its environment. A decision can then be made on the next move of the robot depending on the robot's mission and the extracted information. The system is tested on a simulated environment under realistic conditions.