Abstract : This paper proposes an accurate sensor fusion scheme for navigation inside a real-scale 3D model by combining audio and video signals. Audio signal of a microphone-array is merged by Minimum Variance Distortion-less Response (MVDR) algorithm and processed instantaneously via Hidden Markov Model (HMM) to generate translation commands by word-to-action module of speech processing system. Then, the output of optical head tracker (four IR cameras) is analyzed by non-linear/non-Gaussian Bayesian algorithm to provide information about the orientation of the user's head. The orientation is used to redirect the user toward a new direction by applying quaternion rotation. The output of these two sensors (video and audio) is combined under the sensor fusion scheme to perform continuous travelling inside the model. The maximum precision for the traveling task is achieved under sensor fusion scheme. Practical experiment shows promising results for the implementation.