Abstract : In this paper we tackle the issue of object instances retrieval in video repositories using minimum information from the user (e.g., textual description/tags). Starting for a set of tags, images containing the object of interest are crawled from popular image search engines and repositories (e.g., Bing, Fickr, Google) and the positive and most representative instances of the object are automatically identified. These positive images are then used to generate a visual query descriptor and to retrieve videos containing the object of the interest. This multi-modal approach makes it possible to retrieve video content through images obtained from textual queries, without the use of any advanced learning technique. We test out method on the Flickr corpus of the TRECVID 2012 Instance Search Task.