Finding Groups of Duplicate Images In Very Large Dataset

Winn Voravuthikunchai 1 Bruno Crémilleux 2 Frédéric Jurie 1
1 Equipe Image - Laboratoire GREYC - UMR6072
GREYC - Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen
2 Equipe CODAG - Laboratoire GREYC - UMR6072
GREYC - Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen
Abstract : This paper addresses the problem of detecting groups of duplicates in large-scale unstructured image datasets such as the Internet. Leveraging the recent progress in data mining, we propose an efficient approach based on the search of closed patterns. Moreover, we present a novel way to encode the bag-of-words image representation into data mining transactions. We validate our approach on a new dataset of one million Internet images obtained with random searches on Google image search. Using the proposed method, we find more than 80 thousands groups of duplicates among the one million images in less than three minutes while using only 150 Megabytes of memory. Unlike other existing approaches, our method can scale gracefully to larger datasets as it has linear time and space (memory) complexities. Furthermore, the approach does not need (to build or use) any precomputed indexing structure.
Complete list of metadatas

Cited literature [29 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00806196
Contributor : Yvain Queau <>
Submitted on : Friday, March 29, 2013 - 3:22:47 PM
Last modification on : Tuesday, February 26, 2019 - 6:06:03 PM
Long-term archiving on : Sunday, April 2, 2017 - 10:43:12 PM

File

12_bmvc_LargeDataetsMining.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00806196, version 1

Citation

Winn Voravuthikunchai, Bruno Crémilleux, Frédéric Jurie. Finding Groups of Duplicate Images In Very Large Dataset. Proceedings of the British Machine Vision Conference (BMVC 2012), Sep 2012, Guildford, United Kingdom. pp.105.1--105.12. ⟨hal-00806196⟩

Share

Metrics

Record views

456

Files downloads

410