Face detection, matching and recognition for semantic video understanding

Dzmitry Tsishkou

Résumé

Research topic A problem of fundamental interest is semantic video understanding, particularly in terms of human location and identity. There are several reasons why one is interested in this research topic. First of all, a semantic video description can provide the basis for a video indexing system, which can be used to process video signals so as to make possible search requests on a natural language. A second reason why face detection and recognition in video is important is that they are potentially capable of letting us learn a great deal about the entire video without having to have semantically understand it on a very deep level for majority of applications (i.e. humans are the key objects of interest in semantic video analysis that capture significantly higher degree of importance then any other objects in a general case). This property is especially important when the cost of getting semantic meaning from video is very high. Finally, the most important reason why face detection and recognition in video are important is that they often work extremely well in practice, and enable us to realize accurate and reliable practical systems, in a very efficient manner. In this work we concern ourselves strictly with face detection and recognition problem being a detached part of more general semantic video understanding problem. Thus we didn’t explore possibilities of better video comprehension by integration of multiple types of semantic descriptors together. We have focused our attention on the solution that is accurate and reliable enough to work more effective than a human expert in video analysis applications. Particularly we are interesting in all types of video, where human individuals can be found. Problems and Objective Neither the theory of face detection and recognition in video nor its applications to pattern recognition is new. Many research papers were published staring from the 1970th. However, widespread application of the theory of face detection and recognition to semantic video understanding has occurred only within the past decade. There are several reasons why this has been the case. First, the growth of the telecommunication industry reaches the barriers of human-wise media information management and the need of automatic video understanding and management systems became critical for further development. The second reason was that the original applications of the face detection and recognition, mainly in security domain, didn’t provide sufficient performance for most semantic video analysis demands. As a result, the fundamentals of the face detection and recognition theory, has provided a sufficient level of details for a number of research labs to begin to work on further development in the direction discussed above. Therefore the objective of this work can be summarized as follows: to propose face detection and recognition in video solution that is enough fast, accurate and reliable to be implemented in the semantic video understanding system that is capable of replacing human expert in a variety of multimedia indexing applications. Meanwhile we assume that the research results that were raised during this work are complete enough to be adapted or modified as a part of other image processing, pattern recognition and video indexing and analysis systems. Our Approach A video signal can be represented by a series of layers, each containing one semantic object. The only difference between semantically uncovered signal and a standard video is that all layers are mixed in the last case. Given the above scenario, the problem of interest is how do we decompose the video signal into layers to isolate semantic objects in it. Because of unique nature, human face was used as a token to solve the semantic video understanding issue (and particularly understanding of human individuals). The problems one faces are mainly reflected by variability of human appearance (expression, orientation, scale, etc.) and photometric conditions within the video (lighting conditions, video quality, duration and others). Unlike many problems in image processing and pattern recognition for which an exact solution can be given, there are several possible ways of solving face detection and recognition problem associated with the given semantic video understanding issue. The difficulty lies with the optimal choice of a general architecture and basic principles of such a system. For some of these implementation issues we can prescribe exact analytical solutions; for other issue, mainly concerning the architecture fundamentals, we can only provide some experience gained from working with the problem of interest from the scientific society. Considering the polar concepts of a human-wise data comprehension and a computer-wise, which is fast but very simple in nature we have to benefit from both approaches to meet real-world demands. According to the hypothesis presented above, our approach consists of choosing computationally-wise core with simple instructions, which are adapted by human knowledge in order to empower the entire solution, thus solve the issue. Contributions In order to illustrate all contributions made in this work we first present our general concept of face detection and recognition as a part of the semantic video understanding problem, then we separately discuss face annotating, detection, matching and recognition problems and our particular solutions to them. Face detection and recognition as a part of the semantic video understanding problem: in this part of the thesis we point out that each particular classification/analysis solution for any semantic object has theoretical limits on its learning capacity. However each of these solutions can be trained to accurately and reliably solve a partial part of the entire issue. Therefore we conclude that the face detection and recognition must be decomposed into a series of connected sub-problems that could be efficiently solved. Thus, this architecture relies on human knowledge on the subject of interest in order to decompose and connect separate solutions. The same time it is grounded on fast, accurate and reliable image processing and pattern recognition methods for resolving of particular sub-issues. Face annotating: this section presents a method for semi-automatic ground truth segmentation for benchmarking of face detection and recognition in video. We aim to illustrate the solution to the issue where an image processing and pattern recognition expert is able to segment and annotate facial patterns in video sequences at the rate of 7500 frames per hour. Evaluation criteria are discussed within different aspects of manual face segmentation. We extend these ideas to the semi-automatic face segmentation methodology, where all facial patterns are categorized into 4 classes in order to increase flexibility of evaluation results analysis. We present a strict guide how to speedup manual segmentation process up to 30 times and illustrate it with the sample test video sequences that consists of more than 90000 frames, 800 individuals and 50000 facial images. Experimental evaluation of the face detection using the ground truth data, that was semi-automatically segmented, demonstrates effectiveness of current approach both for learning and test stages. Face detection: given the form of face detection and recognition in semantic video understanding, there are basic problems of interest that have been solved in this thesis. The first problem is the integration problem, namely given a set of methods for face detection, how do we combine their results together in the most efficient way. Our solution utilizes five independent face localizers in order to select potential regions of interest; next these regions are classified using data-mining, which produces an optimal detection strategy. Following this strategy we search potential facial regions by a neural network classifier in order to localize facial position with minimum computational expenses, preserving high accuracy and reliability. The second problem is the one in which we attempt to construct face/non face classifier with high accuracy (>0.7) and extremely high reliability (10E-8<). The presented solution is based on series of cascades of three different kinds of classifier. The first set of cascade is the modified version of the state-of-the art classifier that uses integral facial features. It is followed by a set of neural network cascades on integral facial features. The last type of classifiers is intensity-based neural networks. Although this solution is computationally more expensive comparing to the reference works it provides higher reliability that is the most significant factor for the issue of interest. Finally we consider the problem of adapting face detection to current video stream conditions, so that it provides better results comparing to a fixed solution. The application of interest is media streaming on mobile phones. For selected cases, for example videoconferences, we were able to achieve frame rate face detection on a standard smartphone. Face Matching: the issue of face detection and recognition problem has leads to many schools of though. Our idea is to let some face with known parameters correspond roughly to a probe face that is found by a face detector and must be recognized by a face recognizer. Thus the recognition process will benefit an advantage of a priory available probe characteristics such as lighting conditions, orientation, expression, etc. Furthermore we restrict each detected facial image to have a corresponding best match; this implies that the models used for recognition will work best when they are properly adjusted. The face matching requires a large gallery of training facial images to correctly approximate practically infinite variability of probe images. The concept of a compact facial model, a facial image that is encoded and can be decoded by a 3d face modeling engine when needed, and fast search methods were used in this work in order to solve the issue. The very first important issue that rises face matching is the accuracy of a probe image. This entirely depends on the face detector. The way in which the face matching was solved refers to a 60 bit encoded compact facial model and ability to correctly match probe facial image with varying in-plane/out-plane orientations, head mesh transforms, emotions, texture details, lighting conditions, scale and translations. Face Recognition: this process places constraints on face detection and matching parts of the entire solution so that the probe facial image located by the detector has to be described by the compact facial model in the face matching unit. Genetic algorithms are used for matching on the learning stage because of higher accuracy comparing to the fast search methods used in the previous case (the drawback is significantly higher computational complexity). The probe facial image is reconstructed into a normalized texture by 3d face modeling engine that is installed with compact facial model parameters. Then the facial feature matching finds the most suitable templates for eyes, eyebrows, nose and mouse by using fast matching methods. Such a facial image can again be represented by a set of facial feature templates and their locations. This information is used for preliminary classification, where only top-N individuals are selected for further recognition. Finally a set of neural network cascades based on the encoded facial description (that is independently trained for each individual) is used for choosing the appropriate response. This again serves to make the recognition task more accurate and reliable, thus leads to higher performance of the system. The results of this work were applied into the multimedia indexing system CINDI in a framework of RNRT Cyrano project on personal media distribution and management. Fast search methods were used in biomedical images indexing project B-705. A series of face detection and tracking methods was integrated into a commercial video assets management system.

Face detection, matching and recognition for semantic video understanding

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager