Recognition of Object Categories and Scenes

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP

Abstract

This research project proposes to advance state-of-the-art image recognition techniques to be able to recognize a large number ofscenes and object categories in real and unconstrained indoor andoutdoor environments i.e. traffic scenes (cars, bicycle vehicles,pedestrians, human faces, street signs etc.), urban and naturalscenes (buildings, landscapes etc.) with various rigid andarticulated objects as well as textures. Nowadaysalmost everybody carries a digital camera and taking a photo or ashort video has never been easier. Broadcasting companies receivethousands of pictures from the general public after every majorevent and the annotation of those documents is done manually. Crime investigators collect large amounts ofvisual evidence and its classification is also done manually. The UKhas the largest number of security cameras in Europe but the dataprovided by the cameras is very little explored. Furthermore,recognition and interpretation of visual information is one of themajor requirements for autonomous intelligent robots. There is therefore a dire need for a reliable recognition system capable of automatic classification and annotation of large amounts of visual documents. Any success towards achieving that goal i.e., automatic prioritizing of document browsing for experts, will be seen as a clear benefit in improvingthe efficiency of work.To fulfil the objectives of this project major progress has to bemade in the domain of features extraction, category representationand efficient search. Recent interest point based approachesdemonstrate the capability of dealing with large numbers ofcategories in the context of visual recognition. These methods showpromising directions towards successful scene and objectrecognition. Based on these results we propose to develop noveltechniques for extracting image features robust to backgroundclutter and viewpoint change, which are currently great challengesin image recognition domain. Those features will be suitable forsimultaneous representation of scenes and objects at variousappearance and structure levels as well as for segmentation ofobjects. Mid-level image segmentation methods have a potential toprovide such features and can bridge the gap between interest pointdetectors and semantic segmentation in the context of categoryrecognition. There has been little overlap between recognition andsegmentation domains although the goal is to solve both problemssimultaneously.We also propose to introduce novel hierarchical representationswhich will exploit the properties of new features and allow to dealefficiently with large number of image categories. Therepresentation will model the categories in multiple hierarchies ofvarious image attributes i.e., intensity, color and texture as wellas relations between different object parts and views. The multiplehierarchies will allow for coarse-to-fine classification based onimage cues relevant to the query. Very little work has been done inthis area and the proposed research can shed new light on imagerepresentation problems. Finally, efficient tree structures andnearest neighbor search techniques will be employed to handle largeamounts of data in multi-category learning.Developing novel, efficient and robust techniques which may providesuccessful solutions to fundamental recognition problems and advancethe state-of-the-art in feature extraction, categoryrepresentation and data exploration, make this project verychallenging and adventurous. The project is expected to achieve theobjectives within 36 months and it will involve a research student,a research assistant and the principal investigator.
 
Description A Smart Camera that Learns from Experience' is addressing challenges in visual tracking i.e. following an object moving through video (well demonstrated in computer games). Instead of a programme that will inevitably make mistakes, Predator accepts the mistakes, stores them and is then able to make better decisions in the future - resulting in a unique, real-time visual tracking system that improves its performance over time with the end result resembling the performance of human vision.

A generic visual recognition system capable of dealing with large numbers of scene and object categories in unconstrained indoor and outdoor environments i.e. traffic scenes (vehicles, pedestrians, human faces, street signs etc.), urban (buildings, rooms) and natural scenes with various rigid and articulated objects as well as textures (landscapes, animals, vegetation). It addresses extremely challenging problems in visual recognition, which are simultaneous recognition, localization and segmentation of various objects and scenes independently of viewing conditions, with background clutter and occlusion. The human eye and brain has an outstanding ability to deal with these problems. Unfortunately, existing recognition systems are still far from this level of performance. One of the main limiting factors is the unlimited and unpredictable variability of the appearance of objects even for the same semantic meaning. This implies large amounts of training data, compact image representations and efficient search techniques.

The main achievements of this project was to advance the state of-the-art in visual recognition, to classify large numbers of scenes as well as detect and segment object categories in still images or video frames. The project developed:
Novel image representations suitable for simultaneous modeling of scenes and object categories.
New methods for extracting local features robust to viewpoint change and background clutter.
Data structures, clustering and search techniques for efficient recognition.
Generic recognition system capable of dealing with hundreds of scene and object categories.
Exploitation Route The potential applications span from new human computer interfaces, eHealth, animal behaviour analysis, surveillance, to robot navigation, assisted driving etc.
Sectors Aerospace, Defence and Marine,Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Retail,Transport

URL http://kahlan.eps.surrey.ac.uk/featurespace
 
Description Image retrieval and classification as well as feature detectors developed in this project were used in various industrial applications such as video archive exploration by the BBC. Matching of user generated content in the BBC web pages. These state-of-the-art recognition systems were evaluated in international competitions in image classification. The recently developed recognition systems are submitted to independent evaluations organized by National Institute of Standards and Technology (NIST, USA), or Pascal Network of Excellence (Europe). Our image classification system with novel multi-kernel KDA classifiers won 3 different competitions (TrecVid, Pascal, ImageClef) in 2008, 2009, and 2010. The system was extensively tested as a part of a UAV robot designed for MoD Grand Challenge 2008. For design, construction and control of an autonomous reconnaissance robot the Swarm Systems Team received Best Innovative Idea Award from the Ministry of Defence (UK) in 2008. TLD technology is exploited in various professional fields all over the world. Photron is the world's leading manufacture of high speed digital imaging systems. Head Office is located in Japan. TLD is used for motion analysis for video images in high speed camera software. Virginia Tech is a public land-grant university with a main campus in Blacksburg, Virginia. TLD is used for building a face and object (vehicle) tracker. Movcam is a company that develops and manufactures camera stabilizing devices. TLD technology is used in development autonomous camera and lens control system. Indra Sistemas, is an information technology and defense systems company. TLD technology is used in development of EWE-8000 family emulators for antenna alignment. Cladoop is a startup company focusing on computer vision, surveillance and big data. TLD technology is used for development of their core product. MAGnet Systems is a software and hardware development company based in Canada. G2Associates is the authorized representatives of their products. TLD is used for devleopment of advanced object tracking capabilities for UAV.
First Year Of Impact 2008
Sector Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Leisure Activities, including Sports, Recreation and Tourism,Culture, Heritage, Museums and Collections,Retail,Transport
Impact Types Economic

 
Title Actions Clasification 
Description We proposed an approach for action recognition based on a vocabulary of local motion-appearance features and fast approximate search in a large number of trees. Large numbers of features with associated motion vectors are extracted from video data and are represented by many trees. Multiple interest point detectors are used to provide features for every frame. The motion vectors for the features are estimated using optical flow and a descriptor based matching. The features are combined with image segmentation to estimate dominant homographies, and then separated into static and moving ones despite the camera motion. Features from a query sequence are matched to the trees and vote for action categories and their locations. The locations are then validated with an SVM classifier. Large number of trees make the process efficient and robust. The system is capable of simultaneous categorization and localization of actions using only a few frames per sequence. The approach obtains excellent performance on standard action recognition sequences. We perform large scale experiments on 17 challenging real action categories from various sport disciplines. We demonstrated the robustness of our method to appearance variations, camera motion, scale change, asymmetric actions, background clutter and occlusion. 
Type Of Material Computer model/algorithm 
Year Produced 2010 
Provided To Others? Yes  
Impact This approach had scientific impact on various computer vision projects including Vidi-video, IDASH, as well as EPSRC project Visen etc, where the image classification code developed here was used to analyse the data. The system was extensively tested in various application scenarios and was a part of a robot designed for MoD Grand Challenge 2008. For design, construction and control of an autonomous reconnaissance robot the Swarm Systems Team received Best Innovative Idea Award from the Ministry of Defence (UK) in 2008. The system was also evaluated in international competitions in image classification. These are prestigious events receiving a lot of attention from the research community and the industry. The recently published recognition systems that claim state-of-the art performance are submitted to independent evaluations organized by National Institute of Standards and Technology (NIST, USA), or Pascal Network of Excellence (Europe). The participants are the top research institutions. Our image classification system with novel multi-kernel KDA classifiers won 3 different competitions (TrecVid, Pascal, ImageClef) in 2008, 2009, and 2010. 
URL http://kahlan.eps.surrey.ac.uk/featurespace
 
Description BBC R & D 
Organisation British Broadcasting Corporation (BBC)
Department BBC Research & Development
Country United Kingdom 
Sector Public 
PI Contribution Development of image retrieval and classification for exploring video archives of the BBC and user generated content for the BBC website. The system was presented at the BBC festival of research in 2008.
Collaborator Contribution BBC has provided image and video data for analysis, defined user requirements, participated in the data annotation and evaluation of the developed system.
Impact The collaboration defined several benchmark dataset for evaluating image classification and retrieval systems. Software packages were produced for feature extraction, image classification, and tracking. Several publications were output during this project which are listed in the publications section.
Start Year 2007
 
Title Tracking Learning Detection 
Description TLD is an award-winning, real-time algorithm for tracking of unknown objects in video streams. The object of interest is defined by a bounding box in a single frame. TLD simultaneously Tracks the object, Learns its appearance and Detects it whenever it appears in the video. The result is a real-time tracking that typically improves over time. 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted 2010
Licensed Yes
Impact The tracking code has been licensed to a large international company by the University of Surrey under an NDA for exclusive commercial use in the entertainment business.
 
Title Tracking Learning Detection 
Description TLD is an award-winning, real-time algorithm for tracking of unknown objects in video streams. The object of interest is defined by a bounding box in a single frame. TLD simultaneously Tracks the object, Learns its appearance and Detects it whenever it appears in the video. The result is a real-time tracking that typically improves over time. 
Type Of Technology Software 
Year Produced 2010 
Open Source License? Yes  
Impact This technology is a significant step forward in the direction of reliable long term tracking with learning capabilities. The potential applications span from new human computer interfaces, eHealth, animal behaviour analysis, surveillance, to robot navigation, assisted driving etc. In 2011 this work has won ICT Pioneer Price, in a new national scientific competition organized by the EPSRC. Our smart camera (called Predator) was featured worldwide in The Engineer, Engadget, New Electronics, Surrey Advertiser, Time, New Scientist, Gottabemobile, Laptopmag, Hacker News etc. It received a lot of attention from the industry with direct inquires well known companies including NASA - Johnson Space Center, Google, Microsoft, Sony, Nokia and many others. A number of invited seminars on this research in various institutions including Google Tech Talk. A licence has been created at the University of Surrey in September 2011 (PENN-DMS.FID1994352]). 
URL http://kahlan.eps.surrey.ac.uk/featurespace/tld/
 
Company Name TLDVision 
Description TLD Vision is a research company focusing on real-time object tracking in videos. The ability to track objects is at the core of any application that aims to understand video data. Potential use-cases range from object-centric stabilization on consumer cameras up to target-following from UAVs. 
Year Established 2011 
Impact TLD technology is exploited in various professional fields all over the world. Photron is the world's leading manufacture of high speed digital imaging systems. TLD is used for motion analysis for video images in high speed camera software. Virginia Tech is a public land-grant university with a main campus in Blacksburg, Virginia. TLD is used for building a face and object (vehicle) tracker. Movcam is a company that develops and manufactures camera stabilizing devices. TLD technology is used in development autonomous camera and lens control system. Indra Sistemas, is an information technology and defense systems company. TLD technology is used in development of EWE-8000 family emulators for antenna alignment. Cladoop is a startup company focusing on computer vision, surveillance and big data. TLD technology is used for development of their core product. MAGnet Systems is a software and hardware development company based in Canada. G2Associates is the authorized representatives of their products. TLD is used for devleopment of advanced object tracking capabilities for UAV.
Website http://www.tldvision.com/