Learning to Recognise Dynamic Visual Content from Broadcast Footage

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP

Abstract

This research is in the area of computer vision - making computers which can understand what is happening in photographs and video. As humans we are fascinated by other humans, and capture endless images of their activities, for example home movies of our family on holiday, video of sports events or CCTV footage of people in a town center. A computer capable of understanding what people are doing in such images would be able to do many jobs for us, for example finding clips of our children waving, fast forwarding to a goal in a football game, or spotting when someone starts a fight in the street. For Deaf people, who use a language combining hand gestures with facial expression and body language, a computer which could visually understand their actions would allow them to communicate in their native language. While humans are very good at understanding what people are doing (and can learn to understand special actions such as sign language), this has proved extremely challenging for computers.Much work has tried to solve this problem, and works well in particular settings for example the computer can tell if a person is walking so long as they do it clearly and face to the side, or can understand a few sign language gestures as long as the signer cooperates and signs slowly. We will investigate better models for recognising activities by teaching the computer by showing it many example videos. To make sure our method works well for all kinds of setting we will use real world video from movies and TV. For each video we have to tell the computer what it represents, for example throwing a ball or a man hugging a woman . It would be expensive to collect and label lots of videos in this way, so instead we will extract approximate labels automatically from subtitle text and scripts which are available for TV. Our new methods will combine learning from lots of approximately labelled video (cheap because we get the labels automatically), use of contextual information such as which actions people do at the same time, or how one action leads to another ( he hits the man, who falls to the floor ), and computer vision methods for understanding the pose of a person (how they are standing), how they are moving, and the objects which they are using.By having lots of video to learn from, and methods for making use of approximate labels, we will be able to make stronger and more flexible models of human activities. This will lead to recognition methods which work better in the real world and contribute to applications such as interpreting sign language and automatically tagging video with its content.

Planned Impact

The proposed research has potential impact for three communities: 1. The computer vision and machine learning communities in academia 2. The potential end users of tools for automatic categorisation and searching of content such as organisations like the BBC 3. The Deaf community, by providing tools for automatic recognition of sign language The computer vision and machine learning communities will benefit from new knowledge and new techniques and the creation of new and challenging datasets for use by the wider research community. Dissemination to the research community will be via publishing in the major national and international conferences/journals and via a project website. The principal search and media companies have a significant presence at these conferences. Additionally the PIs will organise an international workshop in conjunction with one of the major international conferences to disseminate the outputs of the project and there is a possibility to run a technical meeting in association with the British Machine Vision Association (BMVA). There is potential to contribute to other scientific disciplines and one of our project partners, DCAL (see letter of support), have indicated that any tools for automatic recognition, annotation and search of sign language would be immensely beneficial to their research on Sign linguistics and the EPSC Corpus project. By automatically categorising, labelling and providing search facilities, the research has immense benefits to the broadcast and media industry. Much of the BBC's digital video archive material from the 1970s and 1908s has only the name of the programme and transmission data. Enabling search and annotation of such material will bring vast improvements in accessibility. BBC achieves are another project partner, specifically for this reason. However, with the ever growing quantities of digital media from personal devices such tools have far wider reaching applications to the general public. Finally the research could have a considerable impact on Deaf-hearing communication, providing tools to automatically translate sign into spoken English. Such tools could provide benefits to organisations that provide such services to the Deaf community such as Significan't, another project partner. They are also well placed to commercialise such technology and the wider application of tools that allow the categorisation and searching of sign have applications for online or web based resources for the Deaf community.

Publications

10 25 50

publication icon
Gilbert A (2017) Image and video mining through online learning in Computer Vision and Image Understanding

publication icon
Gilbert A (2015) Computer Vision -- ACCV 2014

publication icon
Hadfield S (2014) Scene particles: unregularized particle-based scene flow estimation. in IEEE transactions on pattern analysis and machine intelligence

publication icon
Hadfield S (2017) Stereo reconstruction using top-down cues in Computer Vision and Image Understanding

publication icon
Hadfield S (2014) Computer Vision - ECCV 2014

publication icon
Hadfield S (2019) HARD-PnP: PnP Optimization Using a Hybrid Approximate Representation. in IEEE transactions on pattern analysis and machine intelligence

 
Description we have developed new methods for recognition of sign and motion in video. Techniques for accurate long term tracking. Approaches to labelling image and video content without user supervision. Approaches to learning to categorise content automatically using linguistic annotation.
Exploitation Route Automatic recognition and categorisation of images and video
Sectors Aerospace, Defence and Marine,Creative Economy,Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Retail

URL http://cvssp.org/projects/dynavis/
 
Description We have had a high uptake for some of the datasets we have released and associated publications are gaining citations. Several keynote talks have also happened but it is still early days in terms of impact and citations to this work which is on-going. We are in discussions with a small SME about exploitation. We have now moved the code base to the company so they can market it and have a heads of terms to put a licence deal in place should the code generate revenue. This research project has led on to other new collaborations including another EPSRC project, direct funding from the SWISS National Science Foundation (SNSF) and an EU project.
First Year Of Impact 2016
Sector Digital/Communication/Information Technologies (including Software)
Impact Types Societal,Economic

 
Description H2020 Innovative Action
Amount € 3,499,856 (EUR)
Funding ID Project ID: 762021 
Organisation European Union 
Sector Public
Country European Union (EU)
Start 09/2017 
End 08/2020
 
Description Sinergia
Amount SFr. 405,034 (CHF)
Funding ID crsii22_160811 
Organisation Swiss National Science Foundation 
Sector Public
Country Switzerland
Start 03/2016 
End 02/2019
 
Title Hollywood3D 
Description As part of the project, I have examined the exploitation of 3D information within natural action recognition, as a means to reduce the amount of variation within classes. To this end, a dataset of natural actions with 3D data was compiled, called Hollywood 3D from broadcast footage on 3D BluRay. 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact This benchmark is used by an increasing number of international research groups to asses the performance of algorithms. 
URL http://cvssp.org/data/Hollywood3D/
 
Title Kinect Sign Data Sets 
Description The data covers two languages GSL (Greek Sign Language) and DGS (German Sign Language). We provide the skeletal data extracted from the original (calibrated) OpenNI tracker with annotations at the sign level. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact
URL http://cvssp.org/data/KinectSign/webpages/index.html
 
Title YouTube Long Term Tracking Sequences 
Description The dataset contains two videosequences, NissanSkylineChase and LiverRun, which are a subset of the ytLongTrack dataset. These contain traffic scenes, with the camera mounted on a vehicle chasing another car. Both are of a low quality. They were chosen to test long-term tracking for their challenging properties such as length (LiverRun exceeds 29000 frames!), strong illumination and viewpoint changes, extreme scale changes and full occlusions. The following table summarises properties of the videosequences. 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact
URL http://cvssp.org/data/YTLongTrack/
 
Description Aachen Uni 
Organisation RWTH Aachen University
Country Germany 
Sector Academic/University 
PI Contribution PhD co-supervision with Aachen and hosting of visiting research from Aachen
Collaborator Contribution Aachen have funded the student during their year long placement with us
Impact Koller O, Ney H, Bowden R, Deep Learning of Mouth Shapes for Sign Language, Accepted, to appear in Third Workshop on Assistive Computer Vision and Robotics (ACVR-15), ICCV 2015. Koller O, Ney H, Bowden R, Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition. In Proc. Europen Conf Computer Vision, ECCV2014, LNCS 8690, pp281-296. Ong E J, Koller O, Pugeault N, Bowden R, Sign Spotting using Hierarchical Sequential Patterns with Temporal Intervals. In Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR'2014), 2014, pp1931-1938. DOI: 10.1109/CVPR.2014.248 Koller O, Ney H, Bowden R, Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora. LREC Workshop on the Representation and Processing of Sign Languages: Beyond the Manual Channel. In LREC Proceedings 2014. pp94-98. Koller O, Ney H, Bowden R, May the Force be with you: Force-Aligned SignWriting for Automatic Subunit Annotation of Corpora, 10th IEEE Int. Conf on Automatic Face and Gesture Recognition FG2013, Shanghai, China. 22-26 April 2013, pp1-6. DOI:10.1109/FG.2013.6553777
Start Year 2012
 
Description Leeds Uni 
Organisation University of Leeds
Country United Kingdom 
Sector Academic/University 
PI Contribution Open collaboration on sign language recognition and translation
Collaborator Contribution Collaboration on sign language recognition and translation
Impact See awards outcomes
Start Year 2011
 
Description University of Oxford 
Organisation University of Oxford
Country United Kingdom 
Sector Academic/University 
PI Contribution Open collaboration on sign language recognition and translation
Collaborator Contribution Collaboration on sign language recognition and translation
Impact See awards outcomes