Learning to Recognise Dynamic Visual Content from Broadcast Footage

Lead Research Organisation: University of Surrey

Department Name: Vision Speech and Signal Proc CVSSP

Abstract

This research is in the area of computer vision - making computers which can understand what is happening in photographs and video. As humans we are fascinated by other humans, and capture endless images of their activities, for example home movies of our family on holiday, video of sports events or CCTV footage of people in a town center. A computer capable of understanding what people are doing in such images would be able to do many jobs for us, for example finding clips of our children waving, fast forwarding to a goal in a football game, or spotting when someone starts a fight in the street. For Deaf people, who use a language combining hand gestures with facial expression and body language, a computer which could visually understand their actions would allow them to communicate in their native language. While humans are very good at understanding what people are doing (and can learn to understand special actions such as sign language), this has proved extremely challenging for computers.Much work has tried to solve this problem, and works well in particular settings for example the computer can tell if a person is walking so long as they do it clearly and face to the side, or can understand a few sign language gestures as long as the signer cooperates and signs slowly. We will investigate better models for recognising activities by teaching the computer by showing it many example videos. To make sure our method works well for all kinds of setting we will use real world video from movies and TV. For each video we have to tell the computer what it represents, for example throwing a ball or a man hugging a woman . It would be expensive to collect and label lots of videos in this way, so instead we will extract approximate labels automatically from subtitle text and scripts which are available for TV. Our new methods will combine learning from lots of approximately labelled video (cheap because we get the labels automatically), use of contextual information such as which actions people do at the same time, or how one action leads to another ( he hits the man, who falls to the floor ), and computer vision methods for understanding the pose of a person (how they are standing), how they are moving, and the objects which they are using.By having lots of video to learn from, and methods for making use of approximate labels, we will be able to make stronger and more flexible models of human activities. This will lead to recognition methods which work better in the real world and contribute to applications such as interpreting sign language and automatically tagging video with its content.

Planned Impact

The proposed research has potential impact for three communities: 1. The computer vision and machine learning communities in academia 2. The potential end users of tools for automatic categorisation and searching of content such as organisations like the BBC 3. The Deaf community, by providing tools for automatic recognition of sign language The computer vision and machine learning communities will benefit from new knowledge and new techniques and the creation of new and challenging datasets for use by the wider research community. Dissemination to the research community will be via publishing in the major national and international conferences/journals and via a project website. The principal search and media companies have a significant presence at these conferences. Additionally the PIs will organise an international workshop in conjunction with one of the major international conferences to disseminate the outputs of the project and there is a possibility to run a technical meeting in association with the British Machine Vision Association (BMVA). There is potential to contribute to other scientific disciplines and one of our project partners, DCAL (see letter of support), have indicated that any tools for automatic recognition, annotation and search of sign language would be immensely beneficial to their research on Sign linguistics and the EPSC Corpus project. By automatically categorising, labelling and providing search facilities, the research has immense benefits to the broadcast and media industry. Much of the BBC's digital video archive material from the 1970s and 1908s has only the name of the programme and transmission data. Enabling search and annotation of such material will bring vast improvements in accessibility. BBC achieves are another project partner, specifically for this reason. However, with the ever growing quantities of digital media from personal devices such tools have far wider reaching applications to the general public. Finally the research could have a considerable impact on Deaf-hearing communication, providing tools to automatically translate sign into spoken English. Such tools could provide benefits to organisations that provide such services to the Deaf community such as Significan't, another project partner. They are also well placed to commercialise such technology and the wider application of tools that allow the categorisation and searching of sign have applications for online or web based resources for the Deaf community.

Funded Value:

£489,782

Funded Period:

Sep 11 - Feb 16

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/I011811/1

Principal Investigator:

Richard Bowden

Research Subject:

Info. & commun. Technol. (100%)

Research Topic:

Image & Vision Computing (100%)

Organisations

People	ORCID iD
Richard Bowden (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Eng-Jon Ong (2012) Sign Language Recognition using Sequential Pattern Trees

Gilbert A (2017) Image and video mining through online learning in Computer Vision and Image Understanding

Gilbert A (2015) Computer Vision -- ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part V

Gilbert A (2015) Geometric Mining: Scaling Geometric Hashing to Large Datasets

Hadfield S (2014) Scene particles: unregularized particle-based scene flow estimation. in IEEE transactions on pattern analysis and machine intelligence

Hadfield S (2015) Exploiting High Level Scene Cues in Stereo Reconstruction

Hadfield S (2019) HARD-PnP: PnP Optimization Using a Hybrid Approximate Representation. in IEEE transactions on pattern analysis and machine intelligence

Hadfield S (2017) Hollywood 3D: What are the Best 3D Features for Action Recognition? in International journal of computer vision

Hadfield S (2014) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II

Hadfield S (2014) Scene Flow Estimation using Intelligent Cost Functions

Hadfield S (2013) Hollywood 3D: Recognizing Actions in 3D Natural Scenes

Hadfield S (2017) Stereo reconstruction using top-down cues in Computer Vision and Image Understanding

Holt B (2013) Accurate static pose estimation combining direct regression and geodesic extrema

Koller O (2013) May the force be with you: Force-aligned signwriting for automatic subunit annotation of corpora

Koller O (2015) Deep Learning of Mouth Shapes for Sign Language

Koller O (2014) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I

Koller O (2016) Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled

Krejov P (2015) Combining discriminative and model based approaches for hand pose estimation

Krejov P (2017) Guided optimisation through classification and regression for hand pose estimation in Computer Vision and Image Understanding

Lebeda K (2015) Computer Vision -- ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part IV

Lebeda K (2016) Direct-from-Video: Unsupervised NRSfM

Lebeda K (2013) Long-Term Tracking through Failure Cases

Lebeda K (2015) Exploring Causal Relationships in Visual Object Tracking

Lebeda K (2015) Dense Rigid Reconstruction from Unstructured Discontinuous Video

Lebeda K (2017) TMAGIC: A Model-Free 3D Tracker. in IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Collaboration


Description	we have developed new methods for recognition of sign and motion in video. Techniques for accurate long term tracking. Approaches to labelling image and video content without user supervision. Approaches to learning to categorise content automatically using linguistic annotation.
Exploitation Route	Automatic recognition and categorisation of images and video
Sectors	Aerospace, Defence and Marine,Creative Economy,Digital/Communication/Information Technologies (including Software),Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Retail
URL	http://cvssp.org/projects/dynavis/


Description	We have had a high uptake for some of the datasets we have released and associated publications are gaining citations. Several keynote talks have also happened but it is still early days in terms of impact and citations to this work which is on-going. We are in discussions with a small SME about exploitation. We have now moved the code base to the company so they can market it and have a heads of terms to put a licence deal in place should the code generate revenue. This research project has led on to other new collaborations including another EPSRC project, direct funding from the SWISS National Science Foundation (SNSF) and an EU project.
First Year Of Impact	2016
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Societal,Economic


Description	H2020 Innovative Action
Amount	€ 3,499,856 (EUR)
Funding ID	Project ID: 762021
Organisation	European Union
Sector	Public
Country	European Union (EU)
Start	09/2017
End	08/2020


Description	Sinergia
Amount	SFr. 405,034 (CHF)
Funding ID	crsii22_160811
Organisation	Swiss National Science Foundation
Sector	Public
Country	Switzerland
Start	03/2016
End	02/2019


Title	Hollywood3D
Description	As part of the project, I have examined the exploitation of 3D information within natural action recognition, as a means to reduce the amount of variation within classes. To this end, a dataset of natural actions with 3D data was compiled, called Hollywood 3D from broadcast footage on 3D BluRay.
Type Of Material	Database/Collection of data
Year Produced	2013
Provided To Others?	Yes
Impact	This benchmark is used by an increasing number of international research groups to asses the performance of algorithms.
URL	http://cvssp.org/data/Hollywood3D/


Title	Kinect Sign Data Sets
Description	The data covers two languages GSL (Greek Sign Language) and DGS (German Sign Language). We provide the skeletal data extracted from the original (calibrated) OpenNI tracker with annotations at the sign level.
Type Of Material	Database/Collection of data
Year Produced	2012
Provided To Others?	Yes
Impact	.
URL	http://cvssp.org/data/KinectSign/webpages/index.html


Title	YouTube Long Term Tracking Sequences
Description	The dataset contains two videosequences, NissanSkylineChase and LiverRun, which are a subset of the ytLongTrack dataset. These contain traffic scenes, with the camera mounted on a vehicle chasing another car. Both are of a low quality. They were chosen to test long-term tracking for their challenging properties such as length (LiverRun exceeds 29000 frames!), strong illumination and viewpoint changes, extreme scale changes and full occlusions. The following table summarises properties of the videosequences.
Type Of Material	Database/Collection of data
Year Produced	2013
Provided To Others?	Yes
Impact	.
URL	http://cvssp.org/data/YTLongTrack/


Description	Aachen Uni
Organisation	RWTH Aachen University
Country	Germany
Sector	Academic/University
PI Contribution	PhD co-supervision with Aachen and hosting of visiting research from Aachen
Collaborator Contribution	Aachen have funded the student during their year long placement with us
Impact	Koller O, Ney H, Bowden R, Deep Learning of Mouth Shapes for Sign Language, Accepted, to appear in Third Workshop on Assistive Computer Vision and Robotics (ACVR-15), ICCV 2015. Koller O, Ney H, Bowden R, Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition. In Proc. Europen Conf Computer Vision, ECCV2014, LNCS 8690, pp281-296. Ong E J, Koller O, Pugeault N, Bowden R, Sign Spotting using Hierarchical Sequential Patterns with Temporal Intervals. In Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR'2014), 2014, pp1931-1938. DOI: 10.1109/CVPR.2014.248 Koller O, Ney H, Bowden R, Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora. LREC Workshop on the Representation and Processing of Sign Languages: Beyond the Manual Channel. In LREC Proceedings 2014. pp94-98. Koller O, Ney H, Bowden R, May the Force be with you: Force-Aligned SignWriting for Automatic Subunit Annotation of Corpora, 10th IEEE Int. Conf on Automatic Face and Gesture Recognition FG2013, Shanghai, China. 22-26 April 2013, pp1-6. DOI:10.1109/FG.2013.6553777
Start Year	2012


Description	Leeds Uni
Organisation	University of Leeds
Country	United Kingdom
Sector	Academic/University
PI Contribution	Open collaboration on sign language recognition and translation
Collaborator Contribution	Collaboration on sign language recognition and translation
Impact	See awards outcomes
Start Year	2011


Description	University of Oxford
Organisation	University of Oxford
Country	United Kingdom
Sector	Academic/University
PI Contribution	Open collaboration on sign language recognition and translation
Collaborator Contribution	Collaboration on sign language recognition and translation
Impact	See awards outcomes

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications