Learning to Recognise Dynamic Visual Content from Broadcast Footage
Lead Research Organisation:
University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP
Abstract
This research is in the area of computer vision - making computers which can understand what is happening in photographs and video. As humans we are fascinated by other humans, and capture endless images of their activities, for example home movies of our family on holiday, video of sports events or CCTV footage of people in a town center. A computer capable of understanding what people are doing in such images would be able to do many jobs for us, for example finding clips of our children waving, fast forwarding to a goal in a football game, or spotting when someone starts a fight in the street. For Deaf people, who use a language combining hand gestures with facial expression and body language, a computer which could visually understand their actions would allow them to communicate in their native language. While humans are very good at understanding what people are doing (and can learn to understand special actions such as sign language), this has proved extremely challenging for computers.Much work has tried to solve this problem, and works well in particular settings for example the computer can tell if a person is walking so long as they do it clearly and face to the side, or can understand a few sign language gestures as long as the signer cooperates and signs slowly. We will investigate better models for recognising activities by teaching the computer by showing it many example videos. To make sure our method works well for all kinds of setting we will use real world video from movies and TV. For each video we have to tell the computer what it represents, for example throwing a ball or a man hugging a woman . It would be expensive to collect and label lots of videos in this way, so instead we will extract approximate labels automatically from subtitle text and scripts which are available for TV. Our new methods will combine learning from lots of approximately labelled video (cheap because we get the labels automatically), use of contextual information such as which actions people do at the same time, or how one action leads to another ( he hits the man, who falls to the floor ), and computer vision methods for understanding the pose of a person (how they are standing), how they are moving, and the objects which they are using.By having lots of video to learn from, and methods for making use of approximate labels, we will be able to make stronger and more flexible models of human activities. This will lead to recognition methods which work better in the real world and contribute to applications such as interpreting sign language and automatically tagging video with its content.
Planned Impact
The proposed research has potential impact for three communities: 1. The computer vision and machine learning communities in academia 2. The potential end users of tools for automatic categorisation and searching of content such as organisations like the BBC 3. The Deaf community, by providing tools for automatic recognition of sign language The computer vision and machine learning communities will benefit from new knowledge and new techniques and the creation of new and challenging datasets for use by the wider research community. Dissemination to the research community will be via publishing in the major national and international conferences/journals and via a project website. The principal search and media companies have a significant presence at these conferences. Additionally the PIs will organise an international workshop in conjunction with one of the major international conferences to disseminate the outputs of the project and there is a possibility to run a technical meeting in association with the British Machine Vision Association (BMVA). There is potential to contribute to other scientific disciplines and one of our project partners, DCAL (see letter of support), have indicated that any tools for automatic recognition, annotation and search of sign language would be immensely beneficial to their research on Sign linguistics and the EPSC Corpus project. By automatically categorising, labelling and providing search facilities, the research has immense benefits to the broadcast and media industry. Much of the BBC's digital video archive material from the 1970s and 1908s has only the name of the programme and transmission data. Enabling search and annotation of such material will bring vast improvements in accessibility. BBC achieves are another project partner, specifically for this reason. However, with the ever growing quantities of digital media from personal devices such tools have far wider reaching applications to the general public. Finally the research could have a considerable impact on Deaf-hearing communication, providing tools to automatically translate sign into spoken English. Such tools could provide benefits to organisations that provide such services to the Deaf community such as Significan't, another project partner. They are also well placed to commercialise such technology and the wider application of tools that allow the categorisation and searching of sign have applications for online or web based resources for the Deaf community.
People |
ORCID iD |
Richard Bowden (Principal Investigator) |
Publications

Eng-Jon Ong
(2012)
Sign Language Recognition using Sequential Pattern Trees


Gilbert A
(2017)
Image and video mining through online learning
in Computer Vision and Image Understanding

Gilbert A
(2015)
Geometric Mining: Scaling Geometric Hashing to Large Datasets

Hadfield S
(2019)
HARD-PnP: PnP Optimization Using a Hybrid Approximate Representation.
in IEEE transactions on pattern analysis and machine intelligence


Hadfield S
(2017)
Stereo reconstruction using top-down cues
in Computer Vision and Image Understanding

Hadfield S
(2017)
Hollywood 3D: What are the Best 3D Features for Action Recognition?
in International journal of computer vision

Hadfield S
(2013)
Hollywood 3D: Recognizing Actions in 3D Natural Scenes

Hadfield S
(2014)
Scene Particles: Unregularized Particle-Based Scene Flow Estimation
in IEEE Transactions on Pattern Analysis and Machine Intelligence
Description | we have developed new methods for recognition of sign and motion in video. Techniques for accurate long term tracking. Approaches to labelling image and video content without user supervision. Approaches to learning to categorise content automatically using linguistic annotation. |
Exploitation Route | Automatic recognition and categorisation of images and video |
Sectors | Aerospace Defence and Marine Creative Economy Digital/Communication/Information Technologies (including Software) Leisure Activities including Sports Recreation and Tourism Government Democracy and Justice Retail |
URL | http://cvssp.org/projects/dynavis/ |
Description | We have had a high uptake for some of the datasets we have released and associated publications are gaining citations. Several keynote talks have also happened but it is still early days in terms of impact and citations to this work which is on-going. We are in discussions with a small SME about exploitation. We have now moved the code base to the company so they can market it and have a heads of terms to put a licence deal in place should the code generate revenue. This research project has led on to other new collaborations including another EPSRC project, direct funding from the SWISS National Science Foundation (SNSF) and an EU project. |
First Year Of Impact | 2016 |
Sector | Digital/Communication/Information Technologies (including Software) |
Impact Types | Societal Economic |
Description | H2020 Innovative Action |
Amount | € 3,499,856 (EUR) |
Funding ID | Project ID: 762021 |
Organisation | European Union |
Sector | Public |
Country | European Union (EU) |
Start | 08/2017 |
End | 08/2020 |
Description | Sinergia |
Amount | SFr. 405,034 (CHF) |
Funding ID | crsii22_160811 |
Organisation | Swiss National Science Foundation |
Sector | Public |
Country | Switzerland |
Start | 03/2016 |
End | 02/2019 |
Title | Hollywood3D |
Description | As part of the project, I have examined the exploitation of 3D information within natural action recognition, as a means to reduce the amount of variation within classes. To this end, a dataset of natural actions with 3D data was compiled, called Hollywood 3D from broadcast footage on 3D BluRay. |
Type Of Material | Database/Collection of data |
Year Produced | 2013 |
Provided To Others? | Yes |
Impact | This benchmark is used by an increasing number of international research groups to asses the performance of algorithms. |
URL | http://cvssp.org/data/Hollywood3D/ |
Title | Kinect Sign Data Sets |
Description | The data covers two languages GSL (Greek Sign Language) and DGS (German Sign Language). We provide the skeletal data extracted from the original (calibrated) OpenNI tracker with annotations at the sign level. |
Type Of Material | Database/Collection of data |
Year Produced | 2012 |
Provided To Others? | Yes |
Impact | . |
URL | http://cvssp.org/data/KinectSign/webpages/index.html |
Title | YouTube Long Term Tracking Sequences |
Description | The dataset contains two videosequences, NissanSkylineChase and LiverRun, which are a subset of the ytLongTrack dataset. These contain traffic scenes, with the camera mounted on a vehicle chasing another car. Both are of a low quality. They were chosen to test long-term tracking for their challenging properties such as length (LiverRun exceeds 29000 frames!), strong illumination and viewpoint changes, extreme scale changes and full occlusions. The following table summarises properties of the videosequences. |
Type Of Material | Database/Collection of data |
Year Produced | 2013 |
Provided To Others? | Yes |
Impact | . |
URL | http://cvssp.org/data/YTLongTrack/ |
Description | Aachen Uni |
Organisation | RWTH Aachen University |
Country | Germany |
Sector | Academic/University |
PI Contribution | PhD co-supervision with Aachen and hosting of visiting research from Aachen |
Collaborator Contribution | Aachen have funded the student during their year long placement with us |
Impact | Koller O, Ney H, Bowden R, Deep Learning of Mouth Shapes for Sign Language, Accepted, to appear in Third Workshop on Assistive Computer Vision and Robotics (ACVR-15), ICCV 2015. Koller O, Ney H, Bowden R, Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition. In Proc. Europen Conf Computer Vision, ECCV2014, LNCS 8690, pp281-296. Ong E J, Koller O, Pugeault N, Bowden R, Sign Spotting using Hierarchical Sequential Patterns with Temporal Intervals. In Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR'2014), 2014, pp1931-1938. DOI: 10.1109/CVPR.2014.248 Koller O, Ney H, Bowden R, Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora. LREC Workshop on the Representation and Processing of Sign Languages: Beyond the Manual Channel. In LREC Proceedings 2014. pp94-98. Koller O, Ney H, Bowden R, May the Force be with you: Force-Aligned SignWriting for Automatic Subunit Annotation of Corpora, 10th IEEE Int. Conf on Automatic Face and Gesture Recognition FG2013, Shanghai, China. 22-26 April 2013, pp1-6. DOI:10.1109/FG.2013.6553777 |
Start Year | 2012 |
Description | Leeds Uni |
Organisation | University of Leeds |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Open collaboration on sign language recognition and translation |
Collaborator Contribution | Collaboration on sign language recognition and translation |
Impact | See awards outcomes |
Start Year | 2011 |
Description | University of Oxford |
Organisation | University of Oxford |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Open collaboration on sign language recognition and translation |
Collaborator Contribution | Collaboration on sign language recognition and translation |
Impact | See awards outcomes |