Adaptive cognition for automated sports video annotation (ACASVA)

Lead Research Organisation: University of East Anglia
Department Name: Computing Sciences

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Publications

10 25 50
publication icon
Huang Q (2011) Inferring the Structure of a Tennis Game Using Audio Information in IEEE Transactions on Audio, Speech, and Language Processing

publication icon
Huang Q. (2011) Iterative improvement of speaker segmentation in a noisy environment using high-level knowledge in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

publication icon
Huang Q. (2010) Using high-level information to detect key audio events in a tennis game in Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010

publication icon
Huang Q. (2011) Learning score structure from spoken language for a tennis game in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

publication icon
Huang Q. (2012) Detection of ball hits in a tennis game using audio and visual information in 2012 Conference Handbook - Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2012

publication icon
Poh N (2010) Addressing Missing Values in Kernel-Based Multimodal Biometric Fusion Using Neutral Point Substitution in IEEE Transactions on Information Forensics and Security

 
Description The ACASVA project is concerned with teaching a computer how to "understand" events by teaching it how to "see" and "hear" a video and its associated soundtrack. Because we need to start with quite simple events that also have a simple "syntax" (ordering), we have been concentrating on videos of tennis games which have a clear set of rules relating to how events develop. We are particularly interested in how knowledge of the rules affects how the events are "seen" and "heard"; for instance, should the computer pay more attention to entities that are relevant to the rules (e.g players) and less attention to entities that are irrelevant (eg the crowd), and can this be determined automatically? It is clear that humans are able to accomplish this "attentional focus", however, the mechanisms for this are not well understood - a key aim of the project was there to establish how these processes work in humans, and to feed these insights into the computational domain to establish performance baselines (e.g. are humans best at detecting individual events, or are they best at combining information). Also, can learning of audio and visual information be transferred from one domain to another; what about rule knowledge (e.g. of badminton)?



There have been several strands to the research:



(1) Development and tuning of artificial audio and visual detectors for sport video annotation.



(2) Integrating of information from audio and video sources to improve recognition of events. For instance, the computer learns that a long burst of applause indicates the end of a point, which helps it to segment the events in the game; or it can decide who won a point by a combination of interpreting the video action and listening for the shout of the line-judge at the end of a rally. This combination can be achieved at both the high (verbal ) level and the low (audio) level.



(3) Evaluation of human cognitive abilities in relation sport video understanding by directly measuring human attention using an eye-tracker in a series of audio/visual experiments.



We are particularly interested in how knowledge of the rules affects how the events are "seen" and "heard"; for instance, should the computer pay more attention to entities that are relevant to the rules (e.g players) and less attention to entities that are irrelevant (eg the crowd), and can this be determined automatically? It is clear that humans are able to accomplish this "attentional focus", however, the mechanisms for this are not well understood - a key aim of the project was there to establish how these processes work in humans, and to feed these insights into the computational domain to establish performance baselines (e.g. are humans best at detecting individual events, or are they best at combining information). Also, can learning of audio and visual information be transferred from one domain to another; what about rule knowledge (e.g. of badminton)?
Sectors Digital/Communication/Information Technologies (including Software)

URL http://cvssp.org/acasva/
 
Description Partnership with Surrey University 
Organisation University of Surrey
Country United Kingdom 
Sector Academic/University 
PI Contribution Joint work on automatic lip-reading that came about partly as a result of the ACSVA EPSRC project.
Collaborator Contribution Surrey: tracking of lips. UEA: speech recognition
Impact Automatic lip-reading
Start Year 2009