Adaptive cognition for automated sports video annotation (ACASVA)

Lead Research Organisation: University of Surrey
Department Name: Vision Speech and Signal Proc CVSSP


The development of a machine that can autonomously understand and interpret patterns of real-world events remains a challenging goal in AI. Humans are able to achieve this by developing sophisticated internal representational structures for object and events and the grammars that connect them. ACASVA aims to investigate the interaction between visual and linguistic grammars in learning by developing grammars in a scenario where the number of different events is constrained, by a set of rules, to be small: a sport. We will analyse video footage of a game (e.g. tennis) and use computer vision techniques to progressively understand it as a sequence of (possibly overlapping) events, and build a grammar of events. We will do a similar audio/linguistic analysis on the commentary on the game. Both of these grammars will be used to build a representational structure for understanding the game. Visual representations are additionally constrained by the inference of game rules so that object-classification mechanisms are preferentially tuned to game-relevant entities like 'player' rather than game-irrelevant entities like 'crowd-member'. We will also investigate how the two modes, sight and sound, can influence each other in the learning process; interpretation of the video is affected by the linguistic grammar and vice versa. Furthermore, this coupling of modes will lead to improved recognition of both audio and video events when the grammars from the video modes are used to influence the audio recognition, and vice versa. The psychological component of the ACASVA correspondingly attempts to learn how these capabilities are developed in humans; how visual grammars are organized and employed in the learning problem, how these grammars are modified by prior linguistic knowledge of the domain, how visual grammars map onto linguistic grammars, and how game rule-inferences influence lower-level visual learning (determined via gaze-behaviour). These results will feedback into the machine-learning problem and vice versa, as well as providing a performance benchmark for the system.Potential beneficiaries of ACASVA (in addition to the knowledge beneficiaries within the fields of science and engineering) include the broadcasting and on-line video search industries.


10 25 50
Description Overview

ACASVA aimed to investigate the interaction between visual and linguistic grammars in both human and computer learning within environments for which the number and type events is highly constrained by a set of rules (in particular sports game environments).

ACASVA thus investigated the 'horizontal' cross-modal bootstrapping of audio-visual entities, as well as the 'vertical' bootstrapping of low-level representational entities in a manner consistent with inferred high-level game rules. Thus, in addition to the two axes of investigation within the machine learning field, ACASVA also integrated interdisciplinary activity along a third axis, linking the psychological sciences to engineering and computer science.

As the core linking activity, CVSSP (UniS) constructed a complete, adaptive system for the automated analysis and annotation of video footage of sport games (especially tennis and badminton). In collaboration with UEA, computer vision and computer audio techniques were used to progressively understand games as a sequence of high level events, thereby building a complete audio-visual grammar of the video in question [CP7, CP10, CC10, UP1, UC1, UC3, UC6]. Parallel audio-visual analysis of game sounds and sights [UP2, UC2, UC4, UC7] thus enabled a corresponding grammatical structure to be built up; both of these grammars were used in conjunction with each other to build a representational structure for understanding the game as a whole, one capable of utilizing not just low-level, but also high-level linguistic information [UC5, UC8]. Coupling of modalities in this way, with the grammars from the video modalities used to influence the audio recognition, and vice versa, was demonstrated to lead to improved recognition of both audio and video events [CC5]. (Also implicit in the structure of this system were a number of other novelties of construction, in particular an adaptive memory architecture - the subject of a forthcoming paper [CP10] - as well as several contributions to the field of image processing [CP2, CP3, CC4, CC6, CC12, CC13, CT1]).

More generally, various novel developments in the theory of machine learning were made in the construction of this system, including in the areas of kernel methods [CP1, CP6, CC1 CC14-CC16], classifier fusion [CP8, CP9] and transfer learning [CC3, CC7, CC8] (the latter required to enable the system to adaptively switch between different sport domains).

As well as being innovatory in both the audio and visual domains, the completed system was designed so as to be leveraged within the psychological component of the investigation, to annotate experimental material in order to evaluate parallel learning structures in human psychology, in particular the relationship between a spectator's high-level rule inferences and low-level attention to key features of the game (as measured by the eye-tracker), ACASVA being the first project to directly assess this interaction [QP7-QP10, QC1-QC4].

This analysis of top-down/bottom-up interactions in human and machine inference turned out to have a highly significant application in other rule-constrained areas; in particular the symbol-grounding problem involved in relating the Highway Code to computer vision representations of the road situation within driver assistance systems. ACASVA research was thus leveraged both with respect to the determination of human driver intention via eye tracking [CP4], and the building of an adaptive system for determining the road traffic situation [CP5, CC13].

ACASVA investigations also led to key contributions to the field of anomaly detection [CC2, CC9, CC11], with a potentially seminal contribution in the premier pattern-recognition journal, IEEE Trans. PAMI, to the taxonomy of anomaly within machine learning [CP11]. In particular, the notion of a 'domain anomaly' is presented and characterized for the first time; a key development critical for the adaptivity of the annotation system between sport domains.


The psychological aspect of ACAVA scientifically investigated several fundamental issues concerning the cognitive mechanisms associated with the abstraction of information in complex dynamic scenes. We developed several methodological and theoretical innovations that have made significant advances in psychological research in vision. To date, there are few empirical studies that have been able to investigate high and low-level visual processing using complex dynamic scenes. Either the experimental methods that are typically used to examine the way in which people abstract and infer and make sense of visual information involves static complex scenes (e.g. pictures of traffic, crowds) or dynamic simple scenes (e.g., circular dot formations with changes to the colors of the dots). Without the advances in ball tracking, and coding frame-by-frame of the video footage that CVSSP could provide the psychology work package, the experimental methods would never have reached a stage in which experiments could be developed using actual tennis footage that could be married up with actual online eye movement data (e.g., saccades, fixations, dwell times). As a result of the successful integration of engineering and psychology there have been several conference proceedings (QC1-QC4) and publications in the top tier peer-review psychology journals (QP7-QP10). In addition, in the latter stage of the project collaborative work combining the expertise from CMP, UEA, CVSSP UniS and QMUL DoP has resulted in a journal publication currently under review (QP11) in a major science journal. In addition, one of the connected core research focuses of this project concerned the methods by which attentional mechanisms tackle the problem of abstraction and inference on line when the features of the environment are constantly changing. This has resulted in a high profile theoretical publication (QB11), a monograph (book 1) and the empirical validation of this theoretical work in several high profile psychology journals (QP2, QP4-QP6), and conference proceeding (QC5,QC7). Moreover, the project has attracted attention from several international researchers which have contribute to addressing core questions concerning the learning mechanisms associated with object-classification, and inferences. These collaborations have resulted in journal output (QP3) and a conference proceeding (QC7).

Computer Vision

-Video segmentation

One of the first steps in video annotation is video segmentation. Existing scene segmentation measures have a number of deficiencies. In [CP12] we have developed a novel approach to evaluating video temporal decomposition algorithms. The evaluation measures typically used to this end are non-linear combinations of Precision-Recall or Coverage-Overflow, which are not metrics and additionally possess undesirable properties, such as non-symmetry. To alleviate these drawbacks we introduced a novel uni-dimensional measure that is proven to be metric and satisfies a number of qualitative prerequisites that previous measures do not. This measure is named Differential Edit Distance (DED), since it can be seen as a variation of the well-known edit distance. After defining DED, we further introduce an algorithm that computes it in less than cubic time. DED is extensively compared with state of the art measures, namely the harmonic means (F-Score) of Precision-Recall and Coverage-Overflow. The experiments have been conducted which include comparisons of qualitative properties, the time required for optimising the parameters of scene segmentation algorithms with the help of these measures, and a user study gauging the agreement of these measures with the users' assessment of the segmentation results. The results confirm that the proposed measure is a uni-dimensional metric that is effective in evaluating scene segmentation techniques and in helping to optimise their parameters.

-Event Recognition

A key prerequisite of automatic sport video indexing and summarisation is the description of events. In particular, the motion of the ball and agent plays an essential role in describing events. However, existing solutions for the tennis event recognition problem in the literature rely on sets of heuristic rules such as the proximity between ball and players or court lines to classify ball event candidates. In [CC10] a novel hidden Markov model (HMM) paradigm for automatically learning to identify events from ball trajectories is presented. It is demonstrated that the ability to capture the dynamics of the ball movement leads to far higher performance than heuristic approaches.

-Agent Recognition

Another important problem in video analysis and annotation is agent recognition. This can commonly be based on face recognition. However, in uncontrolled scenarios, such as in sports videos, face recognition is challenging because of pose and illuminations changes, blur, and low resolution. We have addressed all these problems in the project. Face recognition in low-resolution images has been addressed using a 3D morphable face model [CC18]. We approach the problem of fitting a 3D morphable model to a low-resolution face by modelling the low-resolution image formation process. We show that the image synthesis model assumed by existing fitting algorithms become less relevant as the resolution of the input image decreases. We propose an alternative imaging model which takes into account the point-spread function of the virtual camera and use this model in the fitting algorithm. Experimental results show that incorporating this imaging model into the fitting algorithm improves performance for low-resolution.
Pose invariances can also be achieved with the help of 3D morphable face model. However, one of the drawback of 3D model fitting is the computational complexity. In [CC17] we have developed a resolution-aware 3D morphable face model, which can be applied to high resolution face images in a hierarchical manner to facilitate a speed-up of the fitting process. The image blur problem has been tackled by developing a blur invariant face descriptors. In [CP13] we propose a novel blur-robust face image descriptor based on Local Phase Quantization (LPQ) and extend it to a multiscale framework (MLPQ) to increase its effectiveness. To maximize the insensitivity to misalignment, the MLPQ descriptor is computed regionally by adopting a component-based framework. The regional features are combined using kernel fusion. Further performance gains are obtained by combining the proposed MLPQ representation with the Multiscale Local Binary Pattern (MLBP) descriptor using kernel fusion to increase insensitivity to illumination. Kernel Discriminant Analysis (KDA) of the combined features extracts discriminative information for face recognition. The proposed approach has been comprehensively evaluated using the combined Yale and Extended Yale database B (degraded by artificially induced linear motion blur) as well as the FERET, FRGC 2.0, and LFW databases. The combined system achieves state-of-the-art performance. The reported work also provides a new insight into the merits of various face representation and fusion methods, as well as their role in dealing with variable lighting and blur degradation.

-Action Recognition

Once agents have been detected, it is necessary, in a sport video annotation context, to recognize their actions. In [CC6], two popular approaches for action recognition from video, bags-of-visual-Words (BoW) and Spatio-Temporal Shapes (STS) are compared. The former (BoW) is an un-structured global representation of videos which is built using a large set of local features. The latter (STS) uses a single feature located on a region of interest (where the actor is) in the video. Despite the popularity of these methods, no comparison between them has been carried out. Also, given that BoW and STS differ intrinsically in terms of context inclusion and globality/locality of operation, an appropriate evaluation framework has to be designed carefully. [CC6] compares these two approaches using four different datasets with varied degree of space-time specificity of the actions and varied relevance of the contextual background. We use the same local feature extraction method and the same classifier for both approaches. Further to BoW and STS, we also evaluated novel variations of BoW constrained in time or space. We thus determined that the STS approach leads to better results in all datasets whose background is of little relevance to action classification.

Computer Audio

Although visual features are clearly a primary source of information about events and interactions, audio is an important source of complementary information and for some events, is superior to video. Much of the audio research in ACASVA was concerned with detection of audio events to (a) integrate with the visual features and (b) to infer an audio "grammar" of the game as a means of tracking it. Our baseline for event detection was conventional modeling techniques (usually Gaussian mixture models), but we improved their performance considerably by adding contextual information, modelled probabilistically. For instance, in [UC8], we described a technique that utilised a hierarchy of language models, which were a low-level model of acoustic observations and a high-level model of audio events that occur during a game: the models were integrated using a maximum entropy approach. Our modelling also utilised event duration, inter-event duration and speech "voicing" information, and all these features were found to contribute to increased performance. Another technique that was useful was multi-gram modelling which enabled us to "segment" a tennis game into points automatically, using only audio information. Segmentation was also useful for isolating other key events in a game, such as the line-judges' calls, and we showed that the use of contextual knowledge could improve the performance of segmentation considerably in cases where the signal-to-noise ratio was low [UC9, UC4]. Later work showed that we could use event and segmentation information to infer the structure of the scores of a tennis game [UC5].

Another important aspect of the audio work was robustness, not just to interfering noise but to different acoustics, microphone placement etc. One of the most crucial pieces of information is the sound of the ball being struck, but it is often corrupted by neighbouring audio events, such as players' grunts or yells as well as by acoustic mismatch between the training and test data. In [UC3], we showed how the use of contextual sounds could help improve detection.

Most of the work referred to above was summarized in [UP1].


-Kernel Methods

As well as the above developments within computer vision and computer audition, ACASVA's objectives necessitated, on a more general level, a number of parallel developments within machine learning. Within this field, kernel methods demonstrate a number of advantages for general classification, notably their resilience to overfitting in the context of Support Vector Machines (SVMs).
In [CC1] we thus considered automatic annotation of game event sequences by utilizing the Structured Output Learning SVM variant (which can be engineered to output strings of events), which demonstrated significantly superior performance to HMM-based approaches. More generally, in [CC19], we formulated multiple kernel learning (MKL) as a distance metric learning (DML) problem, proposing the learning of a linear combination of a set of base kernels by optimising two objective functions commonly used in distance metric learning. (We first proposed a global version of such an MKL via DML scheme, then a localised version). In was demonstrated that the localised version not only yields better performance than the global version, but also fits naturally into the framework of example based retrieval and relevance feedback. The usefulness of the proposed schemes was also verified through experiments on two standard image retrieval datasets.
In carrying out multimodal information fusion of the kind implicit in the video annotation problem, it is common to encounter 'missing' modalities where a particular detector has not returned a sufficiently confident output. We therefore in [CP6] addressed the multimodal fusion problem involving missing modalities by using a 'neutral point substitution' (NPS) method. Thus, when a modality has missing information, the missing modality is substituted by artificial data that is unbiased with regards to the classification, called a 'neutral point'. Critically, unlike conventional missing-data substitution methods, explicit calculation of neutral points may be omitted by virtue of their implicit incorporation within the classifier training framework. Experiments based on publicly available biometric data sets showed that this approach achieves very good generalization performance compared to existing methods, especially with severe missing modalities.

-Anomaly Detection

A significant development made during ACASVA, one that was necessitated by the nature of the project's goals (though only implicitly at the outset), was a formal categorization of anomaly in machine learning. Anomaly detection is a crucial trigger for initiating transfer learning (see below) in a generalized online learning system. It was thus found to be crucial to distinguish between the various types of anomaly.

In [CP11], the concept of 'domain anomaly' is thus introduced as distinct from the conventional notion of anomaly used in the literature. We hence proposed a unified framework for anomaly detection which exposed the multifaceted nature of anomalies and suggested effective mechanisms for identifying and distinguishing each facet as instruments for domain anomaly detection. The framework drew upon a Bayesian probabilistic reasoning apparatus, which clearly defined concepts such as outlier, noise, distribution drift, novelty detection (object, object primitive), rare events, and unexpected events. Based on these concepts, a taxonomy of domain anomaly events was provided using the video annotation system as an exemplar. One of the proposed mechanisms for helping to pinpoint the nature of anomaly was based on detecting incongruence between contextual and noncontextual data interpretations. The proposed methodology is intended to have a wide applicability, and to underpin, in a unified way, all other anomaly detection applications found in the literature.

-Transfer Learning

A key component of ACASVA was the notion of transfer learning; the leveraging of learning in one domain within a different, but related domain (for example tennis and badminton). In [CC11] we proposed a composite mechanism for anomaly detection and transfer learning within the context of sport video annotation, such that it is envisaged that continuous adaptive learning is abandoned and a new transfer learning process initiated once the presence of a new domain is determined via the existence of an anomaly. The mechanism adopted in [CC11] is that of anomaly rectification; the adaptation of the existing learning mechanism to the change of domain by accommodating the anomalies in an appropriate fashion. In particular, a novel lattice-based HMM induction strategy for arbitrary court-game environments was proposed. We thus tested (in real and simulated domains) the ability of the method to adapt to a change of rule structures going from tennis singles to tennis doubles, finding that the method had the capacity to form the basis of a continuously adaptive system with respect to high-level sport video annotation.

A typical problem setting within transfer learning is transductive learning, where some labeled data exists in the new domain. In [CC8] we investigated the application of transductive transfer learning methods to player action classification (the application scenario thus being that of off-line video annotation for retrieval). We demonstrated that if a classification system can analyze the unlabeled test data in order to adapt its models, a significant performance improvement can be achieved. (We applied the method for action classification in tennis games in which the train and test videos where of a different nature e.g. US Open and Wimbledon tennis matches). Actions were described using HOG3D features and for transfer we employed a method based on feature re-weighting as well as a novel method based on feature translation and scaling.

In [CC7] this approach was applied in a domain adaptation context to the problem of recognizing common actions between entirely different court-game sport videos (in particular tennis and badminton games). Actions were again characterized in terms of HOG3D features extracted at the bounding box of each detected player, thus having large intrinsic dimensionality. The techniques evaluated for domain adaptation were based on estimating linear transformations to adapt the source domain features in order to maximize the similarity between posterior PDFs for each class in the source domain and the expected posterior PDF for each class in the target domain. As such, the problem scaled linearly with feature dimensionality, making the video-environment domain adaptation problem tractable on reasonable time scales and resilient to over-fitting. We thereby demonstrated that significant performance improvement can be achieved by applying domain adaptation in this context.

-Rule Induction/Symbol grounding

A key aspect of ACASVA is the relationship between high-level rules and low-level audio/visual features. The various machine learning methods listed above thus had implicitly to accommodate the dual nature of the environment (i.e. with low-level stochasticity underpinning high level abstract game structures). However, various ACASVA activities were aimed at addressing this problem in its own right.
Two novel approaches to the problem of rule induction in a stochastic context were consequently developed during ACSVA. The first of these, [CP9], was based on a second-order logic variants of the Markov Logic Network (MLN), and was tested with both a simulated game predicate generator and also predicatised real-world data (of detected and ground-truth varieties). Predicate generation in the case of the non-ground-truthed real-data was carried-out by low-level computer vision processes, including the HOG3D based player-action classification indicated above, but also Hough-transform-based court detection, and graph-theoretic ball-tracking. Experiments demonstrated that the method exhibits both error resilience and learning transfer within the court-game context.

The second approach to inter-level integration looked at a range of novel hierarchical Dirichlet processes (HDPs) as a means of providing a natural mechanism for continuous abstraction of HMM processes in a manner suitable for characterizing the high- and low-level aspects of court-game videos. We thus, in [CP14], investigated a number of novel hierarchical HMM generating methods for rule induction in the context of automated sports video annotation including a multi-level 'Chinese Takeaway Process' (CTP) based on the Chinese Restaurant Process (CRP) and also a novel label-based hierarchical bottom-up HDP clustering method that employs prior information contained within label structures. Our results demonstrated significant improvement in comparison against a standard HMM: optimal performance was obtained by using a hybrid method that combines the CTP-generated hierarchical topological structures with generated event labels.

The examination of the relationship between high-level rules and lower level computer vision implicit in ACASVA also yielded benefits in other domains; in particular, driver assistance systems. In [CP5] we set out a framework for hierarchical learning that utilized a fuzzy-logical approach to cognitive system building that sought to reduce the complexity associated with conventional environment-representation/action-planning approaches. Here, actions are directly mapped onto the detected changes that they bring about, eliminating the need for intermediate representations and significantly reducing training requirements. We thus set out, in [CP5], a very general learning framework for cognitive systems in which online learning of the representation/symbol mapping may be conducted within a logic processing context, so that complex contextual reasoning can influence the mapping. We experimentally demonstrated that the resulting framework achieves significantly better accuracy than learning without top-down modulation by the fuzzy-reasoning system. We also demonstrated that this approach permits novel forms of context-dependent multilevel learning, enabling an adaptive driver assistance system that nonetheless stays within the rule-bounds of the environment.

Interaction Between Disciplines


As well as the provision of computer vision tools for psychological analysis [QC1-QC4,QP7-QP10], interdisciplinary interaction between psychology and engineering also took the form of direct scientific transfer, one example being in the characterization of driver intentions for driver assistance systems in [CP4]. We thus proposed a mechanism for the classification of the intentional behaviour of a cognitive agent in terms of a hierarchical Perception-Action (P-A) model. (P-A models of human intentionality assume that a cognitive agent's perceptual domain is learned in response to the outcome of the agent's actions rather than vice-versa, as in the classical model of perception). The model was assessed by comparative evaluation against a number of logic-based methods for carrying out intentional classification and found to be substantially better at intentional characterization.

-Multi-modal Audio Visual Interaction

One of the drivers for ACASVA was to explore how the coupling of audio and video modalities could increase the accuracy of detection of events in each others' domains. For instance, for ball-hit detection, neither modality on its own gives satisfactory performance because of noise and training/test-data mismatch for audio and complex backgrounds, camera calibration and multiple moving objects for video. In [UC2], we demonstrated that ball-hit detection can be successfully performed on a recording that has a low visual frame rate and a poor quality soundtrack by directly fusing audio and visual information at the "event" level. In [CP7] a mechanism for combining generalized event-classification via ball-tracking with Mel-frequency Cepstral Coefficients MFCC audio features is presented, for which composite audio-visual performance far exceeds either modality considered independently. The composite ACASVA video annotation system (cf [CP10]) thus exhibits adaptive, intermodal learning, integrating 'horizontal' (audio-visual) and 'vertical' (high-level/low-level) components within an integrated framework for systems engineering.


Computer Vision-led outputs

-Refereed full papers

CP1. Fei Yan, Josef Kittler, Krystian Mikolajczyk, and Atif Tahir. Non-Sparse Multiple Kernel Fisher Discriminant Analysis. Journal of Machine Learning Research (JMLR). vol. 13(3), pp. 607-642, 2012.

CP2. T E deCampos and G Csurka and F Perronnin, Images as Sets of Locally Weighted Features, In Computer Vision and Image Understanding, 2012

CP3. J Sanchez and F Perronnin and T E deCampos, Modeling the Spatial Layout of Images Beyond Spatial Pyramids, In Pattern Recognition Letters, 2012

CP4. D Windridge, A Shaukat, E Hollnagel, Characterizing Driver Intention via Hierarchical Perception Action Modeling, Human-Machine Systems, IEEE Transactions on Volume:43 , Issue: 1, Doi 10.1109/TSMCA.2012.2216868, Jan. 2013

CP5. D Windridge, M Felsberg, A Shaukat, A Framework for Hierarchical Perception-Action Learning Utilizing First Logic Resolution, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS Part B, DOI:10.1109/TSMCB.2012.2202109

CP6. N. Poh, D. Windridge, V. Mottl, A. Tatarchuk, A. Eliseyev, Addressing Missing Values in Kernel-based Multimodal Biometric Fusion using Neutral Point Substitution In IEEE Trans. on Information Forensics and Security, 2010

CP7. Fei Yan, Josef Kittler, David Windridge, William Christmas, Krystian Mikolajczyk, Stephen Cox, and Qiang Huang. Automatic Annotation of Tennis Game: An Integration of Audio, Vision, and Learning. Image and Vision Computing. (Under Review)

CP8. Muhammad Awais, Fei Yan, Krystian Mikolajczyk, and Josef Kittler. Towards Flexible Fusion Schemes for Pattern Recognition. IEEE Transactions on Neural Networks and Learning Systems (NNLS). (Under Review)

CP9. D. Windridge, T. E. deCampos, F. Yan, W. Christmas, J. Kittler, A. Khan, Rule Induction for Adaptive Sport Video Characterization Using MLN Clause Templates, IEEE Transactions on Multimedia, under review, (Under Review) submitted in May 2013.

CP10. I. Kolonias, T. E. deCampos, F. Yan, W. Christmas, J. Kittler, W. Windridge, A. Kostin, A Bayesian Reasoning System for Sports Video Annotation, IEEE Transactions on Cybernetics (formally Trans. Sys. Man & Cyb. - Part B),(Under Review), submitted in April 2013.

CP11. J. Kittler, W. Christmas, T. E. de Campos, D. Windridge, Y. Fei, Domain anomaly detection in machine perception: A framework and taxonomy, IEEE Transactions on Pattern Analysis and Machine Learning (PAMI), 18 Oct. 2013. IEEE computer Society Digital Library. IEEE Computer Society,

CP12. P Sidiropoulos, V Mezaris, I Kompatsiaris and J Kittler, Differential edit distance: A metric for scene segmentation evaluation, IEEE Trans of Circuits Systems for Video Technology, 2012.

CP13. CH Chan, MA Tahir, J Kittler and M Pietikainen, IEEE Trans Pattern Analysis and Machine Intelligence, 35: 1164-1177, 2013

CP14. Khan, A, Windridge D, Kittler J, Multi-Level Chinese Takeaway Process and Label-Based Processes for Rule Induction in the Context of Automated Sports Video Annotation, (formally Trans. Sys. Man & Cyb. - Part B),(Under Revision)

-Conference papers

CC1. Fei Yan and Josef Kittler and Krystian Mikolajczyk and David Windridge. Automatic Annotation of Court Games with Structured Output Learning. International Conference on Pattern Recognition (ICPR). 2012.

CC2. Ibrahim Almajai, Fei Yan, Teo de Campos, Aftab Khan, William Christmas, David Windridge, and Josef Kittler. Anomaly Detection and Knowledge Transfer in Automatic Sports Video Annotation. Springer Berlin Heidelberg. Detection and Identification of Rare Audiovisual Cues. pp. 109-117. 2012.

CC3. T E deCampos and A Khan and F Yan and N FarajiDavar and D Windridge and J Kittler and W Christmas, A framework for automatic sports video annotation with anomaly detection and transfer learning. In Machine Learning and Cognitive Science, collocated with EUCOGIII, 2013

CC4. I Calixto and TE deCampos and L Specia, Images as Context in Statistical Machine Translation, In The 2nd Annual Meeting of the EPSRC Network on Vision & Language (VL'12), 2012

CC5. Q Huang and S Cox and F Yan and T E deCampos and D Windridge and J Kittler and W Christmas, Improved Detection of Ball Hit Events in a Tennis Game Using Multimodal Information, In 11th International Conference on Auditory-Visual Speech Processing (AVSP), 2011

CC6. T deCampos and M Barnard and K Mikolajczyk and J Kittler and F Yan and W Christmas and D Windridge, An evaluation of bags-of-words and spatio-temporal shapes for action recognition, In IEEE Workshop on Applications of Computer Vision (WACV), 2011

CC7. N FarajiDavar and T E deCampos and D Windridge and J Kittler and W Christmas, Domain Adaptation in the Context of Sport Video Action Recognition, In Domain Adaptation Workshop, in conjunction with NIPS, 2011

CC8. N FarajiDavar and T E deCampos and J Kittler and F Yan, Transductive Transfer Learning for Action Recognition in Tennis Games, In 3rd International Workshop on Video Event Categorization, Tagging and Retrieval for Real-World Applications (VECTaR), in conjunction with ICCV, 2011

CC9. Almajai, F Yan, T de Campos, A Khan, W Christmas, D Windridge and J Kittler, Anomaly Detection and Knowledge Transfer in Automatic Sports Video Annotation, In Proceedings of DIRAC Workshop, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2010), 2010

CC10. I Almajai and J Kittler and T DeCampos and W. Christmas and F Yan and D Windridge and A Khan, Ball Event Recognition using HMM for Automatic Tennis Annotation, In Proceedings of Intl. Conf. on Image Proc., 2010

CC11. Aftab Khan and David Windridge and Teofilo de Campos and Josef Kittler and William Christmas, Lattice-based Anomaly Rectification for Sport Video Annotation, In Proceedings of ICPR 2010, 2010

CC12. J J McAuley and T E deCampos and T S Caetano, Unified graph matching in Euclidean spaces, In Conference on Computer Vision and Pattern Recognition (CVPR), 2010

CC13. A Shaukat, A Gilbert, D Windridge, R Bowden, A Top-down and Bottom-up Approach to Detect Pedestrians, ICPR 2012

CC14. N. Razin, D. Sungurov, V. Mottl, I. Torshin, V. Sulimova, O. Seredin, D. Windridge, Application of the Multi-modal Relevance Vector Machine to the problem of protein secondary structure prediction, PRIB 2012.

CC15. Maxim Panov, Alexander Tatarchuk, Vadim Mottl, and David Windridge, A Modified Neutral Point Method for Kernel-Based Fusion of Pattern-Recognition Modalities with Incomplete Data Sets, MCS 2011

CC16. Alexander Tatarchuk and Eugene Urlov and Vadim Mottl and David Windridge, A Support Kernel Machine for Supervised Selective Combining of Diverse Pattern-Recognition Modalities, In Proc. Multiple Classifier Systems, 9th International Workshop, MCS 2010, 2010

CC17. G Hu, W Christmas and J Kittler, A resolution aware 3D morphable model, Proc. BMVC 2012.

CC18. P Mortazavian, J Kittler and W Christmas, 3D morphable model fitting for low-resolution facial images, Proc. ICB 2012

CC19. Yan F, Kittler J, Mikolajczyk K, Multiple Kernel Learning via Distance Metric Learning for Interactive Image Retrieval, Proc. MCS 201 Technical Reports

CT1. T E deCampos and G Csurka and F Perronnin, Images as Sets of Locally Weighted Features, In VSSP-TR-1/2010, 2010

Computer Audio-led outputs

-Refereed full papers

UP1. Inferring the Structure of a Tennis Game using Audio Information, Qiang Huang and Stephen Cox, IEEE Transactions on Audio, Speech & Language Processing, Vol. 19 No 7, pp. 1925-1937, September 2011.

UP2. Robust Event Detection by Fusion of Contextual Audio and Visual Information, to be submitted to IEEE Trans. on Audio, Speech and Language Processing. (In Prep.)

-Conference papers

UC1. A Two Layered Data Association Approach For Ball Tracking, Xiangzeng Zhou, Qiang Huang, Lei Xie, Stephen Cox, Proc. IEEE Conf. on Acoustics, Speech and Signal Processing, Vancouver, 2013

UC2. Detection of Ball Hits in a Tennis Game Using Audio and Visual Information, Qiang Huang, Stephen Cox, Xiangzeng Zhou, Lei Xie, Proc. Asia-Pacific Signal and Information Processing Association (APSIPA), Hollywood, December 2012

UC3. Improved Audio Event Detection By Use Of Contextual Noise, Qiang Huang and Stephen Cox, Proc. IEEE Conf. on Acoustics, Speech and Signal Processing, Kyoto, 2012

UC4. Improved Detection of Ball Hit Events in a Tennis Game Using Multimodal Information, Qiang Huang and Stephen Cox, Proc. International Conference on Auditory-Visual Speech Processing (AVSP) 2011

UC5. Learning Score Structure from Spoken Language for a Tennis Game, Qiang Huang and Stephen Cox, Proc. 14th International Conference on Spoken Language Processing (Interspeech), Florence, August 2011

UC6. Shallow Parsing of a Tennis Game from Audio Events, Qiang Huang and Stephen Cox, Fourth International Conference on Intelligent Information Technology Application (IITA 2010), Qinhuangdao, China, November 5 - 7, 2010

UC7. Using High-level Information to Detect Key Audio Events in a Tennis Game, Qiang Huang and Stephen Cox, Proc. 13th International Conference on Spoken Language Processing (Interspeech), Makuhari, September 2010

UC8. Hierarchical Language Modeling for Audio Events Detection in a Sports Game, Qiang Huang and Stephen Cox, Proc. IEEE Conf. on Acoustics, Speech and Signal Processing, Dallas, 2010

Psychology-led Outputs

-Refereed full papers

QP1. Osman, M. (2010). Controlling Uncertainty: A Review of Human Behavior in Complex Dynamic Environments. Psychological Bulletin, 136, 65-86

QP2. Osman, M., & Speekenbrink (2011). Cue utilization and strategy application in stable and unstable dynamic environments. Cognitive Systems Research, 12, 355-364.

QP3. De Neys, W., Cromheeke, K., & Osman, M. (2011). Biased but in doubt: Conflict and decision confidence. PLOS One, 6, e15954.

QP4. Osman, M. (2012). How powerful is the effect of external goals on learning in an uncertain environment? Learning and Individual Differences, 22, 575-584.

QP5. Osman, M. & Speekenbrink, M. (2012).Predicting vs. Controlling a Dynamic Environment. Frontiers in Decision Neuroscience: Dynamics of Decision Making. 3:68.

QP6. M. (2012). The role of feedback in dynamic decision making. Frontiers in Decision Neuroscience: Human Choice and Adaptive Decision Making 6:56

QP7. Taya S* & Miura K. (2010) Cast shadow can modulate the judged final position of a moving target. Attention, Perception & Psychophysics, 72(7), 1930-1937.

QP8. Seno T, Taya S*, Ito H & Sunaga S. (2011). The mental number line in depth revealed by vection. Perception, 40(10), 1237-1240.

QP9. Taya S, Windridge D, Osman M.* (2012). Looking to score: The dissociation of goal influence on eye movement and meta-attentional allocation in a complex dynamic natural scene. PLoS One, 7(6), e39060

QP10. Taya S, Windridge D, Osman M.* (in press). Knowledge-based modulation of eye-movements in dynamic scene observation. Plos One.

QP11. Glass, B., & Osman, M. (under revision). Sound before vision, or vision before sound, the interfering effects of auditory delays in visual processing of complex dynamic scenes. Science.

QP12. Osman, M., & Johansen, M. (under review). Coincidence and Causality: Flip Sides of the Same Covariance Detection Coin. Psychological Science.

QP13. Ryterska, A., Jahanshahi, M. & Osman, M. (under review). A review of the impact of striatal damage on Decision making. Neuroscience & Biobehavioral Reviews

QP14. Osman, M., Hola, Z., Glass, B., & Stieglitz. N. (under review) Exploration and Exploitation in dynamic decision environments. Journal of Behavioral Decision Making.

QP15. Theocharis, Z., Kingham, S., & Osman, M. (under review).Banking on optimism and control: Judgments of personal and professional events. Journal of Experimental Psychology :Applied

QP16. Osman, M., & Ananiadis-Basias, A. (under review). The role of social cues and causal knowledge in Dynamic decision making. Decision


QB1. Osman, M. (2010). Controlling Uncertainty: Learning and Decision Making in complex worlds. Wiley Blackwell Publishers.

-Conference Papers

QC1. Taya S, Windridge D, Kittler J, Osman M. Rule-based modulation of visual attention. 33rd European Conference of Visual Perception. Lausanne (Switzerland), August 2010.

QC2. Taya S, Windridge D, Osman M. Investigating the influence of task-specific goals on attention allocation and eye-movement behaviour while viewing a dynamic scene. 51st Annual Meeting of Psychonomic Society, St Louis (USA), November 2010.

QC3. Taya S, Windridge D, Osman M. The effects of goal-oriented task on eye-movements during dynamic natural scene observation. Vision Science Society 11th Annual Meeting, Naples (USA), May 2011.

QC4. Taya S, Windridge D, Osman M. Experience-based modulation of eye-movement behavior in dynamic and uncertain visual environment. 34th Annual Meeting of the Cognitive Science Society, Sapporo (Japan), August 2012.

QC5. Smyth, A., Taya, S. Hope, C., & Osman, M. (2011). Can Sleep Enhance both Implicit and Explicit Processes? In Proceedings of the 33rd Annual Meeting of the Cognitive Science Society, Boston, July.

QC6. Osman, M., & DeNeys, W. (2011). More than just logic tasks: New approaches to understanding reasoning. In Proceedings of the 33rd Annual Meeting of the Cognitive Science Society, Boston, July.

QC7. Osman, M., & Speekenbrink, M. (2011). Controlling stable and unstable dynamic decision making environments. In Proceedings of the 33rd Annual Meeting of the Cognitive Science Society, Boston, July.
Exploitation Route The employment of ACASVA methodologies in driver assistance systems constitutes a significant non-academic use of ACASVA technology. Furthermore, the accumulation of very large archives of video footage, both by broadcasters and private individuals, has made the problem of video annotation an increasing central 'big data' issue. The techniques developed by ACASVA are designed to be directly applicable to this problem (both to sport video and more general types of footage), and avenues of exploitation for ACASVA annotation technology are being actively explored.
Sectors Electronics

Description EPSRC Programme Grant
Amount £6,104,265 (GBP)
Funding ID EP/N007743/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 01/2016 
End 12/2020
Description MURI
Amount £8,000,000 (GBP)
Funding ID EP/R018456/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 01/2018 
End 12/2022
Description Platform Grant
Amount £1,539,000 (GBP)
Funding ID EP/P022529/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 07/2017 
End 06/2022
Description Signal processing for the networked battlespace
Amount £3,800,000 (GBP)
Funding ID EP/K014307/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Academic/University
Country United Kingdom
Start 04/2013 
End 03/2018
Title ACASVA Actions Dataset 
Description Player's action recognition is one of the challenges in the ACASVA project. The goal is to classify each action sample into three classes: Non-Hit, Hit and Serve. Following deCampos et al [3], we used HOG3D descriptors extracted on player bounding boxes. Two different sets of feature extraction parameters were used: the 960D parameters (4x4x3x20) optimised for the KTH dataset and the 300D parameters (2x2x5x5x3) optimised for the Hollywood dataset. Each file contains HOG3D data ex, Player's action recognition is one of the challenges in the ACASVA project. The goal is to classify each action sample into three classes: Non-Hit, Hit and Serve. Following deCampos et al [3], we used HOG3D descriptors extracted on player bounding boxes. Two different sets of feature extraction parameters were used: the 960D parameters (4x4x3x20) optimised for the KTH dataset and the 300D parameters (2x2x5x5x3) optimised for the Hollywood dataset. Each file contains HOG3D data extracted 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact The data set was used by peer groups in evaluation studies 
Description MILES 
Organisation University of Surrey
Country United Kingdom 
Sector Academic/University 
PI Contribution Internal inter-department collaboration was initiated with Department of Computing and School of Psychology, and a small feasibility study fund was awarded by the MILES (Models and Mathematics in Life and Social Sciences) project (12/2012-12/2013).
Start Year 2011
Description ACASVA Webpage 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact

Further enquiries about the research done
Year(s) Of Engagement Activity 2009,2010,2011,2012,2013