Developing Foundation Model Capabilities for Video Understanding in the Open World

Lead Research Organisation: University of Oxford


The goal of this project is to develop open-world deep learning models for video understanding that allow users to ask queries about video content using natural language descriptions. Rather than training deep learning models from scratch, the methods developed will leverage pre-trained foundation models. A foundation model is a machine learning model trained on large quantities of data that can be adapted to solve a wide variety of downstream tasks. Extending a foundation model to solve a specific problem usually requires less data than training it from scratch and improves the generalizability of the specialist model. Methods will be developed to solve open-world image-level problems requiring natural language input with pre-trained foundation models. Insights from these developments will inspire the construction of models that solve analogous problems in videos. Examples of problems include counting text-specified objects in images and repetitions in videos and answering queries about the area, shape, and structure of objects in a scene. Rather than solving a problem for a particular class, the models developed will allow users to solve the problem for any arbitrary class by providing text input about the class of interest at inference time. Importantly, adapting such open-world models to new classes would require no additional training or data, even if the class were unseen during training. Hence, this work will result in AI systems that are more accessible to the general public, who may not have access to the large quantities of labelled data and compute typically necessary to train class-specific models.
Aims & Objectives
1. Develop models for image understanding that allow users to ask questions on the image content using natural language.
2. Leverage insights and methods from step (1) to develop models with similar capabilities for video understanding that allow users to ask questions on video content using natural language. For instance, a model developed to count objects in images using text in step (1) could inspire a model to count objects in videos using text in step (2).
3. Iterate on steps (1) and (2), adding more capabilities.
Novelty of the Research Methodology
While leveraging pre-trained vision-language foundation models for tasks such as image retrieval, object detection, and instance segmentation has been significantly explored for images, similar developments have been less explored for videos. This is because learning from videos is more complex due to an additional temporal dimension. Furthermore, methods developed in this project will include novel deep learning architectures that are more general and perform better at existing tasks or that solve new problems such as repetition counting in videos using natural language descriptions and answering arbitrary natural language queries about the size, shape, and structure of objects.
Alignment to the EPSRC's Strategies & Research Areas
This project relates to the "Artificial Intelligence Technologies" research area.
Any Companies or Collaborators Involved?

Planned Impact

AIMS's impact will be felt across domains of acute need within the UK. We expect AIMS to benefit: UK economic performance, through start-up creation; existing UK firms, both through research and addressing skills needs; UK health, by contributing to cancer research, and quality of life, through the delivery of autonomous vehicles; UK public understanding of and policy related to the transformational societal change engendered by autonomous systems.

Autonomous systems are acknowledged by essentially all stakeholders as important to the future UK economy. PwC claim that there is a £232 billion opportunity offered by AI to the UK economy by 2030 (10% of GDP). AIMS has an excellent track record of leadership in spinout creation, and will continue to foster the commercial projects of its students, through the provision of training in IP, licensing and entrepreneurship. With the help of Oxford Science Innovation (investment fund) and Oxford University Innovation (technology transfer office), student projects will be evaluated for commercial potential.

AIMS will also concretely contribute to UK economic competitiveness by meeting the UK's needs for experts in autonomous systems. To meet this need, AIMS will train cohorts with advanced skills that span the breadth of AI, machine learning, robotics, verification and sensor systems. The relevance of the training to the needs of industry will be ensured by the industrial partnerships at the heart of AIMS. These partnerships will also ensure that AIMS will produce research that directly targets UK industrial needs. Our partners span a wide range of UK sectors, including energy, transport, infrastructure, factory automation, finance, health, space and other extreme environments.

The autonomous systems that AIMS will enable also offer the prospect of epochal change in the UK's quality of life and health. As put by former Digital Secretary Matt Hancock, "whether it's improving travel, making banking easier or helping people live longer, AI is already revolutionising our economy and our society." AIMS will help to realise this potential through its delivery of trained experts and targeted research. In particular, two of the four Grand Challenge missions in the UK Industrial Strategy highlight the positive societal impact underpinned by autonomous systems. The "Artificial Intelligence and data" challenge has as its mission to "Use data, Artificial Intelligence and innovation to transform the prevention, early diagnosis and treatment of chronic diseases by 2030". To this mission, AIMS will contribute the outputs of its research pillar on cancer research. The "Future of mobility" challenge highlights the importance the autonomous vehicles will have in making transport "safer, cleaner and better connected." To this challenge, AIMS offers the world-leading research of its robotic systems research pillar.

AIMS will further promote the positive realisation of autonomous technologies through direct influence on policy. The world-leading academics amongst AIMS's supervisory pool are well-connected to policy formation e.g. Prof Osborne serving as a Commissioner on the Independent Commission on the Future of Work. Further, Dr Dan Mawson, Head of the Economy Unit; Economy and Strategic Analysis Team at BEIS will serve as an advisor to AIMS, ensuring bidirectional influence between policy objectives and AIMS research and training.

Broad understanding of autonomous systems is crucial in making a society robust to the transformations they will engender. AIMS will foster such understanding through its provision of opportunities for AIMS students to directly engage with the public. Given the broad societal importance of getting autonomous systems right, AIMS will deliver core training on the ethical, governance, economic and societal implications of autonomous systems.


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/S024050/1 01/10/2019 31/03/2028
2711268 Studentship EP/S024050/1 01/10/2022 30/09/2026 Niki Amini-Naieni