Understanding scenes and events through joint parsing, cognitive reasoning and lifelong learning
Lead Research Organisation:
University of Oxford
Department Name: Engineering Science
Abstract
The goal of this MURI team is to develop machines that have the following capabilities:
i) Represent visual knowledge in probabilistic compositional models in spatial, temporal, and causal hierarchies augmented with rich attributes and relations, use task-oriented representations for efficient task-dependent inference from an agent's perspective, and preserve uncertainties;
ii) Acquire massive visual commonsense via web scale continuous lifelong learning from large and small data in weakly supervised HCI, and maintain consistence via dialogue with humans;
iii) Achieve deep understanding of scenes and events through joint parsing and cognitive reasoning about appearance, geometry, functions, physics, causality, intents and belief of agents, and use joint and long-range reasoning to fill the performance gap with human vision;
iv) Understand human needs and values, interact with humans effectively, and answer human queries about what, who, where, when, why and how in storylines through Turing tests.
Collaboration with US:
Principal Investigator: Dr. Song-Chun Zhu
Tel. 310-206-8693, Fax. 310-206-5658, email: sczhu@stat.ucla.edu
Institution: University of California, Los Angeles
Statistics and Computer Science
8125 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095
Institution proposal no. 20153924
Other universities in the US
CMU: Martial Hebert Computer Vision, Robotics & AI
Abhinav Gupta Computer Vision, Lifelong Learning
MIT: Joshua Tenenbaum Cognitive Modeling and Learning
Nancy Kanwisher Cognitive Neuroscience
Stanford: Fei-Fei Li Computer Vision, Psychology & AI
UIUC Derek Hoiem Computer Vision, Machine Learning
Yale Brian Scholl Psychology, Cognitive Science
i) Represent visual knowledge in probabilistic compositional models in spatial, temporal, and causal hierarchies augmented with rich attributes and relations, use task-oriented representations for efficient task-dependent inference from an agent's perspective, and preserve uncertainties;
ii) Acquire massive visual commonsense via web scale continuous lifelong learning from large and small data in weakly supervised HCI, and maintain consistence via dialogue with humans;
iii) Achieve deep understanding of scenes and events through joint parsing and cognitive reasoning about appearance, geometry, functions, physics, causality, intents and belief of agents, and use joint and long-range reasoning to fill the performance gap with human vision;
iv) Understand human needs and values, interact with humans effectively, and answer human queries about what, who, where, when, why and how in storylines through Turing tests.
Collaboration with US:
Principal Investigator: Dr. Song-Chun Zhu
Tel. 310-206-8693, Fax. 310-206-5658, email: sczhu@stat.ucla.edu
Institution: University of California, Los Angeles
Statistics and Computer Science
8125 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095
Institution proposal no. 20153924
Other universities in the US
CMU: Martial Hebert Computer Vision, Robotics & AI
Abhinav Gupta Computer Vision, Lifelong Learning
MIT: Joshua Tenenbaum Cognitive Modeling and Learning
Nancy Kanwisher Cognitive Neuroscience
Stanford: Fei-Fei Li Computer Vision, Psychology & AI
UIUC Derek Hoiem Computer Vision, Machine Learning
Yale Brian Scholl Psychology, Cognitive Science
Planned Impact
Not required
People |
ORCID iD |
Philip Torr (Principal Investigator) |
Publications
Acciarini G
(2020)
Spacecraft Collision Risk Assessment with Probabilistic Programming
Adam D. Cobb
(2020)
An Ensemble of Bayesian Neural Networks for Exoplanetary Atmospheric Retrieval
in The Astronomical Journal 158 (1). doi:10.3847/1538-3881/ab2390
Ajanthan T
(2017)
Efficient Linear Programming for Dense CRFs
Ajanthan T
(2019)
Proximal Mean-Field for Neural Network Quantization
Ajanthan Thalaiyasingam
(2018)
Proximal Mean-field for Neural Network Quantization
in arXiv e-prints
Alfarra M
(2021)
DeformRS: Certifying Input Deformations with Randomized Smoothing
Alfarra M
(2022)
DeformRS: Certifying Input Deformations with Randomized Smoothing
in Proceedings of the AAAI Conference on Artificial Intelligence
Amartya Sanyal
(2021)
How Benign is Benign Overfitting?
Andreas Munk
(2020)
Deep Probabilistic Surrogate Networks for Universal Simulator Approximation
Description | This work has led to a range of advances as listed by the academic papers. These have been used in spin out Oxsight to help the partially sighted and in new company FiveAI for autonomous cars.The work has been featured on BBC TV e.g. Horizon and Switch as well as other news (Yan Lan Chinese TV). A fuller report has been submitted to EPSRC as part of the MURI program please refer to that. |
Exploitation Route | spin out companies and wide adoption by other academics. |
Sectors | Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Transport |
URL | https://www.robots.ox.ac.uk/~tvg/publication/ |
Description | fed into various spin out and companies |
First Year Of Impact | 2016 |
Sector | Healthcare,Transport |
Impact Types | Societal,Economic |
Description | Five AI/RAEng Research Chair in Computer Vision |
Amount | £225,000 (GBP) |
Organisation | Royal Academy of Engineering |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 10/2018 |
End | 09/2023 |
Description | Research grant |
Amount | £118,000 (GBP) |
Organisation | Innovate UK |
Sector | Public |
Country | United Kingdom |
Start | 03/2020 |
End | 08/2021 |
Description | Turing AI Fellowship: Robust, Efficient and Trustworthy Deep Learning |
Amount | £3,087,056 (GBP) |
Funding ID | EP/W002981/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 10/2021 |
End | 09/2026 |
Company Name | oxsight |
Description | OxSight is a University of Oxford venture that uses the latest smart glasses to improve sight for blind and partially sighted people. OxSight's aim is to develop sight enhancing technologies to improve the quality of life for blind and partially sighted people around the world. Our current commercial products can enhance vision for people affected by conditions like glaucoma, diabetes and retinitis pigmentosa as well as some other degenerative eye diseases. |
Year Established | 2016 |
Impact | see http://smartspecs.co/ |
Website | http://smartspecs.co/ |
Company Name | AISTETIC LIMITED |
Description | We are a University of Oxford Spinout, applying state of the art computer vision and deep learning to the real world problems associated with shopping for clothes online. We're building an innovative ecommerce platform, applying computer vision and deep tech to clothing. With world -class founders from leading academic institutions and companies, we are a team that is on mission to disrupt and improve how everyone shops for clothes. Aistetic was founded with a clear purpose: to make tailoring accessible to more people wherever they are. And our mission is to do so sustainably, reducing waste, and encouraging a more sustainable approach to clothing. Our innovation partners are the University of Oxford, Innovate UK, & OxLep. We are part of the Digital Catapult's Machine Intelligence Garage and Data Market Services Accelerator (EU Horizon 2020). |
Year Established | 2019 |
Impact | Information Technology and Services |
Description | Workshop on language and vision at CVPR 2019 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | The interaction between language and vision, despite seeing traction as of late, is still largely unexplored. This is a particularly relevant topic to the vision community because humans routinely perform tasks which involve both modalities. We do so largely without even noticing. Every time you ask for an object, ask someone to imagine a scene, or describe what you're seeing, you're performing a task which bridges a linguistic and a visual representation. The importance of vision-language interaction can also be seen by the numerous approaches that often cross domains, such as the popularity of image grammars. More concretely, we've recently seen a renewed interest in one-shot learning for object and event models. Humans go further than this using our linguistic abilities; we perform zero-shot learning without seeing a single example. You can recognize a picture of a zebra after hearing the description "horse-like animal with black and white stripes" without ever having seen one. Furthermore, integrating language with vision brings with it the possibility of expanding the horizons and tasks of the vision community. We have seen significant growth in image and video-to-text tasks but many other potential applications of such integration - answering questions, dialog systems, and grounded language acquisition - remain largely unexplored. Going beyond such novel tasks, language can make a deeper contribution to vision: it provides a prism through which to understand the world. A major difference between human and machine vision is that humans form a coherent and global understanding of a scene. This process is facilitated by our ability to affect our perception with high-level knowledge which provides resilience in the face of errors from low-level perception. It also provides a framework through which one can learn about the world: language can be used to describe many phenomena succinctly thereby helping filter out irrelevant details. Topics covered (non-exhaustive): language as a mechanism to structure and reason about visual perception, language as a learning bias to aid vision in both machines and humans, novel tasks which combine language and vision, dialogue as means of sharing knowledge about visual perception, stories as means of abstraction, transfer learning across language and vision, understanding the relationship between language and vision in humans, reasoning visually about language problems, visual captioning, dialogue, and question-answering, visual synthesis from language, sequence learning towards bridging vision and language, joint video and language alignment and parsing, and video sentiment analysis. |
Year(s) Of Engagement Activity | 2019 |
URL | http://languageandvision.com/ |