Understanding scenes and events through joint parsing, cognitive reasoning and lifelong learning
Lead Research Organisation:
University of Oxford
Department Name: Engineering Science
Abstract
The goal of this MURI team is to develop machines that have the following capabilities:
i) Represent visual knowledge in probabilistic compositional models in spatial, temporal, and causal hierarchies augmented with rich attributes and relations, use task-oriented representations for efficient task-dependent inference from an agent's perspective, and preserve uncertainties;
ii) Acquire massive visual commonsense via web scale continuous lifelong learning from large and small data in weakly supervised HCI, and maintain consistence via dialogue with humans;
iii) Achieve deep understanding of scenes and events through joint parsing and cognitive reasoning about appearance, geometry, functions, physics, causality, intents and belief of agents, and use joint and long-range reasoning to fill the performance gap with human vision;
iv) Understand human needs and values, interact with humans effectively, and answer human queries about what, who, where, when, why and how in storylines through Turing tests.
Collaboration with US:
Principal Investigator: Dr. Song-Chun Zhu
Tel. 310-206-8693, Fax. 310-206-5658, email: sczhu@stat.ucla.edu
Institution: University of California, Los Angeles
Statistics and Computer Science
8125 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095
Institution proposal no. 20153924
Other universities in the US
CMU: Martial Hebert Computer Vision, Robotics & AI
Abhinav Gupta Computer Vision, Lifelong Learning
MIT: Joshua Tenenbaum Cognitive Modeling and Learning
Nancy Kanwisher Cognitive Neuroscience
Stanford: Fei-Fei Li Computer Vision, Psychology & AI
UIUC Derek Hoiem Computer Vision, Machine Learning
Yale Brian Scholl Psychology, Cognitive Science
i) Represent visual knowledge in probabilistic compositional models in spatial, temporal, and causal hierarchies augmented with rich attributes and relations, use task-oriented representations for efficient task-dependent inference from an agent's perspective, and preserve uncertainties;
ii) Acquire massive visual commonsense via web scale continuous lifelong learning from large and small data in weakly supervised HCI, and maintain consistence via dialogue with humans;
iii) Achieve deep understanding of scenes and events through joint parsing and cognitive reasoning about appearance, geometry, functions, physics, causality, intents and belief of agents, and use joint and long-range reasoning to fill the performance gap with human vision;
iv) Understand human needs and values, interact with humans effectively, and answer human queries about what, who, where, when, why and how in storylines through Turing tests.
Collaboration with US:
Principal Investigator: Dr. Song-Chun Zhu
Tel. 310-206-8693, Fax. 310-206-5658, email: sczhu@stat.ucla.edu
Institution: University of California, Los Angeles
Statistics and Computer Science
8125 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095
Institution proposal no. 20153924
Other universities in the US
CMU: Martial Hebert Computer Vision, Robotics & AI
Abhinav Gupta Computer Vision, Lifelong Learning
MIT: Joshua Tenenbaum Cognitive Modeling and Learning
Nancy Kanwisher Cognitive Neuroscience
Stanford: Fei-Fei Li Computer Vision, Psychology & AI
UIUC Derek Hoiem Computer Vision, Machine Learning
Yale Brian Scholl Psychology, Cognitive Science
Planned Impact
Not required
People |
ORCID iD |
Philip Torr (Principal Investigator) |
Publications
Acciarini G
(2020)
Spacecraft Collision Risk Assessment with Probabilistic Programming
Adam D. Cobb
(2020)
An Ensemble of Bayesian Neural Networks for Exoplanetary Atmospheric Retrieval
in The Astronomical Journal 158 (1). doi:10.3847/1538-3881/ab2390
Ajanthan T
(2019)
Proximal Mean-Field for Neural Network Quantization
Ajanthan T
(2017)
Efficient Linear Programming for Dense CRFs
Ajanthan Thalaiyasingam
(2018)
Proximal Mean-field for Neural Network Quantization
in arXiv e-prints
Alfarra M
(2021)
DeformRS: Certifying Input Deformations with Randomized Smoothing
Alfarra M
(2022)
DeformRS: Certifying Input Deformations with Randomized Smoothing
in Proceedings of the AAAI Conference on Artificial Intelligence
Amartya Sanyal
(2021)
How Benign is Benign Overfitting?
Andreas Munk
(2020)
Deep Probabilistic Surrogate Networks for Universal Simulator Approximation
Description | This work has led to a range of advances as listed by the academic papers. These have been used in spin out Oxsight to help the partially sighted and in new company FiveAI for autonomous cars.The work has been featured on BBC TV e.g. Horizon and Switch as well as other news (Yan Lan Chinese TV). A fuller report has been submitted to EPSRC as part of the MURI program please refer to that. |
Exploitation Route | spin out companies and wide adoption by other academics. |
Sectors | Creative Economy Digital/Communication/Information Technologies (including Software) Education Healthcare Transport |
URL | https://www.robots.ox.ac.uk/~tvg/publication/ |
Description | fed into various spin out and companies |
First Year Of Impact | 2016 |
Sector | Healthcare,Transport |
Impact Types | Societal Economic |
Description | Five AI/RAEng Research Chair in Computer Vision |
Amount | £225,000 (GBP) |
Organisation | Royal Academy of Engineering |
Sector | Charity/Non Profit |
Country | United Kingdom |
Start | 09/2018 |
End | 09/2023 |
Description | Research grant |
Amount | £118,000 (GBP) |
Organisation | Innovate UK |
Sector | Public |
Country | United Kingdom |
Start | 03/2020 |
End | 08/2021 |
Description | Turing AI Fellowship: Robust, Efficient and Trustworthy Deep Learning |
Amount | £3,087,056 (GBP) |
Funding ID | EP/W002981/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 09/2021 |
End | 09/2026 |
Company Name | Aistetic |
Description | Aistetic develops AI technology for retailers which aims to reduce returns. |
Year Established | 2019 |
Impact | Information Technology and Services |
Website | http://aistetic.com |
Company Name | OxSight |
Description | OxSight has developed SmartSpecs, a system of devices to help people with severe visual impairment navigate independently. The system uses cameras and computer vision algorithms to detect and highlight objects in real-time, creating an interactive overlay over the wearer's normal vision. |
Year Established | 2016 |
Impact | see http://smartspecs.co/ |
Website | http://www.oxsight.co.uk |
Description | Workshop on language and vision at CVPR 2019 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | The interaction between language and vision, despite seeing traction as of late, is still largely unexplored. This is a particularly relevant topic to the vision community because humans routinely perform tasks which involve both modalities. We do so largely without even noticing. Every time you ask for an object, ask someone to imagine a scene, or describe what you're seeing, you're performing a task which bridges a linguistic and a visual representation. The importance of vision-language interaction can also be seen by the numerous approaches that often cross domains, such as the popularity of image grammars. More concretely, we've recently seen a renewed interest in one-shot learning for object and event models. Humans go further than this using our linguistic abilities; we perform zero-shot learning without seeing a single example. You can recognize a picture of a zebra after hearing the description "horse-like animal with black and white stripes" without ever having seen one. Furthermore, integrating language with vision brings with it the possibility of expanding the horizons and tasks of the vision community. We have seen significant growth in image and video-to-text tasks but many other potential applications of such integration - answering questions, dialog systems, and grounded language acquisition - remain largely unexplored. Going beyond such novel tasks, language can make a deeper contribution to vision: it provides a prism through which to understand the world. A major difference between human and machine vision is that humans form a coherent and global understanding of a scene. This process is facilitated by our ability to affect our perception with high-level knowledge which provides resilience in the face of errors from low-level perception. It also provides a framework through which one can learn about the world: language can be used to describe many phenomena succinctly thereby helping filter out irrelevant details. Topics covered (non-exhaustive): language as a mechanism to structure and reason about visual perception, language as a learning bias to aid vision in both machines and humans, novel tasks which combine language and vision, dialogue as means of sharing knowledge about visual perception, stories as means of abstraction, transfer learning across language and vision, understanding the relationship between language and vision in humans, reasoning visually about language problems, visual captioning, dialogue, and question-answering, visual synthesis from language, sequence learning towards bridging vision and language, joint video and language alignment and parsing, and video sentiment analysis. |
Year(s) Of Engagement Activity | 2019 |
URL | http://languageandvision.com/ |