Understanding scenes and events through joint parsing, cognitive reasoning and lifelong learning

Lead Research Organisation: University of Oxford

Department Name: Engineering Science

Abstract

The goal of this MURI team is to develop machines that have the following capabilities:
i) Represent visual knowledge in probabilistic compositional models in spatial, temporal, and causal hierarchies augmented with rich attributes and relations, use task-oriented representations for efficient task-dependent inference from an agent's perspective, and preserve uncertainties;
ii) Acquire massive visual commonsense via web scale continuous lifelong learning from large and small data in weakly supervised HCI, and maintain consistence via dialogue with humans;
iii) Achieve deep understanding of scenes and events through joint parsing and cognitive reasoning about appearance, geometry, functions, physics, causality, intents and belief of agents, and use joint and long-range reasoning to fill the performance gap with human vision;
iv) Understand human needs and values, interact with humans effectively, and answer human queries about what, who, where, when, why and how in storylines through Turing tests.

Collaboration with US:
Principal Investigator: Dr. Song-Chun Zhu
Tel. 310-206-8693, Fax. 310-206-5658, email: sczhu@stat.ucla.edu
Institution: University of California, Los Angeles
Statistics and Computer Science
8125 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095
Institution proposal no. 20153924

Other universities in the US
CMU: Martial Hebert Computer Vision, Robotics & AI
Abhinav Gupta Computer Vision, Lifelong Learning
MIT: Joshua Tenenbaum Cognitive Modeling and Learning
Nancy Kanwisher Cognitive Neuroscience
Stanford: Fei-Fei Li Computer Vision, Psychology & AI
UIUC Derek Hoiem Computer Vision, Machine Learning
Yale Brian Scholl Psychology, Cognitive Science

Planned Impact

Not required

Funded Value:

£1,114,546

Funded Period:

Sep 15 - Mar 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/N019474/1

Principal Investigator:

Philip Torr

Research Subject:

Info. & commun. Technol. (75%)

Psychology (25%)

Research Topic:

Artificial Intelligence (25%)

Cognitive Psychology (25%)

Image & Vision Computing (25%)

Vision & Senses - ICT appl. (25%)

Organisations

People	ORCID iD
Philip Torr (Principal Investigator)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 4 5 6 7 8 9 10 > >|

10 25 50

Acciarini G (2020) Spacecraft Collision Risk Assessment with Probabilistic Programming

Adam D. Cobb (2020) An Ensemble of Bayesian Neural Networks for Exoplanetary Atmospheric Retrieval in The Astronomical Journal 158 (1). doi:10.3847/1538-3881/ab2390

Ajanthan T (2019) Proximal Mean-Field for Neural Network Quantization

Ajanthan T (2017) Efficient Linear Programming for Dense CRFs

Ajanthan Thalaiyasingam (2018) Proximal Mean-field for Neural Network Quantization in arXiv e-prints

Alfarra M (2021) DeformRS: Certifying Input Deformations with Randomized Smoothing

Alfarra M (2022) DeformRS: Certifying Input Deformations with Randomized Smoothing in Proceedings of the AAAI Conference on Artificial Intelligence

Amartya Sanyal (2021) How Benign is Benign Overfitting?

Andreas Munk (2020) Deep Probabilistic Surrogate Networks for Universal Simulator Approximation

Arnab A (2016) Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II

Key Findings
Impact Summary
Further Funding
Spin Outs
Engagement Activities


Description	This work has led to a range of advances as listed by the academic papers. These have been used in spin out Oxsight to help the partially sighted and in new company FiveAI for autonomous cars.The work has been featured on BBC TV e.g. Horizon and Switch as well as other news (Yan Lan Chinese TV). A fuller report has been submitted to EPSRC as part of the MURI program please refer to that.
Exploitation Route	spin out companies and wide adoption by other academics.
Sectors	Creative Economy Digital/Communication/Information Technologies (including Software) Education Healthcare Transport
URL	https://www.robots.ox.ac.uk/~tvg/publication/


Description	fed into various spin out and companies
First Year Of Impact	2016
Sector	Healthcare,Transport
Impact Types	Societal Economic


Description	Five AI/RAEng Research Chair in Computer Vision
Amount	£225,000 (GBP)
Organisation	Royal Academy of Engineering
Sector	Charity/Non Profit
Country	United Kingdom
Start	09/2018
End	09/2023


Description	Research grant
Amount	£118,000 (GBP)
Organisation	Innovate UK
Sector	Public
Country	United Kingdom
Start	03/2020
End	08/2021


Description	Turing AI Fellowship: Robust, Efficient and Trustworthy Deep Learning
Amount	£3,087,056 (GBP)
Funding ID	EP/W002981/1
Organisation	Engineering and Physical Sciences Research Council (EPSRC)
Sector	Public
Country	United Kingdom
Start	09/2021
End	09/2026


Company Name	Aistetic
Description	Aistetic develops AI technology for retailers which aims to reduce returns.
Year Established	2019
Impact	Information Technology and Services
Website	http://aistetic.com


Company Name	OxSight
Description	OxSight has developed SmartSpecs, a system of devices to help people with severe visual impairment navigate independently. The system uses cameras and computer vision algorithms to detect and highlight objects in real-time, creating an interactive overlay over the wearer's normal vision.
Year Established	2016
Impact	see http://smartspecs.co/
Website	http://www.oxsight.co.uk


Description	Workshop on language and vision at CVPR 2019
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	The interaction between language and vision, despite seeing traction as of late, is still largely unexplored. This is a particularly relevant topic to the vision community because humans routinely perform tasks which involve both modalities. We do so largely without even noticing. Every time you ask for an object, ask someone to imagine a scene, or describe what you're seeing, you're performing a task which bridges a linguistic and a visual representation. The importance of vision-language interaction can also be seen by the numerous approaches that often cross domains, such as the popularity of image grammars. More concretely, we've recently seen a renewed interest in one-shot learning for object and event models. Humans go further than this using our linguistic abilities; we perform zero-shot learning without seeing a single example. You can recognize a picture of a zebra after hearing the description "horse-like animal with black and white stripes" without ever having seen one. Furthermore, integrating language with vision brings with it the possibility of expanding the horizons and tasks of the vision community. We have seen significant growth in image and video-to-text tasks but many other potential applications of such integration - answering questions, dialog systems, and grounded language acquisition - remain largely unexplored. Going beyond such novel tasks, language can make a deeper contribution to vision: it provides a prism through which to understand the world. A major difference between human and machine vision is that humans form a coherent and global understanding of a scene. This process is facilitated by our ability to affect our perception with high-level knowledge which provides resilience in the face of errors from low-level perception. It also provides a framework through which one can learn about the world: language can be used to describe many phenomena succinctly thereby helping filter out irrelevant details. Topics covered (non-exhaustive): language as a mechanism to structure and reason about visual perception, language as a learning bias to aid vision in both machines and humans, novel tasks which combine language and vision, dialogue as means of sharing knowledge about visual perception, stories as means of abstraction, transfer learning across language and vision, understanding the relationship between language and vision in humans, reasoning visually about language problems, visual captioning, dialogue, and question-answering, visual synthesis from language, sequence learning towards bridging vision and language, joint video and language alignment and parsing, and video sentiment analysis.
Year(s) Of Engagement Activity	2019
URL	http://languageandvision.com/

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications