Understanding scenes and events through joint parsing, cognitive reasoning and lifelong learning

Lead Research Organisation: University of Oxford
Department Name: Engineering Science


The goal of this MURI team is to develop machines that have the following capabilities:
i) Represent visual knowledge in probabilistic compositional models in spatial, temporal, and causal hierarchies augmented with rich attributes and relations, use task-oriented representations for efficient task-dependent inference from an agent's perspective, and preserve uncertainties;
ii) Acquire massive visual commonsense via web scale continuous lifelong learning from large and small data in weakly supervised HCI, and maintain consistence via dialogue with humans;
iii) Achieve deep understanding of scenes and events through joint parsing and cognitive reasoning about appearance, geometry, functions, physics, causality, intents and belief of agents, and use joint and long-range reasoning to fill the performance gap with human vision;
iv) Understand human needs and values, interact with humans effectively, and answer human queries about what, who, where, when, why and how in storylines through Turing tests.

Collaboration with US:
Principal Investigator: Dr. Song-Chun Zhu
Tel. 310-206-8693, Fax. 310-206-5658, email: sczhu@stat.ucla.edu
Institution: University of California, Los Angeles
Statistics and Computer Science
8125 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095
Institution proposal no. 20153924

Other universities in the US
CMU: Martial Hebert Computer Vision, Robotics & AI
Abhinav Gupta Computer Vision, Lifelong Learning
MIT: Joshua Tenenbaum Cognitive Modeling and Learning
Nancy Kanwisher Cognitive Neuroscience
Stanford: Fei-Fei Li Computer Vision, Psychology & AI
UIUC Derek Hoiem Computer Vision, Machine Learning
Yale Brian Scholl Psychology, Cognitive Science

Planned Impact

Not required
Description This work has led to a range of advances as listed by the academic papers. These have been used in spin out Oxsight to help the partially sighted and in new company FiveAI for autonomous cars.The work has been featured on BBC TV e.g. Horizon and Switch as well as other news (Yan Lan Chinese TV). A fuller report has been submitted to EPSRC as part of the MURI program please refer to that.
Exploitation Route spin out companies and wide adoption by other academics.
Sectors Creative Economy,Digital/Communication/Information Technologies (including Software),Education,Healthcare,Transport

URL https://www.robots.ox.ac.uk/~tvg/publication/
Description fed into various spin out and companies
First Year Of Impact 2016
Sector Healthcare,Transport
Impact Types Societal,Economic

Description Five AI/RAEng Research Chair in Computer Vision
Amount £225,000 (GBP)
Organisation Royal Academy of Engineering 
Sector Charity/Non Profit
Country United Kingdom
Start 10/2018 
End 09/2023
Description Research grant
Amount £118,000 (GBP)
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 03/2020 
End 08/2021
Description Turing AI Fellowship: Robust, Efficient and Trustworthy Deep Learning
Amount £3,087,056 (GBP)
Funding ID EP/W002981/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 09/2021 
End 09/2026
Company Name oxsight 
Description OxSight is a University of Oxford venture that uses the latest smart glasses to improve sight for blind and partially sighted people. OxSight's aim is to develop sight enhancing technologies to improve the quality of life for blind and partially sighted people around the world. Our current commercial products can enhance vision for people affected by conditions like glaucoma, diabetes and retinitis pigmentosa as well as some other degenerative eye diseases. 
Year Established 2016 
Impact see http://smartspecs.co/
Website http://smartspecs.co/
Description We are a University of Oxford Spinout, applying state of the art computer vision and deep learning to the real world problems associated with shopping for clothes online. We're building an innovative ecommerce platform, applying computer vision and deep tech to clothing. With world -class founders from leading academic institutions and companies, we are a team that is on mission to disrupt and improve how everyone shops for clothes. Aistetic was founded with a clear purpose: to make tailoring accessible to more people wherever they are. And our mission is to do so sustainably, reducing waste, and encouraging a more sustainable approach to clothing. Our innovation partners are the University of Oxford, Innovate UK, & OxLep. We are part of the Digital Catapult's Machine Intelligence Garage and Data Market Services Accelerator (EU Horizon 2020). 
Year Established 2019 
Impact Information Technology and Services
Description Workshop on language and vision at CVPR 2019 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact The interaction between language and vision, despite seeing traction as of late, is still largely unexplored. This is a particularly relevant topic to the vision community because humans routinely perform tasks which involve both modalities. We do so largely without even noticing. Every time you ask for an object, ask someone to imagine a scene, or describe what you're seeing, you're performing a task which bridges a linguistic and a visual representation. The importance of vision-language interaction can also be seen by the numerous approaches that often cross domains, such as the popularity of image grammars. More concretely, we've recently seen a renewed interest in one-shot learning for object and event models. Humans go further than this using our linguistic abilities; we perform zero-shot learning without seeing a single example. You can recognize a picture of a zebra after hearing the description "horse-like animal with black and white stripes" without ever having seen one.

Furthermore, integrating language with vision brings with it the possibility of expanding the horizons and tasks of the vision community. We have seen significant growth in image and video-to-text tasks but many other potential applications of such integration - answering questions, dialog systems, and grounded language acquisition - remain largely unexplored. Going beyond such novel tasks, language can make a deeper contribution to vision: it provides a prism through which to understand the world. A major difference between human and machine vision is that humans form a coherent and global understanding of a scene. This process is facilitated by our ability to affect our perception with high-level knowledge which provides resilience in the face of errors from low-level perception. It also provides a framework through which one can learn about the world: language can be used to describe many phenomena succinctly thereby helping filter out irrelevant details.

Topics covered (non-exhaustive):

language as a mechanism to structure and reason about visual perception,
language as a learning bias to aid vision in both machines and humans,
novel tasks which combine language and vision,
dialogue as means of sharing knowledge about visual perception,
stories as means of abstraction,
transfer learning across language and vision,
understanding the relationship between language and vision in humans,
reasoning visually about language problems,
visual captioning, dialogue, and question-answering,
visual synthesis from language,
sequence learning towards bridging vision and language,
joint video and language alignment and parsing, and
video sentiment analysis.
Year(s) Of Engagement Activity 2019
URL http://languageandvision.com/