EPSRC Network on Vision and Language (V&L Net)

Lead Research Organisation: University of Brighton
Department Name: Sch of Computing, Engineering & Maths


The amount of digital information accessible on the web and, more generally, in data repositories of various sorts isgrowing at an ever faster pace. Increasingly, digital information means visual content (image and video), and thisdevelopment has resulted in a situation where computational solutions are lagging behind a diverse range ofcurrent image/video search, processing and management needs. There is a big, and as yet unbridged, semanticgap between visual content and language. Finding solutions for image/video retrieval, automatic image/videoannotation and similar challenges will require this gap to be bridged, and this in turn will require expertise from boththe computer vision (CV) and natural language processing (NLP) fields. Yet, while language and vision are the twoprimary modalities for human perception and computer-mediated communication, the two corresponding computingscience disciplines hardly talk to each other, and this is part of the reason why the language-vision gap is still sowide: NLP research is perhaps not aware enough of the range of possible applications involving visual content andtheir specific language processing requirements; CV can tend to underestimate the complexity of thelanguage processing problem, and currently uses mostly basic language processing technology, whereassophisticated, high-performance tools exist.We propose an EPSRC Network on Vision and Language, V&L Net, to create a forum for researchers from CVand NLP to meet and exchange ideas, expertise and technology. The UK has some of the world's leadingresearchers in NLP and CV. V&L Net aims to tap this body of expertise to create new strategic partnerships aimed atnarrowing the language-vision gap by developing the theory required for solutions to the difficult challenges posedby our increasingly multi-modal world. A successful network will place the UK at the forefront of developing solutionsat the language-vision intersection which have clear commercial potential.Our overarching goal in V&L Net is the creation of a new interdisciplinary research community working towardscomputational solutions for challenges that involve both language and vision. By (i) bringing researchers from thetwo currently separate disciplines of computer vision and language processing together, (ii) facilitating access torelevant information, expertise, and resources, and (iii) stimulating research and pump-priming individual researchprojects, we aim to engender a substantial increase in interdisciplinary research activity. Through this increase inwork bringing to bear expertise from both computer vision and language processing, we expect to see a stepchange in progress towards solutions for a range of real-world challenges as well as theoretical questions. Whilethe latter will tend to have a more long-term impact (laying the groundwork for future breakthroughs), the formerhave substantial potential to result in ground-breaking new products and services that will improve people's qualityof life in diverse ways even in the short to medium term. People with impairments in sight, hearing and cognitive ability will benefit from assistive technology that will help them access multiple modalities. Improvements in image search and retrieval will enhance online search experience, as well as help institutions such as hospitals and police forces to cope with the massive amounts of images and videos they deal with daily.

Planned Impact

As an EPSRC Network, V&L Net will not engage in primary research, but has the key aims of creating a strong new interdisciplinary research community and stimulating new research at the vision-language intersection. The resulting benefits cluster around four main impacts: Impact 1: Creation of a New Interdisciplinary Research Community (Time scale: 3 years) Beneficiaries: Researchers in Natural Language Processing (NLP) and Computer Vision (CV). Benefits: (a) opening up new channels of communication between NLP and CV researchers; (b) facilitate identification of potential academic and industrial partners for interdisciplinary collaboration; (c) increase in critical mass of research aimed at vision-language problems; (d) easier access to expertise, publications, data and software resources. Realisation: high-profile initial recruitment drive, large membership, events for networking and presenting research, online resources for networking and repositories of information about researchers, data resources and expertise, continuation of V&L Net after funding period in self-sustaining Special Interest Group, electronic journal and annual event. Impact 2: Technological Progress (Time scale: 3-6 years) Beneficiaries: NLP & CV researchers, industry, end-users. Benefits: (a) increased interaction and research activity resulting in `hothousing' effect leading to a speed-up in technological progress; (b) new data and software resources accessible to researchers; (c) improved techniques for image search, video retrieval, visual content description, text-to-image generation and related tasks; (d) better tools for range of everyday activities in work and leisure time spent interacting with a computer. Realisation: online tools for sharing resources and expertise, matchmaking services for finding research collaborators in academia and industry, roadmapping initiative to identify key research challenges and milestones, white paper series, research workshops, stimulating research on specific important topics, and pump-priming research collaboration aimed at proposal preparation. Impact 3: Technology for commercial exploitation (Time scale: 3-12 years) Beneficiaries: industry. Benefits: (a) increased collaboration between industry and academia on vision-language problems; (b) real breakthroughs in technologies for saleable products such as image search tools, assistive technologies, video analysis, etc. Realisation: strong involvement of UK-based industry in V&L Net, building on V&L Net members' extensive existing links with industry, academia-industry match-making service, demo sessions at V&L Net meetings, industrial sponsorship, recruiting industrial members, involving industry representatives in roadmapping initiative, linking up with existing KTN initiatives. Impact 4. Products for next-generation human-computer interaction (Time scale: 6-12 years) Beneficiaries: end users, including companies, organisations and individuals. Benefits: (a) enhanced efficiency and enjoyment of work and leisure activities involving interaction with a computer through improved image search tools, multi-modal interaction etc; (b) improved quality of life for people with vision, hearing or cognitive impairments through assistive technology capable of compensating for impaired vision, hearing or language modalities; (c) contribution to public health and safety through new tools for forensic and medical search and retrieval; etc. Realisation: this is a set of more indirect benefits that will be the end results of the activities outlined under the other three chief impacts above; to maximise end-user benefits we will also involve end-user groups in the roadmapping exercise and other activities.


10 25 50
Description While this was a research network which did not directly support any primary research, it led to numerous new research collaborations between researcher in language and vision which in turn produced a diverse range of new results confirming the benefits of bringing language and vision technology together and in particular of developing integrated approaches.

Such new collaborations were fostered and supported directly by kick-starter grants allocated on a competitive basis to enable researchers to meet.
Exploitation Route The work of the EPSRC Network on Vision and Language has continued in the form of the European COST Action on Integrating Vision and Language which has a direct scientific focus on integration of approaches as well as working with end-users.
Sectors Digital/Communication/Information Technologies (including Software)



including Industrial Biotechology



Museums and Collections

Security and Diplomacy

Description The European Network on Integrating Vision and Language (iV&L Net): Combining Computer Vision and Language Processing For Advanced Search, Retrieval, Annotation and Description of Visual Data
Amount £500,000 (GBP)
Funding ID COST IC1307 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 11/2013 
End 11/2017
Description The European Network on Integrating Vision and Language (iV&L Net): Combining Computer Vision and Language Processing For Advanced Search, Retrieval, Annotation and Description of Visual Data
Amount £500,000 (GBP)
Funding ID COST IC1307 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 11/2013 
End 11/2017