Interactive Perception-Action-Learning for Modelling Objects

Lead Research Organisation: Imperial College London
Department Name: Electrical and Electronic Engineering

Abstract

Manipulating everyday objects without detailed prior models is still beyond the capabilities of existing robots. This is due to many challenges posed by diverse types of objects: Manipulation requires understanding and accurate model of physical properties of objects such as shape, mass, friction, elasticity, etc. Many objects are deformable, articulated, or even organic with undefined shape (e.g., plants) such that a fixed model is insufficient. On top of this, objects may be difficult to perceive, typically because of cluttered scenarios, or complex lighting and reflectance properties such as specularity or partial transparency. Creating such rich representations of objects is beyond current datasets and benchmarking practices used for grasping and manipulation. In this project we will develop an automated interactive perception pipeline for building such rich digitization.

More specifically, in IPALM, we will develop methods for the automatic digitization of objects and their physical properties by exploratory manipulations. These methods will be used to build a large collection of object models required for realistic grasping and manipulation experiments in robotics. Household objects such as tools, kitchenware, clothes, and food items are not only widely accessible and in focus of many practical applications but also pose great challenges for robot object perception and manipulation in realistic scenarios. We propose to advance the state of the art by including household objects that can be deformable, articulated, interactive, specular or transparent, as well as shapeless such as cloth and food items.

Our methods will learn physical properties essential for perception and grasping simultaneously from different modalities: vision, touch, audio as well as text documents such as online manuals and will include the following properties: 3D model, texture, elasticity, friction, weight, size and grasping techniques for intended use. At the core of our approach is a two-level modelling, where a category level model provides priors for capturing instance level attributes of specific objects. We will exploit online available resources to build prior category level models and a perception-action-learning loop will use the robot's vision, audio, and touch to model instance level object properties. In return, knowledge acquired from a new instance will be used to improve the category-level knowledge. Our approach will allow us to efficiently create a large database of models for objects of diverse types, which will be suitable for example for training neural network based methods or enhancing existing simulators. We will propose a benchmark and evaluation metrics for object grasping, to enable comparisons of results generated with various robotics platforms on our database.

The main objectives we pursue are commercially relevant robotics technologies, as endorsed by the support letters of several companies. We will pursue our goals with a consortium that brings together 5 world-class academic institutions from 5 EU countries (Imperial College London (UK), University of Bordeaux (France), Institut de Robotica i informàtica Industrial (Spain), Aalto University (Finland), and the Czech Technical University (Czech Republic), assembling a complementary research team with strong expertise in the acquisition, processing and learning of multimodal information with applications in robotics.

Planned Impact

We expect impact in several directions from IPALM. We will create an open access dataset,
together with the evaluation data and metrics in order to help build momentum beyond
the project partners. Large part of the project is devoted to development of objective
benchmarks and evaluation strategies for facilitating research in relevant domain.
We will disseminate project outcomes across different scientific disciplines to strengthen the
community involved in addressing the challenges in object modelling and manipulation. Our
consortium includes experts from a broad range of disciplines needed to tackle this topic e.g.
robotic manipulation, computer vision, embodied cognition, and performance
evaluation.
At the 2015 Amazon Picking Challenge, Prof. Henrik Christensen from Georgia Tech noted
that perception was the dominating factor separating the winners from the rest of the
field and that "90% of all robots today don't use sensors". Thus, the functionality of
perceiving objects in everyday tasks will act as multiplier for creating large impact to markets
and society.
Domestic service robots are foreseen as a key technology to meet the challenge of the
ageing society, enabling enhancing the quality of life by providing personal assistance in
smart homes. IPALM website will provide information to the general public (target: 10 000
views).
International Federation of Robotics study from 2016 showed a 24% annual increase in
number of service robots sold in both professional and personal markets. The trend is
continuing and if capabilities will be developed in Europe through projects such as IPALM,
Europe will be able to capture significant share of the growing market.
As we demonstrated in our ImageCLEF data challenge in 2015, many researchers can be
attracted to a scientific area if given the data and focus. IPALM benchmark will provide this
for object manipulation in domestic scenarios. Our aim is to become a leading research
project in the area of object modelling and manipulation. Released open-source software and
open-access datasets will promote reproducible research allowing other researchers to build
on the advances.

Beyond individual benchmarks, IPALM framework will provide a general-purpose tool for
object modelling and manipulation, which can be used to construct applications in robotics
and beyond.
Scientific advances in object modelling by combining innovations in robotics, computer
vision and machine learning will contribute to new scientific knowledge with wide
implications. The scientific impact will be enhanced by vigorous dissemination activities
through top ranking journal and conference publications in respective fields. In terms of
academic impact, the robotics, computer vision and machine learning research communities
will benefit directly from our methodological innovations, which will be published in top
venues in these areas, including high impact journals (IEEE TPAMI, IJCV, IEEE T-RO, IJRR,
IEEE T-CDS) and conferences (RSS, IROS, ICRA, ICCV, CVPR, ECCV). All consortium
partners have a track record in publishing in these venues.
The scientific achievements will impact on education and training through the partner
institutions' research led postgraduate degree programmes and their continuous professional
education offerings. It will also contribute to training highly skilled researchers and engineers
for the European economy. The outreach element of IPALM will contribute to attracting new
generations to careers in science and engineering. PDRAs at the partner institutions will
participate in Postgraduate Researcher Development Programmes, which support early-
career researchers in the development of research skills to enhance their employability
through career and personal development. PDRAs can for a teaching assistant position,
which will allow them to contribute to a delivery of a relevant course and give them
experience of various aspects of teaching.
 
Description As part of the plan of research aiming to develop methods for the automatic digitization of objects and their physical
properties by exploratory manipulations. 1) a dataset of small 3D objects (SHOP-VRB) has been generated for benchmarking robotic reasoning system. The main advantage of the dataset is its realistic rendering of real household objects, that improves learning of object models and generalises to real robotics scenarios. Using this dataset, 2) a new approach for planing robotics actions has been developed. We presented a new AI task - Vision to Action(V2A) - where an agent (robotic arm) is asked to perform a high-level task with objects (e.g. stacking) present in a scene. The agent has to suggest a plan consisting of primitive actions (e.g. simple movement,grasping) in order to successfully complete the given task. We propose a novel approach based on multimodal attention for this task and demonstrate its performance on our new dataset. 3) We developed a system (Embodied Reasoning) that integrates a reasoning apparatus with an embodied agent in simulation and real environment. It is capable of understanding human formulated tasks, disambiguating the scene and objects, planning action and motion, manipulating real objects, as well as measuring their physical properties. In particular we integrate components that are trained from existing datasets or generated by simulators, which allow the system to generalise to new environments. Our approach segments a long-term goal into action sequences, which can be executed by an embodied agent to acquire information about the environment via object manipulation. Such em-bodied agent can improve the model of the environment to improve planing and execution of more complex tasks.
Exploitation Route The methods, the datasets and the results from our approach can be used by other research labs to directly measure and compare the performance of their methods for the given task.
The methods can serve a basis for other projects in robotics or for developing a complete system in robotics applications that rely on visual reasoning, object grasping and manipulation, object pose estimation, visual localisation.
Sectors Aerospace, Defence and Marine,Creative Economy,Education

URL https://sites.google.com/view/ipalm/
 
Description Our implementation of an object detection and semantic segmentation was used by Sees.ai in an industrial knowledge transfer project for drone navigation which will have a significant impact on the drone industry. By improving the accuracy and efficiency of drone navigation, the new approach can enhance the safety and reliability of drone operations, making them more viable for monitoring landing areas. This can lead to increased adoption of drones in various applications, including surveillance, inspection, and emergency response. Additionally, the successful implementation of the new approach can inspire further research and development in drone technology, leading to even more advanced and effective solutions. Ultimately, this will contribute to the growth and evolution of the drone industry, benefiting society as a whole and impact the policy makers which are currently drafting regulations for operating drones in the UK.
First Year Of Impact 2023
Sector Aerospace, Defence and Marine
Impact Types Economic,Policy & public services

 
Description Future Flight Phase 3
Amount £996,205 (GBP)
Funding ID 10021134 
Organisation Innovate UK 
Sector Public
Country United Kingdom
Start 09/2022 
End 09/2023
 
Title D2d: Keypoint extraction with describe to detect approach 
Description We proposed a novel approach that exploits the information within the descriptor space to propose keypoint locations. Detect then describe, or detect and describe jointly are two typical strategies for extracting local descriptors. In contrast, we propose an approach that inverts this process by first describing and then detecting the keypoint locations. Describe-to-Detect (D2D) leverages successful descriptor models without the need for any additional training. Our method selects keypoints as salient locations with high information content which is defined by the descriptors rather than some independent operators. We perform experiments on multiple benchmarks including image matching, camera localisation, and 3D reconstruction. The results indicate that our method improves the matching performance of various descriptors and that it generalises across methods and tasks. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact The ACCV conference provides a platform for researchers to present their work, exchange ideas, and discuss the latest developments in computer vision. Presenting a paper at ACCV increases the visibility and recognition of the authors and their research. The conference attracts a diverse range of participants, including researchers, practitioners, and industry professionals, providing opportunities for networking and learning. 
 
Title DESC: Domain Adaptation for Depth Estimation via Semantic Consistency 
Description We implement a domain adaptation approach to train a monocular depth estimation model using a fully-annotated source dataset and a non-annotated target dataset. We bridge the domain gap by leveraging semantic predictions and low-level edge features to provide guidance for the target domain. We enforce consistency between the main model and a second model trained with semantic segmentation and edge maps, and introduce priors in the form of instance heights. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact This software has been used within the research lab and by project partners to improve monocular depth estimation for applications in robotics, 3D reconstruction, navigation, augmented reality, etc. 
URL https://github.com/alopezgit/DESC
 
Title HDD-Net: Hybrid Detector Descriptor with Mutual Interactive Learning 
Description The success in SLAM, 3D reconstructions, or AR applications relies on the performance of the feature detector and descriptor. While the detector-descriptor interaction of most methods is based on unifying in single network detections and descriptors, we propose a method that treats both extractions independently and focuses on their interaction in the learning process rather than by parameter sharing. We formulate the classical hard-mining triplet loss as a new detector optimisation term to refine candidate positions based on the descriptor map. We propose a dense descriptor that uses a multi-scale approach and a hybrid combination of hand-crafted and learned features to obtain rotation and scale robustness by design. We evaluate our method extensively on different benchmarks and show improvements over the state of the art in terms of image matching on HPatches and 3D reconstruction quality while keeping on par on camera localisation tasks. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact Improved performance of image matching. Used by project partners and within the research lab. 
URL https://github.com/axelBarroso/HDD-Net
 
Title Hpatches 
Description Database for training and evaluation of methods for image matching 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact The largest benchmark in this area. It has been and will continue to be used in a number of publications by others. 
URL https://github.com/hpatches/hpatches-benchmark
 
Title Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters 
Description It is a novel approach for keypoint detection that combines handcrafted and learned CNN filters within a shallow multi-scale architecture. Handcrafted filters provide anchor structures for learned filters, which localize, score, and rank repeatable features. Scale-space representation is used within the network to extract keypoints at different levels. We design a loss function to detect robust features that exist across a range of scales and to maximize the repeatability score. Our Key.Net model is trained on data synthetically created from ImageNet and evaluated on HPatches and other benchmarks. Results show that our approach outperforms state-of-the-art detectors in terms of repeatability, matching performance, and complexity. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact Used by scientific community in the area of Computer Vision, project partners and within the research group. 
URL https://github.com/axelBarroso/Key.Net
 
Title NinjaDesc: content-concealing visual descriptors via adversarial learning 
Description In the context of privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact Has been developed in collaboration with Meta and used internally within the company. 
 
Title OoD-Pose: Camera Pose Regression From Out-of-Distribution Synthetic Views 
Description We address the problem of camera pose estimation in outdoor and indoor scenarios. We propose a relative pose regression method that can directly regress the camera pose from images with significantly higher accuracy than existing methods of the same class. We first investigate one of the main factors that limits the accuracy of relative pose regression, and then introduce a new approach that significantly improves the performance. Specifically, we propose a method to overcome the biased training data by a novel training technique. It generates poses, guided by a probability distribution of the training set, which are then used to synthesise new views for training. Lastly, we evaluate our approach on widely used benchmarks and show that it achieves significantly lower error compared to prior regression-based methods and retrieval techniques. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact The International Conference on 3D Vision (3DV) provides a platform for researchers to present their work, exchange ideas, and discuss the latest developments in 3D vision. Presenting a paper at 3DV increases the visibility and recognition of the authors and their research. The conference attracts experts in the field, providing opportunities for networking and learning. 
 
Title SHOP-VRB: A Visual Reasoning Benchmark for Object Perception 
Description SHOP-VRB (Simple Household Object Properties) provides a benchmark for visual reasoning and recovering structured, semantic representation of a scene. Our dataset builds on CLEVR benchmark, which contains synthetically generated images and questions related to simple geometrical objects as well as their composition in clear background. SHOP-VRB provides scenes with various kitchen objects and appliances, including articulated ones, along with questions associated with those scenes. Each class of objects is provided with a set of short natural language descriptions expanding dataset with visual-textual questions. Apart from classical split into training, validation and test, we suggest another split - benchmark containing unseen objects of known classes as the measure of generalisability. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact It includes an approach and a benchmark SHOP-VRB (Simple Household Object Properties) to bridge the gap between the requirements in robotic perception tasks and typical problems in visual reasoning. we provide a large number of scenes, that are generated in a procedural way, contain household object and appliances suitable for robotic grasping and manipulation. To improve upon the YCB benchmarking strategy, we include a wide range of object types and test scenarios with unambiguous descriptions of experimental setups. We emphasise the importance of the choice of the experimental setup in order to better simulate real conditions for assessing the generalisation potential of the proposed models, thus providing a suitable benchmark for the task. We focus our attention on methods that allow to obtain fully interpretable scene representation, on a human level of abstraction. We consider such methods suitable for real world applications that include human-robot knowledge exchange. 
URL https://michaal94.github.io/SHOP-VRB/
 
Title SOLAR: second-order loss and attention for image retrieval 
Description Recent works in deep-learning have shown that second-order information is beneficial in many computer-vision tasks. Second-order information can be enforced both in the spatial context and the abstract feature dimensions. In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global. It is used to re-weight feature maps, and thus emphasise salient image locations that are subsequently used for description. The second component is concerned with a second-order similarity (SOS) loss, that we extend to global descriptors for image retrieval, and is used to enhance the triplet loss with hard-negative mining. We validate our approach on two different tasks and datasets for image retrieval and image matching. The results show that our two second-order components complement each other, bringing significant performance improvements in both tasks and lead to state-of-the-art results across the public benchmarks. 
Type Of Material Computer model/algorithm 
Year Produced 2020 
Provided To Others? Yes  
Impact ECCV is a highly regarded conference, and presenting a paper there can enhance the credibility and visibility of the research. Additionally, the feedback and discussions from the conference led to improvements and new insights, advancing the state-of-the-art in computer vision. 
URL https://github.com/tonyngjichun/SOLAR
 
Title ScaleNet: A Shallow Architecture for Scale Estimation 
Description We address the problem of estimating scale factors between images. We formulate the scale estimation problem as a prediction of a probability distribution over scale factors. We design a new architecture, ScaleNet, that exploits dilated convolutions as well as self-and cross-correlation layers to predict the scale between images. It can be used to estimate the scale factor between two input images. 
Type Of Material Computer model/algorithm 
Year Produced 2022 
Provided To Others? Yes  
Impact Rectifying images with estimated scales leads to significant performance improvements for various tasks and methods. Specifically, ScaleNet can be combined with sparse local features and dense correspondence networks to improve camera pose estimation, 3D reconstruction, or dense geometric matching in different benchmarks and datasets. We provide an extensive evaluation on several tasks, and analyze the computational overhead of ScaleNet. The code, evaluation protocols, and trained models are publicly available and have been used by project partners. 
URL https://github.com/axelBarroso/ScaleNet
 
Title WaTur: Estimating water turbidity from a smartphone camera 
Description Water quality monitoring is indispensable for safeguarding human health. One aspect of water quality is turbidity, the measurement of which typically involves on-site water sampling and laboratory analysis, which may be both costly and labour-intensive in the context of developing countries. Alternative portable devices have been developed but they are often inconvenient and require technical expertise. In recent years, smartphone- based solutions have been developed with the aim of bringing turbidimeters to the wider population. However, they rely on additional equipment to create enclosed environments for the sample and the camera to remove ambient light. Therefore, turbidimeters in general require either technical expertise or additional equipment, which has limited their usage, especially in developing countries, where they are most needed. In this work we introduce a new benchmark with a new task for computer vision that aims at estimating a blur of a pattern observed through a liquid. We propose and evaluate an approach for measuring water turbidity from a picture taken by a smartphone camera without any additional equipment. We design a simple protocol for taking a picture of a water sample that allows to estimate its turbidity, collect a dataset and design a benchmark for measuring the performance of computer vision methods in this task. Our model is able to accurately determine turbidity in the range of 0 - 40 NTU. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact This software has been used by Civil Engineering from Imperial College London, in a research project, to measure tap water quality in Thailand. 
URL https://github.com/lml418/WaTur-Water-Turbidity-Dataset
 
Description Grasp-Oriented Cloth Segmentation 
Organisation Institute of Robotics and Industrial Informatics
Country Spain 
Sector Public 
PI Contribution They address the problem of fine-grained region detection in deformed clothes using only a depth image. We implement an approach for T-shirts, and define up to 6 semantic regions of varying extent, including edges on the neckline, sleeve cuffs, and hem, plus top and bottom grasping points. We introduce a U-Net based network to segment and label these parts.
Collaborator Contribution Our contribution is concerned with the level of supervision required to train the proposed network. While most approaches learn to detect grasping points by combining real and synthetic annotations, in this work we propose a multilayered Domain Adaptation strategy that does not use any real annotations. We thoroughly evaluate our approach on real depth images of a T-shirt annotated with fine-grained labels, and show that training our network only with synthetic labels and our proposed DA approach yields results competitive with real data supervision.
Impact - realistic synthetic data and collect a mid-size real dataset of deformed T-shirts which we annotated with edge labels and grasping points. This dataset can be used either for finetuning synthetically trained networks or for evaluation, and will be made publicly available together with the proposed model. - implementation and evaluation of an approach to fine-grained region detection in deformed clothes using only a depth image .Automatically detecting graspable regions from a single depth image is a key ingredient in cloth manipulation. The large variability of cloth deformations has motivated most of the current approaches to focus on identifying specific grasping points rather than semantic parts, as the appearance and depth variations of local regions are smaller and easier to model than the larger ones. However, tasks like cloth folding or assisted dressing require recognizing larger segments, such as semantic edges that carry more information than points.
Start Year 2021
 
Description Language, vision and action in Robotics 
Organisation Czech Technical University in Prague
Department Czech Institute of Informatics Robotics, and Cybernetics
Country Czech Republic 
Sector Academic/University 
PI Contribution We developed an integrated system that includes a reasoning from visual and natural language inputs, action and motion planning, executing tasks by a robotic arm, manipulating objects and discovering their properties. With the proposed system we address a number of important design questions for implementing an embodied agent. In particular we integrate components that are trained from existing datasets or generated by simulators, which allow the system to generalise to new environments. We propose a system for exploring object properties that require manipulation. We make use of synthetic simulated data to train various components of the agent. We propose an evaluation framework, report success rate for critical components and identify weaknesses of the embodied agent. We identify the factors that are robot/setup dependent and cannot be "blindly executed".
Collaborator Contribution We propose suitable representations for objects, scenes, actions, and perceptions to foster successful executions and scalable knowledge acquisition. We select necessary components and interfaces that facilitate the interplay between high-level action reasoning and low-level control schemes such as blind execution, visual servoing, or multi-modal control of action execution with sanity checks.We also address the robot/embodiment requirements which are hardware specific (e.g., workspace, DoF, gripper collisions, etc.).
Impact Joint publication Embodied Reasoning for Discovering Object Properties via Manipulation , Jan Kristof Behrens, Michal Nazarczuk, Karla Stepanova, Matej Hoffmann,Yiannis Demiris, and Krystian Mikolajczyk. International Conference on Robotics and Automation, 2021
Start Year 2020
 
Title DESC: Domain Adaptation for Depth Estimation via Semantic Consistency 
Description We implement a domain adaptation approach to train a monocular depth estimation model using a fully-annotated source dataset and a non-annotated target dataset. We bridge the domain gap by leveraging semantic predictions and low-level edge features to provide guidance for the target domain. We enforce consistency between the main model and a second model trained with semantic segmentation and edge maps, and introduce priors in the form of instance heights. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact This software has been used within the research lab and by project partners to improve monocular depth estimation for applications in robotics, 3D reconstruction, navigation, augmented reality, etc. 
URL https://link.springer.com/article/10.1007/s11263-022-01718-1
 
Title HDD-Net: Hybrid Detector Descriptor with Mutual Interactive Learning 
Description The success in SLAM, 3D reconstructions, or AR applications relies on the performance of the feature detector and descriptor. While the detector-descriptor interaction of most methods is based on unifying in single network detections and descriptors, we propose a method that treats both extractions independently and focuses on their interaction in the learning process rather than by parameter sharing. We formulate the classical hard-mining triplet loss as a new detector optimisation term to refine candidate positions based on the descriptor map. We propose a dense descriptor that uses a multi-scale approach and a hybrid combination of hand-crafted and learned features to obtain rotation and scale robustness by design. We evaluate our method extensively on different benchmarks and show improvements over the state of the art in terms of image matching on HPatches and 3D reconstruction quality while keeping on par on camera localisation tasks. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact Used by project partners and within the research lab. 
URL https://arxiv.org/abs/2005.05777
 
Title HyNet: Learning Local Descriptor with Hybrid Similarity Measure and Triplet Loss 
Description We propose HyNet, a new local descriptor that leads to state-of-the-art results in matching. HyNet introduces a hybrid similarity measure for triplet margin loss, a regularisation term constraining the descriptor norm, and a new network architecture that performs L2 normalisation of all intermediate feature maps and the output descriptors. HyNet surpasses previous methods by a significant margin on standard benchmarks that include patch matching, verification, and retrieval, as well as outperforming full end-to-end methods on 3D reconstruction tasks. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This code was released with the scientific publication and it has an impact on the research community. It allows other researchers to verify the study's results and methods, enhancing transparency, reproducibility, and credibility. It enables further experimentation, expansion, and application of the research, facilitating scientific progress and collaboration. Moreover, it encourages the adoption of new techniques and technologies and fosters open science practices, promoting scientific integrity and trust. Ultimately, an implementation with code release can lead to improved scientific understanding, better decision-making, and more efficient problem-solving. 
URL https://proceedings.neurips.cc/paper/2020/file/52d2752b150f9c35ccb6869cbf074e48-Paper.pdf
 
Title Key.Net: Keypoint Detection by Handcrafted and Learned CNN Filters 
Description This software can be used to extract Key.Net features for a given list of images. It is a novel approach for keypoint detection that combines handcrafted and learned CNN filters within a shallow multi-scale architecture. Handcrafted filters provide anchor structures for learned filters, which localize, score, and rank repeatable features. Scale-space representation is used within the network to extract keypoints at different levels. We design a loss function to detect robust features that exist across a range of scales and to maximize the repeatability score. Our Key.Net model is trained on data synthetically created from ImageNet and evaluated on HPatches and other benchmarks. Results show that our approach outperforms state-of-the-art detectors in terms of repeatability, matching performance, and complexity. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact Used by scientific community, project partners and within the research group. 
URL https://arxiv.org/abs/1904.00889
 
Title NinjaDesc: content-concealing visual descriptors via adversarial learning 
Description In the context of privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance. 
Type Of Technology Software 
Year Produced 2022 
Impact Has been developed in collaboration with Meta and used internally within the company. 
 
Title SOLAR: second-order loss and attention for image retrieval 
Description Recent works in deep-learning have shown that second-order information is beneficial in many computer-vision tasks. Second-order information can be enforced both in the spatial context and the abstract feature dimensions. In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global. It is used to re-weight feature maps, and thus emphasise salient image locations that are subsequently used for description. The second component is concerned with a second-order similarity (SOS) loss, that we extend to global descriptors for image retrieval, and is used to enhance the triplet loss with hard-negative mining. We validate our approach on two different tasks and datasets for image retrieval and image matching. The results show that our two second-order components complement each other, bringing significant performance improvements in both tasks and lead to state-of-the-art results across the public benchmarks. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact This code was released with the scientific publication and it has an impact on the research community. It allows other researchers to verify the study's results and methods, enhancing transparency, reproducibility, and credibility. It enables further experimentation, expansion, and application of the research, facilitating scientific progress and collaboration. Moreover, it encourages the adoption of new techniques and technologies and fosters open science practices, promoting scientific integrity and trust. Ultimately, an implementation with code release can lead to improved scientific understanding, better decision-making, and more efficient problem-solving. 
URL https://arxiv.org/abs/2001.08972
 
Title ScaleNet: A Shallow Architecture for Scale Estimation 
Description We address the problem of estimating scale factors between images. We formulate the scale estimation problem as a prediction of a probability distribution over scale factors. We design a new architecture, ScaleNet, that exploits dilated convolutions as well as self-and cross-correlation layers to predict the scale between images. It can be used to estimate the scale factor between two input images. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact Rectifying images with estimated scales leads to significant performance improvements for various tasks and methods. Specifically, ScaleNet can be combined with sparse local features and dense correspondence networks to improve camera pose estimation, 3D reconstruction, or dense geometric matching in different benchmarks and datasets. We provide an extensive evaluation on several tasks, and analyze the computational overhead of ScaleNet. The code, evaluation protocols, and trained models are publicly available and have been used by project partners. 
URL https://arxiv.org/abs/2112.04846
 
Description CVPRW2020: Image Matching: Local Features & Beyond 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact Invited talk on recent advances in local feature extraction and image matching in the context of 3D reconstruction.
An overview of milestones in the past 50 years of progress in the field of local features and correspondence search.
Year(s) Of Engagement Activity 2020
URL https://image-matching-workshop.github.io/
 
Description CVPRW2020: Long-Term Visual Localization 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Joint Workshop on Long-Term Visual Localization, Visual Odometry and Geometric and Learning-based SLAM, 2020
Visual Localization is the problem of estimating the position and orientation, i.e., the camera pose, from which an image was taken. Long-Term Visual Localization is the problem of robustly handling changes in the scene. Simultaneous Localization and Mapping (SLAM) is the problem of tracking the motion of a camera (or sensor system) while simultaneously building a (3D) map of the scene. Similarly, Visual Odometry (VO) algorithms track the motion of a sensor system, without necessarily creating a map of the scene. Localization, SLAM, and VO are highly related problems, e.g., SLAM algorithms can be used to construct maps that are later used by Localization techniques, Localization approaches can be used to detect loop closures in SLAM and SLAM / VO can be used to integrate frame-to-frame tracking into real-time Localization approaches.
Year(s) Of Engagement Activity 2020
URL https://sites.google.com/view/vislocslamcvpr2020/home
 
Description ECCVW2020: Long-Term Visual Localization under Changing Conditions 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Long-Term Visual Localization under Changing Conditions
Visual localization is the problem of estimating the position and orientation from which an image was taken. It is a vital component in many Computer Vision and Robotics scenarios, including autonomous vehicles, Augmented / Mixed / Virtual Reality, Structure-from-Motion, and SLAM. Due to its central role, visual localization is currently receiving significant interest from both academia and industry. Of special practical importance are long-term localization algorithms that generalize to unseen scene conditions, including illumination changes and the effects of seasons on scene appearance and geometry. The purpose of this workshop is to benchmark the current state of visual localization under changing conditions and to encourage new work on this challenging problem. The workshop consists of both presentations by experts in the field (from academia and industry) and challenges designed to highlight the currently unsolved problems.
Year(s) Of Engagement Activity 2020
URL https://sites.google.com/view/ltvl2020/home
 
Description ICPRAM 2022 keynote: Correspondence search for building 3D models 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Study participants or study members
Results and Impact Keynote at ICPRAM 2022 : 11th International Conference on Pattern Recognition Applications and Methods. Dissemination of achievements in this project to the scientific community. New contact were established with a potential for collaboration in the area of this project.
Year(s) Of Engagement Activity 2022
URL https://www.insticc.org/node/technicalprogram/icpram/2022
 
Description ORMR Workshop 2021 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The purpose of this workshop was to present and discuss the challenges and the progress in the field of robotic perception and manipulation, building solid scientific foundations of experimental reproducibility through transparent sharing of data and methods.
It challenges researchers to work in collaborative projects, which simultaneously address the three pillars of recognition, manipulation and reproducibility within this domain.
The workshop led to creation of a online platform that groups resources and publications from several European projects in closely related areas of Object Recognition and Manipulation in Robotics (ORMR).
https://sites.google.com/view/ormr-icvs2021/home
Year(s) Of Engagement Activity 2021
URL https://sites.google.com/view/ormr-icvs2021/ormr-workshop
 
Description Perception and Modelling for Manipulation of Objects (PaMMO) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact PaMMO was a half-day workshop that will take place during ICPR 2021 ( International Conference on Pattern Recognition)

Object recognition, modelling, and pose estimation of objects for robot manipulation has achieved great advances in recent vision and robotics conferences (e.g., 6D pose challenges at vision conferences). However, vision still struggles with recognition under partial occlusions (e.g., in clutter or when an object is held by a human) and robots are still unable to handle everyday objects under realistic conditions in home, care, and service settings. Despite significant recent advances in data-driven grasping and manipulation, there are still considerable challenges to be addressed at the intersection of perception, robotics hardware, learning and grasp prediction. For example, learning to approach and grasp objects does not generalize beyond constrained objects and settings, object grasps are limited to specific robot embodiments,and perception does not generalize to cluttered scenes and objects touching or partially occluding each other. Furthermore, evaluation and benchmarking (e.g., with the YCB object set and the YCB and Amazon challenges) is yet to become a standard for publishing advances in this domain.

PaMMO focused on recent advances, open problems, and next steps to be taken in recognition and modelling for the handover and manipulation of everyday objects. The Workshop included invited talks by prominent researchers from different disciplines (e.g., vision and robotics for object manipulation), an open discussion, interactive videos and demos accompanied by posters.
Year(s) Of Engagement Activity 2021
URL https://sites.google.com/view/pammo-icpr2020