CORSMAL: Collaborative object recognition, shared manipulation and learning

Lead Research Organisation: Queen Mary University of London

Department Name: Sch of Electronic Eng & Computer Science

Abstract

CORSMAL proposes to develop and validate a new framework for collaborative recognition and manipulation of objects via cooperation with humans. The project will explore the fusion of multiple sensing modalities (touch, sound and first/third person vision) to accurately and robustly estimate the physical properties of objects in noisy and potentially ambiguous environments. The framework will mimic human capability of learning and adapting across a set of different manipulators, tasks, sensing configurations and environments. In particular, we will address the problems of (1) learning shared autonomy models via observations of and interactions with humans and (2) generalising capabilities across tasks and sites by aggregating data and abstracting models to enable accurate object recognition and manipulation of unknown objects in unknown environments. The focus of CORSMAL is to define learning architectures for multimodal sensory data as well as for aggregated data from different environments. A key aim of the project is to identify the most suitable framework resulting from learning across environments and the optimal trade-off between the use of specialised local models and generalised global models. The goal here is to continually improve the adaptability and robustness of the models. The robustness of the proposed framework will be evaluated with prototype implementations in different environments. Importantly, during the project we will organise two community challenges to favour data sharing and support experiment reproducibility in additional sites.

Planned Impact

n/a

Funded Value:

£302,905

Funded Period:

Feb 19 - Dec 22

Funder:

EPSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

EP/S031715/1

Principal Investigator:

Andrea Cavallaro

Research Subject:

Info. & commun. Technol. (85%)

Mechanical engineering (15%)

Research Topic:

Artificial Intelligence (15%)

Human-Computer Interactions (15%)

Image & Vision Computing (50%)

Music & Acoustic Technology (5%)

Robotics & Autonomy (15%)

Organisations

People	ORCID iD
Andrea Cavallaro (Principal Investigator)
Kaspar Althoefer (Co-Investigator)

Publications

Author Name Title Publication

Date Published

|< < 1 2 > >|

10 25 50

Modas A. (2022) Data augmentation with mixtures of max-entropy transformations for filling-level classification in European Signal Processing Conference

Xompero A (2022) The CORSMAL Benchmark for the Prediction of the Properties of Containers in IEEE Access

Sanchez-Matilla R (2020) Benchmark for Human-to-Robot Handovers of Unseen Containers With Unknown Filling in IEEE Robotics and Automation Letters

Modas A (2020) Toward Robust Sensing for Autonomous Vehicles: An Adversarial Perspective in IEEE Signal Processing Magazine

Oh C (2021) View-Action Representation Learning for Active First-Person Vision in IEEE Transactions on Circuits and Systems for Video Technology

Donaher S (2021) Audio classification of the content of food containers and drinking glasses

Pang Y (2021) Towards safe human-to-robot handovers of unknown containers

Xompero A (2020) Multi-View Shape Estimation of Transparent Containers

Xompero A (2022) Audio-Visual Object Classification for Human-Robot Collaboration

Modas A (2021) Improving filling level classification with adversarial training

Key Findings
Impact Summary
Research Databases and Models
Collaboration
Software and Technical Products
Engagement Activities


Description	A method for the contactless estimation (through vision and sound signals) of the physical properties of objects manipulated by humans is important to inform the control of robots for performing accurate and safe grasps of objects handed over by humans. A real-to-simulation framework that integrates sensing data and a robotic arm simulator to complete the handover task and estimates the pose of the hand holding the container to help prevent an unsafe grasp [19]. The framework facilitates the development of algorithms for object properties estimation and robot planning to test methods for safe handovers and to enable progress when access to a robot is unavailable. The simulator was developed to mitigate the limited access to laboratories caused by COVID19 lockdowns and restrictions, and to facilitate the take-up of the CORSMAL dataset by a wider community. We demonstrated and validated the framework on the CORSMAL Containers Manipulation dataset using the CORSMAL vision-based baseline to estimate - online and without access to models or motion capture data - the shape and trajectory of a container. A new method for testing the robustness of machine-learning classifiers through adversarial attacks. The proposed method can generate perturbations on the data for images of any size, and outperforms five state-of-the-art attacks on two different tasks, scene and object classification, three state-of-the-art deep neural networks. A new method for localising, in 3D, container-like objects and estimating their dimensions using two wide-baseline, RGB cameras. A new method for training a method for filling level classificatin using transfer learning and adversarial training.
Exploitation Route	Through a benchmark that we designed and the open source code we distribute: https://corsmal.eecs.qmul.ac.uk/benchmark.html The data we have produced have been already used by research laboratories across the world (see the list of participants in https://corsmal.eecs.qmul.ac.uk/challenge.html)
Sectors	Digital/Communication/Information Technologies (including Software)
URL	https://corsmal.eecs.qmul.ac.uk/publications.html


Description	The organisation of the CORSMAL Challenge at IEEE ICASSP 2022, IEEE ICME 2020, at the Intelligent Sensing Summer School 2020, and at ICPR 2020, which had 30 participants. The leaderboard for the CORSMAL Challenge has accumulated 12 entries, of which 6 are results from teams and 6 are baselines. IET QMUL Children's Christmas Lecture 2019 - London, UK (11 December 2019) - Presentation of the tasks and objectives of CORSMAL to an audience of children, teachers and parents The publication by teams outside the CORSMAL project of six papers and one technical report on the methods developed for the CORSMAL Challenge.
First Year Of Impact	2000
Sector	Digital/Communication/Information Technologies (including Software)
Impact Types	Cultural


Title	Audio-based Containers Manipulation Setup 2 (ACM-S2)
Description	Audio-based Containers Manipulation Setup 2 (ACM-S2) is a dataset for the validation of audio-based models for the task of filling type and filling level classification. This dataset contains 21 recordings acquired in a different setup than the CORSMAL Containers Manupulation dataset, recorded with a different microphone, room and containers, providing a new scenario with different acoustics. The recordings are in .wav format and correspond to 19 pouring and 2 shaking actions, performed manually by a human. The microphone used is a Blue Yeti Studio microphone, placed on a table, facing the container at 12 cm for the pouring action, and 20 cm for the shaking action. The used containers are a medium glass (300 ml), tall glass (450 ml), small plastic cup (200 ml), and muesli box (300 ml). The filling types (materials) are: pasta penne (Gallo plumas nº6), white rice (Arroz SOS), and tap water, and the filling levels (%) are 0 (empty), 50, and 90. The dataset was used in the paper Audio classification of the content of food containers and drinking glasses (Webpage).
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://zenodo.org/record/4770438


Title	CHOC: The CORSMAL Hand-Occluded Containers dataset
Description	CORSMAL Hand-Occluded Containers (CHOC) is an image-based dataset for category-level 6D object pose and size estimation, affordance segmentation, object detection, object and arm segmentation, and hand+object reconstruction. The dataset has 138,240 pseudo-realistic composite RGB-D images of hand-held containers on top of 30 real backgrounds (mixed-reality set) and 3,951 RGB-D images selected from the CORSMAL Container Manipulation (CCM) dataset (real set). CHOC-AFF is the subset that focuses on the problem of visual affordance segmentation. CHOC-AFF consists of the RGB images, the object and arm segmentation masks, and the affordance segmentation masks. The images of the mixed-reality set are automatically rendered using Blender, and are split into 129,600 images of handheld containers and 8,640 images of objects without hand. Only one synthetic container is rendered for each image. Images are evenly split among 48 unique synthetic objects from three categories, namely 16 boxes, 16 drinking containers without stem (nonstems) and 16 drinking containers with stems (stems), selected from ShapeNetSem. For each object, 6 realistic grasps were manually annotated using GraspIt!: bottom grasp, natural grasp, and top grasp for the left and right hand. The mixed-reality set provides RGB images, depth images, segmentation masks (hand and object), normalised object coordinates images (only object), object meshes, annotated 6D object poses (orientation and translation in 3D with respect to the camera view), and grasp meshes with their MANO parameters. Each image has a resolution of 640x480 pixels. Background images were acquired using an Intel RealSense D435i depth camera, and include 15 indoor and 15 outdoor scenes. All information necessary to re-render the dataset is provided, namely backgrounds, camera intrinsic parameters, lighting, object models, and hand + forearm meshes, and poses; users can complement the existing data with additional annotations. Note: The mixed-reality set was built on top of previous works for the generation of synthetic and mixed-reality datasets, such as OBMan and NOCS-CAMERA. The images of the real set are selected from 180 representative sequences of the CCM dataset. Each image contains a person holding one of the 15 containers during a manipulation occurring in the video prior to a handover (e.g., picking up an empty container, shaking an empty or filled food box, or pouring a content into a cup or drinking glass). For each object instance, sequences were chosen under four randomly sampled conditions, including background and lighting conditions, scenarios (person sitting, with the object on the table; person sitting and already holding the object; person standing while holding the container and then walking towards the table), and filling amount and type. The same sequence is selected from the three fixed camera views (two side and one frontal view) of the CCM setup (60 sequences for each view). Fifteen sequences exhibit the case of the empty container for all fifteen objects, whereas the other sequences have the person filling the container with either pasta, rice or water at 50% or 90% of the full container capacity. The real set has RGB images, depth images and 6D pose annotations. For each sequence, the 6D poses of the containers are manually annotated every 10 frames if the container is visible in at least two views, resulting in a total of 3,951 annotations. Annotations of the 6D poses for the intermediate frames are also provided by using interpolation. Contacts For enquiries, questions, or comments, please contact Alessio Xompero. For enquiries, questions, or comments about CHOC-AFF, please contact Tommaso Apicella. References If you work on Visual Affordance Segmentation and you use the subset CHOC-AFF, please see the related work on ACANet and also cite: Affordance segmentation of hand-occluded containers from exocentric images T. Apicella, A. Xompero, E. Ragusa, R. Berta, A. Cavallaro, P. Gastaldo IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), 2023 Additional resources Webpage of 6D pose estimation using CHOC Toolkit to parse and inspect the dataset, or generate new data Release notes 2023/09/10 - Added object affordance segmentation masks 2023/02/08 - Fixed NOCS maps due to a missing rotation during the generation - Fixed annotations to include the missing rotation 2023/01/09 - Fixed RGB_070001_80000 (wrong files previously) 2022/12/14 - Added a mapping dictionary from grasp-IDs to their corresponding MANO-parameters-IDs to grasp.zip - Added object meshes with the NOCS textures/material in object_models.zip - Fixed folder name in annotations.zip - Updated README file to include these changes and fix a typo in the code block to unzip files
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
URL	https://zenodo.org/record/5085800


Title	CORSMAL Containers
Description	The dataset was acquired with two Intel RealSense D435i cameras, located approximately 40cm from the object placed on the top of a table. The cameras are calibrated and localised with respect to a calibration board. The resulting images (1280x720 pixels) are RGB, depth and stereo infrared, with the RGB and depth images being spatially aligned. Data acquisition was performed in two separate rooms with different lighting conditions and different backgrounds are obtained using two tablecloths in addition to a no-tabletop scenario. We employed a room with natural light. The dimensions of the table-top are 160x80 cm and the height of the table is 82 cm from the ground, and a room with no windows and the illumination is provided by either ceiling lights or additional portable lights and where the table is of size 60x60 cm and height 82 cm from the ground. We collected in total 207 configurations, as the combination of objects (23), backgrounds (3) and lighting conditions (3), resulting in 414 RGB images, 414 depth images and 828 IR images. We manually annotated the maximum width and height of each object with a digital caliper (0-150mm ± 0.01mm) and a measuring tape (0-10m ± 1mm).
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	The dataset was used in 2 publications.
URL	http://corsmal.eecs.qmul.ac.uk/containers.html


Title	CORSMAL Containers Manipulation
Description	The dataset consists of multiple recordings of containers: drinking cups, drinking glasses and food boxes. These containers are made of different materials, such as plastic, glass and paper. Each container can be empty or filled with water, rice or pasta at two different levels of fullness. The multiple combination of containers and fillings results are acquired for three scenarios with an increasing level of difficulty, caused by occlusions or subject motions.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	Used in 2020 CORSMAL Challenge. Appeared already in 3 external publications.
URL	http://corsmal.eecs.qmul.ac.uk/containers_manip.html


Title	Crop - CORSMAL Containers Manipulation (C-CCM)
Description	Crop - CORSMAL Containers Manipulation (C-CCM) is a dataset for filling level classification from a single RGB image. C-CCM consists of 10,216 images automatically sampled, followed by manual verification, from public videos recordings of the CORSMAL Container Manipulation dataset. C-CCM extracts RGB images using recordings from three fixed views, and capturing cups (4) and drinking glasses (4) as containers. The selected containers are red cup, small white cup, small transparent cup, green glass, wine glass, champagne flute, beer cup, and cocktail glass. Frames were also selected by considering that the object is completely visible or occluded by the person's hand, under different backgrounds, and for which the pouring process has been finalised (all frames where a person is still pouring the content were excluded). The containers can be transparent, translucent or opaque, while they can be empty or filled by a person (pouring) up to 50% or 90% of the capacity of the container with transparent (water) or opaque (pasta, rice) content. C-CCM distributes selected RGB images, binary masks of the region with the container estimated using Mask R-CNN, and annotations of filling type and level, hand occlusion, transparency of the container, and rectangular bounding box indicating top-left and bottom-right corners for each image. Final images can be extracted again by cropping only the region with the container using the annotated bounding boxes. C-CCM provides a Python script to extract the image crops from the original images.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://zenodo.org/record/4642576


Title	Towards safe human-to-robot handovers of unknown containers: pre-trained models and 3D hand keypoints annotations
Description	This repository contains additional data to be used with the implementation of the real-to-simulation framework of the paper Towards safe human-to-robot handovers of unknown containers. The data include pre-trained models and annotations of the 3D hand poses for selected recordings from the public training and testing sets of CORSMAL Container Manipulation (CCM) dataset. The pre-trained models are used for classifying the filling type and filling level of a container. 3D hand poses are annotated as 21 keypoints based on the OpenPose format.
Type Of Material	Database/Collection of data
Year Produced	2021
Provided To Others?	Yes
URL	https://zenodo.org/record/5525332


Title	Trained models for filling level classification
Description	The networks are already pre-trained on the 3 splits (S1, S2, S3) of the C-CCM dataset, using six different training strategies. The networks are implemented in PyTorch. More information regarding the C-CCM dataset can be found here: https://corsmal.eecs.qmul.ac.uk/filling.html The CCM_Filling_Level_Pretrained_Models.zip file contains: 3 folders (S1, S2, S3) that correspond to the different dataset splits Each of S1, S2, S3 folders contains 6 subfolders (ST, AT, ST-FT, ST-AFT, AT-FT, AT-AFT) which correspond to the different training strategies used in the paper. Each of the ST, AT, ..., AT-AFT subfolders contains a PyTorch file named last.t7. This is the PyTorch ResNet-18 model that is trained on the corresponding split (S1/S2/S3) using the corresponding training strategy (ST, AT, ..., AT-AFT). A Python example script for loading the models is also provided (load_model.py).
Type Of Material	Computer model/algorithm
Year Produced	2021
Provided To Others?	Yes
Impact	Recently published
URL	https://zenodo.org/record/4518951#.YC9-z-qnw5k


Description	EPFL-corsmal
Organisation	Swiss Federal Institute of Technology in Lausanne (EPFL)
Country	Switzerland
Sector	Public
PI Contribution	research collaboration resulting in 3 joint publications.
Collaborator Contribution	expertise in robotic control, robotic manipulation, and robustness of machine learning models
Impact	A multi-disciplinary collaboration that resulted in 3 joint publications. Disciplines involved: robotics, control, machine learning, computer vision, digital signal processing.
Start Year	2019


Description	information fusion
Organisation	Swiss Federal Institute of Technology in Lausanne (EPFL)
Country	Switzerland
Sector	Public
PI Contribution	A new method for the real-time estimation through vision of the physical properties of objects manipulated by humans is important to inform the control of robots for performing accurate and safe grasps of objects handed over by humans.
Collaborator Contribution	The design of the control of a robot for performing accurate and safe grasps of objects handed over by humans.
Impact	multi-disciplinary collaboration - outcome: https://ieeexplore.ieee.org/document/8968407
Start Year	2019


Title	Saafke/CHOC-NOCS: Public release (v1.0.0)
Description	This release publicly publishes the official code to train, test and evaluate the NOCS model trained on the CORSMAL Hand-Occluded Containers (CHOC) dataset. It includes a demo to run the trained model on RGB or RGB-D images.
Type Of Technology	Software
Year Produced	2022
URL	https://zenodo.org/record/7406417


Title	Saafke/CHOC-renderer: public release (v1.0.0)
Description	This release publicly publishes the official code to automatically render images in the style of the CORSMAL Hand-Occluded Containers (CHOC) dataset via Blender and the Python API. The code can be used: to exactly generate the mixed-reality set of the CHOC dataset that consists of RGB images, segmentation masks (object, hand+forearm), depth maps, 6D object poses, and Normalised Object Coordinate Space (NOCS) maps as a starting point to render other types of mixed-reality datasets
Type Of Technology	Software
Year Produced	2022
Open Source License?	Yes
URL	https://zenodo.org/record/7406367


Description	Sub-track 4 of the ICRA 2024 Robot Grasping and Manipulation Competition
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	9th Robotic Grasping and Manipulation Competition Sub-Track 4: Human-to-Robot Handovers -- The tasks in this sub-track designed based on the benchmark published in Sanchez-Matilla, R., Chatzilygeroudis, K., Modas, A., Duarte, N.F., Xompero, A., Frossard, P., Billard, A. and Cavallaro, A., 2020. Benchmark for human-to-robot handovers of unseen containers with unknown filling. IEEE Robotics and Automation Letters, 5(2), pp.1642-1649 https://www.cse.usf.edu/~yusun/rgmc/2024.html
Year(s) Of Engagement Activity	2024

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications