Tutorials – 2022 ACM Multimedia

Deep Learning-based Point Cloud Coding for Immersive Experiences

Presenter:

Fernando Pereira, Instituto Superior Técnico, Lisbon, Portugal

Abstract:

The recent advances in visual data acquisition and consumption have led to the emergence of the so-called plenoptic visual models, where Point Clouds (PCs) are playing an increasingly important role. Point clouds are a 3D visual model where the visual scene is represented through a set of points and associated attributes, notably color. To offer realistic and immersive experiences, point clouds need to have millions, or even billions, of points, thus asking for efficient representation and coding solutions. This is critical for emerging applications and services, notably virtual and augmented reality, personal communications and meetings, education and medical applications and virtual museum tours.

The point cloud coding field has received many contributions in recent years, notably adopting deep learning-based approaches, and it is critical for the future of immersive media experiences. In this context, the key objective of this tutorial is to review the most relevant point cloud coding solutions available in the literature with a special focus on deep learning-based solutions and its specific novel features, e.g. model design and training. Special attention will be dedicated to the ongoing standardization projects in this domain, notably in JPEG and MPEG.

Advances In Quality Assessment Of Video Streaming Systems: Algorithms, Methods, Tools

Presenter:

Yiannis Andreopoulos, iSIZE, UK, and University College London, UK

Cosmin Stejerean, Meta, Video Infrastructure Group, USA

Abstract:

There has been a flurry of recent developments in video quality assessment in academia and industry, with relevant initiatives in VQEG, AOMedia, MPEG, ITU-T P.910, and other standardization and advisory bodies. Most advanced video streaming systems are now clearly moving away from ‘good old-fashioned’ PSNR and structural similarity type of assessment, towards metrics that align better to mean opinion scores from viewers. Several of these algorithms, methods, and tools have only been developed in the last 3 years and, while they are of significant interest to the research community, they are not widely known. The purpose of this tutorial is to provide this overview, but also focus on practical aspects and on how to design quality assessment tests that can scale to large datasets. The participants will also benefit from the experience of the tutorial presenters in designing algorithms, methods and products based on the material summarized in the tutorial.

Multimedia Content Understanding in Harsh Environments

Presenters:

Zheng Wang, Wuhan University, China

Dan Xu, Hong Kong University of Science and Technology, Hong Kong

Zhedong Zheng, National University of Singapore, Singapore

Kui Jiang, Huawei Cloud & AI, China

Abstract:

Multimedia content understanding is a key application for effective and efficient search, retrieval, delivery, management and sharing of multimedia content. Existing work shows that media understanding performs well in excellent environments. Harsh environments (e.g., fog, rain, snow, dark, low light, glare, blur, and low resolution) introduce challenges in visibility, analysis and understanding of visual data for real applications, such as autonomous cars and video surveillance systems. Despite the development of computing power and deep learning algorithms, the performance of current multimedia content understanding algorithms is still mainly benchmarked under high-quality environments. This tutorial will introduces some key directions in the field of multimedia content understanding under harsh environments. It would be useful for the multimedia community, especially for multimedia content understanding task for the practical and open-set domain. First, it will introduce some multimedia enhancement methods, including image deraining, dehazing and low-light enhancement, and demonstrate their performances in down-stream vision tasks, such as object detection and segmentation. Second, the tutorial will present recent advances on 2D and 3D visual scene understanding, and describe how deep learning and visual big data are significantly driving research and development in this domain. Third, it will introduce strategies to estimate the prediction uncertainty during training to rectify the pseudo label learning for unsupervised semantic segmentation adaptation. Finally, it will give a brief summary and show some typical applications and some trends in this task.

Autonomous UAV Cinematography

Presenters:
Ioannis Pitas, Aristotle University of Thessaloniki, Greece

Ioannis Mademlis, Aristotle University of Thessaloniki, Greece

Abstract:

This tutorial will present the state-of-the-art and current research concerning “autonomous Unmanned Aerial Vehicle (UAV) cinematography”: an emerging, exciting, interdisciplinary subject, lying at the crossroad of machine learning, computer graphics, computer vision and aerial robotics. The field concerns the use of robotic drones with high cognitive autonomy for filming aesthetically pleasing footage in dynamic environments. The tutorial will emphasize the definition and formalization of UAV cinematography aesthetic components, as well as the use of robotic planning/control methods for autonomously capturing them on footage, without the need for manual tele-operation. Additionally, it will focus on state-of-the-art Imitation Learning and Deep Reinforcement Learning approaches for automated UAV/camera control, path planning and cinematography planning, in the general context of “flying & filming”.

Video Grounding and Its Generalization

Presenters:

Xin Wang, Tsinghua University, China

Xiaohan Lan, Tsinghua University, China

Wenwu Zhu, Tsinghua University, China

Abstract:

Video grounding aims to retrieve a video segment referring to a descriptive sentence query from an untrimmed video, which is a core vision-and-language task attracting continuous attention from the multimedia community. The grounding accuracy can straightforwardly reflect the ability of correlation modeling of visual and textual inputs. Moreover, it could also serve as an intermediate task for various downstream vision-and-language tasks such as video question answering and video summarization. Hence, it is worthwhile to go into a deep exploration in video grounding, which may draw broad interest in the multimedia community, as well as further promote a variety of downstream applications.

However, recent video grounding studies find that current datasets have obvious annotation distribution biases and the metrics are unreliable for such biased datasets as well. How to effectively de-bias a video grounding model to make it truly focus on the semantic alignment between texts and videos becomes one primary issue that needs to be addressed. Furthermore, we would like to generalize video grounding to more settings, e.g., adopt continual learning with changes of scenes.

In this tutorial, we will firstly summarize the taxonomy of existing video grounding methods along with the potential problems of the current benchmarking designs. Then the dataset bias issue will be discussed with several solutions based on causality and disentangled representation learning. Furthermore, we will also provide clear definitions and specific cases under generalized visual grounding from several promising directions.

Memory Networks

Presenters:

Federico Becattini, University of Florence, Italy

Tiberio Uricchio, University of Florence, Italy

Abstract:

Memory Networks are models equipped with a storage component where information can generally be written and successively retrieved for any purpose. Simple forms of memory networks like the popular recurrent neural networks (RNN), LSTMs or GRUs, have limited storage capabilities and for specific tasks. In contrast, recent works, starting from Memory Augmented Neural Networks, overcome storage and computational limitations with the addition of a controller network with an external element-wise addressable memory. This tutorial aims at providing an overview of such memory-based techniques and their applications in multimedia. It will cover an explanation of the basic concepts behind recurrent neural networks and will then delve into the advanced details of memory augmented neural networks, their structure and how such models can be trained. We target a broad audience, from beginners to experienced researchers, offering an in-depth introduction to an important crop of literature which is starting to gain interest in the multimedia, computer vision and natural language processing communities.

Open Challenges of Interactive Video Search and Evaluation

Presenters:

Jakub Lokoč, Charles University Prague, Czech Republic

Klaus Schoeffmann, Alpen-Adria-Universität Klagenfurt, Austria

Werner Bailer, Joanneum Research, Graz, Austria

Luca Rossetto, University of Zurich, Switzerland

Björn Þór Jónsson, IT University of Copenhagen, Denmark

Abstract:

In the age of large video databases, both automatic and interactive search approaches are necessary. Whereas automatic video content analysis is dominated by deep neural networks, interactive search systems still require combinations of various approaches (including deep models). This tutorial summarizes challenging search tasks that require interactive means of retrieval, including a list of search approaches that turned out to be effective at respected evaluation campaigns. Our experience with the performance of state-of-the-art deep neural networks for multimedia retrieval will be summarized as well as attempts for automatic interactive search system configuration. Last but not least, a new system for distributed evaluation of multimedia search competition will be presented and demonstrated during a VBS-like demo session closing the tutorial.