Score: 9.9 • Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes.. Summary The objective of this study is to offer a comprehensive overview of the rapidly evolving field of self-supervised learning (SSL), which has recently gained significant attention. The authors categorize and summarize different SSL methods and provide practical advice for training and evaluating these models. Additionally, they conduct experiments to shed light on some unresolved issues in the field, such as the role of projectors, which demonstrates that projectors increase noise robustness through image augmentation.
The authors classify SSL methods into three categories: 1) the Deep Metric Learning (DML) family (e.g., SimCLR), 2) the Self-distillation family (e.g., BYOL, DINO), and 3) the Canonical Correlation Analysis (CCA) family (e.g., VICReg, Barlow Twins). For each section, the authors provide a historical background of each family, outlining how they originated and developed into the modern deep SSL approaches. For instance, the shift from the classical DML to the modern contrastive SSL emerged with the use of data augmentation instead of sampling, deep networks, and projectors.
Given that the literature on SSL is vast, the authors provide a summary of each major component of SSL, including data augmentation, projectors, and standard hyperparameters such as batch size and learning rate. They offer helpful tips for training SSL models on limited resources, as well as strategies for better convergence in general. This paper can serve as a valuable reference for novice SSL practitioners, allowing them to comprehend and integrate even the most recent advancements. Details Submitted on Apr 25 • Computer Vision and Pattern Recognition • Navigate• Self-Supervised Learning | | Score: 9.2 • Heewoo Jun, Alex Nichol Summary A team of prominent researchers at OpenAI has introduced a novel 3D generative model called Shap-E, which generates the parameters of an implicit Neural Radiance Fields (NeRF) MLP directly. Unlike DreamFusion-based methods, which require training a NeRF specifically for each object at inference time, Shap-E is considerably faster, taking only 13 seconds to generate a 3D object on a V100 GPU. Shap-E can also directly generate high-resolution textured meshes without requiring any additional super-resolution modules, unlike its predecessor, Point-E. The researchers trained Shap-E using over 1 million 3D assets that were text-labeled by human labelers.
Shap-E consists of three main parts: a 3D encoder that maps both point clouds and 20-view renderings of the asset into the latent space, a latent diffusion model that models the distribution of the latents, and a NeRF MLP that uses the latents as parameters for rendering. In addition, to enable the model to generate textured 3D meshes, an STF output head is added and fine-tuned during the second stage.
The model is inherently multi-representational, as it can be rendered both as textured meshes and as NeRFs. Moreover, in Appendix D, the researchers provide a method to guide Shap-E in image space, which allows researchers to leverage the score distillation loss from DreamFusion, combining the best of both worlds. As the inference code and the model are open-sourced, it will be fascinating to see how researchers utilize this new tool. Details Submitted on May 3rd • Computer Vision and Pattern Recognition • Diffusion | | Score: 8.5 • Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten,.. Summary In the realm of machine learning (ML), much research has been focused on developing better algorithms and optimization strategies to improve performance on fixed benchmark datasets. However, what happens when the situation is reversed, and the goal is to design a better dataset while fixing the algorithm? With the growing prominence of large foundation models, the importance of large-scale data collection and curation has become increasingly crucial. This is the motivation behind DATACOMP, a benchmark proposed by researchers from various organizations that presents new training sets while fixing the training code.
The authors introduce COMMONPOOL, a dataset containing 12.8 billion image-text pairs collected from common crawl, as the candidate pool for DATACOMP. They apply a filtering strategy that combines CLIP score-based thresholding from LAION with image-based filtering based on ImageNet features, resulting in DATACOMP-1B, which contains 1.4 billion image-text pairs that can be used to train a state-of-the-art, open-sourced CLIP model from scratch. Remarkably, training a CLIP ViT-L/14 model with a compute budget of 12.8 billion samples achieves an ImageNet zero-shot accuracy of 79.2%.
Apart from the main findings, the paper and project also include more than 300 baseline experiments with varying compute budgets and model sizes. There is also a BYOD (bring your own data) track, which enables users to utilize external datasets in addition to the proposed benchmark datasets. As this is the start of a new generation of multimodal datasets, it will be intriguing to see what contributions this initiative brings to the community. Details Submitted on Apr 27 • Computer Vision and Pattern Recognition | | Want to promote your company, conference, job, or event to 100,000+ AI researchers and engineers? You can reach out here. | | | -
Score: 9.3 • Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, Ludwig Schmidt -
Score: 7.8 • Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye -
Score: 7.7 • Weiyu Li, Xuelin Chen, Jue Wang, Baoquan Chen | | Hyungjin Chung is a contributing writer at AlphaSignal, incoming Researcher at Google, and Ph.D. student @KAIST Bio-Imaging Signal Processing & Learning lab (BISPL). His work mostly focuses on solving inverse problems arising in computational imaging and generative modeling. | | | | |