ChipBrain AI Research (CAIR)

CAIR is a fundamental AI research group focusing on data-centric ML, conversational AI, NLU/NLP, emotion detection, data labeling, audio ML, and intelligence augmentation.

Our focus is humanity and our long-term mission is the development of new technologies that empower humans, enhance societal empathy, and augment human intelligence.

view our published research

News

Papers

Embodied Intelligence

EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset

By Curtis Northcutt (ChipBrain), Cindy Zha (Facebook AI), Steve Lovegrove (Oculus Research), Richard Newcombe (Oculus Research)

Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning.

Diversity and Fairness

Comment Ranking Diversification in Forum Discussions

By Curtis Northcutt (ChipBrain), Kim Leon (Facebook Reality Labs), Naichun Chen (MIT)

On Facebook, Twitter, Reddit, etc... posts by the majority drown out minorities, simply because there are more people in the majority to upvote them. Learn more about how we solve this problem.

By Curtis Northcutt, Lu Jiang, Isaac Chuang

Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a class-conditional noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeding seven recent competitive approaches for learning with noisy labels on the CIFAR dataset. Uniquely, the CL framework is not coupled to a specific data modality or model (e.g., we use CL to find several label errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews). We also employ CL on ImageNet to quantify ontological class overlap (e.g., estimating 645 missile images are mislabeled as their parent class projectile), and moderately increase model accuracy (e.g., for ResNet) by cleaning data prior to training. These results are replicable using the open-source cleanlab release.

By Curtis G Northcutt, Anish Athalye, Jonas Mueller

We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.4% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set. Putative label errors are identified using confident learning algorithms and then human-validated via crowdsourcing (54% of the algorithmically-flagged candidates are indeed erroneously labeled). Traditionally, machine learning practitioners choose which model to deploy based on test accuracy — our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet- 50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by just 5%.

By Curtis G. Northcutt, Anish Athalye, Jessy Lin

We use new algorithmic techniques to automatically identify numerous label errors in the test sets of ten of the most commonly-used computer vision, natural language, and audio datasets. Errors are widespread: validated using human studies, we estimate an average of 3.4% errors across ten datasets 1, where for example 2916 errors comprise 6% of the ImageNet validation set. In a case study on ImageNet, we find that label errors do not corrupt current benchmarks. Unexpectedly, we find a use for erroneously labeled test data: as a “honeypot” for reliable benchmarking of generalization accuracy. Independently, in both ImageNet and CIFAR 10, pretrained classifiers exhibit a negative correlation in performance on corrected labels versus performance on original (erroneous) labels on the validation set, with lower capacity models (e.g. ResNet-18) out-performing more expressible models (e.g. NasNet), suggesting that this honeypot technique may measure overfitting.

By Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang

P˜N˜ learning is the problem of binary classification when training examples may be mislabeled (flipped) uniformly with noise rate ρ1 for positive examples and ρ0 for negative examples. We propose Rank Pruning (RP) to solve P˜N˜ learning and the open problem of estimating the noise rates. Unlike prior solutions, RP is efficient and general, requiring O(T) for any unrestricted choice of probabilistic classifier with T fitting time. We prove RP achieves consistent noise estimation and equivalent expected risk as learning with uncorrupted labels in ideal conditions, and derive closed-form solutions when conditions are non-ideal. RP achieves state-of-the-art noise estimation and F1, error, and AUC-PR for both MNIST and CIFAR datasets, regardless of the amount of noise. To highlight, RP with a CNN classifier can predict if an MNIST digit is a one or not with only 0.25% error, and 0.46% error across all digits, even when 50% of positive examples are mislabeled and 50% of observed positive labels are mislabeled negative examples.

By Nikola I. Nikolov, Eric Malmi, Curtis Northcutt, Loreto Parisi

The ability to combine symbols to generate language is a defining characteristic of human intelligence, particularly in the context of artistic story-telling through lyrics. We develop a method for synthesizing a rap verse based on the content of any text (e.g., a news article), or for augmenting pre-existing rap lyrics. Our method, called Rapformer, is based on training a Transformer-based denoising autoencoder to reconstruct rap lyrics from content words extracted from the lyrics, trying to preserve the essential meaning, while matching the target style. Rapformer features a novel BERT-based paraphrasing scheme for rhyme enhancement which increases the average rhyme density of output lyrics by 10%. Experimental results on three diverse input domains show that Rapformer is capable of generating technically fluent verses that offer a good trade-off between content preservation and style transfer. Furthermore, a Turing-test-like experiment reveals that Rapformer fools human lyrics experts 25% of the time.