ChipBrain AI Research (CAIR)

CAIR is a fundamental AI research group focusing on data-centric ML, conversational AI, NLU/NLP, emotion detection, data labeling, audio ML, and intelligence augmentation.

Our focus is humanity and our long-term mission is the development of new technologies that empower humans, enhance societal empathy, and augment human intelligence.

News
  • The Batch
  • The Foundations of AI Are Riddled With Errors
  • Turns out humans are leading AI systems astray because we can't agree on labeling
  • Error-riddled data sets are warping our sense of how good AI really is
  • MIT study finds labelling errors in datasets used to test AI
  • Major ML datasets have tens of thousands of errors
  • MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets
  • Big AIs made with the help of bad data
  • Major machine learning datasets have tens of thousands of errors
  • AI Is Getting A Few Things Wrong, Because Humans May Have Incorrectly Labeled A Bunch Of Images
  • Label errors abound in the most common AI test sets
  • The mistaken foundations of artificial intelligence
  • What if artificial intelligence learns stupidity? Researchers at MIT found errors in the data for learning AI
  • AI: Study finds many incorrect descriptions in machine learning datasets
Papers
Embodied Intelligence
EgoCom: A Multi-person Multi-modal Egocentric Communications Dataset
By Curtis Northcutt (ChipBrain), Cindy Zha (Facebook AI), Steve Lovegrove (Oculus Research), Richard Newcombe (Oculus Research)

Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning.

Read more
Diversity and Fairness
Comment Ranking Diversification in Forum Discussions
By Curtis Northcutt (ChipBrain), Kim Leon (Facebook Reality Labs), Naichun Chen (MIT)

On Facebook, Twitter, Reddit, etc... posts by the majority drown out minorities, simply because there are more people in the majority to upvote them. Learn more about how we solve this problem.

Read more
Confident Learning: Estimating Uncertainty in Dataset Labels
By Curtis Northcutt, Lu Jiang, Isaac Chuang

Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a class-conditional noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeding seven recent competitive approaches for learning with noisy labels on the CIFAR dataset. Uniquely, the CL framework is not coupled to a specific data modality or model (e.g., we use CL to find several label errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews). We also employ CL on ImageNet to quantify ontological class overlap (e.g., estimating 645 missile images are mislabeled as their parent class projectile), and moderately increase model accuracy (e.g., for ResNet) by cleaning data prior to training. These results are replicable using the open-source cleanlab release.

Read more
Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks
By Curtis G Northcutt, Anish Athalye, Jonas Mueller

We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.4% errors across the 10 datasets, where for example 2916 label errors comprise 6% of the ImageNet validation set. Putative label errors are identified using confident learning algorithms and then human-validated via crowdsourcing (54% of the algorithmically-flagged candidates are indeed erroneously labeled). Traditionally, machine learning practitioners choose which model to deploy based on test accuracy — our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets. Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet- 50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by just 5%.

Read more
Pervasive Label Errors in ML Benchmark Test Sets, Consequences, and Benefit
By Curtis G. Northcutt, Anish Athalye, Jessy Lin

We use new algorithmic techniques to automatically identify numerous label errors in the test sets of ten of the most commonly-used computer vision, natural language, and audio datasets. Errors are widespread: validated using human studies, we estimate an average of 3.4% errors across ten datasets 1, where for example 2916 errors comprise 6% of the ImageNet validation set. In a case study on ImageNet, we find that label errors do not corrupt current benchmarks. Unexpectedly, we find a use for erroneously labeled test data: as a “honeypot” for reliable benchmarking of generalization accuracy. Independently, in both ImageNet and CIFAR 10, pretrained classifiers exhibit a negative correlation in performance on corrected labels versus performance on original (erroneous) labels on the validation set, with lower capacity models (e.g. ResNet-18) out-performing more expressible models (e.g. NasNet), suggesting that this honeypot technique may measure overfitting.

Read more
Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels
By Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang

P˜N˜ learning is the problem of binary classification when training examples may be mislabeled (flipped) uniformly with noise rate ρ1 for positive examples and ρ0 for negative examples. We propose Rank Pruning (RP) to solve P˜N˜ learning and the open problem of estimating the noise rates. Unlike prior solutions, RP is efficient and general, requiring O(T) for any unrestricted choice of probabilistic classifier with T fitting time. We prove RP achieves consistent noise estimation and equivalent expected risk as learning with uncorrupted labels in ideal conditions, and derive closed-form solutions when conditions are non-ideal. RP achieves state-of-the-art noise estimation and F1, error, and AUC-PR for both MNIST and CIFAR datasets, regardless of the amount of noise. To highlight, RP with a CNN classifier can predict if an MNIST digit is a one or not with only 0.25% error, and 0.46% error across all digits, even when 50% of positive examples are mislabeled and 50% of observed positive labels are mislabeled negative examples.

Read more
Rapformer: Conditional Rap Lyrics Generation with Denoising Autoencoders
By Nikola I. Nikolov, Eric Malmi, Curtis Northcutt, Loreto Parisi

The ability to combine symbols to generate language is a defining characteristic of human intelligence, particularly in the context of artistic story-telling through lyrics. We develop a method for synthesizing a rap verse based on the content of any text (e.g., a news article), or for augmenting pre-existing rap lyrics. Our method, called Rapformer, is based on training a Transformer-based denoising autoencoder to reconstruct rap lyrics from content words extracted from the lyrics, trying to preserve the essential meaning, while matching the target style. Rapformer features a novel BERT-based paraphrasing scheme for rhyme enhancement which increases the average rhyme density of output lyrics by 10%. Experimental results on three diverse input domains show that Rapformer is capable of generating technically fluent verses that offer a good trade-off between content preservation and style transfer. Furthermore, a Turing-test-like experiment reveals that Rapformer fools human lyrics experts 25% of the time.

Read more