Skip to main content

Natural language processing

Find generative ML like Stable Diffusion/Midjourney fascinating.

Love using ChatGPT and its open variants like Alpaca. Alpaca specifically has lots of momentum building on top of it, like this UI.

Prompt Engineering is great read.

spaCy (with their NLP course) & Fairseq are interesting libraries. Natural Language Processing with Transformers Book is nice book. Hugging Face NLP Course is probably the best NLP intro out there.

DALL·E 2 is fascinating too. Trying to understand DALL-E in PyTorch implementation.

Getting started with NLP for absolute beginners is a nice intro.

LangChain & Petals are interesting. Lightning GPT is nice minimal GPT implementation. Want to try use LLaMA model.

Tokenizers & tiktoken are interesting tokenizers.

rust-bert is useful for making NLP pipelines.

Want to explore fine tuning FLAN-T5 model together with examples from OpenAI Cookbook.

Notes

Links

SpaCy - Industrial-strength Natural Language Processing (NLP) with Python and Cython. (HN: SpaCy 3.0 (2021))
Adding voice control to your projects
Increasing data science productivity; founders of spaCy & Prodigy
Course materials for "Natural Language" course
NLP progress - Track the progress in Natural Language Processing (NLP) and give an overview of the state-of-the-art across the most common NLP tasks and their corresponding datasets. (Web)
Natural - General natural language facilities for Node.
YSDA Natural Language Processing course (2018)
PyText - Natural language modeling framework based on PyTorch.
FlashText - Extract Keywords from sentence or Replace keywords in sentences.
BERT PyTorch implementation
LASER Language-Agnostic SEntence Representations - Library to calculate and use multilingual sentence embeddings.
StanfordNLP - Python NLP Library for Many Human Languages.
nlp-tutorial - Tutorial for who is studying NLP(Natural Language Processing) using TensorFlow and PyTorch.
Better Language Models and Their Implications (2019)
gpt-2 - Code for the paper "Language Models are Unsupervised Multitask Learners".
Lingvo - Framework for building neural networks in Tensorflow, particularly sequence models.
Fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Stanford CS224N: NLP with Deep Learning (2019) - Course page. (HN)
Advanced NLP with spaCy: Free Course (Web) (HN)
Code for Stanford Natural Language Understanding course, CS224u (2019)
Awesome Reinforcement Learning for Natural Language Processing
ParlAI - Framework for training and evaluating AI models on a variety of openly available dialogue datasets.
Training language GANs from Scratch (2019)
Olivia - Your new best friend built with an artificial neural network.
Learn-Natural-Language-Processing-Curriculum
This repository recorded my NLP journey
Project Alias - Open-source parasite to train custom wake-up names for smart home devices while disturbing their built-in microphone.
Cornell Tech NLP Code
Cornell Tech NLP Publications
Thinc - SpaCy's Machine Learning library for NLP in Python. (Docs)
Knowledge is embedded in language neural networks but can they reason? (2019)
NLP Best Practices
Transfer NLP library - Framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP.
FARM - Fast & easy transfer learning for NLP. Harvesting language models for the industry.
Transformers - State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. (Web)
NLP Roadmap 2019
Flair - Very simple framework for state-of-the-art NLP. Developed by Zalando Research.
Unsupervised Data Augmentation - Semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks.
Rasa - Open source machine learning framework to automate text-and voice-based conversations.
T5 - Text-To-Text Transfer Transformer.
100 Must-Read NLP Papers (HN)
Awesome NLP
NLP Library - Curated collection of papers for the NLP practitioner.
spacy-transformers - spaCy pipelines for pre-trained BERT, XLNet and GPT-2.
AllenNLP - Open-source NLP research library, built on PyTorch. (Announcing AllenNLP 1.0)
GloVe - Global Vectors for Word Representation.
Botpress - Open-source Virtual Assistant platform.
Mycroft - Hackable open source voice assistant. (HN)
VizSeq - Visual Analysis Toolkit for Text Generation Tasks.
Awesome Natural Language Generation
How I used NLP (Spacy) to screen Data Science Resume (2019)
Introduction to Natural Language Processing book - Survey of computational methods for understanding, generating, and manipulating human language, which offers a synthesis of classical representations and algorithms with contemporary machine learning techniques.
Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning (Code)
Tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production. (Article)
Example Notebook using BERT for NLP with Keras (2020)
NLP 2019/2020 Highlights
Overview of Modern Deep Learning Techniques Applied to Natural Language Processing
Language Identification from Very Short Strings (2019)
SentenceRepresentation - Code acompanies the paper 'Learning Sentence Representations from Unlabelled Data' Felix Hill, KyungHyun Cho and Anna Korhonen 2016.
Deep Learning for Language Processing course
Megatron LM - Ongoing research training transformer language models at scale, including: BERT & GPT-2. (Megatron with FastMoE) (Fork)
XLNet - New unsupervised language representation learning method based on a novel generalized permutation language modeling objective.
ALBERT - Lite BERT for Self-supervised Learning of Language Representations.
BERT - TensorFlow code and pre-trained models for BERT.
Multilingual Denoising Pre-training for Neural Machine Translation (2020)
List of NLP tutorials built on PyTorch
sticker - Sequence labeler that uses either recurrent neural networks, transformers, or dilated convolution networks.
sticker-transformers - Pretrained transformer models for sticker.
pke - Python Keyphrase Extraction module.
How to train a new language model from scratch using Transformers and Tokenizers (2020)
Interactive Attention Visualization - Small example of an interactive visualization for attention values as being used by transformer language models like GPT2 and BERT.
The Annotated GPT-2 (2020)
GluonNLP - Toolkit that enables easy text preprocessing, datasets loading and neural models building to help you speed up your NLP research.
Finetune - Scikit-learn style model finetuning for NLP.
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages (2020) (HN)
NLP Newsletter
NLP Paper Summaries
Advanced NLP with spaCy
Myle Ott's research
Natural Language Toolkit (NLTK) - Suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. (Web) (Book)
NLP 100 Exercise - Bootcamp designed for learning skills for programming, data analysis, and research activities. (Code)
The Transformer Family (2020)
Minimalist Implementation of a BERT Sentence Classifier
fastText - Library for efficient text classification and representation learning. (Code) (Article) (HN) (Fork)
Awesome NLP Paper Discussions - Papers & presentations from Hugging Face's weekly science day.
SynST: Syntactically Supervised Transformers
The Cost of Training NLP Models: A Concise Overview (2020)
Tutorial - Transformers (Tweet)
TTS - Deep learning for Text to Speech.
MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer (2020)
gpt-2-simple - Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts.
BERTScore - BERT score for text generation.
ML and NLP Paper Discussions
NLP Index - Collection of NLP resources.
NLP Datasets
Word Embeddings (2017)
NLP from Scratch: Annotated Attention (2020)
This Word Does Not Exist - Allows people to train a variant of GPT-2 that makes up words, definitions and examples from scratch. (Code) (HN)
Ultimate guide to choosing an online course covering practical NLP (2020)
HuggingFace nlp library - Quick overview (2020) (Twitter)
aitextgen - Robust Python tool for text-based AI training and generation using GPT-2. (HN)
Self Supervised Representation Learning in NLP (2020) (HN)
Synthetic and Natural Noise Both Break Neural Machine Translation (2017)
Inferbeddings - Injecting Background Knowledge in Neural Models via Adversarial Set Regularisation.
UCL Natural Language Processing group
Interactive Lecture Notes, Slides and Exercises for Statistical NLP
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
CMU LTI Low Resource NLP Bootcamp 2020
GPT-3: Language Models Are Few-Shot Learners (2020) (HN) (Code)
nlp - Lightweight and extensible library to easily share and access datasets and evaluation metrics for NLP.
Brainsources for NLP enthusiasts
Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper)
NLP Resources
TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables (Article) (HN)
vtext - NLP in Rust with Python bindings.
Language Technology Lab @ University of Cambridge
The Natural Language Processing Dictionary
Introduction to NLP using Fastai (2020)
Gwern on GPT-3 (HN)
Semantic Machines - Solving conversational artificial intelligence. Part of Microsoft.
The Reformer – Pushing the limits of language modeling (HN)
GPT-3 Creative Fiction (2020) (HN)
Classifying 200k articles in 7 hours using NLP (2020) (HN)
HN: Using GPT-3 to generate user interfaces (2020)
Thread of GPT-3 use cases (2020)
GPT-3 Code Experiments (Examples)
How GPT3 Works - Visualizations and Animations (2020) (Lobsters) (HN)
What is GPT-3? written in layman's terms (2020) (HN)
GPT3 Examples (HN)
DQI: Measuring Data Quality in NLP (2020)
Humanloop - Train and deploy NLP. (HN)
Do NLP Beyond English (2020) (HN)
Giving GPT-3 a Turing Test (2020) (HN)
Neural Network Methods for Natural Language Processing (2017)
Tempering Expectations for GPT-3 and OpenAI’s API (2020)
Philosophers on GPT-3 (2020) (HN)
GPT-3 Explorer - Power tool for experimenting with GPT-3. (Code)
Recent Advances in Natural Language Processing (2020) (HN)
Project Insight - NLP as a Service. (Forum post)
Bob Coecke: Quantum Natural Language Processing (QNLP) (2020) (Article)
Language-Agnostic BERT Sentence Embedding (2020)
Language Interpretability Tool (LIT) - Interactively analyze NLP models for model understanding in an extensible and framework agnostic interface.
Booste Pre Trained Models - Free-to-use GPT-2 API. (HN)
Context-theoretic Semantics for Natural Language: an Algebraic Framework (2007)
THUNLP (Natural Language Processing Lab at Tsinghua University) research
AI training method exceeds GPT-3 performance with fewer parameters (2020) (HN)
BERT Attention Analysis
Neural Modules and Models for Conversational AI (2020)
BERTopic - Topic modeling technique that leverages BERT embeddings and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
NLP Pandect - Comprehensive reference for all topics related to Natural Language Processing.
Practical Natural Language Processing book (Code)
NLP Reseach Project: Best Practices for Finetuning Large Transformer Language models (2020)
Deep Learning for NLP notes (2020)
Modern Practical Natural Language Processing course
LXMERT: Learning Cross-Modality Encoder Representations from Transformers in PyTorch
Awesome software for Text ML
Pretrained Transformers for Text Ranking: BERT and Beyond (2020)
SpaCy v3.0 Nightly (2020) (HN) (Tweet)
Explore trained spaCy v3.0 pipelines
spacy-streamlit - sGpaCy building blocks for Streamlit apps. (Tweet)
Informers - State-of-the-art natural language processing for Ruby.
How to Structure and Manage Natural Language Processing (NLP) Projects (2020)
Sentence-BERT for spaCy - Wraps sentence-transformers (also known as sentence-BERT) directly in spaCy.
Lingua Franca - Mycroft's multilingual text parsing and formatting library.
Simple Transformers - Based on the Transformers library by HuggingFace. Lets you quickly train and evaluate Transformer models.
Deep Bidirectional Transformers for Language Understanding (2020) - Explains a legendary paper, BERT. (HN)
EasyTransfer - Designed to make the development of transfer learning in NLP applications easier.
LambdaBERT - Transformers-style implementation of BERT using LambdaNetworks instead of self-attention.
DialoGPT - State-of-the-Art Large-scale Pretrained Response Generation Model.
Neural reading comprehension and beyond - Danqi Chen's Thesis (2020) (Code)
LAMA: LAnguage Model Analysis - Probe for analyzing the factual and commonsense knowledge contained in pretrained language models.
awesome-2vec - Curated list of 2vec-type embedding models.
Rethinking Attention with Performers (2020) (HN)
BERT Research - Key Concepts & Sources (2019)
The Pile - Large, diverse, open source language modelling data set that consists of many smaller datasets combined together.
Bort - Companion code for the paper "Optimal Subarchitecture Extraction for BERT."
Vector AI - Encode And Deploy Vectors At The Edge. (Code)
KeyBERT - Minimal keyword extraction with BERT. (Web)
Multimodal Transformer for Unaligned Multimodal Language Sequences - In PyTorch.
The Illustrated GPT-2 (Visualizing Transformer Language Models) (2020)
A Primer in BERTology: What we know about how BERT works (2020) (HN)
GPT Neo - Open-source GPT model, with pretrained 1.3B & 2.7B weight models. (HN)
TextSynth - Bellard's free GPT-NeoX-20B, GPT-J playground and paid API. (Playground) (HN)
How to Go from NLP in 1 Language to NLP in N Languages in One Shot (2020)
Contextualized Topic Models - Family of topic models that use pre-trained representations of language (e.g., BERT) to support topic modeling.
Language Style Transfer - Code for Style Transfer from Non-Parallel Text by Cross-Alignment paper.
NLU - Power of Spark NLP, the Simplicity of Python. 1 line for hundreds of NLP models and algorithms.
PyTorch Implementation of Google BERT
High Performance Natural Language Processing (2020)
duoBERT - Multi-stage passage ranking: monoBERT + duoBERT.
Awesome GPT-3
SMAC3 - Sequential Model-based Algorithm Configuration.
Semantic Experiences by Google - Experiments in understanding language.
Long-Range Arena - Systematic evaluation of efficient transformer models.
PaddleHub - Awesome pre-trained models toolkit based on PaddlePaddle.
DeepSPIN (Deep Structured Prediction in Natural Language Processing) (GitHub)
Multi-Task Learning in NLP
FastSeq - Provides efficient implementation of popular sequence models (e.g. Bart, ProphetNet) for text generation, summarization, translation tasks etc.
Sentence Embeddings with BERT & XLNet
FastFormers - Provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Understanding (NLU).
Adversarial NLI - Adversarial Natural Language Inference Benchmark.
textract - Extract text from any document. No muss. No fuss. (Docs)
NLP e Named Entity Recognition (2020)
Big Bird: Transformers for Longer Sequences
NLP PyTorch Tutorial
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
CrossWeigh: Training Named Entity Tagger from Imperfect Annotations (2019) (Code)
Does GPT-2 Know Your Phone Number? (2020)
Towards Fully Automated Manga Translation (2020)
Text Classification Models - All kinds of text classification models and more with deep learning.
Awesome Text Summarization
Shortformer: Better Language Modeling using Shorter Inputs (2020) (HN)
huggingface_hub - Client library to download and publish models and other files on the huggingface.co hub.
Embeddings from the Ground Up (2020)
Ecco - Tools to visuals and explore NLP language models. (Web) (HN)
Interfaces for Explaining Transformer Language Models (2020)
DALL·E: Creating Images from Text (2021) (HN) (Reddit)
CLIP: Connecting Text and Images (2021) (HN) (Paper) (Code)
OpenNRE - Open-Source Package for Neural Relation Extraction (NRE).
Princeton NLP Group (GitHub)
Must-read papers on neural relation extraction (NRE)
FewRel Dataset, Toolkits and Baseline Models
Tree Transformer: Integrating Tree Structures into Self-Attention (2019) (Code)
SentEval: evaluation toolkit for sentence embeddings
gpt-scrolls - Collaborative collection of open-source safe GPT-3 prompts that work well.
SLING - A natural language frame semantics parser - Built to learn to read and understand Wikipedia articles in many languages for the purpose of knowledge base completion.
Awesome Neural Adaptation in NLP
Natural language generation: The commercial state of the art in 2020 (HN)
Non-Autoregressive Generation Progress
Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
VecMap - Framework to learn cross-lingual word embedding mappings.
Kiri - Natural Language Engine. (Web)
GPT3 List - List of things that people are claiming is enabled by GPT3.
DeBERTa - Decoding-enhanced BERT with Disentangled Attention.
Sockeye - Open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet. (Docs)
Robustness Gym - Python evaluation toolkit for natural language processing.
State-of-the-Art Conversational AI with Transfer Learning
GPT-Neo - GPT-3-sized model, open source and free. (HN) (Code)
Deep Daze - Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network).
Notebooks using the Hugging Face libraries
NLP Cloud - Serve spaCy pre-trained models, and your own custom models, through a RESTful API.
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters (2020) (Code)
jiant - Multitask and transfer learning toolkit for NLP. (Web)
Must-read Papers on Textual Adversarial Attack and Defense
Reranker - Build Text Rerankers with Deep Language Models.
rust-bert - Rust native ready-to-use NLP pipelines and transformer-based models (BERT, DistilBERT, GPT2,...).
rust-tokenizers - Offers high-performance tokenizers for modern language models.
Replicating GPT-2 at Home (2021) (HN)
Shifterator - Interpretable data visualizations for understanding how texts differ at the word level.
CMU Neural Networks for NLP Course (2021) (Videos)
minnn - Exercise in developing a minimalist neural network toolkit for NLP.
Controllable Sentence Simplification (2019) (Code)
Awesome Relation Extraction
retext - Natural language processor powered by plugins part of the unified collective. (Awesome)
CLIP Playground - Try OpenAI's CLIP model in your browser.
GPT-3 Demo - GPT-3 Examples, Demos, Showcase, and NLP Use-cases.
Big Sleep - Simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.
Beyond the Imitation Game Benchmark (BIG-bench) - Collaborative benchmark intended to probe large language models, and extrapolate their future capabilities.
AutoNLP - Automatic way to train, evaluate and deploy state-of-the-art NLP models for different tasks.
DeText - Deep Neural Text Understanding Framework for Ranking and Classification Tasks.
Paragraph Vectors in PyTorch
NeuSpell: A Neural Spelling Correction Toolkit
Natural Language YouTube Search - Search inside YouTube videos using natural language.
Accelerate - Simple way to train and use NLP models with multi-GPU, TPU, mixed-precision.
Classical Language Toolkit (CLTK) - Python library offering natural language processing (NLP) for pre-modern languages. (Web)
Guide: Finetune GPT2-XL
GENRE (Generarive ENtity REtrieval) - Uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture.
Teachable NLP - GPT-2 Training as a Service.
DensePhrases - Provides answers to your natural language questions from the entire Wikipedia in real-time.
How to use GPT-3 recursively to solve general problems (2021)
Podium - Framework agnostic Python NLP library for data loading and preprocessing.
Prompts - Advanced GPT-3 playground. (Code)
TextFlint - Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing.
Awesome Text Summarization
SimCSE: Simple Contrastive Learning of Sentence Embeddings (2021) (Code)
Berkeley Neural Parser - High-accuracy NLP parser with models for 11 languages. (Web)
nlpaug - Data augmentation for NLP.
Top2Vec - Learns jointly embedded topic, document and word vectors.
Focused Attention Improves Document-Grounded Generation (2021) (Code)
NLPretext - All the goto functions you need to handle NLP use-cases.
spaCy + UDPipe
adapter-transformers - Friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models.
TextAttack - Generating adversarial examples for NLP models.
GPT-NeoX - Implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library.
Transfer Learning in Natural Language Processing (2019) (Code)
Cohere - Help computers understand language. (Tweet)
Transformers Interpret - Model explainability tool designed to work exclusively with the transformers package.
Whatlang - Natural language detection library for Rust. (Web)
Category Theory + NLP Papers
UniLM - Pre-trained models for natural language understanding (NLU) and generation (NLG) tasks. (HN)
AutoNLP - Faster and easier training and deployments of SOTA NLP models.
TAble PArSing (TAPAS) - End-to-end neural table-text understanding models.
Replacing Bert Self-Attention with Fourier Transform: 92% Accuracy, 7X Faster (2021)
FNet: Mixing Tokens with Fourier Transforms (2021) (Tweet)
True Few-Shot Learning with Language Models (2021) (Tweet) (Code)
End-to-end NLP workflows from prototype to production (Web)
Haystack - End-to-end Python framework for building natural language search interfaces to data. (HN)
PLMpapers - Must-read Papers on pre-trained language models.
English-to-Spanish translation with a sequence-to-sequence Transformer in Keras
Evaluation Harness for Large Language Models - Framework for few-shot evaluation of autoregressive language models.
MLP GPT - Jax - GPT, made only of MLPs, in Jax.
Few-Shot Question Answering by Pretraining Span Selection (2021) (Code)
Neural Extractive Search (2021) (Demo)
Hugging Face NLP Course (Code)
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation.
LoRA: Low-Rank Adaptation of Large Language Models (2021) (Code) (Code) (HN)
PromptPapers - Must-read papers on prompt-based tuning for pre-trained language models.
Obsei - Automation tool for text analysis need.
Evaluating Large Language Models Trained on Code (2021) (Code)
Survey of Surveys for Natural Language Processing (SOS4NLP)
CLIP guided diffusion
Data driven literary analysis
DALL·E Mini - Generate images from a text prompt.
Jury - Evaluation for Natural Language Generation.
Rubrix - Free and open-source tool to explore, label, and monitor data for NLP projects.
Knowledge Neurons in Pretrained Transformers (2021) (Code) (Code)
OpenCLIP - Open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).
Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning (2021) (Code)
Can a Fruit Fly Learn Word Embeddings? (2021)
Spark NLP - Natural Language Processing library built on top of Apache Spark ML. (Web)
Spark NLP Workshop - Showcasing notebooks and codes of how to use Spark NLP in Python and Scala.
ConceptNet Numberbatch - Set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings.
OpenAI Codex - AI system that translates natural language to code. (HN)
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021)
NL-Augmenter - Collaborative Repository of Natural Language Transformations.
wevi - Word embedding visual inspector. (Code)
clip-retrieval - Easily computing clip embeddings and building a clip retrieval system with them.
NVIDIA NeMo - Toolkit for conversational AI.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
BEIR - Heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.
UER-py - Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo.
ExplainaBoard - Explainable Leaderboard for NLP.
Fast-BERT - Super easy library for BERT based NLP models.
Genie Tookit - Generator of Natural Language Parsers for Compositional Virtual Assistants. (Paper)
Quantum Stat - Your NLP Model Training Platform.
Mistral - Framework for transparent and accessible large-scale language model training, built with Hugging Face. (Docs)
NERDA - Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks.
Data Augmentation Techniques for NLP
Feed forward VQGAN-CLIP model
Yet Another Keyword Extractor (Yake) - Unsupervised Approach for Automatic Keyword Extraction using Text Features.
Challenges in Detoxifying Language Models (2021) (Tweet)
TextBrewer - PyTorch-based model distillation toolkit for natural language processing.
GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain (2021)
PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models (2021) (Code)
VQGAN-CLIP Overview - Repo for running VQGAN+CLIP locally.
TLDR: Extreme Summarization of Scientific Documents (2020) (Code)
Can Language Models be Biomedical Knowledge Bases? (2021)
ColBERT: Contextualized Late Interaction over BERT (2020)
Investigating Pretrained Language Models for Graph-to-Text Generation (2020) (Code)
Ubiquitous Knowledge Processing Lab (GitHub)
DedupliPy - Python package for deduplication/entity resolution using active learning.
Flexible Generation of Natural Language Deductions (2021) (Code)
Machine Translation Reading List
Compressive Transformers for Long-Range Sequence Modelling (2020) (Code)
pyxclib - Tools for multi-label classification problems.
ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators.
OpenPrompt - Open-Source Toolkit for Prompt-Learning.
Unsupervised Neural Machine Translation with Generative Language Models Only (2021) (Tweet)
Grounding Spatio-Temporal Language with Transformers (2021) (Code)
Fast Sentence Embeddings (fse) - Compute Sentence Embeddings Fast.
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models (2021)
Surge AI - Build powerful NLP datasets using our global labeling force and platform. (Python SDK)
Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels (Code)
ogen - OpenAPI v3 code generator for go.
PromptSource - Toolkit for collecting and applying prompts to NLP datasets. (Web) (HN)
Creating User Interface Mock-ups from High-Level Text Descriptions with Deep-Learning Models (2021)
Filtlong - Tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset.
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System (2021) (Code)
xFormers - Hackable and optimized Transformers building blocks, supporting a composable construction.
Language Models As or For Knowledge Bases (2021)
Wikipedia2Vec - Tool for learning vector representations of words and entities from Wikipedia. (Code)
Reflections on Foundation Models (2021) (Tweet)
textacy - NLP, before and after spaCy.
Natural Language Processing Specialization Course (Tweet)
Hugging Face on Amazon SageMaker Workshop
CS224N: Natural Language Processing with Deep Learning | Winter 2021 - YouTube
GPT-3 creates geofoam, but out of text (2021)
Towards Efficient NLP: A Standard Evaluation and A Strong Baseline (2021) (Code)
Hierarchical Transformers Are More Efficient Language Models (2021) (HN) (Code)
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration (2021) (Code)
GPT-3 is no longer the only game in town (2021) (HN)
PatrickStar - Parallel Training of Large Language Models via a Chunk-based Memory Management.
Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) (2021)
Text2Art - AI Powered Text-to-Art Generator.
Emergent Communication of Generalizations (2021) (Code)
Awesome Pretrained Models for Information Retrieval
SummerTime - Text Summarization Toolkit for Non-experts.
NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework (2021) (Code)
Differentially Private Fine-tuning of Language Models (2021) (Tweet)
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning (2021) (Code)
Aphantasia - CLIP + FFT/DWT/RGB = text to image/video.
OpenAI’s API Now Available with No Waitlist (2021) (HN)
Recent trends of Entity Linking, Disambiguation, and Representation
Intro to Large Language Models with Cohere
spacy-experimental - Cutting-edge experimental spaCy components and features.
AdaptNLP - High level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models for end to end tasks. (Docs)
Reading list for Awesome Sentiment Analysis papers
Aspect-Based-Sentiment-Analysis: Transformer & Explainable ML (TensorFlow)
Deploy optimized transformer based models in production
PyConverse - Conversational text Analysis using various NLP techniques.
KILT - Library for Knowledge Intensive Language Tasks.
RoFormer: Enhanced Transformer with Rotary Position Embedding (2021) (Code)
N-grammer: Augmenting Transformers with latent n-grams (2021) (Code)
textsearch - Find strings/words in text; convenience and C speed.
Mastering spaCy Book (2021) (Code)
sense2vec - Contextually-keyed word vectors.
Pureformer: Do We Even Need Attention? (2021)
Knover - Toolkit for knowledge grounded dialogue generation based on PaddlePaddle.
Language Modelling at Scale: Gopher, Ethical considerations, and Retrieval | DeepMind (2021) (HN)
CMU Advanced NLP 2021 - YouTube
CMU Advanced NLP 2022 - YouTube (Tweet)
whatlies - Toolkit to help understand "what lies" in word embeddings. Also benchmarking.
CLIP-Guided-Diffusion
Factual Probing Is [MASK]: Learning vs. Learning to Recall (2021) (Code)
Improving Compositional Generalization with Latent Structure and Data Augmentation (2021)
PORORO - Platform Of neuRal mOdels for natuRal language prOcessing.
PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2021) (Code)
To Understand Language Is to Understand Generalization (2021) (HN)
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (2020) (Code)
Multimodal Transformers | Transformers with Tabular Data (Article)
Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering (2021) (Code)
Improving Language Models by Retrieving from Trillions of Tokens (2021)
Open Information Extraction (OIE) Resources
Deeper Text Understanding for IR with Contextual Neural Language Modeling (2019) (Code)
x-clip - Concise but complete implementation of CLIP with various experimental improvements from recent papers.
Calamity - Self-hosted GPT playground.
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation (2021) (Code)
Transactions of the Association for Computational Linguistics (2021) (Code)
DocEE - Toolkit for document-level event extraction, containing some SOTA model implementations.
Autoregressive Entity Retrieval (2020)
Generate, Delete and Rewrite: A Three-Stage Framework for Improving Persona Consistency of Dialogue Generation (2020)
A Span-Based Model for Joint Overlapped and Discontinuous Named Entity Recognition (2021)
Deduplicating Training Data Makes Language Models Better (2021) (Code)
Transformers without Tears: Improving the Normalization of Self-Attention (2019) (Code)
CTCDecoder - Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.
Custom Named Entity Recognition with Spacy3
BARTScore: Evaluating Generated Text as Text Generation (2021) (Code)
minDALL-E on Conceptual Captions - PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs.
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation (2021) (Code)
Multitask Prompted Training Enables Zero-Shot Task Generalization (2021) (Code)
spaCy models - Models for the spaCy Natural Language Processing (NLP) library.
Awesome Huggingface
SyntaxDot - Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
STriP Net - Semantic Similarity of Scientific Papers (S3P) Network.
Small-Text - Active Learning for Text Classification in Python.
Plug and Play Language Models: A Simple Approach to Controlled Text Generation (2020) (Code)
RuDOLPH - One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP.
PLM papers - Paper list of pre-trained language models (PLMs).
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Improving language models by retrieving from trillions of tokens (2022) (Code)
EntitySeg Toolbox - Towards precise and open-world image segmentation.
Aligning Language Models to Follow Instructions (2022) (Tweet) (Code)
Simple Questions Generate Named Entity Recognition Datasets (2021) (Code)
KRED: Knowledge-Aware Document Representation for News Recommendations (2019) (Code)
Stanford Open Information Extraction
Python3 wrapper for Stanford OpenIE
I-BERT: Integer-only BERT Quantization (2021) (Code)
spaCy-wrap - Wrapping fine-tuned transformers in spaCy pipelines.
DeepMatcher - Python package for performing Entity and Text Matching using Deep Learning.
Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond (2020) (Code)
medspacy - Library for clinical NLP with spaCy.
Natural Language Processing with Transformers Book (Code)
blurr - Library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.
HanLP - Multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x.
Awesome Text-to-Image
NLP News Newsletter
Named Entity Recognition as Dependency Parsing (2020) (Code)
Multilingual-CLIP - OpenAI CLIP text encoders for any language.
FasterTransformer - Transformer related optimization, including BERT, GPT.
Papers about Causal Inference and Language
EET (Easy and Efficient Transformer) - Efficient PyTorch inference plugin focus on Transformer-based models with large model sizes and long sequences.
Measuring Massive Multitask Language Understanding (2021) (Code)
A Theoretical Analysis of the Repetition Problem in Text Generation (2021) (Code)
TransformerSum - Models to perform neural summarization (extractive and abstractive) using machine learning transformers and a tool to convert abstractive summarization datasets to the extractive task.
Natural Language Processing with Transformers Book
Transformer Memory as a Differentiable Search Index (2022) (HN) (Tweet)
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (2020) (Code)
spaCy + Stanza - Use the latest Stanza (StanfordNLP) research models directly in spaCy.
Awesome Document Understanding
Sequential Transformer - Code for training Transformers on sequential tasks such as language modeling.
bert-as-service - Mapping a variable-length sentence to a fixed-length vector using BERT model.
A Contrastive Framework for Neural Text Generation (2022) (Code)
Parallax - Tool for interactive embeddings visualization.
Serve PyTorch model as an API using AWS + serverless framework
Neural reality of argument structure constructions (2022)
DeepNet: Scaling Transformers to 1,000 Layers (2022) (HN)
Large Models of Source Code - Guide to using pre-trained large language models of source code.
HyperMixer: An MLP-based Green AI Alternative to Transformers (2022)
NLP Course Material & QA
Survey of Surveys (NLP & ML) - Collection of 700+ survey papers on Natural Language Processing (NLP) and Machine Learning (ML).
Awesome CLIP - Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
MAGMA - GPT-style multimodal model that can understand any combination of images and language.
Timexy - spaCy custom component that extracts and normalizes temporal expressions.
New Capabilities for GPT-3: Edit and Insert (2022) (HN)
Which hardware to train a 176B parameters model? (2022) (Tweet)
Fundamentals of NLP - Series of hands-on notebooks for learning the fundamentals of NLP.
BertViz - Visualize Attention in Transformer Models (BERT, GPT2, BART, etc.).
Attention Is All You Need (2017) (Code) (PyTorch Code)
Word2Vec Explained. Explaining the Intuition of Word2Vec (2021) (HN)
imgbeddings - Python package to generate image embeddings with CLIP without PyTorch/TensorFlow.
Linking Emergent and Natural Languages via Corpus Transfer (2022)
Transformer Inference Arithmetic (2022)
Training Compute-Optimal Large Language Models (2022) (Tweet)
KeyphraseVectorizers - Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix.
Gramformer - Framework for detecting, highlighting and correcting grammatical errors on natural language text.
Classy Classification - Easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classificaiton with Huggingface.
Sphere - Web-scale retrieval for knowledge-intensive NLP.
muTransformers - Common Huggingface transformers in maximal update parametrization (µP).
Event Extraction papers - List of NLP resources focused on event extraction task.
Summarization Papers
GLID-3 - Combination of OpenAI GLIDE, Latent Diffusion and CLIP.
Optimum Transformers - Accelerated NLP pipelines for fast inference on CPU and GPU. Built with Transformers, Optimum and ONNX Runtime.
Pathways Language Model (PaLM): Scaling to 540B parameters (2022) (HN) (Code) (Code)
A Divide-and-Conquer Approach to the Summarization of Long Documents (2020) (Code)
Resources for learning about Text Mining and Natural Language Processing
LinkBERT: Pretraining Language Models with Document Links (2022) (Code)
Dall-E 2 (2022) (HN) (Tweet) (Tweet) (Code) (Code) (Code) (Tweet) (Tweet) (HN) (Video Summary) (HN) (Tweet)
Variations of the Similarity Function of TextRank for Automated Summarization (2016) (Code)
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (2020) (Code)
Awesome Knowledge Distillation
You Only One Sequence (2021)
Towards Understanding and Mitigating Social Biases in Language Models (2021) (Code)
DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization (2021) (Code)
Humanloop Programmatic - Create large high-quality datasets for NLP in minutes. No hand labelling required. (HN)
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language (2022)
Second order effects of the rise of large language models (2022)
Simple Annotated implementation of GPT-NeoX in PyTorch
BLEURT: Learning Robust Metrics for Text Generation (2020) (Code)
Bootleg - Self-supervised named entity disambiguation (NED) system that links mentions in text to entities in a knowledge base. (Code)
DALL-E in Mesh-TensorFlow
A few things to try with DALL·E (2022) (HN)
Google's 540B PaLM Language Model & OpenAI's DALL-E 2 Text-to-Image Revolution (2022)
Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution (2021) (Code)
Simple and Effective Multi-Paragraph Reading Comprehension (2017) (Code)
Researchers Glimpse How AI Gets So Good at Language Processing (2022)
Cornell Conversational Analysis Toolkit (ConvoKit) - Toolkit for extracting conversational features and analyzing social phenomena in conversations.
UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models (2022) (Code)
exBERT - Visual Analysis Tool to Explore Learned Representations in Transformers Models.
How DALL-E 2 Works (2022) (HN)
Getting started with NLP for absolute beginners (2022)
EasyNLP - Comprehensive and Easy-to-use NLP Toolkit.
Reframing Human-AI Collaboration for Generating Free-Text Explanations (2021) (Tweet)
Detoxify - Comment Classification with PyTorch Lightning and Transformers.
DLATK - End to end human text analysis package, specifically suited for social media and social scientific applications.
Language modeling via stochastic processes (2022) (Code)
An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling (2022) (Code)
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization (2021) (Code)
DataLab - Unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner.
Limitations of DALL-E (HN)
AutoPrompt - Automatic Prompt Construction for Masked Language Models.
DALL·E Flow - Human-in-the-Loop workflow for creating HD images from text.
Recon NER - Debug and correct annotated Named Entity Recognition (NER) data for inconsitencies and get insights on improving the quality of your data.
CausalNLP - Practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.
OPT: Open Pre-trained Transformer Language Models (2022) - Meta's 175B parameter language model. (Reddit) (Tweet)
Bert Extractive Summarizer - Easy to use extractive text summarization with BERT.
Dialogue Response Ranking Training with Large-Scale Human Feedback Data (2020) (Code)
LM-Debugger - Interactive tool for inspection and intervention in transformer-based language models.
100 Pages of raw notes released with the language model OPT-175 (HN)
Unsupervised Cross-Task Generalization via Retrieval Augmentation (2022) (Code)
On Continual Model Refinement in Out-of-Distribution Data Streams (2022)
GLID-3-XL - 1.4B latent diffusion model from CompVis back-ported to the guided diffusion codebase.
Neutralizing Subjectivity Bias with HuggingFace Transformers (2022)
Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists (2022) (Code) (Tweet)
gse - Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other.
BERTopic: The Future of Topic Modeling (2022) (HN)
Unifying Language Learning Paradigms (2022) (Code)
GLM: General Language Model Pretraining with Autoregressive Blank Infilling (2021) (Code)
GPT-3 limitations (2022)
Natural Language Processing Demystified
Concise Concepts - Contains an easy and intuitive approach to few-shot NER using most similar expansion over spaCy embeddings. Now with entity scoring.
Dynamic language understanding: adaptation to new knowledge in parametric and semi-parametric models (2022) (Tweet)
nlprule - Fast, low-resource Natural Language Processing and Text Correction library written in Rust.
Quark: Controllable Text Generation with Reinforced Unlearning (2022) (Tweet)
DALL-E 2 has a secret language (HN) (Tweet) (HN)
AdaTest - Find and fix bugs in natural language machine learning models using adaptive testing.
Diffusion-LM Improves Controllable Text Generation (2022) (Code) (Tweet)
RnG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering (2021) (Code)
Neural Prompt Search - Searching prompt modules for parameter-efficient transfer learning.
makemore - Most accessible way of tinkering with a GPT - one hackable script.
DALL-E Playground - Playground for DALL-E enthusiasts to tinker with the open-source version of OpenAI's DALL-E, based on DALL-E Mini.
Craiyon - AI model drawing images from any prompt. Formerly DALL-E mini.
Contrastive Learning for Natural Language Processing
MSCTD: A Multimodal Sentiment Chat Translation Dataset (Code)
Auto-Lambda: Disentangling Dynamic Task Relationships (2022) (Code)
Concepts in Neural Networks for NLP
DinkyTrain - Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration.
Pretrained Language Models
BERT-of-Theseus: Compressing BERT by Progressive Module Replacing (2020) (Code)
YaLM 100B - GPT-like neural network for generating and processing text by Yandex. (HN) (Article)
Pathways Autoregressive Text-to-Image model (Parti) - Autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge. (Web) (HN)
How Imagen Actually Works (2022)
First impressions of DALL-E, generating images from text (2022) (Lobsters)
Meta is inviting researchers to pick apart the flaws in its version of GPT-3 (2022) (HN)
'Making Moves' In DALL·E mini (2022)
min(DALL·E) - Minimal implementation of DALL·E Mini. It has been stripped to the bare essentials necessary for doing inference, and converted to PyTorch.
Awesome Document Similarity Measures
RETRO Is Blazingly Fast (2022)
LightOn - Unlock Extreme-Scale Machine Intelligence. Most repos are focused on the use of photonic hardware. (GitHub)
Minerva: Solving Quantitative Reasoning Problems with Language Models (2022) (Paper)
winkNLP - Developer friendly Natural Language Processing. (Docs)
Facebook Low Resource (FLoRes) MT Benchmark
Using GPT-3 to explain how code works (2022) (Lobsters) (HN)
Awesome Topic Models
Introducing The World’s Largest Open Multilingual Language Model: BLOOM
The DALL·E 2 Prompt Book (HN) (Tweet)
RWKV - RNN with Transformer-level performance, which can also be directly trained like a GPT transformer (parallelizable).
Kern AI - Open-source IDE for data-centric NLP. Combining programmatic labeling, extensive data management and neural search capabilities. (Code) (HN)
spaCy fishing - spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata.
DALL·E Now Available in Beta (2022) (HN)
Inside language models (from GPT-3 to PaLM)
Timeline of AI and language models
Cascades - Python library which enables complex compositions of language models such as scratchpads, chain of thought, tool use, selection-inference, and more.
Awesome Neural Symbolic
Towards Knowledge-Based Recommender Dialog System (2019) (Code)
Asent - Rule-based sentiment analysis library for Python made using SpaCy.
extractacy - Pattern extraction and named entity linking for spaCy.
A Hazard Analysis Framework for Code Synthesis Large Language Models (2022)
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (2022) (Code)
A Frustratingly Easy Approach for Entity and Relation Extraction (2021) (Code)
Chinchilla's Wild Implications (2022) (HN)
DALL·E 2 prompt book (2022) (HN)
GLM-130B - Open Bilingual Pre-Trained Model.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion (2022) (Code)
DALL-E + GPT-3 = ♥ (2022) (HN)
Run your own DALL-E-like image generator (2022) (HN)
Stable Diffusion launch announcement (2022) (HN)
Stable Diffusion
MidJourney Styles and Keywords Reference
Spent $15 in DALL·E 2 credits creating this AI image (2022) (HN)
Phraser - Better way to generate prompts.
Seminar on Large Language Models (2022)
DocQuery - Document Query Engine Powered by NLP. (Article) (Tweet)
Petals - Decentralized platform for running 100B+ language models. (Web) (HN) (HN)
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (2022) (Code)
ekphrasis - Text processing tool, geared towards text from social networks.
ALToolbox - Framework for practical active learning in NLP.
Tools and scripts for experimenting with Transformers: Bert, T5
Action Transformer (ACT-1) model in action
Label Sleuth - Open source no-code system for text annotation and building text classifiers.
Vectoring Words (Word Embeddings) (2022)
CodeGeeX: A Multilingual Code Generative Model (2022)
The first neural machine translation system for the Erzya language (2022) (Code)
Awesome Efficient PLM Papers
Polyglot: Large Language Models of Well-balanced Competence in Multi-languages
Interactive Composition Explorer - Python library and trace visualizer for language model programs.
TrAVis: Transformer Attention Visualizer (Code)
Knowledge Unlearning for Mitigating Privacy Risks in Language Models (2022) (Code)
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model (2022) (Code)
End-to-end Neural Coreference Resolution in spaCy (2022)
Ask Me Anything: A simple strategy for prompting language models
Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval (2022) (Code)
A Kernel-Based View of Language Model Fine-Tuning (2022) (Code)
Large Language Models are few(1)-shot Table Reasoners (2022) (Tweet)
The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains (2022) (Code)
Binding Language Models in Symbolic Languages (2022) (Code)
ML and text manipulation tools (2022)
Table-To-Text generation and pre-training with TabT5 (2022)
concepCy - SpaCy wrapper for ConceptNet.
AliceMind - ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab.
CrossRE: A Cross-Domain Dataset for Relation Extraction (2022) (Code)
Scaling Instruction-Finetuned Language Models (2022) (Tweet) (Tweet)
Large Language Models Can Self-Improve (2022) (Tweet)
Everyprompt - Playground for GPT-3. (Tweet)
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing (2021) (Tweet)
Composable Text Controls in Latent Space with ODEs (2022) (Code)
flashgeotext - Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.
lm-scorer - Language Model based sentences scoring library.
CodeT: Code Generation with Generated Tests
Bloom - BigScience Large Open-science Open-access Multilingual Language Model. (Tweet)
Prompts - Free and open-source (FOSS) curation of prompts for OpenAI’s GPT-3, EleutherAI’s GPT-j, and other LMs.
FSNER - Few-shot Named Entity Recognition.
Ilya Sutskever (OpenAI): What's Next for Large Language Models (LLMs) (2022)
Galactica - General-purpose scientific language model. It is trained on a large corpus of scientific text and data. (Code) (Tweet)
Three-level Hierarchical Transformer Networks for Long-sequence and Multiple Clinical Documents Classification (2021) (Code)
WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models (2022) (Code)
Convenient Text-to-Text Training for Transformers
Homophone Reveals the Truth: A Reality Check for Speech2Vec (2022) (Code)
RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder (2022) (Code)
Generate conversation starters given two personalities using AI
MetaICL: Learning to Learn In Context (2021) (Code)
PAL: Program-aided Language Models (2022) (Code)
ReAct: Synergizing Reasoning and Acting in Language Models (2022) (Code)
CogIE - Information Extraction Toolkit for Bridging Text and CogNet.
T-NER - All-Round Python Library for Transformer-based Named Entity Recognition.
mGPT: Multilingual Generative Pretrained Transformer
LangChain - Building applications with LLMs through composability. (HN)
HN Summary - Summarizes top stories from Hacker News using a large language model and posts them to a Telegram channel. (HN)
OpenAI Model index for researchers
ChatGPT
Adventures in generating music via ChatGPT text prompts (2022)
All the best examples of ChatGPT, from OpenAI
ChatGPT nice examples
WhatsApp-GPT
What ChatGPT features/improvements do you want?
Summarize-Webpage - Small NLP SAAS project that summarize a webpage.
Nonparametric Masked Language Modeling (2022) - 500x fewer parameters than GPT-3 while outperforming it on zero-shot tasks. (Reddit) (Code)
Holistic Evaluation of Language Models - Framework to increase the transparency of language models. (Paper)
Dramatron - Uses large language models to generate long, coherent text and could be useful for authors for co-writing theatre scripts and screenplays. (HN)
ExtremeBERT - Toolkit that accelerates the pretraining of customized language models on customized datasets.
Talking About Large Language Models (2022) (HN) (Tweet)
The GPT-3 Architecture, on a Napkin (2022) (HN)
Discovering Latent Knowledge in Language Models Without Supervision (2022) (HN)
Lightning GPT
Bricks - Open-source natural language enrichments at your fingertips.
GPT-2 Output Detector
Language Model Operationalization
NLQuery - Natural language query engine on WikiData.
Categorical Tools for Natural Language Processing (2022)
Historical analogies for large language models (2022) (Tweet)
CMU Advanced NLP Assignment: End-to-end NLP System Building
New and Improved Embedding Model for OpenAI (2022) (HN)
GPT-NeoX (HN)
OpenAI Cookbook - Examples and guides for using the OpenAI API.
OpenAI Question Answering using Embeddings
GreaseLM: Graph REASoning Enhanced Language Models for Question Answering (2022) (Code)
Rank-One Model Editing (ROME) - Locating and editing factual associations in GPT.
Open Assistant - Give everyone access to a great chat based large language model. (Web) (HN)
Characterizing Emergent Phenomena in Large Language Models (2022)
SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features (2022) (Code)
Blob - Powerful tool that uses language large models (LLMs) to assist in the creation and maintenance of software projects.
Chain of Thought Prompting Elicits Reasoning in Large Language Models (2022) (Code)
Compress-fastText - Python 3 package allows to compress fastText word embedding models.
Large Language Models are Zero-Shot Reasoners (2022)
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning (2022)
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (2022) (Code)
Improving Language Model Behavior by Training on a Curated Dataset (2021)
Reasoning in Large Language Models
SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization (2022) (Code)
Happy Transformer - Package built on top of Hugging Face's transformers library that makes it easy to utilize state-of-the-art NLP models.
TextBox - Text generation library with pre-trained language models.
Advances in Neural Information Processing Systems 30 (NIPS 2017)
Poincaré Embeddings for Learning Hierarchical Representations (2017) (Code)
llm-strategy - Implementing the Strategy Pattern using LLMs.
Zshot - Zero and Few shot named entity & relationships recognition.
Cramming: Training a Language Model on a Single GPU in One Day (2022) (Code)
Trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models"
Training language models to follow instructions with human feedback (2022) (Web) (Code)
Lila: A Unified Benchmark for Mathematical Reasoning (2022)
LibMultiLabel - Library for Multi-class and Multi-label Text Classification.
Paper Notes on Pretrain Language Models with Factual Knowledge
Atlas: Few-shot Learning with Retrieval Augmented Language Models (2022) (Code)
Some Remarks on Large Language Models (2023) (HN)
Massive Language Models Can Be Accurately Pruned in One-Shot (2023) (Reddit)
LM Identifier - Toolkit for identifying pretrained language models from potentially AI-generated text.
BRIO: Bringing Order to Abstractive Summarization (2022) (Code)
DOC: Improving Long Story Coherence With Detailed Outline Control (2022) (Code)
InPars: Data Augmentation for Information Retrieval using Large Language Models (2022) (Code)
Unified Structure Generation for Universal Information Extraction (2022) (Code)
Awesome Resource for NLP
PromptArray: A Prompting Language for Neural Text Generators
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) (Code)
Multi Task NLP - Utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
FairSeq with Apollo optimizer
TFKit - Handling multiple NLP task in one pipeline.
ReAct: Synergizing Reasoning and Acting in Language Models (2022)
Repository of Language Instructions for NLP Tasks
tasksource - Datasets curation and datasets metadata for NLP extreme multitask learning.
ChatLangChain - Implementation of a chatbot specifically focused on question answering over the LangChain documentation.
summaries - Toolkit for summarization analysis and aspect-based summarizers.
SymbolicAI - Neuro-Symbolic Perspective on Large Language Models (LLMs).
PEFT - Parameter-Efficient Fine-Tuning.
Large Transformer Model Inference Optimization (2023) (HN)
Embed-VTT - Generate & query embeddings from VTT files using openai & pinecone on Andrej Karpathy's's latest GPT tutorial.
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation (2021) (Code)
Awesome LLM Engineering
Minimal GPT-NeoX-20B in PyTorch
Language Models of Code are Few-Shot Commonsense Learners (2022) (Code)
Talking About Large Language Models (2022)
Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP (2022) (Code)
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations (2022) (Code)
LangChainHub (Article)
NLP-Cube - Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing.
Dust - Design and Deploy Large Language Models Apps. (Code) (Twitter)
Awesome papers on Language-Model-as-a-Service (LMaaS)
Sentences - Command line sentence tokenizer.
Diff Models – A New Way to Edit Code (2023) (HN)
MegaBlocks - Light-weight library for mixture-of-experts (MoE) training.
Read Pilot - Analyzes online articles and generate Q&A cards for you. Powered by OpenAI & Next.js. (Code)
Promptify - Prompt Engineering, Solve NLP Problems with LLM's & Easily generate different NLP Task prompts.
polymath - Utility that uses AI to intelligently answer free-form questions based on a particular library of content.
Incorporating External Knowledge through Pre-training for Natural Language to Code Generation (2020) (Code)
Longformer: The Long-Document Transformer (2020) (Code)
ProbSem - Probabilistic semantic parsing with program synthesis LLMs.
Generate rather than Retrieve: Large Language Models are Strong Context Generators (2023) (Code)
Text Generation Inference - Large Language Model Text Generation Inference.
New AI classifier for indicating AI-written text (2023) (HN)
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature (2023) (HN)
Towards Continual Knowledge Learning of Language Models (2022) (Code)
AI Text Classifier - OpenAI API
Fine-tuning GPTJ and other GPT models
Adversarial Prompts
Ignore Previous Prompt: Attack Techniques For Language Models (2022) (Code)
Multimodal Chain-of-Thought Reasoning in Language Models (2023) (Paper)
Prodigy OpenAI recipes - Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3.
Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees (2023) (Code)
Online Language Modelling Training Pipeline
Storing OpenAI embeddings in Postgres with pgvector (2023) (HN)
Theory of Mind May Have Spontaneously Emerged in Large Language Models (2023) (HN)
Steamship Python Client Library For LangChain
Toolformer: Language Models Can Teach Themselves to Use Tools (2023) (HN) (Code) (HN)
Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery (2023) (Code)
Understanding Large Language Models – A Transformative Reading List (2023) (HN)
Discovering Latent Knowledge Without Supervision
Offsite-Tuning: Transfer Learning without Full Model (2023) (Code)
Awesome Neural Reprogramming Acoustic Prompting
Chroma - Open-source embedding database. Makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.
Prompt Engine - Microsoft's prompt engineering library. (HN)
PCAE: A framework of plug-in conditional auto-encoder for controllable text generation (2022) (Code)
EasyLM - Easy to use model parallel large language models in JAX/Flax with pjit support on cloud TPU pods.
Promptable - Library that enables you to build powerful AI applications with LLMs and Embeddings providers such as OpenAI, Hugging Face, Cohere and Anthropic.
Lightning + Colossal-AI - Efficient Large-Scale Distributed Training with Colossal-AI and Lightning AI.
MarioGPT: Open-Ended Text2Level Generation through Large Language Models (2023) (Code)
LangChain.js - Building applications with LLMs through composability.
Top resources on prompt engineering (2023)
What are Transformers & Named Entity Recognition (2023)
Text is All You Need (2023) (HN)
Awesome LLM
On Prompt Engineering (2023)
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation (2022) (Code)
How to make LLMs say true things (2023)
A Fast Post-Training Pruning Framework for Transformers (2022) (Code)
Awesome Prompt Engineering
FlexGen - Running large language models on a single GPU. (HN) (HN)
Butterfish - CLI tools for LLMs.
Elk - Eliciting latent knowledge inside the activations of a language model.
Neurosymbolic Reading Group
One Embedder, Any Task: Instruction-Finetuned Text Embeddings (2022) (Code)
Fine-tune FLAN-T5 for chat & dialogue summarization (2022)
Cohere Playground - Summarize texts up to 50K characters.
SGPT: GPT Sentence Embeddings for Semantic Search (2022) (Code)
PromptKG - Gallery of Prompt Learning & KG-related research works, toolkits, and paper-list.
Text generation web UI - Gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, GPT-Neo, and Pygmalion.
Knowledge is a Region in Weight Space for Fine-tuned Language Models (2023)
LangChain Sidecar - UI starterkit for building LangChain apps that can be embedded on any website, similar to how Intercom can be embedded.
embedland - Collection of text embedding experiments.
Understanding large language models
MindsJS - Build your workflows and app backends with large language models (LLMs) like OpenAI, Cohere and AlephAlpha.
LLaMA Inference code
Language Is Not All You Need: Aligning Perception with Language Models (2023) (Tweet)
LLMs are compilers (2023) (Lobsters)
Beating OpenAI CLIP with 100x less data and compute (2023) (HN)
SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
TCJA-SNN: Temporal-Channel Joint Attention for Spiking Neural Networks
LLM Security - New ways of breaking app-integrated LLMs.
LangChain Chat
Awesome Generative Information Retrieval
Facebook LLAMA is being openly distributed via torrents (2023)
Batch Prompting: Efficient Inference with Large Language Model APIs (2023) (Code)
Local attention - Implementation of local windowed attention for language modeling.
Tiktokenizer - Online playground for OpenAPI tokenizers. (Code)
LLaMA: INT8 edition - Hastily quantized inference code for LLaMA models.
The Waluigi Effect: an explanation of bizarre semiotic effects in LLMs (2023) (HN)
Vellum - Dev platform for LLM apps. (HN)
Large Language Model Training Playbook
Inference-only implementation of LLaMA in plain NumPy
GPT-3 will ignore tools when it disagrees with them (2023)
Palm-E: An Embodied Multimodal Language Model (2023) (HN)
UForm - Multi-Modal Inference Library For Semantic Search Applications and Mid-Fusion Vision-Language Transformers.
Basaran - Open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.
4 bits quantization of LLaMa using GPTQ
ClickPrompt - Streamline your prompt design.
Fork of Facebook’s LLaMa model to run on CPU (HN)
Running LLaMA 7B on a 64GB M2 MacBook Pro with llama.cpp (2023)
Llama.cpp - Port of Facebook's LLaMA model in C/C++, with Apple Silicon support. (HN)
Large language models are having their Stable Diffusion moment right now (2023) (HN)
Vaporetto - Fast and lightweight pointwise prediction-based tokenizer.
Using LLaMA with M1 Mac (2023) (HN)
Dalai - Automatically install, run, and play with LLaMA on your computer. (HN) (Code)
What is Temperature in NLP? (2021) (HN)
FLAN Instruction Tuning
Minimal LLaMA
ALLaMo - Simple, hackable and fast implementation for training/finetuning medium-sized LLaMA-based models.
Stanford Alpaca - Instruction-following LLaMA model. (HN) (Web) (HN) (HN) (Web)
Modern language models refute Chomsky’s approach to language (2023)
High-throughput Generative Inference of Large Language Models with a Single GPU (2023) (HN)
LLaMA-rs - Run LLaMA inference on CPU, with Rust. (HN)
llama-dl - High-speed download of LLaMA, Facebook's 65B parameter GPT model. (HN)
RLLaMA - Rust+OpenCL+AVX2 implementation of LLaMA inference code.
Self-Instruct: Aligning Language Model with Self Generated Instructions (2022) (Code)
LLaMA - Run LLM in A Single 4GB GPU
GPT-4 (2023) (HN) (Demo) (Tweet) (Tweet)
Evals - Framework for evaluating OpenAI models and an open-source registry of benchmarks. (HN)
Anthropic | Introducing Claude (2023) (HN)
Prompt in Context-Learning - Awesome resources for in-context learning and prompt engineering.
GPT-4 System Card (2023)
LangFlow - User Interface For LangChain.
Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
Paper list of "The Life Cycle of Knowledge in Big Language Models: A Survey"
AI Q&A for huggingface/diffusers
bloomz.cpp - Inference of HuggingFace's BLOOM-like models in pure C/C++.
MiniLLM: Large Language Models on Consumer GPUs
Guardrails - Python package for specifying structure and type, validating and correcting the outputs of large language models.
Alpaca.cpp - Run an Instruction-Tuned Chat-Style LLM on a MacBook. (HN)
TextSynth Server - REST API to large language models. (HN)
Wolverine - Give your python scripts regenerative healing abilities.
Recursive LLM prompts - Implement recursion using English as the programming language and GPT as the runtime.
Alpaca-LoRA as a Chatbot Service
Simple UI for LLaMA Model Finetuning
Serge - Web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.
llama-cli - Self-hosted, Simple LLaMA/alpaca API & CLI written in go.
Kor - Extract structured data from text using LLMs. Specify the schema of what should be extracted and provide some examples.
Cheating is all you need (2023) (HN)
Prompt Engineering (2023)
Prompt Engineering Guide
Anthropic Python SDK - Access to Anthropic's safety-first language model APIs.
Alpaca-LoRA with Docker (HN)
Autodoc - Toolkit for auto-generating codebase documentation using LLMs. (HN)
Dolly - Fine-tunes the GPT-J 6B model on the Alpaca dataset using a Databricks notebook.
Reflexion: an autonomous agent with dynamic memory and self-reflection (2023) (Code)
CodeAlpaca – Instruction following code generation model (HN)
LLaMA retrieval plugin - Using OpenAI's retrieval plugin.
Retrieval in LangChain (2023) (HN)
Open Sourcing Cody – Sourcegraph's AI-enabled editor assistant (2023) (HN)
Cerebras-GPT: A Family of Open, Compute-Efficient, Large Language Models (2023) (HN)
Lit-LLaMA - Open-source implementation of LLaMA. (HN)
GPT4All - Demo, data and code to train an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa. (HN) (Tweet)
LLMs and GPT: Some of my favorite learning materials (HN)
Malleable software in the age of LLMs (2023)

Notes
Links