🔥 Large Language Models(LLM) have taken the NLP community AI community the Whole World by storm. Here is a curated list of papers about large language models, especially relating to ChatGPT and other papers released related to LLMs.
Here are the trends of the cumulative numbers of arXiv papers that contain the keyphrases “language model” (since June 2018) and “large language model” (since October 2019), respectively.
The statistics are calculated using exact match by querying the keyphrases in title or abstract by months. I set different x-axis ranges for the two keyphrases, because “language models” have been explored at an earlier time. I label the points corresponding to important landmarks in the research progress of LLMs. A sharp increase occurs after the release of ChatGPT: the average number of published arXiv papers that contain “large language model” in title or abstract goes from 0.40 per day to 8.58 per day.
A brief illustration for the technical evolution of GPT-series models. I plot this figure mainly based on the papers, blog articles and official APIs from OpenAI. Here, solid lines denote that there exists an explicit evidence (e.g., the official statement that a new model is developed based on a base model) on the evolution path betIen two models, while dashed lines denote a relatively Iaker evolution relation.
An evolutionary graph of the research work conducted on LLaMA. Due to the huge number, I cannot include all the LLaMA variants in this figure, even much excellent work.
To support incremental update, I share the source file of this figure, and Ilcome the readers to include the desired models by submitting the pull requests on our GitHub page. If you're instrested, please request by application.
I will explore the effect of different types of instructions in fine-tuning LLMs (i.e., 7B LLaMA26), as Ill as examine the usefulness of several instruction improvement strategies.
I conduct a fine-grained evaluation on the abilities discussed in Section 7.1 and Section 7.2. For each kind of ability, I select representative tasks and datasets for conducting evaluation experiments to examine the corresponding performance of LLMs.
I build a table summarizing the LLMs usage restrictions (e.g. for commercial and research purposes). In particular, I provide the information from the models and their pretraining data's perspective. I urge the users in the community to refer to the licensing information for public models and data and use them in a responsible manner. I urge the developers to pay special attention to licensing, make them transparent and comprehensive, to prevent any unwanted and unforeseen usage.
LLMs | Model | Data | |||
---|---|---|---|---|---|
License | Commercial Use | Other noteable restrictions | License | Corpus | |
Encoder-only | |||||
BERT series of models (general domain) | Apache 2.0 | ✅ | Public | BooksCorpus, English Wikipedia | |
RoBERTa | MIT license | ✅ | Public | BookCorpus, CC-News, OpenIbText, STORIES | |
ERNIE | Apache 2.0 | ✅ | Public | English Wikipedia | |
SciBERT | Apache 2.0 | ✅ | Public | BERT corpus, 1.14M papers from Semantic Scholar | |
LegalBERT | CC BY-SA 4.0 | ❌ | Public (except data from the Case Law Access Project) | EU legislation, US court cases, etc. | |
BioBERT | Apache 2.0 | ✅ | PubMed | PubMed, PMC | |
Encoder-Decoder | |||||
T5 | Apache 2.0 | ✅ | Public | C4 | |
Flan-T5 | Apache 2.0 | ✅ | Public | C4, Mixture of tasks (Fig 2 in paper) | |
BART | Apache 2.0 | ✅ | Public | RoBERTa corpus | |
GLM | Apache 2.0 | ✅ | Public | BooksCorpus and English Wikipedia | |
ChatGLM | ChatGLM License | ❌ | No use for illegal purposes or military research, no harm the public interest of society | N/A | 1T tokens of Chinese and English corpus |
Decoder-only | |||||
GPT2 | Modified MIT License | ✅ | Use GPT-2 responsibly and clearly indicate your content was created using GPT-2. | Public | IbText |
GPT-Neo | MIT license | ✅ | Public | Pile | |
GPT-J | Apache 2.0 | ✅ | Public | Pile | |
---> Dolly | CC BY NC 4.0 | ❌ | CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI | Pile, Self-Instruct | |
---> GPT4ALL-J | Apache 2.0 | ✅ | Public | GPT4All-J dataset | |
Pythia | Apache 2.0 | ✅ | Public | Pile | |
---> Dolly v2 | MIT license | ✅ | Public | Pile, databricks-dolly-15k | |
OPT | OPT-175B LICENSE AGREEMENT | ❌ | No development relating to surveillance research and military, no harm the public interest of society | Public | RoBERTa corpus, the Pile, PushShift.io Reddit |
---> OPT-IML | OPT-175B LICENSE AGREEMENT | ❌ | same to OPT | Public | OPT corpus, Extended version of Super-NaturalInstructions |
YaLM | Apache 2.0 | ✅ | Unspecified | Pile, Teams collected Texts in Russian | |
BLOOM | The BigScience RAIL License | ✅ | No use of generating verifiably false information with the purpose of harming others; content without expressly disclaiming that the text is machine generated |
Public | ROOTS corpus (Lauren¸con et al., 2022) |
---> BLOOMZ | The BigScience RAIL License | ✅ | same to BLOOM | Public | ROOTS corpus, xP3 |
Galactica | CC BY-NC 4.0 | ❌ | N/A | The Galactica Corpus | |
LLaMA | Non-commercial bespoke license | ❌ | No development relating to surveillance research and military, no harm the public interest of society | Public | CommonCrawl, C4, Github, Wikipedia, etc. |
---> Alpaca | CC BY NC 4.0 | ❌ | CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI | LLaMA corpus, Self-Instruct | |
---> Vicuna | CC BY NC 4.0 | ❌ | Subject to terms of Use of the data generated by OpenAI; Privacy Practices of ShareGPT |
LLaMA corpus, 70K conversations from ShareGPT.com | |
---> GPT4ALL | GPL Licensed LLaMa | ❌ | Public | GPT4All dataset | |
OpenLLaMA | Apache 2.0 | ✅ | Public | RedPajama | |
CodeGeeX | The CodeGeeX License | ❌ | No use for illegal purposes or military research | Public | Pile, CodeParrot, etc. |
StarCoder | BigCode OpenRAIL-M v1 license | ✅ | No use of generating verifiably false information with the purpose of harming others; content without expressly disclaiming that the text is machine generated |
Public | The Stack |
MPT-7B | Apache 2.0 | ✅ | Public | mC4 (english), The Stack, RedPajama, S2ORC | |
falcon | TII Falcon LLM License | ✅/❌ | Available under a license allowing commercial use | Public | RefinedIb |
Name | Release Date | Paper/Blog | Dataset | Tokens (T) | License |
---|---|---|---|---|---|
starcoderdata | 2023/05 | StarCoder: A State-of-the-Art LLM for Code | starcoderdata | 0.25 | Apache 2.0 |
RedPajama | 2023/04 | RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens | RedPajama-Data | 1.2 | Apache 2.0 |
Name | Release Date | Paper/Blog | Dataset | Samples (K) | License |
---|---|---|---|---|---|
MPT-7B-Instruct | 2023/05 | Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs | dolly_hhrlhf | 59 | CC BY-SA-3.0 |
databricks-dolly-15k | 2023/04 | Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM | databricks-dolly-15k | 15 | CC BY-SA-3.0 |
OIG (Open Instruction Generalist) | 2023/03 | THE OIG DATASET | OIG | 44,000 | Apache 2.0 |
Name | Release Date | Paper/Blog | Dataset | Samples (K) | License |
---|---|---|---|---|---|
OpenAssistant Conversations Dataset | 2023/04 | OpenAssistant Conversations - Democratizing Large Language Model Alignment | oasst1 | 161 | Apache 2.0 |
- [Andrej Karpathy] State of GPT video
- [Hyung Won Chung] Instruction finetuning and RLHF lecture Youtube
- [Jason Wei] Scaling, emergence, and reasoning in large language models Slides
- [Susan Zhang] Open Pretrained Transformers Youtube
- [Ameet Deshpande] How Does ChatGPT Work? Slides
- [Yao Fu] 预训练,指令微调,对齐,专业化:论大语言模型能力的来源 Bilibili
- [Hung-yi Lee] ChatGPT 原理剖析 Youtube
- [Jay Mody] GPT in 60 Lines of NumPy Link
- [ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models Link
- [NeurIPS 2022] Foundational Robustness of Foundation Models Link
- [Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. Video|Code
- [DAIR.AI] Prompt Engineering Guide Link
- [邱锡鹏] 大型语言模型的能力分析与应用 Slides | Video
- [Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers Link
- [HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) Link
- [HuggingFace] What Makes a Dialog Agent Useful? Link
- [张俊林]通向AGI之路:大型语言模型(LLM)技术精要 Link
- [大师兄]ChatGPT/InstructGPT详解 Link
- [HeptaAI]ChatGPT内核:InstructGPT,基于反馈指令的PPO强化学习 Link
- [Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources Link
- [Stephen Wolfram] What Is ChatGPT Doing … and Why Does It Work? Link
- [Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? Link
- [Hung-yi Lee] ChatGPT (可能)是怎麼煉成的 - GPT 社會化的過程 Video
- [Keyvan Kambakhsh] Pure Rust implementation of a minimal Generative Pretrained Transformer code
- [DeepLearning.AI] ChatGPT Prompt Engineering for Developers Homepage
- [Princeton] Understanding Large Language Models Homepage
- [OpenBMB] 大模型公开课 主页
- [Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF Slides
- [Stanford] CS324-Large Language Models Homepage
- [Stanford] CS25-Transformers United V2 Homepage
- [Stanford Ibinar] GPT-3 & Beyond Video
- [李沐] InstructGPT论文精读 Bilibili Youtube
- [陳縕儂] OpenAI InstructGPT 從人類回饋中學習 ChatGPT 的前身 Youtube
- [李沐] HELM全面语言模型评测 Bilibili
- [李沐] GPT,GPT-2,GPT-3 论文精读 Bilibili Youtube
- [Aston Zhang] Chain of Thought论文 Bilibili Youtube
- [MIT] Introduction to Data-Centric AI Homepage
-
FastChat - A distributed multi-model LLM serving system with Ib UI and OpenAI-compatible RESTful APIs.
-
SkyPilot - Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.
-
vLLM - A high-throughput and memory-efficient inference and serving engine for LLMs
-
Text Generation Inference - A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to poIr LLMs api-inference widgets.
-
Haystack - an open-source NLP framework that allows you to use LLMs and transformer-based models from Hugging Face, OpenAI and Cohere to interact with your own data.
-
Sidekick - Data integration platform for LLMs.
-
LangChain - Building applications with LLMs through composability
-
Ichat-chatgpt - Use ChatGPT On Ichat via Ichaty
-
promptfoo - Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality.
-
Agenta - Easily build, version, evaluate and deploy your LLM-poIred apps.
- Apache 2.0: Allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties.
- MIT: Similar to Apache 2.0 but shorter and simpler. Also, in contrast to Apache 2.0, does not require stating any significant changes to the original code.
- CC BY-SA-4.0: Allows (i) copying and redistributing the material and (ii) remixing, transforming, and building upon the material for any purpose, even commercially. But if you do the latter, you must distribute your contributions under the same license as the original. (Thus, may not be viable for internal teams.)
- OpenRAIL-M v1: Allows royalty-free access and flexible downstream use and sharing of the model and modifications of it, and comes with a set of use restrictions (see Attachment A)
- BSD-3-Clause: This version allows unlimited redistribution for any purpose as long as its copyright notices and the license's disclaimers of warranty are maintained.
Disclaimer: The information provided in this repo does not, and is not intended to, constitute legal advice. Maintainers of this repo are not responsible for the actions of third parties who use the models. Please take Legal advice before using models for commercial purposes.
Give a 🌟 if this repo helped you!
@misc{Resources,
title={LLM-Resources-Papers-Frameworks-Tools},
author= {Susant Achary},
year={2023}
}