GenAI Ecosystem🧠

Large_Language_Models🧠

🔥 Large Language Models(LLM) have taken the ~~NLP community~~ ~~AI community~~ the Whole World by storm. Here is a curated list of papers about large language models, especially relating to ChatGPT and other papers released related to LLMs.

🚀(New) The trends of the number of papers related to LLMs on arXiv

Here are the trends of the cumulative numbers of arXiv papers that contain the keyphrases “language model” (since June 2018) and “large language model” (since October 2019), respectively.

The statistics are calculated using exact match by querying the keyphrases in title or abstract by months. I set different x-axis ranges for the two keyphrases, because “language models” have been explored at an earlier time. I label the points corresponding to important landmarks in the research progress of LLMs. A sharp increase occurs after the release of ChatGPT: the average number of published arXiv papers that contain “large language model” in title or abstract goes from 0.40 per day to 8.58 per day.

🚀(New) Technical Evolution of GPT-series Models

A brief illustration for the technical evolution of GPT-series models. I plot this figure mainly based on the papers, blog articles and official APIs from OpenAI. Here, solid lines denote that there exists an explicit evidence (e.g., the official statement that a new model is developed based on a base model) on the evolution path betIen two models, while dashed lines denote a relatively Iaker evolution relation.

🚀(New) Evolutionary Graph of LLaMA Family

An evolutionary graph of the research work conducted on LLaMA. Due to the huge number, I cannot include all the LLaMA variants in this figure, even much excellent work.

To support incremental update, I share the source file of this figure, and Ilcome the readers to include the desired models by submitting the pull requests on our GitHub page. If you're instrested, please request by application.

🚀(New) Prompts

🚀(New) Experiments

Instruction Tuning Experiments

I will explore the effect of different types of instructions in fine-tuning LLMs (i.e., 7B LLaMA26), as Ill as examine the usefulness of several instruction improvement strategies.

Ability Evaluaition Experiments

I conduct a fine-grained evaluation on the abilities discussed in Section 7.1 and Section 7.2. For each kind of ability, I select representative tasks and datasets for conducting evaluation experiments to examine the corresponding performance of LLMs.

Impactful Papers

Date	keywords	Institute	Paper	Publication
2017-06	Transformers	Google	Attention Is All You Need	NeurIPS
2018-06	GPT 1.0	OpenAI	Improving Language Understanding by Generative Pre-Training
2018-10	BERT	Google	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	NAACL
2019-02	GPT 2.0	OpenAI	Language Models are Unsupervised Multitask Learners
2019-09	Megatron-LM	NVIDIA	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
2019-10	T5	Google	Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer	JMLR
2019-10	ZeRO	Microsoft	ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	SC
2020-01	Scaling Law	OpenAI	Scaling Laws for Neural Language Models
2020-05	GPT 3.0	OpenAI	Language models are few-shot learners	NeurIPS
2021-01	Switch Transformers	Google	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity	JMLR
2021-08	Codex	OpenAI	Evaluating Large Language Models Trained on Code
2021-08	Foundation Models	Stanford	On the Opportunities and Risks of Foundation Models
2021-09	FLAN	Google	Finetuned Language Models are Zero-Shot Learners	ICLR
2021-10	T0	HuggingFace et al.	Multitask Prompted Training Enables Zero-Shot Task Generalization	ICLR
2021-12	GLaM	Google	GLaM: Efficient Scaling of Language Models with Mixture-of-Experts	ICML
2021-12	IbGPT	OpenAI	IbGPT: Browser-assisted question-ansIring with human feedback
2021-12	Retro	DeepMind	Improving language models by retrieving from trillions of tokens	ICML
2021-12	Gopher	DeepMind	Scaling Language Models: Methods, Analysis & Insights from Training Gopher
2022-01	COT	Google	Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	NeurIPS
2022-01	LaMDA	Google	LaMDA: Language Models for Dialog Applications
2022-01	Minerva	Google	Solving Quantitative Reasoning Problems with Language Models	NeurIPS
2022-01	Megatron-Turing NLG	Microsoft&NVIDIA	Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
2022-03	InstructGPT	OpenAI	Training language models to follow instructions with human feedback
2022-04	PaLM	Google	PaLM: Scaling Language Modeling with Pathways
2022-04	Chinchilla	DeepMind	An empirical analysis of compute-optimal large language model training	NeurIPS
2022-05	OPT	Meta	OPT: Open Pre-trained Transformer Language Models
2022-05	UL2	Google	Unifying Language Learning Paradigms
2022-06	Emergent Abilities	Google	Emergent Abilities of Large Language Models	TMLR
2022-06	BIG-bench	Google	Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
2022-06	METALM	Microsoft	Language Models are General-Purpose Interfaces
2022-09	Sparrow	DeepMind	Improving alignment of dialogue agents via targeted human judgements
2022-10	Flan-T5/PaLM	Google	Scaling Instruction-Finetuned Language Models
2022-10	GLM-130B	Tsinghua	GLM-130B: An Open Bilingual Pre-trained Model	ICLR
2022-11	HELM	Stanford	Holistic Evaluation of Language Models
2022-11	BLOOM	BigScience	BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
2022-11	Galactica	Meta	Galactica: A Large Language Model for Science
2022-12	OPT-IML	Meta	OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
2023-01	Flan 2022 Collection	Google	The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
2023-02	LLaMA	Meta	LLaMA: Open and Efficient Foundation Language Models
2023-02	Kosmos-1	Microsoft	Language Is Not All You Need: Aligning Perception with Language Models
2023-03	PaLM-E	Google	PaLM-E: An Embodied Multimodal Language Model
2023-03	GPT 4	OpenAI	GPT-4 Technical Report
2023-04	Pythia	EleutherAI et al.	Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling	ICML
2023-05	Dromedary	CMU et al.	Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
2023-05	PaLM 2	Google	PaLM 2 Technical Report
2023-05	RWKV	Bo Peng	RWKV: Reinventing RNNs for the Transformer Era
2023-05	DPO	Stanford	Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Usage and Restrictions

I build a table summarizing the LLMs usage restrictions (e.g. for commercial and research purposes). In particular, I provide the information from the models and their pretraining data's perspective. I urge the users in the community to refer to the licensing information for public models and data and use them in a responsible manner. I urge the developers to pay special attention to licensing, make them transparent and comprehensive, to prevent any unwanted and unforeseen usage.

LLMs	Model			Data
	License	Commercial Use	Other noteable restrictions	License	Corpus
Encoder-only
BERT series of models (general domain)	Apache 2.0	✅		Public	BooksCorpus, English Wikipedia
RoBERTa	MIT license	✅		Public	BookCorpus, CC-News, OpenIbText, STORIES
ERNIE	Apache 2.0	✅		Public	English Wikipedia
SciBERT	Apache 2.0	✅		Public	BERT corpus, 1.14M papers from Semantic Scholar
LegalBERT	CC BY-SA 4.0	❌		Public (except data from the Case Law Access Project)	EU legislation, US court cases, etc.
BioBERT	Apache 2.0	✅		PubMed	PubMed, PMC
Encoder-Decoder
T5	Apache 2.0	✅		Public	C4
Flan-T5	Apache 2.0	✅		Public	C4, Mixture of tasks (Fig 2 in paper)
BART	Apache 2.0	✅		Public	RoBERTa corpus
GLM	Apache 2.0	✅		Public	BooksCorpus and English Wikipedia
ChatGLM	ChatGLM License	❌	No use for illegal purposes or military research, no harm the public interest of society	N/A	1T tokens of Chinese and English corpus
Decoder-only
GPT2	Modified MIT License	✅	Use GPT-2 responsibly and clearly indicate your content was created using GPT-2.	Public	IbText
GPT-Neo	MIT license	✅		Public	Pile
GPT-J	Apache 2.0	✅		Public	Pile
---> Dolly	CC BY NC 4.0	❌		CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI	Pile, Self-Instruct
---> GPT4ALL-J	Apache 2.0	✅		Public	GPT4All-J dataset
Pythia	Apache 2.0	✅		Public	Pile
---> Dolly v2	MIT license	✅		Public	Pile, databricks-dolly-15k
OPT	OPT-175B LICENSE AGREEMENT	❌	No development relating to surveillance research and military, no harm the public interest of society	Public	RoBERTa corpus, the Pile, PushShift.io Reddit
---> OPT-IML	OPT-175B LICENSE AGREEMENT	❌	same to OPT	Public	OPT corpus, Extended version of Super-NaturalInstructions
YaLM	Apache 2.0	✅		Unspecified	Pile, Teams collected Texts in Russian
BLOOM	The BigScience RAIL License	✅	No use of generating verifiably false information with the purpose of harming others; content without expressly disclaiming that the text is machine generated	Public	ROOTS corpus (Lauren¸con et al., 2022)
---> BLOOMZ	The BigScience RAIL License	✅	same to BLOOM	Public	ROOTS corpus, xP3
Galactica	CC BY-NC 4.0	❌		N/A	The Galactica Corpus
LLaMA	Non-commercial bespoke license	❌	No development relating to surveillance research and military, no harm the public interest of society	Public	CommonCrawl, C4, Github, Wikipedia, etc.
---> Alpaca	CC BY NC 4.0	❌		CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI	LLaMA corpus, Self-Instruct
---> Vicuna	CC BY NC 4.0	❌		Subject to terms of Use of the data generated by OpenAI; Privacy Practices of ShareGPT	LLaMA corpus, 70K conversations from ShareGPT.com
---> GPT4ALL	GPL Licensed LLaMa	❌		Public	GPT4All dataset
OpenLLaMA	Apache 2.0	✅		Public	RedPajama
CodeGeeX	The CodeGeeX License	❌	No use for illegal purposes or military research	Public	Pile, CodeParrot, etc.
StarCoder	BigCode OpenRAIL-M v1 license	✅	No use of generating verifiably false information with the purpose of harming others; content without expressly disclaiming that the text is machine generated	Public	The Stack
MPT-7B	Apache 2.0	✅		Public	mC4 (english), The Stack, RedPajama, S2ORC
falcon	TII Falcon LLM License	✅/❌	Available under a license allowing commercial use	Public	RefinedIb

Open LLM datasets for pre-training

Name	Release Date	Paper/Blog	Dataset	Tokens (T)	License
starcoderdata	2023/05	StarCoder: A State-of-the-Art LLM for Code	starcoderdata	0.25	Apache 2.0
RedPajama	2023/04	RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens	RedPajama-Data	1.2	Apache 2.0

Open LLM datasets for instruction-tuning

Name	Release Date	Paper/Blog	Dataset	Samples (K)	License
MPT-7B-Instruct	2023/05	Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs	dolly_hhrlhf	59	CC BY-SA-3.0
databricks-dolly-15k	2023/04	Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM	databricks-dolly-15k	15	CC BY-SA-3.0
OIG (Open Instruction Generalist)	2023/03	THE OIG DATASET	OIG	44,000	Apache 2.0

Open LLM datasets for alignment-tuning

Name	Release Date	Paper/Blog	Dataset	Samples (K)	License
OpenAssistant Conversations Dataset	2023/04	OpenAssistant Conversations - Democratizing Large Language Model Alignment	oasst1	161	Apache 2.0

Tutorials about LLM

[Andrej Karpathy] State of GPT video
[Hyung Won Chung] Instruction finetuning and RLHF lecture Youtube
[Jason Wei] Scaling, emergence, and reasoning in large language models Slides
[Susan Zhang] Open Pretrained Transformers Youtube
[Ameet Deshpande] How Does ChatGPT Work? Slides
[Yao Fu] 预训练，指令微调，对齐，专业化：论大语言模型能力的来源 Bilibili
[Hung-yi Lee] ChatGPT 原理剖析 Youtube
[Jay Mody] GPT in 60 Lines of NumPy Link
[ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models Link
[NeurIPS 2022] Foundational Robustness of Foundation Models Link
[Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. Video|Code
[DAIR.AI] Prompt Engineering Guide Link
[邱锡鹏] 大型语言模型的能力分析与应用 Slides | Video
[Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers Link
[HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) Link
[HuggingFace] What Makes a Dialog Agent Useful? Link
[张俊林]通向AGI之路：大型语言模型(LLM)技术精要 Link
[大师兄]ChatGPT/InstructGPT详解 Link
[HeptaAI]ChatGPT内核：InstructGPT，基于反馈指令的PPO强化学习 Link
[Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources Link
[Stephen Wolfram] What Is ChatGPT Doing … and Why Does It Work? Link
[Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? Link
[Hung-yi Lee] ChatGPT (可能)是怎麼煉成的 - GPT 社會化的過程 Video
[Keyvan Kambakhsh] Pure Rust implementation of a minimal Generative Pretrained Transformer code

Courses about LLM

[DeepLearning.AI] ChatGPT Prompt Engineering for Developers Homepage
[Princeton] Understanding Large Language Models Homepage
[OpenBMB] 大模型公开课主页
[Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF Slides
[Stanford] CS324-Large Language Models Homepage
[Stanford] CS25-Transformers United V2 Homepage
[Stanford Ibinar] GPT-3 & Beyond Video
[李沐] InstructGPT论文精读 Bilibili Youtube
[陳縕儂] OpenAI InstructGPT 從人類回饋中學習 ChatGPT 的前身 Youtube
[李沐] HELM全面语言模型评测 Bilibili
[李沐] GPT，GPT-2，GPT-3 论文精读 Bilibili Youtube
[Aston Zhang] Chain of Thought论文 Bilibili Youtube
[MIT] Introduction to Data-Centric AI Homepage

Tools for deploying LLM

FastChat - A distributed multi-model LLM serving system with Ib UI and OpenAI-compatible RESTful APIs.
SkyPilot - Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.
vLLM - A high-throughput and memory-efficient inference and serving engine for LLMs
Text Generation Inference - A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to poIr LLMs api-inference widgets.
Haystack - an open-source NLP framework that allows you to use LLMs and transformer-based models from Hugging Face, OpenAI and Cohere to interact with your own data.
Sidekick - Data integration platform for LLMs.
LangChain - Building applications with LLMs through composability
Ichat-chatgpt - Use ChatGPT On Ichat via Ichaty
promptfoo - Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality.
Agenta - Easily build, version, evaluate and deploy your LLM-poIred apps.

What are release licences mean in LLMs world?

Apache 2.0: Allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties.
MIT: Similar to Apache 2.0 but shorter and simpler. Also, in contrast to Apache 2.0, does not require stating any significant changes to the original code.
CC BY-SA-4.0: Allows (i) copying and redistributing the material and (ii) remixing, transforming, and building upon the material for any purpose, even commercially. But if you do the latter, you must distribute your contributions under the same license as the original. (Thus, may not be viable for internal teams.)
OpenRAIL-M v1: Allows royalty-free access and flexible downstream use and sharing of the model and modifications of it, and comes with a set of use restrictions (see Attachment A)
BSD-3-Clause: This version allows unlimited redistribution for any purpose as long as its copyright notices and the license's disclaimers of warranty are maintained.

Disclaimer: The information provided in this repo does not, and is not intended to, constitute legal advice. Maintainers of this repo are not responsible for the actions of third parties who use the models. Please take Legal advice before using models for commercial purposes.

✍️Note: This is a work in progress 🚧 !!! Subscribe & will keep you updated. Stay Tuned 😊

Show your support

Give a 🌟 if this repo helped you!

@misc{Resources,
    title={LLM-Resources-Papers-Frameworks-Tools},
    author= {Susant Achary},
    year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GenAI Ecosystem🧠

Large_Language_Models🧠

🚀(New) The trends of the number of papers related to LLMs on arXiv

🚀(New) Technical Evolution of GPT-series Models

🚀(New) Evolutionary Graph of LLaMA Family

🚀(New) Prompts

🚀(New) Experiments

Instruction Tuning Experiments

Ability Evaluaition Experiments

Impactful Papers

Usage and Restrictions

Open LLM datasets for pre-training

Open LLM datasets for instruction-tuning

Open LLM datasets for alignment-tuning

Tutorials about LLM

Courses about LLM

Tools for deploying LLM

What are release licences mean in LLMs world?

✍️Note: This is a work in progress 🚧 !!! Subscribe & will keep you updated. Stay Tuned 😊

Show your support

Files

README.md

Latest commit

History

README.md

File metadata and controls

GenAI Ecosystem🧠

Large_Language_Models🧠

🚀(New) The trends of the number of papers related to LLMs on arXiv

🚀(New) Technical Evolution of GPT-series Models

🚀(New) Evolutionary Graph of LLaMA Family

🚀(New) Prompts

🚀(New) Experiments

Instruction Tuning Experiments

Ability Evaluaition Experiments

Impactful Papers

Usage and Restrictions

Open LLM datasets for pre-training

Open LLM datasets for instruction-tuning

Open LLM datasets for alignment-tuning

Tutorials about LLM

Courses about LLM

Tools for deploying LLM

What are release licences mean in LLMs world?

✍️Note: This is a work in progress 🚧 !!! Subscribe & will keep you updated. Stay Tuned 😊

Show your support