Skip to content

SSusantAchary/LLM-Resources-Papers-Frameworks-Tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenAI Ecosystem🧠

Large_Language_Models🧠

🔥 Large Language Models(LLM) have taken the NLP community AI community the Whole World by storm. Here is a curated list of papers about large language models, especially relating to ChatGPT and other papers released related to LLMs.

🚀(New) The trends of the number of papers related to LLMs on arXiv

Here are the trends of the cumulative numbers of arXiv papers that contain the keyphrases “language model” (since June 2018) and “large language model” (since October 2019), respectively.

arxiv_llms

The statistics are calculated using exact match by querying the keyphrases in title or abstract by months. I set different x-axis ranges for the two keyphrases, because “language models” have been explored at an earlier time. I label the points corresponding to important landmarks in the research progress of LLMs. A sharp increase occurs after the release of ChatGPT: the average number of published arXiv papers that contain “large language model” in title or abstract goes from 0.40 per day to 8.58 per day.

🚀(New) Technical Evolution of GPT-series Models

A brief illustration for the technical evolution of GPT-series models. I plot this figure mainly based on the papers, blog articles and official APIs from OpenAI. Here, solid lines denote that there exists an explicit evidence (e.g., the official statement that a new model is developed based on a base model) on the evolution path betIen two models, while dashed lines denote a relatively Iaker evolution relation.

gpt-series

🚀(New) Evolutionary Graph of LLaMA Family

An evolutionary graph of the research work conducted on LLaMA. Due to the huge number, I cannot include all the LLaMA variants in this figure, even much excellent work.

LLaMA_family

To support incremental update, I share the source file of this figure, and Ilcome the readers to include the desired models by submitting the pull requests on our GitHub page. If you're instrested, please request by application.

🚀(New) Prompts

prompt examples

🚀(New) Experiments

Instruction Tuning Experiments

I will explore the effect of different types of instructions in fine-tuning LLMs (i.e., 7B LLaMA26), as Ill as examine the usefulness of several instruction improvement strategies.

instruction_tuning_table

Ability Evaluaition Experiments

I conduct a fine-grained evaluation on the abilities discussed in Section 7.1 and Section 7.2. For each kind of ability, I select representative tasks and datasets for conducting evaluation experiments to examine the corresponding performance of LLMs.

ability_main

Impactful Papers

Date keywords Institute Paper Publication
2017-06 Transformers Google Attention Is All You Need NeurIPS
Dynamic JSON Badge
2018-06 GPT 1.0 OpenAI Improving Language Understanding by Generative Pre-Training Dynamic JSON Badge
2018-10 BERT Google BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding NAACL
Dynamic JSON Badge
2019-02 GPT 2.0 OpenAI Language Models are Unsupervised Multitask Learners Dynamic JSON Badge
2019-09 Megatron-LM NVIDIA Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Dynamic JSON Badge
2019-10 T5 Google Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer JMLR
Dynamic JSON Badge
2019-10 ZeRO Microsoft ZeRO: Memory Optimizations Toward Training Trillion Parameter Models SC
Dynamic JSON Badge
2020-01 Scaling Law OpenAI Scaling Laws for Neural Language Models Dynamic JSON Badge
2020-05 GPT 3.0 OpenAI Language models are few-shot learners NeurIPS
Dynamic JSON Badge
2021-01 Switch Transformers Google Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity JMLR
Dynamic JSON Badge
2021-08 Codex OpenAI Evaluating Large Language Models Trained on Code Dynamic JSON Badge
2021-08 Foundation Models Stanford On the Opportunities and Risks of Foundation Models Dynamic JSON Badge
2021-09 FLAN Google Finetuned Language Models are Zero-Shot Learners ICLR
Dynamic JSON Badge
2021-10 T0 HuggingFace et al. Multitask Prompted Training Enables Zero-Shot Task Generalization ICLR
Dynamic JSON Badge
2021-12 GLaM Google GLaM: Efficient Scaling of Language Models with Mixture-of-Experts ICML
Dynamic JSON Badge
2021-12 IbGPT OpenAI IbGPT: Browser-assisted question-ansIring with human feedback Dynamic JSON Badge
2021-12 Retro DeepMind Improving language models by retrieving from trillions of tokens ICML
Dynamic JSON Badge
2021-12 Gopher DeepMind Scaling Language Models: Methods, Analysis & Insights from Training Gopher Dynamic JSON Badge
2022-01 COT Google Chain-of-Thought Prompting Elicits Reasoning in Large Language Models NeurIPS
Dynamic JSON Badge
2022-01 LaMDA Google LaMDA: Language Models for Dialog Applications Dynamic JSON Badge
2022-01 Minerva Google Solving Quantitative Reasoning Problems with Language Models NeurIPS
Dynamic JSON Badge
2022-01 Megatron-Turing NLG Microsoft&NVIDIA Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model Dynamic JSON Badge
2022-03 InstructGPT OpenAI Training language models to follow instructions with human feedback Dynamic JSON Badge
2022-04 PaLM Google PaLM: Scaling Language Modeling with Pathways Dynamic JSON Badge
2022-04 Chinchilla DeepMind An empirical analysis of compute-optimal large language model training NeurIPS
Dynamic JSON Badge
2022-05 OPT Meta OPT: Open Pre-trained Transformer Language Models Dynamic JSON Badge
2022-05 UL2 Google Unifying Language Learning Paradigms Dynamic JSON Badge
2022-06 Emergent Abilities Google Emergent Abilities of Large Language Models TMLR
Dynamic JSON Badge
2022-06 BIG-bench Google Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Dynamic JSON Badge
2022-06 METALM Microsoft Language Models are General-Purpose Interfaces Dynamic JSON Badge
2022-09 Sparrow DeepMind Improving alignment of dialogue agents via targeted human judgements Dynamic JSON Badge
2022-10 Flan-T5/PaLM Google Scaling Instruction-Finetuned Language Models Dynamic JSON Badge
2022-10 GLM-130B Tsinghua GLM-130B: An Open Bilingual Pre-trained Model ICLR
Dynamic JSON Badge
2022-11 HELM Stanford Holistic Evaluation of Language Models Dynamic JSON Badge
2022-11 BLOOM BigScience BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Dynamic JSON Badge
2022-11 Galactica Meta Galactica: A Large Language Model for Science Dynamic JSON Badge
2022-12 OPT-IML Meta OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization Dynamic JSON Badge
2023-01 Flan 2022 Collection Google The Flan Collection: Designing Data and Methods for Effective Instruction Tuning Dynamic JSON Badge
2023-02 LLaMA Meta LLaMA: Open and Efficient Foundation Language Models Dynamic JSON Badge
2023-02 Kosmos-1 Microsoft Language Is Not All You Need: Aligning Perception with Language Models Dynamic JSON Badge
2023-03 PaLM-E Google PaLM-E: An Embodied Multimodal Language Model Dynamic JSON Badge
2023-03 GPT 4 OpenAI GPT-4 Technical Report Dynamic JSON Badge
2023-04 Pythia EleutherAI et al. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling ICML
Dynamic JSON Badge
2023-05 Dromedary CMU et al. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision Dynamic JSON Badge
2023-05 PaLM 2 Google PaLM 2 Technical Report Dynamic JSON Badge
2023-05 RWKV Bo Peng RWKV: Reinventing RNNs for the Transformer Era Dynamic JSON Badge
2023-05 DPO Stanford Direct Preference Optimization: Your Language Model is Secretly a Reward Model Dynamic JSON Badge

Usage and Restrictions

I build a table summarizing the LLMs usage restrictions (e.g. for commercial and research purposes). In particular, I provide the information from the models and their pretraining data's perspective. I urge the users in the community to refer to the licensing information for public models and data and use them in a responsible manner. I urge the developers to pay special attention to licensing, make them transparent and comprehensive, to prevent any unwanted and unforeseen usage.

LLMs Model Data
License Commercial Use Other noteable restrictions License Corpus
Encoder-only
BERT series of models (general domain) Apache 2.0 Public BooksCorpus, English Wikipedia
RoBERTa MIT license Public BookCorpus, CC-News, OpenIbText, STORIES
ERNIE Apache 2.0 Public English Wikipedia
SciBERT Apache 2.0 Public BERT corpus, 1.14M papers from Semantic Scholar
LegalBERT CC BY-SA 4.0 Public (except data from the Case Law Access Project) EU legislation, US court cases, etc.
BioBERT Apache 2.0 PubMed PubMed, PMC
Encoder-Decoder
T5 Apache 2.0 Public C4
Flan-T5 Apache 2.0 Public C4, Mixture of tasks (Fig 2 in paper)
BART Apache 2.0 Public RoBERTa corpus
GLM Apache 2.0 Public BooksCorpus and English Wikipedia
ChatGLM ChatGLM License No use for illegal purposes or military research, no harm the public interest of society N/A 1T tokens of Chinese and English corpus
Decoder-only
GPT2 Modified MIT License Use GPT-2 responsibly and clearly indicate your content was created using GPT-2. Public IbText
GPT-Neo MIT license Public Pile
GPT-J Apache 2.0 Public Pile
---> Dolly CC BY NC 4.0 CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI Pile, Self-Instruct
---> GPT4ALL-J Apache 2.0 Public GPT4All-J dataset
Pythia Apache 2.0 Public Pile
---> Dolly v2 MIT license Public Pile, databricks-dolly-15k
OPT OPT-175B LICENSE AGREEMENT No development relating to surveillance research and military, no harm the public interest of society Public RoBERTa corpus, the Pile, PushShift.io Reddit
---> OPT-IML OPT-175B LICENSE AGREEMENT same to OPT Public OPT corpus, Extended version of Super-NaturalInstructions
YaLM Apache 2.0 Unspecified Pile, Teams collected Texts in Russian
BLOOM The BigScience RAIL License No use of generating verifiably false information with the purpose of harming others;
content without expressly disclaiming that the text is machine generated
Public ROOTS corpus (Lauren¸con et al., 2022)
---> BLOOMZ The BigScience RAIL License same to BLOOM Public ROOTS corpus, xP3
Galactica CC BY-NC 4.0 N/A The Galactica Corpus
LLaMA Non-commercial bespoke license No development relating to surveillance research and military, no harm the public interest of society Public CommonCrawl, C4, Github, Wikipedia, etc.
---> Alpaca CC BY NC 4.0 CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI LLaMA corpus, Self-Instruct
---> Vicuna CC BY NC 4.0 Subject to terms of Use of the data generated by OpenAI;
Privacy Practices of ShareGPT
LLaMA corpus, 70K conversations from ShareGPT.com
---> GPT4ALL GPL Licensed LLaMa Public GPT4All dataset
OpenLLaMA Apache 2.0 Public RedPajama
CodeGeeX The CodeGeeX License No use for illegal purposes or military research Public Pile, CodeParrot, etc.
StarCoder BigCode OpenRAIL-M v1 license No use of generating verifiably false information with the purpose of harming others;
content without expressly disclaiming that the text is machine generated
Public The Stack
MPT-7B Apache 2.0 Public mC4 (english), The Stack, RedPajama, S2ORC
falcon TII Falcon LLM License ✅/❌ Available under a license allowing commercial use Public RefinedIb

Open LLM datasets for pre-training

Name Release Date Paper/Blog Dataset Tokens (T) License
starcoderdata 2023/05 StarCoder: A State-of-the-Art LLM for Code starcoderdata 0.25 Apache 2.0
RedPajama 2023/04 RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens RedPajama-Data 1.2 Apache 2.0

Open LLM datasets for instruction-tuning

Name Release Date Paper/Blog Dataset Samples (K) License
MPT-7B-Instruct 2023/05 Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs dolly_hhrlhf 59 CC BY-SA-3.0
databricks-dolly-15k 2023/04 Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM databricks-dolly-15k 15 CC BY-SA-3.0
OIG (Open Instruction Generalist) 2023/03 THE OIG DATASET OIG 44,000 Apache 2.0

Open LLM datasets for alignment-tuning

Name Release Date Paper/Blog Dataset Samples (K) License
OpenAssistant Conversations Dataset 2023/04 OpenAssistant Conversations - Democratizing Large Language Model Alignment oasst1 161 Apache 2.0

Tutorials about LLM

  • [Andrej Karpathy] State of GPT video
  • [Hyung Won Chung] Instruction finetuning and RLHF lecture Youtube
  • [Jason Wei] Scaling, emergence, and reasoning in large language models Slides
  • [Susan Zhang] Open Pretrained Transformers Youtube
  • [Ameet Deshpande] How Does ChatGPT Work? Slides
  • [Yao Fu] 预训练,指令微调,对齐,专业化:论大语言模型能力的来源 Bilibili
  • [Hung-yi Lee] ChatGPT 原理剖析 Youtube
  • [Jay Mody] GPT in 60 Lines of NumPy Link
  • [ICML 2022] Welcome to the "Big Model" Era: Techniques and Systems to Train and Serve Bigger Models Link
  • [NeurIPS 2022] Foundational Robustness of Foundation Models Link
  • [Andrej Karpathy] Let's build GPT: from scratch, in code, spelled out. Video|Code
  • [DAIR.AI] Prompt Engineering Guide Link
  • [邱锡鹏] 大型语言模型的能力分析与应用 Slides | Video
  • [Philipp Schmid] Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging Face Transformers Link
  • [HuggingFace] Illustrating Reinforcement Learning from Human Feedback (RLHF) Link
  • [HuggingFace] What Makes a Dialog Agent Useful? Link
  • [张俊林]通向AGI之路:大型语言模型(LLM)技术精要 Link
  • [大师兄]ChatGPT/InstructGPT详解 Link
  • [HeptaAI]ChatGPT内核:InstructGPT,基于反馈指令的PPO强化学习 Link
  • [Yao Fu] How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources Link
  • [Stephen Wolfram] What Is ChatGPT Doing … and Why Does It Work? Link
  • [Jingfeng Yang] Why did all of the public reproduction of GPT-3 fail? Link
  • [Hung-yi Lee] ChatGPT (可能)是怎麼煉成的 - GPT 社會化的過程 Video
  • [Keyvan Kambakhsh] Pure Rust implementation of a minimal Generative Pretrained Transformer code

Courses about LLM

  • [DeepLearning.AI] ChatGPT Prompt Engineering for Developers Homepage
  • [Princeton] Understanding Large Language Models Homepage
  • [OpenBMB] 大模型公开课 主页
  • [Stanford] CS224N-Lecture 11: Prompting, Instruction Finetuning, and RLHF Slides
  • [Stanford] CS324-Large Language Models Homepage
  • [Stanford] CS25-Transformers United V2 Homepage
  • [Stanford Ibinar] GPT-3 & Beyond Video
  • [李沐] InstructGPT论文精读 Bilibili Youtube
  • [陳縕儂] OpenAI InstructGPT 從人類回饋中學習 ChatGPT 的前身 Youtube
  • [李沐] HELM全面语言模型评测 Bilibili
  • [李沐] GPT,GPT-2,GPT-3 论文精读 Bilibili Youtube
  • [Aston Zhang] Chain of Thought论文 Bilibili Youtube
  • [MIT] Introduction to Data-Centric AI Homepage

Tools for deploying LLM

  • FastChat - A distributed multi-model LLM serving system with Ib UI and OpenAI-compatible RESTful APIs.

  • SkyPilot - Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface.

  • vLLM - A high-throughput and memory-efficient inference and serving engine for LLMs

  • Text Generation Inference - A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to poIr LLMs api-inference widgets.

  • Haystack - an open-source NLP framework that allows you to use LLMs and transformer-based models from Hugging Face, OpenAI and Cohere to interact with your own data.

  • Sidekick - Data integration platform for LLMs.

  • LangChain - Building applications with LLMs through composability

  • Ichat-chatgpt - Use ChatGPT On Ichat via Ichaty

  • promptfoo - Test your prompts. Evaluate and compare LLM outputs, catch regressions, and improve prompt quality.

  • Agenta - Easily build, version, evaluate and deploy your LLM-poIred apps.

What are release licences mean in LLMs world?

  • Apache 2.0: Allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties.
  • MIT: Similar to Apache 2.0 but shorter and simpler. Also, in contrast to Apache 2.0, does not require stating any significant changes to the original code.
  • CC BY-SA-4.0: Allows (i) copying and redistributing the material and (ii) remixing, transforming, and building upon the material for any purpose, even commercially. But if you do the latter, you must distribute your contributions under the same license as the original. (Thus, may not be viable for internal teams.)
  • OpenRAIL-M v1: Allows royalty-free access and flexible downstream use and sharing of the model and modifications of it, and comes with a set of use restrictions (see Attachment A)
  • BSD-3-Clause: This version allows unlimited redistribution for any purpose as long as its copyright notices and the license's disclaimers of warranty are maintained.

Disclaimer: The information provided in this repo does not, and is not intended to, constitute legal advice. Maintainers of this repo are not responsible for the actions of third parties who use the models. Please take Legal advice before using models for commercial purposes.

✍️Note: This is a work in progress 🚧 !!! Subscribe & will keep you updated. Stay Tuned 😊

Show your support

Give a 🌟 if this repo helped you!

@misc{Resources,
    title={LLM-Resources-Papers-Frameworks-Tools},
    author= {Susant Achary},
    year={2023}
}