👋 Присоединяйтесь к нам WeChat.
[ English | 中文 ]
Точная настройка большой языковой модели может быть настолько простой, как...
tutorial_en.mp4
Выберите расположение:
- Colab: https://colab.research.google.com/drive/1eRTPn37ltBbYsISy9Aw2NuI2Aq5CQrD9?usp=sharing
- Локально: Используйте эту инструкцию usage
- Возможности
- Производительность
- Изменения
- Поддерживаемые модели
- Поддерживаемые подходы к обучению
- Предоставленные наборы данных
- Требования
- Начало работы
- Проекты использующие LLaMA Factory
- Лицензия
- Цитирование
- Благодарность
- Разные модели: LLaMA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, и другие.
- Интегрированные подходы: (Continuous) pre-training, supervised fine-tuning, reward modeling, PPO, DPO и ORPO.
- Масштабируемые ресурсы: 32-bit full-tuning, 16-bit freeze-tuning, 16-bit LoRA и 2/4/8-bit QLoRA через AQLM/AWQ/GPTQ/LLM.int8.
- Расширенные алгоритмы: GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ и Agent tuning.
- Практические трюки: FlashAttention-2, Unsloth, RoPE scaling, NEFTune и rsLoRA.
- Мониторинг экспериментов: LlamaBoard, TensorBoard, Wandb, MLflow, и другие.
- Более быстрый вывод: OpenAI совместимый API, Gradio UI и CLI с vLLM worker.
В сравнении с ChatGLM's P-Tuning, LoRA tuning в LLaMA Factory обеспечивает скорость обучения в 3.7 раз быстрее с большим баллом в Rouge на задаче генерации рекламного текста. Используя технику 4-bit quantization, LLaMA Factory's QLoRA повышает еще больше эффективность использования GPU памяти.
Определения
- Скорость обучения: количество обучающих выборок обрабатываемые в секунду во время обучения. (bs=4, cutoff_len=1024)
- Балл Rouge: Балл Rouge-2 набора разработки задачи генерации рекламного текста. (bs=4, cutoff_len=1024)
- Память GPU: Максимальное использование памяти GPU в обучении 4-bit quantized. (bs=1, cutoff_len=1024)
- We adopt
pre_seq_len=128
for ChatGLM's P-Tuning andlora_rank=32
for LLaMA Factory's LoRA tuning.
[24/04/21] We supported Mixture-of-Depths according to AstraMindAI's implementation. See examples/extras/mod
for usage.
[24/04/19] We supported Meta Llama 3 model series.
[24/04/16] We supported BAdam. See examples/extras/badam
for usage.
[24/04/16] We supported unsloth's long-sequence training (Llama-2-7B-56k within 24GB). It achieves 117% speed and 50% memory compared with FlashAttention-2, more benchmarks can be found in this page.
Full Changelog
[24/03/31] We supported ORPO. See examples/lora_single_gpu
for usage.
[24/03/21] Our paper "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models" is available at arXiv!
[24/03/20] We supported FSDP+QLoRA that fine-tunes a 70B model on 2x24GB GPUs. See examples/extras/fsdp_qlora
for usage.
[24/03/13] We supported LoRA+. See examples/extras/loraplus
for usage.
[24/03/07] We supported gradient low-rank projection (GaLore) algorithm. See examples/extras/galore
for usage.
[24/03/07] We integrated vLLM for faster and concurrent inference. Try --infer_backend vllm
to enjoy 270% inference speed. (LoRA is not yet supported, merge it first.)
[24/02/28] We supported weight-decomposed LoRA (DoRA). Try --use_dora
to activate DoRA training.
[24/02/15] We supported block expansion proposed by LLaMA Pro. See examples/extras/llama_pro
for usage.
[24/02/05] Qwen1.5 (Qwen2 beta version) series models are supported in LLaMA-Factory. Check this blog post for details.
[24/01/18] We supported agent tuning for most models, equipping model with tool using abilities by fine-tuning with --dataset glaive_toolcall
.
[23/12/23] We supported unsloth's implementation to boost LoRA tuning for the LLaMA, Mistral and Yi models. Try --use_unsloth
argument to activate unsloth patch. It achieves 170% speed in our benchmark, check this page for details.
[23/12/12] We supported fine-tuning the latest MoE model Mixtral 8x7B in our framework. See hardware requirement here.
[23/12/01] We supported downloading pre-trained models and datasets from the ModelScope Hub for Chinese mainland users. See this tutorial for usage.
[23/10/21] We supported NEFTune trick for fine-tuning. Try --neftune_noise_alpha
argument to activate NEFTune, e.g., --neftune_noise_alpha 5
.
[23/09/27] We supported --shift_attn
argument to enable shift short attention.
[23/09/23] We integrated MMLU, C-Eval and CMMLU benchmarks in this repo. See this example to evaluate your models.
[23/09/10] We supported FlashAttention-2. Try --flash_attn
argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.
[23/08/12] We supported RoPE scaling to extend the context length of the LLaMA models. Try --rope_scaling linear
argument in training and --rope_scaling dynamic
argument at inference to extrapolate the position embeddings.
[23/08/11] We supported DPO training for instruction-tuned models. See this example to train your models.
[23/07/31] We supported dataset streaming. Try --streaming
and --max_steps 10000
arguments to load your dataset in streaming mode.
[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos (LLaMA-2 / Baichuan) for details.
[23/07/18] We developed an all-in-one Web UI for training, evaluation and inference. Try train_web.py
to fine-tune models in your Web browser. Thank @KanadeSiina and @codemayq for their efforts in the development.
[23/07/09] We released FastEdit ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow FastEdit if you are interested.
[23/06/29] We provided a reproducible example of training a chat model using instruction-following datasets, see Baichuan-7B-sft for details.
[23/06/22] We aligned the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.
[23/06/03] We supported quantized training and inference (aka QLoRA). Try --quantization_bit 4/8
argument to work with quantized models.
Модель | Размер модели | Стандартный модуль | Шаблон |
---|---|---|---|
Baichuan2 | 7B/13B | W_pack | baichuan2 |
BLOOM | 560M/1.1B/1.7B/3B/7.1B/176B | query_key_value | - |
BLOOMZ | 560M/1.1B/1.7B/3B/7.1B/176B | query_key_value | - |
ChatGLM3 | 6B | query_key_value | chatglm3 |
Command-R | 35B/104B | q_proj,v_proj | cohere |
DeepSeek (MoE) | 7B/16B/67B | q_proj,v_proj | deepseek |
Falcon | 7B/40B/180B | query_key_value | falcon |
Gemma/CodeGemma | 2B/7B | q_proj,v_proj | gemma |
InternLM2 | 7B/20B | wqkv | intern2 |
LLaMA | 7B/13B/33B/65B | q_proj,v_proj | - |
LLaMA-2 | 7B/13B/70B | q_proj,v_proj | llama2 |
LLaMA-3 | 8B/70B | q_proj,v_proj | llama3 |
Mistral/Mixtral | 7B/8x7B/8x22B | q_proj,v_proj | mistral |
OLMo | 1B/7B | att_proj | olmo |
Phi-1.5/2 | 1.3B/2.7B | q_proj,v_proj | - |
Qwen | 1.8B/7B/14B/72B | c_attn | qwen |
Qwen1.5 (Code/MoE) | 0.5B/1.8B/4B/7B/14B/32B/72B | q_proj,v_proj | qwen |
StarCoder2 | 3B/7B/15B | q_proj,v_proj | - |
XVERSE | 7B/13B/65B | q_proj,v_proj | xverse |
Yi | 6B/9B/34B | q_proj,v_proj | yi |
Yuan | 2B/51B/102B | q_proj,v_proj | yuan |
Note
Стандартный модуль используется для аргумента --lora_target
, вы можете использовать --lora_target all
чтобы использовать все доступные модули.
Для "базовой" модели, аргумент --template
может быть выбран из default
, alpaca
, vicuna
и др. Но обязательно используйте соответствующий шаблон для "instruct/chat" моделей.
Запомните, используйте ТОТ ЖЕ САМЫЙ шаблон для обучения и вывода.
Пожалуйста, следуйте за constants.py для получения полного списка поддерживаемых моделей.
Вы можете добавить свой формат чата template.py.
Подход | Full-tuning | Freeze-tuning | LoRA | QLoRA |
---|---|---|---|---|
Pre-Training | ✅ | ✅ | ✅ | ✅ |
Supervised Fine-Tuning | ✅ | ✅ | ✅ | ✅ |
Reward Modeling | ✅ | ✅ | ✅ | ✅ |
PPO Training | ✅ | ✅ | ✅ | ✅ |
DPO Training | ✅ | ✅ | ✅ | ✅ |
ORPO Training | ✅ | ✅ | ✅ | ✅ |
Наборы данных для пред-обучения
Наборы данных для контроллируемеого файн-тюнинга
- Stanford Alpaca (en)
- Stanford Alpaca (zh)
- Alpaca GPT4 (en&zh)
- Self Cognition (zh)
- Open Assistant (multilingual)
- ShareGPT (zh)
- Guanaco Dataset (multilingual)
- BELLE 2M (zh)
- BELLE 1M (zh)
- BELLE 0.5M (zh)
- BELLE Dialogue 0.4M (zh)
- BELLE School Math 0.25M (zh)
- BELLE Multiturn Chat 0.8M (zh)
- UltraChat (en)
- LIMA (en)
- OpenPlatypus (en)
- CodeAlpaca 20k (en)
- Alpaca CoT (multilingual)
- OpenOrca (en)
- SlimOrca (en)
- MathInstruct (en)
- Firefly 1.1M (zh)
- Wiki QA (en)
- Web QA (zh)
- WebNovel (zh)
- Nectar (en)
- deepctrl (en&zh)
- Ad Gen (zh)
- ShareGPT Hyperfiltered (en)
- ShareGPT4 (en&zh)
- UltraChat 200k (en)
- AgentInstruct (en)
- LMSYS Chat 1M (en)
- Evol Instruct V2 (en)
- Glaive Function Calling V2 (en)
- Cosmopedia (en)
- Open Assistant (de)
- Dolly 15k (de)
- Alpaca GPT4 (de)
- OpenSchnabeltier (de)
- Evol Instruct (de)
- Dolphin (de)
- Booksum (de)
- Airoboros (de)
- Ultrachat (de)
Наборы данных на основе предпочтений
Некоторые наборы данных требуют подтверждения перед его использованием, мы рекомендуем предварительно авторизоваться в ваш аккаунт Hugging Face используя данные команды.
pip install --upgrade huggingface_hub
huggingface-cli login
Обязательно | Минимум | Рекоменд. |
---|---|---|
python | 3.8 | 3.10 |
torch | 1.13.1 | 2.2.0 |
transformers | 4.37.2 | 4.39.3 |
datasets | 2.14.3 | 2.18.0 |
accelerate | 0.27.2 | 0.28.0 |
peft | 0.9.0 | 0.10.0 |
trl | 0.8.1 | 0.8.1 |
Опционально | Минимум | Рекоменд. |
---|---|---|
CUDA | 11.6 | 12.2 |
deepspeed | 0.10.0 | 0.14.0 |
bitsandbytes | 0.39.0 | 0.43.0 |
flash-attn | 2.3.0 | 2.5.6 |
* estimated
Method | Bits | 7B | 13B | 30B | 70B | 8x7B | 8x22B |
---|---|---|---|---|---|---|---|
Full | AMP | 120GB | 240GB | 600GB | 1200GB | 900GB | 2400GB |
Full | 16 | 60GB | 120GB | 300GB | 600GB | 400GB | 1200GB |
Freeze | 16 | 20GB | 40GB | 80GB | 200GB | 160GB | 400GB |
LoRA/GaLore/BAdam | 16 | 16GB | 32GB | 64GB | 160GB | 120GB | 320GB |
QLoRA | 8 | 10GB | 20GB | 40GB | 80GB | 60GB | 160GB |
QLoRA | 4 | 6GB | 12GB | 24GB | 48GB | 30GB | 96GB |
QLoRA | 2 | 4GB | 8GB | 16GB | 24GB | 18GB | 48GB |
Откройте data/README.md для детальной проверки формата файлов набора данных. Вы можете использовать наборы данных HuggingFace / ModelScope hub или загрузить наборы данных на локальный диск.
Note
Пожалуйста обновите файл data/dataset_info.json
для использования собственного набора данных.
git clone https://github.com/hiyouga/LLaMA-Factory.git
conda create -n llama_factory python=3.10
conda activate llama_factory
cd LLaMA-Factory
pip install -e .[metrics]
Доступные дополнительные зависимости: deepspeed, metrics, unsloth, galore, badam, vllm, bitsandbytes, gptq, awq, aqlm, qwen, modelscope, quality
Для пользователей Windows
If you want to enable the quantized LoRA (QLoRA) on the Windows platform, you will be required to install a pre-built version of bitsandbytes
library, which supports CUDA 11.1 to 12.2, please select the appropriate release version based on your CUDA version.
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl
Для включения FlashAttention-2 на платформе Windows, вам нужно будет установить компилированную версию библиотеки flash-attn
, с поддержкой CUDA 12.1 до 12.2. Пожалуйста, загрузите соответствующую версию с flash-attention на основе ваших требований.
Important
LLaMA Board GUI поддерживает обучение только на одном GPU, используйте CLI для распределенного обучения.
export CUDA_VISIBLE_DEVICES=0 # `set CUDA_VISIBLE_DEVICES=0` для Windows
export GRADIO_SERVER_PORT=7860 # `set GRADIO_SERVER_PORT=7860` для Windows
python src/train_web.py # или python -m llmtuner.webui.interface
Для пользователей Alibaba Cloud
If you encountered display problems in LLaMA Board on Alibaba Cloud, try using the following command to set environment variables before starting LLaMA Board:
export GRADIO_ROOT_PATH=/${JUPYTER_NAME}/proxy/7860/
docker build -f ./Dockerfile -t llama-factory:latest .
docker run --gpus=all \
-v ./hf_cache:/root/.cache/huggingface/ \
-v ./data:/app/data \
-v ./output:/app/output \
-e CUDA_VISIBLE_DEVICES=0 \
-p 7860:7860 \
--shm-size 16G \
--name llama_factory \
-d llama-factory:latest
docker compose -f ./docker-compose.yml up -d
Details about volume
- hf_cache: Utilize Hugging Face cache on the host machine. Reassignable if a cache already exists in a different directory.
- data: Place datasets on this dir of the host machine so that they can be selected on LLaMA Board GUI.
- output: Set export dir to this location so that the merged result can be accessed directly on the host machine.
See examples/README.md for usage.
Use python src/train_bash.py -h
to display arguments description.
CUDA_VISIBLE_DEVICES=0,1 API_PORT=8000 python src/api_demo.py \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--template llama3 \
--infer_backend vllm \
--vllm_enforce_eager
Если у вас есть проблемы с загрузкой моделей с Hugging Face, вы можете использовать ModelScope.
export USE_MODELSCOPE_HUB=1 # `set USE_MODELSCOPE_HUB=1` для Windows
Train the model by specifying a model ID of the ModelScope Hub as the --model_name_or_path
. You can find a full list of model IDs at ModelScope Hub, e.g., LLM-Research/Meta-Llama-3-8B-Instruct
.
Если у вас есть проекты, которые нужно включить здесь, сообщите нам по email или создайте pull request.
Нажмите для просмотра
- Wang et al. ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation. 2023. [arxiv]
- Yu et al. Open, Closed, or Small Language Models for Text Classification? 2023. [arxiv]
- Wang et al. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language. 2023. [arxiv]
- Luceri et al. Leveraging Large Language Models to Detect Influence Campaigns in Social Media. 2023. [arxiv]
- Zhang et al. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. 2023. [arxiv]
- Wang et al. Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs. 2024. [arxiv]
- Wang et al. CANDLE: Iterative Conceptualization and Instantiation Distillation from Large Language Models for Commonsense Reasoning. 2024. [arxiv]
- Choi et al. FACT-GPT: Fact-Checking Augmentation via Claim Matching with LLMs. 2024. [arxiv]
- Zhang et al. AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts. 2024. [arxiv]
- Lyu et al. KnowTuning: Knowledge-aware Fine-tuning for Large Language Models. 2024. [arxiv]
- Yang et al. LaCo: Large Language Model Pruning via Layer Collaps. 2024. [arxiv]
- Bhardwaj et al. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. 2024. [arxiv]
- Yang et al. Enhancing Empathetic Response Generation by Augmenting LLMs with Small-scale Empathetic Models. 2024. [arxiv]
- Yi et al. Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding. 2024. [arxiv]
- Cao et al. Head-wise Shareable Attention for Large Language Models. 2024. [arxiv]
- Zhang et al. Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages. 2024. [arxiv]
- Kim et al. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models. 2024. [arxiv]
- Yu et al. KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models. 2024. [arxiv]
- Huang et al. Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning. 2024. [arxiv]
- Duan et al. Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization. 2024. [arxiv]
- Xie and Schwertfeger. Empowering Robotics with Large Language Models: osmAG Map Comprehension with LLMs. 2024. [arxiv]
- Zhang et al. EDT: Improving Large Language Models' Generation by Entropy-based Dynamic Temperature Sampling. 2024. [arxiv]
- Weller et al. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. 2024. [arxiv]
- Hongbin Na. CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering. 2024. [arxiv]
- Zan et al. CodeS: Natural Language to Code Repository via Multi-Layer Sketch. 2024. [arxiv]
- Liu et al. Extensive Self-Contrast Enables Feedback-Free Language Model Alignment. 2024. [arxiv]
- Luo et al. BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models. 2024. [arxiv]
- Du et al. Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model. 2024. [arxiv]
- Liu et al. Dynamic Generation of Personalities with Large Language Models. 2024. [arxiv]
- StarWhisper: A large language model for Astronomy, based on ChatGLM2-6B and Qwen-14B.
- DISC-LawLLM: A large language model specialized in Chinese legal domain, based on Baichuan-13B, is capable of retrieving and reasoning on legal knowledge.
- Sunsimiao: A large language model specialized in Chinese medical domain, based on Baichuan-7B and ChatGLM-6B.
- CareGPT: A series of large language models for Chinese medical domain, based on LLaMA2-7B and Baichuan-13B.
- MachineMindset: A series of MBTI Personality large language models, capable of giving any LLM 16 different personality types based on different datasets and training methods.
Данный репозиторий лицензирован как Apache-2.0 License.
Пожалуйста, соблюдайте лицензии моделей на веса: Baichuan2 / BLOOM / ChatGLM3 / Command-R / DeepSeek / Falcon / Gemma / InternLM2 / LLaMA / LLaMA-2 / LLaMA-3 / Mistral / OLMo / Phi-1.5/2 / Qwen / StarCoder2 / XVERSE / Yi / Yuan
Если данная работа была полезна для вас, пожалуйста цитируйте как:
@article{zheng2024llamafactory,
title={LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models},
author={Yaowei Zheng and Richong Zhang and Junhao Zhang and Yanhan Ye and Zheyan Luo and Yongqiang Ma},
journal={arXiv preprint arXiv:2403.13372},
year={2024},
url={http://arxiv.org/abs/2403.13372}
}
Данный репозиторий использует PEFT, TRL, QLoRA and FastChat.Спасибо им за отличную работу.