LLM+LoRA+RAG+..
✅ LLM FineTuning
This article describes the process of fine-tuning the LLM model from Huggingface, obtaining the dataset, and running it on my RTX 3060 desktop. The problem was that even with careful tuning of various learning parameters, the results were not satisfactory. In particular, increasing the learning intensity often resulted in poorer results for general questions.
✅ LoRA
The process of training the Meta-Llama-3-8B-Instruct model in the LoRA manner and converting it for use on Ollama can be done in the following steps. Considering the previous failures with the DeepSeek model, this time we will follow the exact procedure tailored to the LLaMA 3 model.
🧠 1. Fine-tuning Meta-Llama-3-8B-Instruct with LoRA
Prerequisites
- You must request access to the Meta-Llama-3-8B-Instruct model from Hugging Face.
- Download
1 2
git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
- Download transformers, peft, accelerate, datasets libraries:
bash pip install -U transformers peft accelerate datasets
Dataset Preparations
- Prepare the data as JSON or CSV in the following format:
json { "question": "Prompt Contents", "answer": "Answer generated by Model" }
LoRA Training Code Example
- Train by Hugging Face’s PEFT library
- SFTTrainer or Trainer, SFTTrainer fits for small dataset.
🧠 2. Convert to be available in Ollama
Since Ollama uses its own format (.gguf), the following steps are required to use the trained LoRA model in Ollama:
① LoRA Merge
Merge the trained LoRA adapter into the original LLaMA 3 model: `python from peft import PeftModel from transformers import AutoModelForCausalLM
basemodel = AutoModelForCausalLM.frompretrained(“meta-llama/Meta-Llama-3-8B-Instruct”) loramodel = PeftModel.frompretrained(basemodel, “path/to/loraadapter”) mergedmodel = loramodel.merge_and_unload() mergedmodel.save_pretrained(“merged-llama3”) `
② Convert to GGUF format
- Convert the merged model to .gguf format using llama.cpp or the transformers-to-gguf tool.
- Example:
bash python3 convert.py --modelpath merged-llama3 --outputfile llama3-merged.gguf
③ Register Ollama
Place the converted .gguf file in Ollama’s model directory and create a Modelfile: ` FROM llama3-merged.gguf NAME my-llama3-lora `
After then, ollama run my-llama3-lora
💡 Tips
- When training LoRA, adjusting hyperparameters such as r, alpha, and dropout can have a significant impact on performance.
- Ollama prefers quantized gguf models, so consider quantization options like q4KM or q5KM when converting. It seems like you should be able to fit the VRAM size of your GPU at home.
✅ LoRA Training
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm
# Model Path in your PC or ..
model_id = "./Meta-Llama-3-8B-Instruct"
# Load Tokeninzer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
# Load Model,
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# LoRA Setup
peft_config = LoraConfig(
r=4,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# LoRA Apply
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Load Dataset
dataset = load_dataset("json", data_files="train.jsonl")
# Preprocessing Function
def format_example(example):
inputs = tokenizer(
example["instruction"],
truncation=True,
padding="max_length",
max_length=2048,
return_tensors="pt"
)
labels = tokenizer(
example["output"],
truncation=True,
padding="max_length",
max_length=2048,
return_tensors="pt"
)
return {
"input_ids": inputs["input_ids"][0],
"labels": labels["input_ids"][0]
}
tokenized_ds = dataset["train"].map(format_example)
dataloader = DataLoader(tokenized_ds, batch_size=1)
# Setup Optimizer
optimizer = AdamW(model.parameters(), lr=1e-4)
# Train Loop
model.train()
for epoch in range(1): # Fast ..
for batch in tqdm(dataloader):
input_ids = batch["input_ids"].unsqueeze(0).to(model.device)
labels = batch["labels"].unsqueeze(0).to(model.device)
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Save LoRA Adapter
model.save_pretrained("./lora-llama3")
tokenizer.save_pretrained("./lora-llama3")
In case of device_map error -> .to(“cuda”)
✅ Merge
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load Base Model
base_model = AutoModelForCausalLM.from_pretrained(
"./Meta-Llama-3-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
# Load LoRA Adapter
lora_model = PeftModel.from_pretrained(base_model, "./lora-llama3")
# Merge
merged_model = lora_model.merge_and_unload()
# Save
merged_model.save_pretrained("./merged-llama3", safe_serialization=False)
tokenizer = AutoTokenizer.from_pretrained("./Meta-Llama-3-8B-Instruct")
tokenizer.save_pretrained("./merged-llama3")
✅ Verify Fine Tunning Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Setup
BASE_MODEL = "meta-llama/Meta-Llama-3-8B-Instruct"
LORA_PATH = "./lora_llama3" # 학습 시 저장한 LoRA adapter 경로
PROMPT = "You Question .. "
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
# === Base Model ===
print("🚀Base Model Load ...")
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
device_map="auto",
torch_dtype=torch.float16
)
base_model.eval()
# === Fine Tuning Model Load ===
print("🚀 LoRA Applided Model ..")
lora_model = PeftModel.from_pretrained(
base_model,
LORA_PATH
)
lora_model.eval()
# === Input ===
inputs = tokenizer(PROMPT, return_tensors="pt").to(lora_model.device)
# === Base Model Inference ===
with torch.no_grad():
base_out = base_model.generate(**inputs, max_new_tokens=200)
base_text = tokenizer.decode(base_out[0], skip_special_tokens=True)
# === Fine Tuning Model Inference ===
with torch.no_grad():
lora_out = lora_model.generate(**inputs, max_new_tokens=200)
lora_text = tokenizer.decode(lora_out[0], skip_special_tokens=True)
# === Output ===
print("\n==============================")
print("📌 Prompt:")
print(PROMPT)
print("\n--- Base Model ---")
print(base_text)
print("\n--- LoRA Model ---")
print(lora_text)
print("==============================")
✅ Make llama.cpp GGUF format model
In case of Unsloth,
1
2
3
4
5
cd ../llama.cpp
rm -f ./llama2-local.gguf ./llama2-local-q8.gguf
python convert_hf_to_gguf.py /home/llm/unsloth/llama-2-7b-bnb-fine --outfile ./llama2-local.gguf --outtype f16
./build/bin/llama-quantize ./llama2-local.gguf llama2-local-q8.gguf q8_0
./build/bin/llama-server -m ./llama2-local-q8.gguf --port 8080 --host 0.0.0.0 --threads $(nproc) --n-gpu-layers 35
✅ Ollama model
If you accept above results, run belows.
📌 Modelfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
FROM ./llama2-local-q8.gguf
# 시스템 프롬프트 — 모델이 assistant로 일관되게 대답하도록 유도
SYSTEM """
You are a helpful and knowledgeable assistant.
Answer the user's question directly and do not repeat the user's input.
If the question is unclear, ask for clarification.
"""
# Ollama가 자동으로 prompt 형식을 구성할 수 있도록 지정
TEMPLATE """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
<|start_header_id|>user<|end_header_id|>
<|start_header_id|>assistant<|end_header_id|>
"""
# 이 부분이 중요: 모델이 EOS를 인식하고 출력을 멈추게 함
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|eot_id|>"
# 필요시 온도, 토큰 제한도 추가 가능
PARAMETER temperature 0.7
PARAMETER num_predict 1024
📌 Make Ollama Model
Make ollama style model.
1
2
3
4
5
ollama create llama2-local-q8 -f Modelfile
ollama list
ollama serve
ollama run llama2-local-q8
✅ RAG, Open Web UI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from flask import Flask, request, jsonify
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from flask_cors import CORS
import logging
import pandas as pd
from tabulate import tabulate
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
app = Flask(__name__)
CORS(app)
llm = Ollama(model="llama3")
# 임베딩 모델 설정
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-code-v1", model_kwargs={"device": "cuda"})
# 벡터 DB 로드
db = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
prompt_template = PromptTemplate(
input_variables=["context", "question"],
template=(
"다음 문맥을 참고하여 질문에 답하세요.\n\n"
"문맥:\n{context}\n\n"
"질문:\n{question}\n\n"
"답변:"
)
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=db.as_retriever(),
chain_type="stuff",
chain_type_kwargs={"prompt": prompt_template}
)
@app.route("/query", methods=["POST"])
def rag():
logging.info('POST(/query) arrived !')
data = request.get_json()
query = data.get("query", "")
if not query:
return jsonify({"error": "질문이 필요합니다."}), 400
logging.info(f"query : {query}")
answer = qa_chain.run(query)
logging.info(f"Answer : {answer}")
return jsonify({"answer": answer})
if __name__ == "__main__":
app.run(debug=True, port=5000)
📌 MCP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from mcp.server.fastmcp import FastMCP
import logging
import requests
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
mcp = FastMCP("chroma_rag")
@mcp.tool()
async def ask_chroma(query: str) -> str:
global qa_chain, initialized
try:
logging.info(f"query received: {query}")
response = requests.post(
"http://localhost:5000/query",
json={"query": query},
timeout=10
)
logging.info(f"response status: {response.status_code}")
logging.info(f"response all data : {str(response)}")
data = response.json()
logging.info(f"==========================")
logging.info(f"response data : {data}")
res_data = data.get("answer", "답변을 가져올 수 없습니다.")
res_format = type(res_data)
logging.info(f"return data : {res_data}, format : {res_format}")
return res_data
#return data
except Exception as e:
return f"오류 발생: {str(e)}"
if __name__ == "__main__":
mcp.run()