LLM+LoRA+RAG+..

Posted Sep 30, 2025

By Raymond

8 min read

LLM+LoRA+RAG+..

✅ LLM FineTuning

This article describes the process of fine-tuning the LLM model from Huggingface, obtaining the dataset, and running it on my RTX 3060 desktop. The problem was that even with careful tuning of various learning parameters, the results were not satisfactory. In particular, increasing the learning intensity often resulted in poorer results for general questions.

✅ LoRA

The process of training the Meta-Llama-3-8B-Instruct model in the LoRA manner and converting it for use on Ollama can be done in the following steps. Considering the previous failures with the DeepSeek model, this time we will follow the exact procedure tailored to the LLaMA 3 model.

🧠 1. Fine-tuning Meta-Llama-3-8B-Instruct with LoRA

Prerequisites

You must request access to the Meta-Llama-3-8B-Instruct model from Hugging Face.

Download

git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

Download transformers, peft, accelerate, datasets libraries: bash pip install -U transformers peft accelerate datasets

Dataset Preparations

Prepare the data as JSON or CSV in the following format: json { "question": "Prompt Contents", "answer": "Answer generated by Model" }

LoRA Training Code Example

Train by Hugging Face’s PEFT library
SFTTrainer or Trainer, SFTTrainer fits for small dataset.

🧠 2. Convert to be available in Ollama

Since Ollama uses its own format (.gguf), the following steps are required to use the trained LoRA model in Ollama:

① LoRA Merge

Merge the trained LoRA adapter into the original LLaMA 3 model: `python from peft import PeftModel from transformers import AutoModelForCausalLM
basemodel = AutoModelForCausalLM.frompretrained(“meta-llama/Meta-Llama-3-8B-Instruct”) loramodel = PeftModel.frompretrained(basemodel, “path/to/loraadapter”) mergedmodel = loramodel.merge_and_unload() mergedmodel.save_pretrained(“merged-llama3”) `

② Convert to GGUF format

Convert the merged model to .gguf format using llama.cpp or the transformers-to-gguf tool.
Example: bash python3 convert.py --modelpath merged-llama3 --outputfile llama3-merged.gguf

③ Register Ollama

Place the converted .gguf file in Ollama’s model directory and create a Modelfile: ` FROM llama3-merged.gguf NAME my-llama3-lora `
After then, ollama run my-llama3-lora

💡 Tips

When training LoRA, adjusting hyperparameters such as r, alpha, and dropout can have a significant impact on performance.
Ollama prefers quantized gguf models, so consider quantization options like q4KM or q5KM when converting. It seems like you should be able to fit the VRAM size of your GPU at home.

✅ LoRA Training

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm import tqdm

# Model Path in your PC or ..
model_id = "./Meta-Llama-3-8B-Instruct"

# Load Tokeninzer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Load Model, 
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# LoRA Setup
peft_config = LoraConfig(
    r=4,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# LoRA Apply
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Load Dataset
dataset = load_dataset("json", data_files="train.jsonl")

# Preprocessing Function 
def format_example(example):
    inputs = tokenizer(
        example["instruction"],
        truncation=True,
        padding="max_length",
        max_length=2048,
        return_tensors="pt"
    )
    labels = tokenizer(
        example["output"],
        truncation=True,
        padding="max_length",
        max_length=2048,
        return_tensors="pt"
    )
    return {
        "input_ids": inputs["input_ids"][0],
        "labels": labels["input_ids"][0]
    }

tokenized_ds = dataset["train"].map(format_example)
dataloader = DataLoader(tokenized_ds, batch_size=1)

# Setup Optimizer
optimizer = AdamW(model.parameters(), lr=1e-4)

# Train Loop
model.train()
for epoch in range(1):  # Fast ..
    for batch in tqdm(dataloader):
        input_ids = batch["input_ids"].unsqueeze(0).to(model.device)
        labels = batch["labels"].unsqueeze(0).to(model.device)

        outputs = model(input_ids=input_ids, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Save LoRA Adapter
model.save_pretrained("./lora-llama3")
tokenizer.save_pretrained("./lora-llama3")

In case of device_map error -> .to(“cuda”)

✅ Merge

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load Base Model
base_model = AutoModelForCausalLM.from_pretrained(
    "./Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA Adapter
lora_model = PeftModel.from_pretrained(base_model, "./lora-llama3")

# Merge
merged_model = lora_model.merge_and_unload()

# Save
merged_model.save_pretrained("./merged-llama3", safe_serialization=False)
tokenizer = AutoTokenizer.from_pretrained("./Meta-Llama-3-8B-Instruct")
tokenizer.save_pretrained("./merged-llama3")

✅ Verify Fine Tunning Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Setup
BASE_MODEL = "meta-llama/Meta-Llama-3-8B-Instruct"
LORA_PATH = "./lora_llama3"   # 학습 시 저장한 LoRA adapter 경로
PROMPT = "You Question .. "

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

# === Base Model ===
print("🚀Base Model Load ...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    device_map="auto",
    torch_dtype=torch.float16
)
base_model.eval()

# === Fine Tuning Model Load ===
print("🚀 LoRA Applided Model ..")
lora_model = PeftModel.from_pretrained(
    base_model,
    LORA_PATH
)
lora_model.eval()

# === Input ===
inputs = tokenizer(PROMPT, return_tensors="pt").to(lora_model.device)

# === Base Model Inference ===
with torch.no_grad():
    base_out = base_model.generate(**inputs, max_new_tokens=200)
base_text = tokenizer.decode(base_out[0], skip_special_tokens=True)

# === Fine Tuning Model Inference ===
with torch.no_grad():
    lora_out = lora_model.generate(**inputs, max_new_tokens=200)
lora_text = tokenizer.decode(lora_out[0], skip_special_tokens=True)

# === Output ===
print("\n==============================")
print("📌 Prompt:")
print(PROMPT)
print("\n--- Base Model ---")
print(base_text)
print("\n--- LoRA Model ---")
print(lora_text)
print("==============================")

✅ Make llama.cpp GGUF format model

In case of Unsloth,

cd ../llama.cpp
rm -f ./llama2-local.gguf ./llama2-local-q8.gguf
python convert_hf_to_gguf.py /home/llm/unsloth/llama-2-7b-bnb-fine --outfile ./llama2-local.gguf --outtype f16
./build/bin/llama-quantize ./llama2-local.gguf llama2-local-q8.gguf q8_0
./build/bin/llama-server  -m ./llama2-local-q8.gguf --port 8080 --host 0.0.0.0 --threads $(nproc) --n-gpu-layers 35

✅ Ollama model

If you accept above results, run belows.

📌 Modelfile

FROM ./llama2-local-q8.gguf

# 시스템 프롬프트 — 모델이 assistant로 일관되게 대답하도록 유도
SYSTEM """
You are a helpful and knowledgeable assistant.
Answer the user's question directly and do not repeat the user's input.
If the question is unclear, ask for clarification.
"""

# Ollama가 자동으로 prompt 형식을 구성할 수 있도록 지정
TEMPLATE """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>


<|start_header_id|>user<|end_header_id|>


<|start_header_id|>assistant<|end_header_id|>
"""

# 이 부분이 중요: 모델이 EOS를 인식하고 출력을 멈추게 함
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|eot_id|>"

# 필요시 온도, 토큰 제한도 추가 가능
PARAMETER temperature 0.7
PARAMETER num_predict 1024

📌 Make Ollama Model

Make ollama style model.

ollama create llama2-local-q8 -f Modelfile
ollama list

ollama serve
ollama run llama2-local-q8

✅ RAG, Open Web UI

from flask import Flask, request, jsonify
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from flask_cors import CORS
import logging
import pandas as pd
from tabulate import tabulate

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

app = Flask(__name__)
CORS(app)

llm = Ollama(model="llama3")

# 임베딩 모델 설정
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-code-v1", model_kwargs={"device": "cuda"})

# 벡터 DB 로드
db = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template=(
        "다음 문맥을 참고하여 질문에 답하세요.\n\n"
        "문맥:\n{context}\n\n"
        "질문:\n{question}\n\n"
        "답변:"
    )
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=db.as_retriever(),
    chain_type="stuff",
    chain_type_kwargs={"prompt": prompt_template}
)

@app.route("/query", methods=["POST"])
def rag():
    logging.info('POST(/query) arrived !')
    data = request.get_json()

    query = data.get("query", "")
    if not query:
        return jsonify({"error": "질문이 필요합니다."}), 400

    logging.info(f"query : {query}")

    answer = qa_chain.run(query)

    logging.info(f"Answer : {answer}")
    return jsonify({"answer": answer})

if __name__ == "__main__":
    app.run(debug=True, port=5000)

📌 MCP

from mcp.server.fastmcp import FastMCP
import logging
import requests

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

mcp = FastMCP("chroma_rag")

@mcp.tool()
async def ask_chroma(query: str) -> str:
    global qa_chain, initialized
    try:
        logging.info(f"query received: {query}")
        response = requests.post(
            "http://localhost:5000/query",
            json={"query": query},
            timeout=10
        )
        logging.info(f"response status: {response.status_code}")
        logging.info(f"response all data : {str(response)}")
        data = response.json()
        logging.info(f"==========================")
        logging.info(f"response data : {data}")
        res_data = data.get("answer", "답변을 가져올 수 없습니다.")
        res_format = type(res_data)
        logging.info(f"return data : {res_data}, format : {res_format}")
        return res_data
        #return data
    except Exception as e:
        return f"오류 발생: {str(e)}"

if __name__ == "__main__":
    mcp.run()

Experimentals

This post is licensed under CC BY 4.0 by the author.