Sitemap

Fine-Tuning TinyLlama on WhatsApp Chats: Build Your Own Personal AI Chatbot! ๐Ÿš€

6 min readFeb 16, 2025

--

Introduction

Ever wondered what it would be like to have an AI that talks just like you and your friends? What if you could train an AI chatbot on your WhatsApp conversations and make it understand your slang, emotions, and inside jokes? Well, now you can!

In this guide, weโ€™ll fine-tune TinyLlama (1.1B Chat model) on WhatsApp chat data to create a personalized AI assistant that mirrors real-life conversations. Weโ€™ll use QLoRA (Quantized Low-Rank Adaptation) to make fine-tuning memory efficient โ€” even on consumer GPUs!

Why Fine-Tune TinyLlama?

  • โœ… Lightweight yet powerful โ€” Only 1.1B parameters, making it efficient.
  • โœ… Supports conversational AI โ€” Optimized for chat-based interactions.
  • โœ… Memory-efficient fine-tuning โ€” Uses QLoRA for better performance on low-resource GPUs.
  • โœ… Customizable โ€” Fine-tune on your chat data to make AI sound like you.

First, we will evaluate the output of the TinyLlama 1.1B Chat model when loaded without quantization and without any fine-tuning.

GPU Memory

Next, we will assess its performance when loaded with quantization and without any fine-tuning.

Step 1: Download and Prepare Your WhatsApp Chat Data

First, you need to extract chat data from WhatsApp and format it properly.

Download the Dataset from Kaggle

You can use an existing WhatsApp-style conversation dataset from Kaggle:

Downloading via Kaggle Hub

Install the KaggleHub:

pip install kagglehub

Download the dataset:

# Download latest version
import kagglehub
path = kagglehub.dataset_download("siddikisahil47/conversation")

Now your dataset is ready for preprocessing and fine-tuning!

Export Chat from WhatsApp (If Using Personal Data)

  1. Open WhatsApp and go to any chat.
  2. Tap on More Options (โ‹ฎ) โ†’ Export Chat.
  3. Select Without Media and save the text file.
  4. Transfer the file to your system.

Preprocessing WhatsApp Chat Data

The raw chat file will look something like this:

Rohan: Hey Radhika! Kaisi ho?
Radhika: Hey Rohan, main bilkul thik hun. Tu bata, kaisa hai?
Rohan: I'm good too, yaar. Tumne suna ki next week school wali trip hai?

We need to convert this into JSONL format (ideal for fine-tuning).

Python Script to Format Data

import json
import re

# Input & Output file paths
input_file = path # Your raw chat file
output_file = "conversations.jsonl" # Processed JSONL file

# Read raw chat file
with open(input_file, "r", encoding="utf-8") as f:
lines = f.readlines()

data = []
current_convo = []

# Regex to match "Name: Message"
message_pattern = re.compile(r"^(.*?):\s(.*)")

for line in lines:
line = line.strip()
if not line:
continue

match = message_pattern.match(line)
if match:
sender, message = match.groups()

if current_convo: # If conversation exists, save previous exchange
data.append({
"instruction": current_convo[0], # Previous message
"response": message # Current response
})
current_convo = [message] # Update with new instruction

# Save as JSONL
with open(output_file, "w", encoding="utf-8") as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + "\n")

print(f"Processed {len(data)} conversations and saved to {output_file}.")

Step 2: Fine-Tune TinyLlama Using QLoRA

Now that we have our dataset ready, letโ€™s fine-tune TinyLlama/TinyLlama-1.1B-Chat-v1.0 using QLoRA.

Install Dependencies

pip install transformers datasets accelerate bitsandbytes peft

Fine-Tuning Script

import torch
import json
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16", # Reduce memory usage
bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map={"": torch.cuda.current_device()},quantization_config=quantization_config
)

dataset = load_dataset("json", data_files={"train": "conversations.jsonl"})
dataset['train'][0]
{'instruction': 'Hey Radhika! Kaisi ho?',
'response': 'Hey Rohan, main bilkul thik hun. Tu bata, kaisa hai?'}

Check Max Length

length = 0
for i in range(len(dataset['train'])):
instruction_length = len(dataset['train'][i]['instruction'])
response_length = len(dataset['train'][i]['response'])
if instruction_length > length:
length = instruction_length
if response_length > length:
length = response_length
print(f"max length : {length}")
max length : 454

Tokenize the dataset

def tokenize_function(examples):
return tokenizer(
[instr + "\n\n" + resp for instr, resp in zip(examples["instruction"], examples["response"])], # Combine prompt-response
padding="max_length",
truncation=True,
max_length=480 # Adjust based on your needs
)

# Apply tokenization
tokenized_datasets = dataset.map(tokenize_function, batched=True)
Map:   0%|          | 0/33287 [00:00<?, ? examples/s]
Map: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 33287/33287 [00:10<00:00, 3305.48 examples/s]
lora_config = LoraConfig(
r=8, # Low-rank adaptation dimension
lora_alpha=16,
lora_dropout=0.1,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

# === Step 4: Define Training Arguments ===
training_args = TrainingArguments(
output_dir="./tinyllama-finetuned-whatsapp",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=2,
learning_rate=2e-4,
evaluation_strategy="epoch",
save_strategy="epoch",
save_total_limit=2,
logging_steps=10,
fp16=True, # Use mixed precision for efficiency
report_to="none",
)
split_data = tokenized_datasets["train"].train_test_split(test_size=0.1)
# === Step 5: Initialize Trainer and Train ===
trainer = SFTTrainer(
model=model,
train_dataset=split_data["train"],
eval_dataset=split_data["test"], # โœ… Provide validation dataset
args=training_args,
)

trainer.train()
# === Step 6: Save Fine-Tuned Model ===
model.save_pretrained("./tinyllama-finetuned-whatsapp")
tokenizer.save_pretrained("./tinyllama-finetuned-whatsapp")

print("โœ… Fine-tuning complete! Model saved successfully.")

Step 3: Merge Adaptor Weights with Base Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from peft import PeftModel

# Define model paths
base_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
adapter_path = "tinyllama-finetuned-whatsapp\\checkpoint-11232"
merged_model_path = "tinyllama-finetuned-whatsapp-merged-model"

# Load base tokenizer
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

# Check if fine-tuned model has a modified tokenizer
try:
adapter_tokenizer = AutoTokenizer.from_pretrained(adapter_path, trust_remote_code=True)
print("Loaded tokenizer from adapter.")
except:
adapter_tokenizer = base_tokenizer
print("No separate tokenizer found in adapter path, using base tokenizer.")

# Merge tokenizers if new tokens were added
num_added_tokens = base_tokenizer.add_special_tokens({"additional_special_tokens": adapter_tokenizer.additional_special_tokens})
if num_added_tokens > 0:
print(f"Added {num_added_tokens} new tokens from adapter tokenizer.")

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype=torch.float16, device_map="cuda")

# Load adapter
model = PeftModel.from_pretrained(base_model, adapter_path)

# Merge adapter into base model
model = model.merge_and_unload() # Merges LoRA adapter weights into the base model

# Save the merged model and tokenizer
model.save_pretrained(merged_model_path)
base_tokenizer.save_pretrained(merged_model_path)

print(f"Merged model and tokenizer saved to {merged_model_path}")

Step 4: Test Your Fine-Tuned Model

from transformers import GenerationConfig
from time import perf_counter
def formatted_prompt(question)-> str:
return f"<|user|>\n{question}</s>\n<|assistant|>"
def generate_response(user_input):

prompt = formatted_prompt(user_input)

inputs = tokenizer([prompt], return_tensors="pt")
generation_config = GenerationConfig(penalty_alpha=0.6,do_sample = True,
top_k=5,temperature=0.5,repetition_penalty=1.2,
max_new_tokens=50,pad_token_id=tokenizer.eos_token_id
)
start_time = perf_counter()

inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
output_time = perf_counter() - start_time
print(f"Time taken for inference: {round(output_time,2)} seconds")
generate_response(user_input='Hey Radhika, kesi ho?')

Conclusion

Congratulations! ๐ŸŽ‰ You have successfully fine-tuned TinyLlama on WhatsApp chats to create a personalized chatbot that talks just like you! We can fine-tune it more with epochs as I have used only 2.
This method can be used for:
Personal AI Assistants
Customer Support Bots
AI-driven Social Media Interactions

--

--

Aditya Mangal
Aditya Mangal

Written by Aditya Mangal

Tech enthusiast weaving stories of code and life. Writing about innovation, reflection, and the timeless dance between mind and heart.

Responses (1)