RAG for Resume Extraction: Why HNSW Outshines IVF ??
Resume parsing is no longer just about keyword matching — it’s about understanding context, skills, and nuanced experience. Retrieval-Augmented Generation (RAG), paired with LLMs, provides a transformative approach. But at its heart lies vector search, and choosing the right index can make or break your application.
In this blog, we pit HNSW against IVF for the task of resume extraction using RAG. We’ll walk through real code that extracts information from PDF resumes and shows why HNSW delivers more accurate, faster results.
💼 The Resume Use Case
Let’s say you’ve collected thousands of resumes, and you want to ask questions like:
“What is Aditya’s experience with Prescription Scanning using RAG?”
Instead of rule-based extraction, we chunk and embed resume content, store it in a vector index, and retrieve the best chunks at query time.
This requires:
- High recall (we must not miss key skills)
- Low latency (for real-time recruiter dashboards)
- Dynamic updates (new resumes come in daily)
🤠 Code Walkthrough: Resume Extraction with HNSW
We’ll use:
pdfplumber
for parsingsentence-transformers
for embeddingshnswlib
andfaiss
for indexing
🔧 Step 1: Install Dependencies
pip install pdfplumber sentence-transformers hnswlib faiss-cpu
🔧 Step 2: Recursive Chunking and Embedding
from sentence_transformers import SentenceTransformer
import numpy as np
import pdfplumber
import re
import os
import hnswlib
import json
# Recursive chunking
def recursive_chunk(text, max_length=300, overlap=50):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = min(start + max_length, len(words))
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += max_length - overlap
return chunks
def extract_chunks_from_pdf(pdf_path):
all_chunks = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
text = page.extract_text()
if text:
paragraphs = re.split(r'\n{2,}|\.\s+', text)
for para in paragraphs:
clean_para = para.strip()
if len(clean_para) > 50:
chunks = recursive_chunk(clean_para)
for chunk in chunks:
all_chunks.append({"text": chunk, "page": page_num + 1})
return all_chunks
🔧 Step 3: Build HNSW Vector Index
def build_or_load_hnsw(chunks, embedding_model, hnsw_path='vector_db'):
dim = 768
if os.path.exists(f'{hnsw_path}.bin') and os.path.exists(f'{hnsw_path}_meta.json'):
print("✅ Loading existing vector DB...")
index = hnswlib.Index(space='l2', dim=dim)
index.load_index(f'{hnsw_path}.bin')
with open(f'{hnsw_path}_meta.json', 'r') as f:
metadata = json.load(f)
return index, metadata
else:
print("🛠️ Building new vector DB...")
texts = [c["text"] for c in chunks]
embeddings = embedding_model.encode(texts, normalize_embeddings=True)
index = hnswlib.Index(space='l2', dim=dim)
index.init_index(max_elements=len(texts), ef_construction=100, M=16)
index.add_items(embeddings)
index.save_index(f'{hnsw_path}.bin')
with open(f'{hnsw_path}_meta.json', 'w') as f:
json.dump(chunks, f)
return index, chunks
🔧 Step 4: Query with HNSW
def hnsw_search(query, model, index, metadata, top_k=3):
q_vec = model.encode([query], normalize_embeddings=True)
labels, distances = index.knn_query(q_vec, k=top_k)
results = []
for rank, i in enumerate(labels[0]):
similarity = 1 - distances[0][rank]
results.append((metadata[i], similarity))
return results
🔧 Step 5: Query Resume
pdf_path = "Aditya_Mangal_Resume_2.pdf"
query = "what is project of prescription scanning with LLM and RAG?"
model = SentenceTransformer('all-mpnet-base-v2')
chunks = extract_chunks_from_pdf(pdf_path)
index, metadata = build_or_load_hnsw(chunks, model)
for result, score in hnsw_search(query, model, index, metadata):
print(f"\n[Similarity: {score:.4f}] Page {result['page']}\n{result['text'][:300]}...")
✅ Loading existing vector DB...
[Similarity: 0.1782] Page 3
PROJECT Image-Data-Augmentor (Python Library) Prescription Scanning with LLM and RAG -Personal Project -Personal Project https://github.com/adityamangal1998/Image-Data-Augmentor Developed an AI-powered system to extract structured data from Developed a Python library for image data augmentation, cap...
[Similarity: -0.2556] Page 1
Programming Languages: C/C++, Python, Dart, Assembly • Built a Vision-Language Model (VLM) to detect and extract structured table Language,Embedded C data from scanned documents...
[Similarity: -0.2686] Page 3
Integrated RAG and LLMs for accurate and structured data extraction...
🔍 IVF Alternative for Comparison
import faiss
texts = [c["text"] for c in chunks]
embeddings = model.encode(texts, normalize_embeddings=True)
embeddings = np.array(embeddings).astype('float32')
dim = embeddings.shape[1]
quantizer = faiss.IndexFlatL2(dim)
ivf = faiss.IndexIVFFlat(quantizer, dim, 5)
ivf.train(embeddings)
ivf.add(embeddings)
ivf.nprobe = 2
q_vec = model.encode([query], normalize_embeddings=True)
q_vec = np.array(q_vec).astype('float32')
D, I = ivf.search(q_vec, k=3)
print("\n🔍 IVF Results with Similarity Scores:")
scores = 1 / (1 + D[0]) # Inverse L2 to get similarity-like scores
total_score = np.sum(scores)
for rank, idx in enumerate(I[0]):
probability = scores[rank] / total_score # Normalized to sum to 1
print(f"\n[Prob: {probability:.4f}] Page {chunks[idx]['page']}")
print(chunks[idx]['text'][:300] + "...")
🔍 IVF Results with Similarity Scores:
[Prob: 0.3919] Page 3
PROJECT Image-Data-Augmentor (Python Library) Prescription Scanning with LLM and RAG -Personal Project -Personal Project https://github.com/adityamangal1998/Image-Data-Augmentor Developed an AI-powered system to extract structured data from Developed a Python library for image data augmentation, cap...
[Prob: 0.3046] Page 2
• Image Processing: Utilized MATLAB for advanced visual data analysis and manipulation...
[Prob: 0.3035] Page 4
The system effectively removed noise, such as scanned image artifacts, watermarks, and other disturbances, from images, enhancing their clarity and quality...
📊 Results and Analysis
While both HNSW and IVF successfully retrieve the relevant chunk regarding “Prescription Scanning with RAG,” their approaches and scoring differ:
- HNSW returns the most relevant chunk with an absolute similarity score and often leads with high recall — even across large datasets.
- IVF provides relative probability scores that appear well-distributed but are limited to the clusters it searches. It performs competitively when
nprobe
is set thoughtfully. - In small datasets, IVF might seem comparable, but in real-world, large-scale resume pipelines, HNSW consistently outperforms in recall and adaptability.
- HNSW also supports dynamic resume ingestion without retraining, while IVF requires re-clustering.