Sitemap

RAG for Resume Extraction: Why HNSW Outshines IVF ??

4 min readJun 24, 2025

--

Resume parsing is no longer just about keyword matching — it’s about understanding context, skills, and nuanced experience. Retrieval-Augmented Generation (RAG), paired with LLMs, provides a transformative approach. But at its heart lies vector search, and choosing the right index can make or break your application.

In this blog, we pit HNSW against IVF for the task of resume extraction using RAG. We’ll walk through real code that extracts information from PDF resumes and shows why HNSW delivers more accurate, faster results.

💼 The Resume Use Case

Let’s say you’ve collected thousands of resumes, and you want to ask questions like:

“What is Aditya’s experience with Prescription Scanning using RAG?”

Instead of rule-based extraction, we chunk and embed resume content, store it in a vector index, and retrieve the best chunks at query time.

This requires:

  • High recall (we must not miss key skills)
  • Low latency (for real-time recruiter dashboards)
  • Dynamic updates (new resumes come in daily)

🤠 Code Walkthrough: Resume Extraction with HNSW

We’ll use:

  • pdfplumber for parsing
  • sentence-transformers for embeddings
  • hnswlib and faiss for indexing

🔧 Step 1: Install Dependencies

pip install pdfplumber sentence-transformers hnswlib faiss-cpu

🔧 Step 2: Recursive Chunking and Embedding

from sentence_transformers import SentenceTransformer
import numpy as np
import pdfplumber
import re
import os
import hnswlib
import json

# Recursive chunking
def recursive_chunk(text, max_length=300, overlap=50):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = min(start + max_length, len(words))
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += max_length - overlap
return chunks
def extract_chunks_from_pdf(pdf_path):
all_chunks = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages):
text = page.extract_text()
if text:
paragraphs = re.split(r'\n{2,}|\.\s+', text)
for para in paragraphs:
clean_para = para.strip()
if len(clean_para) > 50:
chunks = recursive_chunk(clean_para)
for chunk in chunks:
all_chunks.append({"text": chunk, "page": page_num + 1})
return all_chunks

🔧 Step 3: Build HNSW Vector Index

def build_or_load_hnsw(chunks, embedding_model, hnsw_path='vector_db'):
dim = 768
if os.path.exists(f'{hnsw_path}.bin') and os.path.exists(f'{hnsw_path}_meta.json'):
print("✅ Loading existing vector DB...")
index = hnswlib.Index(space='l2', dim=dim)
index.load_index(f'{hnsw_path}.bin')
with open(f'{hnsw_path}_meta.json', 'r') as f:
metadata = json.load(f)
return index, metadata
else:
print("🛠️ Building new vector DB...")
texts = [c["text"] for c in chunks]
embeddings = embedding_model.encode(texts, normalize_embeddings=True)
index = hnswlib.Index(space='l2', dim=dim)
index.init_index(max_elements=len(texts), ef_construction=100, M=16)
index.add_items(embeddings)
index.save_index(f'{hnsw_path}.bin')
with open(f'{hnsw_path}_meta.json', 'w') as f:
json.dump(chunks, f)
return index, chunks

🔧 Step 4: Query with HNSW

def hnsw_search(query, model, index, metadata, top_k=3):
q_vec = model.encode([query], normalize_embeddings=True)
labels, distances = index.knn_query(q_vec, k=top_k)
results = []
for rank, i in enumerate(labels[0]):
similarity = 1 - distances[0][rank]
results.append((metadata[i], similarity))
return results

🔧 Step 5: Query Resume

pdf_path = "Aditya_Mangal_Resume_2.pdf"
query = "what is project of prescription scanning with LLM and RAG?"
model = SentenceTransformer('all-mpnet-base-v2')
chunks = extract_chunks_from_pdf(pdf_path)
index, metadata = build_or_load_hnsw(chunks, model)
for result, score in hnsw_search(query, model, index, metadata):
print(f"\n[Similarity: {score:.4f}] Page {result['page']}\n{result['text'][:300]}...")
✅ Loading existing vector DB...

[Similarity: 0.1782] Page 3
PROJECT Image-Data-Augmentor (Python Library) Prescription Scanning with LLM and RAG -Personal Project -Personal Project https://github.com/adityamangal1998/Image-Data-Augmentor Developed an AI-powered system to extract structured data from Developed a Python library for image data augmentation, cap...

[Similarity: -0.2556] Page 1
Programming Languages: C/C++, Python, Dart, Assembly • Built a Vision-Language Model (VLM) to detect and extract structured table Language,Embedded C data from scanned documents...

[Similarity: -0.2686] Page 3
Integrated RAG and LLMs for accurate and structured data extraction...

🔍 IVF Alternative for Comparison

import faiss
texts = [c["text"] for c in chunks]
embeddings = model.encode(texts, normalize_embeddings=True)
embeddings = np.array(embeddings).astype('float32')
dim = embeddings.shape[1]
quantizer = faiss.IndexFlatL2(dim)
ivf = faiss.IndexIVFFlat(quantizer, dim, 5)
ivf.train(embeddings)
ivf.add(embeddings)
ivf.nprobe = 2
q_vec = model.encode([query], normalize_embeddings=True)
q_vec = np.array(q_vec).astype('float32')
D, I = ivf.search(q_vec, k=3)
print("\n🔍 IVF Results with Similarity Scores:")
scores = 1 / (1 + D[0]) # Inverse L2 to get similarity-like scores
total_score = np.sum(scores)

for rank, idx in enumerate(I[0]):
probability = scores[rank] / total_score # Normalized to sum to 1
print(f"\n[Prob: {probability:.4f}] Page {chunks[idx]['page']}")
print(chunks[idx]['text'][:300] + "...")

🔍 IVF Results with Similarity Scores:

[Prob: 0.3919] Page 3
PROJECT Image-Data-Augmentor (Python Library) Prescription Scanning with LLM and RAG -Personal Project -Personal Project https://github.com/adityamangal1998/Image-Data-Augmentor Developed an AI-powered system to extract structured data from Developed a Python library for image data augmentation, cap...

[Prob: 0.3046] Page 2
• Image Processing: Utilized MATLAB for advanced visual data analysis and manipulation...

[Prob: 0.3035] Page 4
The system effectively removed noise, such as scanned image artifacts, watermarks, and other disturbances, from images, enhancing their clarity and quality...

📊 Results and Analysis

While both HNSW and IVF successfully retrieve the relevant chunk regarding “Prescription Scanning with RAG,” their approaches and scoring differ:

  • HNSW returns the most relevant chunk with an absolute similarity score and often leads with high recall — even across large datasets.
  • IVF provides relative probability scores that appear well-distributed but are limited to the clusters it searches. It performs competitively when nprobe is set thoughtfully.
  • In small datasets, IVF might seem comparable, but in real-world, large-scale resume pipelines, HNSW consistently outperforms in recall and adaptability.
  • HNSW also supports dynamic resume ingestion without retraining, while IVF requires re-clustering.

🏆 Verdict: HNSW Wins for Resume Intelligence

--

--

Aditya Mangal
Aditya Mangal

Written by Aditya Mangal

Tech enthusiast weaving stories of code and life. Writing about innovation, reflection, and the timeless dance between mind and heart.

No responses yet