Turn Any Website into a Smart Chatbot with AI-Powered Content Parsing!
Alright, class! Today, we are going to discuss something far more useful than just scrolling endlessly through the internet. It’s called Content Parsing and Q&A, and if you pay attention, you might just build something cooler than your last failed attempt at a to-do list app. So, sit straight, open your brains, and let’s get into it!
What Is This Magical Thing?
This project is a Flask application that allows users to scrape content from URLs, store it in a vector database (ChromaDB, if you must know), and interact with the content using a chatbox and also use Cross Encoder. Think of it as your personal AI librarian — except it doesn’t give you judgmental looks for asking weird questions.
It uses:
- ChromaDB for storing and retrieving the scraped content
- Bedrock API to generate answers based on the content
- Flask for serving this masterpiece to the world
Features That Make This Project Cool
Scrapes content from one or more URLs (Yes, even that shady-looking news site you love)
Stores scraped content in ChromaDB (because your brain can’t remember everything)
It lets you query the stored content using a chatbox (finally, a chatbot that makes sense!)
Uses the Bedrock API to generate smart answers (unlike your roommate)
Requirements (Because Nothing Works by Magic)
To run this beast, you need:
Python 3.7+ (Don't even try Python 2.7, I will find you.)
Flask
requests
beautifulsoup4
chromadb
sentence-transformers
boto3If you don’t have these installed, don’t complain when things don’t work!
Installation (Follow This or Suffer!)
- Clone the repository like a true hacker:
git clone https://github.com/adityamangal1998/UI-Content-Parsing.git cd UI-Content-Parsing- Install the required dependencies:
pip install -r requirements.txt- Configure your AWS credentials in
config.py(or summon an AWS wizard to do it for you):
# AWS Configuration
AWS_ACCESS_KEY_ID = 'your-aws-access-key-id'
AWS_SECRET_ACCESS_KEY = 'your-aws-secret-access-key'
AWS_REGION_NAME = 'your-aws-region'
# Bedrock Model Configuration
BEDROCK_MODEL_ID = 'your-bedrock-model-id'
# ChromaDB Configuration
CHROMA_DB_PATH = './chroma_db'
CHROMA_COLLECTION_NAME = 'web_scraped_data'
# Cross-Encoder Model
CROSS_ENCODER_MODEL = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
# Embedding Model
EMBEDDING_MODEL = 'BAAI/bge-base-en'If you mess this up, don’t come crying. Just read carefully!
Usage (A.K.A. How Not to Break It)
- Run the Flask application:
python app.py- Open your browser and go to: 👉 http://localhost:5000
- Paste one or more URLs in the input box and click “Scrape Content”.
- After scraping, enter a question in the chatbox and hit “Get Answer”.
Be amazed as your AI assistant gives you an answer (or be disappointed if you picked a bad website).
Project Structure (Because Organization Matters)
UI-Content-Parsing
├── app.py # Main Flask application
├── config.py # Secret keys and settings (guard it with your life)
├── model.py # Function to invoke Bedrock API
├── requirements.txt # Dependencies list (Do NOT ignore this!)
├── static/
│ ├── scripts.js # Frontend magic
│ └── styles.css # Make it pretty
├── templates/
│ └── index.html # Where the frontend action happens
├── utils.py # Scraping helper functions
└── vectordb.py # Handles ChromaDB storage and retrievalIf you randomly delete files, don’t blame me when things stop working.
Final Words
If you made it this far, congratulations! You now know how to scrape, store, and chat with web content. If you didn’t, well… go back and read again! This isn’t a bedtime story; it’s a real project that can make your life easier.
Now go forth, install it, break it (I mean, test it), and maybe even improve it! Just don’t ask it embarrassing questions — it remembers everything. 😏
Happy coding! 🚀