Recommended GitHub Repositories for Data Science
GitHub repositories for data science. Organized into clear categories so you can easily navigate based on your needs whether you're a beginner, intermediate, or looking for projects/MLOps.
1. Awesome Curated Lists (Your "Master Index")
These are massive resource collections covering tools, courses, books, and more.
- academic/awesome-datascience — The #1 go-to curated list for data science. Covers everything from basics to real-world applications, books, courses, and tools.
- josephmisiti/awesome-machine-learning — Comprehensive list focused on ML algorithms, libraries, and resources across languages.
- vinta/awesome-python — Best Python ecosystem list (includes data science, ML, and visualization libraries).
- awesomedata/awesome-public-datasets — Curated high-quality open datasets for practice and projects.
- quantmind/awesome-data-science-viz — Focused on data visualization, analysis, and Python/web tools.
2. Learning Roadmaps & Structured Curricula
Perfect for self-paced learning with clear paths.
- microsoft/Data-Science-For-Beginners — Free 10-week curriculum with 20 lessons, notebooks, and exercises (Microsoft-backed).
- Avik-Jain/100-Days-Of-ML-Code — Classic 100-day ML challenge with daily code and concepts.
- CIS-Team/Data-Science-Roadmap-2025 — Complete A-to-Z roadmap (statistics → Python → ML → projects).
- krishnaik06/Perfect-Roadmap-To-Learn-Data-Science-In-2025 — Practical end-to-end roadmap with projects (includes NLP, MLOps, and deployment).
- mhmdkardosha/CAT-Reloaded-2025-Data-Science-Roadmap — Week-by-week structured tasks for beginners to advanced.
3. Core Libraries (Must-Know Foundations)
Daily tools every data scientist uses.
- pandas-dev/pandas — The essential library for data manipulation and analysis.
- scikit-learn/scikit-learn — Classical ML (classification, regression, clustering, preprocessing).
- numpy/numpy — Numerical computing foundation (works hand-in-hand with pandas).
- matplotlib/matplotlib + seaborn/seaborn — Core visualization libraries.
4. Machine Learning & Deep Learning Frameworks
For building models.
- pytorch/pytorch — Most popular for research and flexibility.
- tensorflow/tensorflow — End-to-end production-ready deep learning (Google-backed).
- keras-team/keras — High-level API for quick neural network prototyping (runs on TensorFlow).
- huggingface/transformers — State-of-the-art NLP, LLMs, and multimodal models.
- dmlc/xgboost and microsoft/LightGBM — Gradient boosting powerhouses for tabular data competitions.
5. Hands-On Projects & Portfolio Builders
Build real projects to strengthen your GitHub profile.
- durgeshsamariya/Data-Science-Machine-Learning-Project-with-Source-Code — Huge collection of projects with code and explanations.
- veb-101/Data-Science-Projects — Curated list of project ideas with resources across industries.
- rhiever/Data-Analysis-and-Machine-Learning-Projects — Teaching-focused projects with datasets and code.
6. Bonus: MLOps, Production & Specialized
- DataTalksClub/machine-learning-zoomcamp (and DataTalksClub/mlops-zoomcamp) — Free project-based courses on ML engineering and production.
Quick Start Advice
- Beginner: Start with awesome-datascience + Data-Science-For-Beginners + pandas + 100-Days-Of-ML-Code.
- Intermediate/Advanced: Dive into PyTorch or Transformers + build 5–10 projects from the project repos.
- Pro Tip: Star these repos, fork interesting ones, and contribute small improvements — it’s great for learning and your resume.