Pulsar Star Classification | Michael Chadolias

Overview

A comprehensive machine learning system for classifying pulsar stars from astronomical data. This project demonstrates a complete ML pipeline from data acquisition to production deployment, using the HTRU2 dataset to distinguish rare pulsar signals from noise with high accuracy.

Live Demo: https://huggingface.co/spaces/mchadolias/pulsar-classification-htru2

Project Highlights

🎯 Key Features

Automated Data Pipeline: Downloads and preprocesses HTRU2 dataset from Kaggle
Multiple ML Models: Implements and compares Logistic Regression, Random Forest, Gradient Boosting, and XGBoost
Hyperparameter Tuning: Comprehensive GridSearchCV optimization
Production API: FastAPI-based REST API with real-time predictions
Docker Deployment: Containerized application ready for cloud deployment
Interactive Documentation: Swagger UI for easy API testing

📊 Model Performance

Best Model: XGBoost Classifier

ROC-AUC: 0.9768
Recall: 86.3%
Precision: 92.5%
F1-Score: 89.3%

The model successfully handles significant class imbalance (9.16% pulsars) while maintaining high detection rates and low false positives.

Technical Implementation

Dataset & Challenge

Source: HTRU2 Pulsar Dataset (Kaggle)
Samples: 17,898 candidate observations
Features: 8 statistical measures from integrated pulse profiles and DM-SNR curves
Challenge: Highly imbalanced classification (9.16% positive class)

Machine Learning Pipeline

Data Preprocessing: Automated download, validation, and standardization
Model Training: 4 algorithms with 8-fold cross-validation
Hyperparameter Optimization: Grid search for optimal performance
Model Evaluation: Comprehensive metrics and feature importance analysis
Production Deployment: FastAPI with Docker containerization

Feature Importance Analysis

Top features contributing to pulsar classification:

ip_kurtosis: 0.4368 (Integrated Profile Kurtosis)
ip_skewness: 0.3520 (Integrated Profile Skewness)
dm_std: 0.0678 (DM-SNR Curve Standard Deviation)
ip_std: 0.0412 (Integrated Profile Standard Deviation)
ip_mean: 0.0298 (Integrated Profile Mean)

Quick Start

Local Development

# Clone repository
git clone https://github.com/mchadolias/pulsar-classification.git
cd pulsar-classification

# Install dependencies
uv sync

# Run complete pipeline
uv run python scripts/main.py

# Start API server
uv run python deployment/predict.py

Docker Deployment

# Build and run container
docker build -t pulsar-classification-api:latest .
docker run -it -p 9696:9696 pulsar-classification-api:latest

# Test API
curl http://localhost:9696/health

API Usage

The deployed API provides several endpoints for real-time pulsar classification:

POST /predict - Single sample classification
POST /predict_batch - Batch predictions
GET /health - Service status check
GET /docs - Interactive API documentation

Example Prediction

import requests

# Sample pulsar features
features = [99.367, 41.572, 1.547, 4.154, 27.555, 61.719, 2.208, 3.662]

response = requests.post(
    "https://pulsar-classification.fly.dev/predict",
    json={"features": features}
)

print(response.json())
# {"probability": 0.960, "is_pulsar": true}

Business Impact

This system addresses a real astronomical challenge by automating the identification of pulsar stars, which:

Reduces manual verification workload for astronomers
Enables faster discovery of valuable astronomical objects
Provides consistent, reproducible classification results

Key Technologies

Machine Learning: scikit-learn, XGBoost, pandas, numpy
API Development: FastAPI, Pydantic, Uvicorn
Deployment: Docker, Fly.io
Development: UV package manager, Jupyter notebooks

Development Status

✅ Completed: Data pipeline, model training, API development
✅ Deployed: Production API on Fly.io
✅ Documented: Comprehensive README and examples

This project demonstrates end-to-end machine learning capabilities from research to production deployment. The successful classification of pulsar stars showcases practical application of ML in scientific domains.