Overview

This project builds supervised learning models to predict mitochondrial genetic disorders from a Kaggle dataset. The work emphasizes rigorous evaluation under class imbalance, leakage-safe pipelines, and a comparative study of modern resampling strategies (SMOTE, Tomek Links, etc.) paired with models like SVM, Random Forest, Gradient Boosted Trees, and a Feedforward Neural Network (FNN).

Key Features

Literature-grounded setup: Surveyed three papers to align preprocessing, targets, and metrics with state-of-the-art practice.
Imbalance handling: Systematic comparison of Random over/undersampling, Tomek Links, and SMOTE variants.
Leakage prevention: End-to-end pipelines with stratified splits, resampling inside CV folds, and feature scaling fit only on training.
Model zoo + tuning: SVM, RF, GBT, and FNN with grid/random search (and early stopping where applicable).
Clear evaluation: Stratified K-fold CV with AUROC, AUPRC, F1, balanced accuracy, and calibration checks.

Dataset

Source: Kaggle (mitochondrial genetic disorder dataset).
Task: Multiclass / multilabel (depending on your setup) prediction of mitochondrial disorders.
Challenge: Severe class imbalance and potential feature/target leakage if resampling or scaling happens before splits.

Approach

Problem Framing
- Defined targets per the dataset documentation and literature.
- Established leakage-safe protocol: split → fit scaler on train → resample train only → train → eval on untouched val/test.
Preprocessing
- Missing values: imputed within the training fold only.
- Scaling: standardization / min–max as model-appropriate (e.g., SVM/FNN benefit).
Imbalance Strategies
- Oversampling: Random Oversampling; SMOTE (and variants if applicable).
- Undersampling: Random Undersampling.
- Hybrid: Tomek Links (borderline clean-up) + SMOTE.
- All resampling performed inside each CV fold to avoid leakage.
Models & Tuning
- SVM: Linear/RBF kernels; tuned C, gamma.
- Random Forest: n_estimators, max_depth, class_weight.
- Gradient Boosted Trees (e.g., XGBoost/LightGBM/Sklearn GBDT): learning_rate, n_estimators, max_depth, subsample.
- FNN: Depth/width, activation, dropout, optimizer, batch size, epochs with early stopping and validation splits.
Evaluation Protocol
- Stratified K-fold CV (outer) for performance estimation.
- Optional inner CV for hyperparameter search (nested CV).
- Metrics: AUROC, AUPRC (key under imbalance), F1, balanced accuracy, per-class recall/precision, and confusion matrices.
- Calibration: reliability curves / Brier score (optional).

Results & Highlights

Imbalance Handling: SMOTE (+ Tomek Links) generally improved minority-class recall and AUPRC without sacrificing too much precision.
Best Traditional Model: (Fill in with your winner — e.g., Gradient Boosted Trees with SMOTE yielded the highest AUPRC and competitive F1.)
Neural Baseline: FNN matched or surpassed tree methods when tuned and regularized, especially with robust early stopping and class-balanced loss.
Leakage Controls Matter: Running resampling outside CV inflated scores; placing it inside folds produced more realistic performance.

Replace this bullet with your concrete numbers if you’d like (e.g., “Best AUPRC = 0.61 (GBT+SMOTE), Macro-F1 = 0.54, Minority-class Recall = 0.68”).

Reproducibility

Environment: Python (scikit-learn, imbalanced-learn, numpy, pandas; for FNN, PyTorch or Keras/TensorFlow).
Determinism: Fixed random_state seeds; logged versions.
Pipelines: sklearn.pipeline.Pipeline / imblearn.pipeline.Pipeline to ensure train-fold-only fitting.

PDF

Final Report PDF Link

Sample Commands

```bash

Train + evaluate with stratified K-fold CV

python train.py
–model gbt
–cv 5
–resampling smote_tomek
–scaler standard
–metrics auroc auprc f1 balacc
–seed 42

Hyperparameter search (example)

python tune.py
–model svm
–search random
–n_iter 50
–cv 5
–resampling smote
–seed 42