Skip to main content

Victoria Harbour, Hongkong

Github Repository

Detection of Exoplanets using Transit Photometry

Detection of exoplanets using transit photometry:

  • Sky Debnath: Department of Physics, National Institute of Technology Agartala, Jirania, West Tripura, Tripura 799046
  • Avinash A Deshpande: Astronomy and Astrophysics, Raman Research Institute, C. V. Raman Avenue, Sadashivnagar, Bengaluru 560080

Detection of Exoplanets using Transit Photometry images source

Exoplanets are the planets found outside of the solar system. When a planet passes in front of a star, the brightness of that star as observed by us becomes dimmer depending on the size of the planet. The data we observe will show a dip in flux if a planet is transiting the star we are observing.

Dataset: Exoplanet Hunting in Deep Space - Kepler labelled time series data.

The data describe the change in flux (light intensity) of several thousand stars. Each star has a binary label of 2 or 1. 2 indicated that that the star is confirmed to have at least one exoplanet in orbit

from autogluon.multimodal import MultiModalPredictor
from autogluon.tabular import TabularDataset, TabularPredictor
from imblearn.over_sampling import RandomOverSampler
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import (
confusion_matrix,
ConfusionMatrixDisplay,
accuracy_score,
classification_report
)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.preprocessing import StandardScaler
plt.style.use('fivethirtyeight')
SEED=42
MODEL_PATH = 'model'

Dataset Preprocessing

df_train = pd.read_csv('dataset/exoTrain.csv')
df_test = pd.read_csv('dataset/exoTest.csv')

print(df_train.shape, df_test.shape)
# (5087, 3198) (570, 3198)
df_train.head(5).transpose()
01234
LABEL2.002.002.002.002.00
FLUX.193.85-38.88532.64326.52-1107.21
FLUX.283.81-33.83535.92347.39-1112.59
FLUX.320.10-58.54513.73302.35-1118.95
FLUX.4-26.98-40.09496.92298.13-1095.10
...
FLUX.319392.540.765.06-12.67-438.54
FLUX.319439.32-11.70-11.80-8.77-399.71
FLUX.319561.426.46-28.91-17.31-384.65
FLUX.31965.0816.00-70.02-17.35-411.79
FLUX.3197-39.5419.93-96.6713.98-510.54
# replacing label class [1,2] -> [0,1]
df_train = df_train.replace({'LABEL': {2:1, 1:0}})
df_test = df_test.replace({'LABEL': {2:1, 1:0}})

Missing Values

# how many values are missing?
print(df_train.isnull().sum().sum())
# 0 => no missing data
plt.figure(figsize=(20, 8))
plt.title('Visualize Distribution of Null Values in Dataset')

sns.heatmap(
df_train.isnull(),
annot=False
)

plt.savefig('assets/Real_World_Model_to_Deployment_01.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Label Imbalance

# is the dataset balanced?
df_train['LABEL'].value_counts()

# the dataset only has 37 positives against 5050 negatives
# 0 5050
# 1 37
# Name: LABEL, dtype: int64
plt.figure(figsize=(10, 5))
plt.title('Suns w/o (Class 0) and w (Class 1) Exoplanets')

ax = sns.countplot(
data=df_train,
x='LABEL'
)

ax.bar_label(ax.containers[0])

plt.savefig('assets/Real_World_Model_to_Deployment_02.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

exoplanets = df_train[df_train['LABEL'] == 1.]
exoplanets
LABELFLUX.1FLUX.2FLUX.3FLUX.4FLUX.5FLUX.6FLUX.7FLUX.8FLUX.9...FLUX.3188FLUX.3189FLUX.3190FLUX.3191FLUX.3192FLUX.3193FLUX.3194FLUX.3195FLUX.3196FLUX.3197
0193.8583.8120.10-26.98-39.56-124.71-135.18-96.27-79.89...-78.07-102.15-102.1525.1348.5792.5439.3261.425.08-39.54
11-38.88-33.83-58.54-40.09-79.31-72.81-86.55-85.33-83.97...-3.28-32.21-32.21-24.89-4.860.76-11.706.4616.0019.93
21532.64535.92513.73496.92456.45466.00464.50486.39436.56...-71.6913.3113.31-29.89-20.885.06-11.80-28.91-70.02-96.67
31326.52347.39302.35298.13317.74312.70322.33311.31312.42...5.71-3.73-3.7330.0520.03-12.67-8.77-17.31-17.3513.98
41-1107.21-1112.59-1118.95-1095.10-1057.55-1034.48-998.34-1022.71-989.57...-594.37-401.66-401.66-357.24-443.76-438.54-399.71-384.65-411.79-510.54
51211.10163.57179.16187.82188.46168.13203.46178.65166.49...-98.4530.3430.3429.6228.8019.27-43.90-41.63-52.90-16.16
619.3449.9633.309.6337.6420.854.5422.4210.11...-58.569.939.9323.505.28-0.4410.90-11.77-9.25-36.69
71238.77262.16277.80190.16180.98123.27103.9550.7059.91...-72.4831.7731.7753.4827.8895.3048.86-10.62-112.02-229.92
81-103.54-118.97-108.93-72.25-61.46-50.16-20.61-12.441.48...43.927.247.24-7.45-18.824.5321.9526.9434.0844.65
91-265.91-318.59-335.66-450.47-453.09-561.47-606.03-712.72-685.97...3671.032249.282249.282437.782584.223162.533398.283648.343671.973781.91
101118.81110.9779.53114.2548.783.12-4.0966.20-26.02...50.0550.0550.0567.42-56.78126.14200.36432.95721.81938.08
111-239.88-164.28-180.91-225.69-90.66-130.66-149.75-120.50-157.00...-364.75-364.75-364.75-196.38-165.81-215.94-293.25-214.34-154.84-151.41
12170.3463.8658.3769.4364.1852.7047.5846.8946.00...6.45-8.91-8.91-6.70-5.04-10.79-4.97-7.46-15.06-2.06
131424.14407.71461.59428.17412.69395.58453.35410.45402.09...238.3646.6546.6595.90123.48138.38190.66202.55232.16251.73
141-267.21-239.11-233.15-211.84-191.56-181.69-164.77-156.68-139.23...-754.92-752.38-752.38-754.93-761.64-746.83-765.22-757.05-763.26-769.39
15135.9245.8447.9974.5887.9787.97105.23131.70130.00...39.71-2.53-2.5315.3218.6520.4322.4037.3236.0171.59
161-122.30-122.30-131.08-109.69-109.69-95.27-93.93-84.84-73.65...22.64-42.53-42.53-46.43-56.26-54.25-37.13-24.7313.35-5.81
171-65.20-76.33-76.23-72.58-69.62-74.51-69.48-61.06-49.29...18.66-11.72-11.724.5611.4731.2621.7113.4213.249.21
181-66.47-15.50-44.59-49.03-70.16-85.53-52.06-73.41-59.69...-6.1910.0010.0050.12-14.97-32.75-30.28-9.28-31.5326.88
191560.19262.94189.94185.12210.38104.19289.56172.0681.75...106.00-7.94-7.94-7.9452.31-165.007.38-61.56-44.75104.50
201-1831.31-1781.44-1930.84-2016.72-1963.31-1956.12-2128.24-2188.20-2212.82...903.8275.6175.61191.77196.16326.61481.28635.63651.68695.74
2112053.622126.052146.332159.842237.592236.122244.472279.612288.22...1832.591935.531965.842094.192212.522292.642454.482568.162625.452578.80
221-48.48-22.9511.15-70.04-120.34-150.04-309.38-160.73-201.41...90.70-20.01-62.12-45.96-52.40-4.9326.7421.43145.30197.20
231145.84137.8296.9917.09-73.79-157.79-267.71-365.91-385.07...62.76101.2498.13112.5195.77127.9867.5191.2440.40-10.80
241207.37195.04150.45135.34104.9059.7942.8552.7418.38...-13.21-43.43-14.77-22.27-0.0419.469.3223.55-4.7311.82
251304.50275.94269.24248.51194.88167.80139.13149.36100.97...4.213.53-5.1314.56-1.44-10.733.490.18-2.8940.34
261150725.80129578.36102184.9882253.9867934.1748063.5242745.0218971.552983.58...-11143.45-23351.45-33590.27-31861.95-23298.89-13056.11379.489444.5223261.0233565.48
271124.3972.7336.85-4.686.96-44.61-89.79-121.71-120.59...-14.38-21.65-6.04-7.1567.5856.43-1.957.091.63-10.77
281-63.50-49.15-45.99-34.55-44.34-15.80-16.075.32-7.05...-113.73-113.58-130.99-121.51-94.69-90.38-74.36-56.49-46.51-44.53
29131.2925.1436.9316.6317.01-7.500.091.24-19.82...11.3612.9628.5051.0525.854.7913.26-17.5813.790.72
301-472.50-384.09-330.42-273.41-185.02-115.64-141.86-16.2377.80...-3408.88-3425.92-3465.59-3422.95-3398.83-3410.42-3393.58-3407.78-3391.56-3397.03
311194.82162.51126.17129.7082.2760.7158.7123.3632.57...29.2147.660.48-28.59-33.15-14.98-1.5622.2521.553.49
32126.9638.9825.9947.2826.2934.0816.6628.2720.99...35.26-9.9423.73-7.54-5.8613.04-5.64-16.85-6.18-16.03
33143.0746.7329.439.756.54-3.76-31.48-46.94-40.78...6.995.7523.1815.0818.0913.4015.7818.1851.219.71
341-248.23-243.59-217.91-190.69-190.17-163.04-196.32-164.73-149.34...94.25121.45135.02147.14161.89198.05262.03282.88334.81377.14
35122.8246.3739.6198.7581.32100.4365.0038.8622.11...55.50-16.22-5.2115.0411.86-5.38-24.46-55.86-44.55-16.80
36126.2442.3228.3424.8149.3947.5741.5251.8025.50...-7.53-35.72-14.32-29.21-30.618.494.756.59-7.0324.41

37 rows × 3198 columns

Visualizing Differences between both Classes

# separate label
X_train = df_train.drop(['LABEL'], axis=1)
y_train = df_train['LABEL']
X_test = df_test.drop(['LABEL'], axis=1)
y_test = df_test['LABEL']
# plot light curve for a single sun
## there are no timestamp -> generate range
time = range(1,3198)
## get brightness values for suns (class 0)
flux_00 = X_train.iloc[444,:].values
flux_01 = X_train.iloc[222,:].values
flux_02 = X_train.iloc[666,:].values
flux_03 = X_train.iloc[888,:].values
flux_04 = X_train.iloc[111,:].values
flux_05 = X_train.iloc[5000,:].values
## get brightness values for suns (class 1)
flux_10 = X_train.iloc[0,:].values
flux_11 = X_train.iloc[5,:].values
flux_12 = X_train.iloc[11,:].values
flux_13 = X_train.iloc[17,:].values
flux_14 = X_train.iloc[23,:].values
flux_15 = X_train.iloc[29,:].values
fig, axes = plt.subplots(2, 3, figsize=(15, 5))
fig.suptitle('Suns with Exoplanets (Class 1)')

sns.scatterplot(
x=time,
y=flux_10,
s=10,
alpha=0.6,
ax=axes[0,0]
)

sns.scatterplot(
x=time,
y=flux_11,
s=10,
alpha=0.6,
ax=axes[0,1]
)

sns.scatterplot(
x=time,
y=flux_12,
s=10,
alpha=0.6,
ax=axes[0,2]
)

sns.scatterplot(
x=time,
y=flux_13,
s=10,
alpha=0.6,
ax=axes[1,0]
)

sns.scatterplot(
x=time,
y=flux_14,
s=10,
alpha=0.6,
ax=axes[1,1]
)

sns.scatterplot(
x=time,
y=flux_15,
s=10,
alpha=0.6,
ax=axes[1,2]
)

plt.savefig('assets/Real_World_Model_to_Deployment_03.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

fig, axes = plt.subplots(2, 3, figsize=(15, 5))
fig.suptitle('Suns without Exoplanets (Class 0)')

sns.scatterplot(
x=time,
y=flux_00,
s=10,
alpha=0.6,
ax=axes[0,0]
)

sns.scatterplot(
x=time,
y=flux_01,
s=10,
alpha=0.6,
ax=axes[0,1]
)

sns.scatterplot(
x=time,
y=flux_02,
s=10,
alpha=0.6,
ax=axes[0,2]
)

sns.scatterplot(
x=time,
y=flux_03,
s=10,
alpha=0.6,
ax=axes[1,0]
)

sns.scatterplot(
x=time,
y=flux_04,
s=10,
alpha=0.6,
ax=axes[1,1]
)

sns.scatterplot(
x=time,
y=flux_05,
s=10,
alpha=0.6,
ax=axes[1,2]
)

plt.savefig('assets/Real_World_Model_to_Deployment_04.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Handling Outliers

plt.figure(figsize=(15, 5))

for i in range(1,5):
plt.subplot(1,4,i)
sns.boxplot(data=df_train, x='LABEL', y='FLUX.' + str(i))

plt.savefig('assets/Real_World_Model_to_Deployment_05.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

# there is one extreme outlier with a value above 0.25e6 in FLUX.1
df_train[df_train['FLUX.1'] > 0.25e6].index
# Int64Index([3340], dtype='int64')
df_train[df_train['FLUX.3'] > 0.25e6].index
# Int64Index([3340], dtype='int64')
# it is the same sun in both cases with iloc 3340 -> drop
df_train = df_train.drop(3340, axis=0)
plt.figure(figsize=(15, 5))

for i in range(1,5):
plt.subplot(1,4,i)
sns.boxplot(data=df_train, x='LABEL', y='FLUX.' + str(i))

plt.savefig('assets/Real_World_Model_to_Deployment_06.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Model Training

# the dataset is already split into train/test -> further split validation from train set
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.3, random_state=SEED
)
X_train.shape, X_val.shape
# ((3560, 3197), (1527, 3197))
# normalizing features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
X_val_scaled = scaler.fit_transform(X_val)

KNN Model on Imbalanced Datasets

knn_classifier = KNC(n_neighbors=5, metric='minkowski', p=2)

knn_classifier.fit(X_train_scaled, y_train)

Model Evaluation

y_pred = knn_classifier.predict(X_val)
print(accuracy_score(y_pred, y_val))
# 0.9921414538310412
print(classification_report(y_val, y_pred))
# the imbalanced dataset leads to spectacular accuracies, but...
precisionrecallf1-scoresupport
00.991.001.001515
10.000.000.0012
accuracy0.991527
macro avg0.500.500.501527
weighted avg0.980.990.991527
# because there are so few positives in the dataset
# the accuracy is not affected by a 100% fail in detection
ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix(y_val, y_pred)
).plot()

plt.savefig('assets/Real_World_Model_to_Deployment_07.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Random Oversampling to balance the Dataset

# get full training dataset again
X_train = df_train.drop(['LABEL'], axis=1)
y_train = df_train['LABEL']
over_sampler = RandomOverSampler()

x_osample, y_osample =over_sampler.fit_resample(X_train, y_train)
plt.figure(figsize=(10, 5))
plt.title('Suns w/o (Class 0) and w (Class 1) Exoplanets')

ax = y_osample.value_counts().plot(kind='bar')
ax.bar_label(ax.containers[0])

plt.savefig('assets/Real_World_Model_to_Deployment_08.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

KNN Model on the balanced Data

# the dataset is already split into train/test -> further split validation from train set
X_train, X_val, y_train, y_val = train_test_split(
x_osample, y_osample, test_size=0.3, random_state=SEED
)
X_train.shape, X_val.shape
# ((7068, 3197), (3030, 3197))
# normalizing features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.fit_transform(X_val)
knn_classifier = KNC(n_neighbors=5, metric='minkowski', p=2)

knn_classifier.fit(X_train_scaled, y_train)

Model Evaluation

y_pred = knn_classifier.predict(X_val)
print(accuracy_score(y_pred, y_val))
# 0.6026402640264027
print(classification_report(y_val, y_pred))
precisionrecallf1-scoresupport
00.570.940.711558
10.800.240.371472
accuracy0.603030
macro avg0.680.590.543030
weighted avg0.680.600.553030
ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix(y_val, y_pred)
).plot()

plt.savefig('assets/Real_World_Model_to_Deployment_09.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Hyper Parameter Tuning

knn_classifier = KNC()
param_grid = {
'n_neighbors': [4, 5, 6],
'weights': ['uniform', 'distance'],
'p': [1, 2]
}

grid_search = GridSearchCV(
estimator = knn_classifier,
param_grid = param_grid
)

grid_search.fit(X_train_scaled, y_train)
print('Best Parameter: ', grid_search.best_params_)
# Best Parameter: {'n_neighbors': 4, 'p': 2, 'weights': 'uniform'}
# re-run training with new
knn_classifier = KNC(n_neighbors=4, metric='minkowski', p=2, weights='uniform')
knn_classifier.fit(X_train_scaled, y_train)
y_pred = knn_classifier.predict(X_val)
print(accuracy_score(y_pred, y_val))
# 0.5712871287128712
print(classification_report(y_val, y_pred))
precisionrecallf1-scoresupport
00.550.960.701558
10.810.150.261472
accuracy0.573030
macro avg0.680.560.483030
weighted avg0.670.570.493030
ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix(y_val, y_pred)
).plot()

plt.savefig('assets/Real_World_Model_to_Deployment_10.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

AutoML with AutoGluon

Tabular Data Predictor on the unbalanced Dataset

Data Preprocessing

data = TabularDataset('dataset/exoTrain.csv')
# replacing label class [1,2] -> [0,1]
data = data.replace({'LABEL': {2:1, 1:0}})
data.head(5).transpose()
01234
LABEL1.001.001.001.001.00
FLUX.193.85-38.88532.64326.52-1107.21
FLUX.283.81-33.83535.92347.39-1112.59
FLUX.320.10-58.54513.73302.35-1118.95
FLUX.4-26.98-40.09496.92298.13-1095.10
...
FLUX.319392.540.765.06-12.67-438.54
FLUX.319439.32-11.70-11.80-8.77-399.71
FLUX.319561.426.46-28.91-17.31-384.65
FLUX.31965.0816.00-70.02-17.35-411.79
FLUX.3197-39.5419.93-96.6713.98-510.54
# train/test split
print(len(data)*0.8)
# 4069.6
train_size = 4070
train_data = data.sample(n=train_size, random_state=SEED)
test_data = data.drop(train_data.index)
print(len(train_data), len(test_data))
# 4070 1017

Model Training

predictor = TabularPredictor(label='LABEL', path=MODEL_PATH)
predictor.fit(train_data)

# AutoGluon training complete, total runtime = 315.97s ... Best model: "WeightedEnsemble_L2"
predictor.fit_summary()
modelscore_valpred_time_valfit_time
0LightGBMXT0.9920.02041815.344224
1LightGBMLarge0.9920.02283171.049574
2WeightedEnsemble_L20.9920.02369571.592278
3LightGBM0.9920.02542720.643771
4CatBoost0.9920.05017670.539329
5NeuralNetFastAI0.9920.0506709.314951
6ExtraTreesGini0.9920.0578891.999989
7ExtraTreesEntr0.9920.0594872.002087
8RandomForestEntr0.9920.0595674.198627
9RandomForestGini0.9920.0625864.663457
10XGBoost0.9920.06439952.430401
11KNeighborsDist0.9920.1124090.698705
12KNeighborsUnif0.9920.2130560.924304
13NeuralNetTorch0.9921.41186052.091543
leaderboard=pd.DataFrame(predictor.leaderboard())

plt.figure(figsize=(8, 7))
sns.set(style='darkgrid')
sns.scatterplot(
x='pred_time_val',
y='score_val',
data=leaderboard,
s=300,
alpha=0.5,
hue='model',
palette='tab20',
style='fit_time'
)

plt.title('Prediction Time vs Accuracy Score')
plt.xlabel('Average Time for Predictions')
plt.ylabel('Accuracy Score')
plt.legend(bbox_to_anchor=(1.01,1.01))

plt.savefig('assets/Real_World_Model_to_Deployment_11.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Model Evaluation

# load best model
predictor = TabularPredictor.load("model/")
data_test = TabularDataset('dataset/exoTest.csv')
# replacing label class [1,2] -> [0,1]
data_test = data_test.replace({'LABEL': {2:1, 1:0}})
X_test = data_test.drop(columns=['LABEL'] )
y_test = data_test['LABEL']
y_pred = predictor.predict(X_test)
eval_metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

# {
# "accuracy": 0.9912280701754386,
# "balanced_accuracy": 0.5000000000000001,
# "mcc": 0.0,
# "f1": 0.0,
# "precision": 0.0,
# "recall": 0.0
# }
array = np.array(list(eval_metrics.items()))
df = pd.DataFrame(array, columns = ['metric','value']).sort_values(by='value')

fig, ax = plt.subplots(figsize = (10, 7))
ax.bar(df['metric'], df['value'])
for bar in ax.patches:
ax.annotate(text = bar.get_height(),
xy = (bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha='center',
va='center',
size=15,
xytext=(0, 8),
textcoords='offset points')
plt.xlabel("Metric")
plt.ylabel("Value")
plt.title('Evaluation Metrics')
plt.ylim(bottom=0)

plt.savefig('assets/Real_World_Model_to_Deployment_12.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Tabular Data Predictor on the re-balanced Dataset

As expected the results are badly affected by the inbalance of the dataset. Let's see how AutoGluon handles a preprocessed dataset.

Data Preprocessing

df_train = pd.read_csv('dataset/exoTrain.csv')
df_test = pd.read_csv('dataset/exoTest.csv')

df_train = df_train.replace({'LABEL': {2:1, 1:0}})
df_test = df_test.replace({'LABEL': {2:1, 1:0}})
X_train = df_train.drop(['LABEL'], axis=1)
y_train = df_train['LABEL']

X_test = df_test.drop(['LABEL'], axis=1)
y_test = df_test['LABEL']
# normalizing features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
over_sampler = RandomOverSampler()
x_osample, y_osample = over_sampler.fit_resample(
pd.DataFrame(X_train_scaled), y_train
)
df_merged_train = pd.concat([y_osample, x_osample], axis=1)
df_merged_train.head(5)
df_merged_test = pd.concat([y_test, pd.DataFrame(X_test_scaled)], axis=1)
df_merged_test.head(5)
df_merged_train.to_csv('dataset/exoTrainNorm.csv')
df_merged_test.to_csv('dataset/exoTestNorm.csv')

Model Training

data = TabularDataset('dataset/exoTrainNorm.csv')
print(len(data)*0.8)
# 8080.0
train_size = 8080
train_data = data.sample(n=train_size, random_state=SEED)
test_data = data.drop(train_data.index)
print(len(train_data), len(test_data))
# 8080 2020
predictor = TabularPredictor(label='LABEL', path=MODEL_PATH)
predictor.fit(train_data)

# AutoGluon training complete, total runtime = 412.18s ... Best model: "WeightedEnsemble_L2"
predictor.fit_summary()
modelscore_valpred_time_valfit_time
0LightGBM1.0000000.02937729.271464
1LightGBMLarge1.0000000.03710094.302897
2CatBoost1.0000000.05769079.490693
3ExtraTreesGini1.0000000.0636152.981032
4ExtraTreesEntr1.0000000.0644692.950440
5RandomForestGini1.0000000.0655426.167094
6WeightedEnsemble_L21.0000000.0658133.676120
7RandomForestEntr1.0000000.0673436.783359
8XGBoost1.0000000.09936974.019001
9NeuralNetTorch1.0000001.65831926.595248
10LightGBMXT0.9975250.02780328.317720
11KNeighborsDist0.9975250.3678211.086449
12KNeighborsUnif0.9975250.4364080.956370
13NeuralNetFastAI0.9863860.07497720.446434
leaderboard=pd.DataFrame(predictor.leaderboard())

plt.figure(figsize=(8, 7))
sns.set(style='darkgrid')
sns.scatterplot(
x='pred_time_val',
y='score_val',
data=leaderboard,
s=300,
alpha=0.5,
hue='model',
palette='tab20',
style='fit_time'
)

plt.title('Prediction Time vs Accuracy Score')
plt.xlabel('Average Time for Predictions')
plt.ylabel('Accuracy Score')
plt.legend(bbox_to_anchor=(1.01,1.01))

plt.savefig('assets/Real_World_Model_to_Deployment_13.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Model Evaluation

data_test = TabularDataset('dataset/exoTestNorm.csv')
X_test = data_test.drop(['LABEL'], axis=1)
y_test = data_test['LABEL']
# load best model
predictor = TabularPredictor.load("model/")
y_pred = predictor.predict(X_test)
eval_metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

# {
# "accuracy": 0.9912280701754386,
# "balanced_accuracy": 0.5000000000000001,
# "mcc": 0.0,
# "f1": 0.0,
# "precision": 0.0,
# "recall": 0.0
# }
array = np.array(list(eval_metrics.items()))
df = pd.DataFrame(array, columns = ['metric','value']).sort_values(by='value')

fig, ax = plt.subplots(figsize = (10, 7))
ax.bar(df['metric'], df['value'])
for bar in ax.patches:
ax.annotate(text = bar.get_height(),
xy = (bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha='center',
va='center',
size=15,
xytext=(0, 8),
textcoords='offset points')
plt.xlabel("Metric")
plt.ylabel("Value")
plt.title('Evaluation Metrics')
plt.ylim(bottom=0)

plt.savefig('assets/Real_World_Model_to_Deployment_14.webp', bbox_inches='tight')

Detection of Exoplanets using Transit Photometry

Multi Modal Predictor on the re-balanced Dataset

Data Preprocessing

data = TabularDataset('dataset/exoTrainNorm.csv')
train_data = data.sample(frac=0.8 , random_state=SEED)
test_data = data.drop(train_data.index)

Model Training

mm_predictor = MultiModalPredictor(label='LABEL', path=MODEL_PATH)
mm_predictor.fit(train_data)

Detection of Exoplanets using Transit Photometry

Detection of Exoplanets using Transit Photometry

Model Evaluation

mm_predictor = MultiModalPredictor.load('model/')
data_test = TabularDataset('dataset/exoTestNorm.csv')
X_test = data_test.drop(['LABEL'], axis=1)
y_test = data_test['LABEL']
model_scoring = mm_predictor.evaluate(data_test, metrics=['acc', 'f1'])
print(model_scoring)
# {'acc': 0.987719298245614, 'f1': 0.0}
data_test[2:8]
# check original dataframe to see labels - 3x1 and 3x0
LABEL01234567...3187318831893190319131923193319431953196
2210.0261540.0062990.018946-0.0051350.005965-0.018322-0.006160-0.028589...-0.004433-0.037509-0.014466-0.037624-0.019497-0.048315-0.037617-0.030012-0.027607-0.010661
331-0.106614-0.124118-0.109998-0.125241-0.102100-0.120553-0.105169-0.117242...0.006545-0.0224060.000427-0.024211-0.009926-0.028069-0.025272-0.025595-0.050906-0.036046
441-0.044110-0.059779-0.043223-0.059295-0.043759-0.059998-0.041655-0.061203...-0.010283-0.038573-0.012238-0.033583-0.014463-0.039068-0.036145-0.019389-0.031042-0.015661
550-0.039830-0.057677-0.041331-0.057802-0.041504-0.057055-0.040597-0.057336...-0.005900-0.032540-0.009752-0.031495-0.011695-0.030716-0.027901-0.012041-0.023759-0.014598
660-0.052925-0.069757-0.055066-0.073462-0.055658-0.071925-0.054675-0.070939...-0.005554-0.031578-0.010334-0.035450-0.014312-0.031091-0.029233-0.013697-0.025068-0.014998
770-0.041764-0.059533-0.043542-0.059569-0.043982-0.059702-0.043419-0.059123...-0.004048-0.028919-0.006828-0.027940-0.007229-0.026500-0.031011-0.016102-0.027207-0.015900
# pick data without labels from test set
test_pred = X_test[2:8]
print(mm_predictor.class_labels)
mm_predictor.predict_proba(test_pred)
01
20.9908480.009152
30.8730630.126937
40.9894070.010593
50.9935490.006451
60.9838940.016106
70.9939140.006086

Hmmmm so why doesn't this work?