Skip to main content

Victoria Harbour, Hongkong

Github Repository

Multidimensional Scaling is a family of statistical methods that focus on creating mappings of items based on distance.

Dimensionality Reduction

Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.

High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension must be reduced in some way.

The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost.

To address this concern, a number of supervised and unsupervised linear dimensionality reduction frameworks have been designed, such as:

Dataset

A multivariate study of variation in two species of rock crab of genus Leptograpsus: A multivariate approach has been used to study morphological variation in the blue and orange-form species of rock crab of the genus Leptograpsus. Objective criteria for the identification of the two species are established, based on the following characters:

  • SP: Species (Blue or Orange)
  • Sex: Male or Female
  • FL: Width of the frontal region of the carapace;
  • RW: Width of the posterior region of the carapace (rear width);
  • CL: Length of the carapace along the midline;
  • CW: Maximum width of the carapace;
  • BD: and the depth of the body;

The dataset can be downloaded from Github.

(see introduction in: Principal Component Analysis PCA)

raw_data = pd.read_csv('data/A_multivariate_study_of_variation_in_two_species_of_rock_crab_of_genus_Leptograpsus.csv')

data = raw_data.rename(columns={
'sp': 'Species',
'sex': 'Sex',
'index': 'Index',
'FL': 'Frontal Lobe',
'RW': 'Rear Width',
'CL': 'Carapace Midline',
'CW': 'Maximum Width',
'BD': 'Body Depth'})

data['Species'] = data['Species'].map({'B':'Blue', 'O':'Orange'})
data['Sex'] = data['Sex'].map({'M':'Male', 'F':'Female'})
data['Class'] = data.Species + data.Sex

data_columns = ['Frontal Lobe',
'Rear Width',
'Carapace Midline',
'Maximum Width',
'Body Depth']
# generate a class variable for all 4 classes
data['Class'] = data.Species + data.Sex

print(data['Class'].value_counts())
data.head(5)
  • BlueMale: 50
  • BlueFemale: 50
  • OrangeMale: 50
  • OrangeFemale: 50
SpeciesSexIndexFrontal LobeRear WidthCarapace MidlineMaximum WidthBody DepthClass
0BlueMale18.16.716.119.07.0BlueMale
1BlueMale28.87.718.120.87.4BlueMale
2BlueMale39.27.819.022.47.7BlueMale
3BlueMale49.67.920.123.18.2BlueMale
4BlueMale59.88.020.323.08.2BlueMale
# normalize data columns
data_norm = data.copy()
data_norm[data_columns] = MinMaxScaler().fit_transform(data[data_columns])

data_norm.describe()
IndexFrontal LobeRear WidthCarapace MidlineMaximum WidthBody Depth
count200.000000200.000000200.000000200.000000200.000000200.000000
mean25.5000000.5272330.4553650.5290430.5150530.511645
std14.4670830.2198320.1878350.2163820.2099190.220953
min1.0000000.0000000.0000000.0000000.0000000.000000
25%13.0000000.3584910.3284670.3822190.3840000.341935
50%25.5000000.5251570.4598540.5288750.5253330.503226
75%38.0000000.6823900.5693430.6846500.6640000.677419
max50.0000001.0000001.0000001.0000001.0000001.000000

2-Dimensional Plot

no_components = 2
n_init = 15
metric = True
n_stress='auto'

mds = MDS(
n_components=no_components,
n_init=n_init, metric=metric,
normalized_stress=n_stress)

data_mds = mds.fit_transform(data_norm[data_columns])
print('MSE: ', mds.stress_)
# MSE: 3.886582480465905
# the more components you add the smaller
# the mean squared error becomes - meaning
# your model starts to fit better

data_norm[['MDS1', 'MDS2']] = data_mds
data_norm.head(1)
SpeciesSexIndexFrontal LobeRear WidthCarapace MidlineMaximum WidthBody DepthClassMDS1MDS2
0BlueMale10.0566040.0145990.0425530.0506670.058065BlueMale-0.482199-0.917839
fig = plt.figure(figsize=(10, 8))
sns.scatterplot(x='MDS1', y='MDS2', hue='Class', data=data_norm)

Multidimensional Scaling (MDS)

3-Dimensional Plot

no_components = 3
n_init = 15
metric = True
n_stress='auto'

mds = MDS(
n_components=no_components,
n_init=n_init, metric=metric,
normalized_stress=n_stress)

data_mds = mds.fit_transform(data_norm[data_columns])
print('MSE: ', mds.stress_)
# MSE: 2.4601741009431457

data_norm[['MDS1', 'MDS2', 'MDS3']] = data_mds
data_norm.head(1)
SpeciesSexIndexFrontal LobeRear WidthCarapace MidlineMaximum WidthBody DepthClassMDS1MDS2MDS3
0BlueMale10.0566040.0145990.0425530.0506670.058065BlueMale-0.0939610.804910.645809
class_colours = {
'BlueMale': '#0027c4', #blue
'BlueFemale': '#f18b0a', #orange
'OrangeMale': '#0af10a', # green
'OrangeFemale': '#ff1500', #red
}

colours = data_norm['Class'].apply(lambda x: class_colours[x])

x=data_norm.MDS1
y=data_norm.MDS2
z=data_norm.MDS3

fig = plt.figure(figsize=(10,10))
plt.title('MDS Data Analysis')
ax = fig.add_subplot(projection='3d')

ax.scatter(xs=x, ys=y, zs=z, s=50, c=colours)

Multidimensional Scaling (MDS)