import pandas as pd                                # Used for handling tabular data (dataframes and CSVs)
import numpy as np                                 # Used for numerical operations and array manipulation
import matplotlib.pyplot as plt                    # Used for creating visualizations
import seaborn as sns                              # Used for statistical data visualization
from sklearn.preprocessing import LabelEncoder     # Used for converting categorical labels into numeric codes

plt.rcParams['figure.figsize'] = (8, 5)            # Set default figure size
sns.set(style='whitegrid')                         # Set default Seaborn style for plots

file_name = "GSE10810_Expression_Matrix_cleaned.csv"   # Input file name containing raw gene expression data
OUTPUT_CLEANED = "data_cleaned_with_labels.csv"        # Output file name for cleaned data

print("\nLoading data from:", file_name)              # Display message to show which file is being loaded
df = pd.read_csv(file_name, index_col=0)              # Read CSV file, use first column as index (gene names)
print("\nData before transpose:")                     # Display structure before transposition
display(df.head())                                    # Show first few rows of the dataframe

Loading data from: GSE10810_Expression_Matrix_cleaned.csv

Data before transpose:

print('\nTransposing data so rows = samples and columns = genes...')  
df = df.T                                             # Transpose matrix (samples become rows, genes columns)
print('\nData after transpose:')                      # Display confirmation
display(df.head())                                    # Show first few rows after transpose

Transposing data so rows = samples and columns = genes...

Data after transpose:

labels = ['Normal'] * 27 + ['Cancer'] * 31            # Define class labels: 27 Normal, 31 Cancer
if df.shape[0] != len(labels):
    raise ValueError(f"Number of samples (rows) = {df.shape[0]} doesn't match expected 58 for label assignment. Update labels accordingly.")  # Ensure alignment

df['Label'] = labels                                  # Add Label column to dataframe
print('\nData with Label column added (first 10 rows):')  
display(df[['Label']].head(10))                       # Show first 10 labels
print('\n...and last 10 rows:')
display(df[['Label']].tail(10))                       # Show last 10 labels for verification

Data with Label column added (first 10 rows):

...and last 10 rows:

print('\nShape (samples, features):', df.shape)       # Display data dimensions
print('\nData types:')
print(df.dtypes[:10])                                # Show data types of first 10 columns
print('\nData info:')
df.info()                                            # Summary of dataframe including non-null counts
print('\nDescriptive statistics (first few columns):')
display(df.describe().iloc[:, :5])                   # Display descriptive stats for first 5 genes

Shape (samples, features): (58, 20826)

Data types:
DDR1      float64
RFC2      float64
HSPA6     float64
PAX8      float64
GUCA1A    float64
UBA7      float64
THRA      float64
PTPN21    float64
CCL5      float64
CYP2E1    float64
dtype: object

Data info:
<class 'pandas.core.frame.DataFrame'>
Index: 58 entries, control_sample_1 to tumor_sample_31
Columns: 20826 entries, DDR1 to Label
dtypes: float64(20825), object(1)
memory usage: 9.2+ MB

Descriptive statistics (first few columns):

print('\nMissing values per column (showing top 10 highest):')
print(df.isnull().sum().sort_values(ascending=False).head(10))   # Count missing values per column
print('\nTotal missing values in dataframe:', df.isnull().sum().sum())  # Count total missing values
print('\nDuplicate sample rows count:', df.duplicated().sum())   # Count duplicate rows

Missing values per column (showing top 10 highest):
DDR1      0
CENPM     0
RNF111    0
COQ6      0
DVL2      0
RRP1      0
UPF3B     0
DHRS11    0
KIF20A    0
XKR8      0
dtype: int64

Total missing values in dataframe: 0

Duplicate sample rows count: 0

print('\nClass distribution:')
print(df['Label'].value_counts())                    # Count samples per class

plt.figure()
df['Label'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90)  # Create pie chart
plt.title('Class distribution')                      # Add title
plt.ylabel('')                                       # Remove Y-axis label
plt.tight_layout()                                   # Adjust layout
plt.savefig('class_distribution_pie.png')            # Save figure
plt.show()                                           # Display plot

Class distribution:
Label
Cancer    31
Normal    27
Name: count, dtype: int64

X = df.iloc[:, 0:-1]                                 # Select all columns except Label as features
y = df.iloc[:, -1]                                   # Select Label column as target
print('\nFeatures shape:', X.shape)                  # Print features shape
print('Labels shape:', y.shape)                      # Print labels shape

Features shape: (58, 20825)
Labels shape: (58,)

label_encoder = LabelEncoder()                       # Initialize label encoder
label_encoder.fit(y)                                 # Fit encoder on labels
y_encoded = label_encoder.transform(y)               # Transform labels to numeric codes
labels_unique = label_encoder.classes_               # Retrieve unique label names
classes = np.unique(y_encoded)                       # Retrieve numeric label values

print('\nLabel classes:', labels_unique)             # Show label names
print('Encoded classes unique values:', classes)     # Show encoded numeric values

Label classes: ['Cancer' 'Normal']
Encoded classes unique values: [0 1]

#Notes: compute but do not drop here; actual filtering done in preprocessing notebook
iqr = X.apply(lambda x: np.percentile(x, 75) - np.percentile(x, 25), axis=0)  # Compute IQR for each gene
print('\nIQR summary (first 10 genes):')
print(iqr.head(10))                                   # Display first 10 IQR values

IQR summary (first 10 genes):
DDR1      1.364742
RFC2      0.674510
HSPA6     0.681401
PAX8      0.261396
GUCA1A    0.230753
UBA7      0.438170
THRA      0.445374
PTPN21    0.864435
CCL5      1.640180
CYP2E1    0.304721
dtype: float64

print(f"\nSaving cleaned dataframe to {OUTPUT_CLEANED} ...")  # Display save message
df.to_csv(OUTPUT_CLEANED)                              # Save dataframe to CSV
print('Saved.')                                        # Confirmation message

Saving cleaned dataframe to data_cleaned_with_labels.csv ...
Saved.

print('\n--- Summary ---')
print('Final dataframe shape:', df.shape)             # Show final shape
print('Number of genes (features):', X.shape[1])      # Show number of genes
print('Number of samples:', df.shape[0])              # Show number of samples
print('Class distribution:')
print(df['Label'].value_counts())                     # Show final class counts
print('\nNotebook 1 finished. Output files:')
print('-', OUTPUT_CLEANED)                            # Show saved data file
print('- class_distribution_pie.png')                 # Show saved figure

--- Summary ---
Final dataframe shape: (58, 20826)
Number of genes (features): 20825
Number of samples: 58
Class distribution:
Label
Cancer    31
Normal    27
Name: count, dtype: int64

Notebook 1 finished. Output files:
- data_cleaned_with_labels.csv
- class_distribution_pie.png

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

plt.figure(figsize=(8, 5))
sns.histplot(df.iloc[:, :-1].values.flatten(), bins=50, kde=True, color='steelblue')
plt.title("Histogram of Gene Expression Values")
plt.xlabel("Expression Level")
plt.ylabel("Frequency")
plt.tight_layout()
plt.savefig("histogram_distribution.png")
plt.show()

corr = df.iloc[:, :-1].T.corr()                      # Compute correlation between samples
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap='viridis')
plt.title("Correlation Heatmap Between Samples")
plt.tight_layout()
plt.savefig("correlation_heatmap.png")
plt.show()

X = df.iloc[:, :-1].values
y = df['Label'].values
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(7, 5))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette={'Normal': 'blue', 'Cancer': 'red'})
plt.title("PCA Plot (2 Components)")
plt.tight_layout()
plt.savefig("pca_plot.png")
plt.show()

plt.figure(figsize=(8, 5))
sns.boxplot(x='Label', y=df.iloc[:, :-1].values.mean(axis=1), data=df)
plt.title("Boxplot of Average Expression per Sample by Class")
plt.tight_layout()
plt.savefig("boxplot_comparison.png")
plt.show()

!jupyter nbconvert --to html --embed-images "Notebook_1_Data_Exploration_and_Cleaning.ipynb" --output "Notebook_1_Data_Exploration_and_Cleaning.html"

[NbConvertApp] Converting notebook Notebook_1_Data_Exploration_and_Cleaning.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 1 image(s).
[NbConvertApp] Writing 366351 bytes to Notebook_1_Data_Exploration_and_Cleaning.html

	control_sample_1	control_sample_2	control_sample_3	control_sample_4	control_sample_5	control_sample_6	control_sample_7	control_sample_8	control_sample_9	control_sample_10	...	tumor_sample_22	tumor_sample_23	tumor_sample_24	tumor_sample_25	tumor_sample_26	tumor_sample_27	tumor_sample_28	tumor_sample_29	tumor_sample_30	tumor_sample_31
DDR1	8.938406	7.690918	7.896712	8.264164	7.861269	8.501498	8.488596	9.798451	8.338741	7.461595	...	11.299196	8.736040	7.974397	9.844574	10.223405	9.020612	10.000855	8.735304	9.830538	8.810173
RFC2	6.851183	6.621644	6.662755	6.857314	6.774225	6.479220	6.884122	7.126003	6.999524	6.522597	...	7.775935	7.362577	6.923662	7.102591	7.700892	6.958956	8.232016	7.728711	7.494395	7.838108
HSPA6	6.976834	7.691931	7.365933	6.635037	6.699210	8.174773	6.711386	6.449804	7.037992	6.983417	...	6.976639	6.706442	6.094835	5.796353	8.188928	5.884043	6.719410	7.068801	6.061562	7.824966
PAX8	4.869134	5.038698	5.152850	5.118771	5.086458	5.268379	5.020341	4.822571	5.083671	5.468431	...	5.440319	5.107261	5.252498	5.496686	5.141178	4.923892	5.137863	5.361833	4.778221	4.964566
GUCA1A	4.326948	4.275521	4.294037	4.321169	4.356412	4.364336	4.323865	4.180331	4.313354	4.716860	...	4.270065	4.709285	4.734315	4.975240	4.291261	4.204504	4.332470	4.933903	4.274410	4.511355

	DDR1	RFC2	HSPA6	PAX8	GUCA1A	UBA7	THRA	PTPN21	CCL5	CYP2E1	...	NQO2-AS1	ITGB1-DT	TNFRSF10A-DT	LOC400499	GALR3	NUS1P3	ZNF710-AS1	SAP25	TMEM231	LOC100505915
control_sample_1	8.938406	6.851183	6.976834	4.869134	4.326948	7.538037	6.159400	5.968835	6.700752	4.083799	...	3.632695	4.567796	5.174148	6.796700	5.290225	4.715049	6.244124	6.262145	5.343609	5.832579
control_sample_2	7.690918	6.621644	7.691931	5.038698	4.275521	7.676686	5.985778	6.069999	7.959567	4.135048	...	3.617865	4.424826	5.148048	6.775692	5.737877	4.467093	5.580614	6.404403	4.725281	5.874446
control_sample_3	7.896712	6.662755	7.365933	5.152850	4.294037	7.459939	6.040109	6.029395	7.324931	4.105643	...	3.918734	4.629768	4.978678	6.815992	5.809768	4.412147	5.929512	6.583424	5.258202	5.982806
control_sample_4	8.264164	6.857314	6.635037	5.118771	4.321169	7.374757	6.168062	6.041518	6.564204	3.936414	...	4.100015	4.480501	4.881933	6.756850	5.557724	4.474714	6.226731	6.176037	5.393638	5.657388
control_sample_5	7.861269	6.774225	6.699210	5.086458	4.356412	7.424526	6.191636	6.163735	6.840854	4.045273	...	3.806217	4.453589	5.106010	6.826434	5.547210	4.205264	5.512682	6.142524	4.957848	5.847746

	DDR1	RFC2	HSPA6	PAX8	GUCA1A
count	58.000000	58.000000	58.000000	58.000000	58.000000
mean	9.151427	7.148796	6.989258	5.065083	4.414583
std	1.066212	0.494729	0.722380	0.190153	0.197964
min	6.299351	6.479220	5.373444	4.696112	4.112385
25%	8.499650	6.773367	6.634868	4.922845	4.275924
50%	9.146583	7.023783	6.936467	5.054100	4.355980
75%	9.864392	7.447876	7.316269	5.184241	4.506677
max	11.711080	8.839413	9.382892	5.496686	4.975240

Machine Learning-Based Analysis of Gene Expression Profiles in Breast Cancer¶

Notebook 1 – Data Exploration & Cleaning¶

Workflow Overview¶

1.1.0 Import Required Libraries¶

1.1.1 Define File Parameters¶

1.1.2 Load Raw Data¶

1.1.3 Transpose Matrix Orientation¶

1.1.4 Add Class Labels¶

1.1.5 Inspect Data Integrity¶

1.1.6 Check Missing Values & Duplicates¶

1.1.7 Class Distribution Visualization¶

1.1.8 Separate Features and Labels¶

1.1.9 Encode Labels¶

1.2.0 Compute IQR (Preview)¶

1.3.0 Save Cleaned Dataset¶

1.4.0 Summary of Cleaning Stage¶

1.5.0 Additional Visualizations¶

1.5.1 Histogram of Expression Values¶

1.5.2 Correlation Heatmap Between Samples¶

1.5.3 PCA 2D Visualization¶

1.5.4 Boxplot of Average Expression per Sample¶

Export Notebook 1 to HTML¶

	Label
control_sample_1	Normal
control_sample_2	Normal
control_sample_3	Normal
control_sample_4	Normal
control_sample_5	Normal
control_sample_6	Normal
control_sample_7	Normal
control_sample_8	Normal
control_sample_9	Normal
control_sample_10	Normal

	Label
tumor_sample_22	Cancer
tumor_sample_23	Cancer
tumor_sample_24	Cancer
tumor_sample_25	Cancer
tumor_sample_26	Cancer
tumor_sample_27	Cancer
tumor_sample_28	Cancer
tumor_sample_29	Cancer
tumor_sample_30	Cancer
tumor_sample_31	Cancer