Machine Learning-Based Analysis of Gene Expression Profiles in Breast Cancer¶
Original Educational Template ( Provided by course instructors)
Mohamed Hussein (Code restructuring, enhanced clarity, detailed annotations, and GitHub publication)
Date: 2025-09-05
Notebook 1 – Data Exploration & Cleaning¶
This notebook explores and cleans the gene expression dataset used for the breast cancer classification task.
The goal is to verify the data integrity, handle missing or duplicated entries, check feature distribution, and export a clean dataset for downstream preprocessing.
Workflow Overview¶
This notebook performs the following steps:
- Import required libraries
 - Define input/output filenames
 - Load raw gene expression data
 - Transpose matrix orientation
 - Add class labels (Normal / Cancer)
 - Inspect data structure and statistics
 - Check missing values and duplicates
 - Visualize class distribution
 - Separate features and labels
 - Encode class labels numerically
 - Compute IQR for variability
 - Save cleaned dataset
 - Generate additional visualizations (Histogram, Heatmap, PCA, Boxplot)
 - Display summary and export results
 
1.1.0 Import Required Libraries¶
In [5]:
import pandas as pd                                # Used for handling tabular data (dataframes and CSVs)
import numpy as np                                 # Used for numerical operations and array manipulation
import matplotlib.pyplot as plt                    # Used for creating visualizations
import seaborn as sns                              # Used for statistical data visualization
from sklearn.preprocessing import LabelEncoder     # Used for converting categorical labels into numeric codes
plt.rcParams['figure.figsize'] = (8, 5)            # Set default figure size
sns.set(style='whitegrid')                         # Set default Seaborn style for plots
1.1.1 Define File Parameters¶
In [7]:
file_name = "GSE10810_Expression_Matrix_cleaned.csv"   # Input file name containing raw gene expression data
OUTPUT_CLEANED = "data_cleaned_with_labels.csv"        # Output file name for cleaned data
1.1.2 Load Raw Data¶
In [9]:
print("\nLoading data from:", file_name)              # Display message to show which file is being loaded
df = pd.read_csv(file_name, index_col=0)              # Read CSV file, use first column as index (gene names)
print("\nData before transpose:")                     # Display structure before transposition
display(df.head())                                    # Show first few rows of the dataframe
Loading data from: GSE10810_Expression_Matrix_cleaned.csv Data before transpose:
| control_sample_1 | control_sample_2 | control_sample_3 | control_sample_4 | control_sample_5 | control_sample_6 | control_sample_7 | control_sample_8 | control_sample_9 | control_sample_10 | ... | tumor_sample_22 | tumor_sample_23 | tumor_sample_24 | tumor_sample_25 | tumor_sample_26 | tumor_sample_27 | tumor_sample_28 | tumor_sample_29 | tumor_sample_30 | tumor_sample_31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DDR1 | 8.938406 | 7.690918 | 7.896712 | 8.264164 | 7.861269 | 8.501498 | 8.488596 | 9.798451 | 8.338741 | 7.461595 | ... | 11.299196 | 8.736040 | 7.974397 | 9.844574 | 10.223405 | 9.020612 | 10.000855 | 8.735304 | 9.830538 | 8.810173 | 
| RFC2 | 6.851183 | 6.621644 | 6.662755 | 6.857314 | 6.774225 | 6.479220 | 6.884122 | 7.126003 | 6.999524 | 6.522597 | ... | 7.775935 | 7.362577 | 6.923662 | 7.102591 | 7.700892 | 6.958956 | 8.232016 | 7.728711 | 7.494395 | 7.838108 | 
| HSPA6 | 6.976834 | 7.691931 | 7.365933 | 6.635037 | 6.699210 | 8.174773 | 6.711386 | 6.449804 | 7.037992 | 6.983417 | ... | 6.976639 | 6.706442 | 6.094835 | 5.796353 | 8.188928 | 5.884043 | 6.719410 | 7.068801 | 6.061562 | 7.824966 | 
| PAX8 | 4.869134 | 5.038698 | 5.152850 | 5.118771 | 5.086458 | 5.268379 | 5.020341 | 4.822571 | 5.083671 | 5.468431 | ... | 5.440319 | 5.107261 | 5.252498 | 5.496686 | 5.141178 | 4.923892 | 5.137863 | 5.361833 | 4.778221 | 4.964566 | 
| GUCA1A | 4.326948 | 4.275521 | 4.294037 | 4.321169 | 4.356412 | 4.364336 | 4.323865 | 4.180331 | 4.313354 | 4.716860 | ... | 4.270065 | 4.709285 | 4.734315 | 4.975240 | 4.291261 | 4.204504 | 4.332470 | 4.933903 | 4.274410 | 4.511355 | 
5 rows × 58 columns
1.1.3 Transpose Matrix Orientation¶
In [11]:
print('\nTransposing data so rows = samples and columns = genes...')  
df = df.T                                             # Transpose matrix (samples become rows, genes columns)
print('\nData after transpose:')                      # Display confirmation
display(df.head())                                    # Show first few rows after transpose
Transposing data so rows = samples and columns = genes... Data after transpose:
| DDR1 | RFC2 | HSPA6 | PAX8 | GUCA1A | UBA7 | THRA | PTPN21 | CCL5 | CYP2E1 | ... | NQO2-AS1 | ITGB1-DT | TNFRSF10A-DT | LOC400499 | GALR3 | NUS1P3 | ZNF710-AS1 | SAP25 | TMEM231 | LOC100505915 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| control_sample_1 | 8.938406 | 6.851183 | 6.976834 | 4.869134 | 4.326948 | 7.538037 | 6.159400 | 5.968835 | 6.700752 | 4.083799 | ... | 3.632695 | 4.567796 | 5.174148 | 6.796700 | 5.290225 | 4.715049 | 6.244124 | 6.262145 | 5.343609 | 5.832579 | 
| control_sample_2 | 7.690918 | 6.621644 | 7.691931 | 5.038698 | 4.275521 | 7.676686 | 5.985778 | 6.069999 | 7.959567 | 4.135048 | ... | 3.617865 | 4.424826 | 5.148048 | 6.775692 | 5.737877 | 4.467093 | 5.580614 | 6.404403 | 4.725281 | 5.874446 | 
| control_sample_3 | 7.896712 | 6.662755 | 7.365933 | 5.152850 | 4.294037 | 7.459939 | 6.040109 | 6.029395 | 7.324931 | 4.105643 | ... | 3.918734 | 4.629768 | 4.978678 | 6.815992 | 5.809768 | 4.412147 | 5.929512 | 6.583424 | 5.258202 | 5.982806 | 
| control_sample_4 | 8.264164 | 6.857314 | 6.635037 | 5.118771 | 4.321169 | 7.374757 | 6.168062 | 6.041518 | 6.564204 | 3.936414 | ... | 4.100015 | 4.480501 | 4.881933 | 6.756850 | 5.557724 | 4.474714 | 6.226731 | 6.176037 | 5.393638 | 5.657388 | 
| control_sample_5 | 7.861269 | 6.774225 | 6.699210 | 5.086458 | 4.356412 | 7.424526 | 6.191636 | 6.163735 | 6.840854 | 4.045273 | ... | 3.806217 | 4.453589 | 5.106010 | 6.826434 | 5.547210 | 4.205264 | 5.512682 | 6.142524 | 4.957848 | 5.847746 | 
5 rows × 20825 columns
1.1.4 Add Class Labels¶
In [13]:
labels = ['Normal'] * 27 + ['Cancer'] * 31            # Define class labels: 27 Normal, 31 Cancer
if df.shape[0] != len(labels):
    raise ValueError(f"Number of samples (rows) = {df.shape[0]} doesn't match expected 58 for label assignment. Update labels accordingly.")  # Ensure alignment
df['Label'] = labels                                  # Add Label column to dataframe
print('\nData with Label column added (first 10 rows):')  
display(df[['Label']].head(10))                       # Show first 10 labels
print('\n...and last 10 rows:')
display(df[['Label']].tail(10))                       # Show last 10 labels for verification
Data with Label column added (first 10 rows):
| Label | |
|---|---|
| control_sample_1 | Normal | 
| control_sample_2 | Normal | 
| control_sample_3 | Normal | 
| control_sample_4 | Normal | 
| control_sample_5 | Normal | 
| control_sample_6 | Normal | 
| control_sample_7 | Normal | 
| control_sample_8 | Normal | 
| control_sample_9 | Normal | 
| control_sample_10 | Normal | 
...and last 10 rows:
| Label | |
|---|---|
| tumor_sample_22 | Cancer | 
| tumor_sample_23 | Cancer | 
| tumor_sample_24 | Cancer | 
| tumor_sample_25 | Cancer | 
| tumor_sample_26 | Cancer | 
| tumor_sample_27 | Cancer | 
| tumor_sample_28 | Cancer | 
| tumor_sample_29 | Cancer | 
| tumor_sample_30 | Cancer | 
| tumor_sample_31 | Cancer | 
1.1.5 Inspect Data Integrity¶
In [15]:
print('\nShape (samples, features):', df.shape)       # Display data dimensions
print('\nData types:')
print(df.dtypes[:10])                                # Show data types of first 10 columns
print('\nData info:')
df.info()                                            # Summary of dataframe including non-null counts
print('\nDescriptive statistics (first few columns):')
display(df.describe().iloc[:, :5])                   # Display descriptive stats for first 5 genes
Shape (samples, features): (58, 20826) Data types: DDR1 float64 RFC2 float64 HSPA6 float64 PAX8 float64 GUCA1A float64 UBA7 float64 THRA float64 PTPN21 float64 CCL5 float64 CYP2E1 float64 dtype: object Data info: <class 'pandas.core.frame.DataFrame'> Index: 58 entries, control_sample_1 to tumor_sample_31 Columns: 20826 entries, DDR1 to Label dtypes: float64(20825), object(1) memory usage: 9.2+ MB Descriptive statistics (first few columns):
| DDR1 | RFC2 | HSPA6 | PAX8 | GUCA1A | |
|---|---|---|---|---|---|
| count | 58.000000 | 58.000000 | 58.000000 | 58.000000 | 58.000000 | 
| mean | 9.151427 | 7.148796 | 6.989258 | 5.065083 | 4.414583 | 
| std | 1.066212 | 0.494729 | 0.722380 | 0.190153 | 0.197964 | 
| min | 6.299351 | 6.479220 | 5.373444 | 4.696112 | 4.112385 | 
| 25% | 8.499650 | 6.773367 | 6.634868 | 4.922845 | 4.275924 | 
| 50% | 9.146583 | 7.023783 | 6.936467 | 5.054100 | 4.355980 | 
| 75% | 9.864392 | 7.447876 | 7.316269 | 5.184241 | 4.506677 | 
| max | 11.711080 | 8.839413 | 9.382892 | 5.496686 | 4.975240 | 
1.1.6 Check Missing Values & Duplicates¶
In [17]:
print('\nMissing values per column (showing top 10 highest):')
print(df.isnull().sum().sort_values(ascending=False).head(10))   # Count missing values per column
print('\nTotal missing values in dataframe:', df.isnull().sum().sum())  # Count total missing values
print('\nDuplicate sample rows count:', df.duplicated().sum())   # Count duplicate rows
Missing values per column (showing top 10 highest): DDR1 0 CENPM 0 RNF111 0 COQ6 0 DVL2 0 RRP1 0 UPF3B 0 DHRS11 0 KIF20A 0 XKR8 0 dtype: int64 Total missing values in dataframe: 0 Duplicate sample rows count: 0
1.1.7 Class Distribution Visualization¶
In [19]:
print('\nClass distribution:')
print(df['Label'].value_counts())                    # Count samples per class
plt.figure()
df['Label'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90)  # Create pie chart
plt.title('Class distribution')                      # Add title
plt.ylabel('')                                       # Remove Y-axis label
plt.tight_layout()                                   # Adjust layout
plt.savefig('class_distribution_pie.png')            # Save figure
plt.show()                                           # Display plot
Class distribution: Label Cancer 31 Normal 27 Name: count, dtype: int64
1.1.8 Separate Features and Labels¶
In [21]:
X = df.iloc[:, 0:-1]                                 # Select all columns except Label as features
y = df.iloc[:, -1]                                   # Select Label column as target
print('\nFeatures shape:', X.shape)                  # Print features shape
print('Labels shape:', y.shape)                      # Print labels shape
Features shape: (58, 20825) Labels shape: (58,)
1.1.9 Encode Labels¶
In [23]:
label_encoder = LabelEncoder()                       # Initialize label encoder
label_encoder.fit(y)                                 # Fit encoder on labels
y_encoded = label_encoder.transform(y)               # Transform labels to numeric codes
labels_unique = label_encoder.classes_               # Retrieve unique label names
classes = np.unique(y_encoded)                       # Retrieve numeric label values
print('\nLabel classes:', labels_unique)             # Show label names
print('Encoded classes unique values:', classes)     # Show encoded numeric values
Label classes: ['Cancer' 'Normal'] Encoded classes unique values: [0 1]
1.2.0 Compute IQR (Preview)¶
In [25]:
#Notes: compute but do not drop here; actual filtering done in preprocessing notebook
iqr = X.apply(lambda x: np.percentile(x, 75) - np.percentile(x, 25), axis=0)  # Compute IQR for each gene
print('\nIQR summary (first 10 genes):')
print(iqr.head(10))                                   # Display first 10 IQR values
IQR summary (first 10 genes): DDR1 1.364742 RFC2 0.674510 HSPA6 0.681401 PAX8 0.261396 GUCA1A 0.230753 UBA7 0.438170 THRA 0.445374 PTPN21 0.864435 CCL5 1.640180 CYP2E1 0.304721 dtype: float64
1.3.0 Save Cleaned Dataset¶
In [27]:
print(f"\nSaving cleaned dataframe to {OUTPUT_CLEANED} ...")  # Display save message
df.to_csv(OUTPUT_CLEANED)                              # Save dataframe to CSV
print('Saved.')                                        # Confirmation message
Saving cleaned dataframe to data_cleaned_with_labels.csv ... Saved.
1.4.0 Summary of Cleaning Stage¶
In [29]:
print('\n--- Summary ---')
print('Final dataframe shape:', df.shape)             # Show final shape
print('Number of genes (features):', X.shape[1])      # Show number of genes
print('Number of samples:', df.shape[0])              # Show number of samples
print('Class distribution:')
print(df['Label'].value_counts())                     # Show final class counts
print('\nNotebook 1 finished. Output files:')
print('-', OUTPUT_CLEANED)                            # Show saved data file
print('- class_distribution_pie.png')                 # Show saved figure
--- Summary --- Final dataframe shape: (58, 20826) Number of genes (features): 20825 Number of samples: 58 Class distribution: Label Cancer 31 Normal 27 Name: count, dtype: int64 Notebook 1 finished. Output files: - data_cleaned_with_labels.csv - class_distribution_pie.png
1.5.0 Additional Visualizations¶
1.5.1 Histogram of Expression Values¶
In [32]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
plt.figure(figsize=(8, 5))
sns.histplot(df.iloc[:, :-1].values.flatten(), bins=50, kde=True, color='steelblue')
plt.title("Histogram of Gene Expression Values")
plt.xlabel("Expression Level")
plt.ylabel("Frequency")
plt.tight_layout()
plt.savefig("histogram_distribution.png")
plt.show()
1.5.2 Correlation Heatmap Between Samples¶
In [34]:
corr = df.iloc[:, :-1].T.corr()                      # Compute correlation between samples
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap='viridis')
plt.title("Correlation Heatmap Between Samples")
plt.tight_layout()
plt.savefig("correlation_heatmap.png")
plt.show()
1.5.3 PCA 2D Visualization¶
In [36]:
X = df.iloc[:, :-1].values
y = df['Label'].values
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(7, 5))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette={'Normal': 'blue', 'Cancer': 'red'})
plt.title("PCA Plot (2 Components)")
plt.tight_layout()
plt.savefig("pca_plot.png")
plt.show()
1.5.4 Boxplot of Average Expression per Sample¶
In [38]:
plt.figure(figsize=(8, 5))
sns.boxplot(x='Label', y=df.iloc[:, :-1].values.mean(axis=1), data=df)
plt.title("Boxplot of Average Expression per Sample by Class")
plt.tight_layout()
plt.savefig("boxplot_comparison.png")
plt.show()
Next Step: Notebook 2 will focus on Preprocessing & Normalization.
Export Notebook 1 to HTML¶
In [41]:
!jupyter nbconvert --to html --embed-images "Notebook_1_Data_Exploration_and_Cleaning.ipynb" --output "Notebook_1_Data_Exploration_and_Cleaning.html"
[NbConvertApp] Converting notebook Notebook_1_Data_Exploration_and_Cleaning.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 1 image(s). [NbConvertApp] Writing 366351 bytes to Notebook_1_Data_Exploration_and_Cleaning.html