Machine Learning-Based Analysis of Gene Expression Profiles in Breast Cancer¶

Original Educational Template ( Provided by course instructors)
Mohamed Hussein (Code restructuring, enhanced clarity, detailed annotations, and GitHub publication)

Date: 2025-09-05

Notebook 1 – Data Exploration & Cleaning¶

This notebook explores and cleans the gene expression dataset used for the breast cancer classification task.
The goal is to verify the data integrity, handle missing or duplicated entries, check feature distribution, and export a clean dataset for downstream preprocessing.

Workflow Overview¶

This notebook performs the following steps:

  1. Import required libraries
  2. Define input/output filenames
  3. Load raw gene expression data
  4. Transpose matrix orientation
  5. Add class labels (Normal / Cancer)
  6. Inspect data structure and statistics
  7. Check missing values and duplicates
  8. Visualize class distribution
  9. Separate features and labels
  10. Encode class labels numerically
  11. Compute IQR for variability
  12. Save cleaned dataset
  13. Generate additional visualizations (Histogram, Heatmap, PCA, Boxplot)
  14. Display summary and export results

1.1.0 Import Required Libraries¶

In [5]:
import pandas as pd                                # Used for handling tabular data (dataframes and CSVs)
import numpy as np                                 # Used for numerical operations and array manipulation
import matplotlib.pyplot as plt                    # Used for creating visualizations
import seaborn as sns                              # Used for statistical data visualization
from sklearn.preprocessing import LabelEncoder     # Used for converting categorical labels into numeric codes

plt.rcParams['figure.figsize'] = (8, 5)            # Set default figure size
sns.set(style='whitegrid')                         # Set default Seaborn style for plots

1.1.1 Define File Parameters¶

In [7]:
file_name = "GSE10810_Expression_Matrix_cleaned.csv"   # Input file name containing raw gene expression data
OUTPUT_CLEANED = "data_cleaned_with_labels.csv"        # Output file name for cleaned data

1.1.2 Load Raw Data¶

In [9]:
print("\nLoading data from:", file_name)              # Display message to show which file is being loaded
df = pd.read_csv(file_name, index_col=0)              # Read CSV file, use first column as index (gene names)
print("\nData before transpose:")                     # Display structure before transposition
display(df.head())                                    # Show first few rows of the dataframe
Loading data from: GSE10810_Expression_Matrix_cleaned.csv

Data before transpose:
control_sample_1 control_sample_2 control_sample_3 control_sample_4 control_sample_5 control_sample_6 control_sample_7 control_sample_8 control_sample_9 control_sample_10 ... tumor_sample_22 tumor_sample_23 tumor_sample_24 tumor_sample_25 tumor_sample_26 tumor_sample_27 tumor_sample_28 tumor_sample_29 tumor_sample_30 tumor_sample_31
DDR1 8.938406 7.690918 7.896712 8.264164 7.861269 8.501498 8.488596 9.798451 8.338741 7.461595 ... 11.299196 8.736040 7.974397 9.844574 10.223405 9.020612 10.000855 8.735304 9.830538 8.810173
RFC2 6.851183 6.621644 6.662755 6.857314 6.774225 6.479220 6.884122 7.126003 6.999524 6.522597 ... 7.775935 7.362577 6.923662 7.102591 7.700892 6.958956 8.232016 7.728711 7.494395 7.838108
HSPA6 6.976834 7.691931 7.365933 6.635037 6.699210 8.174773 6.711386 6.449804 7.037992 6.983417 ... 6.976639 6.706442 6.094835 5.796353 8.188928 5.884043 6.719410 7.068801 6.061562 7.824966
PAX8 4.869134 5.038698 5.152850 5.118771 5.086458 5.268379 5.020341 4.822571 5.083671 5.468431 ... 5.440319 5.107261 5.252498 5.496686 5.141178 4.923892 5.137863 5.361833 4.778221 4.964566
GUCA1A 4.326948 4.275521 4.294037 4.321169 4.356412 4.364336 4.323865 4.180331 4.313354 4.716860 ... 4.270065 4.709285 4.734315 4.975240 4.291261 4.204504 4.332470 4.933903 4.274410 4.511355

5 rows × 58 columns

1.1.3 Transpose Matrix Orientation¶

In [11]:
print('\nTransposing data so rows = samples and columns = genes...')  
df = df.T                                             # Transpose matrix (samples become rows, genes columns)
print('\nData after transpose:')                      # Display confirmation
display(df.head())                                    # Show first few rows after transpose
Transposing data so rows = samples and columns = genes...

Data after transpose:
DDR1 RFC2 HSPA6 PAX8 GUCA1A UBA7 THRA PTPN21 CCL5 CYP2E1 ... NQO2-AS1 ITGB1-DT TNFRSF10A-DT LOC400499 GALR3 NUS1P3 ZNF710-AS1 SAP25 TMEM231 LOC100505915
control_sample_1 8.938406 6.851183 6.976834 4.869134 4.326948 7.538037 6.159400 5.968835 6.700752 4.083799 ... 3.632695 4.567796 5.174148 6.796700 5.290225 4.715049 6.244124 6.262145 5.343609 5.832579
control_sample_2 7.690918 6.621644 7.691931 5.038698 4.275521 7.676686 5.985778 6.069999 7.959567 4.135048 ... 3.617865 4.424826 5.148048 6.775692 5.737877 4.467093 5.580614 6.404403 4.725281 5.874446
control_sample_3 7.896712 6.662755 7.365933 5.152850 4.294037 7.459939 6.040109 6.029395 7.324931 4.105643 ... 3.918734 4.629768 4.978678 6.815992 5.809768 4.412147 5.929512 6.583424 5.258202 5.982806
control_sample_4 8.264164 6.857314 6.635037 5.118771 4.321169 7.374757 6.168062 6.041518 6.564204 3.936414 ... 4.100015 4.480501 4.881933 6.756850 5.557724 4.474714 6.226731 6.176037 5.393638 5.657388
control_sample_5 7.861269 6.774225 6.699210 5.086458 4.356412 7.424526 6.191636 6.163735 6.840854 4.045273 ... 3.806217 4.453589 5.106010 6.826434 5.547210 4.205264 5.512682 6.142524 4.957848 5.847746

5 rows × 20825 columns

1.1.4 Add Class Labels¶

In [13]:
labels = ['Normal'] * 27 + ['Cancer'] * 31            # Define class labels: 27 Normal, 31 Cancer
if df.shape[0] != len(labels):
    raise ValueError(f"Number of samples (rows) = {df.shape[0]} doesn't match expected 58 for label assignment. Update labels accordingly.")  # Ensure alignment

df['Label'] = labels                                  # Add Label column to dataframe
print('\nData with Label column added (first 10 rows):')  
display(df[['Label']].head(10))                       # Show first 10 labels
print('\n...and last 10 rows:')
display(df[['Label']].tail(10))                       # Show last 10 labels for verification
Data with Label column added (first 10 rows):
Label
control_sample_1 Normal
control_sample_2 Normal
control_sample_3 Normal
control_sample_4 Normal
control_sample_5 Normal
control_sample_6 Normal
control_sample_7 Normal
control_sample_8 Normal
control_sample_9 Normal
control_sample_10 Normal
...and last 10 rows:
Label
tumor_sample_22 Cancer
tumor_sample_23 Cancer
tumor_sample_24 Cancer
tumor_sample_25 Cancer
tumor_sample_26 Cancer
tumor_sample_27 Cancer
tumor_sample_28 Cancer
tumor_sample_29 Cancer
tumor_sample_30 Cancer
tumor_sample_31 Cancer

1.1.5 Inspect Data Integrity¶

In [15]:
print('\nShape (samples, features):', df.shape)       # Display data dimensions
print('\nData types:')
print(df.dtypes[:10])                                # Show data types of first 10 columns
print('\nData info:')
df.info()                                            # Summary of dataframe including non-null counts
print('\nDescriptive statistics (first few columns):')
display(df.describe().iloc[:, :5])                   # Display descriptive stats for first 5 genes
Shape (samples, features): (58, 20826)

Data types:
DDR1      float64
RFC2      float64
HSPA6     float64
PAX8      float64
GUCA1A    float64
UBA7      float64
THRA      float64
PTPN21    float64
CCL5      float64
CYP2E1    float64
dtype: object

Data info:
<class 'pandas.core.frame.DataFrame'>
Index: 58 entries, control_sample_1 to tumor_sample_31
Columns: 20826 entries, DDR1 to Label
dtypes: float64(20825), object(1)
memory usage: 9.2+ MB

Descriptive statistics (first few columns):
DDR1 RFC2 HSPA6 PAX8 GUCA1A
count 58.000000 58.000000 58.000000 58.000000 58.000000
mean 9.151427 7.148796 6.989258 5.065083 4.414583
std 1.066212 0.494729 0.722380 0.190153 0.197964
min 6.299351 6.479220 5.373444 4.696112 4.112385
25% 8.499650 6.773367 6.634868 4.922845 4.275924
50% 9.146583 7.023783 6.936467 5.054100 4.355980
75% 9.864392 7.447876 7.316269 5.184241 4.506677
max 11.711080 8.839413 9.382892 5.496686 4.975240

1.1.6 Check Missing Values & Duplicates¶

In [17]:
print('\nMissing values per column (showing top 10 highest):')
print(df.isnull().sum().sort_values(ascending=False).head(10))   # Count missing values per column
print('\nTotal missing values in dataframe:', df.isnull().sum().sum())  # Count total missing values
print('\nDuplicate sample rows count:', df.duplicated().sum())   # Count duplicate rows
Missing values per column (showing top 10 highest):
DDR1      0
CENPM     0
RNF111    0
COQ6      0
DVL2      0
RRP1      0
UPF3B     0
DHRS11    0
KIF20A    0
XKR8      0
dtype: int64

Total missing values in dataframe: 0

Duplicate sample rows count: 0

1.1.7 Class Distribution Visualization¶

In [19]:
print('\nClass distribution:')
print(df['Label'].value_counts())                    # Count samples per class

plt.figure()
df['Label'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90)  # Create pie chart
plt.title('Class distribution')                      # Add title
plt.ylabel('')                                       # Remove Y-axis label
plt.tight_layout()                                   # Adjust layout
plt.savefig('class_distribution_pie.png')            # Save figure
plt.show()                                           # Display plot
Class distribution:
Label
Cancer    31
Normal    27
Name: count, dtype: int64
No description has been provided for this image

1.1.8 Separate Features and Labels¶

In [21]:
X = df.iloc[:, 0:-1]                                 # Select all columns except Label as features
y = df.iloc[:, -1]                                   # Select Label column as target
print('\nFeatures shape:', X.shape)                  # Print features shape
print('Labels shape:', y.shape)                      # Print labels shape
Features shape: (58, 20825)
Labels shape: (58,)

1.1.9 Encode Labels¶

In [23]:
label_encoder = LabelEncoder()                       # Initialize label encoder
label_encoder.fit(y)                                 # Fit encoder on labels
y_encoded = label_encoder.transform(y)               # Transform labels to numeric codes
labels_unique = label_encoder.classes_               # Retrieve unique label names
classes = np.unique(y_encoded)                       # Retrieve numeric label values

print('\nLabel classes:', labels_unique)             # Show label names
print('Encoded classes unique values:', classes)     # Show encoded numeric values
Label classes: ['Cancer' 'Normal']
Encoded classes unique values: [0 1]

1.2.0 Compute IQR (Preview)¶

In [25]:
#Notes: compute but do not drop here; actual filtering done in preprocessing notebook
iqr = X.apply(lambda x: np.percentile(x, 75) - np.percentile(x, 25), axis=0)  # Compute IQR for each gene
print('\nIQR summary (first 10 genes):')
print(iqr.head(10))                                   # Display first 10 IQR values
IQR summary (first 10 genes):
DDR1      1.364742
RFC2      0.674510
HSPA6     0.681401
PAX8      0.261396
GUCA1A    0.230753
UBA7      0.438170
THRA      0.445374
PTPN21    0.864435
CCL5      1.640180
CYP2E1    0.304721
dtype: float64

1.3.0 Save Cleaned Dataset¶

In [27]:
print(f"\nSaving cleaned dataframe to {OUTPUT_CLEANED} ...")  # Display save message
df.to_csv(OUTPUT_CLEANED)                              # Save dataframe to CSV
print('Saved.')                                        # Confirmation message
Saving cleaned dataframe to data_cleaned_with_labels.csv ...
Saved.

1.4.0 Summary of Cleaning Stage¶

In [29]:
print('\n--- Summary ---')
print('Final dataframe shape:', df.shape)             # Show final shape
print('Number of genes (features):', X.shape[1])      # Show number of genes
print('Number of samples:', df.shape[0])              # Show number of samples
print('Class distribution:')
print(df['Label'].value_counts())                     # Show final class counts
print('\nNotebook 1 finished. Output files:')
print('-', OUTPUT_CLEANED)                            # Show saved data file
print('- class_distribution_pie.png')                 # Show saved figure
--- Summary ---
Final dataframe shape: (58, 20826)
Number of genes (features): 20825
Number of samples: 58
Class distribution:
Label
Cancer    31
Normal    27
Name: count, dtype: int64

Notebook 1 finished. Output files:
- data_cleaned_with_labels.csv
- class_distribution_pie.png

1.5.0 Additional Visualizations¶

1.5.1 Histogram of Expression Values¶

In [32]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

plt.figure(figsize=(8, 5))
sns.histplot(df.iloc[:, :-1].values.flatten(), bins=50, kde=True, color='steelblue')
plt.title("Histogram of Gene Expression Values")
plt.xlabel("Expression Level")
plt.ylabel("Frequency")
plt.tight_layout()
plt.savefig("histogram_distribution.png")
plt.show()
No description has been provided for this image

1.5.2 Correlation Heatmap Between Samples¶

In [34]:
corr = df.iloc[:, :-1].T.corr()                      # Compute correlation between samples
plt.figure(figsize=(10, 8))
sns.heatmap(corr, cmap='viridis')
plt.title("Correlation Heatmap Between Samples")
plt.tight_layout()
plt.savefig("correlation_heatmap.png")
plt.show()
No description has been provided for this image

1.5.3 PCA 2D Visualization¶

In [36]:
X = df.iloc[:, :-1].values
y = df['Label'].values
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(7, 5))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y, palette={'Normal': 'blue', 'Cancer': 'red'})
plt.title("PCA Plot (2 Components)")
plt.tight_layout()
plt.savefig("pca_plot.png")
plt.show()
No description has been provided for this image

1.5.4 Boxplot of Average Expression per Sample¶

In [38]:
plt.figure(figsize=(8, 5))
sns.boxplot(x='Label', y=df.iloc[:, :-1].values.mean(axis=1), data=df)
plt.title("Boxplot of Average Expression per Sample by Class")
plt.tight_layout()
plt.savefig("boxplot_comparison.png")
plt.show()
No description has been provided for this image

Next Step: Notebook 2 will focus on Preprocessing & Normalization.

Export Notebook 1 to HTML¶

In [41]:
!jupyter nbconvert --to html --embed-images "Notebook_1_Data_Exploration_and_Cleaning.ipynb" --output "Notebook_1_Data_Exploration_and_Cleaning.html"
[NbConvertApp] Converting notebook Notebook_1_Data_Exploration_and_Cleaning.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 1 image(s).
[NbConvertApp] Writing 366351 bytes to Notebook_1_Data_Exploration_and_Cleaning.html