diff --git a/README.md b/README.md index 258d8d4..9d39137 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ Main aim of this project to is implement end-to-end ML pipelines on AWS sagemake
-- We’ll use a Scikit-Learn Pipeline for the transformations, and a Processing Step with a SKLearnProcessor to execute a preprocessing script. Check the SageMaker Pipelines Overview for an introduction to the fundamental components of a SageMaker Pipeline. +- We’ll use a [Scikit-Learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for the transformations, and a [Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) with a [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) to execute a preprocessing script. Check the [SageMaker Pipelines Overview](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) for an introduction to the fundamental components of a SageMaker Pipeline. ### Step 1: EDA @@ -46,12 +46,206 @@ penguins.head() +- Now, let’s get the `summary statistics` for the features in our dataset. +``` +penguins.describe(include="all") +``` + ++ +
+ +- Let’s now display the distribution of values for the three categorical columns in our data: + +``` +species_distribution = penguins["species"].value_counts() +island_distribution = penguins["island"].value_counts() +sex_distribution = penguins["sex"].value_counts() + +print(species_distribution) +print() +print(island_distribution) +print() +print(sex_distribution) +``` ++ +
+ +- The distribution of the categories in our data are: + + - `species`: There are 3 species of penguins in the dataset: Adelie (152), Gentoo (124), and Chinstrap (68). + - `island`: Penguins are from 3 islands: Biscoe (168), Dream (124), and Torgersen (52). + - `sex`: We have 168 male penguins, 165 female penguins, and 1 penguin with an ambiguous gender (.). + +- Let’s replace the ambiguous value in the sex column with a null value: + +``` +penguins["sex"] = penguins["sex"].replace(".", np.nan) +sex_distribution = penguins["sex"].value_counts() +sex_distribution +``` ++ +
+ +- Next, let’s check for any missing values in the dataset. + +``` +penguins.isnull().sum() +``` ++ +
+ +- Let’s get rid of the missing values. For now, we are going to replace the missing values with the most frequent value in the column. Later, we’ll use a different strategy to replace missing numeric values. + +``` +from sklearn.impute import SimpleImputer + +imputer = SimpleImputer(strategy="most_frequent") +penguins.iloc[:, :] = imputer.fit_transform(penguins) +penguins.isnull().sum() +``` ++ +
+ +- Let’s visualize the distribution of `categorical features`. + +``` +import matplotlib.pyplot as plt + +fig, axs = plt.subplots(3, 1, figsize=(6, 10)) + +axs[0].bar(species_distribution.index, species_distribution.values) +axs[0].set_ylabel("Count") +axs[0].set_title("Distribution of Species") + +axs[1].bar(island_distribution.index, island_distribution.values) +axs[1].set_ylabel("Count") +axs[1].set_title("Distribution of Island") + +axs[2].bar(sex_distribution.index, sex_distribution.values) +axs[2].set_ylabel("Count") +axs[2].set_title("Distribution of Sex") + +plt.tight_layout() +plt.show() +``` ++ +
+- Let’s visualize the distribution of `numerical columns`. +``` +fig, axs = plt.subplots(2, 2, figsize=(8, 6)) + +axs[0, 0].hist(penguins["culmen_length_mm"], bins=20) +axs[0, 0].set_ylabel("Count") +axs[0, 0].set_title("Distribution of culmen_length_mm") + +axs[0, 1].hist(penguins["culmen_depth_mm"], bins=20) +axs[0, 1].set_ylabel("Count") +axs[0, 1].set_title("Distribution of culmen_depth_mm") + +axs[1, 0].hist(penguins["flipper_length_mm"], bins=20) +axs[1, 0].set_ylabel("Count") +axs[1, 0].set_title("Distribution of flipper_length_mm") + +axs[1, 1].hist(penguins["body_mass_g"], bins=20) +axs[1, 1].set_ylabel("Count") +axs[1, 1].set_title("Distribution of body_mass_g") + +plt.tight_layout() +plt.show() +``` + ++ +
+ +- Let’s display the covariance matrix of the dataset. The “covariance” measures how changes in one variable are associated with changes in a second variable. In other words, the covariance measures the degree to which two variables are linearly associated. + +``` +penguins.cov(numeric_only=True) +``` ++ +
+ +- Here are three examples of what we get from interpreting the covariance matrix below: + + - Penguins that weight more tend to have a larger culmen. + - The more a penguin weights, the shallower its culmen tends to be. + - There’s a small variance between the culmen depth of penguins. + +- Let’s now display the correlation matrix. “Correlation” measures both the strength and direction of the linear relationship between two variables. + +``` +penguins.corr(numeric_only=True) +``` ++ +
+ +- Here are three examples of what we get from interpreting the correlation matrix below: + + - Penguins that weight more tend to have larger flippers. + - Penguins with a shallower culmen tend to have larger flippers. + - The length and depth of the culmen have a slight negative correlation. + + +- Let’s display the distribution of species by island. + +``` +unique_species = penguins["species"].unique() + +fig, ax = plt.subplots(figsize=(6, 6)) +for species in unique_species: + data = penguins[penguins["species"] == species] + ax.hist(data["island"], bins=5, alpha=0.5, label=species) + +ax.set_xlabel("Island") +ax.set_ylabel("Count") +ax.set_title("Distribution of Species by Island") +ax.legend() +plt.show() +``` ++ +
+ +- Let’s display the distribution of species by sex. + +``` +fig, ax = plt.subplots(figsize=(6, 6)) + +for species in unique_species: + data = penguins[penguins["species"] == species] + ax.hist(data["sex"], bins=3, alpha=0.5, label=species) + +ax.set_xlabel("Sex") +ax.set_ylabel("Count") +ax.set_title("Distribution of Species by Sex") + +ax.legend() +plt.show() +``` ++ +
+### Step 2: Creating the Preprocessing Script +- Fetch the data from S3 bucket on AWS and send it to a `processing job` (job running on AWS) +- Processing Job splits the data into 3 sets and transforms and the output of this job gets stored back on S3 location called `Dataset splits`: + - `Training set` + - `Validation set` + - `Test set` diff --git a/program/cohort.ipynb b/program/cohort.ipynb index 96bc9b7..8d37162 100644 --- a/program/cohort.ipynb +++ b/program/cohort.ipynb @@ -67,16 +67,20 @@ "\n", "import sys\n", "import logging\n", + "\n", + "# python unit test framework\n", "import ipytest\n", + "\n", "import json\n", "from pathlib import Path\n", "\n", - "\n", + "# Create the code folder and the inference code folder\n", "CODE_FOLDER = Path(\"code\")\n", "CODE_FOLDER.mkdir(parents=True, exist_ok=True)\n", "INFERENCE_CODE_FOLDER = CODE_FOLDER / \"inference\"\n", "INFERENCE_CODE_FOLDER.mkdir(parents=True, exist_ok=True)\n", "\n", + "# make the code folder available for imports\n", "sys.path.extend([f\"./{CODE_FOLDER}\", f\"./{INFERENCE_CODE_FOLDER}\"])\n", "\n", "DATA_FILEPATH = \"penguins.csv\"\n", diff --git a/program/images/eda10.PNG b/program/images/eda10.PNG new file mode 100644 index 0000000..2d8393b Binary files /dev/null and b/program/images/eda10.PNG differ diff --git a/program/images/eda11.png b/program/images/eda11.png new file mode 100644 index 0000000..f320f4d Binary files /dev/null and b/program/images/eda11.png differ diff --git a/program/images/eda12.png b/program/images/eda12.png new file mode 100644 index 0000000..33b46b6 Binary files /dev/null and b/program/images/eda12.png differ diff --git a/program/images/eda2.PNG b/program/images/eda2.PNG new file mode 100644 index 0000000..87e76b6 Binary files /dev/null and b/program/images/eda2.PNG differ diff --git a/program/images/eda3.PNG b/program/images/eda3.PNG new file mode 100644 index 0000000..6c5f814 Binary files /dev/null and b/program/images/eda3.PNG differ diff --git a/program/images/eda4.PNG b/program/images/eda4.PNG new file mode 100644 index 0000000..448e615 Binary files /dev/null and b/program/images/eda4.PNG differ diff --git a/program/images/eda5.PNG b/program/images/eda5.PNG new file mode 100644 index 0000000..3678d60 Binary files /dev/null and b/program/images/eda5.PNG differ diff --git a/program/images/eda6.PNG b/program/images/eda6.PNG new file mode 100644 index 0000000..663c3f7 Binary files /dev/null and b/program/images/eda6.PNG differ diff --git a/program/images/eda7.png b/program/images/eda7.png new file mode 100644 index 0000000..94c5044 Binary files /dev/null and b/program/images/eda7.png differ diff --git a/program/images/eda8.png b/program/images/eda8.png new file mode 100644 index 0000000..13dbf8b Binary files /dev/null and b/program/images/eda8.png differ diff --git a/program/images/eda9.PNG b/program/images/eda9.PNG new file mode 100644 index 0000000..386ee6c Binary files /dev/null and b/program/images/eda9.PNG differ