Skip to content

Commit

Permalink
added eda in the script
Browse files Browse the repository at this point in the history
  • Loading branch information
rohit-chandra committed Jan 22, 2024
1 parent d6f701d commit cea043a
Show file tree
Hide file tree
Showing 13 changed files with 200 additions and 2 deletions.
196 changes: 195 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Main aim of this project to is implement end-to-end ML pipelines on AWS sagemake
<img src="program/images/training.png"/>
</p>

- We’ll use a Scikit-Learn Pipeline for the transformations, and a Processing Step with a SKLearnProcessor to execute a preprocessing script. Check the SageMaker Pipelines Overview for an introduction to the fundamental components of a SageMaker Pipeline.
- We’ll use a [Scikit-Learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for the transformations, and a [Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) with a [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) to execute a preprocessing script. Check the [SageMaker Pipelines Overview](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) for an introduction to the fundamental components of a SageMaker Pipeline.

### Step 1: EDA

Expand Down Expand Up @@ -46,12 +46,206 @@ penguins.head()
<img src="program/images/culmen.jpeg"/>
</p>

- Now, let’s get the `summary statistics` for the features in our dataset.

```
penguins.describe(include="all")
```

<p align="left">
<img src="program/images/eda2.PNG"/>
</p>

- Let’s now display the distribution of values for the three categorical columns in our data:

```
species_distribution = penguins["species"].value_counts()
island_distribution = penguins["island"].value_counts()
sex_distribution = penguins["sex"].value_counts()
print(species_distribution)
print()
print(island_distribution)
print()
print(sex_distribution)
```
<p align="left">
<img src="program/images/eda3.PNG"/>
</p>

- The distribution of the categories in our data are:

- `species`: There are 3 species of penguins in the dataset: Adelie (152), Gentoo (124), and Chinstrap (68).
- `island`: Penguins are from 3 islands: Biscoe (168), Dream (124), and Torgersen (52).
- `sex`: We have 168 male penguins, 165 female penguins, and 1 penguin with an ambiguous gender (.).

- Let’s replace the ambiguous value in the sex column with a null value:

```
penguins["sex"] = penguins["sex"].replace(".", np.nan)
sex_distribution = penguins["sex"].value_counts()
sex_distribution
```
<p align="left">
<img src="program/images/eda4.PNG"/>
</p>

- Next, let’s check for any missing values in the dataset.

```
penguins.isnull().sum()
```
<p align="left">
<img src="program/images/eda5.PNG"/>
</p>

- Let’s get rid of the missing values. For now, we are going to replace the missing values with the most frequent value in the column. Later, we’ll use a different strategy to replace missing numeric values.

```
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="most_frequent")
penguins.iloc[:, :] = imputer.fit_transform(penguins)
penguins.isnull().sum()
```
<p align="left">
<img src="program/images/eda6.PNG"/>
</p>

- Let’s visualize the distribution of `categorical features`.

```
import matplotlib.pyplot as plt
fig, axs = plt.subplots(3, 1, figsize=(6, 10))
axs[0].bar(species_distribution.index, species_distribution.values)
axs[0].set_ylabel("Count")
axs[0].set_title("Distribution of Species")
axs[1].bar(island_distribution.index, island_distribution.values)
axs[1].set_ylabel("Count")
axs[1].set_title("Distribution of Island")
axs[2].bar(sex_distribution.index, sex_distribution.values)
axs[2].set_ylabel("Count")
axs[2].set_title("Distribution of Sex")
plt.tight_layout()
plt.show()
```
<p align="left">
<img src="program/images/eda7.PNG"/>
</p>

- Let’s visualize the distribution of `numerical columns`.

```
fig, axs = plt.subplots(2, 2, figsize=(8, 6))
axs[0, 0].hist(penguins["culmen_length_mm"], bins=20)
axs[0, 0].set_ylabel("Count")
axs[0, 0].set_title("Distribution of culmen_length_mm")
axs[0, 1].hist(penguins["culmen_depth_mm"], bins=20)
axs[0, 1].set_ylabel("Count")
axs[0, 1].set_title("Distribution of culmen_depth_mm")
axs[1, 0].hist(penguins["flipper_length_mm"], bins=20)
axs[1, 0].set_ylabel("Count")
axs[1, 0].set_title("Distribution of flipper_length_mm")
axs[1, 1].hist(penguins["body_mass_g"], bins=20)
axs[1, 1].set_ylabel("Count")
axs[1, 1].set_title("Distribution of body_mass_g")
plt.tight_layout()
plt.show()
```

<p align="left">
<img src="program/images/eda8.PNG"/>
</p>

- Let’s display the covariance matrix of the dataset. The “covariance” measures how changes in one variable are associated with changes in a second variable. In other words, the covariance measures the degree to which two variables are linearly associated.

```
penguins.cov(numeric_only=True)
```
<p align="left">
<img src="program/images/eda9.PNG"/>
</p>

- Here are three examples of what we get from interpreting the covariance matrix below:

- Penguins that weight more tend to have a larger culmen.
- The more a penguin weights, the shallower its culmen tends to be.
- There’s a small variance between the culmen depth of penguins.

- Let’s now display the correlation matrix. “Correlation” measures both the strength and direction of the linear relationship between two variables.

```
penguins.corr(numeric_only=True)
```
<p align="left">
<img src="program/images/eda10.PNG"/>
</p>

- Here are three examples of what we get from interpreting the correlation matrix below:

- Penguins that weight more tend to have larger flippers.
- Penguins with a shallower culmen tend to have larger flippers.
- The length and depth of the culmen have a slight negative correlation.


- Let’s display the distribution of species by island.

```
unique_species = penguins["species"].unique()
fig, ax = plt.subplots(figsize=(6, 6))
for species in unique_species:
data = penguins[penguins["species"] == species]
ax.hist(data["island"], bins=5, alpha=0.5, label=species)
ax.set_xlabel("Island")
ax.set_ylabel("Count")
ax.set_title("Distribution of Species by Island")
ax.legend()
plt.show()
```
<p align="left">
<img src="program/images/eda11.PNG"/>
</p>

- Let’s display the distribution of species by sex.

```
fig, ax = plt.subplots(figsize=(6, 6))
for species in unique_species:
data = penguins[penguins["species"] == species]
ax.hist(data["sex"], bins=3, alpha=0.5, label=species)
ax.set_xlabel("Sex")
ax.set_ylabel("Count")
ax.set_title("Distribution of Species by Sex")
ax.legend()
plt.show()
```
<p align="left">
<img src="program/images/eda12.PNG"/>
</p>


### Step 2: Creating the Preprocessing Script

- Fetch the data from S3 bucket on AWS and send it to a `processing job` (job running on AWS)
- Processing Job splits the data into 3 sets and transforms and the output of this job gets stored back on S3 location called `Dataset splits`:
- `Training set`
- `Validation set`
- `Test set`



Expand Down
6 changes: 5 additions & 1 deletion program/cohort.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -67,16 +67,20 @@
"\n",
"import sys\n",
"import logging\n",
"\n",
"# python unit test framework\n",
"import ipytest\n",
"\n",
"import json\n",
"from pathlib import Path\n",
"\n",
"\n",
"# Create the code folder and the inference code folder\n",
"CODE_FOLDER = Path(\"code\")\n",
"CODE_FOLDER.mkdir(parents=True, exist_ok=True)\n",
"INFERENCE_CODE_FOLDER = CODE_FOLDER / \"inference\"\n",
"INFERENCE_CODE_FOLDER.mkdir(parents=True, exist_ok=True)\n",
"\n",
"# make the code folder available for imports\n",
"sys.path.extend([f\"./{CODE_FOLDER}\", f\"./{INFERENCE_CODE_FOLDER}\"])\n",
"\n",
"DATA_FILEPATH = \"penguins.csv\"\n",
Expand Down
Binary file added program/images/eda10.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda11.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda12.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda3.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda4.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda5.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda6.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda8.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added program/images/eda9.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit cea043a

Please sign in to comment.