# Must use python code with explanations for each line because I need to learn the code as I go. Must be completed on time. Please don’t accept if you can no complete this. This is for Data Science cour

Must use python code with explanations for each line because I need to learn the code as I go.

Must be completed on time. Please don’t accept if you can no complete this.

This is for Data Science course

**Instructions**:

* Do your work with Jupyter notebooks.  Create new cell(s) after each problem, place your code/explanation/solution in these cell(s).

* When you’re done, export your Jupyter notebook to an HTML file.

* Upload the HMTL file to Canvas.

In this assignment, you will analyze the “diabetes” dataset, which can be downloaded from Canvas. This dataset has these columns:

1. Number of times pregnant

2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test

3. Diastolic blood pressure (mm Hg)

4. Triceps skin fold thickness (mm)

5. 2-Hour serum insulin (mu U/ml)

6. Body mass index (weight in kg/(height in m)^2)

7. Diabetes pedigree function

8. Age (years)

9. Outcome, which is 1 if the patient is tested positive for diabetes.

**Problem 1**

Use glucose, blood pressure, skin thickness, BMI, and age as features.  Select only non-missing data to form the features for subsequent analysis.

Report the averages, minimums, maximums for each features.

There should about 530 data points with non-missing data.  You will use this data for the next questions.

**Problem 2**

Show the distribution of BMI values of the data you got in the previous question.

**Problem 3**

What is the range of BMI’s for the middle 50% (from 25%-75% of BMI values) of the patients in this data?  Show how to get these range values.

**Problem 4**

These 5 features don’t have the same scales. Distance-based machine learning methods require the features to be of the same scales. Otherwise, the differences in scales emphasize one feature over in an uncontrolled fashion.

Use MinMaxScaler to rescale the features you got from Problem 1.

Then, create a new data frame to store these scaled features.  Make sure that the columns and indexes of the new dataframe is the same as the old dataframe.

Next, melt the dataframe so you can compare the statistics/distributions of the rescaled features side by side in the next step.

Finally, use seaborn catplot to compare the 5-point statistics and distributions of the rescaled features.

Report the differences of the features in terms of median and middle-50% range.

**Problem 5**

Compare the distributions/5-point statistics of the features rescaled using MinMaxScaler and StandardScaler.  You must show the figures and draw some conclusions on how these two methods differ.

**Problem 6**

Use Kmeans to cluster the data into 2 clusters, using features scaled with MinMaxScaler.

Create a new column called “km2” on your scaled features data frame to store the cluster labels.

Create a new column called “Outcome” on your scaled features data frame and save the Outcome from the original data to this column.  To do this properly, you must get the data from the indexes that only exist in your scaled data frame.

Do the two clusters and two Outcome groups (0 and 1) overlap?

Report the counts and average values in each cluster in each group.  (There’ll be a total of 4 cluster/group combinations).

Visualize (relplot on some 2 selected features) the data in the 4 cluster-Outcome combinations.  To see the combinations clearly use hue, col, row in relplot.

**Problem 7**

Using silhouette as the scoring method on Kmeans, determine the best number of clusters.

Create a new column called “km_opt” on your scaled features data frame to store the cluster labels.

Do the two clusters and two Outcome groups (0 and 1) overlap?

Report the counts and average values in each cluster in each group.

Visualize (relplot on some 2 selected features) the data in the cluster-Outcome combinations. To see the combinations clearly use hue, col, row in relplot.