Pandas is a Python Analysis package. In this case we use data from "Winnable Battle Risk Factors and Health Indicators - Summary Report By State", with some commands. The data is [here][Pandas plotting]:
Big Data Analysis with Pandas |
Tutorial
For our analysis we are going to use Pandas which is a library for data analysis, and read in our CSV file [here]:
import pandas as pd ver=pd.read_csv("df.csv")
We now have an object named ver that we can analyse. I have used a small data set so that we can easily analyse the data, and show the various steps. So let's have a quick look at our data by viewing the first few rows:
Command used: ver.head(3) ------------------- RegionState Infant MR Heart Disease DR Stroke DR Suicide DR Homicide DR \ 0 National 6.0 167.0 36.5 13.0 5.10 1 Alabama 8.6 224.0 48.3 14.5 8.10 2 Alaska 5.8 146.6 32.3 22.1 4.70 Drug Poisoning DR Motor Vech DR Cancer DR 0 14.7 10.2 161.2 1 15.2 16.9 177.6 2 16.8 9.9 164.2
We can see that our columns are defined with the US state, then Infant Mortality Rate, Heart Disease Death Rate, and so on. Let's now examine the number of rows and columns that we have. For this we can use the len(ver):
Command used: len(ver) ------------------- 52
and this shows we have 52 rows (related to the number of US states). We can also view the columns used with the ver.columns property:
Command used: ver.columns ------------------- Index([u'RegionState', u'Infant MR', u'Heart Disease DR', u'Stroke DR', u'Suicide DR', u'Homicide DR', u'Drug Poisoning DR', u'Motor Vech DR', u'Cancer DR'], dtype='object')
and the data type of the data in the columns (using the ver.dtypes property):
Command used: ver.dtypes ------------------- RegionState object Infant MR float64 Heart Disease DR float64 Stroke DR float64 Suicide DR float64 Homicide DR object Drug Poisoning DR float64 Motor Vech DR float64 Cancer DR float64 dtype: object
In this case we see that most of our data columns are defined with values as floating point numbers.
Next, if we want to pin-out a specific column, such as the region/state:
Command used: ver['RegionState'] ------------------- 0 National 1 Alabama 2 Alaska 3 Arizona 4 Arkansas 5 California 6 Colorado 7 Connecticut 8 Delaware 9 District of Columbia 10 Florida 11 Georgia 12 Hawaii 13 Idaho 14 Illinois 15 Indiana 16 Iowa 17 Kansas 18 Kentucky 19 Louisiana 20 Maine 21 Maryland 22 Massachusetts 23 Michigan 24 Minnesota 25 Mississippi 26 Missouri 27 Montana 28 Nebraska 29 Nevada 30 New Hampshire 31 New Jersey 32 New Mexico 33 New York 34 North Carolina 35 North Dakota 36 Ohio 37 Oklahoma 38 Oregon 39 Pennsylvania 40 Rhode Island 41 South Carolina 42 South Dakota 43 Tennessee 44 Texas 45 Utah 46 Vermont 47 Virginia 48 Washington 49 West Virginia 50 Wisconsin 51 Wyoming Name: RegionState, dtype: object
We can the access the column data for Infant Mortality Rates with: (ver['Infant MR']):
Command used: ver['Infant MR'] ------------------- 0 6.0 1 8.6 2 5.8 3 5.2 4 7.8 5 4.8 6 5.1 7 4.8 8 6.4 9 6.7 10 6.1 11 7.0 12 6.4 13 5.6 14 6.0 15 7.2 16 4.2 17 6.5 18 6.4 19 8.7 20 7.1 21 6.6 22 4.2 23 7.0 24 5.1 ...
where we are listing Infant Morality Rates, and we can look at too columns together and to include the state/region:
Command used: ver[['RegionState','Infant MR','Heart Disease DR']] ------------------- RegionState Infant MR Heart Disease DR 0 National 6.0 167.0 1 Alabama 8.6 224.0 2 Alaska 5.8 146.6 3 Arizona 5.2 136.4 4 Arkansas 7.8 217.5 5 California 4.8 142.2 6 Colorado 5.1 130.3 7 Connecticut 4.8 145.6 8 Delaware 6.4 168.7 ...
If we want to see the basic stats for the data:
Command used: ver.describe() ------------------- Infant MR Heart Disease DR Stroke DR Suicide DR Drug Poisoning DR \ count 52.000000 52.000000 52.000000 52.000000 52.000000 mean 6.107692 167.740385 36.788462 14.851923 16.036538 std 1.170863 27.728214 5.678361 3.877006 5.602311 min 4.200000 116.500000 25.600000 7.800000 6.300000 25% 5.200000 147.875000 33.550000 12.600000 12.275000 50% 6.150000 159.750000 36.600000 14.200000 14.900000 75% 6.725000 183.625000 41.150000 17.850000 18.550000 max 9.600000 229.900000 48.800000 23.900000 35.500000 Motor Vech DR Cancer DR count 52.000000 52.000000 mean 11.336538 163.761538 std 4.612601 15.223771 min 3.500000 127.400000 25% 7.775000 154.775000 50% 10.250000 163.800000 75% 14.700000 172.075000 max 25.700000 198.800000
and so we can see that the average Infant Mortality Rate is 6.1 with a standard deviation of 1.17. The lowest Infant MR is 4,2 and the highest is 9.6.
Next we want to find out the Top 3 states with the lowest Infant Morality Rates:
Command used: ver.sort(['Infant MR']).head(3) ------------------- RegionState Infant MR Heart Disease DR Stroke DR Suicide DR \ 22 Massachusetts 4.2 137.1 28.7 8.2 16 Iowa 4.2 157.3 34.0 12.9 46 Vermont 4.3 156.6 31.7 18.7 Homicide DR Drug Poisoning DR Motor Vech DR Cancer DR 22 1.60 19.0 4.9 155.5 16 2.50 8.8 10.3 166.0 46 -- 13.9 7.0 167.9
And we see that Massachusetts has the lowest with 4.2. Now we can do a reserve (descending list) to analyse the highest Infant Mortality Rate:
Command used: ver.sort(['Infant MR'],ascending=False).head(3) ------------------- RegionState Infant MR Heart Disease DR Stroke DR Suicide DR \ 25 Mississippi 9.6 229.9 48.8 12.5 19 Louisiana 8.7 216.3 45.6 14.3 1 Alabama 8.6 224.0 48.3 14.5 Homicide DR Drug Poisoning DR Motor Vech DR Cancer DR 25 11.40 11.6 20.3 193.1 19 11.70 16.9 15.9 186.1 1 8.10 15.2 16.9 177.6
In this case Massachusetts has the best mortality rate for infants, but Mississippi has the worst. You should also be able to see that the Heart Disease Death Rate is also much lower in Massachusetts than Mississippi. In fact the Top 3 in both cases seem to differ greatly.
So is there a correlation between the Infant Mortality Rate and Heart Disease? With correlation, we get a value of +1.0 or -1.0 when there is a strong correlation, and 0.0 when there is no correlation. We can now run:
Command used: ver.corr() ------------------- Infant MR Heart Disease DR Stroke DR Suicide DR \ Infant MR 1.000000 0.718922 0.755534 -0.066048 Heart Disease DR 0.718922 1.000000 0.664636 -0.111640 Stroke DR 0.755534 0.664636 1.000000 0.118075 Suicide DR -0.066048 -0.111640 0.118075 1.000000 Drug Poisoning DR 0.151928 0.221105 0.033088 0.209854 Motor Vech DR 0.393358 0.360695 0.538783 0.527052 Cancer DR 0.698498 0.769802 0.642259 -0.130364 Drug Poisoning DR Motor Vech DR Cancer DR Infant MR 0.151928 0.393358 0.698498 Heart Disease DR 0.221105 0.360695 0.769802 Stroke DR 0.033088 0.538783 0.642259 Suicide DR 0.209854 0.527052 -0.130364 Drug Poisoning DR 1.000000 0.039965 0.243367 Motor Vech DR 0.039965 1.000000 0.226029 Cancer DR 0.243367 0.226029 1.000000
We can now see that for Infant MR that there is a strong correlation with Heart Disease DR (Death Rate), Stroke Rate DR and Cancer DR, while there is no direct correlation between the Suicide DR and Infant MR.There is, as expected a strong correlation between Heart Disease DR and the Stroke DR (0.664). The Suicide Rate MR does not have a strong correlation with most factors, apart from Motor Vehicle Death Rates.
If we want, we can just take Infant MR and Heart Disease DR:
Command used: ver[['Infant MR','Heart Disease DR']].corr() ------------------- Infant MR Heart Disease DR Infant MR 1.000000 0.718922 Heart Disease DR 0.718922 1.000000
We can also analyse the correlation in the average income against the other factors:
Av Income Infant MR -0.533132 Heart Disease DR -0.490078 Stroke DR -0.610283 Suicide DR -0.269937 Homicide DR -0.286942 Drug Poisoning DR -0.184603 Motor Vech DR -0.537847 Cancer DR -0.509941 Av Income 1.000000
and we can see the strongest correlation is between the Stroke Death Rate and Average Income. In this case there is a negative correlation, so that as the Stroke Death Rate goes up the Average Income goes down.
We can also look at the covariance which is a measure related to how how much two variables change in the same way. The larger the value, the greater the significance:
Command used: ver.cov() ------------------- Infant MR Heart Disease DR Stroke DR Suicide DR \ Infant MR 1.370920 23.340468 5.023228 -0.299819 Heart Disease DR 23.340468 768.853827 104.647534 -12.001550 Stroke DR 5.023228 104.647534 32.243786 2.599434 Suicide DR -0.299819 -12.001550 2.599434 15.031173 Drug Poisoning DR 0.996576 34.346927 1.052587 4.558066 Motor Vech DR 2.124419 46.132613 14.111802 9.425321 Cancer DR 12.450694 324.954917 55.520724 -7.694434 Drug Poisoning DR Motor Vech DR Cancer DR Infant MR 0.996576 2.124419 12.450694 Heart Disease DR 34.346927 46.132613 324.954917 Stroke DR 1.052587 14.111802 55.520724 Suicide DR 4.558066 9.425321 -7.694434 Drug Poisoning DR 31.385894 1.032756 20.756335 Motor Vech DR 1.032756 21.276090 15.872021 Cancer DR 20.756335 15.872021 231.763198
and where we can see that there is a strong covariance between the Infant MR and Heart Disease DR and Drug Poisoning DR. As we would expect Heart Disease DR has a strong covariance with Stroke DR (104.6).
We can also analysis for thresholds, such as for states with an Infant MR rate is greater than 4.5:
Command used: ver['Infant MR']> 4.8 ------------------- 0 True 1 True 2 True 3 True 4 True 5 False 6 True 7 False 8 True
We now want to see the US states who have an Infant MR greater than 7.0:
Command used: ver[(ver['Infant MR'] > 7.0)] ------------------- RegionState Infant MR Heart Disease DR Stroke DR Suicide DR \ 1 Alabama 8.6 224.0 48.3 14.5 4 Arkansas 7.8 217.5 45.4 17.3 15 Indiana 7.2 182.7 41.7 14.3 19 Louisiana 8.7 216.3 45.6 14.3 20 Maine 7.1 147.9 33.2 15.7 25 Mississippi 9.6 229.9 48.8 12.5 36 Ohio 7.3 186.4 40.0 12.6 49 West Virginia 7.6 192.9 45.3 18.1 Homicide DR Drug Poisoning DR Motor Vech DR Cancer DR 1 8.10 15.2 16.9 177.6 4 7.70 12.6 15.7 183.1 15 5.70 18.2 11.3 179.7 19 11.70 16.9 15.9 186.1 20 2.00 16.8 9.8 170.3 25 11.40 11.6 20.3 193.1 36 5.20 24.6 8.7 177.8 49 5.90 35.5 14.7 195.1
Now we will analyse those states with an Infant MR greater than 7.0, and who have a Heart Disease DR higher than
Command used: ver[(ver['Infant MR'] > 7.0) & (ver['Stroke DR'] > 45.0 )] ------------------- RegionState Infant MR Heart Disease DR Stroke DR Suicide DR \ 1 Alabama 8.6 224.0 48.3 14.5 4 Arkansas 7.8 217.5 45.4 17.3 19 Louisiana 8.7 216.3 45.6 14.3 25 Mississippi 9.6 229.9 48.8 12.5 49 West Virginia 7.6 192.9 45.3 18.1 Homicide DR Drug Poisoning DR Motor Vech DR Cancer DR 1 8.10 15.2 16.9 177.6 4 7.70 12.6 15.7 183.1 19 11.70 16.9 15.9 186.1 25 11.40 11.6 20.3 193.1 49 5.90 35.5 14.7 195.1
We can now perform some linear regression on two of the parameters (Infant MR and Heart Disease):
Command used: sm.OLS(ver['Infant MR'], ver['Heart Disease DR']).fit().summary() ------------------- OLS Regression Results ============================================================================== Dep. Variable: Infant MR R-squared: 0.982 Model: OLS Adj. R-squared: 0.982 Method: Least Squares F-statistic: 2862. Date: Thu, 07 Jul 2016 Prob (F-statistic): 1.78e-46 Time: 12:58:46 Log-Likelihood: -63.625 No. Observations: 52 AIC: 129.2 Df Residuals: 51 BIC: 131.2 Df Model: 1 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------ Heart Disease DR 0.0363 0.001 53.502 0.000 0.035 0.038 ============================================================================== Omnibus: 2.111 Durbin-Watson: 2.249 Prob(Omnibus): 0.348 Jarque-Bera (JB): 2.001 Skew: -0.456 Prob(JB): 0.368 Kurtosis: 2.700 Cond. No. 1.00 ==============================================================================
Based on this data, in the two areas analysed, we now see that it is Alabama, Arkansas, Louisiana, MIssissippi and West Virginia that have the great health problems.
Data
The data used is [here]
RegionState,Infant MR,Heart Disease DR,Stroke DR,Suicide DR,Homicide DR,Drug Poisoning DR,Motor Vech DR,Cancer DR National,6.00,167.00,36.50,13.00,5.10,14.70,10.20,161.20 Alabama,8.60,224.00,48.30,14.50,8.10,15.20,16.90,177.60 Alaska,5.80,146.60,32.30,22.10,4.70,16.80,9.90,164.20 Arizona,5.20,136.40,28.30,18.00,5.00,18.20,11.40,142.70 Arkansas,7.80,217.50,45.40,17.30,7.70,12.60,15.70,183.10 California,4.80,142.20,33.90,10.50,4.60,11.10,7.90,144.10 Colorado,5.10,130.30,33.40,19.90,3.30,16.30,9.10,136.00 Connecticut,4.80,145.60,26.30,9.80,2.80,17.60,6.90,146.70 Delaware,6.40,168.70,38.80,13.20,6.60,20.90,12.90,167.30 District of Columbia,6.70,207.80,33.60,7.80,13.70,14.20,3.50,178.60 Florida,6.10,151.30,33.00,13.90,6.20,13.20,12.50,152.90 Georgia,7.00,179.70,42.60,12.60,6.60,11.90,11.50,165.50 Hawaii,6.40,136.70,34.10,13.80,2.20,10.90,6.70,140.00 Idaho,5.60,152.80,36.80,20.00,2.40,13.70,11.40,155.40 Illinois,6.00,169.70,37.40,10.50,6.20,13.10,7.20,168.90 Indiana,7.20,182.70,41.70,14.30,5.70,18.20,11.30,179.70 Iowa,4.20,157.30,34.00,12.90,2.50,8.80,10.30,166.00 Kansas,6.50,157.40,39.00,15.70,3.60,11.70,13.30,166.80 Kentucky,6.40,200.50,41.80,15.90,4.70,24.70,15.20,198.80 Louisiana,8.70,216.30,45.60,14.30,11.70,16.90,15.90,186.10 Maine,7.10,147.90,33.20,15.70,2.00,16.80,9.80,170.30 Maryland,6.60,167.80,38.00,9.80,6.70,17.40,7.40,161.70 Massachusetts,4.20,137.10,28.70,8.20,1.60,19.00,4.90,155.50 Michigan,7.00,200.90,37.90,13.30,6.30,18.00,9.10,174.10 Minnesota,5.10,116.50,34.00,12.20,1.90,9.60,6.60,152.60 Mississippi,9.60,229.90,48.80,12.50,11.40,11.60,20.30,193.10 Missouri,6.50,194.70,41.00,16.30,7.50,18.20,12.60,177.70 Montana,5.60,147.80,36.40,23.90,2.90,12.40,18.80,156.30 Nebraska,5.20,143.00,34.70,13.40,3.40,7.20,12.00,159.60 Nevada,5.30,197.20,33.80,19.60,6.30,18.40,10.20,164.50 New Hampshire,5.60,147.90,28.90,17.80,--,26.20,7.20,160.40 New Jersey,4.50,166.30,31.40,8.30,4.40,14.00,6.20,156.10 New Mexico,5.30,143.30,34.70,21.00,6.80,27.30,18.40,142.40 New York,4.90,178.30,26.10,8.10,3.40,11.30,5.30,151.80 North Carolina,7.00,158.70,43.00,13.00,5.60,13.80,12.90,169.30 North Dakota,6.00,149.20,35.50,17.80,--,6.30,18.30,152.30 Ohio,7.30,186.40,40.00,12.60,5.20,24.60,8.70,177.80 Oklahoma,6.70,228.10,43.00,19.10,6.60,20.30,17.30,179.90 Oregon,4.90,132.10,37.40,18.60,2.40,12.80,9.00,160.20 Pennsylvania,6.60,175.80,36.70,13.30,5.20,21.90,9.30,169.60 Rhode Island,6.50,160.80,25.60,10.10,2.50,23.40,4.90,167.00 South Carolina,6.90,181.10,44.20,15.20,7.60,14.40,17.10,171.40 South Dakota,6.40,154.60,38.80,17.10,3.10,7.80,15.90,163.40 Tennessee,6.80,205.60,45.80,14.10,6.00,19.50,14.70,184.20 Texas,5.80,169.90,41.60,12.20,5.20,9.70,13.10,152.90 Utah,5.20,151.00,37.90,20.50,2.00,22.40,8.70,127.40 Vermont,4.30,156.60,31.70,18.70,--,13.90,7.00,167.90 Virginia,6.20,156.10,37.00,12.90,4.10,11.70,8.40,161.50 Washington,4.50,137.20,34.30,15.20,3.10,13.30,6.50,155.50 West Virginia,7.60,192.90,45.30,18.10,5.90,35.50,14.70,195.10 Wisconsin,6.30,155.10,34.60,13.10,3.00,15.10,8.80,161.80 Wyoming,4.80,162.20,30.20,20.60,4.40,19.40,25.70,140.70