Python, Pandas and COVID-19

We live in a modern world of data, and we should be using it in every way that we can to understand trends. But for many, it will be…

Photo by NASA on Unsplash

Python, Pandas and COVID-19

We live in a modern world of data, and we should be using it in every way that we can to understand trends. But for many, it will be Microsoft Excel that they may turn to for any sort of analysis. But after Googling for transposing columns to rows, and then trying to merge row, you kinda give up.

The answer is to use Python and Pandas, and which allows us to quickly process data and get it into the right format. Let’s start on a data set which contains date, country, province, lat, long, confirmed cases, recovered cases, and deaths:

To say that we just want the current date (9 April 2020), and ignore all the other dates. For this we can create the code of:

import pandas as pd
filename = 'covid03.csv'
date="2020-04-09"
ds = pd.read_csv(filename)
df = pd.DataFrame(ds)
ds1 = df[df['Date'].str.contains(date)]
print (ds1)

This will filter out all the rows which are not matched to ‘2020–04–09’ and will give:

       Date            Country/Region  ... Recovered  Deaths
78 2020-04-09 Afghanistan ... 32.0 15.0
157 2020-04-09 Albania ... 165.0 23.0
236 2020-04-09 Algeria ... 347.0 235.0
315 2020-04-09 Andorra ... 58.0 25.0
394 2020-04-09 Angola ... 2.0 2.0
... ... ... ... ... ...
20460 2020-04-09 Vietnam ... 128.0 0.0
20539 2020-04-09 West Bank and Gaza ... 44.0 1.0
20618 2020-04-09 Western Sahara ... 0.0 0.0
20697 2020-04-09 Zambia ... 24.0 1.0
20776 2020-04-09 Zimbabwe ... 0.0 3.0

Next, we will fill in any Not A Number (NAN) with a blank, and drop the Date, Lat and Long columns:

ds1=ds1.fillna(‘’)
ds1=ds1.drop([‘Date’,’Lat’,’Long’],axis=1)

Next we will merge the Country/Region and Province/State columns, so that we get a unique column (“Name”), as, for example, China, France and the UK define its data with provinces. We can then drop the Country/Region and Province/State columns, and add the new one:

ds1[“Name”] = ds1[“Country/Region”] +” “+ ds1[“Province/State”]
ds1=ds1.drop([‘Country/Region’,’Province/State’],axis=1)
ds1=ds1[[‘Name’,’Confirmed’,’Recovered’,’Deaths’]]

Finally we can save our data frame to a CSV file, and load it up for analysis:

ds1.to_csv(‘out.csv’, index=False)

This gives:

The code for this is here:

Now let’s load this up for analysis [here]:

Those on the upper left hand side are the countries which have more recovered over confirmed. And for our data analysis we get:

OLS Regression Results 
====================================================================
Dep. Variable: Confirmed R-squared: 0.409
Model: OLS Adj. R-squared: 0.407
Method: Least Squares F-statistic: 171.0
Date: Fri, 10 Apr 2020 Prob (F-statistic): 4.76e-30
Time: 09:27:25 Log-Likelihood: -2881.3
No. Observations: 248 AIC: 5765.
Df Residuals: 247 BIC: 5768.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Recovered 3.1040 0.237 13.077 0.000 2.637 3.572
==============================================================================
Omnibus: 477.574 Durbin-Watson: 1.837
Prob(Omnibus): 0.000 Jarque-Bera (JB): 281695.867
Skew: 11.123 Prob(JB): 0.00
Kurtosis: 166.603 Cond. №1.00
==============================================================================

Conclusions

Spreadsheets are history, Python is the future!