Data Distributions and Charting in Python

I did a review of a research paper, and in it, the authors casually used a standard deviation value. But, the data itself was not following…

Data Distributions and Charting in Python

I did a review of a research paper, and in it, the authors casually used a standard deviation value. But, the data itself was not following a standard deviation, and so the metric was useless. There was no analysis as to whether the data matched a normal distribution. In places, in research, I do detect a lack of understanding of good old maths for data distributions, and where some just reach for the latest k-means methods and care little about actually using scientific methods on the data.

For our data, we can put our data into a number of bins and then plot these for the number of values that hit the bins. This gives us the probability of occurrence within a bin range. In these plots we have random values, normal distributions, and other distributions:

Sample histograms

If our data distribution can match these, we can then define the data values in terms of a data distribution function. In the case of a normal distribution we have:

And where mu is the average value, and sigma is the standard deviation value. We can use Python to produce the data distribution samples and plot these curves [here]:

# https://asecuritysite.com/comms/dist
import matplotlib.pyplot as plt
import numpy as np
import sys
import random
file ='1111'
mu=9.0
sig=2.0
samples=10000
if (len(sys.argv)>1):
file=str(sys.argv[1])
if (len(sys.argv)>2):
mu=float(sys.argv[2])
if (len(sys.argv)>3):
sig=float(sys.argv[3])
if (len(sys.argv)>4):
samples=int(sys.argv[4])
fig,myplot = plt.subplots(4, 1,figsize=(8,8))
plt.tight_layout(w_pad=1.5, h_pad=2.0)

uniSamples = [random.random() for i in range(samples)]

myplot[0].hist(uniSamples, bins=100)
myplot[0].set_title("Uniform random number generator histogram")
myplot[0].set_xlabel("x")
myplot[0].set_ylabel("Frequency of occurrence")
print ("Uni: ",uniSamples[0:10]) #Take a look at the first 10

normSamples = [random.normalvariate(mu,sig) for i in range(samples)]
myplot[1].hist(normSamples, bins=100)
myplot[1].set_title(r"Normal Histogram RNG $\mu = "+str(mu)+"$ and $\sigma = "+str(sig)+"$")
myplot[1].set_xlabel("x")
myplot[1].set_ylabel("Frequency of occurrence")
print ("Norm: ",normSamples[0:10])  #Take a look at the first 10

triSamples = [random.triangular(0,1,0.5) for i in range(samples)]
myplot[2].hist(triSamples, bins=100)
myplot[2].set_title(r"Triangular Histogram RNG")
myplot[2].set_xlabel("x")
myplot[2].set_ylabel("Frequency of occurrence")
print ("Tri: ",triSamples[0:10])  #Take a look at the first 10
logSamples = [random.weibullvariate(mu,sig) for i in range(samples)]
myplot[3].hist(logSamples, bins=100)
myplot[3].set_title(r"Weibull Histogram RNG")
myplot[3].set_xlabel("x")
myplot[3].set_ylabel("Frequency of occurrence")

print ("Log: ",logSamples[0:10]) #Take a look at the first 10
plt.savefig(file)
plt.show()
print ("Saved to ",file)

Here are some sample runs:

  • μ=5, σ=3, Samples=100 Calc
  • μ=5, σ=3, Samples=1,000 Calc
  • μ=5, σ=3, Samples=10,000 Calc
  • μ=5, σ=3, Samples=100,000 Calc
  • μ=5, σ=3, Samples=1000, Bins=10 Calc
  • μ=5, σ=3, Samples=1000, Bins=50 Calc
  • μ=5, σ=3, Samples=1000, Bins=100 Calc
  • μ=5, σ=3, Samples=1000, Bins=200 Calc
  • μ=5, σ=10 Calc
  • μ=5, σ=1 Calc
  • μ=5, σ=0.5 Calc
  • μ=5, σ=5 Calc
  • μ=55, σ=10 Calc. This is the mark distribution we often aim for in academia.

Conclusions

So, don’t just reach for the k-means clustering method first. Know your data and its distribution!