Introduction to Statistics for Astronomers and Physicists

Section 0: Course Format and Outline, and a Crash Course in R and Python

Dr Angus H Wright

2022-04-06

Welcome

Welcome to the Introduction to Statistics for Astronomers and Physicists course for the Summer Semester 2022.

Course Philosophy

What is this course?

How will we do this?

Course Outline

This course will be taught in 4 parts, each spanning from 2-4 weeks

Section 1: Data Description, Analysis, and Modelling (Weeks 1-2)

When working in empirical science, modelling and understanding datasets is paramount. In this module we start by discussing the fundamentals of data modelling. We start by discussing theories of point and interval estimation, in the context of summary statics (e.g. expectation values, confidence intervals), and estimation of data correlation and covariance. Students will learn the fundamentals of data mining and analysis, in a way that is applicable to all physical sciences.

Topics include:

Section 2: Probability & Decision Making (Weeks 3-5)

For all aspects of modern science, an understanding of probability is required. We cover a range of topics in probability, from decision theory and the fundamentals of probability theory, to standard probabilistic distributions and their origin. From this module, students will gain an insight into different statistical distributions that govern modern observational sciences, the interpretation of these distributions, and how one accurately models distributions of data in an unbiased manner.

Topics include:

Section 3: Bayesian Statistics (Weeks 6-8)

Bayes theorem led to a revolution in statistics, via the concepts of prior and posterior evidence. In modern astronomy and physics, applications of Bayesian statistics are widespread. We begin with a study of Bayes theorem, and the fundamental differences between frequentest and Bayesian analysis. We explore applications of Bayesian statistics, through well studied statistical problems (both within and outside of physics).

Topics include:

Section 4: Parameter Simulation, Optimisation, and Analysis (Weeks 9-12)

We apply our understanding of Bayesian statistics to the common problems of parameter simulation, optimisation, and inference. Students will learn the fundamentals of Monte Carlo Simulation, Markov Chain Monte Carlo (MCMC) analysis, hypothesis testing, and quantifying goodness-of-fit. We discuss common errors in parameter inference, including standard physical and astrophysical biases that corrupt statistical analyses.

Topics include:

Rmarkdown

Slides and lecture notes for this course are prepared in Rmarkdown, and provided to you after the lectures.

#in R
theta=seq(0,2,by=0.01)
sinu=1+sin(2*pi*theta)
magplot(theta,sinu,type='l')
#or in python
import numpy as np
import matplotlib.pyplot as plt
theta=np.arange(0.,2.,0.01)
sinu=1+np.sin(2*np.pi*theta)
plt.plot(theta,sinu)
plt.show()

Rmarkdown

Information generated and stored within blocks is persistent, and code-blocks with different engines can also cross-communicate.

#Create some data in R
#E.g. random draws from f~N(0,1)
x=rnorm(1e3) 
y=rnorm(1e3) 
#Default plot in R
plot(x,y)
#and access it directly in python
plt.scatter(r.x,r.y)
plt.show()

Slido

To promote interaction, we will use Slido.

So ask whenever and about whatever you think is relevant/interesting/unclear. The more the better!

Learning Objectives

An example of the Learning Objectives for a later lecture are given here:

Learning Objectives: Summarising relationships in 2D

Understand graphical methods of exploring observations in two or more variables, such as:

Understand the concepts of:

Be able to describe the construction of a covariance/correlation matrix

Understand the differences between Pearson and Spearman Correlation, and describe the uses of each.

Understand the limitations of correlation measures, and the logical fallacy of correlation and causation.

Understand the concept of confounding variables and their role in correlation.

Lecture Notes, Slides

The lecture notes for the course, as well as the slides, are available via the website of the lecturer: https://anguswright.github.io

Lecture Materials

Statistics and Computing

This is a lecture course on Statistics and Statistical methods.

Why do we need programming?

Modern physics and astronomy requires an understanding of programming.

von Hoerner 1960

Why do we need programming?

Holmberg 1941

Why do we need programming?

Holmberg 1941

Why do we need programming?

The surprise? His work was computed entirely by hand.

Holmberg Blub arrangement

Reproducing Holmberg in 2022

Holmberg 1941

But scalability is the main benefit

Modern Science and the Requirement of Programming

The need for programming in the modern physical sciences is linked to the importance of statistics.

Take a simple optimisation problem, as an example:

\[\begin{equation} \begin{gathered} Y = {\rm Beta}(X, \alpha,\beta) + {\rm noise} \\ 0\leq X \leq 1 \\ \alpha, \beta \in [0,\infty) \end{gathered} \end{equation}\]

A simple optimisation

One might be inclined to attempt to fit a model to these data by-hand, using trial and error:

A simple optimisation

Using this approach we can get a reasonable fit with \(\alpha=2.8,\beta=6.7\):

But is this solution near the truth? How close is good enough? And what are the uncertainties on the parameters?

True optimisation

#Estimate parameters using Nonlinear Least Squares (nls) in R 
fit_R=nls(y~dbeta(x,alpha,beta), #The function to fit
          data=list(x=x,y=y),    #The data 
          start=list(alpha=2,beta=5), #The initial guess
          algorithm='port',      #The algorithm 
          lower=c(0,0))          #The lower bounds
best_R=summary(fit_R)$parameters
#Estimate parameters using scipy.optimize.curve_fit in python
import scipy.optimize
import scipy.stats
best_py, cov_py = scipy.optimize.curve_fit(
         scipy.stats.beta.pdf, #The function to fit
         r.x, r.y,         #The data
         p0=[2,5],         #The initial guess 
         bounds=(0,np.inf),#The lower and upper bounds
         method='trf')     #The fitting algorithm

True optimisation

Obviously these fits are superior to those which we can reach by-hand in terms of accuracy, effort, and runtime.

True optimisation

But the most important benefit is in terms of uncertainty estimation. Statistical computing is important for a wide range of reasons, but arguably the first and most important reason is for the computation of measures of uncertainty.

#Model statistics in R
summary(fit_R)
## 
## Formula: y ~ dbeta(x, alpha, beta)
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## alpha  3.21166    0.01453   221.0   <2e-16 ***
## beta   7.70586    0.03728   206.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09854 on 998 degrees of freedom
## 
## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

True optimisation

Best-fit model in R
Best-fit model in python

Do I need to know R or python?

Overall, there is one major consideration that will (and should) drive your choice of which language to learn first:

Focus on whichever suits you best (and understand both if you can)

Any perceived benefit or detriment of the languages will invariably be overwhelmed by whether or not you are able to share and discuss code together with your colleagues. So, this is mostly a case where joining the herd is probably the sensible choice.

What’s the difference between R and Python?

R:

python:

What will we use here?

The vast majority of examples in this course will be programmed in R.

However, I am happy to rewrite examples in python that you think would be particularly useful!




Tell me what problems you would like rewriten!

A Crash Course in R and python

A few important NBs:



I am certainly not a python expert.

A Crash Course in R and python

The following slides cover:

FAQ: What are python2 and python3?

Like all things, python has different versions.

When the language itself changes in some significant way, the primary version number is updated: python2 becomes python3.

Still major (but not that major) changes happen too, and these cause the secondary version number to change: python3.5 becomes python3.6.

Importantly: The developers of python officially removed support for all python2 versions a few years ago.

There is no reason that you, in 2022, should be using python2.

For the remainder of this course, whenever I say python I will mean python3.

Installing and loading Libraries

Packages in R are installed from within the R session, whereas python packages are installed from the command line with a separate function pip:
#within the R session
#Install a packages called "remotes"
install.packages("remotes") 
#from the _commandline_
#install the numpy package using pip 
pip install numpy
To load these packages into R and python something we’ve already seen a few times:
#in R: load the remotes package
library("remotes") 
#in python: load numpy
import numpy as np
To install packages from github:
#in R: load the remotes package
library("remotes") 
#and install the "Rfits" package 
#from github user ASGR:
install_github("ASGR/Rfits")
#from the commandline
#pip install the django package
pip install git+https://github.com/django/django.git

Code Structure and Control Functions

The python and R languages differ in the structure of their code:

#Conditional statements in R
#If statements 
if (condition) { 
  #Evaluate if condition == true
} else { 
  #Evaluate if condition == false
}

#For statements 
for (var in sequence) { 
  #Evaluate 
}

#While statements 
while (condition) { 
  #Evaluate until condition == false
}
#Conditional statements in python
#If statements 
if condition:  
  #Evaluate if condition == true
else:
  #Evaluate if condition == false

#For statements 
for var in sequence:  
  #Evaluate 

#While statements 
while condition: 
  #Evaluate until condition == false

Code Structure and Control Functions in R

The difference in “valid formatting” is clearest with nested loops:

#Valid Nested for loops in R 
mat<-matrix(0,3,3)
#Standard nested For loops 
for (col in 1:ncol(mat)) { 
  for (row in 1:col) { 
    mat[row,col]<-1
  }
}
mat
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    0    1    1
## [3,]    0    0    1

Code Structure and Control Functions in R

#Without the brackets works...
mat<-matrix(0,3,3)
for (col in 1:ncol(mat)) 
  for (row in 1:col) 
    mat[row,col]<-2
mat
##      [,1] [,2] [,3]
## [1,]    2    2    2
## [2,]    0    2    2
## [3,]    0    0    2
#... but only for one line at a time
mat<-matrix(0,3,3)
for (col in 1:ncol(mat)) 
  for (row in 1:col) 
    silly<-"mistake"
    mat[row,col]<-3
mat
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    0    0    3

Code Structure and Control Functions in R

#To hammer the point: 
#This works too (but please don't)
mat<-matrix(0,3,3)
for 
(col 
in 
1:ncol(mat)
)
{
for (row in 1:col) { mat[row,col]<-4
}} 
mat
##      [,1] [,2] [,3]
## [1,]    4    4    4
## [2,]    0    4    4
## [3,]    0    0    4

Code Structure and Control Functions in python

Conversely python has only one valid format for nested loops:

#Valid Nested for loops in python 
mat=np.zeros([3,3])
#Standard nested For loops 
for col in range(mat.shape[0]): 
  for row in range(col):  
    mat[row,col]=1
mat
## array([[0., 1., 1.],
##        [0., 0., 1.],
##        [0., 0., 0.]])

Built-in Functions

Recall our "random function that we named \({\rm Beta}\):

\[ Y = {\rm Beta}(X, \alpha,\beta) + {\rm noise} \]

This is a standard statistical distribution called “the Beta function”, which is described by the “shape” parameters \(\alpha\) and \(\beta\). We will discuss this distribution more in a few weeks time.

Built-in Functions Example: \({\rm Beta}\)

In R, the \({\rm Beta}\) family of distributions are native. For example if we want to evaluate the value of a particular beta function (like the one from earlier) at some \(x\) value:

#Return the value of a Beta PDF at x=1
dbeta(0.1,shape1=3.2,shape2=7.7)
## [1] 1.331313

In python the \({\rm Beta}\) distributions come within the scipy package:

#Load the library using the short name "stat"
import scipy
#Return the value we want
scipy.stats.beta.pdf(0.1,3.2,7.7)
## 1.3313131932303182

A note about python syntax

In python the . indicates access to “attributes” of an object. We’ll see more about this later, but this is important to understand now.

Importantly, we can shortcut some of this at the import stage.

#Import the stats module 
import scipy.stats as stat
#Use the function 
stat.beta.pdf(0.1,3.2,7.7)
## 1.3313131932303182

We can even import just a specific function:

#Import the beta function
from scipy.stats import beta 
#Use the function 
beta.pdf(0.1,3.2,7.7)
## 1.3313131932303182

Built-in Functions: help()

If we want to know more about a function in R, we use the help() command, or type the function name with a leading question mark:

#Documentation for the Beta functions in R
help(dbeta) # or ?dbeta 

R has very rigorous standards of documentation:

Built-in Functions: Examples

The last item is particularly useful, as examples for almost all functions in R can be run by just using the example() function:

#Running examples for functions in R
example(dbeta)

Built-in Functions: help()

Documentation for python functions can also be seen in a similar manner:

#Documentation for the Beta functions in python
help(stat.beta)
## Help on beta_gen in module scipy.stats._continuous_distns object:
## 
## class beta_gen(scipy.stats._distn_infrastructure.rv_continuous)
##  |  beta_gen(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
##  |  
##  |  A beta continuous random variable.
##  |  
##  |  %(before_notes)s
##  |  
##  |  Notes
##  |  -----
##  |  The probability density function for `beta` is:
##  |  
##  |  .. math::
##  |  
##  |      f(x, a, b) = \frac{\Gamma(a+b) x^{a-1} (1-x)^{b-1}}
##  |                        {\Gamma(a) \Gamma(b)}
##  |  
##  |  for :math:`0 <= x <= 1`, :math:`a > 0`, :math:`b > 0`, where
##  |  :math:`\Gamma` is the gamma function (`scipy.special.gamma`).
##  |  
##  |  `beta` takes :math:`a` and :math:`b` as shape parameters.
##  |  
##  |  %(after_notes)s
##  |  
##  |  %(example)s
##  |  
##  |  Method resolution order:
##  |      beta_gen
##  |      scipy.stats._distn_infrastructure.rv_continuous
##  |      scipy.stats._distn_infrastructure.rv_generic
##  |      builtins.object
##  |  
##  |  Methods defined here:
##  |  
##  |  fit(self, data, *args, **kwds)
##  |      Return MLEs for shape (if applicable), location, and scale
##  |      parameters from data.
##  |      
##  |      MLE stands for Maximum Likelihood Estimate.  Starting estimates for
##  |      the fit are given by input arguments; for any arguments not provided
##  |      with starting estimates, ``self._fitstart(data)`` is called to generate
##  |      such.
##  |      
##  |      One can hold some parameters fixed to specific values by passing in
##  |      keyword arguments ``f0``, ``f1``, ..., ``fn`` (for shape parameters)
##  |      and ``floc`` and ``fscale`` (for location and scale parameters,
##  |      respectively).
##  |      
##  |      Parameters
##  |      ----------
##  |      data : array_like
##  |          Data to use in calculating the MLEs.
##  |      arg1, arg2, arg3,... : floats, optional
##  |          Starting value(s) for any shape-characterizing arguments (those not
##  |          provided will be determined by a call to ``_fitstart(data)``).
##  |          No default value.
##  |      kwds : floats, optional
##  |          - `loc`: initial guess of the distribution's location parameter.
##  |          - `scale`: initial guess of the distribution's scale parameter.
##  |      
##  |          Special keyword arguments are recognized as holding certain
##  |          parameters fixed:
##  |      
##  |          - f0...fn : hold respective shape parameters fixed.
##  |            Alternatively, shape parameters to fix can be specified by name.
##  |            For example, if ``self.shapes == "a, b"``, ``fa`` and ``fix_a``
##  |            are equivalent to ``f0``, and ``fb`` and ``fix_b`` are
##  |            equivalent to ``f1``.
##  |      
##  |          - floc : hold location parameter fixed to specified value.
##  |      
##  |          - fscale : hold scale parameter fixed to specified value.
##  |      
##  |          - optimizer : The optimizer to use.  The optimizer must take ``func``,
##  |            and starting position as the first two arguments,
##  |            plus ``args`` (for extra arguments to pass to the
##  |            function to be optimized) and ``disp=0`` to suppress
##  |            output as keyword arguments.
##  |      
##  |      Returns
##  |      -------
##  |      mle_tuple : tuple of floats
##  |          MLEs for any shape parameters (if applicable), followed by those
##  |          for location and scale. For most random variables, shape statistics
##  |          will be returned, but there are exceptions (e.g. ``norm``).
##  |      
##  |      Notes
##  |      -----
##  |      This fit is computed by maximizing a log-likelihood function, with
##  |      penalty applied for samples outside of range of the distribution. The
##  |      returned answer is not guaranteed to be the globally optimal MLE, it
##  |      may only be locally optimal, or the optimization may fail altogether.
##  |      If the data contain any of np.nan, np.inf, or -np.inf, the fit routine
##  |      will throw a RuntimeError.
##  |      
##  |      In the special case where both `floc` and `fscale` are given, a
##  |      `ValueError` is raised if any value `x` in `data` does not satisfy
##  |      `floc < x < floc + fscale`.
##  |      
##  |      Examples
##  |      --------
##  |      
##  |      Generate some data to fit: draw random variates from the `beta`
##  |      distribution
##  |      
##  |      >>> from scipy.stats import beta
##  |      >>> a, b = 1., 2.
##  |      >>> x = beta.rvs(a, b, size=1000)
##  |      
##  |      Now we can fit all four parameters (``a``, ``b``, ``loc`` and ``scale``):
##  |      
##  |      >>> a1, b1, loc1, scale1 = beta.fit(x)
##  |      
##  |      We can also use some prior knowledge about the dataset: let's keep
##  |      ``loc`` and ``scale`` fixed:
##  |      
##  |      >>> a1, b1, loc1, scale1 = beta.fit(x, floc=0, fscale=1)
##  |      >>> loc1, scale1
##  |      (0, 1)
##  |      
##  |      We can also keep shape parameters fixed by using ``f``-keywords. To
##  |      keep the zero-th shape parameter ``a`` equal 1, use ``f0=1`` or,
##  |      equivalently, ``fa=1``:
##  |      
##  |      >>> a1, b1, loc1, scale1 = beta.fit(x, fa=1, floc=0, fscale=1)
##  |      >>> a1
##  |      1
##  |      
##  |      Not all distributions return estimates for the shape parameters.
##  |      ``norm`` for example just returns estimates for location and scale:
##  |      
##  |      >>> from scipy.stats import norm
##  |      >>> x = norm.rvs(a, b, size=1000, random_state=123)
##  |      >>> loc1, scale1 = norm.fit(x)
##  |      >>> loc1, scale1
##  |      (0.92087172783841631, 2.0015750750324668)
##  |  
##  |  ----------------------------------------------------------------------
##  |  Methods inherited from scipy.stats._distn_infrastructure.rv_continuous:
##  |  
##  |  __getstate__(self)
##  |  
##  |  __init__(self, momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, extradoc=None, seed=None)
##  |      Initialize self.  See help(type(self)) for accurate signature.
##  |  
##  |  cdf(self, x, *args, **kwds)
##  |      Cumulative distribution function of the given RV.
##  |      
##  |      Parameters
##  |      ----------
##  |      x : array_like
##  |          quantiles
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      cdf : ndarray
##  |          Cumulative distribution function evaluated at `x`
##  |  
##  |  expect(self, func=None, args=(), loc=0, scale=1, lb=None, ub=None, conditional=False, **kwds)
##  |      Calculate expected value of a function with respect to the
##  |      distribution by numerical integration.
##  |      
##  |      The expected value of a function ``f(x)`` with respect to a
##  |      distribution ``dist`` is defined as::
##  |      
##  |                  ub
##  |          E[f(x)] = Integral(f(x) * dist.pdf(x)),
##  |                  lb
##  |      
##  |      where ``ub`` and ``lb`` are arguments and ``x`` has the ``dist.pdf(x)``
##  |      distribution. If the bounds ``lb`` and ``ub`` correspond to the
##  |      support of the distribution, e.g. ``[-inf, inf]`` in the default
##  |      case, then the integral is the unrestricted expectation of ``f(x)``.
##  |      Also, the function ``f(x)`` may be defined such that ``f(x)`` is ``0``
##  |      outside a finite interval in which case the expectation is
##  |      calculated within the finite range ``[lb, ub]``.
##  |      
##  |      Parameters
##  |      ----------
##  |      func : callable, optional
##  |          Function for which integral is calculated. Takes only one argument.
##  |          The default is the identity mapping f(x) = x.
##  |      args : tuple, optional
##  |          Shape parameters of the distribution.
##  |      loc : float, optional
##  |          Location parameter (default=0).
##  |      scale : float, optional
##  |          Scale parameter (default=1).
##  |      lb, ub : scalar, optional
##  |          Lower and upper bound for integration. Default is set to the
##  |          support of the distribution.
##  |      conditional : bool, optional
##  |          If True, the integral is corrected by the conditional probability
##  |          of the integration interval.  The return value is the expectation
##  |          of the function, conditional on being in the given interval.
##  |          Default is False.
##  |      
##  |      Additional keyword arguments are passed to the integration routine.
##  |      
##  |      Returns
##  |      -------
##  |      expect : float
##  |          The calculated expected value.
##  |      
##  |      Notes
##  |      -----
##  |      The integration behavior of this function is inherited from
##  |      `scipy.integrate.quad`. Neither this function nor
##  |      `scipy.integrate.quad` can verify whether the integral exists or is
##  |      finite. For example ``cauchy(0).mean()`` returns ``np.nan`` and
##  |      ``cauchy(0).expect()`` returns ``0.0``.
##  |      
##  |      The function is not vectorized.
##  |      
##  |      Examples
##  |      --------
##  |      
##  |      To understand the effect of the bounds of integration consider
##  |      
##  |      >>> from scipy.stats import expon
##  |      >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0)
##  |      0.6321205588285578
##  |      
##  |      This is close to
##  |      
##  |      >>> expon(1).cdf(2.0) - expon(1).cdf(0.0)
##  |      0.6321205588285577
##  |      
##  |      If ``conditional=True``
##  |      
##  |      >>> expon(1).expect(lambda x: 1, lb=0.0, ub=2.0, conditional=True)
##  |      1.0000000000000002
##  |      
##  |      The slight deviation from 1 is due to numerical integration.
##  |  
##  |  fit_loc_scale(self, data, *args)
##  |      Estimate loc and scale parameters from data using 1st and 2nd moments.
##  |      
##  |      Parameters
##  |      ----------
##  |      data : array_like
##  |          Data to fit.
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information).
##  |      
##  |      Returns
##  |      -------
##  |      Lhat : float
##  |          Estimated location parameter for the data.
##  |      Shat : float
##  |          Estimated scale parameter for the data.
##  |  
##  |  isf(self, q, *args, **kwds)
##  |      Inverse survival function (inverse of `sf`) at q of the given RV.
##  |      
##  |      Parameters
##  |      ----------
##  |      q : array_like
##  |          upper tail probability
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      x : ndarray or scalar
##  |          Quantile corresponding to the upper tail probability q.
##  |  
##  |  logcdf(self, x, *args, **kwds)
##  |      Log of the cumulative distribution function at x of the given RV.
##  |      
##  |      Parameters
##  |      ----------
##  |      x : array_like
##  |          quantiles
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      logcdf : array_like
##  |          Log of the cumulative distribution function evaluated at x
##  |  
##  |  logpdf(self, x, *args, **kwds)
##  |      Log of the probability density function at x of the given RV.
##  |      
##  |      This uses a more numerically accurate calculation if available.
##  |      
##  |      Parameters
##  |      ----------
##  |      x : array_like
##  |          quantiles
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      logpdf : array_like
##  |          Log of the probability density function evaluated at x
##  |  
##  |  logsf(self, x, *args, **kwds)
##  |      Log of the survival function of the given RV.
##  |      
##  |      Returns the log of the "survival function," defined as (1 - `cdf`),
##  |      evaluated at `x`.
##  |      
##  |      Parameters
##  |      ----------
##  |      x : array_like
##  |          quantiles
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      logsf : ndarray
##  |          Log of the survival function evaluated at `x`.
##  |  
##  |  nnlf(self, theta, x)
##  |      Return negative loglikelihood function.
##  |      
##  |      Notes
##  |      -----
##  |      This is ``-sum(log pdf(x, theta), axis=0)`` where `theta` are the
##  |      parameters (including loc and scale).
##  |  
##  |  pdf(self, x, *args, **kwds)
##  |      Probability density function at x of the given RV.
##  |      
##  |      Parameters
##  |      ----------
##  |      x : array_like
##  |          quantiles
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      pdf : ndarray
##  |          Probability density function evaluated at x
##  |  
##  |  ppf(self, q, *args, **kwds)
##  |      Percent point function (inverse of `cdf`) at q of the given RV.
##  |      
##  |      Parameters
##  |      ----------
##  |      q : array_like
##  |          lower tail probability
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      x : array_like
##  |          quantile corresponding to the lower tail probability q.
##  |  
##  |  sf(self, x, *args, **kwds)
##  |      Survival function (1 - `cdf`) at x of the given RV.
##  |      
##  |      Parameters
##  |      ----------
##  |      x : array_like
##  |          quantiles
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      sf : array_like
##  |          Survival function evaluated at x
##  |  
##  |  ----------------------------------------------------------------------
##  |  Methods inherited from scipy.stats._distn_infrastructure.rv_generic:
##  |  
##  |  __call__(self, *args, **kwds)
##  |      Freeze the distribution for the given arguments.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution.  Should include all
##  |          the non-optional arguments, may include ``loc`` and ``scale``.
##  |      
##  |      Returns
##  |      -------
##  |      rv_frozen : rv_frozen instance
##  |          The frozen distribution.
##  |  
##  |  __setstate__(self, state)
##  |  
##  |  entropy(self, *args, **kwds)
##  |      Differential entropy of the RV.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information).
##  |      loc : array_like, optional
##  |          Location parameter (default=0).
##  |      scale : array_like, optional  (continuous distributions only).
##  |          Scale parameter (default=1).
##  |      
##  |      Notes
##  |      -----
##  |      Entropy is defined base `e`:
##  |      
##  |      >>> drv = rv_discrete(values=((0, 1), (0.5, 0.5)))
##  |      >>> np.allclose(drv.entropy(), np.log(2.0))
##  |      True
##  |  
##  |  freeze(self, *args, **kwds)
##  |      Freeze the distribution for the given arguments.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution.  Should include all
##  |          the non-optional arguments, may include ``loc`` and ``scale``.
##  |      
##  |      Returns
##  |      -------
##  |      rv_frozen : rv_frozen instance
##  |          The frozen distribution.
##  |  
##  |  interval(self, alpha, *args, **kwds)
##  |      Confidence interval with equal areas around the median.
##  |      
##  |      Parameters
##  |      ----------
##  |      alpha : array_like of float
##  |          Probability that an rv will be drawn from the returned range.
##  |          Each value should be in the range [0, 1].
##  |      arg1, arg2, ... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information).
##  |      loc : array_like, optional
##  |          location parameter, Default is 0.
##  |      scale : array_like, optional
##  |          scale parameter, Default is 1.
##  |      
##  |      Returns
##  |      -------
##  |      a, b : ndarray of float
##  |          end-points of range that contain ``100 * alpha %`` of the rv's
##  |          possible values.
##  |  
##  |  mean(self, *args, **kwds)
##  |      Mean of the distribution.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      mean : float
##  |          the mean of the distribution
##  |  
##  |  median(self, *args, **kwds)
##  |      Median of the distribution.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          Location parameter, Default is 0.
##  |      scale : array_like, optional
##  |          Scale parameter, Default is 1.
##  |      
##  |      Returns
##  |      -------
##  |      median : float
##  |          The median of the distribution.
##  |      
##  |      See Also
##  |      --------
##  |      rv_discrete.ppf
##  |          Inverse of the CDF
##  |  
##  |  moment(self, n, *args, **kwds)
##  |      n-th order non-central moment of distribution.
##  |      
##  |      Parameters
##  |      ----------
##  |      n : int, n >= 1
##  |          Order of moment.
##  |      arg1, arg2, arg3,... : float
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information).
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |  
##  |  rvs(self, *args, **kwds)
##  |      Random variates of given type.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information).
##  |      loc : array_like, optional
##  |          Location parameter (default=0).
##  |      scale : array_like, optional
##  |          Scale parameter (default=1).
##  |      size : int or tuple of ints, optional
##  |          Defining number of random variates (default is 1).
##  |      random_state : {None, int, `~np.random.RandomState`, `~np.random.Generator`}, optional
##  |          If `seed` is `None` the `~np.random.RandomState` singleton is used.
##  |          If `seed` is an int, a new ``RandomState`` instance is used, seeded
##  |          with seed.
##  |          If `seed` is already a ``RandomState`` or ``Generator`` instance,
##  |          then that object is used.
##  |          Default is None.
##  |      
##  |      Returns
##  |      -------
##  |      rvs : ndarray or scalar
##  |          Random variates of given `size`.
##  |  
##  |  stats(self, *args, **kwds)
##  |      Some statistics of the given RV.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional (continuous RVs only)
##  |          scale parameter (default=1)
##  |      moments : str, optional
##  |          composed of letters ['mvsk'] defining which moments to compute:
##  |          'm' = mean,
##  |          'v' = variance,
##  |          's' = (Fisher's) skew,
##  |          'k' = (Fisher's) kurtosis.
##  |          (default is 'mv')
##  |      
##  |      Returns
##  |      -------
##  |      stats : sequence
##  |          of requested moments.
##  |  
##  |  std(self, *args, **kwds)
##  |      Standard deviation of the distribution.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      std : float
##  |          standard deviation of the distribution
##  |  
##  |  support(self, *args, **kwargs)
##  |      Return the support of the distribution.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, ... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information).
##  |      loc : array_like, optional
##  |          location parameter, Default is 0.
##  |      scale : array_like, optional
##  |          scale parameter, Default is 1.
##  |      Returns
##  |      -------
##  |      a, b : float
##  |          end-points of the distribution's support.
##  |  
##  |  var(self, *args, **kwds)
##  |      Variance of the distribution.
##  |      
##  |      Parameters
##  |      ----------
##  |      arg1, arg2, arg3,... : array_like
##  |          The shape parameter(s) for the distribution (see docstring of the
##  |          instance object for more information)
##  |      loc : array_like, optional
##  |          location parameter (default=0)
##  |      scale : array_like, optional
##  |          scale parameter (default=1)
##  |      
##  |      Returns
##  |      -------
##  |      var : float
##  |          the variance of the distribution
##  |  
##  |  ----------------------------------------------------------------------
##  |  Data descriptors inherited from scipy.stats._distn_infrastructure.rv_generic:
##  |  
##  |  __dict__
##  |      dictionary for instance variables (if defined)
##  |  
##  |  __weakref__
##  |      list of weak references to the object (if defined)
##  |  
##  |  random_state
##  |      Get or set the RandomState object for generating random variates.
##  |      
##  |      This can be either None, int, a RandomState instance, or a
##  |      np.random.Generator instance.
##  |      
##  |      If None (or np.random), use the RandomState singleton used by np.random.
##  |      If already a RandomState or Generator instance, use it.
##  |      If an int, use a new RandomState instance seeded with seed.

Notice that the python documentation here is enormous. It contains:

Variable assignment

Variable assignment in R and python can be done in the standard way using =.

#in R
a=10
b=a+4
print(b)
## [1] 14
#or in python
a=10
b=a+4
print(b)
## 14

More traditionally the R assignment character is the “arrow”: <-. That is, I can rewrite the R assignments above as:

#using arrows in R
rm("b") #get rid of the old "b"
exists("b") #Good, it's gone!
## [1] FALSE
a<-10
b<-a+4
print(b)
## [1] 14

Arrows in R

While assignment with the = is assumed to be right-to-left, assignment in the arrow notation follows the direction of the arrow…

#right-to-left arrows
rm('b') #get rid of the old "b"
exists('b') #Good, it's gone!
## [1] FALSE
10->a
a+4->b
print(b)
## [1] 14

The arrow notation has one common catastrophic failure:

Arrows in R

The main cause for this catastrophic failure is that “successful assignment” evaluates to TRUE. See the below code for example:

#a is 9 
a=9
#Catastrophic Failure 
if (a<-3) { 
  print("Yes, a is < -3") 
} else { 
  print("No, a is > -3") 
}
## [1] "Yes, a is < -3"
#a is 9 
a=9
#Expected Behaviour 
if (a< -3) { 
  print("Yes, a is < -3") 
} else { 
  print("No, a is > -3") 
}
## [1] "No, a is > -3"

Custom Functions

Functions in R and python are specified in similar ways, but with slightly different syntax. Lets construct a custom function that computes the root-mean-square of two vectors:

#Custom Functions in R
rms<-function(a,b) { 
  return(sqrt(a^2+b^2))
}
x<-seq(0,1,len=10)
y<-seq(2,3,len=10)
rms(a=x,b=y)
##  [1] 2.000000 2.114033 2.233306 2.357023 2.484520 2.615245 2.748737 2.884612 3.022549 3.162278
#Custom Functions in python
def rms(a,b): 
  return np.sqrt(a**2+b**2)

x=np.linspace(0,1,num=10)
y=np.linspace(2,3,num=10)
rms(a=x,b=y)
## array([2.        , 2.11403307, 2.23330569, 2.3570226 , 2.48451997,
##        2.61524495, 2.74873708, 2.88461222, 3.022549  , 3.16227766])

Variable types

Both languages have various variable types:

#Types in R
a=3        #numeric
a=3.1415   #numeric
a="3.1415" #character
#Types in python
a=3        #int
a=3.1415   #float
a="3.1415" #str
python3 uses dynamic typecasting in division /, which python2 does not!
#Arithmetic in R
a=10
b=14
print(b/a)
## [1] 1.4
#Arithmatic in python3
a=10
b=14
print(b/a)
## 1.4

Variable typecasting

You can see what is happening directly by looking at the class() (in R) or type() (in python) of the variables:
#Variable classes in R
c=b/a
class(a); class(b); class(c)
## [1] "numeric"
## [1] "numeric"
## [1] "numeric"
#Variable types in python
c=b/a
type(a); type(b); type(c)
## <class 'int'>
## <class 'int'>
## <class 'float'>

Note that if we force R to treat the input variables as integers, we see the same dynamic typecasting:

#Typecast in R
a=as.integer(a)
b=as.integer(b)
c=b/a
class(a); class(b); class(c)
## [1] "integer"
## [1] "integer"
## [1] "numeric"

Variable typecasting

So, in basically all practical respects, python has converged to the R behaviour in this regard.

#Integers in R
as.integer(10) == as.numeric(10)
## [1] TRUE
#Integers in python
int(10) == float(10) 
## True

Which makes sense, because the value of \(10\) is still the same whether you’re counting discretely (integers) or continuously (numeric/float).

Data types

In R and python there are many different ways to combine chunks of variables. We’re going to focus on a subset of these within R:

And on their python equivalents:

Vectors and np.arrays

In R, the vector is a fundamental unit to the structure of the language, and as a result essentially all operations that you perform can be done in a vectorised fashion. In python, (the most widely used implementation of) vectors are implemented within the numpy package.

#Vectors in R
vec=c(1,2,3,4,5,6)
print(vec); class(vec)
## [1] 1 2 3 4 5 6
## [1] "numeric"
#Vectors in python
vec=np.array([1,2,3,4,5,6])
print(vec); type(vec)
## [1 2 3 4 5 6]
## <class 'numpy.ndarray'>

Vectors allow vector arithmetic:

#Vector Arithmetic in R
a=seq(0,100,length=12)
b=seq(100,30,length=12)
#now do some arbitrary arithmetic
sum((a+b)/sqrt(a^2+b^2))
## [1] 15.37089
#Vector Arithmetic in python
a=np.linspace(0,100,num=12)
b=np.linspace(100,30,num=12)
#now do some arbitrary arithmetic
((a+b)/np.sqrt(a**2+b**2)).sum()
## 15.370885778564148

Beware the carat in python!

#Vector Arithmetic in R
a=seq(0,100,length=12)
b=seq(100,30,length=12)
#now do some arbitrary arithmetic
sum((a+b)/sqrt(a^2+b^2))
## [1] 15.37089
#Vector Arithmetic in python
a=np.linspace(0,100,num=12)
b=np.linspace(100,30,num=12)
#now do some arbitrary arithmetic
((a+b)/np.sqrt(a**2+b**2)).sum()
## 15.370885778564148
#Double Star exponentiation in R
sum((a+b)/sqrt(a**2+b**2))
## [1] 15.37089
#Bitwise XOR operator (i.e. *not* exponentiation) in python
((a+b)/np.sqrt(a^2+b^2)).sum()
## Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

Arrays and np.ndarrays

#Vectors vs Arrays in R
a=seq(0,100,length=12)
b=array(seq(100,30,length=12))
str(a); str(b); identical(a,b)
##  num [1:12] 0 9.09 18.18 27.27 36.36 ...
##  num [1:12(1d)] 100 93.6 87.3 80.9 74.5 ...
## [1] FALSE
#but arithmetic is identical
sum((a+b)/sqrt(a^2+b^2))
## [1] 15.37089

Matricies

When we make the step up to two-dimensional arrays in R, we find that there is another class: matrix:

#Vectors vs Arrays in R
a=array(1:50,dim=c(10,5)) #2D array with dim 10x5
b=matrix(1:50,nrow=10,ncol=5) #matrix with dim 10x5
class(a); class(b); identical(a,b)
## [1] "matrix" "array"
## [1] "matrix" "array"
## [1] TRUE

The main reason for the distinction: matrices are everywhere in mathematics, and so many functions operate on matrices directly, but are not designed for higher dimensional arrays.

In python one can generate arrays with \(2\) or more dimensions using a range of techniques, depending on initialisation:

#Multidimensional arrays in python
a=np.ones([10,5]) #filled with 1s
b=np.zeros([10,5]) #filled with 0s
c=np.ndarray([10,5]) #filled arbitrarily 

Multi-dimensional Arrays

To construct an multidimensional array from a predefined vector of numbers, we create a vector and “reshape” it into the desired multi-dimensional array.

#multidimensional array in R
a=array(seq(0,1,len=12),dim=c(3,5,2))
b=array(seq(0,1,len=10),dim=c(3,5,2))
print(a); print(b)
## , , 1
## 
##            [,1]      [,2]      [,3]      [,4]       [,5]
## [1,] 0.00000000 0.2727273 0.5454545 0.8181818 0.00000000
## [2,] 0.09090909 0.3636364 0.6363636 0.9090909 0.09090909
## [3,] 0.18181818 0.4545455 0.7272727 1.0000000 0.18181818
## 
## , , 2
## 
##           [,1]      [,2]      [,3]       [,4]      [,5]
## [1,] 0.2727273 0.5454545 0.8181818 0.00000000 0.2727273
## [2,] 0.3636364 0.6363636 0.9090909 0.09090909 0.3636364
## [3,] 0.4545455 0.7272727 1.0000000 0.18181818 0.4545455
## , , 1
## 
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.0000000 0.3333333 0.6666667 1.0000000 0.2222222
## [2,] 0.1111111 0.4444444 0.7777778 0.0000000 0.3333333
## [3,] 0.2222222 0.5555556 0.8888889 0.1111111 0.4444444
## 
## , , 2
## 
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.5555556 0.8888889 0.1111111 0.4444444 0.7777778
## [2,] 0.6666667 1.0000000 0.2222222 0.5555556 0.8888889
## [3,] 0.7777778 0.0000000 0.3333333 0.6666667 1.0000000
#multidimensional array with repition in python
a=np.linspace(0,1,num=12).reshape([3,5,2]) #FAILS
## Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: cannot reshape array of size 12 into shape (3,5,2)
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

We can create array with regular structure using the concatenate() or repeat() functions in combination with reshape().

#Variable repetion in array creation
a=np.linspace(0,1,num=10)
a=np.concatenate((a,a,a)).reshape([3,5,2])
b=np.linspace(0,1,num=10).repeat(3).reshape([3,5,2]) 
print(a); np.all(a==r.b); np.all(b==r.b)
## [[[0.         0.11111111]
##   [0.22222222 0.33333333]
##   [0.44444444 0.55555556]
##   [0.66666667 0.77777778]
##   [0.88888889 1.        ]]
## 
##  [[0.         0.11111111]
##   [0.22222222 0.33333333]
##   [0.44444444 0.55555556]
##   [0.66666667 0.77777778]
##   [0.88888889 1.        ]]
## 
##  [[0.         0.11111111]
##   [0.22222222 0.33333333]
##   [0.44444444 0.55555556]
##   [0.66666667 0.77777778]
##   [0.88888889 1.        ]]]
## False
## False

Lists and dictionaries

python and R both have the functionality to specify complex structures of data called lists.

Declaration of lists in R uses the list() function:

#Lists in R
mylist=list(a=a,b=b,vec=c(1,2,3,4,5,6),str="this is a string")
str(mylist)
## List of 4
##  $ a  : num [1:3, 1:5, 1:2] 0 0.0909 0.1818 0.2727 0.3636 ...
##  $ b  : num [1:3, 1:5, 1:2] 0 0.111 0.222 0.333 0.444 ...
##  $ vec: num [1:6] 1 2 3 4 5 6
##  $ str: chr "this is a string"

python lists and dictionaries

In python we declare lists with square brackets [...] and dictionaries with curly braces {...}:

#Lists in python
mylist=[a,b,(1,2,3,4,5,6),
        "this is a string"]
print(mylist)
## [array([[[0.        , 0.11111111],
##         [0.22222222, 0.33333333],
##         [0.44444444, 0.55555556],
##         [0.66666667, 0.77777778],
##         [0.88888889, 1.        ]],
## 
##        [[0.        , 0.11111111],
##         [0.22222222, 0.33333333],
##         [0.44444444, 0.55555556],
##         [0.66666667, 0.77777778],
##         [0.88888889, 1.        ]],
## 
##        [[0.        , 0.11111111],
##         [0.22222222, 0.33333333],
##         [0.44444444, 0.55555556],
##         [0.66666667, 0.77777778],
##         [0.88888889, 1.        ]]]), array([[[0.        , 0.        ],
##         [0.        , 0.11111111],
##         [0.11111111, 0.11111111],
##         [0.22222222, 0.22222222],
##         [0.22222222, 0.33333333]],
## 
##        [[0.33333333, 0.33333333],
##         [0.44444444, 0.44444444],
##         [0.44444444, 0.55555556],
##         [0.55555556, 0.55555556],
##         [0.66666667, 0.66666667]],
## 
##        [[0.66666667, 0.77777778],
##         [0.77777778, 0.77777778],
##         [0.88888889, 0.88888889],
##         [0.88888889, 1.        ],
##         [1.        , 1.        ]]]), (1, 2, 3, 4, 5, 6), 'this is a string']
mydict={"a":a,"b":b,
        "vec":(1,2,3,4,5,6),
        "str":"this is a string"}
print(mydict)
## {'a': array([[[0.        , 0.11111111],
##         [0.22222222, 0.33333333],
##         [0.44444444, 0.55555556],
##         [0.66666667, 0.77777778],
##         [0.88888889, 1.        ]],
## 
##        [[0.        , 0.11111111],
##         [0.22222222, 0.33333333],
##         [0.44444444, 0.55555556],
##         [0.66666667, 0.77777778],
##         [0.88888889, 1.        ]],
## 
##        [[0.        , 0.11111111],
##         [0.22222222, 0.33333333],
##         [0.44444444, 0.55555556],
##         [0.66666667, 0.77777778],
##         [0.88888889, 1.        ]]]), 'b': array([[[0.        , 0.        ],
##         [0.        , 0.11111111],
##         [0.11111111, 0.11111111],
##         [0.22222222, 0.22222222],
##         [0.22222222, 0.33333333]],
## 
##        [[0.33333333, 0.33333333],
##         [0.44444444, 0.44444444],
##         [0.44444444, 0.55555556],
##         [0.55555556, 0.55555556],
##         [0.66666667, 0.66666667]],
## 
##        [[0.66666667, 0.77777778],
##         [0.77777778, 0.77777778],
##         [0.88888889, 0.88888889],
##         [0.88888889, 1.        ],
##         [1.        , 1.        ]]]), 'vec': (1, 2, 3, 4, 5, 6), 'str': 'this is a string'}

Lists and dictionaries

While the list and dictionary types in python must be referenced by number and name respectively, the R list can be accessed in both manners (i.e. they are ordered and named, rather than one-or-the-other):

#List indexing in R
mylist[[4]]; mylist[["str"]]; mylist$str 
## [1] "this is a string"
## [1] "this is a string"
## [1] "this is a string"
#Lists indexing in python
mylist[3]; mydict["str"]; #These work
## 'this is a string'
## 'this is a string'
mylist['str']; mydict[3]; #These error
## Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: list indices must be integers or slices, not str
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

The Python ‘tuple’

There is a special form of uneditable list in python called the tuple.

#Tuples in python 
#Tuples are formally defined using parentheses 
tup = ( np.zeros([2,2]), "string", 90. ) 
tup 
## (array([[0., 0.],
##        [0., 0.]]), 'string', 90.0)
#But this works too!
tup = np.zeros([3,3]), "stringier", 0.99 
tup
## (array([[0., 0., 0.],
##        [0., 0., 0.],
##        [0., 0., 0.]]), 'stringier', 0.99)

Once defined, a tuple cannot be modified:

#Tuples cannot be edited
tup[1]='modified'
## Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: 'tuple' object does not support item assignment
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

The most common use of tuples in python is in function return statements, where you can do something interesting called “unpacking”:

#Define a function that returns a tuple 
def myFunction(): 
  return ( np.ones([2,2]), "new string", 10. ) 
#Unpack the results on-the-fly
mat, newstr, val = myFunction()
print(newstr)
## new string

This unpacking is unnamed. You need to know the position-order of the values returned by the function and unpack it in the correct order, lest the following happen:

#Bad Unpack 
newstr, mat, val = myFunction()
#Whoops
print(newstr)
## [[1. 1.]
##  [1. 1.]]

Data frames and pandas data frames

The final data type is a generalisation on the matrix/2D array, which allows for arbitrary variable types per column.

Data frames provide a natural way to store catalogues.

Playing with data frames

Let’s make use of one of R’s available datasets (of which there are lots!), called “Hitters” and containing “Major League Baseball Data from the \(1986\) and \(1987\) seasons”:

#Data Frames in R
df=get(data("Hitters",package='ISLR'))
#Let's just look at a few columns
df=df[,1:4]
print(head(df))
##                   AtBat Hits HmRun Runs
## -Andy Allanson      293   66     1   30
## -Alan Ashby         315   81     7   24
## -Alvin Davis        479  130    18   66
## -Andre Dawson       496  141    20   65
## -Andres Galarraga   321   87    10   39
## -Alfredo Griffin    594  169     4   74

We’ll use the same R data frame and convert it to a pandas data.frame in python:

#Data Frames in python
import pandas as pd
#For simplicity, inherit R's data 
df=pd.DataFrame(r.df)
print(df.head())
##                    AtBat  Hits  HmRun  Runs
## -Andy Allanson       293    66      1    30
## -Alan Ashby          315    81      7    24
## -Alvin Davis         479   130     18    66
## -Andre Dawson        496   141     20    65
## -Andres Galarraga    321    87     10    39

Data frames and pandas data frames

Note that the data frames here are named in both columns and rows, and so can be accessed by either name or index (this is not unique to data frames in R, though):

#Data Frame referencing in R
print(df[141,])
##               AtBat Hits HmRun Runs
## -Jose Canseco   600  144    33   85
#Index like an array with name/number/both
df[141,'HmRun']; df['-Jose Canseco',3]; 
## [1] 33
## [1] 33
#Index like a list
df$HmRun[141]; df[["HmRun"]][141]
## [1] 33
## [1] 33
#Data Frame referencing in python
print(df.iloc[[82]])
#Index by numbers with iloc
##                 AtBat  Hits  HmRun  Runs
## -Don Mattingly    677   238     31   117
print(df.iloc[82,2])
#Index by name with loc
## 31
print(df.loc['-Don Mattingly','HmRun'])
#Index in dictionary/list style
## 31
print(df.HmRun[82])
## 31

Data frames and pandas data frames

One of the strengths of data frames is the easy of subsetting/selecting data:

#Data Frame subsetting in R (I)
print(df[df$HmRun>30,]) 
##                 AtBat Hits HmRun Runs
## -Don Baylor       585  139    31   93
## -Dave Kingman     561  118    35   70
## -Don Mattingly    677  238    31  117
## -Dave Parker      637  174    31   89
## -George Bell      641  198    31  101
## -Glenn Davis      574  152    31   91
## -Gary Gaetti      596  171    34   91
## -Jesse Barfield   589  170    40  107
## -Jose Canseco     600  144    33   85
## -Kirby Puckett    680  223    31  119
## -Rob Deer         466  108    33   75
#Data Frame subsetting in python (I)
print(df.loc[df.HmRun>30])
##                  AtBat  Hits  HmRun  Runs
## -Don Baylor        585   139     31    93
## -Dave Kingman      561   118     35    70
## -Don Mattingly     677   238     31   117
## -Dave Parker       637   174     31    89
## -George Bell       641   198     31   101
## -Glenn Davis       574   152     31    91
## -Gary Gaetti       596   171     34    91
## -Jesse Barfield    589   170     40   107
## -Jose Canseco      600   144     33    85
## -Kirby Puckett     680   223     31   119
## -Rob Deer          466   108     33    75

or using indexing with the R which() or python np.where() functions:

#Data Frame subsetting in R (II)
print(df[which(df$HmRun>30),]) 
##                 AtBat Hits HmRun Runs
## -Don Baylor       585  139    31   93
## -Dave Kingman     561  118    35   70
## -Don Mattingly    677  238    31  117
## -Dave Parker      637  174    31   89
## -George Bell      641  198    31  101
## -Glenn Davis      574  152    31   91
## -Gary Gaetti      596  171    34   91
## -Jesse Barfield   589  170    40  107
## -Jose Canseco     600  144    33   85
## -Kirby Puckett    680  223    31  119
## -Rob Deer         466  108    33   75
#Data Frame subsetting in python (II)
print(df.iloc[np.where(df.HmRun>30)])
##                  AtBat  Hits  HmRun  Runs
## -Don Baylor        585   139     31    93
## -Dave Kingman      561   118     35    70
## -Don Mattingly     677   238     31   117
## -Dave Parker       637   174     31    89
## -George Bell       641   198     31   101
## -Glenn Davis       574   152     31    91
## -Gary Gaetti       596   171     34    91
## -Jesse Barfield    589   170     40   107
## -Jose Canseco      600   144     33    85
## -Kirby Puckett     680   223     31   119
## -Rob Deer          466   108     33    75

Data frames and pandas data frames

This allows us to trivially select complex subsets of data for analysis. For example, we can look at the correlation between the rate of successful hits and runs:

#Data Frame manipulation in R
df$HitRate<-df$Hits/df$AtBat #Define a new column
df[["RunRate"]]<-df$Runs/df$AtBat #This works too
#Ratio of mean RunRate, split by median in HitRate
print(
  mean(df[df$HitRate>median(df$HitRate),"RunRate"])/ 
  mean(df[df$HitRate<median(df$HitRate),"RunRate"])  
)
## [1] 1.195583

And the equivalent in python:

#Data Frame manipulation in python
df['HitRate']=df.Hits/df.AtBat #Define a new column
df.insert(4,"RunRate",df.Runs/df.AtBat) #This works too
##Ratio of mean RunRate, split by median in HitRate
hiMean=df.RunRate.loc[df.HitRate>np.median(df.HitRate)].mean()
loMean=df.RunRate.loc[df.HitRate<np.median(df.HitRate)].mean()  
print(hiMean/loMean)
## 1.195582607621297

Indexing (!!IMPORTANT!!)

There is one difference between R and python that is absolutely fundamental and cannot be avoided.

This means that the first element in an array (i.e. the element without any other elements in-front of it…) is the “\(1^{\rm st}\)” element in R, and the “\(0^{\rm th}\)” element in python:

#Indexing in R
arr=c("a","b","c","d","e")
arr[1]
## [1] "a"
arr[length(arr)]
## [1] "e"
#Indexing in python
arr=np.array(["a","b","c","d","e"])
arr[1]
## 'b'
arr[arr.shape[0] - 1]
## 'e'

You can think of \(1\)-indexing as being a counting index (“first element, second element, third element, …”), whereas \(0\)-indexing is an offset index (’zero elements away from first, one element away from first, two…").

The different choice of indexing affects negative indexing too:

#Zero and Negative indexing in R
arr[0]; arr[c(0,1)]; arr[-1]; arr[c(-1,0)] 
## character(0)
## [1] "a"
## [1] "b" "c" "d" "e"
## [1] "b" "c" "d" "e"
#Negative indexing in python
ind=np.array([0,1,2])
arr[ind]; arr[ind-1]
## array(['a', 'b', 'c'], dtype='<U1')
## array(['e', 'a', 'b'], dtype='<U1')

Finally, the indexing difference means that sequence construction differs between the languages (so to be consistent with the indexing):

#Sequence generation in R
seq(5)
## [1] 1 2 3 4 5
#Sequence generation in python 
np.arange(5)
## array([0, 1, 2, 3, 4])

Object Orientated vs Functional Code

To demonstrate the distinction between functional and object oriented programming, let’s look at some simple matrix multiplication of eigenvectors and eigenvalues.

#Functional code in R
#Start with a symmetric matrix
mat<-matrix(c(6,2,2,3),nrow=2)
#Calculate the eigenvalues and eigenvectors 
eig<-eigen(mat)
#Check the results: A = V.diag(lambda).V^T 
eig$vectors %*% diag(eig$values) %*% t(eig$vectors)
##      [,1] [,2]
## [1,]    6    2
## [2,]    2    3
#Object-Oriented python
#Start with a symmetric matrix
mat=np.array([6,2,2,3]).reshape(2,2)
#Calculate the eigenvalues and eigenvectors 
eigVals, eigVecs = np.linalg.eig(mat)
#Check the results: A = V.diag(lambda).V^T 
eigVecs.dot(np.diag(eigVals)).dot(eigVecs.T)
## array([[6., 2.],
##        [2., 3.]])

We can actually make the difference clearer on the R side by invoking the matrix multiplication function in a more ‘traditional’ fashion:

#Fully Functional inner/dot products in R
`%*%`(`%*%`(eig$vectors,diag(eig$values)),t(eig$vectors))
##      [,1] [,2]
## [1,]    6    2
## [2,]    2    3

Note that python3.5 introduced a stand-alone functional-operator for matrix multiplication @ that brings the R and python implementations back into line:

#Using the functional @ operator in Python 3.5+:
eigVecs @ np.diag(eigVals) @ eigVecs.T
## array([[6., 2.],
##        [2., 3.]])

Object Oriented Programming and Classes

In python programming in particular, you will definitely run into object “classes”.

In python we define a class in a similar manner to function definitions:

#Class definition in python
class Human: 
  name="Angus"
  occupation="Lecturer"
  height=165.0
  age=32

person1= Human()
print(person1.name)
## Angus
print(person1)
## <__main__.Human object at 0x138a32970>
person1
## <__main__.Human object at 0x138a32970>

Object Oriented Programming and Classes

The power of classes comes with the combination of attributes and methods. There are three main methods that every class should have: > - initialisation (__init__), > - print (__str__), and > - representation (__repr__).

#Class definition in python
class Human: 
  def __init__(self, name, occupation, height, weight, age): 
    self.name=name
    self.occupation=occupation
    self.height=height
    self.weight=weight
    self.age=age
   
  def __str__(self): 
    return f'Person named {self.name} is {self.age} years old, {self.height}cm tall ' \
           f'and weighs {self.weight}kg.'  
           
  def __repr__(self): 
    return f'Human("{self.name}","{self.occupation}","{self.height}","{self.weight}","{self.age}")' 

person1= Human("Angus","Lecturer",165.0,65.0,30)
print(person1.name)
## Angus
print(person1)
## Person named Angus is 30 years old, 165.0cm tall and weighs 65.0kg.
person1
## Human("Angus","Lecturer","165.0","65.0","30")

Object Oriented Programming and Classes

We can define additional arbitrary functions to the class, such as functions that “grow” our Human object by \(1\) or \(n\) years:

#Class definition in python
class Human: 
  def __init__(self, name, occupation, height, weight, age): 
    self.name=name
    self.occupation=occupation
    self.height=height
    self.weight=weight
    self.age=age
  
  def __str__(self): 
    return f'Person named {self.name} is {self.age} years old, {self.height}cm tall ' \
           f'and weighs {self.weight}kg.'  
           
  def __repr__(self): 
    return f'Human("{self.name}","{self.occupation}","{self.height}","{self.weight}","{self.age}")' 
  
  def grow1(self): 
    self.age += 1 
    self.weight += 1.0
    self.height -= 0.5
    return self 
  
  def grow_n(self,n): 
    self.age += n 
    self.weight += 1.0*n
    self.height -= 0.5*n
    return self 


person1= Human("Angus","Lecturer",165.0,65.0,30)
print(person1)
## Person named Angus is 30 years old, 165.0cm tall and weighs 65.0kg.
print(person1.grow1())
## Person named Angus is 31 years old, 164.5cm tall and weighs 66.0kg.
print(person1.grow_n(10).grow1().grow1())
## Person named Angus is 43 years old, 158.5cm tall and weighs 78.0kg.
print(person1)
## Person named Angus is 43 years old, 158.5cm tall and weighs 78.0kg.
person1
## Human("Angus","Lecturer","158.5","78.0","43")

Note that the functions which act on self act in-place!

Object orientation in R

There are actually 4 different ways to implement object-oriented programming in R, that are known as S3, S4, RC, and R6.

We won’t use “OOP” in R in this course, but it’s worth demonstrating briefly how one can do it:

#Unlike S3, S4, and RC, R6 is currently not in base R.
library(R6)
#Class definition in R
Human <- R6Class("Human", list(
  name = NULL,
  occupation=NULL,
  height = NA, 
  weight = NA, 
  age = NA,
  initialize = function(name, occupation, height, weight, age) {
    self$name <- name
    self$occupation <- occupation
    self$height <- height
    self$weight <- weight
    self$age <- age
  },
  print = function() { 
    cat(paste('Person named',self$name,'is',self$age,'years old, '))
    cat(paste(self$height,'cm tall and weighs ',self$weight,'kg.\n',sep=''))
  },
  grow1 = function(){  
    self$age <- self$age + 1 
    self$weight <- self$weight + 1.0
    self$height <- self$height - 0.5
    return=self 
  },
  grow_n = function(n){ 
    self$age <- self$age + n 
    self$weight <- self$weight + 1.0*n
    self$height <- self$height - 0.5*n
    return=self 
  }
))

person1= Human$new("Angus","Lecturer",165.0,65.0,32)
print(person1)
## Person named Angus is 32 years old, 165cm tall and weighs 65kg.
print(person1$grow1())
## Person named Angus is 33 years old, 164.5cm tall and weighs 66kg.
print(person1$grow_n(10)$grow1()$grow1())
## Person named Angus is 45 years old, 158.5cm tall and weighs 78kg.
print(person1)
## Person named Angus is 45 years old, 158.5cm tall and weighs 78kg.
person1
## Person named Angus is 45 years old, 158.5cm tall and weighs 78kg.

Note that R produces consistent output as print() when you just request the object (as in the last line). If you want to see a version of the object as would be produced by python’s __repr__, then you can use the function dput():

#Equivalent of __repr__ output in R
dput(person1)
## <environment>

Plotting my Data

Both R and python have very sophisticated plotting capabilities.

python’s matplotlib is a widely used tool for producing figures. > - with lots of examples online.

In R one can trivially plot lots of things using the base code. > - Additional packages such as magicaxis and ggplot are extremely powerful

Plotting Data in R

This is the code to plot the data+model+uncertainty figures from earlier in R:

#Plotting in R with base "plot"
#Plot the scatter plot with a title 
plot(x,y,ylab=expression(paste('Beta'(alpha,beta))),xlab='x',pch=20,col='blue3',
        main=paste("a=",round(digits=2,best_R[1]),"±",round(digits=2,best_R[3]),"; ",
                   "b=",round(digits=2,best_R[2]),"±",round(digits=2,best_R[4])))
#Overplot the best-fit model line
lines(model.x,model.y,col='red',lwd=2,lty=1)
#Overplot the uncertainty region
lines(model.x,conf.int[,1],col='red',lwd=2,lty=2)
lines(model.x,conf.int[,2],col='red',lwd=2,lty=2)
#Add a legend
legend('topright',legend=c('Data','Best-fit Model','1-sigma uncertainty'),
       pch=c(1,NA,NA),lty=c(NA,1,2),col=c("black","red","red"),lwd=2,
       inset=c(0.05,0.05))

We can use the magicaxis package, which provides a layer on-top of base plotting to optimise how figures are styled:

#Plotting in R with magicaxis
library(magicaxis)
#Plot the scatter plot with a title 
magplot(x,y,ylab=expression(paste('Beta'(alpha,beta))),xlab='x',pch=20,col='blue3',
        main=paste("a=",round(digits=2,best_R[1]),"±",round(digits=2,best_R[3]),"; ",
                   "b=",round(digits=2,best_R[2]),"±",round(digits=2,best_R[4])))
#Overplot the best-fit model line
lines(model.x,model.y,col='red',lwd=2,lty=1)
#Overplot the uncertainty region
lines(model.x,conf.int[,1],col='red',lwd=2,lty=2)
lines(model.x,conf.int[,2],col='red',lwd=2,lty=2)
#Add a legend
legend('topright',legend=c('Data','Best-fit Model','1-sigma uncertainty'),
       pch=c(1,NA,NA),lty=c(NA,1,2),col=c("black","red","red"),lwd=2,inset=c(0.05,0.05))

Plotting Data in python

Here is our equivalent plot in python using matplotlib:

#Plotting with matplotlib in python 
import matplotlib.pyplot as plt
#Do the standard scatter plot 
plt.scatter(r.x,r.y,label='Data')
#Overplot the best-fit model line 
plt.plot(model_x,scipy.stats.beta.pdf(model_x,best_py[0],best_py[1]),color='r',label='Best-fit Model')
#Overplot the uncertainty region 
plt.plot(model_x,scipy.stats.beta.pdf(model_x,best_py[0],best_py[1])-r.sigma,'--',color='r',label='1-sigma uncertainty')
plt.plot(model_x,scipy.stats.beta.pdf(model_x,best_py[0],best_py[1])+r.sigma,'--',color='r')
#Add a title 
plt.title("a="+str(np.round(best_py[0],2))+"±"+str(np.round(sigma_py[0,0],2))+"; "+
          "b="+str(np.round(best_py[1],2))+"±"+str(np.round(sigma_py[1,1],2)))
#Add a legend 
plt.legend()

Finally, as a demonstration of the available plotting tools in R, I’m again going to exploit the example() function…

Here we have an example of sky-plots with magicaxis:

#Sky-plots in R with magicaxis
example("magproj")
## 
## magprj> # GAMA fields:
## magprj> par(mar=c(0.1,0.1,0.1,0.1))
## 
## magprj> magproj(c(129,141), c(-2,3), type='b', projection='aitoff', centre=c(180,0),
## magprj+ fliplong=TRUE, labloc=c(90,-45), col='red', labeltype = 'sex', crunch=TRUE)

## 
## magprj> magproj(c(211.5,223.5), c(-2,3), col='red', add=TRUE)
## 
## magprj> magproj(c(30.2,38.8), c(-10.25,-3.72), col='red', add=TRUE)
## 
## magprj> magproj(c(30.2,38.8), -6, type='l', add=TRUE, col='grey')
## 
## magprj> magproj(c(339,351), c(-35,-30), col='red', add=TRUE)
## 
## magprj> magecliptic(width=10,col=hsv(1/12,alpha=0.3),border=NA)
## 
## magprj> magecliptic(width=0,col='orange')
## 
## magprj> magMWplane(width=20,col=hsv(v=0,alpha=0.1),border=NA)
## 
## magprj> magMWplane(width=0,col='darkgrey')
## 
## magprj> magMW(pch=16, cex=2, col='darkgrey')
## 
## magprj> magsun(c(7,26), pch=16, cex=2, col='orange2') #An important date!
## 
## magprj> magproj(c(174,186), c(-3,2), col='red', add=TRUE)
## 
## magprj> #Plus SDSS:
## magprj> magproj(c(110,260), c(-4,70), border='blue', add=TRUE)
## 
## magprj> magproj(c(35,135,180,217.5,345), c(-3.72,3,2,3,-30)+10, type='t',
## magprj+ plottext=c('G02','G09','G12','G15','G23'), add=TRUE)
## 
## magprj> legend('topleft', legend=c('GAMA Regions','SDSS Main Survey'), col=c('red','blue'),
## magprj+ pch=c(15,NA), lty=c(NA,1), bty='n')
## 
## magprj> legend('topright', legend=c('Ecliptic','MW Plane'), col=c(hsv(c(1/12,0), v=c(1,0),
## magprj+ alpha=0.5)), pch=c(15,15), lty=c(1,1), bty='n')
## 
## magprj> legend('bottomleft', legend=c('Sun', 'MW Centre'), col=c('orange2','darkgrey'), pch=16,
## magprj+ bty='n')

Or image plotting a colour image with magicaxis:

#Sky-plots in R with magicaxis
library(FITSio)
example("magimageWCSRGB", run.dontrun = TRUE)
## 
## mWCSRG> image=readFITS(system.file("extdata", 'VIKING/mystery_VIKING_Z.fits', package="ProFound"))
## 
## mWCSRG> #Convenient image plotting for lists containing headers:
## mWCSRG> 
## mWCSRG> magimageWCS(image$imDat, header=image$hdr)

## 
## mWCSRG> magimageWCS(image)
## 
## mWCSRG> #First using the outer margins for tick labels:
## mWCSRG> 
## mWCSRG> par(mar=c(3.1,3.1,1.1,1.1))
## 
## mWCSRG> magimageWCS(image)

## 
## mWCSRG> magimageWCS(image, coord.type='deg')

## 
## mWCSRG> #Now removing the margins and putting labels inside the image:
## mWCSRG> 
## mWCSRG> par(mar=c(0,0,0,0))
## 
## mWCSRG> magimageWCS(image, margin=FALSE)

## 
## mWCSRG> magimageWCS(image, margin=FALSE, coord.type='deg')

## 
## mWCSRG> #We can make a WCS colour image of mismatched images:
## mWCSRG> 
## mWCSRG> VISTA_K=readFITS(system.file("extdata", 'VISTA_K.fits', package="magicaxis"))
## 
## mWCSRG> VST_r=readFITS(system.file("extdata", 'VST_r.fits', package="magicaxis"))
## 
## mWCSRG> GALEX_NUV=readFITS(system.file("extdata", 'GALEX_NUV.fits', package="magicaxis"))
## 
## mWCSRG> magimageWCSRGB(VISTA_K, VST_r, GALEX_NUV)

## 
## mWCSRG> magimageWCSRGB(VISTA_K, VST_r, GALEX_NUV, saturation=0.5)

## 
## mWCSRG> #To make direct magimageRGB plots of the outputs you must turn off magmap scaling:
## mWCSRG> 
## mWCSRG> temp=magimageWCSRGB(VISTA_K, VST_r, GALEX_NUV)

## 
## mWCSRG> magimageRGB(R=temp$R, G=temp$G, B=temp$B, magmap=FALSE)

## 
## mWCSRG> #We can map onto various WCS schemes easily too:
## mWCSRG> 
## mWCSRG> magimageWCSRGB(VISTA_K, VST_r, GALEX_NUV, VISTA_K$hdr)

## 
## mWCSRG> magimageWCSRGB(VISTA_K, VST_r, GALEX_NUV, VST_r$hdr)

## 
## mWCSRG> magimageWCSRGB(VISTA_K, VST_r, GALEX_NUV, GALEX_NUV$hdr)

Plotting data frames in R

Finally, here’s an example of the default plot for data frames in R:

#Data frame plots in R 
#Make a data frame with random gaussian data
cat<-data.frame(X=rnorm(1e3))
#Add some spuriously correlated variables 
cat$Y<-cat$X+rnorm(1e3,sd=0.5)
cat$Z<-cat$Y^2+rnorm(1e3)
#Include a character variable, just because we can
cat$Str<-sample(LETTERS[1:5],1e3,replace=TRUE,prob=c(0.05,0.5,0.2,0.15,0.1))
#This is our data
cat
#This is the default plot (but with transparent filled dots)
plot(cat,pch=20,col=hsv(v=0,a=0.1))

and the magicaxis equivalent:

#Data frame plots with magicaxis in R 
cat$Str<-NULL #Remove the string data 
magtri(cat,pch=20,col=hsv(v=0,a=0.8))