Appendix 6
Further statistics
RPy and the R statistical package
The R statistical package
1
is one of the most commonly used ones for analysing statistical
data. It has its own language. There is a Python wrapper around it called RPy.
2
The main
reason to use RPy would be if you have lots of existing R code that you wish to interface
to in Python.
There are a few things to keep in mind when using RPy. Standard Python collection
types or NumPy arrays have to be converted into special RPy data types, and results that
are returned from R have to be suitably interpreted. Reading the R documentation is
crucial to using RPy.
We will illustrate the use of R via RPy for a few standard examples.
Binomial test
First we consider the binomial test, which is concerned with the number of occurrences of
an event that has a fixed probability of occurring, given a certain number of trials. R has a
method, ‘binom.test’, to do the binomial test. We create a function, binomialTailTest(),
which calls this method via RPy, and which has the same arguments as in our previous
version of the function in
Chapter 22
, which used SciPy.
First we need to import the RPy module, rpy2.robjects, which we call R below. This has
an object inside it, R.r, which is what we use to get hold of R methods, using dictionary
syntax keyed on the name of the R method. Here we want to use the R method binom.test,
and so R.r[‘binom.test’] is the Python version of this R method.
The R documentation tells us that this function has four arguments, x, n, p and
alternative, which correspond to our arguments count, nTrials, pEvent and oneSided,
although alternative is a string in R rather than a Boolean. In fact alternative can take three
values in R, ‘greater’, which is for our one-tailed calculation, ‘two.sided’, which is for our
two-tailed calculation and ‘less’, which would give 1.0 minus the one-tailed calculation,
so we do not need that here.
In R, there is an optional fifth argument, conf.level, which defaults to 0.95 and which is
used if you want to calculate a confidence interval, for example. Here we are just
calculating a probability. We are using the four arguments in the expected R order, so in
fact here we do not need to use the ‘key=value’ syntax, we can just list the values.
The one oddity is how to extract what we want from the returned result. The R output
contains a lot of information. It turns out that the probability is item 2 of the result
considered as a collection. That is not obvious, and can only be determined by looking at
the output. And further we need to access item 0 of that, because it is an RPy collection
type of length one. This leads to the strange-looking result[2][0].
import rpy2.robjects as R
def binomialTailTest(count, nTrials, pEvent, oneSided=True):
alt = 'greater' if oneSided else 'two.sided'
func = R.r['binom.test']
result = func(x=count, n=nTrials, p=pEvent, alternative=alt)
return result[2][0]
We can now test the function.
count = 530
nTrials = 1000
pEvent = 0.5
result = binomialTailTest(count, nTrials, pEvent, oneSided=True)
print( 'Binomial one tail', result)
result = binomialTailTest(count, nTrials, pEvent, oneSided=False)
print( 'Binomial two tail', result)
Do'stlaringiz bilan baham: |