Archive

Archive for the ‘General post’ Category

Benford’s Law after converting count data to be in base 5

March 8, 2012 2 comments

Firstly, I know nothing about election fraud – this isn’t a serious post. But, I do like to do some simple coding. Ben Goldacre posted on using Benford’s Law to look for evidence of Russian election fraud. Then Richie Cotton did the same, but using R. Commenters on both sites suggested that as the data didn’t span a large order of magnitude, turning it into a lower base (e.g. base 5) may helpful. I’ve no idea if this would be helpful, but the idea of messing around with the data was appealing, so here it is.

The as.binary() function was posted to R-help by Robin Hankin. The code to do the analysis was by Richie Cotton. Putting it all together gives:

The graph...

So there we have it – with the numerical data in base 5, the observed and expected values are closer together than with the numerical data in base 10. The overall dynamic range is from 1 to 30430 (in base 5).

The data are here. The code you’ll need is:-

 
## repeat the analysis but with base 5
rm(list = ls())
library(reshape)
library(stringr)
library(ggplot2)
russian <- read.csv("Russian observed results - FullData.csv")

as.binary <- function(n,base=2 , r=FALSE){ 
  ## function written by robin hankin
  out <- NULL 
  while(n > 0) { 
    if(r) { 
      out <- c(out , n%%base) 
    } else { 
      out <- c(n%%base , out) 
    } 
    n <- n %/% base 
  } 
  ans <- str_c(out, collapse = "")
  return(ans)
}
russian <- melt(
  russian[, 9:13], 
  variable_name = "candidate"
  )
russian$base_5_value <- apply(as.matrix(russian$value), MARGIN = 1,
                              FUN = as.binary, base = 5)
russian$base_5_value_1st = str_extract(russian$base_5_value, "[123456789]")

first_digit_counts <- as.vector(table(russian$base_5_value_1st))

first_digit_actual_vs_expected <- data.frame(
  digit            = 1:4,
  actual.count     = first_digit_counts,   
  actual.fraction  = first_digit_counts / nrow(russian),
  benford.fraction = log(1 + 1 / (1:4), base = 5)
  )

a_vs_e <- melt(first_digit_actual_vs_expected[, c("digit", "actual.fraction", "benford.fraction")], id.var = "digit")
(fig1_lines <- ggplot(a_vs_e, aes(digit, value, colour = variable)) +
  geom_line() +
  scale_x_continuous(breaks = 1:4) +
  scale_y_continuous(formatter = "percent") +
  ylab("Counts with this first digit") +
  opts(legend.position = "none")
 )

range(as.numeric(russian$base_5_value), na.rm = T)
Categories: General post Tags:

How to check if a file exists with HTTP and R

September 1, 2010 8 comments

So, there’s probably an easier way to do this (please let me know if you know it)…

Suppose you’re working with a system which creates (binary) files and posts them for download on a website. You know the names of the files that will be created. However, they may not have been made yet (they’re generated on the fly, and appear in a vaguely random order over time). There are several of them and you want to know which ones are there yet, and when there are enough uploaded, run an analysis.

I spent quite a bit of time trying to work this out, and eventually came up with the following solution:

require(RCurl)
newurl <- c("http://cran.r-project.org/web/packages/RCurl/RCurl.pdf",
            "http://cran.r-project.org/web/packages/RCurl/RCurl2.pdf")
for (n in 2:1){
   z <- ""
   try(z <- getBinaryURL(newurl[n], failonerror = TRUE))   
   if (length(z) > 1) {print(paste(newurl[n], " exists", sep = ""))
      } else {print(paste(newurl[n], " doesn't exist", sep =  ""))}
   }

What this does is uses RCurl to download the file into a variable z. Then your system will check to see if z now contains the file.

If the file doesn’t exist, getBinaryURL() returns an error, and your loop (if you are doing several files) will quit. Wrapping the getBinaryURL() in try() means that the error won’t stop the loop from trying the next file (if you don’t trust me, try doing the above without the try wrapper). You can see how wrapping this in a loop could quickly go through several files and download ones which exist.

I’d really like to be able to do this, but not actually download the whole file (e.g. just the first 100 bytes) to see how many files of interest have been created, and if enough have, then download them all. I just can’t work out how to yet – I tried the range option of getBinaryURL() but this just crashed R. This would be useful if you are collecting data in real time, and you know you need at least (for example) 80% of the data to be available before you jump into a computationally expensive algorithm.

So, there must be an easier way to do all this, but can I find it? …

Categories: General post Tags: , , ,

Visualizing 3d data – plotting quartiles separately

July 30, 2010 3 comments

In this previous post, we’ve looked at displaying three dimensional data.  One major problem is when there is a high density of data, it can be difficult to see what’s going on in a 3 dimensional plot.

One way of looking at the data in more detail is to break it up.  Take a look at this graph:

This is a plot of data of air quality in Nottingham, UK, taken hourly in 2009 (the code to create it in base R is on the bottom of the page).  On the left is a scatterplot of NO2 against ozone (plot A).   The different colours indicate grouping the data by the level of ozone into quartiles.  On the right are plots of the NO vs NO2 for the same data, but a  separate plot for each quartile of the ozone data.  The points are all colour co-ordinated, so the red points indicating the upper quartile of the ozone data in plot A are matched by red points in plot B.

So you can see by comparing plot E and D, that at the lowest quartile of ozone levels, there is a greater spread of both NO2 and NO.

How this is done is pretty simple (most of the code is to make things vaguely pretty).  Essentially, the values for x,y and z are put into a matrix xyz.  The rows of the matrix are ordered according to the z variable.  The rows which deliniate each quartile are calculated, and then the plots for B to E of x vs y are drawn, using only the rows for that quartile.  The axes are plotted so that they are the same scale for each of the plots. There’s not much room for the axis labels – so these are added afterwards with the legend command.

Then on the left the plot for y (on the horizontal axis) and z (on the vertical axis) is drawn, with some added lines to show where the boundaries of each quartile lie.  The colours are stored in the xyz matrix in the col column.  Like most of my code, the graph is portable, you just need to input different values for x, y and z and re-label the names for each variable.  The original dataset is the same one which I have used for my previous posts.  It is from the UK airquality database.  If you copy this file into your working directory and run the code below, you’ll repeat the plot.

Any suggestions for improvements / comments would be most appreciated!

## name the columns of the data
columns <- c("date", "time", "NO", "NO_status", "NO_unit", "NO2",
	"NO2_status", "NO2_unit", "ozone", "ozone_status", "ozone_unit", 
	"SO2", "SO2_status", "SO2_unit")
## read in the data, store it in variable data
data <- read.csv("27899712853.csv", header = FALSE, skip = 7, 
	col.names = columns, stringsAsFactors = FALSE)

## now make the x,y and z variables

x <- data$NO
y <- data$NO2
z <- data$ozone
cols <- rep(1,length(z))

xyz <- cbind (x,y)
xyz <- cbind(xyz,z)
xyz <- cbind(xyz,cols)
colq1 <- 6
colq2 <- 4
colq3 <- 3
colq4 <- 2

xl <- "NO"
yl <- "NO2"
zl <- "Ozone"

point <- 20 

# re order by z
xyz <- xyz[order(xyz[,3]),]
# now define the row numbers for the quartile boundries
maxxyz <-  nrow(xyz)
q1xyz <- round(maxxyz/4)
medianxyz <-  round(maxxyz/2)
q3xyz <- round(maxxyz*3/4)

# assign colours to xyz$col
xyz[1:q1xyz,4] <- colq1
xyz[q1xyz:medianxyz,4] <- colq2
xyz[medianxyz:q3xyz,4] <- colq3
xyz[q3xyz:nrow(xyz),4] <- colq4

# define the maximum values for x,y, and z 
# these are used to ensure all the axes are the same scale
maxx <- x[which.max(x)]
maxy <- y[which.max(y)]
maxz <- z[which.max(z)]


# now make the plot
# first job is to save the graphics parameters currently used
def.par <- par(no.readonly = TRUE)
# define the margins around each plot
par("mar" = c(2,2,0.5,0.5))
# make the layout for the plot
layout(matrix(c(5,1,5,2,5,3,5,4), 4, 2, byrow = TRUE))

# now do the four plots on the right
plot(xyz[q3xyz:maxxyz,1],xyz[q3xyz:maxxyz,2], col = colq4, 
	xlab = xl, ylab = yl, pch=point, xlim = c(0,maxx), 
	ylim = c(0,maxy))
legend(x = "right", yl, bty = "n")
legend(x = "topright", "B", bty = "n")

plot(xyz[medianxyz:q3xyz,1],xyz[medianxyz:q3xyz,2], col = colq3,
	pch=point, xlim = c(0,maxx), ylim = c(0,maxy))
legend(x = "right", yl, bty = "n")
legend(x = "topright", "C", bty = "n")

plot(xyz[q1xyz:medianxyz,1],xyz[q1xyz:medianxyz,2], col = colq2, 
	pch=point, xlim = c(0,maxx), ylim = c(0,maxy))
legend(x = "right", yl, bty = "n")
legend(x = "topright", "D", bty= "n")

plot(xyz[0:q1xyz,1],xyz[0:q1xyz,2], col = colq1, pch=point, 
	xlim = c(0,maxx), ylim = c(0,maxy))
legend(x = "right", yl, bty = "n")
legend(x = "bottom", xl, bty = "n")
legend(x = "topright", "E", bty = "n")

# now do the plot on the left
plot(xyz[,2],xyz[,3], col = xyz[,4], pch=point, xlim = c(0,maxy))
legend(x = "bottom", yl, bty = "n")
legend(x = "right", zl, bty = "n")
legend(x = "topright", "A", bty = "n")

abline(h=xyz[q1xyz,3],col=3,lty=2)
abline(h=xyz[medianxyz,3],col=4)
abline(h=xyz[q3xyz,3],col=5,lty=2)

## reset the graphics display to default
par(def.par)
Categories: General post Tags: ,

Matrix scatterplot of the Airquality data using lattice

In this post we will build on the last one, and create a matrix scatterplot. The package lattice allows for some really excellent graphics. In case you haven’t already seen it I recommend the R Graph Gallery for some examples of what it can do – browse the graphics by package used to create them. We’ll use the same dataset as last time, where we made a plot of the NO levels in the atmosphere vs ozone levels for Nottingham, UK.

First step is to load the lattice package.

require("lattice")

Download the dataset from here, and put the file in your working directory. Now we’ll put the dataset into the matrix data.

columns <- c("date", "time", "NO", "NO_status", "NO_unit",
      "NO2", "NO2_status", "NO2_unit", "ozone", "ozone_status",
      "ozone_unit", "SO2", "SO2_status", "SO2_unit")
data <- read.csv("27899712853.csv", header = FALSE,
      skip = 7, col.names = columns, stringsAsFactors = FALSE)
x <- data$NO
y <- data$ozone
z <- data$SO2

So that it’s easier to follow, I’ve extracted 3 vectors from the matrix: x, y, and z.   These are the columns of the data for NO, ozone and SO2.  Hopefully this will help you follow things.  When working with graphs, I usually do this (in the last post I extracted x and y).  If I make a nice graphic I can then “cut and paste” it into another program, and just change the data in xy and z and hey presto, the same graphic is instantly used with new data.

For a matrix scatterplot, we need to make a matrix of the variables to compare. We join the vectors into a matrix and then name the columns.

mat <- cbind(x,y)
mat <- cbind(mat,z)
colnames(mat) <- c("NO", "ozone", "SO2")

You can look at the first 10 lines of mat with

mat[1:10,]

Finally we create the matrix plot:

title <- "Matrix scatterplot of air polutants"
print(splom(mat, main = title))

The final result is here:

For those unfamiliar with scatterplots – this plot is essentially 3 scatterplots of x vs y, x vs z and y vs z.  The middle left plot is the scatterplot created in this previous post.  The package lattice can do lots more than this – get help on line for it with the command

?lattice

Getting going with R – importing data and plotting a simple graphic

The most difficult part of the learning curve in R is often getting going – many datasets are pre-installed in the packages and organised, so it is difficult to see how you to import your own data into R.  This post takes you step by step through the process of making a table from a spreadsheet and then a simple graph.

The first thing is to get some data.  A .csv file is a common “spreadsheet” like file.  Currently I’m working with some air quality data downloaded from the UK air quality archive.  The data I’ve downloaded is of 2009 data from Nottingham, UK containing automated measurements of Nitric Oxide, NO2, Ozone, and Sulphur Dioxide.  The file is here.  You can cut and paste the code below into R.

The first thing to do is put the data into a variable, called data.  Copy the spreadsheet file into your working directory. We then use the read.csv for this:

columns <- c("date", "time", "NO", "NO_status", "NO_unit", 
      "NO2", "NO2_status", "NO2_unit", "ozone", "ozone_status", 
      "ozone_unit", "SO2", "SO2_status", "SO2_unit")
data <- read.csv("27899712853.csv", header = FALSE, 
      skip = 7, col.names = columns, stringsAsFactors = FALSE)
We have also removed the first 7 lines of the file (if you look at the file in Notepad, you’ll see that the first 7 lines are descriptions and a header.  I wanted my own headers, which I set in the columns vector.  StringsAsFactors = FALSE is important – without this things can go wrong.
You can look at the data we’ve just imported using:
data[1:10,]
which shows the first 10 rows of the data (and all the columns).  R has lots of ways to access data from a table.  For example, we can look at the 5th to 10th measurments of NO using
data$NO[5:10]
So, lets now do a plot.  A simple plot is to see what happens to NO levels over the whole dataset.  In which case, all you have to do is:
plot(data$NO)
For a more complex graph:
## start by saving the original graphical parameters 
def.par <- par(no.readonly = TRUE) 
x <- data$NO
y <- data$ozone
xlabel <- "NO"
ylabel <- "ozone"
layout(matrix(c(2,1,1,3,1,1), 2, 3, byrow = TRUE))
plot(x, y, xlab = xlabel, ylab = ylabel, pch = 20)
plot(x, xlab = NA, ylab = xlabel, pch = 20)
plot(y, xlab = NA, ylab = ylabel, pch = 20)
## reset the graphics display to default
par(def.par)
You should get something like:
So, what we’ve done here is used the layout command.  We’ve defined a matrix with 3 columns and 2 rows.   The numbers in the matrix tell R where the plots should go.   The matrix command which indicates this is:
matrix(c(2,1,1,3,1,1), 2, 3, byrow = TRUE)
and the output you get from this is:
[,1] [,2] [,3]
[1,]    2    1    1
[2,]    3    1    1
Which shows that the second plot will be on the top left, and the third in the bottom left, and the 1st spread over the 4 cells of the table on the right.  The actual plots are simple.  We’ve defined x to be the NO data (using x <- data$NO ) and y to be ozone.  And then we’ve just plotted x and y against each other, and also in separate panes each like a time series. It’s worth playing with the numbers in this command to change the layout of the graph – can you stack the 3 graphs into a column?
Well, that’s got us going for now.
There are of course much more complex plots which we can use and other ways to work with data, but later.

Categories: General post Tags: ,

Welcome to Gosset’s student

Welcome to Gosset’s student, a blog about statistics, with a focus on using R.  We’re all learning R, as it’s constantly being improved.  The blog will aim for brevity, and a focussed approach to getting some stats done, rather than elegance of code.  Discussions which show how to replicate the result in a more elegant fashion, or which criticize the approach which the blog takes will be most welcome.

Categories: General post Tags: ,