## Benford’s Law after converting count data to be in base 5

Firstly, I know nothing about election fraud – this isn’t a serious post. But, I do like to do some simple coding. Ben Goldacre posted on using Benford’s Law to look for evidence of Russian election fraud. Then Richie Cotton did the same, but using R. Commenters on both sites suggested that as the data didn’t span a large order of magnitude, turning it into a lower base (e.g. base 5) may helpful. I’ve no idea if this would be helpful, but the idea of messing around with the data was appealing, so here it is.

The `as.binary()`

function was posted to R-help by Robin Hankin. The code to do the analysis was by Richie Cotton. Putting it all together gives:

So there we have it – with the numerical data in base 5, the observed and expected values are closer together than with the numerical data in base 10. The overall dynamic range is from 1 to 30430 (in base 5).

The data are here. The code you’ll need is:-

## repeat the analysis but with base 5 rm(list = ls()) library(reshape) library(stringr) library(ggplot2) russian <- read.csv("Russian observed results - FullData.csv") as.binary <- function(n,base=2 , r=FALSE){ ## function written by robin hankin out <- NULL while(n > 0) { if(r) { out <- c(out , n%%base) } else { out <- c(n%%base , out) } n <- n %/% base } ans <- str_c(out, collapse = "") return(ans) } russian <- melt( russian[, 9:13], variable_name = "candidate" ) russian$base_5_value <- apply(as.matrix(russian$value), MARGIN = 1, FUN = as.binary, base = 5) russian$base_5_value_1st = str_extract(russian$base_5_value, "[123456789]") first_digit_counts <- as.vector(table(russian$base_5_value_1st)) first_digit_actual_vs_expected <- data.frame( digit = 1:4, actual.count = first_digit_counts, actual.fraction = first_digit_counts / nrow(russian), benford.fraction = log(1 + 1 / (1:4), base = 5) ) a_vs_e <- melt(first_digit_actual_vs_expected[, c("digit", "actual.fraction", "benford.fraction")], id.var = "digit") (fig1_lines <- ggplot(a_vs_e, aes(digit, value, colour = variable)) + geom_line() + scale_x_continuous(breaks = 1:4) + scale_y_continuous(formatter = "percent") + ylab("Counts with this first digit") + opts(legend.position = "none") ) range(as.numeric(russian$base_5_value), na.rm = T)

## Machine learning techniques in the biomedical literature

There are relatively few articles published on using machine learning techniques on what many would consider “classical” biomedical study designs (e.g a sample size of 200 and about 10 parameters) and approaches to dealing with . But they may start being published. This is a list to get going with. No all of the article below fit into the above criteria but I’ve kept them here as they’re interesting (at least to me).

This post was motivated by this question on Crossvalidated. I will add to it as I find them or people point them out to me. It’s very short at the moment! Let me know of any broken links.

**Articles**

Statnikov A, Wang L, Aliferis CF A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008 Jul 22;9:319.

Van Loon K, Guiza F, Meyfroidt G, Aerts JM, Ramon J, Blockeel H, Bruynooghe M, Van Den Berghe G, Berckmans D. Dynamic data analysis and data mining for prediction of clinical stability. Stud Health Technol Inform. 2009;150:590-4.

Luaces O, Taboada F, Albaiceta GM, Domínguez LA, Enríquez P, Bahamonde A; GRECIA Group.Predicting the probability of survival in intensive care unit patients from a small number of variables and training examples.Artif Intell Med. 2009 Jan;45(1):63-76. Epub 2009 Jan 29.

Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009 Mar 15;25(6):714-21. Epub 2009 Jan 28.

Schwaighofer A, Schroeter T, Mika S, Blanchard G. Comb Chem High Throughput Screen. How wrong can we get? A review of machine learning approaches and error bars. 2009 Jun;12(5):453-68.

Huang H, Chanda P, Alonso A, Bader JS, Arking DE. Gene-based tests of association. PLoS Genet. 2011 Jul;7(7):e1002177. Epub 2011 Jul 28.

Liu Z, Shen Y, Ott J. Multilocus association mapping using generalized ridge logistic regression. BMC Bioinformatics. 2011 Sep 29;12:384.

Theses

Hug, Caleb W. Predicting the risk and trajectory of intensive care patients using survival models. 2006 Massachusetts Institute of Technology

**Talks / slides / videos:**

Victoria Stodden’s slides

## Logistic regression – simulation for a power calculation…

Please note–I’ve spotted a problem with the approach taken in this post – it seems to underestimate power in certain circumstances. I’ll post again with a correction or a more full explanation when I’ve sorted it.

So, I posted an answer on cross validation regarding logistic regression. I thought I’d post it in a little more depth here, with a few illustrative figures. It’s based on the approach which Stephen Kolassa described.

Power calculations for logistic regression are discussed in some detail in Hosmer and Lemeshow (Ch 8.5). One approach with R is to simulate a dataset a few thousand times, and see how often your dataset gets the p value right. If it does 95% of the time, then you have 95% power.

In this code we use the approach which Kleinman and Horton use to simulate data for a logistic regression. We then initially calculate the overall proportion of events. To change the number of events adjust odds.ratio. The independent variable is assumed to be normally distributed with mean 0 and variance 1.

nn <- 950 runs <- 10000 intercept <- log(9) odds.ratio <- 1.5 beta <- log(odds.ratio) proportion <- replicate( n = runs, expr = { xtest <- rnorm(nn) linpred <- intercept + (xtest * beta) prob <- exp(linpred)/(1 + exp(linpred)) runis <- runif(length(xtest),0,1) ytest <- ifelse(runis < prob,1,0) prop <- length(which(ytest <= 0.5))/length(ytest) } ) summary(proportion)

This plot shows how the intercept and odds ratio affect the overall proportion of events per trial:

When you’re happy that the proportion of events is right (with some prior knowledge of the dataset), you can then fit a model and calculate a p value for that model. We use R’s inbuilt function replicate to do this 10,000 times, and count the proportion where it gets it right (i.e. p < 0.05). The proportion of the time that the simulation correctly get's the p < 0.05 is essentially the power of the logistic regression for your number of cases, odds ratio and intercept.

result <- replicate( n = runs, expr = { xtest <- rnorm(nn) linpred <- intercept + (xtest * beta) prob <- exp(linpred)/(1 + exp(linpred)) runis <- runif(length(xtest),0,1) ytest <- ifelse(runis < prob,1,0) summary(model <- glm(ytest ~ xtest, family = "binomial"))$coefficients[2,4] < .05 } ) print(sum(result)/runs)

I checked it against the examples given in Hsieh, 1999. It seemed to work pretty well calculating the power to be within ~ 1% of the power of the examples given in table II of that paper.

We can do some interesting things with R. I simulated a range of odds ratios and a range of sample sizes. The plot of these looks like this (each line represents an odd ratio):-

We can also keep the odds ratio constant, but adjust the proportion of events per trial. This looks like this (each line represents an event rate):

As ever, if anyone can spot an error or suggest a simpler way to do this then let me know. I haven’t tested my simulation against any packages which calculate power for logistic regression, but if anyone can it would be great to hear from you.

## Science is vital – what we don’t know yet

This post is not about R (for a change). For working UK scientists, science is vital – sign the on-line petition to preserve science funding.

For my contribution of what we don’t know yet –

We don’t know whether we can use biomarkers of kidney injury to personalise the doses of medications to maximise the dose for the patient whilst minimizing any renal side effects.

## How to check if a file exists with HTTP and R

So, there’s probably an easier way to do this (please let me know if you know it)…

Suppose you’re working with a system which creates (binary) files and posts them for download on a website. You know the names of the files that will be created. However, they may not have been made yet (they’re generated on the fly, and appear in a vaguely random order over time). There are several of them and you want to know which ones are there yet, and when there are enough uploaded, run an analysis.

I spent quite a bit of time trying to work this out, and eventually came up with the following solution:

require(RCurl) newurl <- c("http://cran.r-project.org/web/packages/RCurl/RCurl.pdf", "http://cran.r-project.org/web/packages/RCurl/RCurl2.pdf") for (n in 2:1){ z <- "" try(z <- getBinaryURL(newurl[n], failonerror = TRUE)) if (length(z) > 1) {print(paste(newurl[n], " exists", sep = "")) } else {print(paste(newurl[n], " doesn't exist", sep = ""))} }

What this does is uses RCurl to download the file into a variable z. Then your system will check to see if z now contains the file.

If the file doesn’t exist, getBinaryURL() returns an error, and your loop (if you are doing several files) will quit. Wrapping the getBinaryURL() in try() means that the error won’t stop the loop from trying the next file (if you don’t trust me, try doing the above without the try wrapper). You can see how wrapping this in a loop could quickly go through several files and download ones which exist.

I’d really like to be able to do this, but not actually download the whole file (e.g. just the first 100 bytes) to see how many files of interest have been created, and if enough have, then download them all. I just can’t work out how to yet – I tried the range option of getBinaryURL() but this just crashed R. This would be useful if you are collecting data in real time, and you know you need at least (for example) 80% of the data to be available before you jump into a computationally expensive algorithm.

So, there must be an easier way to do all this, but can I find it? …

## An HSV colour wheel in R

If you’ve read any of my previous posts, you’ll notice that they’re rather scanty on colour. There’s a reason for this. Mainly, that to get a good colour output takes some time. I recently read a commentary in Nature methods (sorry if you don’t have access to it, but this looks like it may be the first part of an interesting series of articles), which discusses colour in graphics. The author suggests a colour wheel, and I thought I’d have a go in R:

You have to click on it to read the text, sorry. There’s probably much easier ways to do it, and it takes a silly amount of time to render (several seconds! – all those nested loops), but this code below makes the colour wheel. If you set the variables t.hue, t.sat and t.val, the bottom right box is the resulting colour (the box just to the bottom right of the colour wheel is the hue with sat and val set to 1.0). Then on the right is the plot of val, and below is the plot of sat. As you go anti-clockwise from the x axis round your hue increases from 0.0 to 1.0.

So you can play around with colour, see what works and what doesn’t. This uses the HSV approach, which seemed okay for my purposes. rgb2hsv() converts rgb into hsv (obviously), if you are more familiar with the RGB approach. There are lots of other resources for colour in R, one of my favourites is here, and of course you can always search R-bloggers.

## colour plot require(graphics) t.hue <- 0.65 ## this is the user entered hue, sat and value t.sat <- 0.5 t.val <- 0.9 def.par <- par(no.readonly = TRUE) layout( matrix(c(1,1,2,1,1,2,3,3,4), 3, 3, byrow = TRUE)) ## prepare the plot for the wheel x <- (-100:100)*0.01 y <- (-100:100)*0.01 ## blank plot to prepare the axis plot(x,y, pch = 20, col = 0, bty = "n", xaxt = "n", yaxt = "n", ann = F) ## make the wheel for (x in (-100:100)*0.01){ for (y in (-100:100)*0.01){ theta <- atan2(y,x) # theta is the angle hue <- Mod(theta/(pi)) # make the hue dependent upon the angle sat <- (x^2 + y^2) # make the saturation depend upon distance from origin if (x^2 + y^2 <= 1){ if (y > 0) {points(x,y, pch = 19, col = hsv(h = hue/2, s = sat, v = 1))} if (y < 0) {points(-x,y, pch = 19, col = hsv(h = hue/2 + 0.5, s = sat, v = 1))} } } } legend("center", "hue", bty = "n") text(0.9,0, labels = "0.0") text(0,0.9, labels = "0.25") text(-0.9,0, labels = "0.5") text(0,-0.9, labels = "0.75") ## bottom right colour box inset into wheel for (x in (80:100)*0.01){ for (y in (-80:-100)*0.01){ points (x,y, pch = 19, col = hsv(t.hue, s = 1, v = 1)) } } ## right sided v scale x <- (0:100)*0.01 y <- (0:100)*0.01 plot(x,y, pch = 20, col = 0, xaxt = "n", yaxt = "n", bty = "n", ann = F) for (x in (50:100)*0.01){ for (y in (0:100)*0.01){ hue <- t.hue sat <- 1 points(x,y, pch = 19, col = hsv(h = hue, s = sat, v = y)) } } legend("topleft", "value", bty = "n") arrows(0.0, t.val, 0.5, t.val,length = 0.01, angle = 20) ## bottom saturation scale x <- (0:100)*0.01 y <- (0:100)*0.01 plot(x,y, pch = 20, col = 0, xaxt = "n", yaxt = "n", bty = "n", ann = F) for (x in (0:100)*0.01){ for (y in (0:50)*0.01){ hue <- t.hue points(x,y, pch = 19, col = hsv(h = hue, s = x, v = 1)) } } legend("topleft", "saturation", bty = "n") arrows(t.sat,1.0, t.sat, 0.5, length = 0.01, angle = 20) ## bottom right plot x <- (0:100)*0.01 y <- (0:100)*0.01 plot(x,y, pch = 20, col = 0, xaxt = "n", yaxt = "n", bty = "n", ann = F) for (x in (0:25)*0.01){ for (y in (0:100)*0.01){ points(x,y, pch = 19, col = hsv(h = t.hue, s = t.sat, v = t.val)) } } legtr <- paste( "hue=", t.hue, sep = "") legr <- paste( "sat=", t.sat, sep = "") legbr <- paste("val=", t.val, sep = "") legend("topright", legtr, bty = "n") legend("right", legr, bty = "n") legend("bottomright", legbr, bty = "n") ## reset the graphics display to default par(def.par)

## Summary plots

So, when you first look at some data, it’s helpful to get a feel of it. One way to do this is to do a plot or two. I’ve found myself continuously doing the same series of plots for different datasets, so in the end I wrote this short code to put all the plots together as a time saving device. Not pretty, but gets the job done.

So on the top a histogram with a normal distribution plot. On the right a QQ normal plot, with an Anderson Darling p value. Then in the middle on the left is the same data put into different numbers of bins, to see how this affects the look of the data. And on the right, we pretend that each value is the next one in a time series with equal time intervals between readings, and plot these. Below this is the ACF and PACF plots.

Hope someone else finds this useful. If there’s easier ways to do this, let me know. To use the code – put your data into a text file as a series of numbers called data.txt in the working directory, and run this code:

## univariate data summary require(nortest) data <- as.numeric(scan ("data.txt")) # first job is to save the graphics parameters currently used def.par <- par(no.readonly = TRUE) par("plt" = c(.2,.95,.2,.8)) layout( matrix(c(1,1,2,2,1,1,2,2,4,5,8,8,6,7,9,10,3,3,9,10), 5, 4, byrow = TRUE)) #histogram on the top left h <- hist(data, breaks = "Sturges", plot = FALSE) xfit<-seq(min(data),max(data),length=100) yfit<-yfit<-dnorm(xfit,mean=mean(data),sd=sd(data)) yfit <- yfit*diff(h$mids[1:2])*length(data) plot (h, axes = TRUE, main = "Sturges") lines(xfit, yfit, col="blue", lwd=2) leg1 <- paste("mean = ", round(mean(data), digits = 4)) leg2 <- paste("sd = ", round(sd(data),digits = 4)) legend(x = "topright", c(leg1,leg2), bty = "n") ## normal qq plot qqnorm(data, bty = "n", pch = 20) qqline(data) p <- ad.test(data) leg <- paste("Anderson-Darling p = ", round(as.numeric(p[2]), digits = 4)) legend(x = "topleft", leg, bty = "n") ## boxplot (bottom left) boxplot(data, horizontal = TRUE) leg1 <- paste("median = ", round(median(data), digits = 4)) lq <- quantile(data, 0.25) leg2 <- paste("25th quantile = ", round(lq,digits = 4)) uq <- quantile(data, 0.75) leg3 <- paste("75th quantile = ", round(uq,digits = 4)) legend(x = "top", leg1, bty = "n") legend(x = "bottom", paste(leg2, leg3, sep = "; "), bty = "n") ## the various histograms with different bins h2 <- hist(data, breaks = (0:12 * (max(data) - min (data))/12)+min(data), plot = FALSE) plot (h2, axes = TRUE, main = "12 bins") h3 <- hist(data, breaks = (0:10 * (max(data) - min (data))/10)+min(data), plot = FALSE) plot (h3, axes = TRUE, main = "10 bins") h4 <- hist(data, breaks = (0:8 * (max(data) - min (data))/8)+min(data), plot = FALSE) plot (h4, axes = TRUE, main = "8 bins") h5 <- hist(data, breaks = (0:6 * (max(data) - min (data))/6)+min(data), plot = FALSE) plot (h5, axes = TRUE,main = "6 bins") ## the time series, ACF and PACF plot (data, main = "Time series", pch = 20) acf(data, lag.max = 20) pacf(data, lag.max = 20) ## reset the graphics display to default par(def.par)