## Visualizing 3d data – plotting quartiles separately

In this previous post, we’ve looked at displaying three dimensional data. One major problem is when there is a high density of data, it can be difficult to see what’s going on in a 3 dimensional plot.

One way of looking at the data in more detail is to break it up. Take a look at this graph:

This is a plot of data of air quality in Nottingham, UK, taken hourly in 2009 (the code to create it in base R is on the bottom of the page). On the left is a scatterplot of NO2 against ozone (plot A). The different colours indicate grouping the data by the level of ozone into quartiles. On the right are plots of the NO vs NO2 for the same data, but a separate plot for each quartile of the ozone data. The points are all colour co-ordinated, so the red points indicating the upper quartile of the ozone data in plot A are matched by red points in plot B.

So you can see by comparing plot E and D, that at the lowest quartile of ozone levels, there is a greater spread of both NO2 and NO.

How this is done is pretty simple (most of the code is to make things vaguely pretty). Essentially, the values for x,y and z are put into a matrix xyz. The rows of the matrix are ordered according to the z variable. The rows which deliniate each quartile are calculated, and then the plots for B to E of x vs y are drawn, using only the rows for that quartile. The axes are plotted so that they are the same scale for each of the plots. There’s not much room for the axis labels – so these are added afterwards with the legend command.

Then on the left the plot for y (on the horizontal axis) and z (on the vertical axis) is drawn, with some added lines to show where the boundaries of each quartile lie. The colours are stored in the xyz matrix in the col column. Like most of my code, the graph is portable, you just need to input different values for x, y and z and re-label the names for each variable. The original dataset is the same one which I have used for my previous posts. It is from the UK airquality database. If you copy this file into your working directory and run the code below, you’ll repeat the plot.

Any suggestions for improvements / comments would be most appreciated!

## name the columns of the data columns <- c("date", "time", "NO", "NO_status", "NO_unit", "NO2", "NO2_status", "NO2_unit", "ozone", "ozone_status", "ozone_unit", "SO2", "SO2_status", "SO2_unit") ## read in the data, store it in variable data data <- read.csv("27899712853.csv", header = FALSE, skip = 7, col.names = columns, stringsAsFactors = FALSE) ## now make the x,y and z variables x <- data$NO y <- data$NO2 z <- data$ozone cols <- rep(1,length(z)) xyz <- cbind (x,y) xyz <- cbind(xyz,z) xyz <- cbind(xyz,cols) colq1 <- 6 colq2 <- 4 colq3 <- 3 colq4 <- 2 xl <- "NO" yl <- "NO2" zl <- "Ozone" point <- 20 # re order by z xyz <- xyz[order(xyz[,3]),] # now define the row numbers for the quartile boundries maxxyz <- nrow(xyz) q1xyz <- round(maxxyz/4) medianxyz <- round(maxxyz/2) q3xyz <- round(maxxyz*3/4) # assign colours to xyz$col xyz[1:q1xyz,4] <- colq1 xyz[q1xyz:medianxyz,4] <- colq2 xyz[medianxyz:q3xyz,4] <- colq3 xyz[q3xyz:nrow(xyz),4] <- colq4 # define the maximum values for x,y, and z # these are used to ensure all the axes are the same scale maxx <- x[which.max(x)] maxy <- y[which.max(y)] maxz <- z[which.max(z)] # now make the plot # first job is to save the graphics parameters currently used def.par <- par(no.readonly = TRUE) # define the margins around each plot par("mar" = c(2,2,0.5,0.5)) # make the layout for the plot layout(matrix(c(5,1,5,2,5,3,5,4), 4, 2, byrow = TRUE)) # now do the four plots on the right plot(xyz[q3xyz:maxxyz,1],xyz[q3xyz:maxxyz,2], col = colq4, xlab = xl, ylab = yl, pch=point, xlim = c(0,maxx), ylim = c(0,maxy)) legend(x = "right", yl, bty = "n") legend(x = "topright", "B", bty = "n") plot(xyz[medianxyz:q3xyz,1],xyz[medianxyz:q3xyz,2], col = colq3, pch=point, xlim = c(0,maxx), ylim = c(0,maxy)) legend(x = "right", yl, bty = "n") legend(x = "topright", "C", bty = "n") plot(xyz[q1xyz:medianxyz,1],xyz[q1xyz:medianxyz,2], col = colq2, pch=point, xlim = c(0,maxx), ylim = c(0,maxy)) legend(x = "right", yl, bty = "n") legend(x = "topright", "D", bty= "n") plot(xyz[0:q1xyz,1],xyz[0:q1xyz,2], col = colq1, pch=point, xlim = c(0,maxx), ylim = c(0,maxy)) legend(x = "right", yl, bty = "n") legend(x = "bottom", xl, bty = "n") legend(x = "topright", "E", bty = "n") # now do the plot on the left plot(xyz[,2],xyz[,3], col = xyz[,4], pch=point, xlim = c(0,maxy)) legend(x = "bottom", yl, bty = "n") legend(x = "right", zl, bty = "n") legend(x = "topright", "A", bty = "n") abline(h=xyz[q1xyz,3],col=3,lty=2) abline(h=xyz[medianxyz,3],col=4) abline(h=xyz[q3xyz,3],col=5,lty=2) ## reset the graphics display to default par(def.par)

## Turning your data into a 3d chart

Some charts are to help you analyse data. Some charts are to wow people. 3d charts are often the latter, but occasionally the former. In this post, we’ll look at how to turn your data into a 3d chart.

Let’s use the data from this previous post. Use the code which turns the .csv spreadsheet into 3 variables, x, y, and z.

3d charts generally need other packages. We’ll kick off with scatterplot3d, which perhaps makes things too easy:

library(scatterplot3d) scatterplot3d(x,y,z, highlight.3d = T, angle = 75, scale.y = .5)

The difficulty with 3d plots is that by definition, you’re looking at a 3d plot on a 2d surface. Wouldn’t you like to be able to rotate that plot around a bit? We’ll use the package rgl. Then type:

library(rgl) plot3d(x,y,z)

This pulls up an interactive window which you can rotate. Very helpful? Perhaps, but there are too many plots. Perhaps you only want to look at the middle 33% of the plots (i.e. look at a subset of the plot)?

startplot <- 33 endplot <- 66 a <- round(startplot/100*length(x)) b <- round(endplot/100*length(x)) plot3d(x[a:b],y[a:b],z[a:b], col = heat.colors(1000))

This looks much better. We’ve said we’d start at 33% of the way through the x,y,z co-ordinates, and end at 66% with the startplot and endplot variables. This is helpful – remember this is one year of data, and we’ve just displayed the middle of the year. The heatmap also helps to distinguish between plots, but in this case it doesn’t add any extra data – more of that in posts to come.

## Quick scatterplot with associated histograms

R can produce some beautiful graphics, and there are some excellent packages, such as lattice and ggplot2 to represent data in original ways. But sometimes, all you want to do is explore the realtionship between pairs of variables with the minimum of fuss.

In this post we’ll use the data which we imported in the previous post to make a quick graphic. I’ll assume you already got as far as importing the data and placing the variable for NO concentration into x and ozone into y.

We’re going to make a scatterplot with the histogram of x below the x axis, and the histogram of y rotated anti-clockwise through 90 degrees and alongside the y axis (all will become clear). The first thing is to set up the graphics display:

## start by saving the original graphical parameters def.par <- par(no.readonly = TRUE) ## then change the margins around each plot to 1 par("mar" = c(1,1,1,1)) ## then set the layout of the graphic layout(matrix(c(2,1,1,2,1,1,4,3,3), 3, 3, byrow = TRUE))

The layout command tells R to split the graphical output into a 3 by 3 array of panels. Each panel is given a number corresponding to the order in which graphics are plotted into it. To see this array, type:

matrix(c(2,1,1,2,1,1,4,3,3), 3, 3, byrow = TRUE)

This output shows that the display is split into 4 zones. The top right is a large area for plot one, the top left is a smaller panel for plot 2, and the bottom right is for plot 3.

So then, we need something for the top right – a straight forward scatter plot of x vs y (we set the maximum for the x axis with the xlim parameter of plot and using the maxx variable, which contains the maximum value held in the vector:

maxx <- x[which.max(x)] maxy <- y[which.max(y)] plot(x, y, xlab = "", ylab = "", pch = 20, bty = "n", xlim = c(0, maxx), ylim = c(0,maxy))

Then, we need to create a histogram of the y values, and plot it to the left of the histogram appropriately orientated. To do this we first store a histogram into the variable yh, and then plot it with the barplot command. The reason for this is that barplots can be easily rotated:

breaks <- 50 yh <- hist(y, breaks = (maxy/breaks)*(0:breaks), plot = FALSE) barplot(-(yh$intensities),space=0,horiz=T, axes = FALSE)

The breaks variable stores the number of bins into which the histogram is divided, maxy is the maximum value for the vector y, yh is the histogram, and then barplot extracts the heights of the bars from the histogram object draws it as a bar chart, but flips it on its side. The negative sign before yh$intensities points the bars to the left rather than the right.

We do the same for the x values, and also then reset the graphics display to defaults.

xh <- hist(x, breaks = (maxx/breaks)*(0:breaks), plot = FALSE) barplot(-(xh$intensities),space=0,horiz=F, axes = FALSE) ## reset the graphics display to default par(def.par)

We get this output:

The advantage of this over the straight scatterplot is that you can see the density of overlapping points on the histogram. I’ve set the number of bins in the histogram to 50 – it’s worth playing around with this with your data. There are more elegant ways of doing this, but if you have paired variables x and y, and you want to quickly look at their distributions and association, this code works fine.

## Matrix scatterplot of the Airquality data using lattice

In this post we will build on the last one, and create a matrix scatterplot. The package lattice allows for some really excellent graphics. In case you haven’t already seen it I recommend the R Graph Gallery for some examples of what it can do – browse the graphics by package used to create them. We’ll use the same dataset as last time, where we made a plot of the NO levels in the atmosphere vs ozone levels for Nottingham, UK.

First step is to load the lattice package.

require("lattice")

Download the dataset from here, and put the file in your working directory. Now we’ll put the dataset into the matrix data.

columns <- c("date", "time", "NO", "NO_status", "NO_unit", "NO2", "NO2_status", "NO2_unit", "ozone", "ozone_status", "ozone_unit", "SO2", "SO2_status", "SO2_unit") data <- read.csv("27899712853.csv", header = FALSE, skip = 7, col.names = columns, stringsAsFactors = FALSE) x <- data$NO y <- data$ozone z <- data$SO2

So that it’s easier to follow, I’ve extracted 3 vectors from the matrix: x, y, and z. These are the columns of the data for NO, ozone and SO2. Hopefully this will help you follow things. When working with graphs, I usually do this (in the last post I extracted x and y). If I make a nice graphic I can then “cut and paste” it into another program, and just change the data in x, y and z and hey presto, the same graphic is instantly used with new data.

For a matrix scatterplot, we need to make a matrix of the variables to compare. We join the vectors into a matrix and then name the columns.

mat <- cbind(x,y) mat <- cbind(mat,z) colnames(mat) <- c("NO", "ozone", "SO2")

You can look at the first 10 lines of mat with

mat[1:10,]

Finally we create the matrix plot:

title <- "Matrix scatterplot of air polutants" print(splom(mat, main = title))

For those unfamiliar with scatterplots – this plot is essentially 3 scatterplots of x vs y, x vs z and y vs z. The middle left plot is the scatterplot created in this previous post. The package lattice can do lots more than this – get help on line for it with the command

?lattice

## Getting going with R – importing data and plotting a simple graphic

The most difficult part of the learning curve in R is often getting going – many datasets are pre-installed in the packages and organised, so it is difficult to see how you to import your own data into R. This post takes you step by step through the process of making a table from a spreadsheet and then a simple graph.

The first thing is to get some data. A .csv file is a common “spreadsheet” like file. Currently I’m working with some air quality data downloaded from the UK air quality archive. The data I’ve downloaded is of 2009 data from Nottingham, UK containing automated measurements of Nitric Oxide, NO2, Ozone, and Sulphur Dioxide. The file is here. You can cut and paste the code below into R.

The first thing to do is put the data into a variable, called data. Copy the spreadsheet file into your working directory. We then use the read.csv for this:

columns <- c("date", "time", "NO", "NO_status", "NO_unit", "NO2", "NO2_status", "NO2_unit", "ozone", "ozone_status", "ozone_unit", "SO2", "SO2_status", "SO2_unit") data <- read.csv("27899712853.csv", header = FALSE, skip = 7, col.names = columns, stringsAsFactors = FALSE)

data[1:10,]

data$NO[5:10]

plot(data$NO)

## start by saving the original graphical parameters def.par <- par(no.readonly = TRUE) x <- data$NO y <- data$ozone xlabel <- "NO" ylabel <- "ozone" layout(matrix(c(2,1,1,3,1,1), 2, 3, byrow = TRUE)) plot(x, y, xlab = xlabel, ylab = ylabel, pch = 20) plot(x, xlab = NA, ylab = xlabel, pch = 20) plot(y, xlab = NA, ylab = ylabel, pch = 20) ## reset the graphics display to default par(def.par)

matrix(c(2,1,1,3,1,1), 2, 3, byrow = TRUE)

## Welcome to Gosset’s student

Welcome to Gosset’s student, a blog about statistics, with a focus on using R. We’re all learning R, as it’s constantly being improved. The blog will aim for brevity, and a focussed approach to getting some stats done, rather than elegance of code. Discussions which show how to replicate the result in a more elegant fashion, or which criticize the approach which the blog takes will be most welcome.