Font Size

Layout

Menu Style

Cpanel

Notes on my R / Git workflow

These are some notes on my current R git work flow, which is quite fluid, and git has enough quirks that I usually forget part of it !

Creating Projects

I've used both RStudio and Eclipse.  RStudio seems easier to create a 'project' and add a local git repo to it, but Eclipse has more functionality (like roxygen comment generation) so I prefer eclipse. 

In Eclipse 3.7, I have both Statet and eGit installed. To start create a new project normally (File > New > R Project), and add any starting stuff like R and Data folder, a readme etc...

Right click on the project name and select Team > Share Project.  Select Git and then create a local Git Repo.  For some reason eclipse has a check box to create the repo within the Eclipse workspace, and then gives you a warning that its not recommended.

Then there are a few ways to commit, Right click on project and Team > Commit, use the Git Staging view tab,   Whatever route, select which files to commit and enter a comment.  Your name and email is stored in Preferences > Team > eGit.

Backing Up 'locally'

To 'backup' (and potentially make available anywhere) I have a Linux server called Pegasus  tucked away somewhere that does many, many jobs.  it's actually an old work desktop and a tad underpowered, but it does the job.


One job is to act as a backup server, and that goes for git too.  using two pieces fo software, Gitosis and gitview. (although it seems Gitosis hasn't been updated in a few years, and isn't being actively maintained, which means no new bugs !)

To add a new repo to my server
on local machine;

cd~/gitosis-admin
kate gitosis.conf

add lines for the new repo, save and close

git commit -a -m "add repos for xxx"

Then cd to the repo your adding

git remote add pegasus gitosis@pegasus:PaulHurleyMisc.git

git push pegasus master

and the repo is magically on the server.  I can even visit http://pegasus/viewgit/index.php and see the new repo sitting there.

 Backing up to the cloud AKA Github

For things I'm happy to share, I have used github as a great cloud based way to share code (https://github.com/paulhurleyuk).  The thing that always gets me is the need to create the repo on github before pushing to it.

So create a repo on Github

then, on your local machine
git remote add github https://www.github.com/paulhurleyuk/testrepo.git

git push github master

and I then get an error that something conflicts (because I have a file with the same name in both, usually readme.md), so need to do

git pull

and then merge/drop any changes before doing git push again....

 

Some assorted Links

https://help.ubuntu.com/community/Git

 http://ao2.it/wiki/How_to_setup_a_GIT_server_with_gitosis_and_gitweb

http://lostechies.com/jasonmeridth/2010/05/25/gitosis-and-gitweb-part-1-setup/

 

Add a comment

Creating SVG Plots from R

I recently wanted to create a ggplot that I could then 'tweak' furthur. This is my solution, to create an .svg file which can be loaded into a suitable application (I prefer Inkscape) and furthur edited / tweaked.

# Build an example Plot
library(ggplot2)
dataframe <- data.frame(fac = factor(c(1:4)), data1 = rnorm(400, 100, sd = 15))
dataframe$data2 <- dataframe$data1 * c(0.25, 0.5, 0.75, 1)
testplot <- qplot(x = fac, y = data2, data = dataframe, colour = fac, geom = c("boxplot", 
    "jitter"))
testplot

cairo 1


# Produce a PNG plot
library(Cairo)
Cairo(800, 800, file = "testplot12200.png", type = "png", bg = "transparent",
    pointsize = 12, units = "px", dpi = 200)
testplot
dev.off()

cairo 3



#Produce an svg file
library(Cairo)
Cairo(800,800,file="cairo_2.svg",type="svg",bg="transparent",pointsize=12, units="in",dpi=400, width=20, height=20)
testplot
dev.off()



cairo 1

 

Add a comment

What lens should I buy next ?; Analysing and graphing a Digikam database using R

I use the Open Source photo management Software Digikam (along with other tools such as Gimp and DarkTable).  I obviously need very little encouragement to combine my geeky hobbies, so I quickly tried to interrogate Digikam with R, which is easy, because Digikam keeps all it's image info in a SQLite database, which R has support for.

So this post shows how I did it, along with some of the output, such as the focal length of my images over time, looks like I need a telephoto lens !  (this script and my digikam db are in github here)

  library(RSQLite)

## Loading required package: DBI

library(ggplot2) library(plyr) m <- dbDriver("SQLite") basedir <- "/home/paul/RStudio/DigikamR/" con <- dbConnect(m, dbname = paste(basedir, "data/digikam4.db", sep = ""))


Now we've opened the database, we can examine some of the tables within it.


# List the tables in the database
dbListTables(con)
##  [1] "AlbumRoots"         "Albums"             "DownloadHistory"   
##  [4] "ImageComments"      "ImageCopyright"     "ImageHaarMatrix"   
##  [7] "ImageHistory"       "ImageInformation"   "ImageMetadata"     
## [10] "ImagePositions"     "ImageProperties"    "ImageRelations"    
## [13] "ImageTagProperties" "ImageTags"          "Images"            
## [16] "Searches"           "Settings"           "TagProperties"     
## [19] "Tags"               "TagsTree"

# List the columns of some of the interesting tables
names(dbReadTable(con, "ImageInformation"))
##  [1] "imageid"          "rating"           "creationDate"    
##  [4] "digitizationDate" "orientation"      "width"           
##  [7] "height"           "format"           "colorDepth"      
## [10] "colorModel"
names(dbReadTable(con, "ImageComments"))
## [1] "id"       "imageid"  "type"     "language" "author"   "date"    
## [7] "comment"
names(dbReadTable(con, "ImageMetadata"))
##  [1] "imageid"                      "make"                        
##  [3] "model"                        "lens"                        
##  [5] "aperture"                     "focalLength"                 
##  [7] "focalLength35"                "exposureTime"                
##  [9] "exposureProgram"              "exposureMode"                
## [11] "sensitivity"                  "flash"                       
## [13] "whiteBalance"                 "whiteBalanceColorTemperature"
## [15] "meteringMode"                 "subjectDistance"             
## [17] "subjectDistanceCategory"
names(dbReadTable(con, "ImageProperties"))
## [1] "imageid"  "property" "value"
names(dbReadTable(con, "ImagePositions"))
##  [1] "imageid"         "latitude"        "latitudeNumber" 
##  [4] "longitude"       "longitudeNumber" "altitude"       
##  [7] "orientation"     "tilt"            "roll"           
## [10] "accuracy"        "description"
names(dbReadTable(con, "Images"))
## [1] "id"               "album"            "name"            
## [4] "status"           "category"         "modificationDate"
## [7] "fileSize"         "uniqueHash"
names(dbReadTable(con, "TagProperties"))
## [1] "tagid"    "property" "value"

And now we can pull some of the inetresting tables into a dataframe


# Pull some of the information together
Imgs <- dbReadTable(con, "Images")
ImgComments <- dbReadTable(con, "ImageComments")
ImgMeta <- dbReadTable(con, "ImageMetadata")
ImgInfo <- dbReadTable(con, "ImageInformation")
# and merge it together
ImgMerge <- merge(Imgs, ImgMeta, by.x = "id", by.y = "imageid")
ImgMerge <- merge(ImgMerge, ImgInfo, by.x = "id", by.y = "imageid")
# clean it up
ImgMerge$make <- as.factor(ImgMerge$make)
ImgMerge$model <- as.factor(ImgMerge$model)
ImgMerge$faperture <- as.factor(ImgMerge$aperture)
ImgMerge$fexposureTime <- as.factor(ImgMerge$exposureTime)
ImgMerge$fmodel <- as.factor(ImgMerge$model)
ImgMerge$Year <- format(as.POSIXct(ImgMerge$creationDate), format = "%Y")
ImgMerge$Month <- format(as.POSIXct(ImgMerge$creationDate), format = "%b")

Here are some plots

# and draw some graphs
ggplot(data = subset(ImgMerge, focalLength < 60), aes(x = as.POSIXct(creationDate), 
    y = focalLength, colour = model)) + geom_point()

digikam 21


ggplot(data = ImgMerge, aes(x = focalLength)) + geom_histogram(binwidth = 5, 
    aes(colour = as.factor(model))) + facet_grid(model ~ .)

digikam 22


qplot(data = ImgMerge, x = as.numeric(as.character(aperture)), y = log(as.numeric(as.character(exposureTime))), 
    colour = as.factor(model), geom = "point")
## Warning: Removed 2638 rows containing missing values (geom_point).

digikam 23


ggplot(data = subset(ImgMerge, model == "NIKON D5000"), aes(x = focalLength)) + 
    geom_histogram(binwidth = 5) + facet_grid(Year ~ .)

digikam 24


ggplot(data = subset(ImgMerge, model == "NIKON D5000"), aes(x = as.POSIXct(creationDate), 
    y = focalLength)) + geom_point()
## Warning: Removed 14 rows containing missing values (geom_point).

digikam 25


ggplot(data = subset(ImgMerge, model == "NIKON D5000" & focalLength < 60), aes(x = as.POSIXct(creationDate), 
    y = focalLength)) + geom_point(alpha = 0.2)

digikam 26

 

 

 

Add a comment

Git Error when pushing with a large file

Quick Note: I had an error recently where RStudio nor EGit nor the command line would push my repo to github.  I can't remember the exact error, although after some googling I found this SO answer that solved it

git config http.postBuffer 524288000

This fixed my problem.

Add a comment

World Cup 2006 First Goal R Analysis

Quite a while ago my amazing wife asked me if it was possible to find the time of the first goal for the 2006 FIFA World Cup matches.  I was using R at the time and thought it was possible.  Here are the scripts I wrote to scrape the info from the FIFA website.  They're also posted on my github here.

There are three scripts, One scrapes the data and saves it as CSV files, the next does some processing and saves the results as csv's and the third produces some basic graphs.

Web Scraping Script

# Script to Scrape web pages to collect World 
# Script to Scrape web pages to collect World Cup 2006 score data
#
# Author: Paul Hurley
###############################################################################
require(ggplot2)
require(plyr)
require(stringr)
require(RCurl)
require(XML)
#' Function to Sort a dataframe with a given list of columns
#' Cribbed from Spector, P. (2008). "Data Manipulation with R", UseR! Springer. Pg78
#' @param df Dataframe to be sorted
#' @param ... list of columns to sort on
#' @returnType
#' @return A sorted dataframe
#' @author "Paul Hurley"
#' @export
#' @
#' @usage with(dataframe,sortframe(dataframe,column1, column2, column3))
#' @examples with(iris,sortframe(iris,Sepal.Length,Sepal.Width,Petal.Length))
sortframe<-function(df,...){df[do.call(order,list(...)),]}
goals<-function(match) {
theURL <-paste("http://www.fifa.com/worldcup/archive/germany2006/results/matches/match=974100",match,"/report.html",sep="")
webpage = tryCatch(getURL(theURL, header=FALSE, verbose=TRUE),
HTTPError = function(e) {
cat("HTTP error: ", e$message, "\n")
})
message(paste("Webpage size is ",nchar(webpage),sep=""))
webpagecont <- readlines="" tc="" -="" textconnection="" webpage="" close="" fifa="" doc="" -htmlparse="" webpagecont="" xpathsapply="" div="" class="cont" xmlvalue="" goals="" scored="" grep="" value="TRUE)" -gsub="" strsplit="" 1="" -strsplit="" table="" -as="" data="" frame="" matrix="" unlist="" ncol="3,byrow=TRUE))" names="" -c="" player="NA,Team=NA,Time=NA)" team="" time="" message="" paste="" there="" were="" nrow="" sep="" if="" 0="" -data="" now="" get="" the="" match="" details="" contains="" teams="" and="" final="" score="" 2="" number="" date="" venue="" attendance="" 0-9="" a-za-z="" fullmatch="" -fifa="" tempfifa="" matchdatetimevenue="" stadiumattendance="" -substr="" nchar="" -2="" tempfifa2="" -unlist="" -paste="" substr="" 2006="" 1000="" :="" 0-5="" return="" groupa="" 01="" 02="" 17="" 18="" 33="" 34="" groupb="" 03="" 04="" 19="" 20="" 35="" 36="" groupc="" 05="" 06="" 21="" 22="" 37="" 38="" groupd="" 07="" 08="" 23="" 25="" 39="" 40="" groupe="" 09="" 10="" 26="" 41="" 42="" groupf="" 11="" 12="" 27="" 28="" 43="" 44="" groupg="" 13="" 14="" 29="" 30="" 45="" 46="" grouph="" 15="" 16="" 31="" 32="" 47="" 48="" round16="" 49="" 50="" 51="" 52="" 53="" 54="" 55="" quater="" 57="" 58="" 59="" 60="" semi="" 61="" 62="" 64="" wooden="" 63="" groupar="" -ldply="" groupbr="" groupcr="" groupdr="" grouper="" groupfr="" groupgr="" grouphr="" round16r="" quaterr="" semir="" finalr="" woodenr="" datadir="" home="" paul="" workspace="" world_cup="" write="" csv="" world="" cup="" -rbind="" worldcup2006="" pre="">
<p>The processing script</p>
<pre xml:lang="rsplus"># TODO: Add comment
#
# Author: paul
###############################################################################
require(ggplot2)
require(plyr)
require(stringr)
require(RCurl)
require(XML)
sortframe<-function(df,...){df[do.call(order,list(...)),]}
datadir<-"/home/paul/workspace/world_cup/data/"
world.cup.2006<-read.csv(paste(datadir, "worldcup2006.csv", sep=""))
world.cup.2006$Timen<-as.numeric(str_extract(as.character(world.cup.2006$Time)," [0-9]*"))
teamgoals<-ddply(subset(world.cup.2006,!is.na(Team)),.(Team),nrow)
top5<-subset(world.cup.2006,Team %in% (with(teamgoals,sortframe(teamgoals,-V1))$Team[1:5]))
top5$Team<-factor(top5$Team)
write.csv(top5, paste(datadir, "top5.csv", sep=""))
firstgoal<-ddply(world.cup.2006,.(match),function(df) {
with(df,sortframe(df,Timen))
return(df[1,])
})
write.csv(firstgoal,paste(datadir, "firstgoals.csv", sep=""))
top5firstgoal<-ddply(top5,.(match),function(df) {
with(df,sortframe(df,Timen))
return(df[1,])
})
write.csv(top5firstgoal,paste(datadir, "top5firstgoal", sep=""))
 

and the graphs

# TODO: Add comment
#
# Author: paul
###############################################################################
require(ggplot2)
require(plyr)
require(stringr)
datadir<-"/home/paul/workspace/world_cup/data/"
firstgoal<-read.csv(file=paste(datadir, "firstgoals.csv", sep=""))
top5firstgoal<-read.csv(file=paste(datadir, "top5firstgoal.csv", sep=""))
top5<-read.csv(file=paste(datadir, "top5.csv", sep=""))
print(qplot(Timen,data=firstgoal, geom="histogram", binwidth=1))
qplot(Timen,data=firstgoal, geom="histogram", binwidth=5)
qplot(Timen,data=firstgoal, geom="histogram", binwidth=10)
qplot(factor(Team), Timen, data=firstgoal, geom="boxplot")+geom_jitter()
qplot(factor(Team), Timen, data=world.cup.2006, geom="boxplot")+geom_jitter()
qplot(Timen,data=top5firstgoal, geom="histogram", binwidth=1)
qplot(Timen,data=top5firstgoal, geom="histogram", binwidth=5)
ggplot(top5firstgoal,aes(Timen, fill=Team))+geom_density(alpha=0.2)
qplot(factor(Team), Timen, data=top5firstgoal, geom="boxplot")+geom_jitter()
qplot(factor(Team), Timen, data=top5, geom="boxplot")+geom_jitter()
 

time2score byteam box

 

time2score hist

 

time2score by team

 

 

Add a comment

What's the smallest amount you can't make with 5 coins ?

My amazing, awesome wife often comes up with the little puzzles for our amazing children, and this one seemed destined to be solved in R. So, using up to 5 coins (1p, 2p, 5p, 10p, 20p and 50p) first she asked our
kids whether they could make every value up to 50p, and then what the smallest value they couldn't make was.

Here's my R solution (which took about 5mins less than our daughter took to answer the first question)

<code class="r"># What Amounts can't you make using up to 5 coins 1p to 50p
# 
# Author: Paul Hurley
library(ggplot2)
library(plyr)
# Define our coins
coins <- as.factor(c(0, 1, 2, 5, 10, 20, 50))
# build a list of all the possibilities
possibilities <- expand.grid(coin1 = coins, coin2 = coins, coin3 = coins, coin4 = coins, 
    coin5 = coins)
# calculate the result
possibilities$total <- as.numeric(as.character(possibilities$coin1)) + as.numeric(as.character(possibilities$coin2)) + 
    as.numeric(as.character(possibilities$coin3)) + as.numeric(as.character(possibilities$coin4)) + 
    as.numeric(as.character(possibilities$coin5))
# define our target values
targets <- 1:250
# what amounts aren't possible
targets[!targets %in% possibilities$total]
</code>
##  [1]  88  89  98  99 118 119 128 129 133 134 136 137 138 139 143 144 146
## [18] 147 148 149 158 159 163 164 166 167 168 169 173 174 176 177 178 179
## [35] 181 182 183 184 185 186 187 188 189 191 192 193 194 195 196 197 198
## [52] 199 203 204 206 207 208 209 211 212 213 214 215 216 217 218 219 221
## [69] 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
## [86] 239 240 241 242 243 244 245 246 247 248 249

So, the smallest value we can't make is
88

We can even produce a table of the number of ways to make each
number, and a graph

<code class="r">tableofpossibilities <- ddply(.data = possibilities, .(total), nrow)
ggplot(data = possibilities, aes(x = total)) + geom_histogram(binwidth = 1)
</code>

plot of chunk unnamed-chunk-2

Then when I triumphantly told her, she asked, 'what about 4 coins ?'

<code class="r"># How about 4 coins build a list of all the possibilities
fourpossibilities <- expand.grid(coin1 = coins, coin2 = coins, coin3 = coins, 
    coin4 = coins)
# calculate the result
fourpossibilities$total <- as.numeric(as.character(fourpossibilities$coin1)) + 
    as.numeric(as.character(fourpossibilities$coin2)) + as.numeric(as.character(fourpossibilities$coin3)) + 
    as.numeric(as.character(fourpossibilities$coin4))
# what values can't be made ?
targets[!targets %in% fourpossibilities$total]
</code>
##   [1]  38  39  48  49  68  69  78  79  83  84  86  87  88  89  93  94  96
##  [18]  97  98  99 108 109 113 114 116 117 118 119 123 124 126 127 128 129
##  [35] 131 132 133 134 135 136 137 138 139 141 142 143 144 145 146 147 148
##  [52] 149 153 154 156 157 158 159 161 162 163 164 165 166 167 168 169 171
##  [69] 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
##  [86] 189 190 191 192 193 194 195 196 197 198 199 201 202 203 204 205 206
## [103] 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
## [120] 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## [137] 241 242 243 244 245 246 247 248 249 250

So, the answer is 38and a graph


ggplot(data = fourpossibilities, aes(x = total)) + geom_histogram(binwidth = 1)

plot of chunk unnamed-chunk-4

Add a comment

Getting into R, RCommander, JGR and Deducer

I've been meaning to post something about R for a while, but never got started, and now have a pile of things I'd like to post, so it's time to get started.

I first started using R during my Master Dissertation where I had to do some stats.  I've since had several occasions needed to do some ad-hoc data analysis of one sort or another, and every time I've ended up using R to get it done.  I now use R regularly, and while can't describe myself as an expert, I'd say a enthusiastic amateur.

R is an integrated suite of software facilities for data manipulation, calculation and graphical
Display.  It is a full, proper programming language, even being turing complete.  It has a suite of operators for calculations on data in many forms, in particular arrays and matrices.  It also has an enormous collection of add-on packages for pretty much any form of analysis, calculation or statistic that can be performed.

You can run R in a million different ways, the most basic is just using the basic R interpreter and the command line.  Personally I use Eclipse / StatEt / LaTex, which I'll describe another time.

A colleague recently asked about the basics, so I've cribbed my email back to him here, where I suggested either Rcommander or JGR / Deducer, which both seem the ideal mid point of some extra menu/click functionality without trying to rebuild Excel in R.  Rcommander seems to be slightly better in terms of a statistics tool, and JGR/Deducer in terms of data exploration.
 
RCommander (http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/) is an add-on package to R.

To Use RCommander, install R and run, then in the console, type
install.packages("Rcmdr", dep=TRUE)
this should install R commander.  To run it in the console type

 install.packages("Rcmdr", dep=TRUE)
 

this should install R commander. To run it in the console type

library(RCmdr)
 

JGR (Java Gui for R - http://rforge.net/JGR/ ) and an add-on to it, Deducer (http://www.deducer.org).  They're both 'packages' in R, which means some extra functionality that can be easily installed from within R and then used.

To Install
Ensure you have a fairly up to date Java installed
Download and Install the latest version of R from CRAN (Comprehensive R Network -
http://cran.r-project.org/ )
Download and Install the JGR client from http://rforge.net/JGR/
Start the JGR application.  It will get any extra things it needs.
The JGR console should now be open. To load Deducer, go to 'Packages & Data' > 'Package Manager' and select Deducer and DeducerExtras.
 
Now you're using R !
 
Now you need to go find some help, I'd recommend some places to start with R;

 

Add a comment