The R statistics package

Volume Number: 23 (2007)
Issue Number: 01
Column Tag: Statistics

The R statistics package

What is R and how you can use it.

By Mihalis Tsoukalos

Introduction

The R language of statistical computing is a free implementation of the S language, also used for statistical computing. Rick Becker, John Chambers and Allan Wilks developed S at the famous AT&T Bell Labs. The commercial version of S is called S-PLUS and the problem with S-PLUS is that it is very expensive.

Version 1.00 of R was first released on February 2000 and the latest version of R, at the time of writing this article, is 2.4.0. Version 2.3.1 is used for the purposes of this article.

R and S-PLUS can be used for statistical analysis and graphics. Put simply, you insert datasets that you want to analyze and visualize in creative ways.

What, a statistics package in MacTech?

Well, you may wonder what is a statistics package doing in MacTech so I think that I have to explain some things to you. First, let me tell you that statistics are not that difficult in all of their aspects -you can use a small division of statistics that are incredibly simple. Second, I should add that statistics can be very useful for systems administration purposes including Mac OS X. Last, you should know that statistics are particularly useful when you want to generate a report for your boss that usually does not understand technical information very well.

If you still feel uncomfortable with statistics, please have in mind that this article is not going to use higher-level statistics. What will be used are straightforward statistical methods and some R commands that generate a lot of handy and impressive graphical images.

Introducing R

The good news is that there is a Macintosh version of R that can either run as a console or a graphical application. R can also run on Windows as well as other UNIX machines.

Figure 1 shows the console version of R whereas figure 2 shows the graphical version of R. In order to run the console version you just have to type R provided that the directory of the R command is included in your PATH variable.

Figure 1: Running R from the console

As you can see both versions of R have a similar text window that you enter your commands. Nevertheless, the GUI version is more elegant and offers more options. By typing q() -this works for both versions- you can quit R. Additionally, you can quit the GUI version from its usual Mac-related menus.

Figure 2: The R GUI

R can also be used as a simple calculator. The following examples illustrate it:

> 1 + 5
[1] 6
> abs(-1.4)
[1] 1.4

R can also be used in a batch mode -only the command line version of it- where you store the desired commands in a file and execute them from the command line or a cron job. The following commands demonstrate it:

$ cat example.R
1 + 1;
5 - 8;
$ R CMD BATCH example.R
$ cat example.Rout 
R : Copyright 2006, The R Foundation for Statistical Computing
Version 2.3.1 (2006-06-01)
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
  Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Previously saved workspace restored]
> invisible(options(echo = TRUE))
> 1 + 1;
[1] 2
> 5 - 8;
[1] -3
> 
> proc.time()
[1] 0.634 0.169 0.818 0.000 0.000
>
$

As you can see, the output for the example.R batch file is stored in a file called example.Rout. If the batch executed commands generate any graphics files, those graphics files should have been created as well.

Learning more about R and Statistics

One of the most important things that you have to learn is how to insert external data inside R. This can be made using the read.table() command. The following example shows how to use it:

$ cat TEST.data 
Name       Salary  Age
Mike       100000  25
Eugenia    200000  22
John       125000  26
PIK        250000  38
Antonis    180000  30
$ R
...
> SAL <- read.table("TEST.data", header=TRUE)
> SAL
     Name    Salary Age
1    Mike    100000  25
2 Eugenia    200000  22
3    John    125000  26
4     PIK    250000  38
5 Antonis    180000  30
>

In this example, a table was saved in a text file called TEST.data and loaded into R. Notice the header=TRUE parameter that tells that the first line of the TEST.data file is the header row of the table and therefore should be treated in a different way. Also notice that the SAL variable holds the whole table.

Imagine that you want to learn some information about your SAL data. The summary() command can be used as follows:

> summary(SAL)
      Name       Salary            Age      
 Antonis:1   Min.   :100000   Min.   :22.0  
 Eugenia:1   1st Qu.:125000   1st Qu.:25.0  
 John   :1   Median :180000   Median :26.0  
 Mike   :1   Mean   :171000   Mean   :28.2  
 PIK    :1   3rd Qu.:200000   3rd Qu.:30.0  
             Max.   :250000   Max.   :38.0  
>

As you will understand, this is a great way to summarize your data. Now, let us explain the output.

The Name column does not contain numbers so R sums the occurrences (considering each value as a string) of each "string" and prints the top numbers. As far as Salary and Age columns are concerned, which are both numeric, R calculates and displays the following six values:

- Min.: This is the minimum value of the whole data set.

- Median: It is an element that divides the data set into two subsets (left and right subsets) with the same number of elements. If the data set has an odd number of elements, then the Median is part of the data set. If the data set has an even number of elements, then the Median is the mean value of the two center elements of the data set.

- 1st Qu.: The 1st Quartile (q1) is a value, that does not necessarily belong to the data set, with the property that at most 25% of the data set values are smaller than q1 and at most 75% of the data set values are bigger than q1. Simplistically, you can consider it as the Median value of the left half subset of the sorted data set.

In the case that the number of elements of the data set is such that q1 does not belong to the data set, it is produced by interpolating the two values at the left (v) and the right (w) of its position to the sorted data set as:

q1 =  0.75 * v + 0.25 * w

- Mean: This is the mean value of the data set (the sum of all values divided by the number of the items in the data set).

- 3rd Qu.: The 3rd Quartile (q3) is a value, not necessarily belonging to the data set, with the property that at most 75% of the data set values are smaller than q3 and at most 25% of the data set values are bigger than q3. Put simply, you can consider the 3rd Quartile as the Median of the right half subset of the sorted data set.

In the case that the number of elements of the data set is such that q3 does not belong to the data set, it is produced by interpolation of the two values at the left (v) and the right (w) of its position to the sorted data set as:

q3 =  0.25 * v + 0.75 * w

- Max.: This is the maximum value found in the data set.

Please note that there exist many practices for finding Quartiles. In you try another statistical package, you may get slightly different results.

Creating Graphics with R

In the main part of the article I am going to tell you how to generate creative graphics with R. R has amazing graphical capabilities. Please look at the Bibliography and References section for more information. Also, the presented examples are real examples, using real data.

WWW server example

For this example, I used some old web server log data from a real web server. The duration of the logs is one week. Let me explain all the required steps.

First, let me show you some things about the log files using the wc command:

$ wc *.log
416041  6656584   119534721 day1.log
429039  6864552   123800090 day2.log
1185958 18975185  338653060 day3.log
1162803 18604776  330550972 day4.log
1157444 18519068  329710792 day5.log
1209289 19348537  342242234 day6.log
1078902 17262326  307343799 day7.log
6639476 106231028 1891835668 total

The wc command provides us counts of lines, words and bytes of each file. As you can see, the web log files are big as this is a very popular web server.

The log file format is the "standard" Apache "combined" format as follows:

#Fields: date time c-ip cs-username s-ip ¬
cs-method cs-uri-stem cs-uri-query sc-status sc-bytes ¬
cs-bytes time-taken cs-version cs(User-Agent) cs(Cookie) cs(Referer)

I have now to decide which fields to use and extract from the log files. I will use the following fields:

time: the time of the request

sc-bytes: a number that shows the server to client bytes

cs-bytes: a number that shows the client to server bytes

time-taken: the time -in milliseconds- it took the web server to process the request. Please note that a 0 value may declare that the requested resource was stored in a cache memory and therefore the web server did not have to process it.

The following UNIX shell script does what we want:

$ cat WWW.sh 
#!/bin/bash
grep -v '^#' day1.log | awk '{print $2, $10, $11, $12}'¬
 | sed 's/:/ /g' | awk '{print $1 ":" $2, $4, $5, $6}'

Its output, for the day1.log file, begins as follows:

00:00 137   465 0
00:00 142   471 0
00:00 13449 338 0
00:00 140   471 0
00:00 142   476 0
00:00 141   468 15
00:00 142   474 0
00:00 466   228 0
00:00 139   465 0
00:00 140   464 0

Of course, you have to change the day1.log string to fit your own filename. I did so for the rest of the web server log files. The files created are as follows (again using the output of the handy wc command):

$ wc *.data
 416033 1664132 6816604 day1.data
 429031 1716124 7026785 day2.data
1185942 4743768 19385103 day3.data
1162795 4651180 19041770 day4.data
1157440 4629760 18933110 day5.data
1209281 4837124 19748074 day6.data
1078894 4315576 17627914 day7.data
6639416 26557664 108579360 total

If you want to have header data in your files, you can do it by manually editing the output files. I put the "Time sc cs timeTaken" line at the beginning of each of the daily data files.

Now, we are finally ready to use R to process some of our data. I used the Misc, Change Working Directory (or Command-D) option to change my working directory so that I will not have to use full paths for my data files.

First, I am going to use the summary() command to overview my day1 data.

> day1 <- read.table("day1.data", header=TRUE)
> summary(day1)
      Time              sc                cs           timeTaken        
 18:05  :   775   Min.   :      0   Min.   :   0.0   Min.   :      0.0  
 17:32  :   708   1st Qu.:    141   1st Qu.: 378.0   1st Qu.:      0.0  
 12:21  :   697   Median :    142   Median : 431.0   Median :      0.0  
 17:07  :   696   Mean   :   2997   Mean   : 428.9   Mean   :    253.3  
 10:15  :   693   3rd Qu.:    842   3rd Qu.: 464.0   3rd Qu.:      0.0  
 18:15  :   687   Max.   :5686096   Max.   :2340.0   Max.   :1908734.0  
 (Other):411777                                                         
>

You can easily see the moments that were very busy: 18:05, 17:32, 12:21, 17:07, 10:15 and 18:15. You can also understand from the timeTaken column output that your web server was serving requests pretty fast (because the 3rd Qu. value is 0).

There is also a very quick way to represent a data set graphically. It can be done with the pairs(<dataset_name>) command which plots pairs of the columns in a data set. The output of the

> pairs(day1)

can be seen in figure 3. Isn't it worth every statistical definition you have read in this article?

Figure 3: the output of pairs(day1) command

The attach() command takes a data set as its argument, and lets you use the columns of the data set separately. In this example, I will use the day2 data. Also, check the objects() and search() commands that help you discover existing objects.

> day2 <- read.table("day2.data", header=TRUE)
> attach(day2)
> objects()
[1] "day2"
> search()
 [1] ".GlobalEnv"        "day2"              "tools:RGUI"       
 [4] "package:methods"   "package:stats"     "package:graphics" 
 [7] "package:grDevices" "package:utils"     "package:datasets" 
[10] "Autoloads"         "package:base"     
> objects(2)
[1] "Time"      "cs"        "sc"        "timeTaken"
>

The plot(Time) command will produce figure 4. This figure shows the total number of connections per time. It makes sense that after midnight there are less connections than the rest of the day. On the other hand, unreasonable outputs may be the cause of a network attack.

Figure 4: The output of the plot(Time) command

Last, imagine that you want to limit the output values for both x and y variables. You can do that by using the xlim and ylim parameters of the plot command. The following example shows this (the output can be seen in figure 5):

> plot(cs, sc, xlim=c(450, 500), ylim=c(450,500))

Network data example

In this example, I will use network data. As many of you already know, the usual way to capture network data is the tcpdump tool. The output of the tcpdump utility is difficult to read but there are many tools (i.e. tcpshow, ethereal/wireshark) that will help you parse it. Anyway, imagine that you have readable tcpdump output that contains the following fields:

Figure 5: Limiting the values of the output

Packet Number: column title "Packet"

Time: column title "Time"

Time Difference from Previous Packet: column title "dt"

Source Port: column title "sp"

Destination Port: column title "dp"

I used the tcpshow package which produces output that looks as follows:

-----------------------------------
Packet 171
TIME:    15:00:15.367367 (0.000190)
LINK:    00:60:97:DE:54:36 -> 00:00:0C:04:41:BC type=IP
  IP:    207.46.130.139 -> 172.16.117.52 hlen=20 TOS=00 dgramlen=40 id=003A
    MF/DF=0/1 frag=0 TTL=64 proto=TCP cksum=C797
 TCP:    port http -> 1024 seq=1274940435 ack=3183900831
    hlen=20 (data=0) UAPRSF=010000 wnd=32735 cksum=2A2F urg=0
DATA:    <No data>
-----------------------------------
Packet 172
TIME:    15:00:15.455012 (0.087645)
LINK:    00:00:0C:04:41:BC -> 00:C0:4F:A3:58:23 type=IP
  IP:    172.16.112.20 -> 192.168.1.10 hlen=20 TOS=00 dgramlen=60 id=0080
    MF/DF=0/0 frag=0 TTL=63 proto=UDP cksum=9D5A
 UDP:    port domain -> domain hdr=8 data=32
DATA:    .9..........
    hostmaster.com.....
-----------------------------------

I used a small Perl script to extract the data (TCP traffic only) that I wanted from the tcpshow command output. Remember that you may have to replace text values like http, telnet, etc., found in tcpshow output with their service numbers so that R can use it.

This time, I will also bring into play a new R package for creating graphics called lattice. The following command shows how to load the lattice package in R.

> library(lattice)

In order to get some help about the lattice package, you can type the following command:

> help(lattice)

After executing the last command inside the graphical version of R, you will get the output shown in figure 6.

Figure 6: help(lattice) graphical output

For the first example, I will use the data from the first three columns (Packet, Time, and dt) of the extracted data (C3.data file). I executed the following three commands:

> c3 <- read.table("C3.data", header=TRUE)
> attach(c3)
> c3m <- as.matrix(read.table("C3.data", header=TRUE))

You already know the first command. The second command inserts the data as a matrix (as.matrix() command) because some graphics functions that plot more than two variables will only accept data as a matrix. Do not forget to also run the library(lattice) command.

Write the following commands in a text editor. After that, copy them and paste them inside R. You will get figure 7! This example is based on an existing example that uses the volcano data set.

x <- 10*(1:nrow(c3m))
y <- 10*(1:ncol(c3m))
image(x, y, c3m, col = terrain.colors(100), axes = FALSE)
contour(x, y, c3m, add = TRUE, col = "peru")
axis(1, at = seq(100, 800, by = 100))
axis(2, at = seq(100, 600, by = 100))
box()
title(main = "c3m plot", font.main = 4)

Do not ask me about the physical meaning of that graph. If you know your data, you can tell more about this image. This is just an example about getting an idea of R capabilities.

Now, I am going to show you a more down-to-earth example. The following commands

> plot.new()
> xyplot(Packet ~ Time)
> title(main = "Packet vs Time", font.main = 4)

will plot Packet number versus Time -using data from the c3 dataset- as can be seen in Figure 8. Straight lines may

Figure 7: Plotting the c3m data set

represent complete HTTP transactions. You can see that there are moments that there is not so much TCP traffic whereas other times the TCP traffic is very high.

Figure 8: Packet vs Time plot

For the last example I will use the data from the last two columns (sp, and dp) of the extracted information (C2.data file). First, run the following commands:

> c2 <- read.table("C2.data", header=TRUE)
> attach(c2)
> c2m <- as.matrix(read.table("C2.data", header=TRUE))
> summary(c2)
       sp             dp      
 http   :2472   http   :1923  
 smtp   : 202   smtp   : 223  
 telnet :  39   telnet :  42  
 1024   :  31   1024   :  38  
 2615   :  22   1306   :  36  
 1026   :  21   4233   :  27  
 (Other):2124   (Other):2622

As you can easily understand, the summary command is very useful and meaningful this time. This traffic has many HTTP, SMPT and TELNET requests. If you are concerned about security, you may try to lower the TELNET connections and replace them with secure connections (ssh).

As this sample contains non-numeric values, there are not so many things to do. Plotting your data set is something you can do (figure 9):

> plot(c2)
> title(main = "dp vs sp plot", font.main = 4)

Figure 9: dp vs sp plot

One more thing

You may wonder by now, why did I talk about web log files for seven days although I used only two of them! Well, the answer is that I am going to use them now.

Run the following commands for each one of the seven web log data files:

head -n 1 day1.data > hour1
grep "^13:" day1.data  >> hour1

This will create seven files, each of them containing the web log data between 13:00 and 13:59, one for each weekday. Also, execute the following command:

$ wc -l hour*
  25647 hour1
  23211 hour2
  70192 hour3
  67904 hour4
  59699 hour5
  60121 hour6
  58629 hour7
 365403 total
$ wc -l hour* > hour13.data
$ head -n 7 hour13.data > hour13

Now, let us go back to R and execute the following commands:

> hours <- read.table("hour13")
> attach(hour)
> barplot(V1, angle=c(45,135), density=20, col="grey", ¬
names=c("Sunday", "Monday", "Tuesday", "Wednesday", ¬
"Thursday", "Friday", "Saturday") )
> title(main="Web server connections from 13:00 to 13:59", font=5)

The output can be see in Figure 10.

Figure 10: A bar plot

This procedure can be easily automated (and also run using cron) and therefore you can have a report of your data every day at your email account!

What else can R do?

Now that you have learned more things about R, let me briefly tell you what else can R do. Well, R can also perform the following:

1. Advanced data analysis.

2. Advanced statistics.

3. R has an object oriented programming language that you can write your own programs.

Summary

In this article you learned a lot - you learned some basic things about R, how to import data into R, and how to create graphics with R.

Those things can be very valuable for regular users and, especially, for system administrators.

Conclusions

I hope that this article did not contain too much statistics. I also hope that you have, by now, understood some of the capabilities of R. If you want to learn more about R then visit its home page, and check out some of the proposed books.

The output of R should help you prove your points to either your colleagues or your manager and get a general overview of your data.

Please let me know if you have any questions or if you want another article about R and its rich graphical capabilities.

Bibliography and References

R home page: http://www.R-project.org/

S-PLUS home page: http://www.insightful.com/products
/splus/default.asp

R and DBMSs page: http://developer.R-project.org/db/

Venables, W.N. and Ripley, B.D., Modern Applied Statistics with S, 4th Ed., Springer Verlang, 2002

Venables, W. N., Smith, D. M., An Introduction to R, Network Theory Ltd., 2002

Murrel, Paul, R Graphics, Chapman & Hall/CRC, 2006

Crawley, Michael, Statistics: An introduction using R, Wiley, 2005

Dalgaard, Peter, Introductory Statistics with R, Springer, 2004

Venables, W.N. and Ripley, B.D., S Programming, Springer, 2004

Tom Christiansen and Nathan Torkington, Perl Cookbook, 2nd Ed., O'Reilly, 2003

Mihalis Tsouklos lives in Greece with his wife Eugenia and enjoys digital photography and writing articles. You can reach him at tsoukalos@sch.gr.

Software Updates via MacUpdate

Latest Forum Discussions

Combo Quest (Games)

Combo Quest 1.0 Device: iOS Universal Category: Games Price: $.99, Version: 1.0 (iTunes) Description: Combo Quest is an epic, time tap role-playing adventure. In this unique masterpiece, you are a knight on a heroic quest to retrieve... | Read more »

Hero Emblems (Games)

Hero Emblems 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: ** 25% OFF for a limited time to celebrate the release ** ** Note for iPhone 6 user: If it doesn't run fullscreen on your device... | Read more »

Puzzle Blitz (Games)

Puzzle Blitz 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: Puzzle Blitz is a frantic puzzle solving race against the clock! Solve as many puzzles as you can, before time runs out! You have... | Read more »

Sky Patrol (Games)

Sky Patrol 1.0.1 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0.1 (iTunes) Description: 'Strategic Twist On The Classic Shooter Genre' - Indie Game Mag... | Read more »

The Princess Bride - The Official Game...

The Princess Bride - The Official Game 1.1 Device: iOS Universal Category: Games Price: $3.99, Version: 1.1 (iTunes) Description: An epic game based on the beloved classic movie? Inconceivable! Play the world of The Princess Bride... | Read more »

Frozen Synapse (Games)

Frozen Synapse 1.0 Device: iOS iPhone Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: Frozen Synapse is a multi-award-winning tactical game. (Full cross-play with desktop and tablet versions) 9/10 Edge 9/10 Eurogamer... | Read more »

Space Marshals (Games)

Space Marshals 1.0.1 Device: iOS Universal Category: Games Price: $4.99, Version: 1.0.1 (iTunes) Description: ### IMPORTANT ### Please note that iPhone 4 is not supported. Space Marshals is a Sci-fi Wild West adventure taking place... | Read more »

Battle Slimes (Games)

Battle Slimes 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: BATTLE SLIMES is a fun local multiplayer game. Control speedy & bouncy slime blobs as you compete with friends and family.... | Read more »

Spectrum - 3D Avenue (Games)

Spectrum - 3D Avenue 1.0 Device: iOS Universal Category: Games Price: $2.99, Version: 1.0 (iTunes) Description: "Spectrum is a pretty cool take on twitchy/reaction-based gameplay with enough complexity and style to stand out from the... | Read more »

Drop Wizard (Games)

Drop Wizard 1.0 Device: iOS Universal Category: Games Price: $1.99, Version: 1.0 (iTunes) Description: Bring back the joy of arcade games! Drop Wizard is an action arcade game where you play as Teo, a wizard on a quest to save his... | Read more »

Price Scanner via MacPrices.net

Our MacBook Price Trackers will show you the...

Our Apple award-winning MacBook Price Trackers are continually updated with the latest information on prices, bundles, and availability for 16″ and 14″ MacBook Pros along with 13″ and 15″ MacBook... Read more

Amazon is offering a 10% discount on Apple’s...

Don’t pay full price! Amazon has 16-inch M4 Pro MacBook Pros (Silver and Black colors) on sale today for 10% off Apple’s MSRP. Shipping is free. These are the lowest prices currently available for 16... Read more

13-inch M4 MacBook Airs on sale for $150 off...

Amazon has new 13″ M4 MacBook Airs on sale for $150 off MSRP right now, starting at $849. Sale prices apply to most colors and configurations. Be sure to select Amazon as the seller, rather than a... Read more

15-inch M4 MacBook Airs on sale for $150 off...

Amazon has new 15″ M4 MacBook Airs on sale for $150 off Apple’s MSRP, starting at $1049. Be sure to select Amazon as the seller, rather than a third-party: – 15″ M4 MacBook Air (16GB/256GB): $1049, $... Read more

Amazon is offering a $50 discount on Apple’s...

Amazon has Apple’s 11th-generation A16 iPads in stock on sale for $50 (or a little more) off MSRP this week. Shipping is free: – 11″ 11th-generation 128GB WiFi iPads: $299 $50 off MSRP – 11″ 11th-... Read more

Clearance 13-inch M1 MacBook Airs available f...

Walmart has clearance, but new, Apple 13″ M1 MacBook Airs (8GB RAM, 256GB SSD) available online for $649, $360 off original MSRP, in Space Gray, Silver, and Gold colors. These are new MacBooks for... Read more

iPad minis on sale for $100 off Apple’s MSRP...

Amazon is offering $100 discounts (up to 20% off) on Apple’s newest 2024 WiFi iPad minis, each with free shipping. These are the lowest prices available for new minis among the Apple retailers we... Read more

AirPods Max headphones on sale for $479, $70...

Amazon has AirPods Max with USB-C on sale for $479.99 in all colors. Shipping is free. Their price is $70 off Apple’s MSRP, and it’s the lowest price available today for AirPods Max. Keep an eye on... Read more

14-inch M4 Pro/M4 Max MacBook Pros on sale th...

Don’t pay full price! Get a new 14″ MacBook Pro with an M4 Pro or M4 Max CPU for up to $320 off Apple’s MSRP this weekend at these retailers…they are the lowest prices available for these MacBook... Read more

Get a 15-inch M4 MacBook Air for $150 off App...

A couple of Apple retailers are offering $150 discounts on new 15″ M4 MacBook Airs this weekend. Prices at these retailers start at $1049: (1): Amazon has new 15″ M4 MacBook Airs on sale for $150 off... Read more

Jobs Board

SPREAD THE WORD:
Slashdot
Digg
Del.icio.us
Reddit
Newsvine