Introduction to R

. I present a small introduction to R useful for the School. You need to execute and play with the commands.


To cite this version:
Didier Fraix-Burnet. Introduction to R. D. Fraix-Burnet & S. Girard. Statistics for Astrophysics: Clustering and Classification, 77, EDP Sciences, pp.3-12, 2016, EAS Publications Series, 978-2-7598-9001-9. 10.1051/eas/1677002. hal-01324928 1 Introduction R is a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. It comprises libraries of nearly all the statistical literature. Please consult the R project homepage 1 for further information.
R is an object-oriented interpretated language, made of an ensemble of packages. Packages are installed through a compilation of codes C++ and FORTRAN (gcc compiler). R is command-line driven, making it a wonderful tool for complex scripts and repeated procedures.
R is used by most statisticians in the world mainly for research and development. There are more and more packages for astronomical purposes, with a package to manage FITS format: FITSio.
CRAN 2 (Comprehensive R Archive Network) is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Please use the CRAN mirror nearest to you to minimize network load.
There are many and many help websites and blogs 3 . Simply asking your question in your favorite internet search engine, you will surely find the corresponding help and even full scripts. To make a search on the web regarding R, type "r-cran" instead of "R"! For MATLAB users, there are tables of command equivalence 4 .
You can find a Short Reference Card 5 on the School website with the Codes and Data 6 .

Installation of R
You can download a binary version for your platform. For Linux users, you may check whether R is in the distribution.
Warning: once R is installed, if your are behind a proxy, it should be indicated with a line like: http_proxy='YOUR PROXY ADDRESS:3128/' in a file ∼/.Renviron or at the end of the file /etc/R/Renviron to be able to install and update packages.

Graphical User Interface (GUI) or Integrated Development Environment
A non-necessary but appreciable complement is to have a GUI interface to R which provides an integrated environment with the command line terminal, the plot window and an editor for the script files among others. This is extremely useful to save commands into a script file since you can execute any selection in this file.
I would strongly recommend RStudio 7 . Otherwise there are many others, one being included in the Mac binary.

Packages
Packages are ensembles of programs, commands and scripts. They must be installed, i.e. compiled, once. Then they can be loaded to be used or detached when you are done.
Warning: commands of the newly installed packages hide commands with same name already in memory until the new package is uninstalled ("detached"). These commands are indicated when the package is loaded. Some commands automatically adapt their behaviour to the type of objects (for instance, plot can have several slightly different behaviours). For this reason, and except for some basic set of packages, loaded packages are not kept into memory after session's end.
Packages can be installed directly from a repository (you will be proposed a choice of mirrors) or from a previously loaded file.
To install packages: i n s t a l l . packages ( " f p c " ) To load packages into memory and access its commands: To remove a package from memory: Sometimes it is useful to update packages: update . packages ( r e p o s=" h t t p : // c r a n . univ−l y o n 1 . f r /" , ask=F , c h e c k B u i l t=T)

Files
R uses two important files which are saved under the current directory: • .Rhistory: the history of the commands • .RData: the environment file made of the functions and objects in memory. The loaded packages are not kept when you quit R to avoid command confusions in the next session.
It is a good idea to have different directories for different projects. You can move to the right directory before loading R, or change directory within R: RStudio encourages you to create projects so that it is very easy to handle several works and switch from one to the other.

Basics
To open R, type R.
When in R , the prompt > will appear. It will not be indicated in the following command lines (to ease copy and paste operations...).
To quit: To obtain help:

n d a command i n u n l o a d e d p a c k a g e s
Help pages all have the same structure with description, usage, format (options), Details, Sources and Examples at least. Do not hesitate to run the examples.

Commands
Commands are always followed by closed parentheses. Otherwise it prints the content of the object. If parentheses are not closed, then the symbol + appears asking for a missing closing ).
Two commands written on the same line have to be separated by ";". The function rm() removes (definitvely) an object from memory: a <− " H e l l o " ; a rm( a ) ; a

Objects
Objects have values and can be of several data types. To give the value 1 to object a, simply enter equivalently: To print values on the console:

Functions
Functions are defined as follows: Lists: lists are a general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists. Lists provide a convenient way to return the results of a statistical computation. A = matrix ( 1 : 1 5 , ncol =5) x=l i s t (mat=A, text=" t e s t l i s t " , v ec=y ) x [ [ 2 ] ] ; x$ v ec Matrices: matrices or more generally arrays are multi-dimensional generalizations of vectors. In fact, they are vectors that can be indexed by two or more indices and will be printed in special ways. Arrays: an array can be considered as a multiply subscripted collection of data entries.

Indices
It is important to understand the indices since it can become somewhat tricky for lists or complex data frames. The following examples should help you in this respect.
The entry "NA" (Not Available) indicates a missing value: Some functions can handle these missing values in different ways (ignore or replace).

Objects classes
Objects can be of different classes: integer, double, numeric, character, factor, logical. Factors provide compact ways to handle categorical data.  i s t a l l a v a i l a b l e g r a p h i c a l

An illustration with simple classification tools
As an illustration of both the graphical capabilities of R and some basic classification techniques using kmeans and DBSCAN, we use the built-in database iris. Since there is generally not one best method, it is always a good idea to compare the results obtained using several approaches. Here is the comparison between kmeans and DBSCAN of the above analyses through a convenient graphical display:

Other graphical examples
Here are other examples of graphics: 5 Import/export read . table ( f i l e . choose ( ) , s t r i n g s A s F a c t o r s = FALSE, h e a d e r=T, quote=" \ " " , na . s t r i n g s=" ? " , s k i p =0 ,comment . c h a r="#") tab <− read . table ( f i l e = ' ' m y f i l e . t x t ' ' ) write . table ( tab , f i l e=" m y f i l e 2 . t x t " , quote=FALSE, s e p=" " , na=" ? " ,row . names=F) Beware: save ( x , f i l e=" whatev er . RData" ) load ( " whatev er . RData" , e n v i r=n e w e n v i r ) The first command saves the object x in a (R) binary format file named whatever.RData. There is a compression option. The second command loads the object included in the (R) binary format named whatever.RData. Be careful because this replaces all existing objects with the same names (here object x ) in the current environment. To be safe, if you are not sure, you should load the files in a new environment as indicated. In this case two objects with the same name can exist, each in a different environment.
To define the new environment and use it, this is rather simple: n e w e n v i r <− new . env ( ) l s ( n e w e n v i r ) n e w e n v i r$x attach ( n e w e n v i r ) x detach ( n e w e n v i r ) rm( x , e n v i r=n e w e n v i r )