About R

About R

In this post, we will look at an overview of the R language, how to install it, and some of its syntax.

So what is R?

From the R Project Documentation, this is the definition of R:

R is a language and environment for statistical computing and graphics. Many users think of R as a statistics system.

What sets R apart from other languages?

R is not a programming language like C or Java. It was not created by software engineers for software development. Instead, it was developed by statisticians as an interactive environment for data analysis.

Also, R is not only a programming language but also an "environment”. This is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

How to run R code

There are various ways to run R code. One can run it using R Console or you can use R Studio.

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging and managing your workspace.

rstudio-windows.png RStudio

To learn more about RStudio you can visit this post on RStudio Features

Installing R and RStudio

Note: R is installed before RStudio

The steps for installing R and RStudio include:

Step 1: Install R.

a. Download the R installer from this R CRAN Repository depending on your OS.

downloading_R.png

b. Run the installer. Default settings are fine.

Step 2: Install RStudio

a. Download RStudio from this RStudio Download Page depending on your OS:

downloadingRSTudio.png

b. Run the RStudio installer

For Ubuntu 20.04 you can also follow this post

Step3 – Check that R and RStudio are working

a. Open RStudio. It should open a window that looks similar to image below.

b. In the left hand window, go to the console, by the ‘>’ sign, type ‘4+5’ (without the quotes) and hit enter. An output line reading ‘[1] 9’ should appear. This means that R and RStudio are working.

testingRandRStusio.png

For more details on getting started with R, you can visit this R tutorial

R Basic Syntax

1. Declaring a variable

A variable provides us with named storage that our programs can manipulate. The variables can be assigned values using leftward, rightward or equal to operator.

> x <- 34          # Assignment using leftward operator.
> y -> 22          # Assignment using rightward operator
> z = 55           # Assignment using equal operator.

Note: comments in R are preceded by #

In a programming language, the information we store in variables could be an integer, character, floating-point, boolean, etc.

Programming languages like C, C++, and Java, variables are declared as data type; however, in Python and R, the variables are an object. Objects are a data structure having few attributes and methods which are applied to its attributes.

In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object assigned to it. So R is called a dynamically typed language, which means that we can change a variable’s data type of the same variable again and again when using it in a program.

3. Data Structures and data types in R

Some data structures include: Vectors, Lists, Matrices, Arrays, Factors and Data Frames

Basic data types on which the R data structures are built include: Numeric, Integer, Character, Factor, and Logical.

You can check the data type of a using the class() function.

Numeric are numbers that have a decimal value

num = 1.2           
class(num)                             # returns 'numeric'.

Integers are numbers that do not contain decimal

myint = 10                              
class(myint)                           # this returns 'integer'

A character can be a letter or a combination of letters enclosed by quotes is considered as a character data type by R. It can be alphabets or numbers.

mychar = "Hello There"        
class(mychar)
print(mychar)                        # the print() function is used to print out the variables.

Factors are a data type that is used to refer to a qualitative relationship like colors, good & bad, course or movie ratings, etc. They are useful in statistical modeling.

fac = factor(c("good", "bad", "ugly","good", "bad", "ugly"))
print(fac)

This returns:

[1] good bad  ugly good bad  ugly
Levels: bad good ugly

Let's look at the various data structures we have mentioned:

1. Vectors

Vectors are an object which is used to store multiple information or values of the same data type.

A vector can be created with a function c(), which will combine all the elements and return a one-dimensional array.

marks = c(88,65,90,40,65)
class(marks)
# returns 'numeric'

To check the length of the vector, we will use the length() function which returns the number of elements contained in a variable.

length(marks)            
# returns 5

We can access a specific element by its index

marks[4]            
# returns 40

Note: In R indexing, the first element is given an index of 1

Slicing can also be applied:

marks[2:5]       
# returns 65 90 40 65

Creating a character vector is similar to creating a numeric character:

char_vector = c("a", "b", "c")
print(char_vector)
[1]  "a" "b" "c"
class(char_vector)
'character'

2. Matrix

A matrix is used to store information about the same data type. However, unlike vectors, matrices are capable of holding two-dimensional information inside it.

matrix.jpeg

The syntax for defining a matrix is:

M = matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames))

For example a 2 x 3 matrix

M = matrix( c('AI','ML','DL','Tensorflow','Pytorch','Keras'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
     [,1]         [,2]      [,3]   
[1,] "AI"         "ML"      "DL"   
[2,] "Tensorflow" "Pytorch" "Keras"

Let's use the slicing concept and fetch elements from a row and column.

M[1:2,1:2] 
# The first dimension selects the first two rows while the second dimension will select the first two columns
     [,1]         [,2]     
[1,] "AI"         "ML"     
[2,] "Tensorflow" "Pytorch" 

3. Data Frame

Unlike a matrix, Data frames are a more generalized form of a matrix. It contains data in a tabular fashion. The data in the data frame can be spread across various columns, having different data types.

DataFrame can be created using the data.frame() function.

DataFrame has been widely used in:

  • reading comma-separated files (CSV), text files.
  • machine learning problems, especially when dealing with numerical data in understanding the data, data wrangling, plotting and visualizing.

Let's create a dummy dataset and learn some data frame specific functions.

dataset <- data.frame(
   Person = c("Aditya", "Ayush","Akshay"),
   Age = c(26, 26, 27),
   Weight = c(81,85, 90),
   Height = c(6,5.8,6.2),
   Salary = c(50000, 80000, 100000)
)
print(dataset)
  Person Age Weight Height Salary
1 Aditya  26     81    6.0  5e+04
2  Ayush  26     85    5.8  8e+04
3 Akshay  27     90    6.2  1e+05
class(dataset)
'data.frame'
nrow(dataset) 
# this will give you the number of rows that are there in the dataset dataframe
ncol(dataset) 
# this will give you the number of columns that are there in the dataset dataframe
df1 = rbind(dataset, dataset) 
# a row bind which will append the arguments in row fashion.
df1

This returns

  Person Age Weight Height Salary
1 Aditya  26     81    6.0  5e+04
2  Ayush  26     85    5.8  8e+04
3 Akshay  27     90    6.2  1e+05
4 Aditya  26     81    6.0  5e+04
5  Ayush  26     85    5.8  8e+04
6 Akshay  27     90    6.2  1e+05
df2 = cbind(df1, df1) 
# a column bind which will append the arguments in column fashion.
df2

This returns

  Person Age Weight Height Salary Person Age Weight Height Salary
1 Aditya  26     81    6.0  5e+04 Aditya  26     81    6.0  5e+04
2  Ayush  26     85    5.8  8e+04  Ayush  26     85    5.8  8e+04
3 Akshay  27     90    6.2  1e+05 Akshay  27     90    6.2  1e+05
4 Aditya  26     81    6.0  5e+04 Aditya  26     81    6.0  5e+04
5  Ayush  26     85    5.8  8e+04  Ayush  26     85    5.8  8e+04
6 Akshay  27     90    6.2  1e+05 Akshay  27     90    6.2  1e+05

To look at the first few rows we use the head() function while to look at the last few rows we use the tail() function.

head(df1, 3)
# here only the first three rows will be printed

this returns:

  Person Age Weight Height Salary
1 Aditya  26     81    6.0  5e+04
2  Ayush  26     85    5.8  8e+04
3 Akshay  27     90    6.2  1e+05
tail(df1, 3)
# here only the last three rows will be printed

this returns:

  Person Age Weight Height Salary
4 Aditya  26     81    6.0  5e+04
5  Ayush  26     85    5.8  8e+04
6 Akshay  27     90    6.2  1e+05
str(dataset) 
# this returns the individual class or data type information for each column.
'data.frame':    3 obs. of  5 variables:
 $ Person: Factor w/ 3 levels "Aditya","Akshay",..: 1 3 2
 $ Age   : num  26 26 27
 $ Weight: num  81 85 90
 $ Height: num  6 5.8 6.2
 $ Salary: num  5e+04 8e+04 1e+05

The summary() function, helps to understand the statistics of the data frame. As shown below, it divides your data into three quartiles, based on which you can get some intuition about the distribution of your data. It also shows if there are any missing values in your dataset.

summary(dataset)
Person       Age            Weight          Height        Salary      
 Aditya:1   Min.   :26.00   Min.   :81.00   Min.   :5.8   Min.   : 50000  
 Akshay:1   1st Qu.:26.00   1st Qu.:83.00   1st Qu.:5.9   1st Qu.: 65000  
 Ayush :1   Median :26.00   Median :85.00   Median :6.0   Median : 80000  
            Mean   :26.33   Mean   :85.33   Mean   :6.0   Mean   : 76667  
            3rd Qu.:26.50   3rd Qu.:87.50   3rd Qu.:6.1   3rd Qu.: 90000  
            Max.   :27.00   Max.   :90.00   Max.   :6.2   Max.   :100000  

4. Lists

Unlike vectors, a list can contain elements of various data types. It can contain vectors, functions, matrices, and even another list inside it (nested-list).

You construct lists by using the list() function.

mylist = list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
mylist

This returns:

[[1]]
[1] 1 2 3

[[2]]
[1] "a"

[[3]]
[1]  TRUE FALSE  TRUE

[[4]]
[1] 2.3 5.9
str(mylist)

This returns:

List of 4
 $ : int [1:3] 1 2 3
 $ : chr "a"
 $ : logi [1:3] TRUE FALSE TRUE
 $ : num [1:2] 2.3 5.9

If...else statement in R

Decision making is an important part of programming. This can be achieved in R programming using the conditional if...else statement.

If statement

The syntax of if statement is:

if (test_expression) {
statement
}

If the test_expression is TRUE, the statement gets executed. But if it’s FALSE, nothing happens.

for example:

x = 5
if (x  < 0){
print("The number is positive")
}

# The output is:
# [1] "Positive number"

If…else statement

The syntax of if…else statement is:

if (test_expression) {
statement1
} else {
statement2
}

The else part is optional and is only evaluated if test_expression is FALSE.

Note: else must be in the same line as the closing braces of the if statement.

If…else Ladder

The if…else ladder (if…else…if) statement allows you execute a block of code among more than 2 alternatives

The syntax of if…else statement is:

if ( test_expression1) {
statement1
} else if ( test_expression2) {
statement2
} else if ( test_expression3) {
statement3
} else {
statement4
}

Only one statement will get executed depending upon the test_expressions.

Example:

x = 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else
print("Zero")

# output is:
# [1] "Zero"

Loops

Loops are used in programming to repeat a specific block of code.

for loop in R

A for loop is used to iterate over a vector in R programming.

Syntax of for loop

for (val in sequence)
{
statement
}

Here, sequence is a vector and val takes on each of its value during the loop. In each iteration, the statement is evaluated.

Example:

a for loop to print out the elements in a vector

x = c(2,5,3,9)
count = 0
for (val in x) {
print(val)
}

This outputs:

[1] 2
[1] 5
[1] 3
[1] 9

Below is an example to count the number of even numbers in a vector.

x = c(2,5,3,9,8,11,6)
count = 0
for (val in x) {
if(val %% 2 == 0)  count = count+1
}
print(count)

# output is:
# [1] 3

Functions

A function is a set of statements organized together to perform a specific task.

R has a large number of in-built functions and the user can create their own functions.

1. In-built functions

So far we have used several pre-existing functions in R like the print(), data.frame(), list() functions among others.

Note: There are so many R in-built functions available so before defining your own,it would be better to see if what you need is already existing.

2. User-defined functions

Function Definition

An R function is created by using the keyword function. The basic syntax of an R function definition is as follows:

function_name = function(arg_1, arg_2, ...) {
   Function body 
}

Function Components

The different parts of a function are:

  • Function Name − This is the actual name of the function. It is stored in R environment as an object with this name.

  • Arguments − An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.

  • Function Body − The function body contains a collection of statements that defines what the function does.

  • Return Value − The return value of a function is the last expression in the function body to be evaluated.

Example:

# Create a function to print squares of numbers in sequence.

new.function = function(a) {
   for(i in 1:a) {
      b = i^2
      print(b)
   }
}

# Call the function new.function supplying 6 as an argument.

new.function(6)

When we execute the above code, it produces the following result −

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36

Object oriented programming in R

In R programming, OOPs (Object Oriented Programs) provides classes and objects as its key tools to reduce and manage the complexity of the program.

R is a functional language that uses concepts of OOPs.

We can think of class like a sketch of a car. It contains all the details about the model_name, model_no, engine etc. Based on these descriptions we select a car. Car is the object. Each car object has its own characteristics and features.

An object is also called an instance of a class and the process of creating this object is called instantiation.

OOPs has following features:

  • Class
  • Object
  • Abstraction
  • Encapsulation
  • Polymorphism
  • Inheritance

The two most important classes in R include:

  • S3 Class
  • S4 Class

1. S3 Class

S3 class does not have a predefined definition and ability to dispatch. In this class, the generic function makes a call to the method

It's syntax is:

variable_name = list(attribute1, attribute2, attribute3....attributeN)

Example: In the following code a Student class is defined. Appropriate class name is given having attributes student’s name and roll number. Then the object of student class is created and invoked.

# List creation with its attributes name and roll no.
a = list(name = "Adam", Roll_No = 15 )  

# Defining a class "Student"
class(a) = "Student"  

# Creation of object
a

Output:

$name
[1] "Adam"

$Roll_No
[1] 15

attr(, "class")
[1] "Student"

2. S4 Class

S4 class has a predefined definition. It contains functions for defining methods and generics. It makes multiple dispatches easy.

Syntax:

setClass("myclass", slots=list(name="character", Roll_No="numeric")) 

Example:

# Function setClass() command is used to create S4 class containing list of slots.
setClass("Student", slots=list(name="character", Roll_No="numeric"))

# 'new' keyword used to create object of class 'Student'   
a = new("Student", name="Adam", Roll_No=20)  

# Calling object
a

Output:

Slot "name":
[1] "Adam"

Slot "Roll_No":
[1] 20

For more details on R OOP visit this R Object Oriented Programming post.

Installing packages

When you download R from the Comprehensive R Archive Network (CRAN), you get the "base" R system which comes with basic functionality;

One reason R is so useful is the large collection of packages that extend the basic functionality of R

R packages are developed and published by the larger R community

The primary location for obtaining R packages is CRAN

Packages can be installed with the install.packages() function in R

For example to install the rgdal package

install.packages("rgdal")

Note: The rgdal package provides bindings to the Geospatial Data Abstraction Library (GDAL) for reading, writing and converting between spatial formats.

To use the installed package, you have to import it using the library() function.

For instance after installing the rgdal package, to check the OGR Drivers installed:

library(rgdal)                         #import the rgdal package                       
ogrDrivers()$name

This returns:

##  [1] "AeronavFAA"     "ARCGEN"         "AVCBin"         "AVCE00"        
##  [5] "BNA"            "CSV"            "DGN"            "DXF"           
##  [9] "EDIGEO"         "ESRI Shapefile" "Geoconcept"     "GeoJSON"       
## [13] "Geomedia"       "GeoRSS"         "GML"            "GMT"           
## [17] "GPSBabel"       "GPSTrackMaker"  "GPX"            "HTF"           
## [21] "Idrisi"         "KML"            "MapInfo File"   "Memory"        
## [25] "MSSQLSpatial"   "ODBC"           "ODS"            "OpenAir"       
## [29] "OpenFileGDB"    "PCIDSK"         "PDF"            "PDS"           
## [33] "PGDump"         "PGeo"           "REC"            "S57"           
## [37] "SDTS"           "SEGUKOOA"       "SEGY"           "SUA"           
## [41] "SVG"            "SXF"            "TIGER"          "UK .NTF"       
## [45] "VRT"            "Walk"           "WAsP"           "XLSX"          
## [49] "XPlane"

Reference:

  1. R if..else statement

  2. R for Loop

  3. R-Functions